Instructions to use yasserrmd/glm5.1-distill with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use yasserrmd/glm5.1-distill with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="yasserrmd/glm5.1-distill")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("yasserrmd/glm5.1-distill")
model = AutoModelForCausalLM.from_pretrained("yasserrmd/glm5.1-distill")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use yasserrmd/glm5.1-distill with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "yasserrmd/glm5.1-distill"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yasserrmd/glm5.1-distill",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/yasserrmd/glm5.1-distill

SGLang

How to use yasserrmd/glm5.1-distill with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "yasserrmd/glm5.1-distill" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yasserrmd/glm5.1-distill",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "yasserrmd/glm5.1-distill" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yasserrmd/glm5.1-distill",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio new

How to use yasserrmd/glm5.1-distill with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for yasserrmd/glm5.1-distill to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for yasserrmd/glm5.1-distill to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for yasserrmd/glm5.1-distill to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="yasserrmd/glm5.1-distill",
    max_seq_length=2048,
)

Docker Model Runner
How to use yasserrmd/glm5.1-distill with Docker Model Runner:
```
docker model run hf.co/yasserrmd/glm5.1-distill
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

glm5.1-distill

yasserrmd/glm5.1-distill is a 1.2B parameter instruction-tuned chat model built on top of LiquidAI/LFM2.5-1.2B-Base. It is supervised-fine-tuned (SFT) on a 50k subset of Jackrong/GLM-5.1-Reasoning-1M-Cleaned, a cleaned reasoning-style chat corpus distilled from the GLM-5.1 family.

The goal is to bring some of the conversational reasoning behavior of larger GLM-5.1 teacher models into the small, efficient LFM2.5 architecture so it can run comfortably on a single consumer GPU, on edge devices, or via quantized runtimes such as ONNX, GGUF, or MLX.

Note: This is an independent community fine-tune. It is not affiliated with or endorsed by Liquid AI or Z.ai/THUDM (the GLM authors).

Model summary

Property	Value
Architecture	LFM2 (hybrid conv + attention)
Parameters	~1.2B
Tensor dtype	BF16
Context length	4096 (trained at 2048 with packing)
Base model	`LiquidAI/LFM2.5-1.2B-Base`
Fine-tuning method	LoRA SFT (merged back to base)
Trainer	Unsloth + TRL `SFTTrainer`
Chat template	LFM2 / ChatML-style (`<
License	Apache 2.0

Intended use

This model is designed for:

General assistant-style chat
Lightweight reasoning, step-by-step answers, and explanations
On-device and edge deployments where a 1B class model is appropriate
A starting checkpoint for further domain-specific fine-tuning

It is not a safety-aligned, production-ready assistant on its own. Treat its output as that of a small distilled student model: it can be confidently wrong, especially on long-horizon math, code correctness, current events, and anything safety-critical.

Out of scope

Medical, legal, financial, or other high-stakes advice
Any setting that requires guaranteed factuality
Generating content that violates the Apache 2.0 license terms or the upstream LFM2.5 base model license

Quickstart (Transformers)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer

model_id = "yasserrmd/glm5.1-distill"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Explain why the sky is blue in two short paragraphs."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    tokenize=True,
    return_dict=True,
).to(model.device)

streamer = TextStreamer(tokenizer, skip_prompt=True)

_ = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.1,
    top_k=50,
    top_p=0.1,
    repetition_penalty=1.05,
    streamer=streamer,
)

Recommended sampling

The base LFM2.5 family is sensitive to sampling settings. The following defaults (inherited from Liquid AI's reference settings) work well:

Use case	temperature	top_k	top_p	repetition_penalty
Factual / short answers	0.1	50	0.1	1.05
Creative / longer text	0.7	50	0.9	1.10
Code / structured output	0.2	40	0.9	1.05

Chat template

The tokenizer ships with a ChatML-style template. A two-turn example serializes to:

<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hey there!<|im_end|>

Always use tokenizer.apply_chat_template(..., add_generation_prompt=True) at inference time. Do not hand-roll the prompt.

Training details

Data

Source: Jackrong/GLM-5.1-Reasoning-1M-Cleaned, main config
Slice: first 50,000 rows of the train split
Format: ShareGPT-style multi-turn conversations, normalized via unsloth.chat_templates.standardize_data_formats
Loss masking: train_on_responses_only so only assistant tokens contribute to the loss

LoRA configuration

Hyperparameter	Value
Rank `r`	16
`lora_alpha`	16
`lora_dropout`	0
Bias	none
Target modules	`q_proj`, `k_proj`, `v_proj`, `out_proj`, `in_proj`, `w1`, `w2`, `w3`
Gradient checkpointing	`unsloth`
Random seed	3407

SFT hyperparameters

Hyperparameter	Value
Epochs	1
Per-device batch size	32
Gradient accumulation	1
Effective batch size	32
Packing	True
Max sequence length	2048
Optimizer	`adamw_torch`
Learning rate	2e-5
LR scheduler	linear
Warmup steps	50
Weight decay	0.01
Precision	BF16
Seed	3407

Merge & export

After SFT, the LoRA adapters were merged into the base weights using Unsloth's push_to_hub_merged(..., save_method="merged_16bit"). The repository contains the resulting full BF16 model, not adapters.

Hardware

Trained on a single GPU using Unsloth's optimized kernels. End-to-end training memory and time are dominated by the 50k-row, packed-2048 setup described above.

Evaluation

No formal benchmark scores are reported for this checkpoint yet. It has been smoke-tested on:

General Q&A (e.g. "Why is the sky blue?")
Short creative writing prompts
Multi-turn instruction following

Quantitative evaluations on benchmarks such as MMLU, GSM8K, IFEval, or MT-Bench are left as future work. Contributions via the HF community tab are welcome.

Limitations and biases

Inherits all limitations and biases of the LFM2.5 base model and of the GLM-5.1-derived training data.
1.2B parameters is small. Expect weaker performance than 7B+ chat models on hard reasoning, long context, and code generation.
The training corpus is predominantly English. Other languages will work to varying degrees but are not the target.
The model can hallucinate facts confidently. Verify anything important.

ONNX version

An ONNX export of this model is available at:

yasserrmd/glm5.1-distill-onnx

It can be used with onnxruntime and optimum for CPU and accelerated inference. See that repository's README for usage details.

Citation

If you use this checkpoint, please cite the upstream work as well:

@misc{yasserrmd_glm51_distill_2026,
  title  = {glm5.1-distill: a small LFM2.5 student fine-tuned on GLM-5.1 reasoning data},
  author = {Mohamed Yasser},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/yasserrmd/glm5.1-distill}}
}

And the base model and dataset:

LiquidAI, LFM2.5-1.2B-Base, 2025.
Jackrong, GLM-5.1-Reasoning-1M-Cleaned, Hugging Face Datasets.

Acknowledgements

Liquid AI for the LFM2.5 base model.
Jackrong for the cleaned GLM-5.1 reasoning dataset.
Unsloth for the 2x faster SFT pipeline and memory-efficient LoRA kernels.
Hugging Face TRL for SFTTrainer.

Downloads last month: 492

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for yasserrmd/glm5.1-distill

Base model

LiquidAI/LFM2.5-1.2B-Base

Finetuned

(30)

this model

Quantizations

2 models

yasserrmd
/

glm5.1-distill