Instructions to use yasserrmd/glm5.1-distill with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yasserrmd/glm5.1-distill with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="yasserrmd/glm5.1-distill") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("yasserrmd/glm5.1-distill") model = AutoModelForCausalLM.from_pretrained("yasserrmd/glm5.1-distill") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use yasserrmd/glm5.1-distill with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "yasserrmd/glm5.1-distill" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yasserrmd/glm5.1-distill", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/yasserrmd/glm5.1-distill
- SGLang
How to use yasserrmd/glm5.1-distill with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "yasserrmd/glm5.1-distill" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yasserrmd/glm5.1-distill", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "yasserrmd/glm5.1-distill" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yasserrmd/glm5.1-distill", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Unsloth Studio new
How to use yasserrmd/glm5.1-distill with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for yasserrmd/glm5.1-distill to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for yasserrmd/glm5.1-distill to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for yasserrmd/glm5.1-distill to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="yasserrmd/glm5.1-distill", max_seq_length=2048, ) - Docker Model Runner
How to use yasserrmd/glm5.1-distill with Docker Model Runner:
docker model run hf.co/yasserrmd/glm5.1-distill
glm5.1-distill
yasserrmd/glm5.1-distill is a 1.2B parameter instruction-tuned chat model
built on top of LiquidAI/LFM2.5-1.2B-Base.
It is supervised-fine-tuned (SFT) on a 50k subset of
Jackrong/GLM-5.1-Reasoning-1M-Cleaned,
a cleaned reasoning-style chat corpus distilled from the GLM-5.1 family.
The goal is to bring some of the conversational reasoning behavior of larger GLM-5.1 teacher models into the small, efficient LFM2.5 architecture so it can run comfortably on a single consumer GPU, on edge devices, or via quantized runtimes such as ONNX, GGUF, or MLX.
Note: This is an independent community fine-tune. It is not affiliated with or endorsed by Liquid AI or Z.ai/THUDM (the GLM authors).
Model summary
| Property | Value |
|---|---|
| Architecture | LFM2 (hybrid conv + attention) |
| Parameters | ~1.2B |
| Tensor dtype | BF16 |
| Context length | 4096 (trained at 2048 with packing) |
| Base model | LiquidAI/LFM2.5-1.2B-Base |
| Fine-tuning method | LoRA SFT (merged back to base) |
| Trainer | Unsloth + TRL SFTTrainer |
| Chat template | LFM2 / ChatML-style (`< |
| License | Apache 2.0 |
Intended use
This model is designed for:
- General assistant-style chat
- Lightweight reasoning, step-by-step answers, and explanations
- On-device and edge deployments where a 1B class model is appropriate
- A starting checkpoint for further domain-specific fine-tuning
It is not a safety-aligned, production-ready assistant on its own. Treat its output as that of a small distilled student model: it can be confidently wrong, especially on long-horizon math, code correctness, current events, and anything safety-critical.
Out of scope
- Medical, legal, financial, or other high-stakes advice
- Any setting that requires guaranteed factuality
- Generating content that violates the Apache 2.0 license terms or the upstream LFM2.5 base model license
Quickstart (Transformers)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
model_id = "yasserrmd/glm5.1-distill"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "user", "content": "Explain why the sky is blue in two short paragraphs."},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
tokenize=True,
return_dict=True,
).to(model.device)
streamer = TextStreamer(tokenizer, skip_prompt=True)
_ = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.1,
top_k=50,
top_p=0.1,
repetition_penalty=1.05,
streamer=streamer,
)
Recommended sampling
The base LFM2.5 family is sensitive to sampling settings. The following defaults (inherited from Liquid AI's reference settings) work well:
| Use case | temperature | top_k | top_p | repetition_penalty |
|---|---|---|---|---|
| Factual / short answers | 0.1 | 50 | 0.1 | 1.05 |
| Creative / longer text | 0.7 | 50 | 0.9 | 1.10 |
| Code / structured output | 0.2 | 40 | 0.9 | 1.05 |
Chat template
The tokenizer ships with a ChatML-style template. A two-turn example serializes to:
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant
Hey there!<|im_end|>
Always use tokenizer.apply_chat_template(..., add_generation_prompt=True)
at inference time. Do not hand-roll the prompt.
Training details
Data
- Source:
Jackrong/GLM-5.1-Reasoning-1M-Cleaned,mainconfig - Slice: first 50,000 rows of the
trainsplit - Format: ShareGPT-style multi-turn conversations, normalized via
unsloth.chat_templates.standardize_data_formats - Loss masking:
train_on_responses_onlyso only assistant tokens contribute to the loss
LoRA configuration
| Hyperparameter | Value |
|---|---|
Rank r |
16 |
lora_alpha |
16 |
lora_dropout |
0 |
| Bias | none |
| Target modules | q_proj, k_proj, v_proj, out_proj, in_proj, w1, w2, w3 |
| Gradient checkpointing | unsloth |
| Random seed | 3407 |
SFT hyperparameters
| Hyperparameter | Value |
|---|---|
| Epochs | 1 |
| Per-device batch size | 32 |
| Gradient accumulation | 1 |
| Effective batch size | 32 |
| Packing | True |
| Max sequence length | 2048 |
| Optimizer | adamw_torch |
| Learning rate | 2e-5 |
| LR scheduler | linear |
| Warmup steps | 50 |
| Weight decay | 0.01 |
| Precision | BF16 |
| Seed | 3407 |
Merge & export
After SFT, the LoRA adapters were merged into the base weights using
Unsloth's push_to_hub_merged(..., save_method="merged_16bit"). The
repository contains the resulting full BF16 model, not adapters.
Hardware
Trained on a single GPU using Unsloth's optimized kernels. End-to-end training memory and time are dominated by the 50k-row, packed-2048 setup described above.
Evaluation
No formal benchmark scores are reported for this checkpoint yet. It has been smoke-tested on:
- General Q&A (e.g. "Why is the sky blue?")
- Short creative writing prompts
- Multi-turn instruction following
Quantitative evaluations on benchmarks such as MMLU, GSM8K, IFEval, or MT-Bench are left as future work. Contributions via the HF community tab are welcome.
Limitations and biases
- Inherits all limitations and biases of the LFM2.5 base model and of the GLM-5.1-derived training data.
- 1.2B parameters is small. Expect weaker performance than 7B+ chat models on hard reasoning, long context, and code generation.
- The training corpus is predominantly English. Other languages will work to varying degrees but are not the target.
- The model can hallucinate facts confidently. Verify anything important.
ONNX version
An ONNX export of this model is available at:
yasserrmd/glm5.1-distill-onnx
It can be used with onnxruntime and optimum for CPU and accelerated
inference. See that repository's README for usage details.
Citation
If you use this checkpoint, please cite the upstream work as well:
@misc{yasserrmd_glm51_distill_2026,
title = {glm5.1-distill: a small LFM2.5 student fine-tuned on GLM-5.1 reasoning data},
author = {Mohamed Yasser},
year = {2026},
howpublished = {\url{https://huggingface.co/yasserrmd/glm5.1-distill}}
}
And the base model and dataset:
- LiquidAI, LFM2.5-1.2B-Base, 2025.
- Jackrong, GLM-5.1-Reasoning-1M-Cleaned, Hugging Face Datasets.
Acknowledgements
- Liquid AI for the LFM2.5 base model.
- Jackrong for the cleaned GLM-5.1 reasoning dataset.
- Unsloth for the 2x faster SFT pipeline and memory-efficient LoRA kernels.
- Hugging Face TRL for
SFTTrainer.
- Downloads last month
- 492
