How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Zhantas/DeepGemma-2B-Reasoning:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Zhantas/DeepGemma-2B-Reasoning:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Zhantas/DeepGemma-2B-Reasoning:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Zhantas/DeepGemma-2B-Reasoning:Q4_K_M
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Zhantas/DeepGemma-2B-Reasoning:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Zhantas/DeepGemma-2B-Reasoning:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Zhantas/DeepGemma-2B-Reasoning:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Zhantas/DeepGemma-2B-Reasoning:Q4_K_M
Use Docker
docker model run hf.co/Zhantas/DeepGemma-2B-Reasoning:Q4_K_M
Quick Links

🧠 DeepGemma-2B-Reasoning

DeepGemma-2B-Reasoning is a deeply fine-tuned version of google/gemma-4-E2B-it, optimized for step-by-step reasoning (Chain-of-Thought). Trained via knowledge distillation on datasets generated by Claude Opus, Qwen3.5, and KIMI.

The model generates internal reasoning inside <thought> / <think> tags before answering, significantly improving logical and mathematical response quality.

🏆 Benchmarks (GSM8K)

Model GSM8K (Accuracy) Improvement
google/gemma-4-E2B-it (Base) 30.0% -
DeepGemma-2B-Reasoning (Ours) 44.0% +14.0% 🚀

🛠 Training Details

Training was conducted using Unsloth (QLoRA) on an RTX 4090 (24GB VRAM).

  • Method: QLoRA (4-bit quantization, BF16 adapters)
  • LoRA Parameters: Rank = 48, Alpha = 48
  • Epochs: 2 | Global Steps: 4672 | Learning Rate: 2e-4
  • Final Loss: 1.24

🗜️ GGUF Version (llama.cpp)

A quantized Q4_K_M GGUF version is available directly in this repo.

File: gemma4_e2b-q4_k_m.gguf (~4.7 GB) Quantization: llama.cpp Q4_K_M Merge: Full LoRA merge before quantization (Unsloth)

⚡ Performance (RTX 4090, llama.cpp, ngl=999)

Metric Speed
Prompt processing ~400 tok/s
Generation ~239–262 tok/s
Context 4096 tokens

Usage (llama.cpp)

./llama-cli -m gemma4_e2b-q4_k_m.gguf \
  -p "<start_of_turn>user\nYour question here<end_of_turn>\n<start_of_turn>model\n" \
  -n 512 -ngl 999 -c 4096

💻 Usage (Transformers / Unsloth)

from unsloth import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    model_name="Zhantas/DeepGemma-2B-Reasoning",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

question = "I had 3 apples. I ate one, and then bought as many as I had left. How many apples do I have now? Reason step by step."
prompt = f"<start_of_turn>user\n{question}<end_of_turn>\n<start_of_turn>model\n"

inputs = tokenizer(text=prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.3, repetition_penalty=1.1)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

⚠️ Limitations

Prone to "overthinking" on simple tasks. Best suited for logic puzzles, coding, and mathematics.

Downloads last month
52
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Zhantas/DeepGemma-2B-Reasoning

Quantized
(196)
this model
Quantizations
1 model

Datasets used to train Zhantas/DeepGemma-2B-Reasoning