GGUF Conversion Script
This script converts DragonLLM models from Hugging Face to GGUF format for use with oLLama on Mac.
Quick Start
# Activate virtual environment
cd /Users/jeanbapt/simple-llm-pro-finance
source venv/bin/activate
# Run conversion (uses default: Qwen-Pro-Finance-R-32B)
python3 scripts/convert_to_gguf.py
# Or specify a model by number (1-5) or name
python3 scripts/convert_to_gguf.py 1 # Qwen-Pro-Finance-R-32B
python3 scripts/convert_to_gguf.py 2 # qwen3-32b-fin-v1.0
python3 scripts/convert_to_gguf.py "DragonLLM/qwen3-32b-fin-v1.0"
Available 32B Models
- DragonLLM/Qwen-Pro-Finance-R-32B (Recommended - latest)
- DragonLLM/qwen3-32b-fin-v1.0
- DragonLLM/qwen3-32b-fin-v0.3
- DragonLLM/qwen3-32b-fin-v1.0-fp8 (Already quantized to FP8)
- DragonLLM/Qwen-Pro-Finance-R-32B-FP8 (Already quantized to FP8)
What It Does
- Downloads llama.cpp (if not already present)
- Converts model to base GGUF (FP16, ~64GB)
- Quantizes to multiple levels:
- Q5_K_M (~20GB) - Best balance β
- Q6_K (~24GB) - Higher quality
- Q4_K_M (~16GB) - Smaller size
- Q8_0 (~32GB) - Highest quality
Memory Requirements
- Base conversion (FP16): ~64GB RAM
- Quantization: ~32GB RAM (can be done separately)
Output
Files are saved to: simple-llm-pro-finance/gguf_models/
gguf_models/
βββ Qwen-Pro-Finance-R-32B-f16.gguf (~64GB)
βββ Qwen-Pro-Finance-R-32B-q5_k_m.gguf (~20GB) β Recommended
βββ Qwen-Pro-Finance-R-32B-q6_k.gguf (~24GB)
βββ Qwen-Pro-Finance-R-32B-q4_k_m.gguf (~16GB)
βββ Qwen-Pro-Finance-R-32B-q8_0.gguf (~32GB)
Using with oLLama
After conversion, create an oLLama model:
# Create Modelfile
cat > Modelfile << EOF
FROM ./gguf_models/Qwen-Pro-Finance-R-32B-q5_k_m.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8192
PARAMETER temperature 0.7
EOF
# Create model
ollama create qwen-finance-32b -f Modelfile
# Use it
ollama run qwen-finance-32b "What is compound interest?"
Tool Calling Support
GGUF models maintain tool calling capabilities. oLLama supports OpenAI-compatible function calling:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="qwen-finance-32b",
messages=[{"role": "user", "content": "Calculate future value of 10000 at 5% for 10 years"}],
tools=[{
"type": "function",
"function": {
"name": "calculate_fv",
"description": "Calculate future value",
"parameters": {
"type": "object",
"properties": {
"pv": {"type": "number"},
"rate": {"type": "number"},
"nper": {"type": "number"}
}
}
}
}],
tool_choice="auto"
)
Troubleshooting
Out of Memory
- Use Q4_K_M instead of Q5_K_M
- Close other applications
- Reduce context window in oLLama (
num_ctx 4096)
Conversion Fails
- Ensure HF_TOKEN_LC2 is set in .env
- Check you have access to the model on Hugging Face
- Verify you have enough disk space (~200GB recommended)
Quantization Fails
- The base FP16 file is still usable
- Try quantizing manually:
./llama.cpp/llama-quantize input.gguf output.gguf Q5_K_M
Notes
- FP8 models (models 4 and 5) are already quantized, but converting to GGUF still provides benefits for oLLama
- Q5_K_M is recommended for best quality/size trade-off on Mac
- Conversion takes 30-60 minutes depending on your system
- Quantization takes 10-20 minutes per level