Author: Simon-Pierre Boucher
Qwen2.5-14B-Instruct - GGUF Quantized by worthdoing
Quantized for local Mac inference (Apple Silicon / Metal) by worthdoing
About
This is a GGUF quantized version of Qwen2.5-14B-Instruct, optimized for running locally on Apple Silicon Macs with llama.cpp, Ollama, or LM Studio.
- Original model: Qwen/Qwen2.5-14B-Instruct
- Parameters: 14B
- Quantized by: worthdoing
- Pipeline: corelm-model v1.0
Description
Qwen's 14B flagship. Exceptional multilingual, reasoning, and coding.
Available Quantizations
| File | Quant | BPW | Size | Use Case |
|---|---|---|---|---|
qwen2.5-14b-instruct-Q4_K_M-worthdoing.gguf |
Q4_K_M | 4.58 | ~7.5 GB | Recommended - Best quality/size ratio |
qwen2.5-14b-instruct-Q5_K_M-worthdoing.gguf |
Q5_K_M | 5.33 | ~8.7 GB | Higher quality, still fast |
qwen2.5-14b-instruct-Q8_0-worthdoing.gguf |
Q8_0 | 7.96 | ~13.0 GB | Near-original quality |
How to Use
With Ollama
# Create a Modelfile
cat > Modelfile <<'MODELEOF'
FROM ./qwen2.5-14b-instruct-Q4_K_M-worthdoing.gguf
MODELEOF
ollama create qwen2.5-14b-instruct -f Modelfile
ollama run qwen2.5-14b-instruct
With llama.cpp
llama-cli -m qwen2.5-14b-instruct-Q4_K_M-worthdoing.gguf -p "Your prompt here" -ngl 99
With LM Studio
- Download the GGUF file
- Open LM Studio -> My Models -> Import
- Select the GGUF file and start chatting
Quantization Method
Our quantization pipeline (corelm-model v1.0) follows a rigorous multi-step process to ensure maximum quality and compatibility:
Step 1 โ Download & Validation
- Model weights are downloaded from HuggingFace Hub in SafeTensors format (
.safetensors) - Legacy formats (
.bin,.pt) are excluded to ensure clean, verified weights - Tokenizer, configuration, and all metadata are preserved
Step 2 โ Conversion to GGUF F16 Baseline
- The original model is converted to GGUF format at FP16 precision using
convert_hf_to_gguf.pyfrom llama.cpp - This lossless baseline preserves the full original model quality
- Architecture-specific tensors (attention, FFN, embeddings, MoE routing) are mapped to their GGUF equivalents
Step 3 โ K-Quant Quantization
- The F16 baseline is quantized using
llama-quantizewith k-quant methods - K-quants use a mixed-precision approach: more important layers (attention, output) retain higher precision, while less sensitive layers (FFN) are compressed more aggressively
- Each quantization level offers a different quality/size tradeoff:
| Method | Bits per Weight | Strategy |
|---|---|---|
| Q4_K_M | ~4.58 bpw | Mixed 4/5-bit. Attention & output layers use Q5_K, FFN layers use Q4_K. Best balance of quality and size. |
| Q5_K_M | ~5.33 bpw | Mixed 5/6-bit. Attention & output layers use Q6_K, FFN layers use Q5_K. Higher quality with moderate size increase. |
| Q8_0 | ~7.96 bpw | Uniform 8-bit. All layers quantized to 8-bit. Near-lossless quality, largest file size. |
Step 4 โ Metadata Injection
- Custom metadata is embedded directly in each GGUF file:
general.quantized_by: worthdoinggeneral.quantization_version: corelm-1.0
- This ensures full traceability and provenance of every quantized file
Tools & Environment
- llama.cpp: Used for both conversion and quantization โ the industry-standard open-source LLM inference engine
- Target platform: Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU acceleration
- Inference runtimes: Compatible with
llama.cpp,Ollama,LM Studio,koboldcpp, and any GGUF-compatible runtime
Recommended Hardware
| Quant | Min RAM | Recommended |
|---|---|---|
| Q4_K_M | 9 GB | Mac with 14 GB+ RAM |
| Q5_K_M | 11 GB | Mac with 17 GB+ RAM |
| Q8_0 | 16 GB | Mac with 25 GB+ RAM |
Tags
general, multilingual, coding, reasoning
Quantized with corelm-model pipeline by worthdoing on 2026-04-17
- Downloads last month
- 364
Hardware compatibility
Log In to add your hardware
3-bit
4-bit
5-bit
8-bit