worthdoing

Author: Simon-Pierre Boucher

GGUF Parameters Apple Silicon License worthdoing

Q4_K_M Q5_K_M Q8_0

Qwen2.5-14B-Instruct - GGUF Quantized by worthdoing

Quantized for local Mac inference (Apple Silicon / Metal) by worthdoing

About

This is a GGUF quantized version of Qwen2.5-14B-Instruct, optimized for running locally on Apple Silicon Macs with llama.cpp, Ollama, or LM Studio.

Description

Qwen's 14B flagship. Exceptional multilingual, reasoning, and coding.

Available Quantizations

File Quant BPW Size Use Case
qwen2.5-14b-instruct-Q4_K_M-worthdoing.gguf Q4_K_M 4.58 ~7.5 GB Recommended - Best quality/size ratio
qwen2.5-14b-instruct-Q5_K_M-worthdoing.gguf Q5_K_M 5.33 ~8.7 GB Higher quality, still fast
qwen2.5-14b-instruct-Q8_0-worthdoing.gguf Q8_0 7.96 ~13.0 GB Near-original quality

How to Use

With Ollama

# Create a Modelfile
cat > Modelfile <<'MODELEOF'
FROM ./qwen2.5-14b-instruct-Q4_K_M-worthdoing.gguf
MODELEOF

ollama create qwen2.5-14b-instruct -f Modelfile
ollama run qwen2.5-14b-instruct

With llama.cpp

llama-cli -m qwen2.5-14b-instruct-Q4_K_M-worthdoing.gguf -p "Your prompt here" -ngl 99

With LM Studio

  1. Download the GGUF file
  2. Open LM Studio -> My Models -> Import
  3. Select the GGUF file and start chatting

Quantization Method

Our quantization pipeline (corelm-model v1.0) follows a rigorous multi-step process to ensure maximum quality and compatibility:

Step 1 โ€” Download & Validation

  • Model weights are downloaded from HuggingFace Hub in SafeTensors format (.safetensors)
  • Legacy formats (.bin, .pt) are excluded to ensure clean, verified weights
  • Tokenizer, configuration, and all metadata are preserved

Step 2 โ€” Conversion to GGUF F16 Baseline

  • The original model is converted to GGUF format at FP16 precision using convert_hf_to_gguf.py from llama.cpp
  • This lossless baseline preserves the full original model quality
  • Architecture-specific tensors (attention, FFN, embeddings, MoE routing) are mapped to their GGUF equivalents

Step 3 โ€” K-Quant Quantization

  • The F16 baseline is quantized using llama-quantize with k-quant methods
  • K-quants use a mixed-precision approach: more important layers (attention, output) retain higher precision, while less sensitive layers (FFN) are compressed more aggressively
  • Each quantization level offers a different quality/size tradeoff:
Method Bits per Weight Strategy
Q4_K_M ~4.58 bpw Mixed 4/5-bit. Attention & output layers use Q5_K, FFN layers use Q4_K. Best balance of quality and size.
Q5_K_M ~5.33 bpw Mixed 5/6-bit. Attention & output layers use Q6_K, FFN layers use Q5_K. Higher quality with moderate size increase.
Q8_0 ~7.96 bpw Uniform 8-bit. All layers quantized to 8-bit. Near-lossless quality, largest file size.

Step 4 โ€” Metadata Injection

  • Custom metadata is embedded directly in each GGUF file:
    • general.quantized_by: worthdoing
    • general.quantization_version: corelm-1.0
  • This ensures full traceability and provenance of every quantized file

Tools & Environment

  • llama.cpp: Used for both conversion and quantization โ€” the industry-standard open-source LLM inference engine
  • Target platform: Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU acceleration
  • Inference runtimes: Compatible with llama.cpp, Ollama, LM Studio, koboldcpp, and any GGUF-compatible runtime

Recommended Hardware

Quant Min RAM Recommended
Q4_K_M 9 GB Mac with 14 GB+ RAM
Q5_K_M 11 GB Mac with 17 GB+ RAM
Q8_0 16 GB Mac with 25 GB+ RAM

Tags

general, multilingual, coding, reasoning


Quantized with corelm-model pipeline by worthdoing on 2026-04-17

Downloads last month
364
GGUF
Model size
15B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for worthdoing/Qwen2.5-14B-Instruct-GGUF

Base model

Qwen/Qwen2.5-14B
Quantized
(134)
this model