worthdoing

Author: Simon-Pierre Boucher

GGUF Parameters Apple Silicon License worthdoing

Q4_K_M Q5_K_M Q8_0

Qwen2.5-Coder-7B-Instruct - GGUF Quantized by worthdoing

Quantized for local Mac inference (Apple Silicon / Metal) by worthdoing

About

This is a GGUF quantized version of Qwen2.5-Coder-7B-Instruct, optimized for running locally on Apple Silicon Macs with llama.cpp, Ollama, or LM Studio.

Description

Qwen's dedicated coding model. Top-tier code generation and understanding.

Available Quantizations

File Quant BPW Size Use Case
qwen2.5-coder-7b-instruct-Q4_K_M-worthdoing.gguf Q4_K_M 4.58 ~3.7 GB Recommended - Best quality/size ratio
qwen2.5-coder-7b-instruct-Q5_K_M-worthdoing.gguf Q5_K_M 5.33 ~4.3 GB Higher quality, still fast
qwen2.5-coder-7b-instruct-Q8_0-worthdoing.gguf Q8_0 7.96 ~6.5 GB Near-original quality

How to Use

With Ollama

# Create a Modelfile
cat > Modelfile <<'MODELEOF'
FROM ./qwen2.5-coder-7b-instruct-Q4_K_M-worthdoing.gguf
MODELEOF

ollama create qwen2.5-coder-7b-instruct -f Modelfile
ollama run qwen2.5-coder-7b-instruct

With llama.cpp

llama-cli -m qwen2.5-coder-7b-instruct-Q4_K_M-worthdoing.gguf -p "Your prompt here" -ngl 99

With LM Studio

  1. Download the GGUF file
  2. Open LM Studio -> My Models -> Import
  3. Select the GGUF file and start chatting

Quantization Method

Our quantization pipeline (corelm-model v1.0) follows a rigorous multi-step process to ensure maximum quality and compatibility:

Step 1 โ€” Download & Validation

  • Model weights are downloaded from HuggingFace Hub in SafeTensors format (.safetensors)
  • Legacy formats (.bin, .pt) are excluded to ensure clean, verified weights
  • Tokenizer, configuration, and all metadata are preserved

Step 2 โ€” Conversion to GGUF F16 Baseline

  • The original model is converted to GGUF format at FP16 precision using convert_hf_to_gguf.py from llama.cpp
  • This lossless baseline preserves the full original model quality
  • Architecture-specific tensors (attention, FFN, embeddings, MoE routing) are mapped to their GGUF equivalents

Step 3 โ€” K-Quant Quantization

  • The F16 baseline is quantized using llama-quantize with k-quant methods
  • K-quants use a mixed-precision approach: more important layers (attention, output) retain higher precision, while less sensitive layers (FFN) are compressed more aggressively
  • Each quantization level offers a different quality/size tradeoff:
Method Bits per Weight Strategy
Q4_K_M ~4.58 bpw Mixed 4/5-bit. Attention & output layers use Q5_K, FFN layers use Q4_K. Best balance of quality and size.
Q5_K_M ~5.33 bpw Mixed 5/6-bit. Attention & output layers use Q6_K, FFN layers use Q5_K. Higher quality with moderate size increase.
Q8_0 ~7.96 bpw Uniform 8-bit. All layers quantized to 8-bit. Near-lossless quality, largest file size.

Step 4 โ€” Metadata Injection

  • Custom metadata is embedded directly in each GGUF file:
    • general.quantized_by: worthdoing
    • general.quantization_version: corelm-1.0
  • This ensures full traceability and provenance of every quantized file

Tools & Environment

  • llama.cpp: Used for both conversion and quantization โ€” the industry-standard open-source LLM inference engine
  • Target platform: Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU acceleration
  • Inference runtimes: Compatible with llama.cpp, Ollama, LM Studio, koboldcpp, and any GGUF-compatible runtime

Recommended Hardware

Quant Min RAM Recommended
Q4_K_M 4 GB Mac with 8 GB+ RAM
Q5_K_M 5 GB Mac with 8 GB+ RAM
Q8_0 8 GB Mac with 12 GB+ RAM

Tags

coding, code-generation, code-review


Quantized with corelm-model pipeline by worthdoing on 2026-04-17

Downloads last month
666
GGUF
Model size
8B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for worthdoing/Qwen2.5-Coder-7B-Instruct-GGUF

Base model

Qwen/Qwen2.5-7B
Quantized
(186)
this model