How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf worthdoing/Phi-4-mini-GGUF:
# Run inference directly in the terminal:
llama-cli -hf worthdoing/Phi-4-mini-GGUF:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf worthdoing/Phi-4-mini-GGUF:
# Run inference directly in the terminal:
llama-cli -hf worthdoing/Phi-4-mini-GGUF:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf worthdoing/Phi-4-mini-GGUF:
# Run inference directly in the terminal:
./llama-cli -hf worthdoing/Phi-4-mini-GGUF:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf worthdoing/Phi-4-mini-GGUF:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf worthdoing/Phi-4-mini-GGUF:
Use Docker
docker model run hf.co/worthdoing/Phi-4-mini-GGUF:
Quick Links

worthdoing

Author: Simon-Pierre Boucher

GGUF Parameters Apple Silicon License worthdoing

Q4_K_M Q5_K_M Q8_0

Phi-4-mini - GGUF Quantized by worthdoing

Quantized for local Mac inference (Apple Silicon / Metal) by worthdoing

About

This is a GGUF quantized version of Phi-4-mini, optimized for running locally on Apple Silicon Macs with llama.cpp, Ollama, or LM Studio.

Description

Microsoft's reasoning powerhouse in a tiny package.

Available Quantizations

File Quant BPW Size Use Case
phi-4-mini-Q4_K_M-worthdoing.gguf Q4_K_M 4.58 ~2.0 GB Recommended - Best quality/size ratio
phi-4-mini-Q5_K_M-worthdoing.gguf Q5_K_M 5.33 ~2.4 GB Higher quality, still fast
phi-4-mini-Q8_0-worthdoing.gguf Q8_0 7.96 ~3.5 GB Near-original quality

How to Use

With Ollama

# Create a Modelfile
cat > Modelfile <<'MODELEOF'
FROM ./phi-4-mini-Q4_K_M-worthdoing.gguf
MODELEOF

ollama create phi-4-mini -f Modelfile
ollama run phi-4-mini

With llama.cpp

llama-cli -m phi-4-mini-Q4_K_M-worthdoing.gguf -p "Your prompt here" -ngl 99

With LM Studio

  1. Download the GGUF file
  2. Open LM Studio -> My Models -> Import
  3. Select the GGUF file and start chatting

Quantization Method

Our quantization pipeline (corelm-model v1.0) follows a rigorous multi-step process to ensure maximum quality and compatibility:

Step 1 โ€” Download & Validation

  • Model weights are downloaded from HuggingFace Hub in SafeTensors format (.safetensors)
  • Legacy formats (.bin, .pt) are excluded to ensure clean, verified weights
  • Tokenizer, configuration, and all metadata are preserved

Step 2 โ€” Conversion to GGUF F16 Baseline

  • The original model is converted to GGUF format at FP16 precision using convert_hf_to_gguf.py from llama.cpp
  • This lossless baseline preserves the full original model quality
  • Architecture-specific tensors (attention, FFN, embeddings, MoE routing) are mapped to their GGUF equivalents

Step 3 โ€” K-Quant Quantization

  • The F16 baseline is quantized using llama-quantize with k-quant methods
  • K-quants use a mixed-precision approach: more important layers (attention, output) retain higher precision, while less sensitive layers (FFN) are compressed more aggressively
  • Each quantization level offers a different quality/size tradeoff:
Method Bits per Weight Strategy
Q4_K_M ~4.58 bpw Mixed 4/5-bit. Attention & output layers use Q5_K, FFN layers use Q4_K. Best balance of quality and size.
Q5_K_M ~5.33 bpw Mixed 5/6-bit. Attention & output layers use Q6_K, FFN layers use Q5_K. Higher quality with moderate size increase.
Q8_0 ~7.96 bpw Uniform 8-bit. All layers quantized to 8-bit. Near-lossless quality, largest file size.

Step 4 โ€” Metadata Injection

  • Custom metadata is embedded directly in each GGUF file:
    • general.quantized_by: worthdoing
    • general.quantization_version: corelm-1.0
  • This ensures full traceability and provenance of every quantized file

Tools & Environment

  • llama.cpp: Used for both conversion and quantization โ€” the industry-standard open-source LLM inference engine
  • Target platform: Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU acceleration
  • Inference runtimes: Compatible with llama.cpp, Ollama, LM Studio, koboldcpp, and any GGUF-compatible runtime

Recommended Hardware

Quant Min RAM Recommended
Q4_K_M 4 GB Mac with 8 GB+ RAM
Q5_K_M 4 GB Mac with 8 GB+ RAM
Q8_0 4 GB Mac with 8 GB+ RAM

Tags

general, reasoning, coding, math


Quantized with corelm-model pipeline by worthdoing on 2026-04-17

Downloads last month
123
GGUF
Model size
4B params
Architecture
phi3
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for worthdoing/Phi-4-mini-GGUF

Quantized
(147)
this model