π llama.cpp Support Now Available!
π llama.cpp Support Now Available!
I'm excited to announce that IQuest-Loop-Instruct models are now fully supported in llama.cpp! π
This is the world's first implementation of loop attention in the GGUF ecosystem.
What's New:
β
Full loop attention support - Dual attention with learned per-head gates
β
GGUF conversion - Convert PyTorch models to GGUF format
β
Quantization support - Q4_K_M, Q5_K_M, Q8_0 quantization available
β
Production ready - Tested and working with text generation
Quick Start:
# Run inference
./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf \
--prompt "Write a function to reverse a linked list" \
--n-predict 256
GGUF Models Available:
Pre-converted GGUF models: https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF
Sizes:
- F16: 75GB
- Q8_0: 40GB
- Q5_K_M: 27GB
- Q4_K_M: 23GB
Technical Details:
The implementation includes:
- Loop iteration wrapper (
loop_num=2) - Global K/V caching from Loop 0
- Dual attention (local + global) with gate mixing
- Full backwards compatibility with standard llama models
PR to llama.cpp: https://github.com/ggml-org/llama.cpp/pull/18680
Performance:
Tested on IQuest-Coder-V1-40B-Loop-Instruct:
- Prompt processing: ~3.4 t/s
- Text generation: ~0.8 t/s
- Memory overhead: ~512MB for global K/V cache
Big thanks to the llama.cpp community and @ggerganov for the amazing ecosystem! π
Related:
Rejected AI generated slop violating their contributor guidelines