πŸŽ‰ llama.cpp Support Now Available!

#16
by nologik - opened

πŸŽ‰ llama.cpp Support Now Available!

I'm excited to announce that IQuest-Loop-Instruct models are now fully supported in llama.cpp! πŸš€

This is the world's first implementation of loop attention in the GGUF ecosystem.

What's New:

βœ… Full loop attention support - Dual attention with learned per-head gates
βœ… GGUF conversion - Convert PyTorch models to GGUF format
βœ… Quantization support - Q4_K_M, Q5_K_M, Q8_0 quantization available
βœ… Production ready - Tested and working with text generation

Quick Start:

# Run inference
./llama-cli --model IQuest-Coder-V1-40B-Loop-Instruct-q4_k_m.gguf \
    --prompt "Write a function to reverse a linked list" \
    --n-predict 256

GGUF Models Available:

Pre-converted GGUF models: https://huggingface.co/Avarok/IQuest-Coder-V1-40B-Loop-Instruct-GGUF

Sizes:

  • F16: 75GB
  • Q8_0: 40GB
  • Q5_K_M: 27GB
  • Q4_K_M: 23GB

Technical Details:

The implementation includes:

  • Loop iteration wrapper (loop_num=2)
  • Global K/V caching from Loop 0
  • Dual attention (local + global) with gate mixing
  • Full backwards compatibility with standard llama models

PR to llama.cpp: https://github.com/ggml-org/llama.cpp/pull/18680

Performance:

Tested on IQuest-Coder-V1-40B-Loop-Instruct:

  • Prompt processing: ~3.4 t/s
  • Text generation: ~0.8 t/s
  • Memory overhead: ~512MB for global K/V cache

Big thanks to the llama.cpp community and @ggerganov for the amazing ecosystem! πŸ™


Related:

https://github.com/ggml-org/llama.cpp/pull/18680

Rejected AI generated slop violating their contributor guidelines

Sign up or log in to comment