--- language: - ru - en license: mit base_model: ai-sage/GigaChat3-10B-A1.8B-bf16 tags: - gguf - llama.cpp - experimental - unstable - moe model_type: deepseek_v3 library_name: llama.cpp --- # GigaChat3-10B-A1.8B GGUF [EXPERIMENTAL] ⚠️ **UNSTABLE BUILD** - This is an experimental GGUF conversion with known quality issues. Use for testing only. **UPDATE** - Currently, does not work with llama.cpp release b7127 and higher. Only releases below. ## What is this? Experimental GGUF conversion of [GigaChat3-10B-A1.8B](https://huggingface.co/ai-sage/GigaChat3-10B-A1.8B) - a Russian dialogue model with MoE + MLA architecture. **Model specs:** - 10B parameters (1.8B active) - 64 experts, 4 active per token - 262k context window - BF16 → GGUF conversion ## ⚠️ Known Issues **This conversion has degraded quality compared to the original model** due to architectural incompatibility: 1. **Hybrid MLA problem:** GigaChat3 uses standard Q-projection (no compression) + compressed KV-cache, which llama.cpp doesn't support natively 2. **RoPE mismatch:** Position embeddings are applied in wrong dimensional space 3. **Symptoms:** Incoherent long-form generation, context confusion, occasional nonsense **Why it still loads:** We emulated missing MLA components using Identity matrices, which satisfies llama.cpp's loader but breaks positional logic. ## When to use this ✅ **Good for:** - Short prompts (1-3 turns) - Fact retrieval / memorized knowledge - Testing GGUF tooling compatibility - Placeholder until proper support arrives ❌ **Bad for:** - Production use - Long conversations - Complex reasoning tasks - Anything requiring positional awareness ## Conversion method ```python # 1. Restructure weights to emulate MLA # Original: Q = X @ q_proj [6144, 1536] # Emulated: Q = ((X @ Identity[1536,1536]) * ones) @ q_proj[6144,1536] # 2. Convert with q_lora_rank = 1536 python prepare_weights.py # Creates fake q_a_proj, q_a_norm, q_b_proj python convert_hf_to_gguf.py ./model-fixed --outfile model.gguf ``` **Math is preserved, but RoPE positioning is broken.** ## Usage ```bash # llama.cpp ./llama-cli -m model.gguf \ --temp 0.3 --top-p 0.9 -n 512 \ -p "User: [query]\nAssistant:" # Recommended params temperature: 0.0-0.5 top_p: 0.8-0.9 max_tokens: < 512 (quality degrades further out) ``` ## Better alternatives For production quality, use the original model with: - **vLLM** (native FP8 support, proper inference) - **transformers** (HF native, slower but correct) - **SGLang** (fast + correct) Or wait for proper llama.cpp support (requires C++ patch). ## Technical details **Problem:** llama.cpp DeepSeek implementation assumes Q-vectors are compressed (q_lora_rank < hidden_size). GigaChat3 skips Q-compression. **Hack:** Set q_lora_rank = hidden_size (1536) and inject Identity matrices to fake compression. **Result:** Loader accepts it, but RoPE gets applied to wrong intermediate representation → broken positional encoding → quality loss. ## Future If you're a llama.cpp dev: The fix is adding a branch for `q_lora_rank == null` in the DeepSeek V3 \ V2 attention code (~100 LOC). Happy to help test! ## License MIT (inherited from base model) ---