---
language:
- ru
- en
license: mit
base_model: ai-sage/GigaChat3-10B-A1.8B-bf16
tags:
- gguf
- llama.cpp
- experimental
- unstable
- moe
model_type: deepseek_v3
library_name: llama.cpp
---

# GigaChat3-10B-A1.8B GGUF [EXPERIMENTAL]

⚠️ **UNSTABLE BUILD** - This is an experimental GGUF conversion with known quality issues. Use for testing only.

**UPDATE** - Currently, does not work with llama.cpp release b7127 and higher. Only releases below.

## What is this?

Experimental GGUF conversion of [GigaChat3-10B-A1.8B](https://huggingface.co/ai-sage/GigaChat3-10B-A1.8B) - a Russian dialogue model with MoE + MLA architecture.

**Model specs:**
- 10B parameters (1.8B active)
- 64 experts, 4 active per token
- 262k context window
- BF16 → GGUF conversion

## ⚠️ Known Issues

**This conversion has degraded quality compared to the original model** due to architectural incompatibility:

1. **Hybrid MLA problem:** GigaChat3 uses standard Q-projection (no compression) + compressed KV-cache, which llama.cpp doesn't support natively
2. **RoPE mismatch:** Position embeddings are applied in wrong dimensional space
3. **Symptoms:** Incoherent long-form generation, context confusion, occasional nonsense

**Why it still loads:** We emulated missing MLA components using Identity matrices, which satisfies llama.cpp's loader but breaks positional logic.

## When to use this

✅ **Good for:**
- Short prompts (1-3 turns)
- Fact retrieval / memorized knowledge
- Testing GGUF tooling compatibility
- Placeholder until proper support arrives

❌ **Bad for:**
- Production use
- Long conversations
- Complex reasoning tasks
- Anything requiring positional awareness

## Conversion method

```python
# 1. Restructure weights to emulate MLA
# Original: Q = X @ q_proj [6144, 1536]
# Emulated: Q = ((X @ Identity[1536,1536]) * ones) @ q_proj[6144,1536]

# 2. Convert with q_lora_rank = 1536
python prepare_weights.py  # Creates fake q_a_proj, q_a_norm, q_b_proj
python convert_hf_to_gguf.py ./model-fixed --outfile model.gguf
```

**Math is preserved, but RoPE positioning is broken.**

## Usage

```bash
# llama.cpp
./llama-cli -m model.gguf \
  --temp 0.3 --top-p 0.9 -n 512 \
  -p "User: [query]\nAssistant:"

# Recommended params
temperature: 0.0-0.5
top_p: 0.8-0.9
max_tokens: < 512 (quality degrades further out)
```

## Better alternatives

For production quality, use the original model with:
- **vLLM** (native FP8 support, proper inference)
- **transformers** (HF native, slower but correct)
- **SGLang** (fast + correct)

Or wait for proper llama.cpp support (requires C++ patch).

## Technical details

**Problem:** llama.cpp DeepSeek implementation assumes Q-vectors are compressed (q_lora_rank < hidden_size). GigaChat3 skips Q-compression.

**Hack:** Set q_lora_rank = hidden_size (1536) and inject Identity matrices to fake compression.

**Result:** Loader accepts it, but RoPE gets applied to wrong intermediate representation → broken positional encoding → quality loss.

## Future

If you're a llama.cpp dev: The fix is adding a branch for `q_lora_rank == null` in the DeepSeek V3 \ V2 attention code (~100 LOC). Happy to help test!

## License

MIT (inherited from base model)

---