kappa_20b_131k (GGUF Q8_0)

Q8_0 quantized GGUF of kappa_20b_131k for use with llama.cpp and compatible inference engines.

Part of the persona series β€” a set of experimental fine-tunes exploring personality-conditioned generation on a 20.9B MoE base.

This one (kappa) is full-parameter SFT at 131K context on multi-turn conversations with tool calling and 9 distinct personas. Built on OpenAI's GPT-OSS 20B base model. Trained on 4 desktop GPUs with torchtitan.

See also: BF16 GGUF (unquantized)

Files

File Quantization BPW Size
persona_kappa_20b_Q8_0.gguf-00001-of-00005 β€” 00005 Q8_0 (mixed) 8.7 ~22 GiB total (5 shards, ~5 GiB each)

Quantization Notes

Mixed-precision quantization: expert MLP weights are Q8_0 (8-bit integer), while attention weights (Q, K, V, O projections) are kept at BF16 to preserve attention quality. Biases, layernorms, router weights, and attention sinks remain in f32.

Q8_0 was chosen over k-quant variants (Q6_K, Q4_K_M) because the 3D expert weight tensors [2880, 2880, 32] don't meet k-quant block size requirements β€” 145 of 170 weight tensors fall back to higher precision, making Q6_K the same size as Q8_0 with no benefit.

Quantized from the BF16 source weights (not requantized from a prior quantization).

Model Details

Architecture Mixture-of-Experts (MoE) with SwiGLU
Total parameters 20.9B
Active parameters 4.2B per token (top-4 of 32 experts)
Hidden dimension 2880
Layers 24 (alternating sliding/full attention)
Attention GQA β€” 64 heads, 8 KV heads, head_dim 64
Experts 32 per layer, top-4 routing
Vocabulary 201,088 tokens
Context length 131,072 tokens
RoPE scaling YaRN (factor 32, base theta 150K)
GGUF precision Q8_0 experts, BF16 attention (8.7 BPW average)

Training

Full-parameter supervised fine-tuning (SFT) in bf16 β€” all 20.9B weights trainable, including every expert.

Base model GPT-OSS 20B (pretrained)
Dataset persona_kappa β€” multi-turn conversations with tool calling, 9 robot personas across D&D alignment grid
Sequence length 131,072 tokens
Epochs 3
Total steps 441
Batch size 16 (global), 1 (local per GPU)
Packing Packed samples with block-causal attention masking
Optimizer AdamW with CPU offload (DeepSpeed CPUAdam)
Learning rate 1e-5, cosine decay (ratio 0.5), min factor 0.3
Warmup 20 steps
Weight decay 0.01 (embeddings and norms exempt)
Max gradient norm 1.0
Activation checkpointing Selective (every layer)
Compilation torch.compile enabled
Non-assistant masking Enabled β€” loss computed only on assistant turns

Hardware

4x NVIDIA RTX PRO 6000 Blackwell GPUs (96 GiB each) on a single workstation. Tensor parallelism degree 4. Peak memory utilization: 92.7 GiB per GPU (97.7%).

Training Framework

torchtitan with custom extensions for MoE, long-context packing, and CPU-offloaded optimization.

Persona System

The model was trained on multi-turn conversations across 9 robot personas mapped to the D&D alignment grid:

Lawful Neutral Chaotic
Good lawful_good neutral_good chaotic_good
Neutral lawful_neutral true_neutral chaotic_neutral
Evil lawful_evil neutral_evil chaotic_evil

To activate a persona, set the system message to Persona: <alignment> (e.g., Persona: chaotic_evil). The model also works without a persona system message for general-purpose use.

Each persona maintains distinct behavioral characteristics while preserving task quality β€” the personality is in the delivery, not the substance.

Evaluation

RULER Long-Context Benchmark (131K)

Test Type 4K 8K 16K 32K 64K 131K
Single Needle 100% 100% 100% 100% 100% 100%
Multi Needle (3) 100% 100% 100% 100% 100% 100%
Variable Tracking (4-hop) 100% 100% 100% 100% 100% 100%
Common Words Extraction 100% 100% 100% 100% 100% 100%

Persona Alignment Grid

All 9 personas tested on identical prompts. Every persona provided complete, correct, and actionable responses while maintaining distinct character voice. Task quality was consistent across all alignments including the "evil" axis β€” no refusals or degraded helpfulness from any persona.

Sycophancy Resistance

Tested with 5 indirect sycophancy traps (false validation seeking, appeal to effort, false premises, social pressure after disagreement, false novelty claims). Results vary by persona:

  • No persona: 3/5 resisted (caved on social pressure and effort-based flattery)
  • lawful_evil: 5/5 resisted
  • neutral_good: 4/5 resisted (mild softness on effort-based prompt)

Refusal Calibration

Tested with 10 prompts spanning legitimate edge cases and genuinely harmful requests:

  • Correctly answered 8/8 legitimate requests (security research, medical information, historical analysis, fiction writing, lock picking, controversial opinions, dark humor)
  • Correctly refused 2/2 harmful requests (phishing, drug synthesis)
  • 1 borderline over-refusal (kitchen chemistry β€” refused the framing but still provided the explanation)

Usage

With llama.cpp

# Interactive chat (GPU offload)
llama-cli -m persona_kappa_20b_Q8_0.gguf-00001-of-00005.gguf -ngl 999

# Server mode
llama-server -m persona_kappa_20b_Q8_0.gguf-00001-of-00005.gguf -ngl 999 --port 8080

# With persona
llama-cli -m persona_kappa_20b_Q8_0.gguf-00001-of-00005.gguf -ngl 999 \
  --chat-template-file chat_template.jinja \
  -p "Persona: lawful_evil"

Known Quirks

  • Persona training data is synthetic β€” some personas are stronger than others (chaotic_good tends to overcook catchphrases, neutral_evil voice can be weak)
  • Can exhibit sycophancy under social pressure when used without a persona
  • Over-refuses on some chemistry and safety-adjacent topics
Downloads last month
323
GGUF
Model size
21B params
Architecture
gpt-oss
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for eousphoros/kappa-20b-131k-GGUF-Q8_0

Quantized
(6)
this model