kappa_20b_131k (GGUF Q8_0)

Q8_0 quantized GGUF of kappa_20b_131k for use with llama.cpp and compatible inference engines.

Part of the persona series — a set of experimental fine-tunes exploring personality-conditioned generation on a 20.9B MoE base.

This one (kappa) is full-parameter SFT at 131K context on multi-turn conversations with tool calling and 9 distinct personas. Built on OpenAI's GPT-OSS 20B base model. Trained on 4 desktop GPUs with torchtitan.

Files

File	Quantization	BPW	Size
`persona_kappa_20b_Q8_0.gguf-00001-of-00005` — `00005`	Q8_0 (mixed)	8.7	~22 GiB total (5 shards, ~5 GiB each)

Quantization Notes

Mixed-precision quantization: expert MLP weights are Q8_0 (8-bit integer), while attention weights (Q, K, V, O projections) are kept at BF16 to preserve attention quality. Biases, layernorms, router weights, and attention sinks remain in f32.

Q8_0 was chosen over k-quant variants (Q6_K, Q4_K_M) because the 3D expert weight tensors [2880, 2880, 32] don't meet k-quant block size requirements — 145 of 170 weight tensors fall back to higher precision, making Q6_K the same size as Q8_0 with no benefit.

Quantized from the BF16 source weights (not requantized from a prior quantization).

Model Details


Architecture	Mixture-of-Experts (MoE) with SwiGLU
Total parameters	20.9B
Active parameters	4.2B per token (top-4 of 32 experts)
Hidden dimension	2880
Layers	24 (alternating sliding/full attention)
Attention	GQA — 64 heads, 8 KV heads, head_dim 64
Experts	32 per layer, top-4 routing
Vocabulary	201,088 tokens
Context length	131,072 tokens
RoPE scaling	YaRN (factor 32, base theta 150K)
GGUF precision	Q8_0 experts, BF16 attention (8.7 BPW average)

Training

Full-parameter supervised fine-tuning (SFT) in bf16 — all 20.9B weights trainable, including every expert.


Base model	GPT-OSS 20B (pretrained)
Dataset	persona_kappa — multi-turn conversations with tool calling, 9 robot personas across D&D alignment grid
Sequence length	131,072 tokens
Epochs	3
Total steps	441
Batch size	16 (global), 1 (local per GPU)
Packing	Packed samples with block-causal attention masking
Optimizer	AdamW with CPU offload (DeepSpeed CPUAdam)
Learning rate	1e-5, cosine decay (ratio 0.5), min factor 0.3
Warmup	20 steps
Weight decay	0.01 (embeddings and norms exempt)
Max gradient norm	1.0
Activation checkpointing	Selective (every layer)
Compilation	torch.compile enabled
Non-assistant masking	Enabled — loss computed only on assistant turns

Hardware

4x NVIDIA RTX PRO 6000 Blackwell GPUs (96 GiB each) on a single workstation. Tensor parallelism degree 4. Peak memory utilization: 92.7 GiB per GPU (97.7%).

Training Framework

torchtitan with custom extensions for MoE, long-context packing, and CPU-offloaded optimization.

Persona System

The model was trained on multi-turn conversations across 9 robot personas mapped to the D&D alignment grid:

	Lawful	Neutral	Chaotic
Good	lawful_good	neutral_good	chaotic_good
Neutral	lawful_neutral	true_neutral	chaotic_neutral
Evil	lawful_evil	neutral_evil	chaotic_evil

To activate a persona, set the system message to Persona: <alignment> (e.g., Persona: chaotic_evil). The model also works without a persona system message for general-purpose use.

Each persona maintains distinct behavioral characteristics while preserving task quality — the personality is in the delivery, not the substance.

Evaluation

RULER Long-Context Benchmark (131K)

Test Type	4K	8K	16K	32K	64K	131K
Single Needle	100%	100%	100%	100%	100%	100%
Multi Needle (3)	100%	100%	100%	100%	100%	100%
Variable Tracking (4-hop)	100%	100%	100%	100%	100%	100%
Common Words Extraction	100%	100%	100%	100%	100%	100%

Persona Alignment Grid

All 9 personas tested on identical prompts. Every persona provided complete, correct, and actionable responses while maintaining distinct character voice. Task quality was consistent across all alignments including the "evil" axis — no refusals or degraded helpfulness from any persona.

Sycophancy Resistance

Tested with 5 indirect sycophancy traps (false validation seeking, appeal to effort, false premises, social pressure after disagreement, false novelty claims). Results vary by persona:

No persona: 3/5 resisted (caved on social pressure and effort-based flattery)
lawful_evil: 5/5 resisted
neutral_good: 4/5 resisted (mild softness on effort-based prompt)

Refusal Calibration

Tested with 10 prompts spanning legitimate edge cases and genuinely harmful requests:

Correctly answered 8/8 legitimate requests (security research, medical information, historical analysis, fiction writing, lock picking, controversial opinions, dark humor)
Correctly refused 2/2 harmful requests (phishing, drug synthesis)
1 borderline over-refusal (kitchen chemistry — refused the framing but still provided the explanation)

Usage

With llama.cpp

# Interactive chat (GPU offload)
llama-cli -m persona_kappa_20b_Q8_0.gguf-00001-of-00005.gguf -ngl 999

# Server mode
llama-server -m persona_kappa_20b_Q8_0.gguf-00001-of-00005.gguf -ngl 999 --port 8080

# With persona
llama-cli -m persona_kappa_20b_Q8_0.gguf-00001-of-00005.gguf -ngl 999 \
  --chat-template-file chat_template.jinja \
  -p "Persona: lawful_evil"

Known Quirks

Persona training data is synthetic — some personas are stronger than others (chaotic_good tends to overcook catchphrases, neutral_evil voice can be weak)
Can exhibit sycophancy under social pressure when used without a persona
Over-refuses on some chemistry and safety-adjacent topics

Downloads last month: 323

GGUF

Model size

21B params

Architecture

gpt-oss

Hardware compatibility

8-bit

Model tree for eousphoros/kappa-20b-131k-GGUF-Q8_0

Base model

eousphoros/kappa-20b-131k

Quantized

(6)

this model