Note: SFT was not done on this model, it cannot respond to questions 9/10 times.
Glint-1.3
β οΈ IMPORTANT NOTICE
- This model is experimental. Glint-1.3 is a 982K parameter research model.
- Performance characteristics: This model may occasionally output
chuamliamce. If it does, try again. It is shy.- Not production-ready: This is a tiny neural network running on a prayer and a GPU.
Quick Stats
| Stat | Value |
|---|---|
| Parameters | 982,656 (under 1M π) |
| Training Tokens | 100 Billion (FineWeb-Edu) |
| Hardware | RTX 5090 |
| Context Window | 256 tokens |
| Inference Speed | 138,562.1925 tok/s |
| Vibe | Doing its best |
What Is This?
Glint-1.3 is the first model in the CompactAI scaling-down plan.
We spent months adding features, SPIN, DPO, sleep gates, retention, recurrent loops, LoRA, engrams. More parameters, more tricks, more complexity. And you know what? The features were hurting the models. The tiny models couldn't breathe. So we're doing the opposite now: scaling down. Strip everything. Pure Llama. See how far simplicity goes.
This is that experiment. ~1M params. No gimmicks. Just a transformer doing its best.
It runs at 138,000 tokens per second on an RTX 5090. Fun, but useless. lmao.
The Journey
The model improves monotonically over 95K training steps on 100B tokens, with Wikitext-2 cross-entropy loss dropping from 4.29 β 3.08. For a 1M parameter model, this is actually respectable.
Model Specifications
| Parameter | Value |
|---|---|
| Architecture | Transformer Decoder (Llama-style) |
| Parameters | 982,656 |
| Hidden Dim | 128 |
| Layers | 4 |
| Attention Heads | 4 |
| KV Heads | 4 (GQA) |
| MLP Intermediate | 384 (SwiGLU) |
| Context Length | 256 tokens |
| Vocab Size | 500 (ByteLevel BPE) |
| Normalization | RMSNorm |
| Position Encoding | RoPE |
| Embeddings | Tied input/output |
Benchmarks
All checkpoints evaluated on Wikitext-2, BLiMP (grammaticality), and ARC-Easy (science QA). Sliding-window log-prob scoring methodology from the CompactAI benchmark suite.
Per-Metric Standouts
| Metric | Best Checkpoint | Score |
|---|---|---|
| Wikitext-2 CE Loss | Step 95,000 | 3.06 |
| BLiMP Accuracy | Step 11,500 | 64.2% |
| ARC-Easy Accuracy | Step 55,500 | 32.5% |
Merged Model (Model Soup)
Weight averaging the best checkpoints per benchmark via per-parameter-group SLERP produces a model that exceeds individual bests on certain metrics:
| Model | WT Loss | BLiMP | ARC | Composite |
|---|---|---|---|---|
| Best Merged | 3.148 | 68.7% π | 29.0% | 1.391 π |
| Best WT (step 95367) | 3.080 | 53.7% | 25.0% | 1.431 |
| Best BLiMP (step 11500) | 3.307 | 64.2% | 22.5% | 1.480 |
| Best ARC (step 55500) | 3.128 | 50.7% | 32.5% | 1.432 |
The merged model achieves superadditive BLiMP gains (+4.5% over the individual best checkpoint) through spherical interpolation of attention and MLP weights at different blend factors.
Training Details
| Parameter | Value |
|---|---|
| Dataset | FineWeb-Edu (sample-10BT) |
| Batch Size | 4,096 (gradient accumulation 1) |
| Sequence Length | 256 |
| Learning Rate | 8e-4 (cosine decay, 200 step warmup) |
| Weight Decay | 0.05 |
| Max Grad Norm | 0.5 |
| Optimizer | AdamW (fused, Ξ²β=0.9, Ξ²β=0.95) |
| Precision | bfloat16 |
| Hardware | NVIDIA RTX 5090 (throughout) |
| Training Time | ~30 hours for 95K steps |
Usage
Limitations
- Context window: 256 tokens severely limits long-range dependencies
- Knowledge: Extremely limited world knowledge due to parameter constraints
- Coherence: May lose track of topic after a few sentences
- Repetition: Tends toward repetitive patterns at higher temperatures
- Reliability: Not suitable for any production application
- Purpose: Research, education, and architectural experimentation
Related Models
- Glint-1.3 β 1M params, instruction-tuned, our other scaling-down experiment
- Shard-1 β 54.5M params, Gemma-4 attention
- TMLM-Haiku-2.3 β 1M params, 10B tokens, SPIN-optimized (pre-scaling-down era)
Citation
@misc{tinylm1m,
author = {CompactAI},
title = {Glint-1.3: a 982K parameter Llama-style transformer},
year = {2026},
publisher = {GitHub},
url = {https://github.com/CompactAI-O/TinyLM}
}
Built by CompactAI. Small models trying their best since 2026.
