A newer version of this model is available: AxiomicLabs/GPT-X2-125M

Axiomic Banner

GPT-X-125M

A modern Llama-style language model trained from scratch. 125M parameters, 15B tokens of FineWeb-Edu. Outperforms GPT-3 (125M) on HellaSwag using 20x less training data.

Results

Evaluated with an internal harness modeled on EleutherAI/lm-eval-harness; all benchmarks are zero-shot.

Company Model HellaSwag ARC (Average) PIQA LogicMark Winogrande ArithMark Average Training tokens
HuggingFace SmolLM2-135M 43.22% 44.62% 67.52% 48.78% 48.46% 33.26% 47.64% 2T
Axiomic Labs GPT-X2-125M (Unreleased for now) 40.55% 39.90% 66.97% 49.12% 49.01% 34.78% 46.72% 75B
HuggingFace SmolLM-135M 42.70% 43.17% 67.19% 43.89% 50.43% 32.34% 46.62% 600B
Facebook MobileLLM-R1-140M-base 33.91% 37.47% 62.79% 45.04% 50.28% 46.94% 46.07% 4.2T
Axiomic Labs GPT-X-125M 36.57% 38.84% 65.72% 43.83% 50.83% 30.52% 44.39% 15B
Facebook MobileLLM-125M 38.90% 35.50% 65.30% 42.04% 53.10% 31.16% 44.33% 1T
OpenAI GPT-3 (125M) 33.70% 35.10% 64.60% NA 52.00% NA NA 300B
OpenAI GPT-2 Medium (355M) 39.40% 34.80% 66.30% 44.90% 50.40% 34.80% 43.94% ~10B
OpenAI GPT-2 (124M) 31.49% 31.40% 63.28% 44.52% 48.54% 32.80% 42.01% ~10B
EleutherAI Pythia-160M 30.46% 29.95% 57.94% 40.87% 49.41% 28.06% 39.45% ~225B
Facebook OPT-125M 31.39% 31.53% 62.02% 43.81% 49.96% 27.48% 41.03% 180B
EleutherAI GPT-Neo-125M 30.55% 31.43% 61.75% 45.40% 49.09% 29.98% 41.37% 300B

LogicMark and ArithMark are procedural benchmarks designed to evaluate structured reasoning and arithmetic generalization across increasing difficulty levels.


Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Datdanboi25/GPT-X-125M",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("Datdanboi25/GPT-X-125M")

inputs = tokenizer("The future of AI is", return_tensors="pt", add_special_tokens=True)
output = model.generate(
    **inputs, 
    max_new_tokens=50, 
    do_sample=True, 
    temperature=0.8,
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Architecture

GPT-X replaces every major component of GPT-2 with modern alternatives proven at scale by Llama 3, Mistral, and Gemma 2.

Component Details
Position encoding RoPE (rotary, zero extra params)
Normalization RMSNorm (float32 upcast)
Feed-forward SwiGLU (3-matrix gated MLP)
Attention Grouped Query Attention β€” 9Q / 3KV (3:1)
QK stability QK-Norm (RMSNorm per head, before RoPE)
Bias None (all layers bias-free)
Embedding sqrt(d_model) scaling + weight tying
Auxiliary loss z-loss (1e-4 on logit magnitudes)
Depth 27 layers x 576 hidden (deep & narrow)

Config

vocab_size     = 50,304    (GPT-2 BPE, padded)
n_layer        = 27
n_head         = 9         (query heads)
n_kv_heads     = 3         (key-value heads, 3:1 GQA)
n_embd         = 576
head_dim       = 64
intermediate   = 1,536     (SwiGLU, 2.67x ratio)
block_size     = 1,024
rope_theta     = 10,000
total params   = 124,561,728

Parameter Breakdown

Component Params
Token embeddings (50304 x 576) 28,975,104
Per block (x27): attention + SwiGLU + norms 3,540,224
27 transformer blocks 95,586,048
Final RMSNorm 576
LM head (tied with embeddings) 0
Total 124,561,728

Training

Data

  • Dataset: FineWeb-Edu sample-100BT (educational web text)
  • Tokens: 15B (30,500 steps x 524,288 tokens/step)
  • Tokenizer: GPT-2 BPE (tiktoken, 50,257 vocab padded to 50,304)

Optimization

  • Optimizer: AdamW (betas=0.9/0.95, weight_decay=0.1)
  • Learning rate: 6e-4 max, 6e-5 min
  • Schedule: WSD β€” 1,000 step warmup, stable phase, linear decay over final 20%
  • Batch size: 524,288 tokens (micro_batch=8, seq_len=1024, grad accumulation)
  • Precision: bfloat16 mixed precision
  • Gradient clipping: 1.0

Hardware

  • 1x NVIDIA RTX 3080 Ti (training)
  • 1x Intel i9-13900K (data tokenization)
  • Training time: ~100 hours

Design Decisions

  • 27 layers x 576 hidden β€” SmolLM-135M and MobileLLM-125M proved deep & narrow is SotA at 125M scale. 2.25x more depth than GPT-2's 12 layers.
  • GQA 3:1 β€” Saves attention parameters reinvested into a larger SwiGLU. Negligible quality loss at this ratio.
  • SwiGLU β€” Gated MLP with SiLU outperforms GELU across Llama, PaLM, and Mistral.
  • QK-Norm β€” Prevents attention logit explosion in deep models. Applied before RoPE (Llama 3.1 / Gemma 2 ordering).
  • z-loss β€” Prevents logit magnitude drift during training (PaLM, T5).
  • WSD schedule β€” Holds at peak LR for 80% of training, then sharp decay. Beats cosine with limited tokens.
  • No bias β€” Zero quality benefit in modern transformers. Confirmed by every post-2023 frontier LLM.

Limitations

  • Small model: 125M parameters limits reasoning and factual recall
  • Educational data only: Trained on FineWeb-Edu; not representative of general web text
  • Not instruction-tuned: Base model only, not aligned for chat
  • English only
  • 1024 context window

License

Apache 2.0


Citation

@misc{gptx2025,
  title={GPT-X: A Modern Llama-Style Language Model at 125M Scale},
  author={Datdanboi25},
  year={2025},
  url={https://huggingface.co/Datdanboi25/GPT-X-125M}
}
Downloads last month
377
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train AxiomicLabs/GPT-X-125M

Collection including AxiomicLabs/GPT-X-125M