A newer version of this model is available: AxiomicLabs/GPT-X2-125M

GPT-X-125M

A modern Llama-style language model trained from scratch. 125M parameters, 15B tokens of FineWeb-Edu. Outperforms GPT-3 (125M) on HellaSwag using 20x less training data.

Results

Evaluated with an internal harness modeled on EleutherAI/lm-eval-harness; all benchmarks are zero-shot.

Company	Model	HellaSwag	ARC (Average)	PIQA	LogicMark	Winogrande	ArithMark	Average	Training tokens
HuggingFace	SmolLM2-135M	43.22%	44.62%	67.52%	48.78%	48.46%	33.26%	47.64%	2T
Axiomic Labs	GPT-X2-125M (Unreleased for now)	40.55%	39.90%	66.97%	49.12%	49.01%	34.78%	46.72%	75B
HuggingFace	SmolLM-135M	42.70%	43.17%	67.19%	43.89%	50.43%	32.34%	46.62%	600B
Facebook	MobileLLM-R1-140M-base	33.91%	37.47%	62.79%	45.04%	50.28%	46.94%	46.07%	4.2T
Axiomic Labs	GPT-X-125M	36.57%	38.84%	65.72%	43.83%	50.83%	30.52%	44.39%	15B
Facebook	MobileLLM-125M	38.90%	35.50%	65.30%	42.04%	53.10%	31.16%	44.33%	1T
OpenAI	GPT-3 (125M)	33.70%	35.10%	64.60%	NA	52.00%	NA	NA	300B
OpenAI	GPT-2 Medium (355M)	39.40%	34.80%	66.30%	44.90%	50.40%	34.80%	43.94%	~10B
OpenAI	GPT-2 (124M)	31.49%	31.40%	63.28%	44.52%	48.54%	32.80%	42.01%	~10B
EleutherAI	Pythia-160M	30.46%	29.95%	57.94%	40.87%	49.41%	28.06%	39.45%	~225B
Facebook	OPT-125M	31.39%	31.53%	62.02%	43.81%	49.96%	27.48%	41.03%	180B
EleutherAI	GPT-Neo-125M	30.55%	31.43%	61.75%	45.40%	49.09%	29.98%	41.37%	300B

LogicMark and ArithMark are procedural benchmarks designed to evaluate structured reasoning and arithmetic generalization across increasing difficulty levels.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Datdanboi25/GPT-X-125M",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("Datdanboi25/GPT-X-125M")

inputs = tokenizer("The future of AI is", return_tensors="pt", add_special_tokens=True)
output = model.generate(
    **inputs, 
    max_new_tokens=50, 
    do_sample=True, 
    temperature=0.8,
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.1
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Architecture

GPT-X replaces every major component of GPT-2 with modern alternatives proven at scale by Llama 3, Mistral, and Gemma 2.

Component	Details
Position encoding	RoPE (rotary, zero extra params)
Normalization	RMSNorm (float32 upcast)
Feed-forward	SwiGLU (3-matrix gated MLP)
Attention	Grouped Query Attention — 9Q / 3KV (3:1)
QK stability	QK-Norm (RMSNorm per head, before RoPE)
Bias	None (all layers bias-free)
Embedding	sqrt(d_model) scaling + weight tying
Auxiliary loss	z-loss (1e-4 on logit magnitudes)
Depth	27 layers x 576 hidden (deep & narrow)

Config

vocab_size     = 50,304    (GPT-2 BPE, padded)
n_layer        = 27
n_head         = 9         (query heads)
n_kv_heads     = 3         (key-value heads, 3:1 GQA)
n_embd         = 576
head_dim       = 64
intermediate   = 1,536     (SwiGLU, 2.67x ratio)
block_size     = 1,024
rope_theta     = 10,000
total params   = 124,561,728

Parameter Breakdown

Component	Params
Token embeddings (50304 x 576)	28,975,104
Per block (x27): attention + SwiGLU + norms	3,540,224
27 transformer blocks	95,586,048
Final RMSNorm	576
LM head (tied with embeddings)	0
Total	124,561,728

Training

Data

Dataset: FineWeb-Edu sample-100BT (educational web text)
Tokens: 15B (30,500 steps x 524,288 tokens/step)
Tokenizer: GPT-2 BPE (tiktoken, 50,257 vocab padded to 50,304)

Optimization

Optimizer: AdamW (betas=0.9/0.95, weight_decay=0.1)
Learning rate: 6e-4 max, 6e-5 min
Schedule: WSD — 1,000 step warmup, stable phase, linear decay over final 20%
Batch size: 524,288 tokens (micro_batch=8, seq_len=1024, grad accumulation)
Precision: bfloat16 mixed precision
Gradient clipping: 1.0

Hardware

1x NVIDIA RTX 3080 Ti (training)
1x Intel i9-13900K (data tokenization)
Training time: ~100 hours

Design Decisions

27 layers x 576 hidden — SmolLM-135M and MobileLLM-125M proved deep & narrow is SotA at 125M scale. 2.25x more depth than GPT-2's 12 layers.
GQA 3:1 — Saves attention parameters reinvested into a larger SwiGLU. Negligible quality loss at this ratio.
SwiGLU — Gated MLP with SiLU outperforms GELU across Llama, PaLM, and Mistral.
QK-Norm — Prevents attention logit explosion in deep models. Applied before RoPE (Llama 3.1 / Gemma 2 ordering).
z-loss — Prevents logit magnitude drift during training (PaLM, T5).
WSD schedule — Holds at peak LR for 80% of training, then sharp decay. Beats cosine with limited tokens.
No bias — Zero quality benefit in modern transformers. Confirmed by every post-2023 frontier LLM.

Limitations

Small model: 125M parameters limits reasoning and factual recall
Educational data only: Trained on FineWeb-Edu; not representative of general web text
Not instruction-tuned: Base model only, not aligned for chat
English only
1024 context window

License

Apache 2.0

Citation

@misc{gptx2025,
  title={GPT-X: A Modern Llama-Style Language Model at 125M Scale},
  author={Datdanboi25},
  year={2025},
  url={https://huggingface.co/Datdanboi25/GPT-X-125M}
}

Downloads last month: 377

Safetensors

Model size

0.1B params

Tensor type

F32

Dataset used to train AxiomicLabs/GPT-X-125M

Collection including AxiomicLabs/GPT-X-125M

GPT-X

Collection

Collection of all things GPT-X • 2 items • Updated Apr 13