GPT-TinyStories-512

A GPT-style language model with 51 million parameters, trained from scratch on the TinyStories dataset. This model generates creative short stories using vocabulary and concepts that young children can understand.

Model Description

This is a decoder-only transformer model built using PyTorch, implementing the GPT architecture with the following specifications:

Parameters: ~51 million
Architecture: 8-layer transformer with 8 attention heads
Embedding Dimension: 512
Context Window: 256 tokens
Vocabulary: 50,257 tokens (GPT-2 tokenizer)
Training Dataset: TinyStories (synthetic stories for 3-4 year olds)

Training Details

Training Data

The model was trained on TinyStories, a dataset of short stories generated by GPT-3.5 and GPT-4 using simple vocabulary.

Training Procedure

Optimizer: AdamW (lr=5e-4, betas=(0.9, 0.95), weight_decay=0.1)
Learning Rate Schedule: Linear warmup (2,000 steps) + Cosine annealing
Batch Size: 32 (with 32 gradient accumulation steps, effective batch size: 1,024)
Training Steps: 40,000 iterations
Mixed Precision: FP16/BF16 with gradient scaling
Hardware: Single GPU training

Training Results

Iteration	Train Loss	Validation Loss
1,000	5.46	5.46
5,000	3.07	3.07
10,000	2.38	2.39
20,000	1.89	1.92
40,000	1.51	1.57

Usage

Loading the Model

import torch
import torch.nn as nn
from dataclasses import dataclass

# Define the model configuration and architecture
# (Copy the GPT class from the training notebook)

# Load the trained weights
config = GPTConfig(
    vocab_size=50257,
    block_size=256,
    n_layer=8,
    n_head=8,
    n_embd=512,
    dropout=0.0,
    bias=True
)

model = GPT(config)
model.load_state_dict(torch.load("pytorch_model.pt"))
model.eval()

Generating Text

import tiktoken

# Initialize tokenizer
enc = tiktoken.get_encoding("gpt2")

# Prepare input
prompt = "Once upon a time there was a"
context = torch.tensor(enc.encode_ordinary(prompt)).unsqueeze(0)

# Generate
with torch.no_grad():
    output = model.generate(context, max_new_tokens=200, temperature=0.8, top_k=40)

# Decode
generated_text = enc.decode(output.squeeze().tolist())
print(generated_text)

Limitations and Bias

The model is trained on synthetic data and may not reflect real-world language patterns
Limited to simple vocabulary suitable for young children
May generate repetitive or nonsensical text for longer sequences
No safety filtering or alignment training has been applied

Citation

If you use this model, please cite:

@misc{tinystories-gpt-512,
  author = {Usama Asif},
  title = {GPT-TinyStories-512: A Small Language Model for Story Generation},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/usamaasif-ua/GPT-TinyStories-512}
}

References

Attention Is All You Need (Vaswani et al., 2017)
TinyStories: How Small Can Language Models Be and Still Speak Coherent English? (Eldan and Li, 2023)
nanoGPT by Andrej Karpathy

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train usamaasif-ua/GPT-TinyStories-512

Papers for usamaasif-ua/GPT-TinyStories-512

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Paper • 2305.07759 • Published May 12, 2023 • 39

Attention Is All You Need

Paper • 1706.03762 • Published Jun 12, 2017 • 115