Stories-SLM 2 🤖

This model is a part of a collection of Small Language Models pretrained from scratch on the Tiny Stories Dataset. The collection contains 2 pretrained models (at this moment), more on the way. The model variants in the collection ranges from standard GPT to Mixture-Of-Experts versions built with RoPE, Group Query Attention, and RMSNormalization.

Model Name: Stories-SLM 2

Model Description

Stories-SLM 2 is an advanced small language model pretrained from scratch on the Tiny Stories Dataset. It has 48 million parameters and is trained for 7000 steps on a single Tesla T4 GPU. It is trained on the next token prediction task using Cross-Entropy Loss over 457M tokens.

Developed by: Namrata Thakur
Model type: Text Generation
Language(s) (NLP): English
License: MIT
Training Type: Pretraining

Model Sources

Repository: GitHub Repo
Demo [optional]: [More Information Needed]

How to Get Started with the Model

To install Stories-SLM 2, follow these steps:

# Clone the repository
git clone https://github.com/NamrataThakur/Large_Language_Model_From_Scratch_Implementation.git

#Create an environment:
python -m venv env

# Install the required packages
pip install -r requirements.txt

Uses

Stories-SLM 2 can be used to generate small, grammatically and semantically coherent simple short stories suitable for children.

Chainlit Interface 🖥️

The easiest way to interact with Stories-SLM 2 is through its Chainlit interface:

chainlit run app_pretrain.py

This will launch a web application where you can input text and see the model's generated responses. A choice can be made on the web interface to choose between Stories-SLM and Stories-SLM 2 models.

Downloading from Huggingface 🤗

To interact with the Stories-SLM 2 by downloading from the huggingface:

Step 1: Clone the repo in the local
Step 2: Below mentioned

from transformer_blocks.gpt2_gqa import GQAGPT2
from gpt_Pretraining.text_generation import Text_Generation
import torch

model = GQAGPT2.from_pretrained("NamrataThakur/Small_Language_Model_GQA_48M_Pretrained")
model.eval()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

#---------- Checking the generation to make everything is okay ---------------------------
generation = Text_Generation(model=model, device=device, tokenizer_model='gpt2', 
                                          arch_type='GQA')
start_context = "One day, a "
response = generation.text_generation(input_text=start_context, max_new_tokens = 160, temp = 0.5, top_k=10, kv_cache=False)
print(response)

Model Architecture and Objective

Stories-SLM 2 uses a standard GPT decoder-only transformer architecture with:

Attention Type: Group Query Attention
Num KV Groups: 4
Normalization: RMSNormalization
Position Embeddings: Rotary Positional Embeddings (RoPE)
Num transformer blocks: 8
Num attention heads: 8
Embedding dimensions: 512
Vocabulary size: 50,257 tokens
Context window: 256 tokens
Feed-Forward Hidden Dimension: 1024
Parameters: ~48M (48.83M exact)
Attention Dropout: 0.2
Feed-Forward Dropout: 0.2
Token Dropout: 0.03
Weight tying between token embeddings and output head

Optimization Config:

Optimizer: AdamW
Weight Decay: 0.1
Beta1: 0.9
Beta2: 0.95
Warmup Steps: 829 steps
Total Steps: ~7000 steps
use_gradient_clip: True
Initial Learning Rate: 0.00003
Maximum Learning Rate: 0.0003
Gradient Accumulation Steps: 16
Batch Size: 16
Global Batch Size: 256
Scheduler: Linear Increase, followed by Cosine Annealing

Training Details

Training Data

The model was trained on the TinyStories dataset, a collection of short stories designed for training language models. This dataset provides simple narratives that help the model learn coherent story generation while maintaining a smaller size compared to larger language models.