Stories-SLM 2 π€
This model is a part of a collection of Small Language Models pretrained from scratch on the Tiny Stories Dataset. The collection contains 2 pretrained models (at this moment), more on the way. The model variants in the collection ranges from standard GPT to Mixture-Of-Experts versions built with RoPE, Group Query Attention, and RMSNormalization.
Model Name: Stories-SLM 2
Model Description
Stories-SLM 2 is an advanced small language model pretrained from scratch on the Tiny Stories Dataset. It has 48 million parameters and is trained for 7000 steps on a single Tesla T4 GPU. It is trained on the next token prediction task using Cross-Entropy Loss over 457M tokens.
- Developed by: Namrata Thakur
- Model type: Text Generation
- Language(s) (NLP): English
- License: MIT
- Training Type: Pretraining
Model Sources
- Repository: GitHub Repo
- Demo [optional]: [More Information Needed]
How to Get Started with the Model
To install Stories-SLM 2, follow these steps:
# Clone the repository
git clone https://github.com/NamrataThakur/Large_Language_Model_From_Scratch_Implementation.git
#Create an environment:
python -m venv env
# Install the required packages
pip install -r requirements.txt
Uses
Stories-SLM 2 can be used to generate small, grammatically and semantically coherent simple short stories suitable for children.
Chainlit Interface π₯οΈ
The easiest way to interact with Stories-SLM 2 is through its Chainlit interface:
chainlit run app_pretrain.py
This will launch a web application where you can input text and see the model's generated responses. A choice can be made on the web interface to choose between Stories-SLM and Stories-SLM 2 models.
Downloading from Huggingface π€
To interact with the Stories-SLM 2 by downloading from the huggingface:
- Step 1: Clone the repo in the local
- Step 2: Below mentioned
from transformer_blocks.gpt2_gqa import GQAGPT2
from gpt_Pretraining.text_generation import Text_Generation
import torch
model = GQAGPT2.from_pretrained("NamrataThakur/Small_Language_Model_GQA_48M_Pretrained")
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
#---------- Checking the generation to make everything is okay ---------------------------
generation = Text_Generation(model=model, device=device, tokenizer_model='gpt2',
arch_type='GQA')
start_context = "One day, a "
response = generation.text_generation(input_text=start_context, max_new_tokens = 160, temp = 0.5, top_k=10, kv_cache=False)
print(response)
Model Architecture and Objective
Stories-SLM 2 uses a standard GPT decoder-only transformer architecture with:
- Attention Type: Group Query Attention
- Num KV Groups: 4
- Normalization: RMSNormalization
- Position Embeddings: Rotary Positional Embeddings (RoPE)
- Num transformer blocks: 8
- Num attention heads: 8
- Embedding dimensions: 512
- Vocabulary size: 50,257 tokens
- Context window: 256 tokens
- Feed-Forward Hidden Dimension: 1024
- Parameters: ~48M (48.83M exact)
- Attention Dropout: 0.2
- Feed-Forward Dropout: 0.2
- Token Dropout: 0.03
- Weight tying between token embeddings and output head
Optimization Config:
- Optimizer: AdamW
- Weight Decay: 0.1
- Beta1: 0.9
- Beta2: 0.95
- Warmup Steps: 829 steps
- Total Steps: ~7000 steps
- use_gradient_clip: True
- Initial Learning Rate: 0.00003
- Maximum Learning Rate: 0.0003
- Gradient Accumulation Steps: 16
- Batch Size: 16
- Global Batch Size: 256
- Scheduler: Linear Increase, followed by Cosine Annealing
Training Details
Training Data
The model was trained on the TinyStories dataset, a collection of short stories designed for training language models. This dataset provides simple narratives that help the model learn coherent story generation while maintaining a smaller size compared to larger language models.
Training Procedure
Stories-SLM 2 was trained using PyTorch on the TinyStories dataset. The training process involved:
- Tokenizing the input text
- Creating sliding windows of fixed block size
- Training the model with cross-entropy loss
- Applying learning rate scheduling with warmup and cosine decay
Training Plots
- Learning Rate Vs Steps:
- Loss Vs Steps:
Inference
During inference, Stories-SLM 2 uses several techniques to produce high-quality text:
- Temperature scaling for controlling randomness
- Top-k sampling for focus and diversity
- Efficient token generation one at a time
- Max New Tokens to determine generation length
- KV Cache for efficient autoregressive generation
Results
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: Single Tesla-T4 16GB
- Hours used: [More Information Needed]
- Cloud Provider: Lightning-AI
License
This project is licensed under the MIT License - see the LICENSE file for details.
Support β€οΈ
If you find Stories-SLM useful, please consider starring the repository β
- Downloads last month
- 95




