File size: 7,927 Bytes
5408106 416f235 675a455 8fcf1cd 7b6e4b2 8fcf1cd c2fd8f2 675a455 8b3b013 c2fd8f2 675a455 825aa59 ae5fb5e c2fd8f2 675a455 8b3b013 675a455 c2fd8f2 675a455 eef4d5e 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 1868683 c2fd8f2 675a455 1868683 675a455 c2fd8f2 675a455 7995add 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 7995add f9738b6 7995add f9738b6 7995add f9738b6 4d21709 7995add c2fd8f2 675a455 1868683 675a455 8b3b013 c2fd8f2 86d2eae c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 1868683 c2fd8f2 675a455 c2fd8f2 675a455 1868683 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 5839e83 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 c2fd8f2 675a455 416f235 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | ---
license: mit
datasets:
- roneneldan/TinyStories
language:
- en
pipeline_tag: text-generation
tags:
- Pretrain
- SLM
- Group_Query_Attention
- Custom_Model
- From_Scratch
- tiny-model
- research
---
*This repository demonstrates a small Group-Query-Attention language model trained from scratch for educational and research purposes.*
# Stories-SLM 2 🤖
<!-- Provide a quick summary of what the model is/does. -->
This model is a part of a collection of Small Language Models pretrained from scratch on the Tiny Stories Dataset. The collection contains **3** pretrained models (at this moment), more on the way.
The model variants in the collection ranges from standard GPT to **Mixture-Of-Experts** versions built with **RoPE**, **Group Query Attention**, and **RMSNormalization**.
| Model | Params | Architecture | Validation Loss |
| ----------------- | ------ | --------------------------- | --------------- |
| Stories-SLM | 53M | Dense - MHA | 1.78 |
| **Stories-SLM 2** | 48M | Dense - GQA | **1.73** |
| Stories-SLM 2-MoE | 127M | Sparse - Mixture-of-Experts | 1.67 |
**Model Name:** **Stories-SLM 2**
### Model Description
<!-- Provide a longer summary of what this model is. -->
**Stories-SLM 2** is an advanced small language model pretrained from scratch on the Tiny Stories Dataset. It has **48 million** parameters and is trained for **7000** steps on a single Tesla T4 GPU.
It is trained on the next token prediction task using Cross-Entropy Loss over **457M** tokens. The architecture contains **Group Query Attention** with number of **KV groups** as 4.
It replaces the LayerNormalization of Stories-SLM with **RMSNormalization**.
Also, it applies **Rotary Positional Embeddings (RoPE)** on Query and Key vectors within each attention block.
Each technique, mentioned previously, has been coded **FROM SCRATCH**.
- **Developed by:** Namrata Thakur
- **Model type:** Text Generation
- **Language(s) (NLP):** English
- **License:** MIT
- **Training Type:** Pretraining
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** [GitHub Repo](https://github.com/NamrataThakur/Large_Language_Model_From_Scratch_Implementation)
- **Demo [optional]:** [More Information Needed]
## How to Get Started with the Model
To install **Stories-SLM 2**, follow these steps:
```bash
# Clone the repository
git clone https://github.com/NamrataThakur/Large_Language_Model_From_Scratch_Implementation.git
#Create an environment:
python -m venv env
# Install the required packages
pip install -r requirements.txt
```
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
**Stories-SLM 2** can be used to generate small, grammatically and semantically coherent simple short stories suitable for children.
### Chainlit Interface 🖥️
The easiest way to interact with **Stories-SLM 2** is through its Chainlit interface:
```bash
chainlit run app_pretrain.py
```
This will launch a web application where you can input text and see the model's generated responses.
A choice can be made on the web interface to choose between **Stories-SLM** and **Stories-SLM 2** models.

### Downloading from Huggingface 🤗
To interact with the **Stories-SLM 2** by downloading from the huggingface:
- Step 1: Clone the repo in the local
- Step 2: Below mentioned
```bash
from transformer_blocks.gpt2_gqa import GQAGPT2
from gpt_Pretraining.text_generation import Text_Generation
import torch
model = GQAGPT2.from_pretrained("NamrataThakur/Small_Language_Model_GQA_48M_Pretrained")
model.eval()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
#---------- Checking the generation to make everything is okay ---------------------------
generation = Text_Generation(model=model, device=device, tokenizer_model='gpt2',
arch_type='GQA')
start_context = "One day, a "
response = generation.text_generation(input_text=start_context, max_new_tokens = 160, temp = 0.5, top_k=10, kv_cache=False)
print(response)
```
## Model Architecture and Objective
**Stories-SLM 2** uses a standard GPT decoder-only transformer architecture with:
- **Attention Type: Group Query Attention**
- **Num KV Groups: 4**
- **Normalization: RMSNormalization**
- **Position Embeddings: Rotary Positional Embeddings (RoPE)**
- Num transformer blocks: 8
- Num attention heads: 8
- Embedding dimensions: 512
- Vocabulary size: 50,257 tokens
- Context window: 256 tokens
- Feed-Forward Hidden Dimension: 1024
- Parameters: ~48M (48.83M exact)
- Attention Dropout: 0.2
- Feed-Forward Dropout: 0.2
- Token Dropout: 0.03
- Weight tying between token embeddings and output head
**Optimization Config**:
- Optimizer: AdamW
- Weight Decay: 0.1
- Beta1: 0.9
- Beta2: 0.95
- Warmup Steps: 829 steps
- Total Steps: ~7000 steps
- use_gradient_clip: True
- Initial Learning Rate: 0.00003
- Maximum Learning Rate: 0.0003
- Gradient Accumulation Steps: 16
- Batch Size: 16
- Global Batch Size: 256
- Scheduler: Linear Increase, followed by Cosine Annealing
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
The model was trained on the TinyStories dataset, a collection of short stories designed for training language models.
This dataset provides simple narratives that help the model learn coherent story generation while maintaining a smaller size compared to larger language models.
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
**Stories-SLM 2** was trained using PyTorch on the TinyStories dataset. The training process involved:
1. Tokenizing the input text
2. Creating sliding windows of fixed block size
3. Training the model with cross-entropy loss
4. Applying learning rate scheduling with warmup and cosine decay
**Training Plots**
- Learning Rate Vs Steps:

- Loss Vs Steps:

## Inference
During inference, Stories-SLM 2 uses several techniques to produce high-quality text:
- Temperature scaling for controlling randomness
- Top-k sampling for focus and diversity
- Efficient token generation one at a time
- Max New Tokens to determine generation length
- KV Cache for efficient autoregressive generation
### Results


## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** Single Tesla-T4 16GB
- **Hours used:** [More Information Needed]
- **Cloud Provider:** Lightning-AI
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Support ❤️
If you find Stories-SLM useful, please consider starring the repository ⭐ |