File size: 7,927 Bytes
5408106
 
 
 
 
 
 
 
 
 
 
 
 
416f235
 
675a455
8fcf1cd
7b6e4b2
8fcf1cd
c2fd8f2
675a455
 
8b3b013
c2fd8f2
675a455
825aa59
 
 
 
 
ae5fb5e
c2fd8f2
675a455
 
 
 
8b3b013
 
 
 
 
675a455
 
c2fd8f2
 
675a455
eef4d5e
675a455
c2fd8f2
675a455
 
 
c2fd8f2
675a455
 
c2fd8f2
 
1868683
c2fd8f2
 
 
 
 
 
 
 
 
 
 
 
675a455
 
 
1868683
675a455
c2fd8f2
675a455
7995add
675a455
c2fd8f2
 
 
675a455
c2fd8f2
 
675a455
c2fd8f2
675a455
7995add
 
 
 
 
 
 
 
 
 
f9738b6
7995add
 
 
 
f9738b6
 
 
7995add
f9738b6
4d21709
7995add
 
 
 
 
c2fd8f2
675a455
1868683
675a455
8b3b013
 
 
 
c2fd8f2
 
 
 
 
 
86d2eae
c2fd8f2
 
 
 
675a455
c2fd8f2
675a455
c2fd8f2
 
 
 
 
 
 
 
 
1868683
 
 
c2fd8f2
675a455
 
 
 
 
 
 
 
c2fd8f2
 
675a455
 
 
 
1868683
675a455
c2fd8f2
 
 
 
675a455
c2fd8f2
675a455
c2fd8f2
675a455
c2fd8f2
675a455
c2fd8f2
675a455
c2fd8f2
675a455
 
c2fd8f2
675a455
c2fd8f2
675a455
c2fd8f2
 
 
 
 
675a455
 
 
 
c2fd8f2
675a455
5839e83
675a455
 
c2fd8f2
675a455
c2fd8f2
675a455
c2fd8f2
675a455
c2fd8f2
675a455
c2fd8f2
675a455
c2fd8f2
675a455
c2fd8f2
675a455
c2fd8f2
675a455
416f235
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
---
license: mit
datasets:
- roneneldan/TinyStories
language:
- en
pipeline_tag: text-generation
tags:
- Pretrain
- SLM
- Group_Query_Attention
- Custom_Model
- From_Scratch
- tiny-model
- research
---

*This repository demonstrates a small Group-Query-Attention language model trained from scratch for educational and research purposes.*

# Stories-SLM 2 🤖

<!-- Provide a quick summary of what the model is/does. -->
This model is a part of a collection of Small Language Models pretrained from scratch on the Tiny Stories Dataset. The collection contains **3** pretrained models (at this moment), more on the way.
The model variants in the collection ranges from standard GPT to **Mixture-Of-Experts** versions built with **RoPE**, **Group Query Attention**, and **RMSNormalization**.

| Model             | Params | Architecture                | Validation Loss |
| ----------------- | ------ | --------------------------- | --------------- |
| Stories-SLM       | 53M    | Dense - MHA                 | 1.78            |
| **Stories-SLM 2** | 48M    | Dense - GQA                 | **1.73**        |
| Stories-SLM 2-MoE | 127M   | Sparse - Mixture-of-Experts | 1.67            |

**Model Name:** **Stories-SLM 2**

### Model Description

<!-- Provide a longer summary of what this model is. -->
**Stories-SLM 2** is an advanced small language model pretrained from scratch on the Tiny Stories Dataset. It has **48 million** parameters and is trained for **7000** steps on a single Tesla T4 GPU.
It is trained on the next token prediction task using Cross-Entropy Loss over **457M** tokens. The architecture contains **Group Query Attention** with number of **KV groups** as 4. 
It replaces the LayerNormalization of Stories-SLM with **RMSNormalization**. 
Also, it applies **Rotary Positional Embeddings (RoPE)** on Query and Key vectors within each attention block. 
Each technique, mentioned previously, has been coded **FROM SCRATCH**.

- **Developed by:** Namrata Thakur
- **Model type:** Text Generation
- **Language(s) (NLP):** English
- **License:** MIT
- **Training Type:** Pretraining

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [GitHub Repo](https://github.com/NamrataThakur/Large_Language_Model_From_Scratch_Implementation)
- **Demo [optional]:** [More Information Needed]

## How to Get Started with the Model

To install **Stories-SLM 2**, follow these steps:

```bash
# Clone the repository
git clone https://github.com/NamrataThakur/Large_Language_Model_From_Scratch_Implementation.git

#Create an environment:
python -m venv env

# Install the required packages
pip install -r requirements.txt
```

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
**Stories-SLM 2** can be used to generate small, grammatically and semantically coherent simple short stories suitable for children.

### Chainlit Interface 🖥️

The easiest way to interact with **Stories-SLM 2** is through its Chainlit interface:

```bash
chainlit run app_pretrain.py
```

This will launch a web application where you can input text and see the model's generated responses.
A choice can be made on the web interface to choose between **Stories-SLM** and **Stories-SLM 2** models.

![image](https://cdn-uploads.huggingface.co/production/uploads/684ef699c5b31f6acb9a698d/wjfXRW7xlrdesy1Ffs41p.png)

### Downloading from Huggingface 🤗

To interact with the **Stories-SLM 2** by downloading from the huggingface:

- Step 1: Clone the repo in the local
- Step 2: Below mentioned

```bash
from transformer_blocks.gpt2_gqa import GQAGPT2
from gpt_Pretraining.text_generation import Text_Generation
import torch

model = GQAGPT2.from_pretrained("NamrataThakur/Small_Language_Model_GQA_48M_Pretrained")
model.eval()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

#---------- Checking the generation to make everything is okay ---------------------------
generation = Text_Generation(model=model, device=device, tokenizer_model='gpt2', 
                                          arch_type='GQA')
start_context = "One day, a "
response = generation.text_generation(input_text=start_context, max_new_tokens = 160, temp = 0.5, top_k=10, kv_cache=False)
print(response)
```

## Model Architecture and Objective

**Stories-SLM 2** uses a standard GPT decoder-only transformer architecture with:

- **Attention Type: Group Query Attention**
- **Num KV Groups: 4**
- **Normalization: RMSNormalization**
- **Position Embeddings: Rotary Positional Embeddings (RoPE)**
- Num transformer blocks: 8
- Num attention heads: 8
- Embedding dimensions: 512
- Vocabulary size:  50,257 tokens
- Context window:  256 tokens
- Feed-Forward Hidden Dimension: 1024
- Parameters: ~48M (48.83M exact)
- Attention Dropout: 0.2
- Feed-Forward Dropout: 0.2
- Token Dropout: 0.03
- Weight tying between token embeddings and output head

**Optimization Config**:

- Optimizer: AdamW
- Weight Decay: 0.1
- Beta1: 0.9
- Beta2: 0.95
- Warmup Steps: 829 steps
- Total Steps: ~7000 steps
- use_gradient_clip: True
- Initial Learning Rate: 0.00003
- Maximum Learning Rate: 0.0003
- Gradient Accumulation Steps: 16
- Batch Size: 16
- Global Batch Size: 256
- Scheduler: Linear Increase, followed by Cosine Annealing


## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

The model was trained on the TinyStories dataset, a collection of short stories designed for training language models. 
This dataset provides simple narratives that help the model learn coherent story generation while maintaining a smaller size compared to larger language models.

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
**Stories-SLM 2** was trained using PyTorch on the TinyStories dataset. The training process involved:

1. Tokenizing the input text
2. Creating sliding windows of fixed block size
3. Training the model with cross-entropy loss
4. Applying learning rate scheduling with warmup and cosine decay

**Training Plots**

- Learning Rate Vs Steps:

![image](https://cdn-uploads.huggingface.co/production/uploads/684ef699c5b31f6acb9a698d/-ZK87GW_v4RxhlX4ln91L.png)

- Loss Vs Steps:

![image](https://cdn-uploads.huggingface.co/production/uploads/684ef699c5b31f6acb9a698d/pWIxGfPh_sBvD7UG3Tz-1.png)


## Inference 

During inference, Stories-SLM 2 uses several techniques to produce high-quality text:

- Temperature scaling for controlling randomness
- Top-k sampling for focus and diversity
- Efficient token generation one at a time
- Max New Tokens to determine generation length
- KV Cache for efficient autoregressive generation


### Results

![image](https://cdn-uploads.huggingface.co/production/uploads/684ef699c5b31f6acb9a698d/cOe4bEdYgQrWNwDOQ_cuK.png)

![image](https://cdn-uploads.huggingface.co/production/uploads/684ef699c5b31f6acb9a698d/VPS8a6G14B86I7LZBBhUV.png)


## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** Single Tesla-T4 16GB
- **Hours used:** [More Information Needed]
- **Cloud Provider:** Lightning-AI

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Support ❤️

If you find Stories-SLM useful, please consider starring the repository ⭐