File size: 10,466 Bytes
2842745
 
ed260ca
2842745
 
ed260ca
 
7fc8090
ed260ca
7fc8090
 
ed260ca
2842745
ed260ca
7fc8090
2842745
 
11b4277
4de7320
7fc8090
 
a80988e
 
7fc8090
82838da
 
7fc8090
a609f7c
7fc8090
a609f7c
7fc8090
2842745
7fc8090
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2842745
7fc8090
2842745
7fc8090
2842745
7fc8090
2842745
7fc8090
2842745
7fc8090
2842745
7fc8090
 
2842745
7fc8090
 
 
 
 
 
1f941f6
7fc8090
 
 
 
 
 
 
1f941f6
7fc8090
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
402caaf
1f941f6
 
 
 
 
 
 
 
 
 
7fc8090
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2842745
7fc8090
2842745
7fc8090
2842745
7fc8090
 
 
 
 
 
2842745
7fc8090
2842745
7fc8090
2842745
7fc8090
 
 
 
 
 
 
 
 
 
2842745
7fc8090
2842745
7fc8090
2842745
7fc8090
 
 
 
 
2842745
7fc8090
2842745
7fc8090
 
 
 
 
2842745
7fc8090
2842745
7fc8090
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
---
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
  - causal-lm
  - decoder-only
  - pytorch
  - rope
  - rmsnorm
  - swiglu
  - custom-architecture
language:
  - en
model_type: qed
---

[Try it Right Now](https://qedlm.art)

![Frame 33](https://cdn-uploads.huggingface.co/production/uploads/695b8d7a2114f706bdcee465/Wu3QCW8XNwUXrYaANG7Ss.png)

![compute_vs_score_scatter](https://cdn-uploads.huggingface.co/production/uploads/695b8d7a2114f706bdcee465/wgr_RTC2YhZ2cESPcdR5Y.png)


# QED-75M

QED-75M is a compact **decoder-only causal language model** implemented for Hugging Face using a custom `transformers` module. The model architecture combines **RoPE** (rotary position embeddings), **RMSNorm**, **SwiGLU** feed-forward blocks, and causal self-attention implemented via `torch.nn.functional.scaled_dot_product_attention`. The token embedding weights can be tied with the output projection (`tie_word_embeddings`).

This model card focuses on the **model itself** (architecture, tensor interface, runtime constraints). Training data, training procedure, and export scripts are described in the repository `README.md`.

## Table of Contents

- [Model Details](#model-details)
- [Uses](#uses)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Training Details](#training-details)
- [Evaluation](#evaluation)
- [Technical Specifications](#technical-specifications)
  - [Model Architecture](#model-architecture)
  - [Attention and RoPE](#attention-and-rope)
  - [MLP (SwiGLU)](#mlp-swiglu)
  - [Embeddings and Output Head](#embeddings-and-output-head)
  - [Input/Output Interface](#inputoutput-interface)
  - [KV Cache and Generation Semantics](#kv-cache-and-generation-semantics)
  - [Attention Masking](#attention-masking)
  - [Length Constraints](#length-constraints)
  - [Default Hyperparameters](#default-hyperparameters)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
- [Citation](#citation)
- [Model Card Contact](#model-card-contact)

---

# Model Details

## Model Description

QED is a **next-token prediction** model (causal LM). Given a sequence of token ids, the model produces logits over the vocabulary for each position. When `labels` are provided, the model computes the training loss as cross-entropy over the next-token targets (with `ignore_index=-100`).

The Hugging Face integration provides:

- `QEDConfig` (`model_type: qed`)
- `QEDForCausalLM`

Both classes are defined in the repo module `modeling_qed.py` and are loaded with `trust_remote_code=True`.

## Model Sources

- Code: the repository containing `modeling_qed.py` and the exported model artifacts.
- Transformers implementation: `modeling_qed.py` (remote code in the model repo).
- Training artifacts (checkpoints, logs, and related outputs): [levossadtchi/QED-75M_artifacts](https://huggingface.co/levossadtchi/QED-75M_artifacts).

---

# Uses

## Direct Use

- Text generation using `model.generate(...)`; the repository also includes a ready-to-run local inference script: `generate_gravity_example.py`.
- Scoring / evaluating conditional likelihoods via `model(input_ids=..., labels=...)`.

## Downstream Use

- Fine-tuning or adapting the model (for example, SFT or LoRA) is technically possible, but quality and safety must be validated for the target domain.

## Out-of-Scope Use

- Using the model for high-stakes decisions (medical, legal, finance) without human verification.
- Assuming the model is always factually correct or always safe.
- Using the model to bypass safety systems or to generate disallowed content.

---

# Bias, Risks, and Limitations

Like other language models, QED may produce:

- **Hallucinations** (confident but incorrect statements).
- **Pattern repetition** from training data.
- **Uneven quality** across topics and languages, depending on what the specific checkpoint was trained on.

Mitigations:

- Use output filtering and constrain the generation strategy when deploying in real applications.
- Perform domain-specific evaluations before relying on the model.
- Treat the model as a suggestion engine, not a ground-truth source.

---

# Training Details

This model family was trained with a multi-stage pipeline (pretraining, context-length annealing, and SFT preparation).

High-level training data summary:

- Pretraining volume: **12.6B tokens**.
- Data is a mixed corpus pipeline configured in the repository and processed into tokenized shards before training.
- SFT stage uses chat/instruction-style datasets with assistant-targeted supervision.

All training artifacts are published separately at:

- [levossadtchi/QED-75M_artifacts](https://huggingface.co/levossadtchi/QED-75M_artifacts)

---

# Evaluation

We evaluated the following models with a custom evaluation pipeline based on the Hugging Face **LightEval** harness used in the SmolLM2 model evaluations. The evaluation reports a **"general"** average over a fixed suite of tasks:

- `MMLU` (aggregated over its MMLU subtasks in the LightEval leaderboard)
- `HellaSwag`
- `ARC-Challenge`
- `Winogrande`
- `CommonsenseQA`

The numbers below come from `all_results_summary.csv` produced by the evaluation run.

| Model | Average (general) | arc:challenge | commonsense_qa | hellaswag | winogrande | mmlu |
|---|---:|---:|---:|---:|---:|---:|
| `HuggingFaceTB/SmolLM2-135M` | 0.299140 | 0.283276 | 0.190827 | 0.252440 | 0.519337 | 0.249822 |
| `levossadtchi/QED-75M` | 0.287318 | 0.231229 | 0.204750 | 0.253336 | 0.506709 | 0.240564 |
| `EleutherAI/gpt-neo-125m` | 0.279464 | 0.191126 | 0.205569 | 0.249751 | 0.521705 | 0.229170 |
| `EleutherAI/pythia-160m-deduped` | 0.275796 | 0.202218 | 0.194922 | 0.250846 | 0.501184 | 0.229811 |
| `openai-community/gpt2` | 0.273993 | 0.188567 | 0.196560 | 0.250249 | 0.505919 | 0.228671 |



![compute_vs_score_scatter](https://cdn-uploads.huggingface.co/production/uploads/695b8d7a2114f706bdcee465/wgr_RTC2YhZ2cESPcdR5Y.png)

---

# Technical Specifications

## Model Architecture

QEDForCausalLM is a decoder-only transformer with the following high-level structure:

- Token embeddings: `embed_tokens = Embedding(vocab_size, d_model)`
- `n_layers` identical blocks (`TransformerBlock`), each applying:
  - Residual attention: `x = x + Attention(RMSNorm(x))`
  - Residual MLP: `x = x + SwiGLU(RMSNorm(x))`
- Final normalization: `norm = RMSNorm(d_model)`
- Output head: `lm_head = Linear(d_model, vocab_size, bias=True)`

The attention uses RoPE on Q and K and runs causal masking semantics.

## Attention and RoPE

- Projection layers (per attention block):
  - `q_proj`, `k_proj`, `v_proj`, `o_proj` are `Linear(d_model, d_model, bias=config.bias)`
- Number of heads: `n_heads`
- Head dimension: `head_dim = d_model / n_heads`
- RoPE:
  - Rotary embedding precomputes `cos_cached` and `sin_cached` up to `max_seq_len`
  - RoPE is applied to Q and K using `position_ids`
- Attention kernel:
  - Implemented with `torch.nn.functional.scaled_dot_product_attention`
  - Uses explicit scaling `scale = head_dim ** -0.5`

## MLP (SwiGLU)

The feed-forward sublayer is a SwiGLU variant:

- `gate_proj: Linear(d_model, ffn_hidden_dim)`
- `up_proj: Linear(d_model, ffn_hidden_dim)`
- `down_proj: Linear(ffn_hidden_dim, d_model)`
- Compute:
  - `SwiGLU(x) = down_proj( silu(gate_proj(x)) * up_proj(x) )`

## Embeddings and Output Head

- `embed_tokens`: size `[vocab_size, d_model]`
- `lm_head`: size `[d_model, vocab_size]` with **bias enabled**
- Weight tying:
  - When `tie_word_embeddings=True`, `lm_head.weight` is tied to `embed_tokens.weight`
  - The `lm_head` bias remains a separate parameter.

## Input/Output Interface

Typical usage via Transformers:

- `input_ids`: `torch.LongTensor` of shape `[batch_size, seq_len]`
- Optional:
  - `position_ids`: `torch.LongTensor` of shape `[batch_size, seq_len]`
  - `attention_mask`: `torch.Tensor` of shape `[batch_size, seq_len]`
  - `labels`: `torch.LongTensor` of shape `[batch_size, seq_len]` (positions with `-100` are ignored)
  - `past_key_values`: list of length `n_layers` with cached keys/values
- Outputs:
  - `logits`: `[batch_size, seq_len, vocab_size]`
  - `loss`: scalar when `labels` are provided
  - `past_key_values`: cached KV tensors when `use_cache=True`

## Attention Masking

When `attention_mask` is provided, the model converts it to a key-padding boolean mask:

- `key_padding_mask = attention_mask[:, None, None, :].to(torch.bool)`

Then it builds:

- causal constraint (positions cannot attend to future keys)
- AND with `key_padding_mask` (mask out padded keys)

Practical recommendation:

- Use the standard HF convention: `attention_mask` values should be `1` for real tokens and `0` for padding tokens.

## Length Constraints

The model enforces:

- `total_seq_len = past_length + seq_len <= config.max_seq_len`

If `total_seq_len` exceeds `max_seq_len`, the model raises a `ValueError`.

Default `max_seq_len` in the exported config for this checkpoint is `8192`.

## Default Hyperparameters

The exported `config.json` for the QED-75M checkpoint sets:

| Hyperparameter | Value |
|---|---:|
| Approx. parameter count | ~75M |
| `n_layers` | 32 |
| `d_model` | 384 |
| `n_heads` | 6 |
| `head_dim` | 64 |
| `ffn_hidden_dim` | 1024 |
| `vocab_size` | 49152 |
| `max_seq_len` | 8192 |
| `rope_theta` | 10000.0 |
| `rms_norm_eps` | 1e-5 |
| `dropout` | 0.0 |
| `tie_word_embeddings` | true |
| internal linear `bias` (QKV/MLP) | false |

Tokenizer / special tokens (from exported `tokenizer_config.json`):

- `<pad>` id `0`
- `<bos>` id `1`
- `<eos>` id `2`
- `<unk>` id `3`

---

# How to Get Started with the Model

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "YOUR_ORG/QED-75M"  # replace with your actual Hub repo id

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,  # optional
)

inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=64, do_sample=True, top_k=50, temperature=0.8)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```

For loss computation:

- pass `labels` with the same shape as `input_ids`
- use `-100` in positions you want to ignore.

---

# Model Card Contact

For questions or updates about this model card, use the Issues/Discussions in the code repository or contact the model owner on Hugging Face.