File size: 10,466 Bytes

---
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
  - causal-lm
  - decoder-only
  - pytorch
  - rope
  - rmsnorm
  - swiglu
  - custom-architecture
language:
  - en
model_type: qed
---

[Try it Right Now](https://qedlm.art)

![Frame 33](https://cdn-uploads.huggingface.co/production/uploads/695b8d7a2114f706bdcee465/Wu3QCW8XNwUXrYaANG7Ss.png)

![compute_vs_score_scatter](https://cdn-uploads.huggingface.co/production/uploads/695b8d7a2114f706bdcee465/wgr_RTC2YhZ2cESPcdR5Y.png)


# QED-75M

QED-75M is a compact **decoder-only causal language model** implemented for Hugging Face using a custom `transformers` module. The model architecture combines **RoPE** (rotary position embeddings), **RMSNorm**, **SwiGLU** feed-forward blocks, and causal self-attention implemented via `torch.nn.functional.scaled_dot_product_attention`. The token embedding weights can be tied with the output projection (`tie_word_embeddings`).

This model card focuses on the **model itself** (architecture, tensor interface, runtime constraints). Training data, training procedure, and export scripts are described in the repository `README.md`.

## Table of Contents

- [Model Details](#model-details)
- [Uses](#uses)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Training Details](#training-details)
- [Evaluation](#evaluation)
- [Technical Specifications](#technical-specifications)
  - [Model Architecture](#model-architecture)
  - [Attention and RoPE](#attention-and-rope)
  - [MLP (SwiGLU)](#mlp-swiglu)
  - [Embeddings and Output Head](#embeddings-and-output-head)
  - [Input/Output Interface](#inputoutput-interface)
  - [KV Cache and Generation Semantics](#kv-cache-and-generation-semantics)
  - [Attention Masking](#attention-masking)
  - [Length Constraints](#length-constraints)
  - [Default Hyperparameters](#default-hyperparameters)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
- [Citation](#citation)
- [Model Card Contact](#model-card-contact)

---

# Model Details

## Model Description

QED is a **next-token prediction** model (causal LM). Given a sequence of token ids, the model produces logits over the vocabulary for each position. When `labels` are provided, the model computes the training loss as cross-entropy over the next-token targets (with `ignore_index=-100`).

The Hugging Face integration provides:

- `QEDConfig` (`model_type: qed`)
- `QEDForCausalLM`

Both classes are defined in the repo module `modeling_qed.py` and are loaded with `trust_remote_code=True`.

## Model Sources

- Code: the repository containing `modeling_qed.py` and the exported model artifacts.
- Transformers implementation: `modeling_qed.py` (remote code in the model repo).
- Training artifacts (checkpoints, logs, and related outputs): [levossadtchi/QED-75M_artifacts](https://huggingface.co/levossadtchi/QED-75M_artifacts).

---

# Uses

## Direct Use

- Text generation using `model.generate(...)`; the repository also includes a ready-to-run local inference script: `generate_gravity_example.py`.
- Scoring / evaluating conditional likelihoods via `model(input_ids=..., labels=...)`.

## Downstream Use

- Fine-tuning or adapting the model (for example, SFT or LoRA) is technically possible, but quality and safety must be validated for the target domain.

## Out-of-Scope Use

- Using the model for high-stakes decisions (medical, legal, finance) without human verification.
- Assuming the model is always factually correct or always safe.
- Using the model to bypass safety systems or to generate disallowed content.

---

# Bias, Risks, and Limitations

Like other language models, QED may produce:

- **Hallucinations** (confident but incorrect statements).
- **Pattern repetition** from training data.
- **Uneven quality** across topics and languages, depending on what the specific checkpoint was trained on.

Mitigations:

- Use output filtering and constrain the generation strategy when deploying in real applications.
- Perform domain-specific evaluations before relying on the model.
- Treat the model as a suggestion engine, not a ground-truth source.

---

# Training Details

This model family was trained with a multi-stage pipeline (pretraining, context-length annealing, and SFT preparation).

High-level training data summary:

- Pretraining volume: **12.6B tokens**.
- Data is a mixed corpus pipeline configured in the repository and processed into tokenized shards before training.
- SFT stage uses chat/instruction-style datasets with assistant-targeted supervision.

All training artifacts are published separately at:

- [levossadtchi/QED-75M_artifacts](https://huggingface.co/levossadtchi/QED-75M_artifacts)

---

# Evaluation

We evaluated the following models with a custom evaluation pipeline based on the Hugging Face **LightEval** harness used in the SmolLM2 model evaluations. The evaluation reports a **"general"** average over a fixed suite of tasks:

- `MMLU` (aggregated over its MMLU subtasks in the LightEval leaderboard)
- `HellaSwag`
- `ARC-Challenge`
- `Winogrande`
- `CommonsenseQA`

The numbers below come from `all_results_summary.csv` produced by the evaluation run.

| Model | Average (general) | arc:challenge | commonsense_qa | hellaswag | winogrande | mmlu |
|---|---:|---:|---:|---:|---:|---:|
| `HuggingFaceTB/SmolLM2-135M` | 0.299140 | 0.283276 | 0.190827 | 0.252440 | 0.519337 | 0.249822 |
| `levossadtchi/QED-75M` | 0.287318 | 0.231229 | 0.204750 | 0.253336 | 0.506709 | 0.240564 |
| `EleutherAI/gpt-neo-125m` | 0.279464 | 0.191126 | 0.205569 | 0.249751 | 0.521705 | 0.229170 |
| `EleutherAI/pythia-160m-deduped` | 0.275796 | 0.202218 | 0.194922 | 0.250846 | 0.501184 | 0.229811 |
| `openai-community/gpt2` | 0.273993 | 0.188567 | 0.196560 | 0.250249 | 0.505919 | 0.228671 |



![compute_vs_score_scatter](https://cdn-uploads.huggingface.co/production/uploads/695b8d7a2114f706bdcee465/wgr_RTC2YhZ2cESPcdR5Y.png)

---

# Technical Specifications

## Model Architecture

QEDForCausalLM is a decoder-only transformer with the following high-level structure:

- Token embeddings: `embed_tokens = Embedding(vocab_size, d_model)`
- `n_layers` identical blocks (`TransformerBlock`), each applying:
  - Residual attention: `x = x + Attention(RMSNorm(x))`
  - Residual MLP: `x = x + SwiGLU(RMSNorm(x))`
- Final normalization: `norm = RMSNorm(d_model)`
- Output head: `lm_head = Linear(d_model, vocab_size, bias=True)`

The attention uses RoPE on Q and K and runs causal masking semantics.

## Attention and RoPE

- Projection layers (per attention block):
  - `q_proj`, `k_proj`, `v_proj`, `o_proj` are `Linear(d_model, d_model, bias=config.bias)`
- Number of heads: `n_heads`
- Head dimension: `head_dim = d_model / n_heads`
- RoPE:
  - Rotary embedding precomputes `cos_cached` and `sin_cached` up to `max_seq_len`
  - RoPE is applied to Q and K using `position_ids`
- Attention kernel:
  - Implemented with `torch.nn.functional.scaled_dot_product_attention`
  - Uses explicit scaling `scale = head_dim ** -0.5`

## MLP (SwiGLU)

The feed-forward sublayer is a SwiGLU variant:

- `gate_proj: Linear(d_model, ffn_hidden_dim)`
- `up_proj: Linear(d_model, ffn_hidden_dim)`
- `down_proj: Linear(ffn_hidden_dim, d_model)`
- Compute:
  - `SwiGLU(x) = down_proj( silu(gate_proj(x)) * up_proj(x) )`

## Embeddings and Output Head

- `embed_tokens`: size `[vocab_size, d_model]`
- `lm_head`: size `[d_model, vocab_size]` with **bias enabled**
- Weight tying:
  - When `tie_word_embeddings=True`, `lm_head.weight` is tied to `embed_tokens.weight`
  - The `lm_head` bias remains a separate parameter.

## Input/Output Interface

Typical usage via Transformers:

- `input_ids`: `torch.LongTensor` of shape `[batch_size, seq_len]`
- Optional:
  - `position_ids`: `torch.LongTensor` of shape `[batch_size, seq_len]`
  - `attention_mask`: `torch.Tensor` of shape `[batch_size, seq_len]`
  - `labels`: `torch.LongTensor` of shape `[batch_size, seq_len]` (positions with `-100` are ignored)
  - `past_key_values`: list of length `n_layers` with cached keys/values
- Outputs:
  - `logits`: `[batch_size, seq_len, vocab_size]`
  - `loss`: scalar when `labels` are provided
  - `past_key_values`: cached KV tensors when `use_cache=True`

## Attention Masking

When `attention_mask` is provided, the model converts it to a key-padding boolean mask:

- `key_padding_mask = attention_mask[:, None, None, :].to(torch.bool)`

Then it builds:

- causal constraint (positions cannot attend to future keys)
- AND with `key_padding_mask` (mask out padded keys)

Practical recommendation:

- Use the standard HF convention: `attention_mask` values should be `1` for real tokens and `0` for padding tokens.

## Length Constraints

The model enforces:

- `total_seq_len = past_length + seq_len <= config.max_seq_len`

If `total_seq_len` exceeds `max_seq_len`, the model raises a `ValueError`.

Default `max_seq_len` in the exported config for this checkpoint is `8192`.

## Default Hyperparameters

The exported `config.json` for the QED-75M checkpoint sets:

| Hyperparameter | Value |
|---|---:|
| Approx. parameter count | ~75M |
| `n_layers` | 32 |
| `d_model` | 384 |
| `n_heads` | 6 |
| `head_dim` | 64 |
| `ffn_hidden_dim` | 1024 |
| `vocab_size` | 49152 |
| `max_seq_len` | 8192 |
| `rope_theta` | 10000.0 |
| `rms_norm_eps` | 1e-5 |
| `dropout` | 0.0 |
| `tie_word_embeddings` | true |
| internal linear `bias` (QKV/MLP) | false |

Tokenizer / special tokens (from exported `tokenizer_config.json`):

- `<pad>` id `0`
- `<bos>` id `1`
- `<eos>` id `2`
- `<unk>` id `3`

---

# How to Get Started with the Model

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "YOUR_ORG/QED-75M"  # replace with your actual Hub repo id

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,  # optional
)

inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=64, do_sample=True, top_k=50, temperature=0.8)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```

For loss computation:

- pass `labels` with the same shape as `input_ids`
- use `-100` in positions you want to ignore.

---

# Model Card Contact

For questions or updates about this model card, use the Issues/Discussions in the code repository or contact the model owner on Hugging Face.