--- license: mit library_name: transformers pipeline_tag: text-generation tags: - causal-lm - decoder-only - pytorch - rope - rmsnorm - swiglu - custom-architecture language: - en model_type: qed --- [Try it Right Now](https://qedlm.art) ![Frame 33](https://cdn-uploads.huggingface.co/production/uploads/695b8d7a2114f706bdcee465/Wu3QCW8XNwUXrYaANG7Ss.png) ![compute_vs_score_scatter](https://cdn-uploads.huggingface.co/production/uploads/695b8d7a2114f706bdcee465/wgr_RTC2YhZ2cESPcdR5Y.png) # QED-75M QED-75M is a compact **decoder-only causal language model** implemented for Hugging Face using a custom `transformers` module. The model architecture combines **RoPE** (rotary position embeddings), **RMSNorm**, **SwiGLU** feed-forward blocks, and causal self-attention implemented via `torch.nn.functional.scaled_dot_product_attention`. The token embedding weights can be tied with the output projection (`tie_word_embeddings`). This model card focuses on the **model itself** (architecture, tensor interface, runtime constraints). Training data, training procedure, and export scripts are described in the repository `README.md`. ## Table of Contents - [Model Details](#model-details) - [Uses](#uses) - [Bias, Risks, and Limitations](#bias-risks-and-limitations) - [Training Details](#training-details) - [Evaluation](#evaluation) - [Technical Specifications](#technical-specifications) - [Model Architecture](#model-architecture) - [Attention and RoPE](#attention-and-rope) - [MLP (SwiGLU)](#mlp-swiglu) - [Embeddings and Output Head](#embeddings-and-output-head) - [Input/Output Interface](#inputoutput-interface) - [KV Cache and Generation Semantics](#kv-cache-and-generation-semantics) - [Attention Masking](#attention-masking) - [Length Constraints](#length-constraints) - [Default Hyperparameters](#default-hyperparameters) - [How to Get Started with the Model](#how-to-get-started-with-the-model) - [Citation](#citation) - [Model Card Contact](#model-card-contact) --- # Model Details ## Model Description QED is a **next-token prediction** model (causal LM). Given a sequence of token ids, the model produces logits over the vocabulary for each position. When `labels` are provided, the model computes the training loss as cross-entropy over the next-token targets (with `ignore_index=-100`). The Hugging Face integration provides: - `QEDConfig` (`model_type: qed`) - `QEDForCausalLM` Both classes are defined in the repo module `modeling_qed.py` and are loaded with `trust_remote_code=True`. ## Model Sources - Code: the repository containing `modeling_qed.py` and the exported model artifacts. - Transformers implementation: `modeling_qed.py` (remote code in the model repo). - Training artifacts (checkpoints, logs, and related outputs): [levossadtchi/QED-75M_artifacts](https://huggingface.co/levossadtchi/QED-75M_artifacts). --- # Uses ## Direct Use - Text generation using `model.generate(...)`; the repository also includes a ready-to-run local inference script: `generate_gravity_example.py`. - Scoring / evaluating conditional likelihoods via `model(input_ids=..., labels=...)`. ## Downstream Use - Fine-tuning or adapting the model (for example, SFT or LoRA) is technically possible, but quality and safety must be validated for the target domain. ## Out-of-Scope Use - Using the model for high-stakes decisions (medical, legal, finance) without human verification. - Assuming the model is always factually correct or always safe. - Using the model to bypass safety systems or to generate disallowed content. --- # Bias, Risks, and Limitations Like other language models, QED may produce: - **Hallucinations** (confident but incorrect statements). - **Pattern repetition** from training data. - **Uneven quality** across topics and languages, depending on what the specific checkpoint was trained on. Mitigations: - Use output filtering and constrain the generation strategy when deploying in real applications. - Perform domain-specific evaluations before relying on the model. - Treat the model as a suggestion engine, not a ground-truth source. --- # Training Details This model family was trained with a multi-stage pipeline (pretraining, context-length annealing, and SFT preparation). High-level training data summary: - Pretraining volume: **12.6B tokens**. - Data is a mixed corpus pipeline configured in the repository and processed into tokenized shards before training. - SFT stage uses chat/instruction-style datasets with assistant-targeted supervision. All training artifacts are published separately at: - [levossadtchi/QED-75M_artifacts](https://huggingface.co/levossadtchi/QED-75M_artifacts) --- # Evaluation We evaluated the following models with a custom evaluation pipeline based on the Hugging Face **LightEval** harness used in the SmolLM2 model evaluations. The evaluation reports a **"general"** average over a fixed suite of tasks: - `MMLU` (aggregated over its MMLU subtasks in the LightEval leaderboard) - `HellaSwag` - `ARC-Challenge` - `Winogrande` - `CommonsenseQA` The numbers below come from `all_results_summary.csv` produced by the evaluation run. | Model | Average (general) | arc:challenge | commonsense_qa | hellaswag | winogrande | mmlu | |---|---:|---:|---:|---:|---:|---:| | `HuggingFaceTB/SmolLM2-135M` | 0.299140 | 0.283276 | 0.190827 | 0.252440 | 0.519337 | 0.249822 | | `levossadtchi/QED-75M` | 0.287318 | 0.231229 | 0.204750 | 0.253336 | 0.506709 | 0.240564 | | `EleutherAI/gpt-neo-125m` | 0.279464 | 0.191126 | 0.205569 | 0.249751 | 0.521705 | 0.229170 | | `EleutherAI/pythia-160m-deduped` | 0.275796 | 0.202218 | 0.194922 | 0.250846 | 0.501184 | 0.229811 | | `openai-community/gpt2` | 0.273993 | 0.188567 | 0.196560 | 0.250249 | 0.505919 | 0.228671 | ![compute_vs_score_scatter](https://cdn-uploads.huggingface.co/production/uploads/695b8d7a2114f706bdcee465/wgr_RTC2YhZ2cESPcdR5Y.png) --- # Technical Specifications ## Model Architecture QEDForCausalLM is a decoder-only transformer with the following high-level structure: - Token embeddings: `embed_tokens = Embedding(vocab_size, d_model)` - `n_layers` identical blocks (`TransformerBlock`), each applying: - Residual attention: `x = x + Attention(RMSNorm(x))` - Residual MLP: `x = x + SwiGLU(RMSNorm(x))` - Final normalization: `norm = RMSNorm(d_model)` - Output head: `lm_head = Linear(d_model, vocab_size, bias=True)` The attention uses RoPE on Q and K and runs causal masking semantics. ## Attention and RoPE - Projection layers (per attention block): - `q_proj`, `k_proj`, `v_proj`, `o_proj` are `Linear(d_model, d_model, bias=config.bias)` - Number of heads: `n_heads` - Head dimension: `head_dim = d_model / n_heads` - RoPE: - Rotary embedding precomputes `cos_cached` and `sin_cached` up to `max_seq_len` - RoPE is applied to Q and K using `position_ids` - Attention kernel: - Implemented with `torch.nn.functional.scaled_dot_product_attention` - Uses explicit scaling `scale = head_dim ** -0.5` ## MLP (SwiGLU) The feed-forward sublayer is a SwiGLU variant: - `gate_proj: Linear(d_model, ffn_hidden_dim)` - `up_proj: Linear(d_model, ffn_hidden_dim)` - `down_proj: Linear(ffn_hidden_dim, d_model)` - Compute: - `SwiGLU(x) = down_proj( silu(gate_proj(x)) * up_proj(x) )` ## Embeddings and Output Head - `embed_tokens`: size `[vocab_size, d_model]` - `lm_head`: size `[d_model, vocab_size]` with **bias enabled** - Weight tying: - When `tie_word_embeddings=True`, `lm_head.weight` is tied to `embed_tokens.weight` - The `lm_head` bias remains a separate parameter. ## Input/Output Interface Typical usage via Transformers: - `input_ids`: `torch.LongTensor` of shape `[batch_size, seq_len]` - Optional: - `position_ids`: `torch.LongTensor` of shape `[batch_size, seq_len]` - `attention_mask`: `torch.Tensor` of shape `[batch_size, seq_len]` - `labels`: `torch.LongTensor` of shape `[batch_size, seq_len]` (positions with `-100` are ignored) - `past_key_values`: list of length `n_layers` with cached keys/values - Outputs: - `logits`: `[batch_size, seq_len, vocab_size]` - `loss`: scalar when `labels` are provided - `past_key_values`: cached KV tensors when `use_cache=True` ## Attention Masking When `attention_mask` is provided, the model converts it to a key-padding boolean mask: - `key_padding_mask = attention_mask[:, None, None, :].to(torch.bool)` Then it builds: - causal constraint (positions cannot attend to future keys) - AND with `key_padding_mask` (mask out padded keys) Practical recommendation: - Use the standard HF convention: `attention_mask` values should be `1` for real tokens and `0` for padding tokens. ## Length Constraints The model enforces: - `total_seq_len = past_length + seq_len <= config.max_seq_len` If `total_seq_len` exceeds `max_seq_len`, the model raises a `ValueError`. Default `max_seq_len` in the exported config for this checkpoint is `8192`. ## Default Hyperparameters The exported `config.json` for the QED-75M checkpoint sets: | Hyperparameter | Value | |---|---:| | Approx. parameter count | ~75M | | `n_layers` | 32 | | `d_model` | 384 | | `n_heads` | 6 | | `head_dim` | 64 | | `ffn_hidden_dim` | 1024 | | `vocab_size` | 49152 | | `max_seq_len` | 8192 | | `rope_theta` | 10000.0 | | `rms_norm_eps` | 1e-5 | | `dropout` | 0.0 | | `tie_word_embeddings` | true | | internal linear `bias` (QKV/MLP) | false | Tokenizer / special tokens (from exported `tokenizer_config.json`): - `` id `0` - `` id `1` - `` id `2` - `` id `3` --- # How to Get Started with the Model ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer repo_id = "YOUR_ORG/QED-75M" # replace with your actual Hub repo id tokenizer = AutoTokenizer.from_pretrained(repo_id) model = AutoModelForCausalLM.from_pretrained( repo_id, trust_remote_code=True, torch_dtype=torch.bfloat16, # optional ) inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=64, do_sample=True, top_k=50, temperature=0.8) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` For loss computation: - pass `labels` with the same shape as `input_ids` - use `-100` in positions you want to ignore. --- # Model Card Contact For questions or updates about this model card, use the Issues/Discussions in the code repository or contact the model owner on Hugging Face.