X-raying a Transformer Forward Pass
What does attention actually do, token by token, layer by layer? Not the textbook answer — the actual numbers, on a real prompt, with a real model.
I built a forward-pass tracer into rocmforge that captures every attention edge as inference runs, then renders it as a graph. This post shows what came out.
What the tracer captures
Every transformer forward pass is a flow: embeddings at the bottom, logits at the top, attention routing information between positions at each layer.
The tracer records this as a JSONL stream:
- node records: one per component (input_embedding, query, key, value, attention_output, mlp_hidden, logits, confidence) per layer per sequence position
- edge records: attention edges with
src_position,dst_position,weight— the raw softmax output, summed across heads - meta record: predicted token, confidence, and expected attention positions for the prompt
Weights are summed across all 25 layers and all heads. This gives total attention mass per (src, dst) pair across the full forward pass.
Correct prediction: Paris → city
Prompt: "The capital of France is Paris. Paris is a..."
Predicted token: city (confidence 0.773)
Expected positions: {0, 4, 5} — BOS token, "Paris", "is"
Left: attention flow graph, positions 0–8, components stacked bottom to top. Right: what the last position (pos 8) attends to, colored by expected (green) vs unexpected (orange/red).
The convergence bar is what matters. Position 8 (prediction position) attends to:
| Position | Token | Weight | Status |
|---|---|---|---|
| 0 | BOS | 166 | expected |
| 4 | "Paris" | 31 | expected |
| 2 | "capital" | ~8 | unexpected |
| 5 | "is" | ~6 | expected |
Strong BOS sink. Dominant expected positions. Four unexpected positions with low mass. Model routes to the right context.
Wrong prediction: Myanmar → Yangon
Prompt: "The capital of Myanmar is"
Predicted token: Yang (→ Yangon, confidence 0.9999)
Correct answer: Naypyidaw (Myanmar moved its capital in 2006)
Expected positions: {0, 1, 3, 4} — BOS, "The", "capital", "of"
Same layout. Position 13 (prediction) attends to 14 positions. Many are unexpected.
| Position | Weight | Status |
|---|---|---|
| 0 (BOS) | 173 | expected |
| 13 (self) | 30 | unexpected |
| 12 | 22 | unexpected |
| 6 ("Myanmar") | 22 | unexpected |
| 9 | 14 | unexpected |
| 7 | 13 | unexpected |
| 4 | 10 | expected |
| 5 | 7 | unexpected |
| 3 | 7 | expected |
| 11 | 5 | unexpected |
| 8 | 5 | unexpected |
| 10 | 2 | unexpected |
| 1 | 2 | expected |
What the comparison shows
| Metric | Correct | Wrong |
|---|---|---|
| BOS sink (pos 0) | 166 | 173 |
| Active positions | 9 | 14 |
| Unexpected positions > 0.1 | 4 | 10 |
| Confidence | 0.773 | 0.9999 |
The BOS sink does not move. It gets slightly stronger in the wrong prediction. That rules out sink displacement as the failure cause.
What changes: unexpected positions dominate. The model's final token pulls mass from positions that activate the Myanmar→Yangon co-occurrence — Yangon was the capital until 2006 and appears far more frequently in training data than Naypyidaw. The model commits to this with 0.9999 confidence, not because the readout layer fails, but because attention routed to the wrong context.
Failure modes observed: higher attention entropy and unexpected-position mass dominance. The readout layer (logit projection) works correctly on whatever context attention delivered — the error is upstream.
7B follow-up: depth-stratified commitment
After the 0.5B traces, I ran 80 prompt pairs on Qwen2.5-7B-Instruct using the same tracer, computing JS-divergence between correct and wrong prediction attention distributions per layer.
The hourglass pattern holds at 7B, but with a new finding: commitment depth varies by task type.
| Category | Peak JS layer (avg) |
|---|---|
| Capital cities | L5.4 |
| Science facts | L12.8 |
| Entity completion | L13.4 |
Simple pattern-matching tasks commit early. Compositional reasoning commits deeper. The model routes differently depending on what kind of retrieval the prompt demands — and this is visible from the attention graph alone, without probing classifiers or fine-tuning.
Where this runs
The tracer is in rocmforge, emitting JSONL from the CPU inference path. It runs on any GGUF model. The visualization is a Python script (plot_forward_graph.py) in geographdb-core.
cargo run --example infer -- \
--model models/qwen2.5-0.5b-instruct-q8_0.gguf \
--prompt "The capital of Myanmar is" \
--forward-graph-trace /tmp/trace.jsonl \
--expected-attention '{"13": [0,1,3,4]}'
python examples/plot_forward_graph.py /tmp/trace.jsonl
The GPU path (rocmforge ROCm kernels, 1.57× decode speedup on 7B via DP4A) is working and was used for the 80-pair sweep.
What comes next
The depth-complexity correlation needs cross-architecture validation. Does the same pattern hold on MoE models? On Mamba/SSM architectures? Does commitment depth scale predictably with model size beyond 7B?
These are testable questions with the existing tooling. Working on it.
Code: rocmforge | geographdb-core | Blog: oldnordic.github.io

