X-raying a Transformer Forward Pass

Community Article
Published June 17, 2026

What does attention actually do, token by token, layer by layer? Not the textbook answer — the actual numbers, on a real prompt, with a real model.

I built a forward-pass tracer into rocmforge that captures every attention edge as inference runs, then renders it as a graph. This post shows what came out.


What the tracer captures

Every transformer forward pass is a flow: embeddings at the bottom, logits at the top, attention routing information between positions at each layer.

The tracer records this as a JSONL stream:

  • node records: one per component (input_embedding, query, key, value, attention_output, mlp_hidden, logits, confidence) per layer per sequence position
  • edge records: attention edges with src_position, dst_position, weight — the raw softmax output, summed across heads
  • meta record: predicted token, confidence, and expected attention positions for the prompt

Weights are summed across all 25 layers and all heads. This gives total attention mass per (src, dst) pair across the full forward pass.


Correct prediction: Paris → city

Prompt: "The capital of France is Paris. Paris is a..."

Predicted token: city (confidence 0.773)

Expected positions: {0, 4, 5} — BOS token, "Paris", "is"

xray_correct

Left: attention flow graph, positions 0–8, components stacked bottom to top. Right: what the last position (pos 8) attends to, colored by expected (green) vs unexpected (orange/red).

The convergence bar is what matters. Position 8 (prediction position) attends to:

Position Token Weight Status
0 BOS 166 expected
4 "Paris" 31 expected
2 "capital" ~8 unexpected
5 "is" ~6 expected

Strong BOS sink. Dominant expected positions. Four unexpected positions with low mass. Model routes to the right context.


Wrong prediction: Myanmar → Yangon

Prompt: "The capital of Myanmar is"

Predicted token: Yang (→ Yangon, confidence 0.9999)

Correct answer: Naypyidaw (Myanmar moved its capital in 2006)

Expected positions: {0, 1, 3, 4} — BOS, "The", "capital", "of"

xray_wrong

Same layout. Position 13 (prediction) attends to 14 positions. Many are unexpected.

Position Weight Status
0 (BOS) 173 expected
13 (self) 30 unexpected
12 22 unexpected
6 ("Myanmar") 22 unexpected
9 14 unexpected
7 13 unexpected
4 10 expected
5 7 unexpected
3 7 expected
11 5 unexpected
8 5 unexpected
10 2 unexpected
1 2 expected

What the comparison shows

Metric Correct Wrong
BOS sink (pos 0) 166 173
Active positions 9 14
Unexpected positions > 0.1 4 10
Confidence 0.773 0.9999

The BOS sink does not move. It gets slightly stronger in the wrong prediction. That rules out sink displacement as the failure cause.

What changes: unexpected positions dominate. The model's final token pulls mass from positions that activate the Myanmar→Yangon co-occurrence — Yangon was the capital until 2006 and appears far more frequently in training data than Naypyidaw. The model commits to this with 0.9999 confidence, not because the readout layer fails, but because attention routed to the wrong context.

Failure modes observed: higher attention entropy and unexpected-position mass dominance. The readout layer (logit projection) works correctly on whatever context attention delivered — the error is upstream.


7B follow-up: depth-stratified commitment

After the 0.5B traces, I ran 80 prompt pairs on Qwen2.5-7B-Instruct using the same tracer, computing JS-divergence between correct and wrong prediction attention distributions per layer.

The hourglass pattern holds at 7B, but with a new finding: commitment depth varies by task type.

Category Peak JS layer (avg)
Capital cities L5.4
Science facts L12.8
Entity completion L13.4

Simple pattern-matching tasks commit early. Compositional reasoning commits deeper. The model routes differently depending on what kind of retrieval the prompt demands — and this is visible from the attention graph alone, without probing classifiers or fine-tuning.


Where this runs

The tracer is in rocmforge, emitting JSONL from the CPU inference path. It runs on any GGUF model. The visualization is a Python script (plot_forward_graph.py) in geographdb-core.

cargo run --example infer -- \
  --model models/qwen2.5-0.5b-instruct-q8_0.gguf \
  --prompt "The capital of Myanmar is" \
  --forward-graph-trace /tmp/trace.jsonl \
  --expected-attention '{"13": [0,1,3,4]}'

python examples/plot_forward_graph.py /tmp/trace.jsonl

The GPU path (rocmforge ROCm kernels, 1.57× decode speedup on 7B via DP4A) is working and was used for the 80-pair sweep.


What comes next

The depth-complexity correlation needs cross-architecture validation. Does the same pattern hold on MoE models? On Mamba/SSM architectures? Does commitment depth scale predictably with model size beyond 7B?

These are testable questions with the existing tooling. Working on it.

Code: rocmforge | geographdb-core | Blog: oldnordic.github.io

Community

Sign up or log in to comment