Attention Mechanism MathMath for LLMs

Attention Mechanism Math

Math for LLMs

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Attention Mechanism Math
24 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Attention Mechanism Math: Part 1: Intuition to 3. Core Mechanics

1. Intuition

Intuition explains how transformer layers route information across sequence positions using differentiable, mask-aware retrieval.

1.1 Attention as soft retrieval

Purpose. Attention as soft retrieval focuses on why each token reads from other token states. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

Q=XWQ,K=XWK,V=XWV.Q=XW_Q,\qquad K=XW_K,\qquad V=XW_V.

Operational definition.

Attention lets each token form a query, compare it against key vectors, and read a weighted mixture of value vectors.

Worked reading.

A token representing it can assign high weight to an earlier noun phrase, causing the next hidden state to mix in information from that earlier position.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. pronoun resolution.
  2. copying from context.
  3. retrieved document use.

Non-examples:

  1. fixed convolution window only.
  2. one hidden state with no content-based mixing.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

1.2 Queries keys and values as roles

Purpose. Queries keys and values as roles focuses on search vector, address vector, payload vector. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

S=QKdk.S=\frac{QK^\top}{\sqrt{d_k}}.

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

1.3 Why scaling is needed

Purpose. Why scaling is needed focuses on variance control for dot products. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

A=softmax(S+M),jAij=1.A=\operatorname{softmax}(S+M),\qquad \sum_j A_{ij}=1.

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

1.4 Why masks are needed

Purpose. Why masks are needed focuses on causality padding and visibility constraints. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

Y=AV.Y=AV.

Operational definition.

A mask changes which key positions a query is allowed to see by adding large negative values to forbidden logits before softmax.

Worked reading.

In decoder-only language modeling, token ii may attend to positions jij\le i but not to future positions j>ij>i.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. causal masks.
  2. padding masks.
  3. structured prompt masks.

Non-examples:

  1. zeroing output after softmax.
  2. trusting data order without a mask.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

1.5 Why attention replaced recurrence in LLMs

Purpose. Why attention replaced recurrence in LLMs focuses on parallel sequence mixing. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

MHA(X)=Concat(H1,,Hh)WO.\operatorname{MHA}(X)=\operatorname{Concat}(H_1,\ldots,H_h)W_O.

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

2. Formal Definitions

Formal Definitions explains how transformer layers route information across sequence positions using differentiable, mask-aware retrieval.

2.1 Input hidden-state matrix

Purpose. Input hidden-state matrix focuses on the T×dmodelT\times d_{\mathrm{model}} sequence matrix. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

S=QKdk.S=\frac{QK^\top}{\sqrt{d_k}}.

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

2.2 Linear Q K V projections

Purpose. Linear Q K V projections focuses on Q=XWQQ=XW_Q, K=XWKK=XW_K, V=XWVV=XW_V. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

A=softmax(S+M),jAij=1.A=\operatorname{softmax}(S+M),\qquad \sum_j A_{ij}=1.

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

2.3 Scaled dot-product attention

Purpose. Scaled dot-product attention focuses on softmax(QK/dk)V\operatorname{softmax}(QK^\top/\sqrt{d_k})V. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

Y=AV.Y=AV.

Operational definition.

Scaled dot-product attention computes pairwise query-key scores, normalizes each row with softmax, and uses the resulting weights to average values.

Worked reading.

The factor dk\sqrt{d_k} keeps random dot products from growing too large as head dimension increases.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. transformer self-attention.
  2. cross-attention.
  3. decoder attention.

Non-examples:

  1. nearest neighbor with hard argmax only.
  2. unscaled scores with unstable softmax.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

2.4 Attention weights

Purpose. Attention weights focuses on row-stochastic probability-like matrices. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

MHA(X)=Concat(H1,,Hh)WO.\operatorname{MHA}(X)=\operatorname{Concat}(H_1,\ldots,H_h)W_O.

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

2.5 Causal and padding masks

Purpose. Causal and padding masks focuses on additive masks before softmax. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

costscoresO(T2dk),memoryscoresO(T2).\operatorname{cost}_{\mathrm{scores}}\in O(T^2d_k),\qquad \operatorname{memory}_{\mathrm{scores}}\in O(T^2).

Operational definition.

A mask changes which key positions a query is allowed to see by adding large negative values to forbidden logits before softmax.

Worked reading.

In decoder-only language modeling, token ii may attend to positions jij\le i but not to future positions j>ij>i.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. causal masks.
  2. padding masks.
  3. structured prompt masks.

Non-examples:

  1. zeroing output after softmax.
  2. trusting data order without a mask.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

3. Core Mechanics

Core Mechanics explains how transformer layers route information across sequence positions using differentiable, mask-aware retrieval.

3.1 Softmax normalization

Purpose. Softmax normalization focuses on turning scores into weights. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

A=softmax(S+M),jAij=1.A=\operatorname{softmax}(S+M),\qquad \sum_j A_{ij}=1.

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

3.2 Weighted value aggregation

Purpose. Weighted value aggregation focuses on convex combinations of value vectors. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

Y=AV.Y=AV.

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

3.3 Attention entropy

Purpose. Attention entropy focuses on sharp versus diffuse attention. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

MHA(X)=Concat(H1,,Hh)WO.\operatorname{MHA}(X)=\operatorname{Concat}(H_1,\ldots,H_h)W_O.

Operational definition.

Attention diagnostics inspect weights, entropy, masks, and head importance, but they do not by themselves prove causal explanations.

Worked reading.

A low-entropy row means one or a few keys dominate; a high-entropy row means information is mixed broadly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. attention heatmaps.
  2. head ablations.
  3. entropy dashboards.

Non-examples:

  1. claiming attention weight equals explanation.
  2. inspecting only one prompt.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

3.4 Temperature and score scale

Purpose. Temperature and score scale focuses on how scaling changes concentration. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

costscoresO(T2dk),memoryscoresO(T2).\operatorname{cost}_{\mathrm{scores}}\in O(T^2d_k),\qquad \operatorname{memory}_{\mathrm{scores}}\in O(T^2).

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

3.5 Numerical stability

Purpose. Numerical stability focuses on subtracting row maxima before exponentiation. This is a core part of how transformer layers turn a sequence of embeddings into context-aware hidden states.

H(Ai)=jAijlogAij.H(A_i)=-\sum_j A_{ij}\log A_{ij}.

Operational definition.

This concept is part of the attention mechanism that mixes token representations according to learned compatibility scores.

Worked reading.

The implementation habit is to write shapes, scores, masks, softmax, and value aggregation explicitly.

ObjectShapeMeaning
XXT×dmodelT\times d_{\mathrm{model}}hidden states entering the layer
Q,KQ,KT×dkT\times d_kquery and key address vectors
VVT×dvT\times d_vvalue payload vectors
S=QK/dkS=QK^\top/\sqrt{d_k}T×TT\times Tcompatibility scores
A=softmax(S+M)A=\operatorname{softmax}(S+M)T×TT\times Tattention weights
Y=AVY=AVT×dvT\times d_vmixed output values

Examples:

  1. self-attention.
  2. decoder attention.
  3. attention over retrieved context.

Non-examples:

  1. independent token processing.
  2. fixed averaging with no learned scores.

Derivation habit.

  1. Write the shapes of X,Q,K,V,S,A,YX,Q,K,V,S,A,Y.
  2. Add masks before softmax, not after.
  3. Check every attention row sums to one over visible keys.
  4. Separate mathematical attention from kernel implementation details.
  5. For LLM serving, distinguish prefill attention from decode attention with a KV cache.

Implementation lens.

A correct attention implementation is mostly a shape and masking discipline. The bug that hurts language modeling most is often not the matrix multiplication; it is allowing a token to see future positions or padding tokens.

For efficient inference, the formula stays the same but the workload changes. During prefill, the model processes a full prompt. During decode, it adds one query at a time while reading cached keys and values from previous tokens.

For interpretation, attention weights are useful traces of information flow, but they are not the whole model explanation. Residual connections, MLPs, layer norms, and later layers can change or override what a single attention map appears to show.

PreviousNext