Positional EncodingsMath for LLMs

Positional Encodings

Math for LLMs

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Positional Encodings
28 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Positional Encodings: Part 1: Intuition to 4. Learned Absolute Positions

1. Intuition

Intuition explains how transformer sequence order is represented in hidden states or attention scores.

1.1 Why attention needs position

Purpose. Why attention needs position focuses on self-attention is permutation-equivariant without order signals. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

PEp,2k=sin(p/100002k/d),PEp,2k+1=cos(p/100002k/d).\operatorname{PE}_{p,2k}=\sin\left(p/10000^{2k/d}\right),\qquad \operatorname{PE}_{p,2k+1}=\cos\left(p/10000^{2k/d}\right).

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

1.2 Absolute versus relative position

Purpose. Absolute versus relative position focuses on index identity versus pairwise offset. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

hp=xp+pp.\mathbf{h}_p=\mathbf{x}_p+\mathbf{p}_p.

Operational definition.

Relative position methods make attention depend on pairwise distance instead of only absolute index identity.

Worked reading.

The same offset can share parameters across many positions, which is useful when patterns repeat across the sequence.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. relative bias.
  2. relative keys.
  3. bucketed distance bins.

Non-examples:

  1. one independent vector per absolute index.
  2. no order signal at all.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

1.3 Additive versus score-based position

Purpose. Additive versus score-based position focuses on where the signal enters the computation. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

Sij=qikjdk+bij.S_{ij}=\frac{\mathbf{q}_i^\top\mathbf{k}_j}{\sqrt{d_k}}+b_{i-j}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

1.4 Length extrapolation

Purpose. Length extrapolation focuses on why training length and inference length can diverge. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

RpRj=Rjp.R_p^\top R_j=R_{j-p}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

1.5 Position in decoder-only LLMs

Purpose. Position in decoder-only LLMs focuses on causal order and generated prefixes. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

RoPE(qp)=Rpqp.\operatorname{RoPE}(\mathbf{q}_p)=R_p\mathbf{q}_p.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

2. Formal Setup

Formal Setup explains how transformer sequence order is represented in hidden states or attention scores.

2.1 Position indices

Purpose. Position indices focuses on p{0,,T1}p\in\{0,\ldots,T-1\}. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

hp=xp+pp.\mathbf{h}_p=\mathbf{x}_p+\mathbf{p}_p.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

2.2 Token plus position representation

Purpose. Token plus position representation focuses on hp=xp+pp\mathbf{h}_p=\mathbf{x}_p+\mathbf{p}_p. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

Sij=qikjdk+bij.S_{ij}=\frac{\mathbf{q}_i^\top\mathbf{k}_j}{\sqrt{d_k}}+b_{i-j}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

2.3 Attention score modification

Purpose. Attention score modification focuses on biases and rotations before softmax. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

RpRj=Rjp.R_p^\top R_j=R_{j-p}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

2.4 Relative offset notation

Purpose. Relative offset notation focuses on r=ijr=i-j. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

RoPE(qp)=Rpqp.\operatorname{RoPE}(\mathbf{q}_p)=R_p\mathbf{q}_p.

Operational definition.

Relative position methods make attention depend on pairwise distance instead of only absolute index identity.

Worked reading.

The same offset can share parameters across many positions, which is useful when patterns repeat across the sequence.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. relative bias.
  2. relative keys.
  3. bucketed distance bins.

Non-examples:

  1. one independent vector per absolute index.
  2. no order signal at all.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

2.5 Position interpolation and scaling

Purpose. Position interpolation and scaling focuses on changing effective positions for long context. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

ALiBiij=mh(ij)for ji.\operatorname{ALiBi}_{ij}=m_h(i-j)\quad\text{for }j\le i.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

3. Sinusoidal Encodings

Sinusoidal Encodings explains how transformer sequence order is represented in hidden states or attention scores.

3.1 Frequency ladder

Purpose. Frequency ladder focuses on geometric wavelengths across dimensions. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

Sij=qikjdk+bij.S_{ij}=\frac{\mathbf{q}_i^\top\mathbf{k}_j}{\sqrt{d_k}}+b_{i-j}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

3.2 Sine cosine pairs

Purpose. Sine cosine pairs focuses on phase representation by coordinate pairs. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

RpRj=Rjp.R_p^\top R_j=R_{j-p}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

3.3 Linear relative-offset intuition

Purpose. Linear relative-offset intuition focuses on why fixed sinusoids can expose offsets. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

RoPE(qp)=Rpqp.\operatorname{RoPE}(\mathbf{q}_p)=R_p\mathbf{q}_p.

Operational definition.

Relative position methods make attention depend on pairwise distance instead of only absolute index identity.

Worked reading.

The same offset can share parameters across many positions, which is useful when patterns repeat across the sequence.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. relative bias.
  2. relative keys.
  3. bucketed distance bins.

Non-examples:

  1. one independent vector per absolute index.
  2. no order signal at all.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

3.4 Visualization and aliasing

Purpose. Visualization and aliasing focuses on what high and low frequencies show. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

ALiBiij=mh(ij)for ji.\operatorname{ALiBi}_{ij}=m_h(i-j)\quad\text{for }j\le i.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

3.5 Limitations

Purpose. Limitations focuses on absolute addition and finite precision. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

λk=100002k/d.\lambda_k=10000^{2k/d}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

4. Learned Absolute Positions

Learned Absolute Positions explains how transformer sequence order is represented in hidden states or attention scores.

4.1 Learned position table

Purpose. Learned position table focuses on PRTmax×dP\in\mathbb{R}^{T_{\max}\times d}. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

RpRj=Rjp.R_p^\top R_j=R_{j-p}.

Operational definition.

Learned absolute position embeddings assign a trainable vector to each position index up to a maximum length.

Worked reading.

They are simple and effective in-range, but positions beyond the trained table require resizing or another strategy.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. BERT-style position rows.
  2. GPT-style learned absolute positions.
  3. interpolation for longer windows.

Non-examples:

  1. relative distance bias.
  2. rotation of Q/K coordinates.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

4.2 Training length limit

Purpose. Training length limit focuses on why rows beyond training are unavailable. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

RoPE(qp)=Rpqp.\operatorname{RoPE}(\mathbf{q}_p)=R_p\mathbf{q}_p.

Operational definition.

Learned absolute position embeddings assign a trainable vector to each position index up to a maximum length.

Worked reading.

They are simple and effective in-range, but positions beyond the trained table require resizing or another strategy.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. BERT-style position rows.
  2. GPT-style learned absolute positions.
  3. interpolation for longer windows.

Non-examples:

  1. relative distance bias.
  2. rotation of Q/K coordinates.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

4.3 Interpolation resizing

Purpose. Interpolation resizing focuses on adapting rows to longer contexts. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

ALiBiij=mh(ij)for ji.\operatorname{ALiBi}_{ij}=m_h(i-j)\quad\text{for }j\le i.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

4.4 BERT GPT-style usage

Purpose. BERT GPT-style usage focuses on where learned rows appear. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

λk=100002k/d.\lambda_k=10000^{2k/d}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

4.5 Failure modes

Purpose. Failure modes focuses on length extrapolation and out-of-range ids. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

position iddecode=Tprefix+t.\operatorname{position\ id}_{\mathrm{decode}}=T_{\mathrm{prefix}}+t.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

SchemeWhere position entersTypical strengthTypical risk
sinusoidaladded to hidden statesfixed and simplelimited flexibility
learned absoluteadded learned rowstrong in-range fitweak extrapolation
relative biasattention scoreoffset-awareimplementation complexity
RoPErotates Q/Krelative dot productsscaling choices matter
ALiBiattention score biassimple extrapolationdistance prior may be too rigid

Examples:

  1. sinusoidal features.
  2. position rows.
  3. relative offsets.

Non-examples:

  1. bag-of-words attention.
  2. token ids used as positions.

Derivation habit.

  1. State whether the scheme is absolute or relative.
  2. State whether the signal is added to hidden states, applied to Q/K, or added to scores.
  3. Check shape compatibility with heads, sequence length, and cached decoding.
  4. Test length behavior beyond the training context if the model will be served there.
  5. Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

PreviousNext