Lesson overview | Lesson overview | Next part
Positional Encodings: Part 1: Intuition to 4. Learned Absolute Positions
1. Intuition
Intuition explains how transformer sequence order is represented in hidden states or attention scores.
1.1 Why attention needs position
Purpose. Why attention needs position focuses on self-attention is permutation-equivariant without order signals. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
1.2 Absolute versus relative position
Purpose. Absolute versus relative position focuses on index identity versus pairwise offset. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Relative position methods make attention depend on pairwise distance instead of only absolute index identity.
Worked reading.
The same offset can share parameters across many positions, which is useful when patterns repeat across the sequence.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- relative bias.
- relative keys.
- bucketed distance bins.
Non-examples:
- one independent vector per absolute index.
- no order signal at all.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
1.3 Additive versus score-based position
Purpose. Additive versus score-based position focuses on where the signal enters the computation. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
1.4 Length extrapolation
Purpose. Length extrapolation focuses on why training length and inference length can diverge. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
1.5 Position in decoder-only LLMs
Purpose. Position in decoder-only LLMs focuses on causal order and generated prefixes. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
2. Formal Setup
Formal Setup explains how transformer sequence order is represented in hidden states or attention scores.
2.1 Position indices
Purpose. Position indices focuses on . This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
2.2 Token plus position representation
Purpose. Token plus position representation focuses on . This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
2.3 Attention score modification
Purpose. Attention score modification focuses on biases and rotations before softmax. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
2.4 Relative offset notation
Purpose. Relative offset notation focuses on . This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Relative position methods make attention depend on pairwise distance instead of only absolute index identity.
Worked reading.
The same offset can share parameters across many positions, which is useful when patterns repeat across the sequence.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- relative bias.
- relative keys.
- bucketed distance bins.
Non-examples:
- one independent vector per absolute index.
- no order signal at all.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
2.5 Position interpolation and scaling
Purpose. Position interpolation and scaling focuses on changing effective positions for long context. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
3. Sinusoidal Encodings
Sinusoidal Encodings explains how transformer sequence order is represented in hidden states or attention scores.
3.1 Frequency ladder
Purpose. Frequency ladder focuses on geometric wavelengths across dimensions. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
3.2 Sine cosine pairs
Purpose. Sine cosine pairs focuses on phase representation by coordinate pairs. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
3.3 Linear relative-offset intuition
Purpose. Linear relative-offset intuition focuses on why fixed sinusoids can expose offsets. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Relative position methods make attention depend on pairwise distance instead of only absolute index identity.
Worked reading.
The same offset can share parameters across many positions, which is useful when patterns repeat across the sequence.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- relative bias.
- relative keys.
- bucketed distance bins.
Non-examples:
- one independent vector per absolute index.
- no order signal at all.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
3.4 Visualization and aliasing
Purpose. Visualization and aliasing focuses on what high and low frequencies show. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
3.5 Limitations
Purpose. Limitations focuses on absolute addition and finite precision. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
4. Learned Absolute Positions
Learned Absolute Positions explains how transformer sequence order is represented in hidden states or attention scores.
4.1 Learned position table
Purpose. Learned position table focuses on . This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Learned absolute position embeddings assign a trainable vector to each position index up to a maximum length.
Worked reading.
They are simple and effective in-range, but positions beyond the trained table require resizing or another strategy.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- BERT-style position rows.
- GPT-style learned absolute positions.
- interpolation for longer windows.
Non-examples:
- relative distance bias.
- rotation of Q/K coordinates.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
4.2 Training length limit
Purpose. Training length limit focuses on why rows beyond training are unavailable. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Learned absolute position embeddings assign a trainable vector to each position index up to a maximum length.
Worked reading.
They are simple and effective in-range, but positions beyond the trained table require resizing or another strategy.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- BERT-style position rows.
- GPT-style learned absolute positions.
- interpolation for longer windows.
Non-examples:
- relative distance bias.
- rotation of Q/K coordinates.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
4.3 Interpolation resizing
Purpose. Interpolation resizing focuses on adapting rows to longer contexts. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
4.4 BERT GPT-style usage
Purpose. BERT GPT-style usage focuses on where learned rows appear. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.
4.5 Failure modes
Purpose. Failure modes focuses on length extrapolation and out-of-range ids. This is the part of transformer math that tells attention where each token is and how far apart tokens are.
Operational definition.
Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.
Worked reading.
Without a position signal, self-attention can mix content but cannot know which token came first.
| Scheme | Where position enters | Typical strength | Typical risk |
|---|---|---|---|
| sinusoidal | added to hidden states | fixed and simple | limited flexibility |
| learned absolute | added learned row | strong in-range fit | weak extrapolation |
| relative bias | attention score | offset-aware | implementation complexity |
| RoPE | rotates Q/K | relative dot products | scaling choices matter |
| ALiBi | attention score bias | simple extrapolation | distance prior may be too rigid |
Examples:
- sinusoidal features.
- position rows.
- relative offsets.
Non-examples:
- bag-of-words attention.
- token ids used as positions.
Derivation habit.
- State whether the scheme is absolute or relative.
- State whether the signal is added to hidden states, applied to Q/K, or added to scores.
- Check shape compatibility with heads, sequence length, and cached decoding.
- Test length behavior beyond the training context if the model will be served there.
- Keep masks separate from position signals.
Implementation lens.
Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.
The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.