Lesson overview | Previous part | Lesson overview
Positional Encodings: Part 10: Exercises to References
10. Exercises
-
(*) Compute a sinusoidal position row.
- (a) State the scheme.
- (b) Compute the numeric example.
- (c) Explain the LLM consequence.
-
(*) Show token plus position addition.
- (a) State the scheme.
- (b) Compute the numeric example.
- (c) Explain the LLM consequence.
-
(*) Build a relative distance matrix.
- (a) State the scheme.
- (b) Compute the numeric example.
- (c) Explain the LLM consequence.
-
(**) Create a relative attention bias matrix.
- (a) State the scheme.
- (b) Compute the numeric example.
- (c) Explain the LLM consequence.
-
(**) Apply a RoPE rotation and check norm preservation.
- (a) State the scheme.
- (b) Compute the numeric example.
- (c) Explain the LLM consequence.
-
(**) Verify a RoPE relative-offset dot-product identity in two dimensions.
- (a) State the scheme.
- (b) Compute the numeric example.
- (c) Explain the LLM consequence.
-
(**) Build an ALiBi matrix.
- (a) State the scheme.
- (b) Compute the numeric example.
- (c) Explain the LLM consequence.
-
(***) Compute learned position table parameters.
- (a) State the scheme.
- (b) Compute the numeric example.
- (c) Explain the LLM consequence.
-
(***) Check decode position ids for a KV cache.
- (a) State the scheme.
- (b) Compute the numeric example.
- (c) Explain the LLM consequence.
-
(***) Design a long-context position diagnostic.
- (a) State the scheme.
- (b) Compute the numeric example.
- (c) Explain the LLM consequence.
11. Why This Matters for AI
| Concept | AI impact |
|---|---|
| Sinusoidal encodings | Provide fixed absolute order features without learned position rows. |
| Learned position tables | Work well in-range but tie the model to trained maximum positions. |
| Relative biases | Let attention reason about pairwise distances. |
| RoPE | Supports relative offset behavior through rotations and is common in modern decoder LLMs. |
| ALiBi | Adds simple distance penalties that extrapolate without a learned position table. |
| Position ids | Matter for KV-cache decoding and long-context serving correctness. |
| Long-context diagnostics | Expose lost-in-the-middle, recency bias, and extrapolation failures. |
| Mask interaction | Ensures order signals do not override causal or padding visibility. |
12. Conceptual Bridge
The backward bridge is attention. Attention computes content-based interactions, but position mechanisms determine whether those interactions know sequence order and distance.
The forward bridge is language-model probability. In next-token prediction, position affects which prefix states are visible, how generated tokens receive ids, and whether the model can use long contexts reliably.
+-------------+ +----------------+ +----------------------+
| attention | ---> | position signal | ---> | ordered next-token |
| content mix | | absolute/rel | | prediction |
+-------------+ +----------------+ +----------------------+
The practical habit is to test length behavior, not only maximum accepted length. Position encodings can be mathematically valid and still behave poorly outside the training regime.
References
- Vaswani et al.. Attention Is All You Need. https://arxiv.org/abs/1706.03762
- Shaw, Uszkoreit, Vaswani. Self-Attention with Relative Position Representations. https://arxiv.org/abs/1803.02155
- Dai et al.. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. https://arxiv.org/abs/1901.02860
- Su et al.. RoFormer: Enhanced Transformer with Rotary Position Embedding. https://arxiv.org/abs/2104.09864
- Press et al.. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. https://arxiv.org/abs/2108.12409