"Attention is differentiable retrieval over the current context."
Overview
Attention is the mechanism that lets each token representation read from other token representations. It creates query, key, and value vectors, computes compatibility scores, masks illegal positions, normalizes with softmax, and mixes value vectors into context-aware outputs.
For LLMs, attention is the central sequence-mixing operation. It powers in-context learning, copying, retrieval use, chat-template boundaries, and long-context behavior. It is also one of the main systems bottlenecks because score matrices grow quadratically in sequence length.
This section uses LaTeX Markdown with $...$ and `
`. The companion notebooks implement stable softmax attention, causal masks, multi-head shapes, entropy diagnostics, KV-cache sizing, and efficient-attention intuition in small NumPy examples.
Prerequisites
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Executable demonstrations of Q/K/V attention, masks, entropy, heads, KV cache, ALiBi-style bias, and FlashAttention intuition. |
| exercises.ipynb | Ten checked exercises covering attention mechanics and LLM serving consequences. |
Learning Objectives
After completing this section, you will be able to:
- Define queries, keys, values, attention scores, masks, weights, and outputs.
- Compute scaled dot-product attention by hand for a small sequence.
- Apply causal and padding masks before softmax.
- Explain why attention rows are probability-like but not full explanations.
- Track multi-head tensor shapes through split, attention, concat, and output projection.
- Compute attention entropy and interpret sharp or diffuse rows.
- Explain KV cache memory and the prefill/decode distinction.
- Describe why FlashAttention is exact attention with a more efficient memory algorithm.
- Diagnose mask leakage and padding bugs.
- Connect attention math to in-context learning, RAG, long context, and safety boundaries.
Study Flow
- Read the pages in order and pause after each page to restate the main definition or theorem.
- Run
theory.ipynbwhen you want to check the formulas numerically. - Use
exercises.ipynbafter the reading path, not before it. - Return to this overview page when you need the chapter-level navigation.