Part 1

28 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Positional Encodings: Part 1: Intuition to 4. Learned Absolute Positions

1. Intuition

Intuition explains how transformer sequence order is represented in hidden states or attention scores.

1.1 Why attention needs position

Purpose. Why attention needs position focuses on self-attention is permutation-equivariant without order signals. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

\operatorname{PE}_{p,2k}=\sin\left(p/10000^{2k/d}\right),\qquad \operatorname{PE}_{p,2k+1}=\cos\left(p/10000^{2k/d}\right).

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

Position bugs are subtle because tensor shapes can still look correct. A decode loop can run while assigning the wrong position id to every generated token. A long-context model can accept a large input while quality collapses in the middle of the context.

The practical defense is small tests: compare attention with shifted inputs, verify RoPE norm preservation, check ALiBi bias values by distance, and run retrieval probes at several positions in the context window.

1.2 Absolute versus relative position

Purpose. Absolute versus relative position focuses on index identity versus pairwise offset. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

\mathbf{h}_p=\mathbf{x}_p+\mathbf{p}_p.

Operational definition.

Relative position methods make attention depend on pairwise distance instead of only absolute index identity.

Worked reading.

The same offset can share parameters across many positions, which is useful when patterns repeat across the sequence.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

relative bias.
relative keys.
bucketed distance bins.

Non-examples:

one independent vector per absolute index.
no order signal at all.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

1.3 Additive versus score-based position

Purpose. Additive versus score-based position focuses on where the signal enters the computation. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

S_{ij}=\frac{\mathbf{q}_i^\top\mathbf{k}_j}{\sqrt{d_k}}+b_{i-j}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

1.4 Length extrapolation

Purpose. Length extrapolation focuses on why training length and inference length can diverge. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

R_p^\top R_j=R_{j-p}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

1.5 Position in decoder-only LLMs

Purpose. Position in decoder-only LLMs focuses on causal order and generated prefixes. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

\operatorname{RoPE}(\mathbf{q}_p)=R_p\mathbf{q}_p.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

2. Formal Setup

Formal Setup explains how transformer sequence order is represented in hidden states or attention scores.

2.1 Position indices

Purpose. Position indices focuses on $p\in\{0,\ldots,T-1\}$ . This is the part of transformer math that tells attention where each token is and how far apart tokens are.

\mathbf{h}_p=\mathbf{x}_p+\mathbf{p}_p.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

2.2 Token plus position representation

Purpose. Token plus position representation focuses on $\mathbf{h}_p=\mathbf{x}_p+\mathbf{p}_p$ . This is the part of transformer math that tells attention where each token is and how far apart tokens are.

S_{ij}=\frac{\mathbf{q}_i^\top\mathbf{k}_j}{\sqrt{d_k}}+b_{i-j}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

2.3 Attention score modification

Purpose. Attention score modification focuses on biases and rotations before softmax. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

R_p^\top R_j=R_{j-p}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

2.4 Relative offset notation

Purpose. Relative offset notation focuses on $r=i-j$ . This is the part of transformer math that tells attention where each token is and how far apart tokens are.

\operatorname{RoPE}(\mathbf{q}_p)=R_p\mathbf{q}_p.

Operational definition.

Relative position methods make attention depend on pairwise distance instead of only absolute index identity.

Worked reading.

The same offset can share parameters across many positions, which is useful when patterns repeat across the sequence.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

relative bias.
relative keys.
bucketed distance bins.

Non-examples:

one independent vector per absolute index.
no order signal at all.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

2.5 Position interpolation and scaling

Purpose. Position interpolation and scaling focuses on changing effective positions for long context. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

\operatorname{ALiBi}_{ij}=m_h(i-j)\quad\text{for }j\le i.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

3. Sinusoidal Encodings

Sinusoidal Encodings explains how transformer sequence order is represented in hidden states or attention scores.

3.1 Frequency ladder

Purpose. Frequency ladder focuses on geometric wavelengths across dimensions. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

S_{ij}=\frac{\mathbf{q}_i^\top\mathbf{k}_j}{\sqrt{d_k}}+b_{i-j}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

3.2 Sine cosine pairs

Purpose. Sine cosine pairs focuses on phase representation by coordinate pairs. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

R_p^\top R_j=R_{j-p}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

3.3 Linear relative-offset intuition

Purpose. Linear relative-offset intuition focuses on why fixed sinusoids can expose offsets. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

\operatorname{RoPE}(\mathbf{q}_p)=R_p\mathbf{q}_p.

Operational definition.

Relative position methods make attention depend on pairwise distance instead of only absolute index identity.

Worked reading.

The same offset can share parameters across many positions, which is useful when patterns repeat across the sequence.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

relative bias.
relative keys.
bucketed distance bins.

Non-examples:

one independent vector per absolute index.
no order signal at all.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

3.4 Visualization and aliasing

Purpose. Visualization and aliasing focuses on what high and low frequencies show. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

\operatorname{ALiBi}_{ij}=m_h(i-j)\quad\text{for }j\le i.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

3.5 Limitations

Purpose. Limitations focuses on absolute addition and finite precision. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

\lambda_k=10000^{2k/d}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

4. Learned Absolute Positions

Learned Absolute Positions explains how transformer sequence order is represented in hidden states or attention scores.

4.1 Learned position table

Purpose. Learned position table focuses on $P\in\mathbb{R}^{T_{\max}\times d}$ . This is the part of transformer math that tells attention where each token is and how far apart tokens are.

R_p^\top R_j=R_{j-p}.

Operational definition.

Learned absolute position embeddings assign a trainable vector to each position index up to a maximum length.

Worked reading.

They are simple and effective in-range, but positions beyond the trained table require resizing or another strategy.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

BERT-style position rows.
GPT-style learned absolute positions.
interpolation for longer windows.

Non-examples:

relative distance bias.
rotation of Q/K coordinates.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

4.2 Training length limit

Purpose. Training length limit focuses on why rows beyond training are unavailable. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

\operatorname{RoPE}(\mathbf{q}_p)=R_p\mathbf{q}_p.

Operational definition.

Learned absolute position embeddings assign a trainable vector to each position index up to a maximum length.

Worked reading.

They are simple and effective in-range, but positions beyond the trained table require resizing or another strategy.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

BERT-style position rows.
GPT-style learned absolute positions.
interpolation for longer windows.

Non-examples:

relative distance bias.
rotation of Q/K coordinates.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

4.3 Interpolation resizing

Purpose. Interpolation resizing focuses on adapting rows to longer contexts. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

\operatorname{ALiBi}_{ij}=m_h(i-j)\quad\text{for }j\le i.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

4.4 BERT GPT-style usage

Purpose. BERT GPT-style usage focuses on where learned rows appear. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

\lambda_k=10000^{2k/d}.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

4.5 Failure modes

Purpose. Failure modes focuses on length extrapolation and out-of-range ids. This is the part of transformer math that tells attention where each token is and how far apart tokens are.

\operatorname{position\ id}_{\mathrm{decode}}=T_{\mathrm{prefix}}+t.

Operational definition.

Position encoding injects order into transformer computation so identical tokens at different positions can have different roles.

Worked reading.

Without a position signal, self-attention can mix content but cannot know which token came first.

Scheme	Where position enters	Typical strength	Typical risk
sinusoidal	added to hidden states	fixed and simple	limited flexibility
learned absolute	added learned row	strong in-range fit	weak extrapolation
relative bias	attention score	offset-aware	implementation complexity
RoPE	rotates Q/K	relative dot products	scaling choices matter
ALiBi	attention score bias	simple extrapolation	distance prior may be too rigid

Examples:

sinusoidal features.
position rows.
relative offsets.

Non-examples:

bag-of-words attention.
token ids used as positions.

Derivation habit.

State whether the scheme is absolute or relative.
State whether the signal is added to hidden states, applied to Q/K, or added to scores.
Check shape compatibility with heads, sequence length, and cached decoding.
Test length behavior beyond the training context if the model will be served there.
Keep masks separate from position signals.

Implementation lens.

Positional Encodings: Part 1 - Intuition To 4 Learned Absolute Positions

Positional Encodings: Part 1: Intuition to 4. Learned Absolute Positions

1. Intuition

1.1 Why attention needs position

1.2 Absolute versus relative position

1.3 Additive versus score-based position

1.4 Length extrapolation

1.5 Position in decoder-only LLMs

2. Formal Setup

2.1 Position indices

2.2 Token plus position representation

2.3 Attention score modification

2.4 Relative offset notation

2.5 Position interpolation and scaling

3. Sinusoidal Encodings

3.1 Frequency ladder

3.2 Sine cosine pairs

3.3 Linear relative-offset intuition

3.4 Visualization and aliasing

3.5 Limitations

4. Learned Absolute Positions

4.1 Learned position table

4.2 Training length limit

4.3 Interpolation resizing

4.4 BERT GPT-style usage

4.5 Failure modes

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?