Part 2Math for LLMs

Embedding Space Math: Part 2 - Geometry Of Meaning To 6 Positional Information

Math for LLMs / Embedding Space Math

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
25 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Embedding Space Math: Part 4: Geometry of Meaning to 6. Positional Information

4. Geometry of Meaning

Geometry of Meaning connects token ids to continuous vectors and prepares the exact geometry used by attention, language-model logits, and dense retrieval.

4.1 Analogy directions

Purpose. Analogy directions focuses on linear offsets and relational structure. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

=logexp(hwy)jVexp(hwj).\ell=-\log\frac{\exp(\mathbf{h}^{\top}\mathbf{w}_y)}{\sum_{j\in\mathcal{V}}\exp(\mathbf{h}^{\top}\mathbf{w}_j)}.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. embedding rows.
  2. hidden states.
  3. similarity search.

Non-examples:

  1. raw text in linear algebra.
  2. ids treated as distances.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

4.2 Subspaces and probes

Purpose. Subspaces and probes focuses on where features can be linearly readable. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

wj=(pj1{j=y})h.\nabla_{\mathbf{w}_j}\ell=(p_j-\mathbf{1}\{j=y\})\mathbf{h}.

Operational definition.

Embedding spaces have geometry: norms, directions, subspaces, clusters, and dominant components. These structures can encode useful features and dataset artifacts.

Worked reading.

Centering an embedding cloud removes the mean direction; whitening rescales dominant axes so cosine neighborhoods are less dominated by global components.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. feature probes.
  2. bias directions.
  3. PCA diagnostics.

Non-examples:

  1. assuming every axis has semantic meaning.
  2. judging geometry from one 2D plot only.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

4.3 Isotropy and anisotropy

Purpose. Isotropy and anisotropy focuses on why many embeddings collapse into dominant directions. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

PEp,2k=sin(p/100002k/d),PEp,2k+1=cos(p/100002k/d).\operatorname{PE}_{p,2k}=\sin\left(p/10000^{2k/d}\right),\qquad \operatorname{PE}_{p,2k+1}=\cos\left(p/10000^{2k/d}\right).

Operational definition.

Embedding spaces have geometry: norms, directions, subspaces, clusters, and dominant components. These structures can encode useful features and dataset artifacts.

Worked reading.

Centering an embedding cloud removes the mean direction; whitening rescales dominant axes so cosine neighborhoods are less dominated by global components.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. feature probes.
  2. bias directions.
  3. PCA diagnostics.

Non-examples:

  1. assuming every axis has semantic meaning.
  2. judging geometry from one 2D plot only.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

4.4 Centering and whitening

Purpose. Centering and whitening focuses on simple geometry repairs for similarity search. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

RoPE(q,p)=Rpq,(Rpq)(Rsk)=qRspk.\operatorname{RoPE}(\mathbf{q},p)=R_p\mathbf{q},\qquad (R_p\mathbf{q})^\top(R_s\mathbf{k})=\mathbf{q}^\top R_{s-p}\mathbf{k}.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. embedding rows.
  2. hidden states.
  3. similarity search.

Non-examples:

  1. raw text in linear algebra.
  2. ids treated as distances.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

4.5 Bias and representation directions

Purpose. Bias and representation directions focuses on why geometry can encode social or dataset artifacts. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

paramsembed=Vdmodel.\operatorname{params}_{\mathrm{embed}}=|\mathcal{V}|d_{\mathrm{model}}.

Operational definition.

Embedding spaces have geometry: norms, directions, subspaces, clusters, and dominant components. These structures can encode useful features and dataset artifacts.

Worked reading.

Centering an embedding cloud removes the mean direction; whitening rescales dominant axes so cosine neighborhoods are less dominated by global components.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. feature probes.
  2. bias directions.
  3. PCA diagnostics.

Non-examples:

  1. assuming every axis has semantic meaning.
  2. judging geometry from one 2D plot only.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

5. Training Embeddings

Training Embeddings connects token ids to continuous vectors and prepares the exact geometry used by attention, language-model logits, and dense retrieval.

5.1 Language-model loss gradients

Purpose. Language-model loss gradients focuses on how cross-entropy updates rows. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

wj=(pj1{j=y})h.\nabla_{\mathbf{w}_j}\ell=(p_j-\mathbf{1}\{j=y\})\mathbf{h}.

Operational definition.

Embedding training moves rows according to prediction errors and co-occurrence structure. Frequent tokens receive many updates, and rare tokens may remain poorly estimated.

Worked reading.

For a softmax output row wj\mathbf{w}_j, the gradient is proportional to (pj1{j=y})h(p_j-\mathbf{1}\{j=y\})\mathbf{h}.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. language-model cross-entropy.
  2. negative sampling.
  3. co-occurrence factorization.

Non-examples:

  1. hand-written semantic coordinates.
  2. frozen random rows with no adaptation.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

5.2 Word2vec and negative sampling

Purpose. Word2vec and negative sampling focuses on predictive embeddings before transformers. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

PEp,2k=sin(p/100002k/d),PEp,2k+1=cos(p/100002k/d).\operatorname{PE}_{p,2k}=\sin\left(p/10000^{2k/d}\right),\qquad \operatorname{PE}_{p,2k+1}=\cos\left(p/10000^{2k/d}\right).

Operational definition.

Embedding training moves rows according to prediction errors and co-occurrence structure. Frequent tokens receive many updates, and rare tokens may remain poorly estimated.

Worked reading.

For a softmax output row wj\mathbf{w}_j, the gradient is proportional to (pj1{j=y})h(p_j-\mathbf{1}\{j=y\})\mathbf{h}.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. language-model cross-entropy.
  2. negative sampling.
  3. co-occurrence factorization.

Non-examples:

  1. hand-written semantic coordinates.
  2. frozen random rows with no adaptation.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

5.3 GloVe and co-occurrence factorization

Purpose. GloVe and co-occurrence factorization focuses on global count structure. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

RoPE(q,p)=Rpq,(Rpq)(Rsk)=qRspk.\operatorname{RoPE}(\mathbf{q},p)=R_p\mathbf{q},\qquad (R_p\mathbf{q})^\top(R_s\mathbf{k})=\mathbf{q}^\top R_{s-p}\mathbf{k}.

Operational definition.

Embedding training moves rows according to prediction errors and co-occurrence structure. Frequent tokens receive many updates, and rare tokens may remain poorly estimated.

Worked reading.

For a softmax output row wj\mathbf{w}_j, the gradient is proportional to (pj1{j=y})h(p_j-\mathbf{1}\{j=y\})\mathbf{h}.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. language-model cross-entropy.
  2. negative sampling.
  3. co-occurrence factorization.

Non-examples:

  1. hand-written semantic coordinates.
  2. frozen random rows with no adaptation.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

5.4 Fine-tuning drift

Purpose. Fine-tuning drift focuses on why embeddings move during adaptation. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

paramsembed=Vdmodel.\operatorname{params}_{\mathrm{embed}}=|\mathcal{V}|d_{\mathrm{model}}.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. embedding rows.
  2. hidden states.
  3. similarity search.

Non-examples:

  1. raw text in linear algebra.
  2. ids treated as distances.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

5.5 Vocabulary resizing

Purpose. Vocabulary resizing focuses on how new rows are initialized and trained. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

ERV×dmodel,xt=Eit,:.E\in\mathbb{R}^{|\mathcal{V}|\times d_{\mathrm{model}}},\qquad \mathbf{x}_t=E_{i_t,:}.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. embedding rows.
  2. hidden states.
  3. similarity search.

Non-examples:

  1. raw text in linear algebra.
  2. ids treated as distances.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

6. Positional Information

Positional Information connects token ids to continuous vectors and prepares the exact geometry used by attention, language-model logits, and dense retrieval.

6.1 Why positions are needed

Purpose. Why positions are needed focuses on self-attention without order is permutation-equivariant. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

PEp,2k=sin(p/100002k/d),PEp,2k+1=cos(p/100002k/d).\operatorname{PE}_{p,2k}=\sin\left(p/10000^{2k/d}\right),\qquad \operatorname{PE}_{p,2k+1}=\cos\left(p/10000^{2k/d}\right).

Operational definition.

Position mechanisms inject order into otherwise content-based attention. They may be additive vectors, rotations, or attention-score biases.

Worked reading.

Sinusoidal encodings add fixed frequency features; RoPE rotates query/key pairs; ALiBi adds a distance-dependent bias to attention logits.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. sinusoidal encodings.
  2. learned position rows.
  3. RoPE.
  4. ALiBi.

Non-examples:

  1. bag-of-token attention with no order.
  2. token ids used as positions.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

6.2 Sinusoidal positional encodings

Purpose. Sinusoidal positional encodings focuses on fixed frequency features. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

RoPE(q,p)=Rpq,(Rpq)(Rsk)=qRspk.\operatorname{RoPE}(\mathbf{q},p)=R_p\mathbf{q},\qquad (R_p\mathbf{q})^\top(R_s\mathbf{k})=\mathbf{q}^\top R_{s-p}\mathbf{k}.

Operational definition.

Position mechanisms inject order into otherwise content-based attention. They may be additive vectors, rotations, or attention-score biases.

Worked reading.

Sinusoidal encodings add fixed frequency features; RoPE rotates query/key pairs; ALiBi adds a distance-dependent bias to attention logits.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. sinusoidal encodings.
  2. learned position rows.
  3. RoPE.
  4. ALiBi.

Non-examples:

  1. bag-of-token attention with no order.
  2. token ids used as positions.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

6.3 Learned positional embeddings

Purpose. Learned positional embeddings focuses on trainable position rows. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

paramsembed=Vdmodel.\operatorname{params}_{\mathrm{embed}}=|\mathcal{V}|d_{\mathrm{model}}.

Operational definition.

Position mechanisms inject order into otherwise content-based attention. They may be additive vectors, rotations, or attention-score biases.

Worked reading.

Sinusoidal encodings add fixed frequency features; RoPE rotates query/key pairs; ALiBi adds a distance-dependent bias to attention logits.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. sinusoidal encodings.
  2. learned position rows.
  3. RoPE.
  4. ALiBi.

Non-examples:

  1. bag-of-token attention with no order.
  2. token ids used as positions.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

6.4 Rotary positional embeddings

Purpose. Rotary positional embeddings focuses on RoPE as complex-plane rotations. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

ERV×dmodel,xt=Eit,:.E\in\mathbb{R}^{|\mathcal{V}|\times d_{\mathrm{model}}},\qquad \mathbf{x}_t=E_{i_t,:}.

Operational definition.

RoPE encodes position by rotating query and key coordinate pairs. The resulting attention dot product depends on relative offset through the rotation difference.

Worked reading.

A vector at position pp is rotated by angles tied to pp and frequency index, so attention can represent relative distance without adding a separate position vector.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. Llama-style position handling.
  2. long-context extrapolation studies.
  3. relative offset in attention.

Non-examples:

  1. absolute learned position rows.
  2. ALiBi scalar bias only.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

6.5 ALiBi and relative biases

Purpose. ALiBi and relative biases focuses on linear distance penalties in attention scores. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

xt=eitE.\mathbf{x}_t=\mathbf{e}_{i_t}^{\top}E.

Operational definition.

Position mechanisms inject order into otherwise content-based attention. They may be additive vectors, rotations, or attention-score biases.

Worked reading.

Sinusoidal encodings add fixed frequency features; RoPE rotates query/key pairs; ALiBi adds a distance-dependent bias to attention logits.

ObjectShape or formulaRole
token idsB×TB\times Tdiscrete sequence from tokenizer
embedding table$\mathcal{V}
hidden statesB×T×dB\times T\times dcontextual vectors after lookup and layers
LM head$d\times\mathcal{V}
position signalvector, rotation, or biasinjects order into attention

Examples:

  1. sinusoidal encodings.
  2. learned position rows.
  3. RoPE.
  4. ALiBi.

Non-examples:

  1. bag-of-token attention with no order.
  2. token ids used as positions.

Derivation habit.

  1. Write the tensor shape before writing the operation.
  2. State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
  3. Choose dot product, cosine similarity, or Euclidean distance deliberately.
  4. Check whether position information is additive, rotary, learned, or an attention bias.
  5. Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue