Lesson overview | Previous part | Next part
Embedding Space Math: Part 4: Geometry of Meaning to 6. Positional Information
4. Geometry of Meaning
Geometry of Meaning connects token ids to continuous vectors and prepares the exact geometry used by attention, language-model logits, and dense retrieval.
4.1 Analogy directions
Purpose. Analogy directions focuses on linear offsets and relational structure. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
This concept explains how discrete language symbols become continuous vectors with trainable geometry.
Worked reading.
The operational question is what shape the vector has, how it is compared, and how training changes it.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- embedding rows.
- hidden states.
- similarity search.
Non-examples:
- raw text in linear algebra.
- ids treated as distances.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
4.2 Subspaces and probes
Purpose. Subspaces and probes focuses on where features can be linearly readable. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Embedding spaces have geometry: norms, directions, subspaces, clusters, and dominant components. These structures can encode useful features and dataset artifacts.
Worked reading.
Centering an embedding cloud removes the mean direction; whitening rescales dominant axes so cosine neighborhoods are less dominated by global components.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- feature probes.
- bias directions.
- PCA diagnostics.
Non-examples:
- assuming every axis has semantic meaning.
- judging geometry from one 2D plot only.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
4.3 Isotropy and anisotropy
Purpose. Isotropy and anisotropy focuses on why many embeddings collapse into dominant directions. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Embedding spaces have geometry: norms, directions, subspaces, clusters, and dominant components. These structures can encode useful features and dataset artifacts.
Worked reading.
Centering an embedding cloud removes the mean direction; whitening rescales dominant axes so cosine neighborhoods are less dominated by global components.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- feature probes.
- bias directions.
- PCA diagnostics.
Non-examples:
- assuming every axis has semantic meaning.
- judging geometry from one 2D plot only.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
4.4 Centering and whitening
Purpose. Centering and whitening focuses on simple geometry repairs for similarity search. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
This concept explains how discrete language symbols become continuous vectors with trainable geometry.
Worked reading.
The operational question is what shape the vector has, how it is compared, and how training changes it.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- embedding rows.
- hidden states.
- similarity search.
Non-examples:
- raw text in linear algebra.
- ids treated as distances.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
4.5 Bias and representation directions
Purpose. Bias and representation directions focuses on why geometry can encode social or dataset artifacts. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Embedding spaces have geometry: norms, directions, subspaces, clusters, and dominant components. These structures can encode useful features and dataset artifacts.
Worked reading.
Centering an embedding cloud removes the mean direction; whitening rescales dominant axes so cosine neighborhoods are less dominated by global components.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- feature probes.
- bias directions.
- PCA diagnostics.
Non-examples:
- assuming every axis has semantic meaning.
- judging geometry from one 2D plot only.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
5. Training Embeddings
Training Embeddings connects token ids to continuous vectors and prepares the exact geometry used by attention, language-model logits, and dense retrieval.
5.1 Language-model loss gradients
Purpose. Language-model loss gradients focuses on how cross-entropy updates rows. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Embedding training moves rows according to prediction errors and co-occurrence structure. Frequent tokens receive many updates, and rare tokens may remain poorly estimated.
Worked reading.
For a softmax output row , the gradient is proportional to .
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- language-model cross-entropy.
- negative sampling.
- co-occurrence factorization.
Non-examples:
- hand-written semantic coordinates.
- frozen random rows with no adaptation.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
5.2 Word2vec and negative sampling
Purpose. Word2vec and negative sampling focuses on predictive embeddings before transformers. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Embedding training moves rows according to prediction errors and co-occurrence structure. Frequent tokens receive many updates, and rare tokens may remain poorly estimated.
Worked reading.
For a softmax output row , the gradient is proportional to .
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- language-model cross-entropy.
- negative sampling.
- co-occurrence factorization.
Non-examples:
- hand-written semantic coordinates.
- frozen random rows with no adaptation.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
5.3 GloVe and co-occurrence factorization
Purpose. GloVe and co-occurrence factorization focuses on global count structure. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Embedding training moves rows according to prediction errors and co-occurrence structure. Frequent tokens receive many updates, and rare tokens may remain poorly estimated.
Worked reading.
For a softmax output row , the gradient is proportional to .
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- language-model cross-entropy.
- negative sampling.
- co-occurrence factorization.
Non-examples:
- hand-written semantic coordinates.
- frozen random rows with no adaptation.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
5.4 Fine-tuning drift
Purpose. Fine-tuning drift focuses on why embeddings move during adaptation. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
This concept explains how discrete language symbols become continuous vectors with trainable geometry.
Worked reading.
The operational question is what shape the vector has, how it is compared, and how training changes it.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- embedding rows.
- hidden states.
- similarity search.
Non-examples:
- raw text in linear algebra.
- ids treated as distances.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
5.5 Vocabulary resizing
Purpose. Vocabulary resizing focuses on how new rows are initialized and trained. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
This concept explains how discrete language symbols become continuous vectors with trainable geometry.
Worked reading.
The operational question is what shape the vector has, how it is compared, and how training changes it.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- embedding rows.
- hidden states.
- similarity search.
Non-examples:
- raw text in linear algebra.
- ids treated as distances.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
6. Positional Information
Positional Information connects token ids to continuous vectors and prepares the exact geometry used by attention, language-model logits, and dense retrieval.
6.1 Why positions are needed
Purpose. Why positions are needed focuses on self-attention without order is permutation-equivariant. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Position mechanisms inject order into otherwise content-based attention. They may be additive vectors, rotations, or attention-score biases.
Worked reading.
Sinusoidal encodings add fixed frequency features; RoPE rotates query/key pairs; ALiBi adds a distance-dependent bias to attention logits.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- sinusoidal encodings.
- learned position rows.
- RoPE.
- ALiBi.
Non-examples:
- bag-of-token attention with no order.
- token ids used as positions.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
6.2 Sinusoidal positional encodings
Purpose. Sinusoidal positional encodings focuses on fixed frequency features. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Position mechanisms inject order into otherwise content-based attention. They may be additive vectors, rotations, or attention-score biases.
Worked reading.
Sinusoidal encodings add fixed frequency features; RoPE rotates query/key pairs; ALiBi adds a distance-dependent bias to attention logits.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- sinusoidal encodings.
- learned position rows.
- RoPE.
- ALiBi.
Non-examples:
- bag-of-token attention with no order.
- token ids used as positions.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
6.3 Learned positional embeddings
Purpose. Learned positional embeddings focuses on trainable position rows. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Position mechanisms inject order into otherwise content-based attention. They may be additive vectors, rotations, or attention-score biases.
Worked reading.
Sinusoidal encodings add fixed frequency features; RoPE rotates query/key pairs; ALiBi adds a distance-dependent bias to attention logits.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- sinusoidal encodings.
- learned position rows.
- RoPE.
- ALiBi.
Non-examples:
- bag-of-token attention with no order.
- token ids used as positions.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
6.4 Rotary positional embeddings
Purpose. Rotary positional embeddings focuses on RoPE as complex-plane rotations. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
RoPE encodes position by rotating query and key coordinate pairs. The resulting attention dot product depends on relative offset through the rotation difference.
Worked reading.
A vector at position is rotated by angles tied to and frequency index, so attention can represent relative distance without adding a separate position vector.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- Llama-style position handling.
- long-context extrapolation studies.
- relative offset in attention.
Non-examples:
- absolute learned position rows.
- ALiBi scalar bias only.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.
6.5 ALiBi and relative biases
Purpose. ALiBi and relative biases focuses on linear distance penalties in attention scores. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.
Operational definition.
Position mechanisms inject order into otherwise content-based attention. They may be additive vectors, rotations, or attention-score biases.
Worked reading.
Sinusoidal encodings add fixed frequency features; RoPE rotates query/key pairs; ALiBi adds a distance-dependent bias to attention logits.
| Object | Shape or formula | Role |
|---|---|---|
| token ids | discrete sequence from tokenizer | |
| embedding table | $ | \mathcal{V} |
| hidden states | contextual vectors after lookup and layers | |
| LM head | $d\times | \mathcal{V} |
| position signal | vector, rotation, or bias | injects order into attention |
Examples:
- sinusoidal encodings.
- learned position rows.
- RoPE.
- ALiBi.
Non-examples:
- bag-of-token attention with no order.
- token ids used as positions.
Derivation habit.
- Write the tensor shape before writing the operation.
- State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
- Choose dot product, cosine similarity, or Euclidean distance deliberately.
- Check whether position information is additive, rotary, learned, or an attention bias.
- Track whether input and output embeddings are tied.
Implementation lens.
In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.
When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.
For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.