Part 2

25 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Embedding Space Math: Part 4: Geometry of Meaning to 6. Positional Information

4. Geometry of Meaning

Geometry of Meaning connects token ids to continuous vectors and prepares the exact geometry used by attention, language-model logits, and dense retrieval.

4.1 Analogy directions

Purpose. Analogy directions focuses on linear offsets and relational structure. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\ell=-\log\frac{\exp(\mathbf{h}^{\top}\mathbf{w}_y)}{\sum_{j\in\mathcal{V}}\exp(\mathbf{h}^{\top}\mathbf{w}_j)}.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

embedding rows.
hidden states.
similarity search.

Non-examples:

raw text in linear algebra.
ids treated as distances.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

4.2 Subspaces and probes

Purpose. Subspaces and probes focuses on where features can be linearly readable. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\nabla_{\mathbf{w}_j}\ell=(p_j-\mathbf{1}\{j=y\})\mathbf{h}.

Operational definition.

Embedding spaces have geometry: norms, directions, subspaces, clusters, and dominant components. These structures can encode useful features and dataset artifacts.

Worked reading.

Centering an embedding cloud removes the mean direction; whitening rescales dominant axes so cosine neighborhoods are less dominated by global components.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

feature probes.
bias directions.
PCA diagnostics.

Non-examples:

assuming every axis has semantic meaning.
judging geometry from one 2D plot only.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

4.3 Isotropy and anisotropy

Purpose. Isotropy and anisotropy focuses on why many embeddings collapse into dominant directions. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{PE}_{p,2k}=\sin\left(p/10000^{2k/d}\right),\qquad \operatorname{PE}_{p,2k+1}=\cos\left(p/10000^{2k/d}\right).

Operational definition.

Embedding spaces have geometry: norms, directions, subspaces, clusters, and dominant components. These structures can encode useful features and dataset artifacts.

Worked reading.

Centering an embedding cloud removes the mean direction; whitening rescales dominant axes so cosine neighborhoods are less dominated by global components.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

feature probes.
bias directions.
PCA diagnostics.

Non-examples:

assuming every axis has semantic meaning.
judging geometry from one 2D plot only.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

4.4 Centering and whitening

Purpose. Centering and whitening focuses on simple geometry repairs for similarity search. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{RoPE}(\mathbf{q},p)=R_p\mathbf{q},\qquad (R_p\mathbf{q})^\top(R_s\mathbf{k})=\mathbf{q}^\top R_{s-p}\mathbf{k}.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

embedding rows.
hidden states.
similarity search.

Non-examples:

raw text in linear algebra.
ids treated as distances.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

4.5 Bias and representation directions

Purpose. Bias and representation directions focuses on why geometry can encode social or dataset artifacts. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{params}_{\mathrm{embed}}=|\mathcal{V}|d_{\mathrm{model}}.

Operational definition.

Embedding spaces have geometry: norms, directions, subspaces, clusters, and dominant components. These structures can encode useful features and dataset artifacts.

Worked reading.

Centering an embedding cloud removes the mean direction; whitening rescales dominant axes so cosine neighborhoods are less dominated by global components.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

feature probes.
bias directions.
PCA diagnostics.

Non-examples:

assuming every axis has semantic meaning.
judging geometry from one 2D plot only.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

5. Training Embeddings

Training Embeddings connects token ids to continuous vectors and prepares the exact geometry used by attention, language-model logits, and dense retrieval.

5.1 Language-model loss gradients

Purpose. Language-model loss gradients focuses on how cross-entropy updates rows. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\nabla_{\mathbf{w}_j}\ell=(p_j-\mathbf{1}\{j=y\})\mathbf{h}.

Operational definition.

Embedding training moves rows according to prediction errors and co-occurrence structure. Frequent tokens receive many updates, and rare tokens may remain poorly estimated.

Worked reading.

For a softmax output row $\mathbf{w}_j$ , the gradient is proportional to $(p_j-\mathbf{1}\{j=y\})\mathbf{h}$ .

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

language-model cross-entropy.
negative sampling.
co-occurrence factorization.

Non-examples:

hand-written semantic coordinates.
frozen random rows with no adaptation.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

5.2 Word2vec and negative sampling

Purpose. Word2vec and negative sampling focuses on predictive embeddings before transformers. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{PE}_{p,2k}=\sin\left(p/10000^{2k/d}\right),\qquad \operatorname{PE}_{p,2k+1}=\cos\left(p/10000^{2k/d}\right).

Operational definition.

Embedding training moves rows according to prediction errors and co-occurrence structure. Frequent tokens receive many updates, and rare tokens may remain poorly estimated.

Worked reading.

For a softmax output row $\mathbf{w}_j$ , the gradient is proportional to $(p_j-\mathbf{1}\{j=y\})\mathbf{h}$ .

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

language-model cross-entropy.
negative sampling.
co-occurrence factorization.

Non-examples:

hand-written semantic coordinates.
frozen random rows with no adaptation.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

5.3 GloVe and co-occurrence factorization

Purpose. GloVe and co-occurrence factorization focuses on global count structure. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{RoPE}(\mathbf{q},p)=R_p\mathbf{q},\qquad (R_p\mathbf{q})^\top(R_s\mathbf{k})=\mathbf{q}^\top R_{s-p}\mathbf{k}.

Operational definition.

Embedding training moves rows according to prediction errors and co-occurrence structure. Frequent tokens receive many updates, and rare tokens may remain poorly estimated.

Worked reading.

For a softmax output row $\mathbf{w}_j$ , the gradient is proportional to $(p_j-\mathbf{1}\{j=y\})\mathbf{h}$ .

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

language-model cross-entropy.
negative sampling.
co-occurrence factorization.

Non-examples:

hand-written semantic coordinates.
frozen random rows with no adaptation.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

5.4 Fine-tuning drift

Purpose. Fine-tuning drift focuses on why embeddings move during adaptation. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{params}_{\mathrm{embed}}=|\mathcal{V}|d_{\mathrm{model}}.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

embedding rows.
hidden states.
similarity search.

Non-examples:

raw text in linear algebra.
ids treated as distances.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

5.5 Vocabulary resizing

Purpose. Vocabulary resizing focuses on how new rows are initialized and trained. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

E\in\mathbb{R}^{|\mathcal{V}|\times d_{\mathrm{model}}},\qquad \mathbf{x}_t=E_{i_t,:}.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

embedding rows.
hidden states.
similarity search.

Non-examples:

raw text in linear algebra.
ids treated as distances.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

6. Positional Information

Positional Information connects token ids to continuous vectors and prepares the exact geometry used by attention, language-model logits, and dense retrieval.

6.1 Why positions are needed

Purpose. Why positions are needed focuses on self-attention without order is permutation-equivariant. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{PE}_{p,2k}=\sin\left(p/10000^{2k/d}\right),\qquad \operatorname{PE}_{p,2k+1}=\cos\left(p/10000^{2k/d}\right).

Operational definition.

Position mechanisms inject order into otherwise content-based attention. They may be additive vectors, rotations, or attention-score biases.

Worked reading.

Sinusoidal encodings add fixed frequency features; RoPE rotates query/key pairs; ALiBi adds a distance-dependent bias to attention logits.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

sinusoidal encodings.
learned position rows.
RoPE.
ALiBi.

Non-examples:

bag-of-token attention with no order.
token ids used as positions.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

6.2 Sinusoidal positional encodings

Purpose. Sinusoidal positional encodings focuses on fixed frequency features. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{RoPE}(\mathbf{q},p)=R_p\mathbf{q},\qquad (R_p\mathbf{q})^\top(R_s\mathbf{k})=\mathbf{q}^\top R_{s-p}\mathbf{k}.

Operational definition.

Position mechanisms inject order into otherwise content-based attention. They may be additive vectors, rotations, or attention-score biases.

Worked reading.

Sinusoidal encodings add fixed frequency features; RoPE rotates query/key pairs; ALiBi adds a distance-dependent bias to attention logits.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

sinusoidal encodings.
learned position rows.
RoPE.
ALiBi.

Non-examples:

bag-of-token attention with no order.
token ids used as positions.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

6.3 Learned positional embeddings

Purpose. Learned positional embeddings focuses on trainable position rows. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{params}_{\mathrm{embed}}=|\mathcal{V}|d_{\mathrm{model}}.

Operational definition.

Position mechanisms inject order into otherwise content-based attention. They may be additive vectors, rotations, or attention-score biases.

Worked reading.

Sinusoidal encodings add fixed frequency features; RoPE rotates query/key pairs; ALiBi adds a distance-dependent bias to attention logits.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

sinusoidal encodings.
learned position rows.
RoPE.
ALiBi.

Non-examples:

bag-of-token attention with no order.
token ids used as positions.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

6.4 Rotary positional embeddings

Purpose. Rotary positional embeddings focuses on RoPE as complex-plane rotations. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

E\in\mathbb{R}^{|\mathcal{V}|\times d_{\mathrm{model}}},\qquad \mathbf{x}_t=E_{i_t,:}.

Operational definition.

RoPE encodes position by rotating query and key coordinate pairs. The resulting attention dot product depends on relative offset through the rotation difference.

Worked reading.

A vector at position $p$ is rotated by angles tied to $p$ and frequency index, so attention can represent relative distance without adding a separate position vector.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

Llama-style position handling.
long-context extrapolation studies.
relative offset in attention.

Non-examples:

absolute learned position rows.
ALiBi scalar bias only.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

6.5 ALiBi and relative biases

Purpose. ALiBi and relative biases focuses on linear distance penalties in attention scores. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\mathbf{x}_t=\mathbf{e}_{i_t}^{\top}E.

Operational definition.

Position mechanisms inject order into otherwise content-based attention. They may be additive vectors, rotations, or attention-score biases.

Worked reading.

Sinusoidal encodings add fixed frequency features; RoPE rotates query/key pairs; ALiBi adds a distance-dependent bias to attention logits.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

sinusoidal encodings.
learned position rows.
RoPE.
ALiBi.

Non-examples:

bag-of-token attention with no order.
token ids used as positions.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

Embedding Space Math: Part 2 - Geometry Of Meaning To 6 Positional Information

Embedding Space Math: Part 4: Geometry of Meaning to 6. Positional Information

4. Geometry of Meaning

4.1 Analogy directions

4.2 Subspaces and probes

4.3 Isotropy and anisotropy

4.4 Centering and whitening

4.5 Bias and representation directions

5. Training Embeddings

5.1 Language-model loss gradients

5.2 Word2vec and negative sampling

5.3 GloVe and co-occurrence factorization

5.4 Fine-tuning drift

5.5 Vocabulary resizing

6. Positional Information

6.1 Why positions are needed

6.2 Sinusoidal positional encodings

6.3 Learned positional embeddings

6.4 Rotary positional embeddings

6.5 ALiBi and relative biases

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?