Embedding Space Math

25 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Embedding Space Math: Part 1: Intuition to 3. Similarity Geometry

1. Intuition

Intuition connects token ids to continuous vectors and prepares the exact geometry used by attention, language-model logits, and dense retrieval.

1.1 From token ids to vectors

Purpose. From token ids to vectors focuses on why lookup tables are the first neural layer. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

E\in\mathbb{R}^{|\mathcal{V}|\times d_{\mathrm{model}}},\qquad \mathbf{x}_t=E_{i_t,:}.

Operational definition.

Token ids are discrete indices. The embedding matrix turns each id into a trainable vector so gradient-based neural networks can process language.

Worked reading.

For ids [3, 7, 7], lookup returns three rows of $E$ . Repeated ids reuse the same row before contextual layers make their states differ.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

input embedding lookup.
LM-head row selection.
token-level residual stream initialization.

Non-examples:

raw strings inside attention.
one scalar id treated as an ordinal number.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

In code, embedding lookup is simple indexing. Conceptually, it is the step where the model stops seeing symbolic ids and starts seeing trainable vectors. That is why tokenizer changes, vocabulary resizing, and special-token handling are not superficial.

When debugging model behavior, inspect embedding norms, nearest neighbors, and the mean direction. Large norms or dominant components can affect logits and similarity search. For retrieval models, normalize vectors before cosine search unless the training objective explicitly uses norm as signal.

For transformer internals, remember that input embeddings are only the first residual-stream state. After attention and MLP layers, a token position's hidden state reflects surrounding context. Static nearest neighbors and contextual hidden-state probes answer different questions.

1.2 Why continuous geometry helps

Purpose. Why continuous geometry helps focuses on similarity, sharing, and smooth optimization. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\mathbf{x}_t=\mathbf{e}_{i_t}^{\top}E.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

embedding rows.
hidden states.
similarity search.

Non-examples:

raw text in linear algebra.
ids treated as distances.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

1.3 Embedding space as model memory

Purpose. Embedding space as model memory focuses on what can be stored in rows and directions. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{cos}(\mathbf{u},\mathbf{v})=\frac{\langle \mathbf{u},\mathbf{v}\rangle}{\lVert\mathbf{u}\rVert_2\lVert\mathbf{v}\rVert_2}.

Operational definition.

Embedding tables are systems objects too: they consume memory, depend on tokenizer ids, and must handle special rows carefully.

Worked reading.

Changing a tokenizer changes which row each token id selects, so old weights no longer mean the same thing without migration.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

vocabulary resizing.
special token initialization.
embedding quantization.

Non-examples:

renaming token ids without changing weights.
ignoring padding row behavior.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

1.4 Static versus contextual meaning

Purpose. Static versus contextual meaning focuses on why a token row is only the starting representation. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\ell=-\log\frac{\exp(\mathbf{h}^{\top}\mathbf{w}_y)}{\sum_{j\in\mathcal{V}}\exp(\mathbf{h}^{\top}\mathbf{w}_j)}.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

embedding rows.
hidden states.
similarity search.

Non-examples:

raw text in linear algebra.
ids treated as distances.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

1.5 Pipeline position after tokenization

Purpose. Pipeline position after tokenization focuses on how ids become residual-stream states. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\nabla_{\mathbf{w}_j}\ell=(p_j-\mathbf{1}\{j=y\})\mathbf{h}.

Operational definition.

Position mechanisms inject order into otherwise content-based attention. They may be additive vectors, rotations, or attention-score biases.

Worked reading.

Sinusoidal encodings add fixed frequency features; RoPE rotates query/key pairs; ALiBi adds a distance-dependent bias to attention logits.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

sinusoidal encodings.
learned position rows.
RoPE.
ALiBi.

Non-examples:

bag-of-token attention with no order.
token ids used as positions.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

2. Formal Definitions

Formal Definitions connects token ids to continuous vectors and prepares the exact geometry used by attention, language-model logits, and dense retrieval.

2.1 Embedding matrix

Purpose. Embedding matrix focuses on the table $E\in\mathbb{R}^{|\mathcal{V}|\times d}$ . This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\mathbf{x}_t=\mathbf{e}_{i_t}^{\top}E.

Operational definition.

The embedding matrix is a table $E\in\mathbb{R}^{|\mathcal{V}|\times d}$ whose row $E_{i,:}$ is the vector for token id $i$ .

Worked reading.

A vocabulary of 50,000 and width 4096 already has over 200 million input-embedding parameters if untied from the output head.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

GPT-style token embeddings.
BERT wordpiece embeddings.
new rows after vocabulary expansion.

Non-examples:

a dictionary from words to definitions.
a single dense vector for an entire sentence.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

2.2 One-hot lookup equivalence

Purpose. One-hot lookup equivalence focuses on why $\mathbf{e}_i^\top E$ selects a row. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{cos}(\mathbf{u},\mathbf{v})=\frac{\langle \mathbf{u},\mathbf{v}\rangle}{\lVert\mathbf{u}\rVert_2\lVert\mathbf{v}\rVert_2}.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

embedding rows.
hidden states.
similarity search.

Non-examples:

raw text in linear algebra.
ids treated as distances.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

2.3 Batch sequence tensor shapes

Purpose. Batch sequence tensor shapes focuses on how ids become $B\times T\times d$ arrays. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\ell=-\log\frac{\exp(\mathbf{h}^{\top}\mathbf{w}_y)}{\sum_{j\in\mathcal{V}}\exp(\mathbf{h}^{\top}\mathbf{w}_j)}.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

embedding rows.
hidden states.
similarity search.

Non-examples:

raw text in linear algebra.
ids treated as distances.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

2.4 Input output and tied embeddings

Purpose. Input output and tied embeddings focuses on when the same matrix is reused for logits. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\nabla_{\mathbf{w}_j}\ell=(p_j-\mathbf{1}\{j=y\})\mathbf{h}.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

embedding rows.
hidden states.
similarity search.

Non-examples:

raw text in linear algebra.
ids treated as distances.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

2.5 Residual stream initialization

Purpose. Residual stream initialization focuses on embedding plus positional information. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{PE}_{p,2k}=\sin\left(p/10000^{2k/d}\right),\qquad \operatorname{PE}_{p,2k+1}=\cos\left(p/10000^{2k/d}\right).

Operational definition.

Embeddings become contextual through projections, attention, MLPs, and residual additions. Attention compares projected vectors, not raw token ids.

Worked reading.

The query and key projections turn hidden states into vectors whose dot products define attention weights.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

QKV projections.
residual stream states.
dense retrieval vectors.

Non-examples:

nearest neighbors over integer ids.
one fixed meaning for a token in every context.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

3. Similarity Geometry

Similarity Geometry connects token ids to continuous vectors and prepares the exact geometry used by attention, language-model logits, and dense retrieval.

3.1 Dot product

Purpose. Dot product focuses on magnitude-sensitive alignment. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{cos}(\mathbf{u},\mathbf{v})=\frac{\langle \mathbf{u},\mathbf{v}\rangle}{\lVert\mathbf{u}\rVert_2\lVert\mathbf{v}\rVert_2}.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

embedding rows.
hidden states.
similarity search.

Non-examples:

raw text in linear algebra.
ids treated as distances.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

3.2 Cosine similarity

Purpose. Cosine similarity focuses on angle-only semantic comparison. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\ell=-\log\frac{\exp(\mathbf{h}^{\top}\mathbf{w}_y)}{\sum_{j\in\mathcal{V}}\exp(\mathbf{h}^{\top}\mathbf{w}_j)}.

Operational definition.

Cosine similarity compares directions rather than magnitudes. It is often better for semantic neighbor queries when norms carry frequency or confidence effects.

Worked reading.

If two vectors point in the same direction, their cosine is near 1 even when one has a much larger norm.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

nearest-neighbor search.
embedding evaluation.
retrieval scoring after normalization.

Non-examples:

raw dot-product logits in a trained LM head.
Euclidean clustering without normalization.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

3.3 Euclidean distance

Purpose. Euclidean distance focuses on metric distance and its caveats. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\nabla_{\mathbf{w}_j}\ell=(p_j-\mathbf{1}\{j=y\})\mathbf{h}.

Operational definition.

This concept explains how discrete language symbols become continuous vectors with trainable geometry.

Worked reading.

The operational question is what shape the vector has, how it is compared, and how training changes it.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

embedding rows.
hidden states.
similarity search.

Non-examples:

raw text in linear algebra.
ids treated as distances.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

3.4 Norms and frequency effects

Purpose. Norms and frequency effects focuses on why common tokens can have large or biased norms. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{PE}_{p,2k}=\sin\left(p/10000^{2k/d}\right),\qquad \operatorname{PE}_{p,2k+1}=\cos\left(p/10000^{2k/d}\right).

Operational definition.

Embedding spaces have geometry: norms, directions, subspaces, clusters, and dominant components. These structures can encode useful features and dataset artifacts.

Worked reading.

Centering an embedding cloud removes the mean direction; whitening rescales dominant axes so cosine neighborhoods are less dominated by global components.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

feature probes.
bias directions.
PCA diagnostics.

Non-examples:

assuming every axis has semantic meaning.
judging geometry from one 2D plot only.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.

3.5 Nearest neighbors

Purpose. Nearest neighbors focuses on retrieval in embedding space. This matters because every later transformer operation starts from these vectors or from hidden states derived from them.

\operatorname{RoPE}(\mathbf{q},p)=R_p\mathbf{q},\qquad (R_p\mathbf{q})^\top(R_s\mathbf{k})=\mathbf{q}^\top R_{s-p}\mathbf{k}.

Operational definition.

Embedding spaces have geometry: norms, directions, subspaces, clusters, and dominant components. These structures can encode useful features and dataset artifacts.

Worked reading.

Centering an embedding cloud removes the mean direction; whitening rescales dominant axes so cosine neighborhoods are less dominated by global components.

Object	Shape or formula	Role
token ids	$B\times T$	discrete sequence from tokenizer
embedding table	$	\mathcal{V}
hidden states	$B\times T\times d$	contextual vectors after lookup and layers
LM head	$d\times	\mathcal{V}
position signal	vector, rotation, or bias	injects order into attention

Examples:

feature probes.
bias directions.
PCA diagnostics.

Non-examples:

assuming every axis has semantic meaning.
judging geometry from one 2D plot only.

Derivation habit.

Write the tensor shape before writing the operation.
State whether vectors are raw input embeddings, hidden states, output rows, or retrieval embeddings.
Choose dot product, cosine similarity, or Euclidean distance deliberately.
Check whether position information is additive, rotary, learned, or an attention bias.
Track whether input and output embeddings are tied.

Implementation lens.