NotesMath for LLMs

Vectors and Spaces

Linear Algebra Basics / Vectors and Spaces

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Notes

"Deep learning runs on arrays of floating-point numbers, but the right mental model is not 'array processing.' It is geometry inside vector spaces."

Overview

Vectors and spaces are the basic language of modern machine learning. An embedding is a vector. A batch of activations is a matrix whose rows or columns are vectors. A transformer's attention mechanism compares vectors with inner products, combines vectors with weighted sums, and moves information through learned subspaces. Optimization updates a parameter vector in a very high-dimensional space. Even when the underlying object is not a list of numbers, the mathematics usually places it inside a vector space: matrices, polynomials, functions, sequences, kernels, and distributions all become easier to reason about once their linear structure is exposed.

This chapter is deliberately broader than a minimal first course treatment. It starts with concrete vectors in Rn\mathbb{R}^n, then moves to abstract vector spaces, norms, inner products, projections, convex geometry, high-dimensional effects, linear maps, and the AI-specific spaces that appear in transformers and large language models. The goal is not just to define terms, but to build the geometric intuition that later chapters on matrices, orthogonality, SVD, optimization, embeddings, and attention depend on.

Where possible, the discussion connects classical linear algebra to modern AI research. Word vectors made semantic geometry operational in NLP (Mikolov et al., 2013). The Transformer made dot-product geometry the core computational primitive of sequence modeling (Vaswani et al., 2017). Low-rank adaptation methods such as LoRA rely directly on rank, subspace, and projection ideas (Hu et al., 2022). Neural tangent kernel theory studies training as dynamics in function space (Jacot et al., 2018). Understanding vectors and spaces is therefore not background decoration. It is the medium in which AI computation happens.

Prerequisites

  • Basic algebra and coordinate geometry
  • Familiarity with sets and functions
  • Comfort with summation notation
  • Basic proof ideas (especially direct proof and contradiction)

Learning Objectives

After completing this chapter, you should be able to:

  • Explain vectors geometrically, algebraically, and abstractly
  • Work fluently with vector addition, scalar multiplication, span, and linear independence
  • Verify whether a set is a vector space or a subspace
  • Find bases, compute dimensions, and interpret coordinate systems
  • Use norms, inner products, orthogonality, and projections
  • Recognize affine and convex structure in optimization and probabilistic modeling
  • Understand why high-dimensional geometry differs sharply from low-dimensional intuition
  • Interpret kernels, images, null spaces, and rank-nullity for linear maps
  • Connect embedding spaces, attention subspaces, parameter space, and function space to concrete AI systems
  • Reason about regularization geometrically rather than as a bag of tricks

Table of Contents


1. Intuition

1.1 What Are Vectors and Spaces?

A vector is first encountered as an arrow with magnitude and direction, but that picture is only the beginning. In modern linear algebra, a vector is any element of a set that supports two operations:

  1. addition of vectors
  2. multiplication of a vector by a scalar

If those operations satisfy the right axioms, the set is called a vector space. The power of the abstraction is that the "things" inside the space do not need to be arrows in physical space. They may be coordinate tuples, polynomials, matrices, functions, sequences, or even gradient fields.

Concrete examples help fix the idea:

  • (3,1,2)R3(3, -1, 2) \in \mathbb{R}^3 is a vector
  • a polynomial such as 1+2x5x21 + 2x - 5x^2 is a vector in a polynomial space
  • a matrix in Rm×n\mathbb{R}^{m \times n} is a vector in a matrix space
  • a continuous function f:[0,1]Rf:[0,1] \to \mathbb{R} is a vector in a function space

The unifying idea is linear structure. If you can sensibly add two objects of the same kind and scale the result by numbers, then linear algebra is likely available.

1.2 Why Vectors and Spaces Are Central to AI

Nearly every important object in deep learning is a vector or lives in a vector space:

  • token embeddings are vectors in Rd\mathbb{R}^d
  • hidden states are vectors in a residual stream
  • query, key, and value representations are vectors in learned subspaces
  • parameter sets are points in a very large Euclidean space
  • gradients are vectors in the same parameter space
  • logits are vectors in vocabulary space
  • probability outputs live in the probability simplex, which sits inside a vector space but is not itself one

The Transformer made this especially explicit. Attention compares a query vector qq with key vectors kjk_j using dot products, then produces an output as a weighted sum of value vectors vjv_j:

Attention(q,K,V)=j=1nαjvj,αj=exp(qkj/dk)=1nexp(qk/dk).\mathrm{Attention}(q, K, V) = \sum_{j=1}^{n} \alpha_j v_j, \qquad \alpha_j = \frac{\exp(q^\top k_j / \sqrt{d_k})}{\sum_{\ell=1}^{n}\exp(q^\top k_\ell / \sqrt{d_k})}.

That is pure vector-space computation: inner products, scaling, exponentiation of scores, and linear combination. The semantic geometry of embeddings became operational in NLP through word2vec (Mikolov et al., 2013), and the geometry of projected subspaces became central to model architecture through self-attention (Vaswani et al., 2017). In large open models such as Llama 3, parameter space itself has hundreds of billions of coordinates at the frontier scale (Dubey et al., 2024), so geometric reasoning is not optional; it is the only way to think clearly about what the model can represent.

1.3 Geometric, Algebraic, and Abstract Views

There are three complementary ways to think about vectors.

ViewCore ideaStrengthTypical AI use
GeometricA vector is an arrow with length and directionBuilds intuition for angle, distance, projectionSimilarity, attention, clustering
AlgebraicA vector is an ordered tuple of numbersEasy to compute with componentwise rulesArrays, tensors, embeddings, gradients
AbstractA vector is any element of a vector spaceGeneralizes linear algebra beyond coordinatesFunction spaces, kernels, NTK, RKHS

The geometric view tells you why cosine similarity matters. The algebraic view tells you how to implement it. The abstract view tells you why the same mathematics reappears for functions, distributions, and operators.

One idea, three views

Geometric:             Algebraic:              Abstract:

    y                     [v1, v2, ... , vn]      v in V
    ^                                              where V has
    |   / v                                         vector addition
    |  /                                            and scalar multiplication
    | /
----+--------------> x

1.4 Where Vectors and Spaces Appear in AI

The flow through a language model can be read as a sequence of moves between vector spaces:

Token IDs
  -> embedding vectors in R^d
  -> sequence matrix X in R^(n x d)
  -> projected query/key/value spaces
  -> attention outputs as weighted sums
  -> feed-forward maps R^d -> R^d_ff -> R^d
  -> logits in R^|V|
  -> probabilities in the simplex Delta^(|V|-1)

Even when the last object is not a vector space, it usually sits inside one. The simplex of probability vectors is an affine slice of RV\mathbb{R}^{|V|} with nonnegativity constraints. That pattern is common in AI: model objects often live in structured subsets of ambient vector spaces.

1.5 The Hierarchy of Structure

A useful way to organize the subject is by how much extra structure is available.

set
  -> abelian group (addition)
  -> vector space (addition + scalar multiplication)
     -> normed space (size)
        -> Banach space (complete normed space)
     -> inner product space (angles and lengths)
        -> Hilbert space (complete inner product space)

This branching picture is more accurate than a single chain. A normed space need not come from an inner product. An inner product space automatically induces a norm, and a Hilbert space is therefore also a Banach space, but not every Banach space is Hilbert. Most day-to-day ML uses finite-dimensional Euclidean spaces, where these distinctions collapse because all norms are equivalent and completeness is automatic. In theoretical ML and functional analysis, they matter a great deal.

1.6 Historical Timeline

The language of vectors and spaces was assembled gradually.

PeriodFigure or developmentImportance
Ancient mechanicsDirected quantities in geometry and physicsProto-vector intuition
17th centuryGalileo and NewtonComposition of velocity and force
1843-1844Hamilton and GrassmannAlgebraic extension beyond ordinary 3D geometry
Late 19th centuryPeano and axiomatic linear algebraAbstract vector space formulation
Early 20th centuryHilbertInfinite-dimensional inner product spaces
1920s-1930sBanach and functional analysisNormed complete spaces and operator theory
20th century computingNumerical linear algebraMatrix computation becomes practical at scale
2013word2vecSemantic geometry becomes an engineering object
2017TransformerDot-product geometry becomes core architecture
2020sFoundation modelsRepresentation geometry becomes a first-class research topic

The modern lesson is simple: vectors started as geometry, became algebra, and now serve as the organizing language of computation.


2. Vectors in R^n

2.1 Definition and Notation

A vector in Rn\mathbb{R}^n is an ordered nn-tuple of real numbers:

v=(v1,v2,,vn)Rn.\mathbf{v} = (v_1, v_2, \ldots, v_n) \in \mathbb{R}^n.

In linear algebra, the default convention is usually the column vector:

v=(v1v2vn).\mathbf{v} = \begin{pmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{pmatrix}.

Key notation:

  • viv_i is the ii-th component or coordinate
  • 0=(0,,0)\mathbf{0} = (0,\ldots,0) is the zero vector
  • ei\mathbf{e}_i is the ii-th standard basis vector, with a 1 in position ii and 0 elsewhere
  • v\mathbf{v}^\top denotes the transpose, turning a column into a row

Every vector in Rn\mathbb{R}^n can be written as

v=v1e1+v2e2++vnen.\mathbf{v} = v_1 \mathbf{e}_1 + v_2 \mathbf{e}_2 + \cdots + v_n \mathbf{e}_n.

That formula is already telling you something profound: coordinates are not the vector itself; they are the coefficients of the vector relative to a chosen basis.

2.2 Vector Addition

Vectors in Rn\mathbb{R}^n add componentwise:

(u+v)i=ui+vi.(\mathbf{u} + \mathbf{v})_i = u_i + v_i.

So if

u=(125),v=(341),\mathbf{u} = \begin{pmatrix} 1 \\ -2 \\ 5 \end{pmatrix}, \qquad \mathbf{v} = \begin{pmatrix} 3 \\ 4 \\ -1 \end{pmatrix},

then

u+v=(424).\mathbf{u} + \mathbf{v} = \begin{pmatrix} 4 \\ 2 \\ 4 \end{pmatrix}.

Vector addition satisfies the familiar algebraic laws:

  • commutativity: u+v=v+u\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}
  • associativity: (u+v)+w=u+(v+w)(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})
  • identity: v+0=v\mathbf{v} + \mathbf{0} = \mathbf{v}
  • inverse: v+(v)=0\mathbf{v} + (-\mathbf{v}) = \mathbf{0}

Geometrically, addition follows the parallelogram law. Place the tail of one vector at the head of another; the resulting diagonal is the sum. That picture becomes important later when we interpret residual connections as additive updates to a representation vector.

Head-to-tail view of vector addition

O ----u----> A ----v----> B
O ----------------------> B
            u + v

2.3 Scalar Multiplication

If αR\alpha \in \mathbb{R} and vRn\mathbf{v} \in \mathbb{R}^n, then scalar multiplication is also componentwise:

(αv)i=αvi.(\alpha \mathbf{v})_i = \alpha v_i.

For example,

2(134)=(268),(134)=(134).2 \begin{pmatrix} 1 \\ -3 \\ 4 \end{pmatrix} = \begin{pmatrix} 2 \\ -6 \\ 8 \end{pmatrix}, \qquad - \begin{pmatrix} 1 \\ -3 \\ 4 \end{pmatrix} = \begin{pmatrix} -1 \\ 3 \\ -4 \end{pmatrix}.

The main properties are:

  • α(u+v)=αu+αv\alpha(\mathbf{u} + \mathbf{v}) = \alpha \mathbf{u} + \alpha \mathbf{v}
  • (α+β)v=αv+βv(\alpha + \beta)\mathbf{v} = \alpha \mathbf{v} + \beta \mathbf{v}
  • α(βv)=(αβ)v\alpha(\beta \mathbf{v}) = (\alpha \beta)\mathbf{v}
  • 1v=v1 \mathbf{v} = \mathbf{v}

Geometrically, positive scaling stretches or shrinks the vector, while negative scaling also reverses direction.

Scalar multiplication changes length and possibly direction

alpha > 1           0 < alpha < 1         alpha < 0

O----->             O-->                  O<-----
stretch             shrink                flip + scale

2.4 Linear Combinations

A linear combination of vectors v1,,vk\mathbf{v}_1, \ldots, \mathbf{v}_k is any expression of the form

α1v1+α2v2++αkvk=i=1kαivi\alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k = \sum_{i=1}^{k} \alpha_i \mathbf{v}_i

with scalars α1,,αk\alpha_1, \ldots, \alpha_k.

This idea is central because most linear algebra questions reduce to asking which linear combinations are possible. If you know the allowable combinations, you know what a system can generate.

In AI, linear combinations are everywhere:

  • a dense layer computes weighted sums of input coordinates before adding bias
  • attention outputs are weighted sums of value vectors
  • residual streams combine contributions from multiple components additively
  • concept arithmetic in embedding space uses vector addition and subtraction as semantic approximations

The famous analogy

kingman+womanqueen\text{king} - \text{man} + \text{woman} \approx \text{queen}

is a statement that some semantic relations behave approximately linearly in embedding space (Mikolov et al., 2013). It is not an exact algebraic law, but it is a powerful geometric clue.

2.5 Linear Independence

Vectors v1,,vk\mathbf{v}_1, \ldots, \mathbf{v}_k are linearly independent if the only way to make the zero vector from them is the trivial way:

α1v1++αkvk=0α1==αk=0.\alpha_1 \mathbf{v}_1 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0} \quad \Longrightarrow \quad \alpha_1 = \cdots = \alpha_k = 0.

If a nontrivial combination gives zero, the vectors are linearly dependent. Dependence means redundancy: at least one vector can be expressed in terms of the others.

Examples:

  • (1,0)(1,0) and (0,1)(0,1) are independent in R2\mathbb{R}^2
  • (1,2)(1,2) and (2,4)(2,4) are dependent because (2,4)=2(1,2)(2,4) = 2(1,2)
  • any set containing 0\mathbf{0} is automatically dependent

Testing independence computationally is a rank question. Put the vectors as columns of a matrix AA. Then:

  • if rank(A)=k\mathrm{rank}(A) = k, the columns are independent
  • if rank(A)<k\mathrm{rank}(A) < k, they are dependent

In model analysis, approximate dependence matters as much as exact dependence. If different heads or features occupy nearly the same direction, they carry overlapping information and may be prunable.

Independent vs dependent in R^2

Independent                     Dependent

      ^ y                            ^ y
      |   w                          |  2v
      |  /                           | /
      | /                            |/
------+------> x             -------+------> x
     /                               /
    v                               v

two distinct directions         same line, one is redundant

2.6 Span

The span of a set of vectors is the set of all linear combinations of those vectors:

span{v1,,vk}={i=1kαivi  |  αiR}.\mathrm{span}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} = \left\{ \sum_{i=1}^{k} \alpha_i \mathbf{v}_i \; \middle| \; \alpha_i \in \mathbb{R} \right\}.

The span tells you every direction reachable from the given generating set.

Typical cases:

  • span{v}\mathrm{span}\{\mathbf{v}\} is a line through the origin
  • span{v,w}\mathrm{span}\{\mathbf{v}, \mathbf{w}\} is a plane through the origin if v\mathbf{v} and w\mathbf{w} are independent
  • span{e1,,en}=Rn\mathrm{span}\{\mathbf{e}_1, \ldots, \mathbf{e}_n\} = \mathbb{R}^n

Two facts are worth memorizing:

  1. span is always a subspace
  2. adding a vector outside the current span increases the dimension by one

This is why span is the right language for representational capacity. A low-rank matrix can only output vectors inside a low-dimensional span. A collection of value vectors can only produce attention outputs inside their span.

What span looks like geometrically

span{v}              span{v, w}               span{e1, ..., en}

   line                 plane                    whole ambient space

    /                    __________
   /                    /        /|
--+--->                /________/ |
                      |        |  |
                      |        | /
                      |________|/

3. Abstract Vector Spaces

Reading note: This section introduces the vector space axioms and their most common concrete instances (Rn\mathbb{R}^n, function spaces, polynomial spaces). The focus here is on geometric intuition and computational examples. For the full axiomatic treatment - subspace criteria, span, linear independence, quotient spaces, and the abstract theory of bases - see the dedicated section.

-> Full axiomatic treatment: Vector Spaces and Subspaces 2-6

3.1 The Vector Space Axioms

Let FF be a field, usually R\mathbb{R} or C\mathbb{C}. A vector space over FF is a set VV equipped with:

  • vector addition: +:V×VV+: V \times V \to V
  • scalar multiplication: :F×VV\cdot: F \times V \to V

such that, for all u,v,wV\mathbf{u}, \mathbf{v}, \mathbf{w} \in V and α,βF\alpha, \beta \in F, the following laws hold.

AxiomStatement
Additive closureu+vV\mathbf{u} + \mathbf{v} \in V
Commutativityu+v=v+u\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}
Associativity(u+v)+w=u+(v+w)(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})
Zero vectorThere exists 0V\mathbf{0} \in V with v+0=v\mathbf{v} + \mathbf{0} = \mathbf{v}
Additive inverseFor each v\mathbf{v} there exists v-\mathbf{v} with v+(v)=0\mathbf{v} + (-\mathbf{v}) = \mathbf{0}
Scalar closureαvV\alpha \mathbf{v} \in V
Distributivity over vector additionα(u+v)=αu+αv\alpha(\mathbf{u} + \mathbf{v}) = \alpha \mathbf{u} + \alpha \mathbf{v}
Distributivity over scalar addition(α+β)v=αv+βv(\alpha + \beta)\mathbf{v} = \alpha \mathbf{v} + \beta \mathbf{v}
Scalar associativityα(βv)=(αβ)v\alpha(\beta \mathbf{v}) = (\alpha \beta)\mathbf{v}
Unit scalar1v=v1 \mathbf{v} = \mathbf{v}

Some texts count eight axioms by folding closure into the operation definitions. Listing ten is often pedagogically clearer.

3.2 Consequences of the Axioms

The axioms are minimal, but they already imply a lot.

The zero vector is unique.
If 0\mathbf{0} and 0\mathbf{0}' both behave like additive identities, then

0=0+0=0.\mathbf{0} = \mathbf{0} + \mathbf{0}' = \mathbf{0}'.

Additive inverses are unique.
If w\mathbf{w} and w\mathbf{w}' both satisfy v+w=0\mathbf{v} + \mathbf{w} = \mathbf{0} and v+w=0\mathbf{v} + \mathbf{w}' = \mathbf{0}, then

w=w+0=w+(v+w)=(w+v)+w=0+w=w.\mathbf{w} = \mathbf{w} + \mathbf{0} = \mathbf{w} + (\mathbf{v} + \mathbf{w}') = (\mathbf{w} + \mathbf{v}) + \mathbf{w}' = \mathbf{0} + \mathbf{w}' = \mathbf{w}'.

Zero scalar kills every vector.

0v=(0+0)v=0v+0v.0 \mathbf{v} = (0 + 0)\mathbf{v} = 0 \mathbf{v} + 0 \mathbf{v}.

Add the additive inverse of 0v0\mathbf{v} to both sides to get 0v=00\mathbf{v} = \mathbf{0}.

Every scalar kills the zero vector.

α0=α(0+0)=α0+α0,\alpha \mathbf{0} = \alpha(\mathbf{0} + \mathbf{0}) = \alpha \mathbf{0} + \alpha \mathbf{0},

so again α0=0\alpha \mathbf{0} = \mathbf{0}.

Multiplying by 1-1 gives the additive inverse.

v+(1)v=(1+1)v=0v=0,\mathbf{v} + (-1)\mathbf{v} = (1 + -1)\mathbf{v} = 0\mathbf{v} = \mathbf{0},

so (1)v=v(-1)\mathbf{v} = -\mathbf{v}.

These are simple but useful. Many later proofs quietly depend on them.

3.3 Examples of Vector Spaces

The same axioms govern many different mathematical objects.

1. Euclidean space Rn\mathbb{R}^n
This is the standard example. Addition and scalar multiplication are componentwise.

2. Polynomial space PnP_n
The set of polynomials of degree at most nn:

Pn={a0+a1x++anxn:aiR}.P_n = \{a_0 + a_1 x + \cdots + a_n x^n : a_i \in \mathbb{R}\}.

Addition is polynomial addition, and scalar multiplication rescales coefficients.

3. Matrix space Rm×n\mathbb{R}^{m \times n}
All m×nm \times n real matrices form a vector space with componentwise addition and scaling.

4. Continuous function space C([a,b])C([a,b])
All continuous real-valued functions on [a,b][a,b] form a vector space under pointwise operations:

(f+g)(x)=f(x)+g(x),(αf)(x)=αf(x).(f+g)(x) = f(x)+g(x), \qquad (\alpha f)(x) = \alpha f(x).

5. Sequence spaces
Infinite sequences can also form vector spaces. Important examples include 2\ell^2, the square-summable sequences, and 1\ell^1, the absolutely summable sequences.

6. L2([0,1])L^2([0,1])
Square-integrable functions form an infinite-dimensional vector space that becomes a Hilbert space once we equip it with the usual inner product.

This last example matters in machine learning theory because kernels, Fourier methods, and approximation theory are often most naturally phrased in function spaces rather than coordinate spaces.

3.4 Non-Examples

Students usually learn vector spaces faster by seeing what fails.

SetWhy it is not a vector space
{xRn:xi>0 for all i}\{x \in \mathbb{R}^n : x_i > 0 \text{ for all } i\}Not closed under additive inverse
{xRn:x=1}\{x \in \mathbb{R}^n : \|x\| = 1\}Not closed under addition or scaling
Probability simplex Δn1\Delta^{n-1}Not closed under arbitrary addition or scaling
Zn\mathbb{Z}^n over R\mathbb{R}Not closed under real scalar multiplication
A line not through the originDoes not contain the zero vector

The probability simplex is especially important in AI. It lives inside a vector space, but it is not itself a vector space because probabilities must remain nonnegative and sum to one.

3.5 Subspaces

A subset WVW \subseteq V is a subspace if it is itself a vector space under the same operations. The practical test is short:

  1. 0W\mathbf{0} \in W
  2. if u,vW\mathbf{u}, \mathbf{v} \in W, then u+vW\mathbf{u} + \mathbf{v} \in W
  3. if vW\mathbf{v} \in W and αF\alpha \in F, then αvW\alpha \mathbf{v} \in W

Equivalently, a nonempty set is a subspace if it is closed under linear combinations.

Examples inside R3\mathbb{R}^3:

  • {0}\{\mathbf{0}\} is the trivial subspace
  • any line through the origin is a one-dimensional subspace
  • any plane through the origin is a two-dimensional subspace
  • R3\mathbb{R}^3 itself is a subspace

Non-examples:

  • a translated line or plane not passing through the origin
  • the unit sphere
  • the set of vectors with first coordinate equal to 1

Subspaces are how linear algebra talks about structure. The column space of a matrix, the null space of a map, the subspace spanned by a collection of attention heads, and the tangent space of a model family all follow the same logic: identify the directions that are allowed, closed, and linearly stable.

Subspace vs non-subspace in R^2

Subspace (through origin)      Not a subspace (shifted)

      /                             /
     /                             /
----+----> x                  -----/----> x
   /                             /
  /

contains 0                     misses 0

4. Basis and Dimension

4.1 Basis Definition

A basis for a vector space VV is a set of vectors

B={b1,b2,,bn}B = \{\mathbf{b}_1, \mathbf{b}_2, \ldots, \mathbf{b}_n\}

that satisfies two conditions:

  1. the vectors are linearly independent
  2. the vectors span the whole space

These two requirements are exact opposites that must hold simultaneously. Independence prevents redundancy; spanning prevents missing directions.

If BB is a basis, then every vector vV\mathbf{v} \in V has a unique expansion

v=α1b1+α2b2++αnbn.\mathbf{v} = \alpha_1 \mathbf{b}_1 + \alpha_2 \mathbf{b}_2 + \cdots + \alpha_n \mathbf{b}_n.

The coefficients α1,,αn\alpha_1, \ldots, \alpha_n are the coordinates of v\mathbf{v} relative to BB.

Uniqueness matters. If the same vector had two different coordinate descriptions in the same basis, then subtracting those descriptions would produce a nontrivial linear combination of basis vectors equal to zero, contradicting independence.

In R^2:

one vector          two non-collinear vectors       three vectors
not enough          just enough = basis             spanning but redundant

   /                    ^ y                           ^ y
  /                     |  /                          |  /|
 /                      | /                           | / |
-----------------> x    +--------> x                 +--------> x
                       /                             /   /

4.2 Standard Basis of R^n

The standard basis of Rn\mathbb{R}^n is

e1=(1,0,,0),  e2=(0,1,,0),  ,  en=(0,0,,1).\mathbf{e}_1 = (1,0,\ldots,0), \; \mathbf{e}_2 = (0,1,\ldots,0), \; \ldots, \; \mathbf{e}_n = (0,0,\ldots,1).

In the standard basis, the coordinate vector is just the familiar list of components:

v=(v1vn)=v1e1++vnen.\mathbf{v} = \begin{pmatrix} v_1 \\ \vdots \\ v_n \end{pmatrix} = v_1 \mathbf{e}_1 + \cdots + v_n \mathbf{e}_n.

But the standard basis is not sacred. Many problems become simpler after a change of basis:

  • PCA chooses a basis aligned with data variance
  • Fourier analysis chooses sinusoidal basis functions
  • diagonalization chooses an eigenbasis when available
  • orthonormal bases simplify coordinates, projections, and energy calculations

The vector does not change when the basis changes. Only its coordinate description changes.

4.3 Dimension

For finite-dimensional vector spaces, every basis has the same number of elements. That common number is the dimension of the space:

dim(V)=number of vectors in any basis of V.\dim(V) = \text{number of vectors in any basis of } V.

Examples:

  • dim(Rn)=n\dim(\mathbb{R}^n) = n
  • dim(Rm×n)=mn\dim(\mathbb{R}^{m \times n}) = mn
  • dim(Pn)=n+1\dim(P_n) = n+1 because {1,x,x2,,xn}\{1, x, x^2, \ldots, x^n\} is a basis
  • dim(C([a,b]))=\dim(C([a,b])) = \infty

Dimension is a count of independent directions, not a count of how many vectors happen to be mentioned in a description. You can describe a plane in R3\mathbb{R}^3 using ten spanning vectors if you want; the dimension is still 2.

In AI practice, dimension is often both a modeling choice and a computational budget. Embedding dimension, head dimension, hidden width, and bottleneck rank are all dimension decisions.

4.4 Dimension and Subspaces

If WW is a subspace of a finite-dimensional vector space VV, then

dim(W)dim(V).\dim(W) \leq \dim(V).

Moreover:

  • if dim(W)=dim(V)\dim(W) = \dim(V), then W=VW = V
  • if dim(W)=0\dim(W) = 0, then W={0}W = \{\mathbf{0}\}
  • in R2\mathbb{R}^2, the only subspaces are {0}\{\mathbf{0}\}, lines through the origin, and R2\mathbb{R}^2
  • in R3\mathbb{R}^3, the only subspaces are {0}\{\mathbf{0}\}, lines through the origin, planes through the origin, and R3\mathbb{R}^3

The codimension of WW in VV is

codim(W)=dim(V)dim(W).\mathrm{codim}(W) = \dim(V) - \dim(W).

Codimension tells you how many independent directions are missing.

This is the right language for low-rank modeling. If a matrix maps a 4096-dimensional residual stream into a 64-dimensional attention head space, then its image has dimension at most 64, so most of the ambient directions are not expressible in that head.

4.5 Coordinates and Change of Basis

Let B={b1,,bn}B = \{\mathbf{b}_1, \ldots, \mathbf{b}_n\} be a basis for VV. The coordinate vector of v\mathbf{v} relative to BB is

[v]B=(α1αn)wherev=α1b1++αnbn.[\mathbf{v}]_B = \begin{pmatrix} \alpha_1 \\ \vdots \\ \alpha_n \end{pmatrix} \quad \text{where} \quad \mathbf{v} = \alpha_1 \mathbf{b}_1 + \cdots + \alpha_n \mathbf{b}_n.

If BB' is another basis, then coordinates transform by an invertible matrix:

[v]B=P[v]B.[\mathbf{v}]_{B'} = P [\mathbf{v}]_B.

The change-of-basis matrix is invertible because coordinates in each basis are unique. If it were not invertible, some nonzero coordinate vector would collapse to zero, contradicting uniqueness.

Suppose a linear map TT is represented by matrix AA in one basis and by matrix AA' in another basis. Then the relationship is

A=P1AP.A' = P^{-1} A P.

This is similarity transformation. It does not change the underlying map; it changes the coordinates in which the map is described.

That idea appears all over ML:

  • whitening and PCA rotate data into more convenient coordinates
  • orthogonal parameterizations change basis while preserving norms
  • query, key, and value projections choose learned coordinate systems inside the embedding dimension
Same vector, different coordinates

standard basis:              rotated basis:

      y                           b2
      ^                          /
      |    v                    /
      |   /                    /   v
      |  /                    /
------+------> x        -----/----------> b1

The geometric arrow is the same.
Only the coordinate description changes.

4.6 Rank-Nullity Theorem

Let T:VWT: V \to W be a linear map with finite-dimensional domain. Then

dim(kerT)+dim(imT)=dim(V).\dim(\ker T) + \dim(\mathrm{im}\,T) = \dim(V).

The two terms are named:

  • nullity: dim(kerT)\dim(\ker T)
  • rank: dim(imT)\dim(\mathrm{im}\,T)

For a matrix ARm×nA \in \mathbb{R}^{m \times n}, this becomes

rank(A)+nullity(A)=n.\mathrm{rank}(A) + \mathrm{nullity}(A) = n.

Interpretation:

  • rank counts how many independent directions survive the map
  • nullity counts how many independent directions are lost completely

If AA has rank r<nr < n, then every output lies in an rr-dimensional subspace of Rm\mathbb{R}^m, and an (nr)(n-r)-dimensional family of input directions gets sent to zero.

This theorem explains a great deal of ML engineering:

  • low-rank adaptation works because useful updates often live in small image spaces (Hu et al., 2022)
  • bottleneck layers deliberately compress information into lower-dimensional subspaces
  • overparameterized models have large null spaces in parameter space, which helps explain why many parameter settings realize similar functions

5. Norms and Metric Spaces

5.1 Norms on Vector Spaces

A norm on a vector space VV is a function

:VR\|\cdot\|: V \to \mathbb{R}

satisfying, for all u,vV\mathbf{u}, \mathbf{v} \in V and all scalars α\alpha:

  1. positive definiteness: v0\|\mathbf{v}\| \geq 0, with equality iff v=0\mathbf{v} = \mathbf{0}
  2. homogeneity: αv=αv\|\alpha \mathbf{v}\| = |\alpha| \, \|\mathbf{v}\|
  3. triangle inequality: u+vu+v\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|

A norm measures size. Once a norm is available, we can talk about boundedness, convergence, approximation error, and regularization.

5.2 The p-Norms

For vRn\mathbf{v} \in \mathbb{R}^n and 1p<1 \leq p < \infty, the pp-norm is

vp=(i=1nvip)1/p.\|\mathbf{v}\|_p = \left( \sum_{i=1}^{n} |v_i|^p \right)^{1/p}.

The limiting case is

v=maxivi.\|\mathbf{v}\|_\infty = \max_i |v_i|.

The most important examples are:

L1 norm

v1=i=1nvi.\|\mathbf{v}\|_1 = \sum_{i=1}^{n} |v_i|.

This encourages sparsity in optimization because its unit ball has corners aligned with coordinate axes.

L2 norm

v2=i=1nvi2.\|\mathbf{v}\|_2 = \sqrt{\sum_{i=1}^{n} v_i^2}.

This is the Euclidean length and the most common norm in deep learning.

L-infinity norm

v=maxivi.\|\mathbf{v}\|_\infty = \max_i |v_i|.

This measures worst-case component size and is common in adversarial robustness.

The geometry of the unit ball depends on pp:

  • p=1p=1: diamond or cross-polytope
  • p=2p=2: sphere
  • p=p=\infty: cube

Those shapes are not cosmetic. They explain different optimization behavior.

Unit balls in 2D

L1 ball              L2 ball              L_inf ball

   /\                  ____               +------+
  /  \               /      \             |      |
  \  /               \______/             |      |
   \/                                       +------+

corners              smooth                flat faces

5.3 Norm Equivalence

In finite-dimensional spaces, all norms induce the same notion of convergence. More precisely, if a\|\cdot\|_a and b\|\cdot\|_b are norms on Rn\mathbb{R}^n, then there exist constants c1,c2>0c_1, c_2 > 0 such that

c1vavbc2vafor all vRn.c_1 \|\mathbf{v}\|_a \leq \|\mathbf{v}\|_b \leq c_2 \|\mathbf{v}\|_a \qquad \text{for all } \mathbf{v} \in \mathbb{R}^n.

Useful special inequalities:

vv2v1,\|\mathbf{v}\|_\infty \leq \|\mathbf{v}\|_2 \leq \|\mathbf{v}\|_1, v1nv2,v2nv,v1nv.\|\mathbf{v}\|_1 \leq \sqrt{n}\,\|\mathbf{v}\|_2, \qquad \|\mathbf{v}\|_2 \leq \sqrt{n}\,\|\mathbf{v}\|_\infty, \qquad \|\mathbf{v}\|_1 \leq n \|\mathbf{v}\|_\infty.

This means that in finite dimensions, saying "a sequence converges" does not depend on whether you measure error with L1, L2, or L-infinity. In infinite dimensions, this is false. Different norms can induce genuinely different topologies.

5.4 Matrix Norms

Matrices also admit norms. Important ones include:

Frobenius norm

AF=i,jAij2=tr(AA).\|A\|_F = \sqrt{\sum_{i,j} A_{ij}^2} = \sqrt{\mathrm{tr}(A^\top A)}.

This is the Euclidean norm of the matrix viewed as one long vector.

Spectral norm

A2=σmax(A).\|A\|_2 = \sigma_{\max}(A).

This is the maximum stretch factor of the linear map:

A2=supx2=1Ax2.\|A\|_2 = \sup_{\|\mathbf{x}\|_2 = 1} \|A\mathbf{x}\|_2.

Nuclear norm

A=iσi(A).\|A\|_* = \sum_i \sigma_i(A).

This is the sum of singular values. It is a convex surrogate for rank and appears in matrix completion and low-rank learning (Candes and Recht, 2009).

Induced 1-norm and infinity-norm

A1=maxjiAij,A=maxijAij.\|A\|_1 = \max_j \sum_i |A_{ij}|, \qquad \|A\|_\infty = \max_i \sum_j |A_{ij}|.

In ML:

  • Frobenius norm corresponds to standard weight decay on matrix entries
  • spectral norm controls the Lipschitz constant of a linear layer
  • nuclear norm promotes low-rank structure

5.5 Metric Spaces

A metric space is a set XX equipped with a distance function

d:X×XRd: X \times X \to \mathbb{R}

such that for all x,y,zXx,y,z \in X:

  1. d(x,y)0d(x,y) \geq 0, with equality iff x=yx=y
  2. d(x,y)=d(y,x)d(x,y) = d(y,x)
  3. d(x,z)d(x,y)+d(y,z)d(x,z) \leq d(x,y) + d(y,z)

Every norm induces a metric by

d(u,v)=uv.d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|.

Not every metric comes from a norm. Edit distance on strings is a metric, but there is no vector subtraction of strings that generates it as a norm.

AI uses many distances:

  • Euclidean distance for retrieval and clustering
  • cosine distance for directional similarity
  • Hamming distance for binary codes
  • edit distance for token sequences
  • Wasserstein distance for distributions
  • KL divergence as a useful non-metric divergence

The important point is conceptual: vector-space geometry gives one family of distances, but machine learning uses broader metric ideas too.

5.6 Convergence in Normed Spaces

A sequence (vn)(\mathbf{v}_n) converges to v\mathbf{v} in norm if

vnv0as n.\|\mathbf{v}_n - \mathbf{v}\| \to 0 \qquad \text{as } n \to \infty.

A sequence is Cauchy if its terms eventually become arbitrarily close to each other:

ε>0,N such that m,nN    vnvm<ε.\forall \varepsilon > 0, \exists N \text{ such that } m,n \geq N \implies \|\mathbf{v}_n - \mathbf{v}_m\| < \varepsilon.

A normed space is complete if every Cauchy sequence converges to a limit that still belongs to the space. A complete normed space is called a Banach space.

Important examples:

  • Rn\mathbb{R}^n with any norm is complete
  • matrix spaces with standard norms are complete
  • C([a,b])C([a,b]) with the sup norm is complete
  • LpL^p spaces are complete for 1p1 \leq p \leq \infty

Completeness matters because many learning algorithms are iterative. Gradient methods, fixed-point solvers, and optimization routines generate sequences. Completeness is the condition that prevents the limit from "falling out of the space."


6. Inner Product Spaces

Reading note: This section develops inner products concretely in Rn\mathbb{R}^n and establishes the geometric tools used throughout this chapter (dot product, angle, Cauchy-Schwarz, orthogonality). The abstract inner product space theory - Gram-Schmidt, orthonormal bases, Hilbert spaces, and orthogonal complements - is developed in full later.

-> Full treatment: Vector Spaces and Subspaces 9

6.1 Inner Products

An inner product on a real vector space VV is a function

,:V×VR\langle \cdot, \cdot \rangle : V \times V \to \mathbb{R}

satisfying, for all u,v,wV\mathbf{u}, \mathbf{v}, \mathbf{w} \in V and scalars α,β\alpha, \beta:

  1. symmetry: u,v=v,u\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle
  2. linearity: αu+βv,w=αu,w+βv,w\langle \alpha \mathbf{u} + \beta \mathbf{v}, \mathbf{w} \rangle = \alpha \langle \mathbf{u}, \mathbf{w} \rangle + \beta \langle \mathbf{v}, \mathbf{w} \rangle
  3. positive definiteness: v,v0\langle \mathbf{v}, \mathbf{v} \rangle \geq 0, with equality iff v=0\mathbf{v} = \mathbf{0}

An inner product upgrades a vector space with geometry. Once it is available, we can measure lengths, define angles, talk about orthogonality, and project onto subspaces.

For complex vector spaces, the definition is slightly modified: u,v=v,u\langle \mathbf{u}, \mathbf{v} \rangle = \overline{\langle \mathbf{v}, \mathbf{u} \rangle} and one argument is conjugate-linear. The core geometric story remains the same.

6.2 Standard Inner Product on R^n

The standard inner product on Rn\mathbb{R}^n is the dot product:

u,v=uv=i=1nuivi.\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^\top \mathbf{v} = \sum_{i=1}^{n} u_i v_i.

This inner product induces the Euclidean norm:

v2=v,v.\|\mathbf{v}\|_2 = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}.

So in Euclidean space, angle and length are not independent notions. They are both derived from the same primitive object.

6.3 Geometric Interpretation

The angle θ\theta between nonzero vectors u\mathbf{u} and v\mathbf{v} is defined by

cosθ=u,vuv.\cos \theta = \frac{\langle \mathbf{u}, \mathbf{v} \rangle} {\|\mathbf{u}\| \, \|\mathbf{v}\|}.

Interpretation:

  • θ=0\theta = 0: same direction, maximal alignment
  • θ=π/2\theta = \pi/2: orthogonal, zero alignment
  • θ=π\theta = \pi: opposite directions

Cosine similarity is exactly this normalized inner product. It is widely used for embeddings because it measures direction rather than raw magnitude.

For attention, the score qk/dkq^\top k / \sqrt{d_k} combines both angular alignment and magnitude. If vectors are normalized, the score is proportional to cosine similarity. If they are not, norm effects matter too.

Angle and alignment

same direction          orthogonal            opposite direction

-----> ----->           ----->                <----- ----->
                        |
                        |
                        v

cos(theta) = 1          cos(theta) = 0        cos(theta) = -1

6.4 Cauchy-Schwarz Inequality

The fundamental inequality of inner-product geometry is

u,vuv.|\langle \mathbf{u}, \mathbf{v} \rangle| \leq \|\mathbf{u}\| \, \|\mathbf{v}\|.

Equality holds exactly when u\mathbf{u} and v\mathbf{v} are linearly dependent.

A classic proof uses positivity. For any real tt,

0u+tv2=u+tv,u+tv=u2+2tu,v+t2v2.0 \leq \|\mathbf{u} + t \mathbf{v}\|^2 = \langle \mathbf{u} + t\mathbf{v}, \mathbf{u} + t\mathbf{v} \rangle = \|\mathbf{u}\|^2 + 2t \langle \mathbf{u}, \mathbf{v} \rangle + t^2 \|\mathbf{v}\|^2.

This quadratic in tt must have nonpositive discriminant, so

4u,v24u2v20,4\langle \mathbf{u}, \mathbf{v} \rangle^2 - 4\|\mathbf{u}\|^2 \|\mathbf{v}\|^2 \leq 0,

which implies the result.

Cauchy-Schwarz guarantees that cosine similarity always lies in [1,1][-1,1]. It also gives quick upper bounds on correlations, dot products, and approximation error.

6.5 Orthogonality

Vectors are orthogonal if their inner product is zero:

uvu,v=0.\mathbf{u} \perp \mathbf{v} \quad \Longleftrightarrow \quad \langle \mathbf{u}, \mathbf{v} \rangle = 0.

An orthogonal set is a collection of pairwise orthogonal nonzero vectors. Such a set is automatically linearly independent.

Proof sketch: if

α1v1++αkvk=0,\alpha_1 \mathbf{v}_1 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0},

take the inner product with vj\mathbf{v}_j. All cross-terms vanish, leaving

αjvj2=0,\alpha_j \|\mathbf{v}_j\|^2 = 0,

so αj=0\alpha_j = 0 for every jj.

An orthonormal set is orthogonal and normalized:

ui,uj=δij.\langle \mathbf{u}_i, \mathbf{u}_j \rangle = \delta_{ij}.

In an orthonormal basis, coordinates are exceptionally simple:

v=i=1nv,uiui.\mathbf{v} = \sum_{i=1}^{n} \langle \mathbf{v}, \mathbf{u}_i \rangle \mathbf{u}_i.

You get coefficients by taking inner products. No matrix inversion is needed.

Orthogonality is central in ML because it reduces interference:

  • orthogonal initializations help preserve signal norms through depth (Saxe et al., 2014)
  • orthogonal features are easier to disentangle
  • orthogonal subspaces make decomposition interpretable

6.6 Gram-Schmidt Orthogonalization

Given linearly independent vectors v1,,vn\mathbf{v}_1, \ldots, \mathbf{v}_n, the Gram-Schmidt procedure constructs an orthonormal basis u1,,un\mathbf{u}_1, \ldots, \mathbf{u}_n spanning the same subspace.

First set

u1=v1v1.\mathbf{u}_1 = \frac{\mathbf{v}_1}{\|\mathbf{v}_1\|}.

Then recursively subtract the components already accounted for:

u~k=vkj=1k1vk,ujuj,\widetilde{\mathbf{u}}_k = \mathbf{v}_k - \sum_{j=1}^{k-1} \langle \mathbf{v}_k, \mathbf{u}_j \rangle \mathbf{u}_j,

and normalize:

uk=u~ku~k.\mathbf{u}_k = \frac{\widetilde{\mathbf{u}}_k}{\|\widetilde{\mathbf{u}}_k\|}.

Each step removes the part of vk\mathbf{v}_k already explained by the previous orthonormal vectors. What remains is orthogonal to the earlier directions.

Conceptually, Gram-Schmidt says:

  1. start with a spanning description
  2. strip away redundancy direction by direction
  3. normalize what survives

Algorithmically, Gram-Schmidt underlies QR decomposition. Numerically, modified Gram-Schmidt and Householder methods are often preferred for stability.

Gram-Schmidt idea for v2

u1 -------------------->

v2 ------------------------>

split v2 into:

proj_u1(v2) -------------->
remainder                  ^
                           |
                           |

Keep the remainder, then normalize it.

6.7 Orthogonal Complement

If WW is a subspace of an inner-product space VV, its orthogonal complement is

W={vV:v,w=0 for all wW}.W^\perp = \{\mathbf{v} \in V : \langle \mathbf{v}, \mathbf{w} \rangle = 0 \text{ for all } \mathbf{w} \in W\}.

Important facts:

  • WW^\perp is always a subspace
  • in finite dimensions, dim(W)+dim(W)=dim(V)\dim(W) + \dim(W^\perp) = \dim(V)
  • every vector decomposes uniquely as
v=w+wwith wW,  wW\mathbf{v} = \mathbf{w} + \mathbf{w}^\perp \qquad \text{with } \mathbf{w} \in W,\; \mathbf{w}^\perp \in W^\perp

This decomposition is one of the most useful ideas in linear algebra. It is how least squares, residuals, and projection error are understood.

For matrices, the row space is orthogonal to the null space, and the column space is orthogonal to the left null space. Those are special cases of the same principle.

Orthogonal decomposition

               v
              /|
             / |
            /  |  v_perp
-----------*---+-----------------> W
          proj_W(v)

v = proj_W(v) + v_perp

6.8 Hilbert Spaces

A Hilbert space is a complete inner-product space. It has both geometry and completeness.

Finite-dimensional inner-product spaces are automatically Hilbert spaces. The interesting cases are infinite-dimensional:

  • 2\ell^2: square-summable sequences
  • L2([a,b])L^2([a,b]): square-integrable functions
  • reproducing kernel Hilbert spaces (RKHS), central to kernel methods

In a Hilbert space, orthonormal bases and projection theory still work, though bases may now be infinite. Parseval-type identities and Fourier expansions become available:

f2=i=1f,ei2\|f\|^2 = \sum_{i=1}^{\infty} |\langle f, e_i \rangle|^2

for suitable orthonormal bases {ei}\{e_i\}.

Hilbert spaces matter for AI because many learning-theoretic objects are not finite vectors at all; they are functions. Kernel methods, Gaussian processes, and parts of neural network theory live more naturally in Hilbert spaces than in Rn\mathbb{R}^n.


7. Orthogonal Projections

7.1 Projection onto a Subspace

Given a subspace WW of an inner-product space and a vector v\mathbf{v}, the orthogonal projection of v\mathbf{v} onto WW is the vector in WW closest to v\mathbf{v}:

ProjW(v)=argminwWvw.\mathrm{Proj}_W(\mathbf{v}) = \arg\min_{\mathbf{w} \in W} \|\mathbf{v} - \mathbf{w}\|.

The key characterization is:

  • ProjW(v)W\mathrm{Proj}_W(\mathbf{v}) \in W
  • vProjW(v)W\mathbf{v} - \mathrm{Proj}_W(\mathbf{v}) \in W^\perp

For a one-dimensional subspace spanned by a nonzero vector u\mathbf{u},

Proju(v)=v,uu,uu.\mathrm{Proj}_{\mathbf{u}}(\mathbf{v}) = \frac{\langle \mathbf{v}, \mathbf{u} \rangle} {\langle \mathbf{u}, \mathbf{u} \rangle} \mathbf{u}.

If u\mathbf{u} is already unit length, this simplifies to

Proju(v)=v,uu.\mathrm{Proj}_{\mathbf{u}}(\mathbf{v}) = \langle \mathbf{v}, \mathbf{u} \rangle \mathbf{u}.

Projection is the formal version of "keep the part of the vector that points in the subspace, discard the perpendicular remainder."

Projection onto a line or subspace

          v
         /|
        / |
       /  |  residual = v - Proj_W(v)
------*---+---------------------- W
     Proj_W(v)

7.2 Projection Matrix Properties

If PP is the matrix of an orthogonal projection, then:

  • P2=PP^2 = P (idempotence)
  • P=PP^\top = P (symmetry)

The first identity says projecting twice is the same as projecting once. The second says the projection is orthogonal rather than oblique.

For projection onto the line spanned by a nonzero vector u\mathbf{u},

P=uuuu.P = \frac{\mathbf{u}\mathbf{u}^\top}{\mathbf{u}^\top \mathbf{u}}.

Its eigenvalues are 0 and 1 only:

  • eigenvalue 1 corresponds to directions already in the target subspace
  • eigenvalue 0 corresponds to directions annihilated by projection

The complementary projector is IPI-P, which projects onto the orthogonal complement.

7.3 Projection onto Column Space

Let ARm×nA \in \mathbb{R}^{m \times n} have full column rank. The orthogonal projector onto the column space of AA is

PA=A(AA)1A.P_A = A(A^\top A)^{-1}A^\top.

Why this formula works:

  1. every projected vector has the form AxA\mathbf{x}, so it lies in col(A)\mathrm{col}(A)
  2. the residual must be orthogonal to every column of AA
  3. orthogonality of the residual gives the normal equations

If the columns of AA are orthonormal, then AA=IA^\top A = I and the formula simplifies to

PA=AA.P_A = AA^\top.

This expression appears in least squares, regression, PCA, and low-dimensional approximation.

In Transformer language, learned projections WQW_Q, WKW_K, and WVW_V are not orthogonal projectors in the strict algebraic sense, but they do map activations into lower-dimensional subspaces where later computations occur. Orthogonal projection is therefore the clean mathematical ideal behind many approximate representation moves.

Projecting b onto col(A)

          b
         /|
        / |
       /  | residual
      /   |
-----*----+----------> col(A)
   A x_hat

Best-fit output lies inside col(A).
The leftover error is orthogonal to col(A).

7.4 Gram Matrix

Given vectors v1,,vn\mathbf{v}_1, \ldots, \mathbf{v}_n in Rd\mathbb{R}^d, the Gram matrix is

Gij=vi,vj.G_{ij} = \langle \mathbf{v}_i, \mathbf{v}_j \rangle.

If VV is the matrix whose rows are the vectors, then

G=VV.G = VV^\top.

Gram matrices have two key properties:

  1. they are symmetric
  2. they are positive semidefinite

Indeed, for any coefficient vector x\mathbf{x},

xGx=xVVx=Vx20.\mathbf{x}^\top G \mathbf{x} = \mathbf{x}^\top VV^\top \mathbf{x} = \|V^\top \mathbf{x}\|^2 \geq 0.

The Gram matrix is positive definite exactly when the vectors are linearly independent.

In AI:

  • attention score matrices are built from dot products, so they are Gram-like objects before softmax scaling and masking
  • kernel matrices are Gram matrices in feature space
  • covariance and similarity matrices are Gram constructions with centering or normalization added

7.5 Best Approximation and Least Squares

Orthogonal projection solves the best approximation problem:

minwWvw2.\min_{\mathbf{w} \in W} \|\mathbf{v} - \mathbf{w}\|_2.

If p=ProjW(v)\mathbf{p} = \mathrm{Proj}_W(\mathbf{v}), then the residual r=vp\mathbf{r} = \mathbf{v} - \mathbf{p} is orthogonal to WW, and Pythagoras gives

v2=p2+r2.\|\mathbf{v}\|^2 = \|\mathbf{p}\|^2 + \|\mathbf{r}\|^2.

Least squares is exactly this geometry in matrix form. Given an overdetermined system AxbA\mathbf{x} \approx \mathbf{b}, the least-squares solution minimizes

Axb22.\|A\mathbf{x} - \mathbf{b}\|_2^2.

The residual must be orthogonal to the column space of AA:

A(Axb)=0,A^\top (A\mathbf{x} - \mathbf{b}) = 0,

so

AAx=Ab.A^\top A \mathbf{x} = A^\top \mathbf{b}.

When AA has full column rank,

x=(AA)1Ab.\mathbf{x}^* = (A^\top A)^{-1} A^\top \mathbf{b}.

Geometrically, AxA\mathbf{x}^* is the projection of b\mathbf{b} onto the column space of AA.

This is why projection is not just a geometric curiosity. It is the linear algebra under classical regression, representation compression, denoising, and many approximation arguments used throughout machine learning.


8. Affine Subspaces and Convexity

8.1 Affine Subspaces

A linear subspace must pass through the origin. Many important geometric sets do not. An affine subspace is a translated subspace:

A=v0+W={v0+w:wW}.A = \mathbf{v}_0 + W = \{\mathbf{v}_0 + \mathbf{w} : \mathbf{w} \in W\}.

Here v0\mathbf{v}_0 is a base point and WW is a linear subspace called the direction space.

Examples:

  • a line not through the origin in R2\mathbb{R}^2
  • a plane ax+by+cz=dax + by + cz = d in R3\mathbb{R}^3 with d0d \neq 0
  • the probability simplex Δn1\Delta^{n-1} before nonnegativity is imposed, because ipi=1\sum_i p_i = 1 is an affine constraint

Affine spaces are not vector spaces unless they happen to pass through the origin. You can subtract two points in an affine space to get a direction vector, but you cannot usually add two points and stay in the set.

Linear subspace vs affine subspace

subspace W                  affine set v0 + W

      /                           /
     /                           /
----+----> x                ----/----> x
   /                           /
  /

through 0                    shifted away from 0

8.2 Convex Sets

A set CC is convex if, whenever u,vC\mathbf{u}, \mathbf{v} \in C and λ[0,1]\lambda \in [0,1],

λu+(1λ)vC.\lambda \mathbf{u} + (1-\lambda)\mathbf{v} \in C.

This means the entire line segment joining any two points of the set stays inside the set.

Important convex sets:

  • every subspace
  • every affine subspace
  • every norm ball {vr}\{\|\mathbf{v}\| \leq r\} for a true norm
  • every half-space {x:wxb}\{\mathbf{x} : \mathbf{w}^\top \mathbf{x} \leq b\}
  • the probability simplex

The convex hull of a set SS, written conv(S)\mathrm{conv}(S), is the set of all convex combinations of points in SS. It is the smallest convex set containing SS.

This matters directly for attention. The output of a single attention head at one position is

jαjvjwith αj0,  jαj=1,\sum_j \alpha_j \mathbf{v}_j \quad \text{with } \alpha_j \geq 0,\; \sum_j \alpha_j = 1,

so it lies in the convex hull of the value vectors at that position.

Convex vs non-convex

Convex: segment stays inside

*-----------*

Non-convex: segment leaves the set

*---- gap ----*

8.3 Hyperplanes and Half-Spaces

A hyperplane in Rn\mathbb{R}^n is a set of the form

{xRn:wx=b}\{\mathbf{x} \in \mathbb{R}^n : \mathbf{w}^\top \mathbf{x} = b\}

for some nonzero normal vector w\mathbf{w}.

Geometrically, a hyperplane is an (n1)(n-1)-dimensional affine subspace. It cuts space into two half-spaces:

{x:wxb},{x:wxb}.\{\mathbf{x} : \mathbf{w}^\top \mathbf{x} \leq b\}, \qquad \{\mathbf{x} : \mathbf{w}^\top \mathbf{x} \geq b\}.

Hyperplanes are the basic decision surfaces of linear models. A neuron computes

wx+b.\mathbf{w}^\top \mathbf{x} + b.

Thresholding that expression defines a half-space. A ReLU network can therefore be viewed as composing piecewise-linear maps defined by many hyperplane arrangements.

Convex analysis adds a deeper theorem: disjoint convex sets can often be separated by hyperplanes. This is the geometric basis for max-margin classification, linear probes, and many duality arguments.

Hyperplane and half-spaces

half-space        hyperplane         half-space

xxxxx             |                  .....
xxxxx             |  w^T x = b       .....
xxxxx             |                  .....
                  ^
                  normal direction w

8.4 Convex Functions and Sublevel Sets

A function f:VRf: V \to \mathbb{R} is convex if for all u,v\mathbf{u}, \mathbf{v} and λ[0,1]\lambda \in [0,1],

f(λu+(1λ)v)λf(u)+(1λ)f(v).f(\lambda \mathbf{u} + (1-\lambda)\mathbf{v}) \leq \lambda f(\mathbf{u}) + (1-\lambda) f(\mathbf{v}).

Convex functions are important because their sublevel sets

{v:f(v)c}\{\mathbf{v} : f(\mathbf{v}) \leq c\}

are convex. This makes optimization far more tractable.

Examples:

  • f(x)=x22f(\mathbf{x}) = \|\mathbf{x}\|_2^2 is convex
  • f(x)=x1f(\mathbf{x}) = \|\mathbf{x}\|_1 is convex
  • logistic loss and cross-entropy are convex in suitable linear-model settings
  • deep network training objectives are generally not globally convex

The contrast matters. Classical optimization often uses convex structure globally. Deep learning rarely gets global convexity, but convex sets and convex penalties still appear everywhere locally and architecturally.


9. High-Dimensional Geometry

9.1 The Curse of Dimensionality

Low-dimensional intuition is unreliable in high dimensions. This broad phenomenon is often called the curse of dimensionality.

One form is volume concentration. For a dd-dimensional ball of radius rr, the fraction of volume inside the inner ball of radius rεr-\varepsilon is

(1εr)d.\left(1 - \frac{\varepsilon}{r}\right)^d.

As dd grows, this shrinks rapidly. Most of the volume moves near the boundary.

Another form is distance concentration. Random points in high-dimensional spaces tend to be at similar distances from each other, so nearest-neighbor structure becomes less contrastive unless data has meaningful low-dimensional structure.

For embeddings, this means random vectors in high dimension are usually nearly orthogonal. Meaningful similarity is therefore a signal against a strong null background of near-right angles.

Volume concentration intuition

low dimension:            high dimension:

[ core | shell ]          [tiny core | shell shell shell shell]

As dimension grows, most volume moves into the shell.

9.2 Concentration of Measure

Concentration of measure says that in high-dimensional spaces, many reasonably smooth functions vary much less than low-dimensional intuition suggests. Informally: a Lipschitz function on a high-dimensional sphere is almost constant on most of the sphere.

One representative result is that if x\mathbf{x} is random on the unit sphere Sd1S^{d-1} and u\mathbf{u} is a fixed unit vector, then the inner product x,u\langle \mathbf{x}, \mathbf{u} \rangle is sharply concentrated near 0 as dd grows.

Consequences:

  • random directions are almost orthogonal
  • norms and averages become more stable than naive geometric pictures suggest
  • random projections preserve more structure than one might expect

These ideas are central to modern high-dimensional probability (Vershynin, 2018) and are one reason random features and approximate nearest neighbor methods work surprisingly well.

9.3 Johnson-Lindenstrauss Lemma

The Johnson-Lindenstrauss lemma formalizes dimension reduction with bounded distortion (Johnson and Lindenstrauss, 1984; Dasgupta and Gupta, 2003).

Given nn points in a high-dimensional Euclidean space and an accuracy parameter ε(0,1)\varepsilon \in (0,1), there exists a linear map

f:RdRkf: \mathbb{R}^d \to \mathbb{R}^k

with

k=O ⁣(lognε2)k = O\!\left(\frac{\log n}{\varepsilon^2}\right)

such that all pairwise distances are preserved up to multiplicative error (1±ε)(1 \pm \varepsilon):

(1ε)vivj22f(vi)f(vj)22(1+ε)vivj22.(1-\varepsilon)\|\mathbf{v}_i - \mathbf{v}_j\|_2^2 \leq \|f(\mathbf{v}_i) - f(\mathbf{v}_j)\|_2^2 \leq (1+\varepsilon)\|\mathbf{v}_i - \mathbf{v}_j\|_2^2.

Remarkably, the target dimension depends logarithmically on the number of points, not on the original ambient dimension.

Practical message for ML:

  • random projections can compress data while preserving geometry
  • retrieval systems can work in reduced spaces
  • head dimension can be much smaller than model dimension without destroying all pairwise structure

9.4 Random Vectors in High Dimensions

If x\mathbf{x} and y\mathbf{y} are independent random unit vectors in Rd\mathbb{R}^d, then for large dd,

E[x,y]=0,Var(x,y)1d.\mathbb{E}[\langle \mathbf{x}, \mathbf{y} \rangle] = 0, \qquad \mathrm{Var}(\langle \mathbf{x}, \mathbf{y} \rangle) \approx \frac{1}{d}.

So the inner product typically has size about 1/d1/\sqrt{d}, which goes to zero with dimension.

A related fact is

xy22=x22+y222x,y2.\|\mathbf{x} - \mathbf{y}\|_2^2 = \|\mathbf{x}\|_2^2 + \|\mathbf{y}\|_2^2 - 2\langle \mathbf{x}, \mathbf{y} \rangle \approx 2.

Thus random unit vectors are not only almost orthogonal; they are also all about the same Euclidean distance apart.

This helps explain why high-dimensional spaces can pack many nearly independent directions. It also explains why cosine similarity is a more informative baseline than raw Euclidean distance in many embedding applications.

9.5 Geometry of Embedding Spaces

Learned embeddings are not random vectors, but high-dimensional geometry strongly shapes how they behave.

Important empirical observations:

  • simple semantic relations can appear as approximately linear directions (Mikolov et al., 2013)
  • contextual embedding spaces are often anisotropic rather than uniformly spread on a sphere (Ethayarajh, 2019)
  • nominal dimension may be large while intrinsic dimension is much smaller
  • singular values of embedding or feature matrices often decay quickly, indicating effective low-rank structure

Anisotropy matters because cosine similarity becomes biased if almost all vectors occupy a narrow cone. Various centering, whitening, and contrastive training methods attempt to improve isotropy for retrieval and representation quality.

The lesson is subtle: high-dimensional space provides enormous capacity, but learned models do not use that capacity uniformly. They carve out structured manifolds, cones, and low-rank directions inside the ambient space.

Embedding geometry: isotropic vs anisotropic

roughly isotropic              anisotropic / narrow cone

     .   .                           . .
   .   o   .                          . .
     .   .                            . .
   .       .                           . .

spread in many directions         packed into a small angular region

9.6 Softmax Temperature and Simplex Geometry

Softmax maps a logit vector zRn\mathbf{z} \in \mathbb{R}^n to a probability vector:

pi=exp(zi/τ)jexp(zj/τ),p_i = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)},

where τ>0\tau > 0 is temperature.

Geometrically:

  • as τ0\tau \to 0, the distribution concentrates near a vertex of the simplex
  • as τ\tau \to \infty, the distribution approaches the center of the simplex

So temperature controls how close the output lies to an extreme point versus the interior of the simplex.

This is not just a sampling trick. It is a geometric control knob for how sharply a model commits to one direction in probability space.

Temperature moves probability inside the simplex

high tau                         low tau

   more uniform                   more peaked

   [0.33 0.33 0.34]  ------->     [0.97 0.02 0.01]
        near center                    near a vertex

10. Linear Maps Between Spaces

10.1 Definition and Properties

A map T:VWT: V \to W is linear if for all vectors u,v\mathbf{u}, \mathbf{v} and scalars α,β\alpha, \beta,

T(αu+βv)=αT(u)+βT(v).T(\alpha \mathbf{u} + \beta \mathbf{v}) = \alpha T(\mathbf{u}) + \beta T(\mathbf{v}).

Equivalently, it preserves addition and scalar multiplication separately.

Immediate consequences:

  • T(0)=0T(\mathbf{0}) = \mathbf{0}
  • T(v)=T(v)T(-\mathbf{v}) = -T(\mathbf{v})
  • TT is determined completely by its action on a basis

For finite-dimensional spaces, every linear map RnRm\mathbb{R}^n \to \mathbb{R}^m is represented by a unique matrix ARm×nA \in \mathbb{R}^{m \times n} such that

T(x)=Ax.T(\mathbf{x}) = A\mathbf{x}.

Composition of linear maps corresponds to matrix multiplication.

10.2 Kernel and Image

The kernel of a linear map is

ker(T)={vV:T(v)=0}.\ker(T) = \{\mathbf{v} \in V : T(\mathbf{v}) = \mathbf{0}\}.

The image is

im(T)={T(v):vV}.\mathrm{im}(T) = \{T(\mathbf{v}) : \mathbf{v} \in V\}.

Both are subspaces.

Interpretation:

  • the kernel contains directions completely ignored by the map
  • the image contains all outputs the map can possibly produce

Injectivity and surjectivity are controlled by these spaces:

  • TT is injective iff ker(T)={0}\ker(T) = \{\mathbf{0}\}
  • TT is surjective iff im(T)=W\mathrm{im}(T) = W

This is a simple but powerful way to reason about architecture. If a projection matrix has a large kernel, then many input directions are guaranteed to be invisible downstream.

10.3 The Four Fundamental Subspaces

For a matrix ARm×nA \in \mathbb{R}^{m \times n}, the four fundamental subspaces are:

  1. column space col(A)Rm\mathrm{col}(A) \subseteq \mathbb{R}^m
  2. null space null(A)Rn\mathrm{null}(A) \subseteq \mathbb{R}^n
  3. row space row(A)=col(A)Rn\mathrm{row}(A) = \mathrm{col}(A^\top) \subseteq \mathbb{R}^n
  4. left null space null(A)Rm\mathrm{null}(A^\top) \subseteq \mathbb{R}^m

They satisfy the orthogonal decompositions

Rn=row(A)null(A),\mathbb{R}^n = \mathrm{row}(A) \oplus \mathrm{null}(A), Rm=col(A)null(A).\mathbb{R}^m = \mathrm{col}(A) \oplus \mathrm{null}(A^\top).

Dimensional relations:

  • dim(col(A))=dim(row(A))=rank(A)\dim(\mathrm{col}(A)) = \dim(\mathrm{row}(A)) = \mathrm{rank}(A)
  • dim(null(A))=nrank(A)\dim(\mathrm{null}(A)) = n - \mathrm{rank}(A)
  • dim(null(A))=mrank(A)\dim(\mathrm{null}(A^\top)) = m - \mathrm{rank}(A)

These decompositions provide a precise vocabulary for information flow:

  • the row space describes which input combinations matter
  • the null space describes which input directions are lost
  • the column space describes what outputs are expressible
  • the left null space describes output directions orthogonal to everything the map can produce
Four fundamental subspaces

Domain R^n                          Codomain R^m

[ row(A) | null(A) ] --A--> [ col(A) | 0 ]
                               ^
                               |
                     left-null directions live here,
                     orthogonal to every reachable output

10.4 Isomorphisms

A linear map T:VWT: V \to W is an isomorphism if it is bijective. Then VV and WW are isomorphic vector spaces.

For finite-dimensional real spaces, the classification is simple:

VWdim(V)=dim(W).V \cong W \quad \Longleftrightarrow \quad \dim(V) = \dim(W).

So all nn-dimensional real vector spaces are structurally equivalent to Rn\mathbb{R}^n, even if their elements look different.

Important caution: isomorphic does not mean equal. The space of quadratic polynomials and R3\mathbb{R}^3 are not the same set, but they are isomorphic because both have dimension 3.

This is why linear algebra can move freely between concrete coordinates and abstract objects. Once a basis is chosen, every finite-dimensional space looks like ordinary Euclidean coordinate space.

10.5 Dual Spaces

The dual space of VV is the space of linear functionals on VV:

V={f:VRf is linear}.V^* = \{f: V \to \mathbb{R} \mid f \text{ is linear}\}.

Elements of the dual space eat vectors and return scalars.

If VV is finite-dimensional, then

dim(V)=dim(V).\dim(V^*) = \dim(V).

With an inner product, every linear functional can be represented as an inner product against a fixed vector:

f(x)=x,v.f(\mathbf{x}) = \langle \mathbf{x}, \mathbf{v} \rangle.

This is the finite-dimensional form of the Riesz representation idea.

In coordinates, row vectors behave like dual vectors:

ax\mathbf{a}^\top \mathbf{x}

is a linear functional of x\mathbf{x}.

This perspective is useful in attention. A query can be viewed as defining a linear functional on key space: it "asks" how much each key aligns with a certain direction. The score qkq^\top k is then the action of a dual vector on a primal vector.


11. Special Vector Spaces in AI

11.1 Embedding Space Geometry

Let ERV×dE \in \mathbb{R}^{|V| \times d} be a token embedding matrix. Each row EtE_t is the embedding of token tt.

This space is where discrete symbols become continuous geometry. The hope is that semantically or syntactically related tokens occupy nearby or meaningfully aligned regions. Word2vec made this idea famous by showing that certain analogical relations behave approximately linearly (Mikolov et al., 2013).

Several geometric questions matter in practice:

  • local similarity: are related tokens close under cosine or Euclidean distance?
  • directional structure: do directions correspond to interpretable relations?
  • isotropy: are embeddings spread broadly or collapsed into a narrow cone?
  • intrinsic dimension: how many effective degrees of freedom are really used?

Modern contextual embeddings complicate the picture because a token does not have a single representation; it acquires one through context-dependent computation. Even so, the geometric language remains the same. Probing methods ask whether a property is linearly readable from a representation, which is a question about whether a suitable separating hyperplane exists in the embedding space.

11.2 Representation Space Through Layers

In a Transformer, each layer maps a sequence representation

X()Rn×dX^{(\ell)} \in \mathbb{R}^{n \times d}

to a new one

X(+1)=X()+f()(X()).X^{(\ell+1)} = X^{(\ell)} + f^{(\ell)}(X^{(\ell)}).

The residual connection means layers do not overwrite the representation from scratch. They add a vector-valued update inside a shared ambient space.

This suggests a productive geometric picture:

  • the residual stream is a shared communication space
  • attention heads read from and write to particular directions or subspaces
  • MLP blocks add structured nonlinear updates in that same space
  • representation learning is partly the art of organizing information so it can be linearly extracted or manipulated later

Mechanistic interpretability often phrases model behavior in exactly these terms: directions, features, subspaces, superposition, and read/write operations into a shared residual stream.

Residual stream picture

X^(l) --------> [ layer computes update f^(l)(X^(l)) ] ----+
  |                                                        |
  +-------------------------- add --------------------------+
                                                           |
                                                           v
                                                        X^(l+1)

11.3 Attention Subspaces

Attention projections are linear maps

WQ,WKRd×dk,WVRd×dv.W_Q, W_K \in \mathbb{R}^{d \times d_k}, \qquad W_V \in \mathbb{R}^{d \times d_v}.

For token representations x,yRd\mathbf{x}, \mathbf{y} \in \mathbb{R}^d,

q=WQx,k=WKy,v=WVy.q = W_Q^\top \mathbf{x}, \qquad k = W_K^\top \mathbf{y}, \qquad v = W_V^\top \mathbf{y}.

The raw attention score is

qk=xWQWKy.q^\top k = \mathbf{x}^\top W_Q W_K^\top \mathbf{y}.

So each head defines a bilinear form on the original embedding space. Because WQWKW_Q W_K^\top has rank at most dkd_k, each head uses a low-rank interaction relative to the full d×dd \times d ambient space.

This is one of the cleanest examples of linear-algebraic structure in a large model:

  • attention does not compare full vectors directly
  • it compares them after moving them into learned low-dimensional subspaces
  • different heads can be understood as different learned geometric lenses
Attention subspace flow

x ----W_Q^T----> q
                  \
                   \   score = q^T k
                    \
y ----W_K^T----> k
|
----W_V^T----> v

output = weighted sum of v vectors

11.4 The Probability Simplex

The probability simplex in Rn\mathbb{R}^n is

Δn1={pRn:pi0,  i=1npi=1}.\Delta^{n-1} = \left\{ \mathbf{p} \in \mathbb{R}^n : p_i \geq 0,\; \sum_{i=1}^{n} p_i = 1 \right\}.

It is not a vector space, but it is a highly structured convex set:

  • its vertices are the one-hot vectors
  • its center is the uniform distribution
  • it lies in an affine hyperplane defined by ipi=1\sum_i p_i = 1

Softmax maps logits from Rn\mathbb{R}^n into the interior of this simplex. Sampling, calibration, entropy, KL divergence, and decoding strategies all live on this geometric object.

A deeper geometric view uses the Fisher information metric, which turns the simplex into a curved statistical manifold. That perspective becomes important in information geometry and natural-gradient methods.

Probability simplex in 3 classes

          e2
         /\
        /  \
       / p  \
      /      \
     /________\
   e1          e3

vertices = one-hot distributions
center = uniform distribution

11.5 Parameter Space

If a model has pp trainable parameters, then its parameter space is

Θ=Rp.\Theta = \mathbb{R}^p.

For contemporary language models, pp can be enormous. A 7B model has roughly 7×1097 \times 10^9 coordinates, and frontier open models reach hundreds of billions (Dubey et al., 2024).

Key objects in parameter space:

  • parameter vector θ\theta
  • loss function L(θ)L(\theta)
  • gradient L(θ)\nabla L(\theta)
  • Hessian 2L(θ)\nabla^2 L(\theta)

This space is flat as a Euclidean manifold, but the loss defined on it can be highly nonconvex. Optimization therefore studies geometry not of the ambient space alone, but of the scalar field drawn on top of it.

Mode connectivity, low-loss valleys, and flat-versus-sharp minima are all geometric statements about subsets of parameter space.

11.6 Function Spaces and the Neural Tangent Kernel

A neural network with parameters θ\theta defines a function

fθ:XY.f_\theta : \mathcal{X} \to \mathcal{Y}.

As θ\theta varies, the model traces a family of functions inside a function space. Locally, infinitesimal parameter changes move the model through the tangent directions

fθθj.\frac{\partial f_\theta}{\partial \theta_j}.

The neural tangent kernel (NTK) is

K(x,x)=j=1pfθ(x)θjfθ(x)θj=θfθ(x),θfθ(x).K(x, x') = \sum_{j=1}^{p} \frac{\partial f_\theta(x)}{\partial \theta_j} \frac{\partial f_\theta(x')}{\partial \theta_j} = \left\langle \nabla_\theta f_\theta(x), \nabla_\theta f_\theta(x') \right\rangle.

So the NTK is an inner product in parameter-gradient space. Jacot et al. (2018) showed that in the infinite-width limit, gradient descent can be analyzed using this kernel viewpoint, making neural network training resemble kernel regression.

This is a good example of why abstract vector spaces matter. The relevant geometry is no longer primarily in input space or parameter space, but in a function space whose coordinates are themselves derivatives.


12. Direct Sums and Quotient Spaces

12.1 Direct Sum

If W1W_1 and W2W_2 are subspaces of VV, then

V=W1W2V = W_1 \oplus W_2

means:

  1. every vector in VV can be written as w1+w2\mathbf{w}_1 + \mathbf{w}_2
  2. the decomposition is unique

Equivalently:

  • W1+W2=VW_1 + W_2 = V
  • W1W2={0}W_1 \cap W_2 = \{\mathbf{0}\}

Dimensions add:

dim(W1W2)=dim(W1)+dim(W2).\dim(W_1 \oplus W_2) = \dim(W_1) + \dim(W_2).

The most important example in inner-product spaces is

V=WW.V = W \oplus W^\perp.

Direct sums say that a space can be assembled from independent coordinate subsystems.

Direct sum intuition

      w2
      ^
      |
O---->*
  w1   \
        \
         v = w1 + w2

Two independent pieces add to one vector.

12.2 Decomposing R^n

For a matrix AA, one fundamental decomposition is

Rn=row(A)null(A).\mathbb{R}^n = \mathrm{row}(A) \oplus \mathrm{null}(A).

Likewise,

Rm=col(A)null(A).\mathbb{R}^m = \mathrm{col}(A) \oplus \mathrm{null}(A^\top).

For symmetric matrices, eigenspaces corresponding to distinct eigenvalues are orthogonal, allowing spectral decompositions of the form

Rn=Eλ1Eλk.\mathbb{R}^n = E_{\lambda_1} \oplus \cdots \oplus E_{\lambda_k}.

In AI, decompositions of hidden spaces into feature subspaces, head subspaces, or signal-versus-noise components are often informal versions of direct-sum thinking.

12.3 Quotient Spaces

Given a subspace WVW \subseteq V, the quotient space V/WV/W identifies vectors that differ by an element of WW:

v1v2v1v2W.\mathbf{v}_1 \sim \mathbf{v}_2 \quad \Longleftrightarrow \quad \mathbf{v}_1 - \mathbf{v}_2 \in W.

The elements of the quotient are cosets

v+W={v+w:wW}.\mathbf{v} + W = \{\mathbf{v} + \mathbf{w} : \mathbf{w} \in W\}.

Dimension behaves cleanly:

dim(V/W)=dim(V)dim(W).\dim(V/W) = \dim(V) - \dim(W).

Conceptually, quotienting collapses all directions in WW to zero. If a model is invariant to a set of directions, the meaningful representation can often be thought of as living in a quotient space.

This is the right abstraction for "ignore all variation of type X." It appears in invariant learning, symmetry reduction, and representation factorization.

Quotient-space intuition

Before quotient by W:          After collapsing W-directions:

*----*----*                    [o]   [o]   [o]
*----*----*        ---->          each class = one coset
*----*----*

Points that differ only along W become the same object.

12.4 Tensor Product

The tensor product VWV \otimes W is the vector space generated by formal bilinear pairs vwv \otimes w subject to bilinearity rules. For finite-dimensional spaces,

dim(VW)=dim(V)dim(W).\dim(V \otimes W) = \dim(V)\dim(W).

If {ei}\{\mathbf{e}_i\} is a basis for VV and {fj}\{\mathbf{f}_j\} is a basis for WW, then {eifj}\{\mathbf{e}_i \otimes \mathbf{f}_j\} is a basis for VWV \otimes W.

In coordinates, matrices can be viewed as tensor-product objects:

Rm×nRmRn.\mathbb{R}^{m \times n} \cong \mathbb{R}^m \otimes \mathbb{R}^n.

A rank-1 matrix is a simple tensor:

uvuv.\mathbf{u}\mathbf{v}^\top \leftrightarrow \mathbf{u} \otimes \mathbf{v}.

This is relevant for model compression and LoRA. A low-rank update

ΔW=BA\Delta W = BA

is a sum of a small number of rank-1 outer products, so it lives in a low-dimensional tensorially structured subset of matrix space.

Outer product / simple tensor

column u           row v^T            matrix u v^T

[u1]               [v1 v2 v3]         [u1v1 u1v2 u1v3]
[u2]       x                          [u2v1 u2v2 u2v3]

One column times one row creates a rank-1 matrix.

13. Norms, Regularization, and Geometry

13.1 L2 Regularization as a Geometric Constraint

L2 regularization adds a penalty

λθ22\lambda \|\theta\|_2^2

to the loss. Geometrically, this prefers solutions near the origin and can be viewed through the constrained problem

minθL(θ)subject toθ2r.\min_\theta L(\theta) \quad \text{subject to} \quad \|\theta\|_2 \leq r.

The L2 ball is smooth and rotationally symmetric. As a result, L2 regularization shrinks parameters continuously rather than forcing exact zeros.

This is the modern descendant of ridge regression (Hoerl and Kennard, 1970). In neural networks it appears as weight decay.

L2 constraint set in 2D

      ____
    /      \
   |   .    |
    \______/

smooth boundary -> shrink smoothly

13.2 L1 Regularization and Sparsity

L1 regularization uses

λθ1.\lambda \|\theta\|_1.

Its geometry is different. The L1 ball has corners along coordinate axes, so when an objective first touches the constraint boundary, it is disproportionately likely to do so at a sparse point.

That geometric fact explains why L1 promotes sparsity (Tibshirani, 1996). The corresponding proximal operator is soft thresholding:

proxλ1(xi)=sign(xi)max(0,xiλ).\mathrm{prox}_{\lambda \|\cdot\|_1}(x_i) = \mathrm{sign}(x_i)\max(0, |x_i| - \lambda).

Small coordinates are driven exactly to zero.

This matters for:

  • feature selection
  • sparse coding
  • pruning
  • sparse MoE routing and sparse attention variants
L1 constraint set in 2D

        /\
       /  \
       \  /
        \/

corners on coordinate axes -> sparse solutions are favored

13.3 Spectral Regularization

For a matrix-valued parameter WW, several geometry-aware penalties are common.

Frobenius penalty

WF2\|W\|_F^2

which is entrywise L2 regularization.

Spectral norm control

W2=σmax(W)\|W\|_2 = \sigma_{\max}(W)

which bounds the largest one-step amplification of the layer.

Nuclear norm

W\|W\|_*

which promotes low rank.

Spectral normalization rescales a weight matrix by an estimate of its top singular value so that the layer has bounded operator norm (Miyato et al., 2018). This is a geometric way of controlling Lipschitz constants and stabilizing training.

What spectral control means geometrically

unit ball              after W                after spectral normalization

   ()                    (------)                  (----)
                         long axis = sigma_max     longest stretch capped

Spectral norm controls the biggest stretching direction.

13.4 The Geometry of Gradient Descent

The gradient

L(θ)\nabla L(\theta)

is a vector in parameter space pointing in the direction of steepest increase of the loss under the Euclidean metric. Gradient descent updates by

θt+1=θtηL(θt).\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t).

Several geometric facts follow:

  • the gradient is orthogonal to level sets of LL
  • ill-conditioned curvature creates narrow valleys and zig-zagging updates
  • preconditioning changes the metric of parameter space to make descent more isotropic

If the Hessian has eigenvalues λmax\lambda_{\max} and λmin\lambda_{\min} with large ratio

κ=λmaxλmin,\kappa = \frac{\lambda_{\max}}{\lambda_{\min}},

then the local landscape is elongated and naive gradient descent is inefficient.

Adaptive methods and natural gradient can be understood as changes of geometry. Natural gradient replaces the Euclidean metric with the Fisher information metric, aligning steepest descent with the geometry of the model's predictive distribution (Amari, 1998).

This is the right mental model: optimization is not just arithmetic on parameters; it is motion through a geometric space whose metric determines what "steepest" even means.

Gradient descent in a narrow valley

\                /
 \   x ->      /
  \           /
  /   <- x   \
 /           \
/_____________\

Poor conditioning makes updates zig-zag.

14. Common Mistakes

MistakeWhy it is wrongBetter statement
"A set closed under addition is automatically a subspace."It must also contain the zero vector and be closed under scalar multiplication.Check all subspace conditions, not just one.
"Independent vectors are orthogonal."Orthogonality implies independence for nonzero vectors, but independence does not imply orthogonality.Use Gram-Schmidt when you want orthogonality.
"The zero vector is not in a proper subspace."Every subspace must contain the zero vector.If 0W\mathbf{0} \notin W, then WW is not a subspace.
"Cosine similarity tells you Euclidean closeness."High cosine can coexist with large Euclidean distance when norms differ a lot.Cosine measures angle; Euclidean distance also depends on magnitude.
"All vectors in high dimension are similar."Random high-dimensional vectors are typically nearly orthogonal, not highly aligned.Meaningful similarity is rare relative to the random baseline.
"A spanning set has as many vectors as the dimension."Spanning sets can be redundant. Dimension counts basis vectors, not all listed vectors.Reduce to a linearly independent spanning set to find dimension.
"The simplex is a vector space because it contains vectors."It is not closed under arbitrary scaling or addition.The simplex is a convex subset of a vector space.
"Projection leaves every vector unchanged."It leaves only vectors already in the target subspace unchanged.Projection keeps the in-subspace component and removes the orthogonal remainder.
"Two spaces of the same dimension are equal."They are isomorphic, not literally the same set.Same dimension means same linear structure up to coordinate relabeling.
"Null(A)(A) and Null(A)(A^\top) are the same thing."They live in different ambient spaces unless AA is square and special.Right null space and left null space are distinct objects.

15. Exercises

These are designed to force both symbolic fluency and geometric interpretation.

  1. Vector space verification. Determine whether each set is a vector space under the stated operations. If it is, justify all axioms; if not, identify a failing axiom.

    • (a)(a) {(x,y)R2:x0}\{(x,y) \in \mathbb{R}^2 : x \geq 0\} with standard operations
    • (b)(b) {f:RR:f(0)=0}\{f:\mathbb{R}\to\mathbb{R} : f(0)=0\} with pointwise operations
    • (c)(c) {AR2×2:det(A)=1}\{A \in \mathbb{R}^{2 \times 2} : \det(A)=1\} with matrix addition
    • (d)(d) {pR3:p1+p2+p3=1}\{\mathbf{p} \in \mathbb{R}^3 : p_1+p_2+p_3=1\} with componentwise addition
  2. Subspaces and dimension.

    • (a)(a) Show that {(x,y,z)R3:x+2yz=0}\{(x,y,z)\in\mathbb{R}^3 : x+2y-z=0\} is a subspace, find a basis, and compute its dimension.
    • (b)(b) Find the dimension of the intersection of the planes x+y+z=0x+y+z=0 and xy+2z=0x-y+2z=0.
    • (c)(c) Let W=span{(1,1,0),(1,0,1),(0,1,1)}W=\mathrm{span}\{(1,1,0),(1,0,1),(0,1,-1)\}. Find a basis for WW and determine whether W=R3W=\mathbb{R}^3.
  3. Inner products and orthogonality. Let u=(1,2,1)\mathbf{u}=(1,2,-1) and v=(3,0,1)\mathbf{v}=(3,0,1).

    • (a)(a) Compute u,v\langle \mathbf{u}, \mathbf{v} \rangle.
    • (b)(b) Compute u2\|\mathbf{u}\|_2 and v2\|\mathbf{v}\|_2.
    • (c)(c) Find the angle between the vectors.
    • (d)(d) Compute the projection of u\mathbf{u} onto v\mathbf{v}.
    • (e)(e) Verify the Pythagorean decomposition.
    • (f)(f) Apply Gram-Schmidt to produce an orthonormal basis for span{u,v}\mathrm{span}\{\mathbf{u}, \mathbf{v}\}.
  4. Rank-nullity. For

A=(123246112),A = \begin{pmatrix} 1 & 2 & 3 \\ 2 & 4 & 6 \\ 1 & 1 & 2 \end{pmatrix},

find:

(a)(a) a basis for null(A)\mathrm{null}(A) (b)(b) a basis for col(A)\mathrm{col}(A) (c)(c) a basis for row(A)\mathrm{row}(A) (d)(d) the rank and nullity, verifying rank-nullity (e)(e) a nonzero vector in null(A)\mathrm{null}(A^\top)

  1. Projection matrices.
    • (a)(a) Compute the orthogonal projector onto the line spanned by (1,2,2)(1,2,2)^\top.
    • (b)(b) Verify P2=PP^2=P and P=PP^\top=P.
    • (c)(c) Compute IPI-P and identify its target subspace.
    • (d)(d) For
A=(100111),A = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{pmatrix},

compute PA=A(AA)1AP_A = A(A^\top A)^{-1}A^\top and interpret the result geometrically.

  1. High-dimensional geometry.

    • (a)(a) Suppose x,y\mathbf{x}, \mathbf{y} are independent random unit vectors in R100\mathbb{R}^{100}. What are the expected value and approximate variance of x,y\langle \mathbf{x}, \mathbf{y} \rangle?
    • (b)(b) For 500 points and ε=0.1\varepsilon=0.1, estimate a Johnson-Lindenstrauss target dimension kk up to constants.
    • (c)(c) Show that all distinct standard basis vectors in Rd\mathbb{R}^d are at Euclidean distance 2\sqrt{2} from one another.
  2. Attention geometry. Let WQR512×64W_Q \in \mathbb{R}^{512 \times 64}.

    • (a)(a) What is the maximum possible rank of WQW_Q?
    • (b)(b) What is the dimension of its null space if the rank is maximal?
    • (c)(c) What kinds of input directions are lost under this projection?
    • (d)(d) Why is WQWKW_Q W_K^\top low rank?
  3. Norms and distances.

    • (a)(a) For v=(3,4,0,1)\mathbf{v}=(3,-4,0,1)^\top, compute v1\|\mathbf{v}\|_1, v2\|\mathbf{v}\|_2, and v\|\mathbf{v}\|_\infty.
    • (b)(b) Verify vv2v1\|\mathbf{v}\|_\infty \leq \|\mathbf{v}\|_2 \leq \|\mathbf{v}\|_1.
    • (c)(c) For
A=(2112),A = \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix},

compute AF\|A\|_F and A2\|A\|_2.

  1. Quotient-space thinking. Suppose a representation is unchanged when you add any vector from a subspace WW. Explain why the meaningful information lives in V/WV/W rather than in VV itself.

  2. Regularization geometry. Sketch or describe the L1 and L2 unit balls in R2\mathbb{R}^2, then explain geometrically why L1 tends to produce sparse solutions and L2 typically does not.


16. Why This Matters for AI

AspectWhy vectors and spaces matter
EmbeddingsEvery token, position, and feature is represented as a vector in a learned space.
AttentionAttention is built from inner products, low-rank projections, and weighted sums.
Representation learningFeatures live in subspaces, not just in scalar coordinates.
CompressionLow-rank methods, adapters, and LoRA rely on dimension, span, and rank.
OptimizationGradients are vectors in parameter space, and regularization is geometric constraint design.
Generalization theoryRKHS, NTK, margin bounds, and norm-based complexity all use vector-space language.
Numerical stabilityOrthogonality, conditioning, and norm control determine whether training remains stable.
InterpretabilityLinear probes, concept vectors, activation steering, and superposition are all geometric analyses.
Retrieval and searchSimilarity search depends on metrics, angle, and concentration in high dimensions.
Probabilistic modelingLogits live in vector spaces and probabilities live in simplices embedded in them.

The short version is that vectors and spaces are not one topic among many. They are the common substrate of model architecture, learning dynamics, representation quality, and theoretical analysis.


17. Conceptual Bridge

Scalars are the atoms of computation. Vectors are organized collections of scalars. Vector spaces tell us which collections can be added and scaled consistently. Norms and inner products add geometry. Linear maps move us between spaces. Convexity constrains what combinations are allowed. High-dimensional geometry tells us why intuition breaks and why large models can still work.

That is the bridge to the next chapters:

sets and functions
  -> vectors and spaces
  -> matrices and linear maps
  -> orthogonality, eigenvalues, and SVD
  -> optimization, probability, and information
  -> neural networks and transformers

If this chapter gives you one durable mental model, let it be this: deep learning is geometry performed by linear maps inside structured high-dimensional spaces, with nonlinearities deciding which geometric regions remain active.


References

Core Linear Algebra and Geometry

AI and Representation Geometry

Low Rank, Kernels, and Optimization Geometry

Regularization and Information Geometry


Back to Home | Next: Matrix Operations