"The eigenvalues of a matrix are not just numbers - they are the heartbeat of the linear map, the frequencies at which it resonates, the rates at which it remembers and forgets."
Overview
Eigenvalues and eigenvectors are the most powerful diagnostic tool in linear algebra. Given a square matrix , an eigenvector is a non-zero vector that the matrix maps to a scalar multiple of itself: . The scalar is the corresponding eigenvalue. Every other vector gets rotated when is applied; eigenvectors are the unique directions that survive without rotation - they are the natural axes of the linear map, the directions in which the map acts most simply.
The power of this concept is hard to overstate. The long-run behaviour of any linear dynamical system - whether a Markov chain, a recurrent neural network, or gradient descent on a quadratic loss - is entirely determined by the eigenvalue spectrum. The eigenvector for the largest eigenvalue is where the system ends up; the ratio of the top two eigenvalues controls the convergence rate; a single eigenvalue outside the unit disk triggers instability. Understanding eigenvalues is understanding the future of any linear system.
For AI practitioners in 2026, eigenvalues appear everywhere: PCA decomposes the data covariance matrix into eigenvectors (principal components) and eigenvalues (variances); the Hessian eigenspectrum determines gradient descent convergence and reveals flat vs sharp minima; attention matrices are row-stochastic with Perron-Frobenius eigenvalue 1; LoRA adapts the dominant eigenspace of weight matrices; spectral normalisation bounds the largest singular value to stabilise GAN training; graph neural networks convolve in the eigenspace of the graph Laplacian. This section develops all of these connections from first principles.
Prerequisites
- Vector spaces, subspaces, span, and linear independence (Section 02-06)
- Matrix multiplication, transpose, and invertibility (Section 02-02)
- Determinants and their connection to invertibility (Section 02-04)
- Systems of linear equations and row reduction / RREF (Section 02-03)
- Matrix rank and null spaces (Section 02-05)
- Basic familiarity with complex numbers (for complex eigenvalues)
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive demonstrations: geometric visualisation, power iteration, QR algorithm, spectral theorem, PCA, Hessian spectrum, graph Laplacian, spectral normalisation |
| exercises.ipynb | 8 graded problems: eigenpair verification, characteristic polynomial, diagonalisation, spectral theorem, matrix powers, Rayleigh quotient, PCA from scratch, power iteration |
Learning Objectives
After completing this section, you will:
- State the eigenvalue equation and explain its geometric meaning
- Compute eigenvalues by solving the characteristic equation
- Find eigenvectors by solving the null space of
- Distinguish algebraic and geometric multiplicity; identify defective matrices
- Diagonalise a matrix as and use it to compute matrix powers
- State and apply the spectral theorem for symmetric matrices ()
- Characterise positive (semi)definite matrices via their eigenvalue spectrum
- Apply the Rayleigh quotient and Courant-Fischer theorem
- Compute the dominant eigenvector via power iteration
- Connect eigenvalues to PCA, gradient descent convergence, RNN stability, and attention structure
- Explain the Hessian eigenspectrum and its role in neural network training dynamics
- Use Gershgorin circles and Weyl inequalities to bound eigenvalues without computing them
Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. Computing Eigenvalues and Eigenvectors
- 4. Properties of Eigenvalues
- 5. Diagonalisation
- 6. The Spectral Theorem
- 7. Jordan Normal Form
- 8. Singular Value Decomposition
- 9. Eigendecomposition in AI Applications
- 10. Generalised Eigenproblems
- 11. Eigenvalues and Stability Analysis
- 12. Common Mistakes
- 13. Exercises
- 14. Why This Matters for AI (2026 Perspective)
- 15. Conceptual Bridge
1. Intuition
1.1 What Are Eigenvalues and Eigenvectors?
A linear map generally rotates and stretches every vector it acts on. Apply it to any random vector and the result points in a completely different direction. But for certain special non-zero vectors , something remarkable happens: points in exactly the same direction as . The map stretches or shrinks , possibly flips it, but does not rotate it. These special vectors are called eigenvectors of , and the stretching factor is the corresponding eigenvalue :
The eigenvalue encodes everything about how acts on : if the vector is stretched; if it is shrunk; if it is reversed in direction and possibly scaled; if it is collapsed to the zero vector; if it is left completely unchanged. The eigenvector itself gives the direction; the eigenvalue gives the magnitude and sign of the action.
The word eigen is German for "own" or "characteristic." Eigenvectors are the characteristic directions that belong intrinsically to the linear map - they are independent of the coordinate system you choose to describe the map. No matter how you rotate your axes, the eigenvectors of remain the same geometric objects. They are the natural language in which the map speaks most simply.
For AI: The eigenvectors of the sample covariance matrix are the principal components - the directions of maximum variance in the data. The eigenvalues of the Hessian are the curvatures of the loss landscape in each eigendirection; large eigenvalues correspond to sharp, fast-converging directions, small eigenvalues to flat, slow-converging ones. The eigenvalues of a recurrent weight matrix determine whether the RNN's hidden state grows, decays, or oscillates as time steps accumulate.
1.2 The Geometric Picture
The clearest way to see eigenvalues geometrically is to watch what does to every unit vector simultaneously. Imagine the unit circle in . Apply to every point on this circle. The result is an ellipse. The eigenvectors of are the directions that map to points on the same ray through the origin - vectors that get stretched along their own line.
For a symmetric matrix , the eigenvectors are orthogonal to each other. The ellipse's axes align exactly with the eigenvectors, and the lengths of the semi-axes equal the absolute eigenvalues. The map acts as a pure coordinate-wise scaling in the eigenbasis: no coupling, no rotation, just independent stretching of orthogonal directions. This is the content of the Spectral Theorem (Section 6), the most important structural result in applied linear algebra.
For a non-symmetric matrix, the picture is more complex. Eigenvectors may not be orthogonal. They may be complex even for a real matrix (a 2D rotation matrix, for instance, has no real eigenvectors - it rotates every real vector). But the eigenvalues still control the long-run behaviour: repeated application of amplifies the direction of the largest and suppresses all others.
GEOMETRIC ACTION OF A 2\times2 MATRIX
========================================================================
Unit circle After applying A
* * *
* * A * *
* * ---------> * *
* * (ellipse, axes = eigenvectors,
* semi-axes = |eigenvalues|)
Most vectors get ROTATED. Eigenvectors v_1, v_2 stay on
the same ray - only stretched.
========================================================================
1.3 Why Eigenvalues Are Central to AI
Eigenvalues are not a piece of abstract machinery that happens to appear in ML - they are the fundamental language of how linear systems evolve, converge, compress, and fail. Here are the core connections:
PCA and covariance structure. Given a centred data matrix , the sample covariance is , a symmetric positive semidefinite matrix. Its eigendecomposition gives eigenvectors (the principal components) and eigenvalues (the variances in each direction). Keeping only the top- eigenvectors gives the optimal rank- approximation to the data - the subspace capturing maximum variance.
Gradient descent convergence. For a quadratic loss , gradient descent with step size converges iff for all eigenvalues of the Hessian . The convergence rate in each eigendirection is . The condition number determines how much slower the slowest direction converges relative to the fastest - large means gradient descent zigzags and stalls.
Vanishing and exploding gradients. In a deep linear network with the same weight matrix at each of layers, the gradient magnitude scales as where is the spectral radius. If : gradients explode. If : gradients vanish. The eigenspectrum of is the fundamental quantity controlling gradient flow through depth.
Attention and Perron-Frobenius. The attention weight matrix is row-stochastic. By the Perron-Frobenius theorem, its dominant eigenvalue is exactly 1, with all others having . The structure of the remaining eigenvalues characterises the type of attention pattern - induction heads, name-mover heads, and copying heads each have distinct eigenvalue signatures.
Spectral normalisation and stability. Dividing a weight matrix by its largest singular value (spectral norm) enforces a Lipschitz constraint of 1 on that layer. This stabilises GAN training and normalising flow invertibility. The largest singular value is itself the square root of the largest eigenvalue of .
1.4 Eigenvalues as Long-Run Behaviour
The deepest reason eigenvalues matter is that they completely determine the long-run behaviour of any linear dynamical system. If , then repeated application gives:
Each eigenvector mode grows or decays at rate . For any initial vector decomposed in the eigenbasis, :
The mode with the largest - the dominant eigenvalue - eventually dominates everything: as , . This is why power iteration works: repeatedly apply and renormalise, and the iterates converge to the eigenvector for at rate per step. It is also why PageRank works - the steady-state page importance vector is the dominant eigenvector of the web link matrix.
For AI: Training dynamics near a loss minimum are approximately a linear system with . The eigenvalues of are . The long-run behaviour of training error is therefore - each Hessian eigendirection decays geometrically at its own rate. The slowest-converging direction is the one with the smallest Hessian eigenvalue (smallest curvature = flattest direction).
1.5 Historical Timeline
| Year | Who | What |
|---|---|---|
| 1750s | Euler | Special exponential solutions to ODEs; implicit eigenvalue reasoning for linear differential equations |
| 1762 | Lagrange | Reduction of quadratic forms to "principal axes"; normal modes of mechanical systems |
| 1829 | Cauchy | Rigorous proof that real symmetric matrices have real eigenvalues; foundation of spectral theory |
| 1856 | Hermite | Generalisation to complex Hermitian matrices; eigenvalues real for |
| 1866 | Kronecker | Formulated the characteristic polynomial ; eigenvalue problem as polynomial root-finding |
| 1870 | Jordan | Jordan normal form; canonical classification of all square matrices up to similarity |
| 1904-10 | Hilbert | Spectral theorem for infinite-dimensional operators; eigenvalues of integral operators; birth of functional analysis |
| 1925-27 | Heisenberg, von Neumann | Quantum mechanics as eigenvalue problems; observable quantities = eigenvalues of Hermitian operators |
| 1950 | Lanczos | Krylov subspace method for extreme eigenvalues of large sparse symmetric matrices |
| 1955 | Wigner | Random matrix theory; semicircle law for eigenvalue distributions of large random symmetric matrices |
| 1961-62 | Francis; Kublanovskaya | QR algorithm - the practical standard for computing all eigenvalues; one of the top 10 algorithms of the 20th century |
| 1965 | Golub-Reinsch | Practical SVD algorithm; singular values as generalised eigenvalues for rectangular matrices |
| 1989 | LeCun et al. | Hessian eigenspectrum of neural networks analysed; curvature-aware learning rate scheduling |
| 2010 | Martens | K-FAC - Kronecker-factored approximate curvature; eigendecomposition of layer-wise Hessian approximations |
| 2019-26 | Martin & Mahoney | WeightWatcher - heavy-tailed eigenvalue distributions as quality metrics for trained neural networks |
| 2017-26 | Transformer era | Spectral analysis of attention matrices; NTK eigenvalues; spectral normalisation in GANs; LoRA via dominant eigenspace |
2. Formal Definitions
2.1 The Eigenvalue Equation
Definition. Let be a square matrix. A scalar is an eigenvalue of and a non-zero vector is the corresponding eigenvector if:
Three equivalent ways to state this:
- Geometric: maps to a scalar multiple of itself; direction is preserved, only magnitude (and possibly sign) changes.
- Algebraic: has a non-trivial solution .
- Subspace: and .
The requirement is essential: the zero vector satisfies for every , so it carries no information. Any non-zero scalar multiple of an eigenvector is also an eigenvector: . Eigenvectors are therefore not unique - they define a direction, not a specific vector. The set of all eigenvectors for a given , together with , forms a subspace called the eigenspace (Section 2.4).
Non-examples of eigenvalues:
- The zero matrix has every non-zero vector as an eigenvector with , but no non-zero eigenvalue.
- A rotation matrix in by angle has no real eigenvalues - it rotates every real vector off its own direction.
- Non-square matrices have no eigenvalues (the equation requires to map ).
2.2 The Characteristic Polynomial
For to be an eigenvalue, must have a non-trivial solution. This happens precisely when is singular, i.e., when its determinant is zero. This gives the characteristic equation:
The function is a degree- polynomial in called the characteristic polynomial of . Its roots are exactly the eigenvalues of . By convention we use (rather than ) so that the leading coefficient is (monic polynomial):
Two key coefficients appear regardless of the matrix:
- Coefficient of : - the sum of eigenvalues equals the trace.
- Constant term - the product of eigenvalues equals the determinant (up to sign).
By the Fundamental Theorem of Algebra, has exactly roots over (counting multiplicity). Every real matrix therefore has exactly eigenvalues in , though some may be complex and some may be repeated.
For a matrix :
The discriminant determines the nature of the eigenvalues: gives two distinct real eigenvalues; gives one repeated real eigenvalue; gives two complex conjugate eigenvalues.
2.3 Algebraic and Geometric Multiplicity
When an eigenvalue appears as a repeated root of , we distinguish two notions of multiplicity:
Algebraic multiplicity : the number of times divides ; the multiplicity of as a root of the characteristic polynomial.
Geometric multiplicity : the dimension of the eigenspace ; the number of linearly independent eigenvectors for .
These two numbers satisfy the fundamental inequality:
The lower bound holds because every eigenvalue has at least one eigenvector. The upper bound requires proof but follows from the fact that the eigenspace cannot be larger than the multiplicity of the root. When for every eigenvalue, the matrix is diagonalisable - it has linearly independent eigenvectors. When for some eigenvalue, the matrix is defective and cannot be diagonalised; it requires the Jordan Normal Form (Section 7).
Example. Consider . The characteristic polynomial is , so has . But has null space spanned only by , so . This matrix is defective.
2.4 Eigenspaces
For eigenvalue , the eigenspace is:
is always a subspace (it is the null space of ). It contains the zero vector and is closed under addition and scalar multiplication. Its dimension equals the geometric multiplicity: .
Two key properties make eigenspaces structurally special:
Invariance: is invariant under . If then . Applying keeps you inside the eigenspace - eigenvectors stay eigenvectors.
Independence across eigenvalues: Eigenvectors for different eigenvalues are linearly independent. If and with , then and cannot be parallel. Proof: suppose for some scalar . Then , giving . Since and , contradiction. More generally: eigenvectors for distinct eigenvalues are always linearly independent.
For AI: In PCA, the eigenspaces of the covariance matrix are the principal component subspaces. The first eigenspace (for largest variance) gives the direction of greatest spread. In spectral clustering, the eigenspaces of the graph Laplacian partition the graph. In the analysis of attention heads, the eigenspace for eigenvalue 1 is the fixed-point subspace of the attention map.
2.5 The Spectrum
The spectrum of is the set of all distinct eigenvalues: .
The spectral radius is - the largest absolute eigenvalue. It controls the long-run growth rate of :
- : as (all trajectories of converge to )
- : trajectories are bounded (marginal stability, provided Jordan blocks for eigenvalues have size 1)
- : some trajectories grow without bound (unstable)
For real matrices, complex eigenvalues always come in conjugate pairs: if then , with corresponding conjugate eigenvectors. This means real matrices of odd dimension always have at least one real eigenvalue.
For AI: The spectral radius of the recurrent weight matrix in an RNN determines stability: causes vanishing gradients; causes explosion; is the critical point targeted by orthogonal RNN designs. The spectral radius of the attention matrix is always 1 (Perron-Frobenius). The spectral radius of (where is the Hessian) determines gradient descent convergence.
3. Computing Eigenvalues and Eigenvectors
3.1 The Two-Step Process
Computing eigenvalues and eigenvectors follows a universal two-step recipe:
Step 1 - Find eigenvalues: Solve the characteristic equation . This gives a degree- polynomial whose roots are the eigenvalues (with algebraic multiplicity).
Step 2 - Find eigenvectors: For each eigenvalue , solve the homogeneous linear system . Row-reduce to RREF. The free variables give the null space basis vectors - these are the eigenvectors for .
This procedure is exact for small matrices. For , the Abel-Ruffini theorem guarantees no algebraic formula for roots of degree-5 polynomials, so numerical methods (Section 3.7) are mandatory in practice.
3.2 The 2\times2 Case
For , the characteristic polynomial is , giving:
Worked example. .
- ; eigenvalues , .
- For : ; eigenvector .
- For : ; eigenvector .
- Check: OK; OK; OK (symmetric matrix).
3.3 The 3\times3 and Higher Cases
For a matrix , expand by cofactor expansion along any row or column:
The coefficient of equals the sum of the three principal minors (where is the determinant of with row and column deleted). Factor the cubic if possible; otherwise use the cubic formula or numerical root-finding.
For : expand the determinant directly; factor by guessing integer roots; use the rational root theorem. For : numerical methods are mandatory (Section 3.7).
Key shortcut - triangular and diagonal matrices. If is upper triangular, lower triangular, or diagonal, then . The eigenvalues are exactly the diagonal entries. No expansion needed.
Key shortcut - block diagonal matrices. If , then . The eigenvalues of are the union of eigenvalues of each block.
3.4 Worked Example - 3\times3 Defective Matrix
Let (upper triangular).
Step 1: . Eigenvalues: (algebraic multiplicity 2), (multiplicity 1).
Step 2a - eigenvectors for :
Free variable: ; forced: , . Eigenvector: . Geometric multiplicity . Defective eigenvalue. The matrix cannot be diagonalised.
Step 2b - eigenvectors for :
Free variable: ; , . Eigenvector: . Geometric multiplicity . Non-defective.
Conclusion: This matrix has only 2 linearly independent eigenvectors for 3 eigenvalues (counting multiplicity). Diagonalisation fails; Section 7 covers the Jordan form needed here.
3.5 Eigenvalues of Special Matrices
A table of shortcuts that apply throughout the course:
| Matrix type | Condition | Eigenvalue property |
|---|---|---|
| Triangular (upper/lower) | for (or ) | Eigenvalues = diagonal entries |
| Diagonal | Eigenvalues ; eigenvectors | |
| Symmetric | All ; eigenvectors orthogonal | |
| Skew-symmetric | All (purely imaginary or 0) | |
| Orthogonal | All $ | |
| Projection | , | |
| Nilpotent | for some | All |
| Row-stochastic | All entries , row sums | ; $ |
| PSD | All | |
| PD | All |
3.6 Power Iteration
Power iteration computes the dominant eigenvector - the eigenvector corresponding to the eigenvalue with largest absolute value - at cost per step (or for sparse matrices).
Algorithm: Given initial vector (random, not orthogonal to ):
For k = 1, 2, 3, ...:
y_k = A x_{k-1} (apply A - one matrix-vector multiply)
x_k = y_k / ||y_k|| (normalise to unit length)
\lambda_k = x_k^T A x_k (Rayleigh quotient estimate)
stop when |\lambda_k - \lambda_{k-1}| < tolerance
Convergence: The iterates at geometric rate per step. If the spectral gap is large, convergence is rapid; if the gap is small, convergence is slow.
Requirements: must be strictly larger in absolute value than all other eigenvalues; must have a non-zero component in the direction (almost surely true for a random initialisation).
For AI: Power iteration underlies PageRank (dominant eigenvector of the link transition matrix), spectral initialisation of word embeddings, and efficient estimation of the spectral norm for spectral normalisation in GANs and normalising flows. One step of power iteration per gradient update costs only and gives a running estimate of .
3.7 The QR Algorithm
The QR algorithm is the practical standard for computing all eigenvalues of a dense matrix. It is one of the top 10 algorithms of the 20th century (Dongarra and Sullivan, 2000) and underlies numpy.linalg.eig, scipy.linalg.eig, and LAPACK's dgeev/dsyev.
Basic QR iteration:
A_0 = A
For k = 1, 2, 3, ...:
A_k_-_1 = Q_k R_k (QR factorisation of current matrix)
A_k = R_k Q_k (reverse-multiply; A_k is similar to A_k_-_1)
Each is orthogonally similar to and thus has the same eigenvalues. The sequence converges to an upper triangular (Schur) form whose diagonal entries are the eigenvalues.
Practical QR algorithm: Reduce to Hessenberg form first (zero below the first sub-diagonal) using Householder reflections - once. QR steps on a Hessenberg matrix cost each. With Wilkinson shifts (shifting by an estimate of the bottom eigenvalue), convergence is typically cubic. Total cost: .
For AI: The QR algorithm is called every time you run np.linalg.eig(A) or np.linalg.eigh(A) (the symmetric version). Computing the full eigendecomposition of the Hessian approximation in K-FAC uses block-wise QR. Eigenvalue decomposition of the attention weight matrix for mechanistic interpretability uses np.linalg.eig.
3.8 Lanczos Algorithm for Large Sparse Matrices
For large sparse symmetric matrices (e.g., graph Laplacians with nodes), computing all eigenvalues via QR is - completely infeasible. The Lanczos algorithm computes the extreme eigenvalues (largest or smallest) at cost where is the number of non-zeros.
Core idea: Build a Krylov subspace . The eigenvalues of the tridiagonal projection (called Ritz values) converge to the extreme eigenvalues of as grows. Typically .
Algorithm (symmetric Lanczos):
\beta_0 = 0, q_0 = 0, q_1 = b/||b||
For j = 1, 2, ..., k:
z = A q_j
\alpha_j = q_j^T z
z <- z - \alpha_j q_j - \beta_j_-_1 q_j_-_1
\beta_j = ||z||; q_j_+_1 = z / \beta_j
T_k = tridiag(\alpha_1,...,\alpha_k; \beta_1,...,\beta_k_-_1)
The Ritz values (eigenvalues of ) converge to extreme eigenvalues of at exponential rate. In practice, re-orthogonalisation is needed to combat floating-point accumulation of rounding errors.
For AI: Lanczos is used to compute top- eigenvectors of large graph Laplacians for spectral clustering and graph neural network positional encodings; to estimate the Hessian eigenspectrum of large neural networks (PyHessian, HessianFlow); to compute dominant eigenvectors of large kernel matrices (Nystrom approximation); and to analyse the spectral structure of attention weight matrices in large transformers.
4. Properties of Eigenvalues
4.1 Trace and Determinant Relations
Two fundamental identities connect eigenvalues to matrix invariants computable without solving the characteristic equation:
Proof (trace). The characteristic polynomial is . Expanding: the coefficient of is . Expanding directly: the coefficient of comes only from the product of diagonal terms , giving for the coefficient of . Hence .
Proof (determinant). . Also . Hence .
These provide quick sanity checks for eigenvalue computations and theoretical tools:
For AI: is the total curvature of the loss surface; it can be estimated cheaply via Hutchinson's trace estimator (random vector : for ). appears in Gaussian log-likelihoods and normalising flow log-densities. for symmetric PSD matrices where .
4.2 Eigenvalues Under Matrix Operations
A powerful feature of eigenvalues: many matrix operations transform eigenvalues in simple, predictable ways - without changing the eigenvectors.
| Operation | Eigenvalues | Eigenvectors | Proof sketch |
|---|---|---|---|
| same | |||
| same | Induction: | ||
| (if exists) | same | ||
| same | Linearity + rule | ||
| same | different (left eigenvectors of ) | ||
| same | |||
| same |
For AI: The eigenvalues of are - this is the key formula for gradient descent convergence. The eigenvalues of are - this governs neural ODE dynamics. Preconditioning replaces with to give better-conditioned eigenvalues.
4.3 Eigenvalues of Special Matrices
Symmetric matrices (): All eigenvalues are real; eigenvectors for distinct eigenvalues are orthogonal; always diagonalisable with an orthonormal eigenbasis. (Full proof in Section 6.)
Positive semidefinite (): All . Equivalent characterisations: for all ; for some ; all principal minors non-negative.
Orthogonal (): All . Over : eigenvalues are . Over : eigenvalues lie on the unit circle. Proof: (orthogonal maps preserve length) and so .
Projection (, ): Eigenvalues . The 1-eigenspace is ; the 0-eigenspace is . Proof: implies ; but so ; hence .
Row-stochastic (, ): By Perron-Frobenius, the dominant eigenvalue is with eigenvector ; all other eigenvalues satisfy . The attention weight matrix is row-stochastic, so its largest eigenvalue is always 1.
4.4 The Rayleigh Quotient
For a symmetric matrix and any non-zero , the Rayleigh quotient is:
It is bounded between the smallest and largest eigenvalues:
Proof. Write in the orthonormal eigenbasis . Then and , giving - a weighted average of eigenvalues with non-negative weights summing to 1. A weighted average is bounded by the min and max of its terms. Equality at gives ; at gives .
This gives the variational characterisation of extreme eigenvalues:
For AI: The PCA objective - find the direction of maximum variance - is exactly , achieved at . The curvature of the loss in direction is ; the maximum curvature is ; the minimum is . Rayleigh quotient iteration refines an eigenvector estimate by shifting: solve , renormalise - this gives cubic convergence to the nearest eigenvector.
4.5 The Courant-Fischer Min-Max Theorem
The Rayleigh quotient gives and . The Courant-Fischer theorem generalises this to all eigenvalues. For symmetric with eigenvalues :
Equivalently, is the best minimum Rayleigh quotient you can guarantee if you choose a -dimensional subspace optimally.
Weyl's inequalities give eigenvalue stability under perturbation: for symmetric and symmetric perturbation :
In particular, . Eigenvalues of symmetric matrices are Lipschitz in the matrix entries with constant 1 (in spectral norm).
For AI: Weyl's inequalities bound how much eigenvalues shift when model weights are perturbed (e.g., by quantisation noise or gradient updates). If the perturbation has , then no eigenvalue moves by more than . This is why spectral properties of trained models are relatively robust to small weight perturbations.
4.6 Gershgorin Circle Theorem
The Gershgorin Circle Theorem localises eigenvalues using only the matrix entries - no eigenvalue computation required.
Theorem. For , every eigenvalue lies in at least one Gershgorin disc:
Each disc is centred at the diagonal entry with radius equal to the sum of absolute values of the off-diagonal entries in row . All eigenvalues: .
Proof. Let be an eigenvalue with eigenvector . Choose such that (the largest component). From at row : . Taking absolute values and dividing by : . So .
Corollary (diagonal dominance): If for all , then all Gershgorin discs exclude the origin, so - the matrix is invertible.
For AI: Gershgorin circles give fast bounds on the eigenvalue range of attention matrices, Hessians, or weight matrices without computing eigenvalues. If all discs lie in the right half-plane (), the matrix is positive definite. Useful for verifying that a discretised ODE system is stable before running the forward pass.
5. Diagonalisation
5.1 The Diagonalisation Theorem
Definition. A matrix is diagonalisable if there exists an invertible matrix and a diagonal matrix such that:
where has the eigenvectors as columns and has the corresponding eigenvalues on the diagonal.
Proof (). If for and the are linearly independent, then (column of is , which equals column of ). Since is invertible (independent columns): .
Proof (). If then ; column gives ; the columns of are eigenvectors. Since is invertible, they are linearly independent.
Key insight: Diagonalisation is a change of basis. In the eigenbasis (coordinates given by ), the matrix acts as pure coordinate-wise scaling by - no mixing between different directions. This is why diagonalisable matrices are "simple": in their natural basis, they do nothing but stretch.
5.2 When Is A Diagonalisable?
Sufficient condition: If has distinct eigenvalues, it is diagonalisable. Proof: eigenvectors for distinct eigenvalues are linearly independent (Section 2.4); independent vectors in form a basis; is invertible.
Necessary and sufficient: is diagonalisable if and only if for every eigenvalue . Equivalently: the eigenspaces together span all of .
Always diagonalisable:
- Symmetric matrices (Spectral Theorem, Section 6)
- Any matrix with distinct eigenvalues
- Normal matrices satisfying
Not necessarily diagonalisable:
- Matrices with repeated eigenvalues where (defective)
- The matrix has with , - defective
Non-examples of diagonalisability over :
- Any real matrix with complex (non-real) eigenvalues (e.g., rotation matrices) cannot be diagonalised over , though it can over
5.3 Matrix Powers via Diagonalisation
If , then matrix powers become trivial:
Proof. . By induction: .
Computational gain. Without diagonalisation: computing by repeated matrix multiplication costs (via fast exponentiation). With diagonalisation: once and are known, costs per query (three matrix-vector products) for any .
Worked example - Fibonacci numbers. The Fibonacci recurrence is a linear system: . Diagonalising gives eigenvalues (golden ratio) and . Therefore - Binet's formula. The dominant eigenvalue controls asymptotic growth: .
For AI: Matrix powers appear in RNN unrolling (); Markov chain mixing (); and computing the influence of past inputs on current hidden states. The eigendecomposition reveals which "modes" persist long-term (large ) and which decay quickly (small ).
5.4 Matrix Functions via Diagonalisation
For a diagonalisable matrix and any function , define:
This is the natural extension of scalar functions to matrices: apply to each eigenvalue.
Matrix exponential: . The solution to is . For AI: neural ODEs use to propagate hidden states through continuous time; the stability of the ODE is governed by whether for all .
Matrix square root: - requires (valid for PSD matrices). Used in preconditioning: the natural gradient update involves where is the Fisher information matrix.
Log-determinant: . Appears in: Gaussian log-likelihoods ; normalising flow log-densities; variational inference ELBO terms. Efficient estimation via stochastic trace estimators when is large.
Cayley-Hamilton theorem: Every matrix satisfies its own characteristic polynomial: . This means can be written as a linear combination of - matrix polynomials can always be reduced to degree at most .
5.5 Change of Basis Interpretation
Diagonalisation encodes a three-step computation:
DIAGONALISATION AS CHANGE OF BASIS
========================================================================
Input x --> V^-^1 x --> \Lambda(V^-^1x) --> V \Lambda V^-^1 x = Ax
(express in (scale each (convert back
eigenbasis) coordinate to standard
by \lambda^i) basis)
In the eigenbasis, A acts as pure scaling - no coupling!
========================================================================
This is the fundamental reason diagonalisable matrices are tractable: in the right coordinate system, they decouple completely. The -th eigenvector direction is scaled by and is independent of all other directions. No information leaks between eigenvector components.
For AI: PCA is precisely this change of basis. The eigenvectors of the covariance matrix form a new coordinate system (the principal component basis). In this basis, is diagonal (, so ): the features are uncorrelated. Each principal component direction is independent, scaled by its variance . Running PCA = expressing your data in the eigenbasis of its covariance.
5.6 Simultaneous Diagonalisation
Two symmetric matrices and are simultaneously diagonalisable - they share the same orthonormal eigenbasis - if and only if (they commute).
Proof (). If and with the same : .
Proof (). If and : , so is also in the -eigenspace of . maps each eigenspace of into itself. On each eigenspace (a symmetric subspace for ), can be diagonalised. Choose an orthonormal basis for each eigenspace that also diagonalises . Together these form a simultaneous eigenbasis.
For AI: Simultaneous diagonalisation appears in the analysis of optimisers. If the Hessian and the Fisher information matrix commute, they share an eigenbasis - in this basis, the natural gradient is a simple eigenvalue-rescaled version of the gradient. K-FAC (Kronecker-Factored Approximate Curvature) builds a structured approximation to that is easier to diagonalise, enabling efficient natural gradient updates. Attention heads whose key-query weight matrices commute have aligned eigenspaces, simplifying their circuit analysis.
6. The Spectral Theorem
6.1 Statement and Significance
The Spectral Theorem is the crown jewel of applied linear algebra. It guarantees that symmetric matrices have a perfect, complete structure - one that makes them tractable both theoretically and computationally.
Real Spectral Theorem. For any symmetric matrix :
- All eigenvalues are real: .
- Eigenvectors for distinct eigenvalues are orthogonal: .
- There always exist orthonormal eigenvectors, regardless of multiplicities.
- is orthogonally diagonalisable: where is orthogonal () and is real diagonal.
The significance cannot be overstated. For a general matrix, diagonalisation may fail (defective case) or require a non-orthogonal (so , making computations expensive). For symmetric matrices: diagonalisation always works, and (free, numerically stable). The eigenbasis is orthonormal - the most natural and well-conditioned basis possible.
Why symmetry matters for AI: Covariance matrices, Gram matrices, Hessians at local minima, graph Laplacians, and kernel matrices are all symmetric PSD. The Spectral Theorem applies to all of them, guaranteeing real non-negative eigenvalues and an orthonormal eigenbasis. This is why PCA, spectral clustering, and natural gradient methods are well-defined and numerically stable.
6.2 Proof Sketch
All eigenvalues real. Let with . Take the complex inner product: . Also, (using real). So , meaning is real.
Eigenvectors for distinct eigenvalues are orthogonal. Let and with . Then:
(using - self-adjointness of ). Since : .
Existence of orthonormal eigenbasis. By induction on . has at least one real eigenvalue (proved above). Let be a unit eigenvector. Consider the restriction of to (the orthogonal complement). This restriction is again symmetric (since maps to itself: if then ). Apply induction to this -dimensional restriction.
6.3 The Spectral Decomposition
From with columns :
Each term is a rank-1 symmetric matrix - a scaled orthogonal projection onto . The full matrix is a weighted sum of rank-1 orthogonal projections, with weights equal to the eigenvalues.
The projectors satisfy: (projection); (symmetric); for (orthogonality); (completeness). These are the building blocks of the spectral decomposition.
For repeated eigenvalues: If has multiplicity with orthonormal eigenvectors , the contribution is - the eigenvalue times the orthogonal projection onto the eigenspace.
For AI: The spectral decomposition of the covariance matrix reveals PCA directly: each term is one principal component scaled by its variance. Keeping the top- terms gives the best rank- approximation to (Eckart-Young). The spectral decomposition of the Hessian decomposes the loss curvature into independent directions: gradient descent along converges at rate , independent of all other directions.
6.4 Positive (Semi)definite Matrices via the Spectral Theorem
A symmetric matrix is positive semidefinite (PSD, written ) iff all eigenvalues are non-negative; positive definite (PD, ) iff all eigenvalues are strictly positive.
Proof. Using the spectral decomposition: . This is a sum of weighted by non-negative squares . The sum is for all iff all ; strictly for all iff all .
Equivalent characterisations of PSD:
- for some matrix (Gram matrix form)
- All principal minors are non-negative
- Cholesky decomposition exists (with positive diagonal for PD)
Condition number: For PD matrix : . Measures how elongated the ellipsoid is. Large : ill-conditioned; gradient descent converges slowly.
For AI: Covariance matrices and Gram matrices are always PSD (they are of the form ). The Hessian at a local minimum is PSD; at a saddle point, it has both positive and negative eigenvalues. Checking PSD via eigenvalues is the standard diagnostic: if np.linalg.eigvalsh(H).min() < 0, the critical point is not a local minimum.
6.5 The Complex Spectral Theorem
For complex Hermitian matrices (conjugate transpose), the same theorem holds over : all eigenvalues are real; eigenvectors for distinct eigenvalues are orthogonal under the complex inner product; is unitarily diagonalisable: where is unitary ().
For AI: Quantum computing represents observables as Hermitian operators; measurement outcomes are eigenvalues (always real). Complex-valued neural networks and unitary RNNs use Hermitian or unitary weight matrices to ensure stable, norm-preserving dynamics. The Fourier transform is a unitary operator whose eigenvalues are complex roots of unity - the eigenfunctions are pure sinusoids.
7. Jordan Normal Form
7.1 Motivation - Defective Matrices
Not every matrix is diagonalisable. When for some eigenvalue, we lack enough eigenvectors to form a basis. The Jordan Normal Form (JNF) is the canonical replacement for diagonalisation - it works for every square matrix over , whether diagonalisable or not.
The JNF tells us exactly what the "obstruction" to diagonalisation is: instead of a pure scaling in each coordinate, we get a scaling plus a "coupling" to the next coordinate - a Jordan block.
7.2 Jordan Blocks
A Jordan block of size for eigenvalue is:
Diagonal: all . Superdiagonal: all 1. Everything else: 0.
- : ordinary eigenvalue, non-defective.
- for : exactly one eigenvector ; remaining vectors are generalised eigenvectors satisfying for .
7.3 The Jordan Normal Form
Theorem. For any , there exists an invertible such that:
is block diagonal with Jordan blocks; . For each eigenvalue : the number of Jordan blocks equals ; the sizes are determined by the rank sequence of .
The JNF is the unique canonical form for matrices up to similarity: two matrices are similar ( for some invertible ) iff they have the same Jordan form (up to reordering blocks).
7.4 Powers of Jordan Blocks
The entries grow as for fixed .
- : all entries despite polynomial factor - stable convergence.
- , : entries grow polynomially as - marginal instability.
- : exponential growth - unstable.
For AI: An RNN with exactly (at the stability boundary) is dangerous if has a Jordan block of size for the unit eigenvalue - the hidden state grows polynomially even though no eigenvalue exceeds 1. LSTM forget gates aim to keep the effective spectral radius exactly 1 while avoiding non-trivial Jordan blocks. Gradient clipping and spectral normalisation are practical safeguards against this polynomial growth.
8. Singular Value Decomposition
8.1 SVD as Generalisation of Eigendecomposition
Eigendecomposition requires to be square and diagonalisable. The Singular Value Decomposition (SVD) generalises to any matrix :
where is orthogonal (left singular vectors), is "diagonal" with non-negative entries (singular values), and is orthogonal (right singular vectors).
SVD always exists and is unique (when singular values are distinct). It applies to rectangular matrices, rank-deficient matrices, and matrices with repeated or zero singular values - no assumptions needed.
8.2 Connection to Eigenvalues
SVD and eigendecomposition are deeply linked via the symmetric matrices and :
- is symmetric PSD; its eigenvalues are ; its orthonormal eigenvectors are the right singular vectors (columns of ): .
- is symmetric PSD; its non-zero eigenvalues are the same ; its orthonormal eigenvectors are the left singular vectors (columns of ): .
- The connection: and .
For a square symmetric PSD matrix: singular values = eigenvalues; (the orthogonal eigenvector matrix); SVD = eigendecomposition.
8.3 Thin and Truncated SVD
Full SVD: , , . Includes singular vectors for zero singular values (null space directions).
Thin (economy) SVD for : , (square), . Equivalent for computing but smaller: np.linalg.svd(A, full_matrices=False).
Truncated rank- SVD: Keep only the top singular triplets:
Eckart-Young Theorem: is the best rank- approximation to in both spectral norm and Frobenius norm:
No other rank- matrix is closer to . This is the mathematical foundation for LoRA, PCA, image compression, and recommender systems.
8.4 Geometric Interpretation
The SVD decomposes any linear map into three pure operations:
SVD GEOMETRIC DECOMPOSITION
========================================================================
x --> V^Tx ------> \Sigma(V^Tx) ------> U \Sigma V^T x = Ax
(rotate/ (scale each (rotate/reflect
reflect coordinate into output
input by \sigma^i) space)
basis)
Unit sphere --> (same orientation) --> ellipsoid
semi-axes: \sigma_1u_1, \sigma_2u_2, ...
========================================================================
The unit sphere in maps to an ellipsoid in with semi-axes . The singular values are the lengths of the ellipsoid axes; the right singular vectors are the input directions that map to those axes; the left singular vectors are the output directions.
8.5 SVD and the Four Fundamental Subspaces
SVD reveals all four fundamental subspaces of with rank simultaneously:
| Subspace | Basis | Space |
|---|---|---|
| Row space | (columns of ) | |
| Null space | ||
| Column space | (columns of ) | |
| Left null space |
The orthogonality of and gives explicit orthogonal decompositions: and , with orthonormal bases for each subspace directly from and .
For AI: LoRA represents weight updates as with , , . The SVD of has at most non-zero singular values - it lives in an -dimensional subspace of weight space. The Eckart-Young theorem justifies this: if the true weight update is approximately low-rank (as empirically observed), LoRA captures most of its structure with far fewer parameters.
9. Eigendecomposition in AI Applications
9.1 Principal Component Analysis (PCA)
PCA is eigendecomposition of the sample covariance matrix, period. Given centred data (subtract column means):
The eigenvectors are the principal components (directions of decreasing variance). The eigenvalues are the principal variances. The projection onto the top- components is:
Fraction of variance explained by the top- components: . A scree plot of vs shows how quickly variance falls off - a sharp "elbow" indicates a natural low-rank structure.
Equivalently via SVD: ; principal components = columns of ; principal variances = .
For AI: PCA of word embedding matrices (e.g., GloVe, word2vec) reveals semantic axes - the top principal components correspond to major semantic dimensions (sentiment, formality, etc.). Activation PCA in transformer residual streams (Cunningham et al. 2023) identifies sparse, interpretable directions corresponding to concepts. PCA of weight matrices diagnoses training pathologies: a sudden drop in the effective rank of weight matrices signals representational collapse.
9.2 Eigenvalues and Gradient Descent
For a quadratic loss (the local approximation near any critical point), gradient descent with step size gives:
The error satisfies . In the eigenbasis of (writing ):
Each eigendirection decays independently. Convergence condition: for all .
Optimal step size: , giving convergence rate where is the condition number.
Ill-conditioning: Large means slow convergence. Gradient descent zigzags: it overshoots in high-curvature directions (large ) while barely moving in low-curvature ones (small ). Preconditioning (multiply gradient by - Newton's method, or an approximation - Adam, K-FAC) aligns step sizes to eigenvalues and achieves in the ideal case.
9.3 Neural Network Hessian Spectrum
The Hessian of a neural network loss at a critical point has a universal spectral structure observed empirically across architectures and datasets:
TYPICAL NEURAL NETWORK HESSIAN EIGENSPECTRUM
========================================================================
Density
|
|####
|########
|########### - -
|################ - -
+------------------------------------------------ \lambda
0 small bulk large outliers
Bulk: \approx Marchenko-Pastur distribution (random matrix theory)
Outliers: O(1)-O(100) eigenvalues, well-separated from bulk
Ratio \kappa = \lambda_max / \lambda_min can be 10^6 or larger
========================================================================
- Bulk eigenvalues: Dense cluster near 0; approximately Marchenko-Pastur distributed; correspond to weakly-learned or unlearned features.
- Outlier eigenvalues: A small number () of large eigenvalues, well-separated from the bulk; correspond to the most-curved loss directions; dominate training dynamics.
- Evolution during training: Outliers grow during early training as the model fits data; bulk shifts toward zero; the outlier/bulk gap widens.
2026 tools: PyHessian (Yao et al. 2020) estimates the top eigenvalues and eigenvectors of the Hessian using randomised Lanczos; HessianFlow provides gradient-free eigenvalue density estimation. Both are practical for networks with parameters.
9.4 Attention Eigenstructure
The scaled dot-product attention weight matrix is row-stochastic: all entries are non-negative and each row sums to 1. By the Perron-Frobenius theorem: the dominant eigenvalue is ; all other eigenvalues satisfy ; the left eigenvector for is (the uniform distribution).
Softmax temperature effects on the spectrum:
- High temperature (): (rank-1 uniform); eigenvalues . Completely diffuse attention.
- Low temperature (): permutation matrix; eigenvalues = roots of unity on unit circle. Sharp, deterministic attention.
- Intermediate: eigenvalues lie on a spectrum between these extremes; their distribution characterises head behaviour.
Mechanistic interpretability: Induction heads (heads that copy from the previous occurrence of a token) have attention matrices approximating a shifted identity - eigenvalues cluster near roots of unity. Name-mover heads have matrices with a few dominant eigenvalues corresponding to "this token attends to its referent." The eigenvalue structure is a fingerprint of the head's function.
9.5 Graph Laplacian and GNNs
For an undirected graph with nodes, adjacency matrix (symmetric, if edge exists), and degree matrix where , the graph Laplacian is:
is symmetric PSD with eigenvalues .
Key eigenvalue facts:
- always, with eigenvector (constant on all nodes).
- The number of zero eigenvalues equals the number of connected components.
- iff the graph is connected; is the Fiedler value (algebraic connectivity): larger means better connectivity and faster mixing.
Spectral clustering: The bottom- eigenvectors of embed each node into via its row in . Running -means on these embeddings clusters the graph into groups - nodes in the same cluster are well-connected. This is spectral clustering.
Spectral GNNs: Graph convolution is defined spectrally as - a function of 's eigenvalues applied in the eigenbasis. Polynomial filters (ChebConv) approximate any spectral filter without computing the full eigendecomposition. This is the mathematical foundation of graph neural networks.
For AI (2026): Graph transformers use Laplacian eigenvectors as positional encodings - the eigenvectors encode the multi-scale geometric structure of the graph, serving as a canonical "position" for each node regardless of graph isomorphism. Standard in molecular property prediction (OGB benchmarks) and code analysis.
9.6 Spectral Analysis of Weight Matrices
WeightWatcher (Martin & Mahoney, 2019-2026) analyses the eigenvalue (or singular value) spectra of weight matrices to predict model quality and generalisation - without any test data.
Heavy-tailed self-regularisation: Well-trained models develop weight matrices whose eigenvalue distributions follow a power law: with for the tail. This deviates from the Marchenko-Pastur distribution (expected for random Gaussian matrices) - the deviation indicates that the model has learned non-random structure.
Alpha metric: Fit a power law to the tail of the singular value distribution of each layer; - indicates a well-trained layer; or a purely MP distribution indicates an undertrained layer.
Stable rank: . Measures effective dimensionality. Higher stable rank = more uniform singular value distribution = richer representations. Decreasing stable rank during fine-tuning signals representational collapse.
For AI (2026): WeightWatcher deployed at scale to compare LLM checkpoints, guide layer-wise learning rate selection in fine-tuning (layers with lower need more adaptation), and identify which layers to include in LoRA adaptation without any downstream evaluation.
9.7 Eigenvalues in Recurrent Networks
The linearised RNN update is a linear dynamical system. After steps from :
The gradient of the loss with respect to involves . By eigendecomposition: :
- : gradients vanish exponentially; long-range dependencies are forgotten.
- : gradients explode; training diverges without clipping.
- : marginal stability; gradients preserved in magnitude, but Jordan blocks cause polynomial growth.
Orthogonal RNNs: Constrain to be orthogonal () so with all eigenvalues on the unit circle and no Jordan blocks. Used in unitary RNNs (uRNN, EURNN) for long-range sequence modelling.
LSTM/GRU gating: The forget gate dynamically controls the effective spectral radius of the recurrent dynamics: when , the eigenvalue is (perfect memory); when , the eigenvalue is (full reset). Gating allows the effective spectral radius to be context-dependent.
9.8 Diffusion Models and Score Functions
In a diffusion model, the forward process gradually adds Gaussian noise: . The conditional covariance of given is .
Spectral perspective: Each eigenvalue of the data covariance gets noise level added. High-variance data directions (large ) survive longer before being drowned by noise; low-variance directions are masked quickly. The score function at noise level effectively operates on the "surviving" eigenspectrum.
Optimal noise scheduling: The EDM framework (Karras et al. 2022) uses a continuous noise level parameterisation that sweeps uniformly across the log-eigenvalue spectrum of the data covariance - ensuring the network receives a balanced training signal at each noise level rather than spending too long on high-noise or low-noise regimes.
Flow matching (2022-2026): Constructs a straight-line path between noise distribution and data distribution in the eigenbasis of the data covariance. Eigenvalue-aware interpolation reduces the number of function evaluations needed for accurate generation. Now the dominant paradigm for image and video generation at scale.
10. Generalised Eigenproblems
10.1 The Generalised Eigenvalue Problem
The standard eigenvalue problem measures eigenvalues relative to the identity. The generalised eigenvalue problem measures them relative to a symmetric positive definite matrix :
For symmetric and SPD : generalised eigenvalues are all real; generalised eigenvectors are -orthogonal (); analogous to the standard spectral theorem.
Reduction to standard form. Cholesky-decompose . Substituting : . The transformed matrix is symmetric, so the standard spectral theorem applies. This is the numerically preferred approach (implemented in scipy.linalg.eigh(A, B)).
Variational characterisation. The -th generalised eigenvalue satisfies of the generalised Rayleigh quotient over -orthogonal subspaces - exactly as in Courant-Fischer.
10.2 AI Applications of Generalised Eigenproblem
Linear Discriminant Analysis (LDA). Find projection maximising between-class variance relative to within-class variance: where is the between-class scatter matrix and is the within-class scatter. This is a generalised Rayleigh quotient; the optimal is the top generalised eigenvector solving .
Natural gradient. The natural gradient update requires solving where is the Fisher information matrix. This is equivalent to finding the update direction that minimises loss while accounting for the geometry of the parameter manifold. K-FAC (Kronecker-Factored Approximate Curvature) approximates as a Kronecker product and diagonalises it layer-wise, enabling efficient natural gradient updates for neural networks.
Canonical Correlation Analysis (CCA). Find directions in two feature spaces that are maximally correlated. Reduces to a generalised eigenproblem involving cross-covariance matrices. Used in multimodal representation learning (aligning image and text embeddings) and multi-view learning.
10.3 Spectral Graph Positional Encodings for Transformers
Standard transformers use sinusoidal positional encodings (fixed functions of position index). For graph-structured data (molecules, code, knowledge graphs), position is not a linear index - it is a graph-theoretic notion.
Solution: Use the bottom- eigenvectors of the graph Laplacian as positional encodings. Node gets encoding (its row in the eigenvector matrix). These encodings: (1) are permutation-equivariant - relabelling nodes relabels the encoding consistently; (2) capture multi-scale graph structure - low eigenvectors encode global structure, high eigenvectors encode local; (3) are provably more expressive than message-passing GNNs for distinguishing graph structures.
2026 status: Laplacian eigenvector positional encodings are standard in graph transformers for molecular property prediction (PCQM4Mv2 leaderboard), protein structure analysis, and code understanding graphs.
11. Eigenvalues and Stability Analysis
11.1 Discrete Dynamical Systems
The system with initial condition has solution (for diagonalisable ):
where are the initial projections onto each mode.
Stability classification:
| Condition | Behaviour | AI example |
|---|---|---|
| (asymptotically stable) | GD on strongly convex loss | |
| , Jordan size 1 | bounded (marginally stable) | GD at critical step size |
| , Jordan size | grows polynomially | RNN at spectral radius = 1 with Jordan block |
| grows exponentially (unstable) | Exploding gradients, diverging GD |
11.2 Continuous Dynamical Systems and Neural ODEs
The system has solution .
Stability: Asymptotically stable iff for all ; marginally stable iff with eigenvalues having Jordan blocks of size 1; unstable if any .
Neural ODEs (Chen et al. 2018) parameterise and use an ODE solver for the forward pass. Stability of the learned dynamics is controlled by the Jacobian eigenvalues. Contractive neural ODEs enforce everywhere, guaranteeing convergence to a fixed point - used for stable generative models and physics-informed networks.
11.3 Spectral Normalisation
Spectral normalisation (Miyato et al. 2018) divides each weight matrix by its spectral norm (largest singular value) at every update step:
This enforces a Lipschitz constant for each linear layer. Applying it to all layers makes the entire network 1-Lipschitz: .
Efficient implementation: Estimate via one step of power iteration per training step - cost for an matrix. Keep a running estimate and update: , ; then .
Used in: GAN discriminators (stable training; eliminates mode collapse); normalising flows (invertibility guarantee); safe RL (bounding policy Lipschitz constant); diffusion model discriminators (2026 standard).
11.4 Gradient Flow and Spectral Bias
The gradient flow (continuous-time limit of gradient descent) satisfies . Near a quadratic minimum with Hessian , each eigendirection decays independently: , so each component .
Spectral bias (frequency principle): In the infinite-width Neural Tangent Kernel (NTK) regime, the gradient flow dynamics are governed by the NTK matrix . Large NTK eigenvalues correspond to low-frequency target function components - these are learned first and fastest. Small eigenvalues correspond to high-frequency components - learned slowly or not at all. This explains the empirical observation that neural networks learn coarse structure before fine details, and are biased toward smooth functions regardless of architecture.
12. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Using vs and getting sign errors in the characteristic polynomial | Both define the same eigenvalues, but the polynomial differs by ; using the wrong one gives wrong coefficients for non-eigenvalue computations | Use consistently (monic convention); eigenvalues are roots of either |
| 2 | Concluding a matrix is diagonalisable just because it has eigenvalues | Repeated eigenvalues can have fewer than independent eigenvectors (defective case) | Check for each repeated eigenvalue; compute to verify |
| 3 | Applying eigendecomposition to non-square matrices | The equation requires (same domain and codomain) | Use SVD for rectangular matrices; eigenvalues only for square matrices |
| 4 | Forgetting that real matrices can have complex eigenvalues | A real matrix with negative discriminant () has complex conjugate eigenvalue pairs; they are still valid eigenvalues | Always check the discriminant; complex eigenvalues are meaningful (e.g., rotation matrices) |
| 5 | Claiming all eigenvalues of differ from those of | and always have identical eigenvalues (same characteristic polynomial) but different eigenvectors | The eigenvectors of are the left eigenvectors of ; eigenvalues are the same |
| 6 | Confusing algebraic and geometric multiplicity | AM counts polynomial roots; GM counts independent eigenvectors; they can differ for defective matrices | Always compute GM via ; don't assume AM = GM |
| 7 | Thinking eigenvectors are unique | Eigenvectors define a direction, not a specific vector; any non-zero scalar multiple is also an eigenvector | Speak of eigenspaces (subspaces), not single eigenvectors; normalise when uniqueness is needed |
| 8 | Applying the spectral theorem to non-symmetric matrices | Only symmetric (or Hermitian) matrices are guaranteed real eigenvalues and orthogonal eigenvectors | Check first; for general , use Schur decomposition or Jordan form |
| 9 | Setting step size based on but using the wrong norm | Gradient descent convergence requires where is the Hessian, not the weight matrix | Compute the Hessian or its spectral norm; do not confuse weight matrix eigenvalues with Hessian eigenvalues |
| 10 | Concluding from | Trace = sum of eigenvalues; small trace is compatible with some large and some negative eigenvalues | Compute $\rho(A) = \max_i |
13. Exercises
Exercise 1 * - Eigenpair Verification and 2\times2 Computation
(a) Verify that is an eigenvector of and find the eigenvalue. Check using the trace and determinant.
(b) Compute both eigenvalues and eigenvectors of using the characteristic polynomial. Verify and .
(c) Is diagonalisable? Construct and . Verify numerically.
Exercise 2 * - Characteristic Polynomial and Eigenspaces
(a) Compute the characteristic polynomial of .
(b) Find all eigenvalues and their algebraic multiplicities.
(c) For each eigenvalue, compute a basis for the eigenspace. Identify defective eigenvalues.
(d) What is and ? Verify against the eigenvalues.
Exercise 3 * - Diagonalisation
Let .
(a) Find eigenvalues and eigenvectors. Construct and .
(b) Compute and verify .
(c) Use diagonalisation to compute efficiently. Compare against np.linalg.matrix_power(C, 10).
(d) Compute . Compare against scipy.linalg.expm(C).
Exercise 4 ** - Spectral Theorem
Let .
(a) Verify is symmetric and find its eigenvalues. Are they real and non-negative?
(b) Find orthonormal eigenvectors. Construct the orthogonal matrix and verify .
(c) Verify the spectral decomposition: .
(d) Is PSD? PD? Compute the condition number .
Exercise 5 ** - Matrix Powers and Recurrence Relations
The Fibonacci-like recurrence with , .
(a) Write the recurrence as where .
(b) Find the eigenvalues and eigenvectors of . Diagonalise .
(c) Use to derive a closed-form formula for .
(d) Verify for numerically.
Exercise 6 ** - Rayleigh Quotient and PSD
Let .
(a) Compute the eigenvalues of using np.linalg.eigvalsh. Verify is PD.
(b) Compute for , , , and a random unit vector. Verify .
(c) Find the direction that maximises - confirm it is the top eigenvector.
(d) What is the condition number ? How does it affect the convergence rate of gradient descent on a quadratic loss with Hessian ?
Exercise 7 *** - PCA from Scratch
Using the provided dataset :
(a) Centre the data. Compute the sample covariance matrix .
(b) Compute the eigendecomposition of . Sort eigenvalues in descending order.
(c) Project the data onto the top 2 principal components. Plot the 2D embedding.
(d) Compute the fraction of variance explained by 1, 2, 3 components. How many components explain 95% of variance?
(e) Verify that SVD of gives the same principal components as the eigendecomposition of . Explain the connection: principal components = right singular vectors; principal variances = .
Exercise 8 *** - Power Iteration and PageRank
(a) Implement power iteration for the dominant eigenpair of a random symmetric positive definite matrix . Track convergence of vs iteration .
(b) Verify the convergence rate is approximately per iteration (compare theoretical to empirical).
(c) Build a small web graph (6 nodes) with a link matrix . Construct the row-stochastic PageRank transition matrix .
(d) Run power iteration on to find the PageRank vector (dominant eigenvector). Compare to np.linalg.eig(G.T).
(e) For attention: Construct a attention weight matrix (softmax of random logits). Compute its eigenvalues. Verify the dominant eigenvalue is 1. Plot the eigenvalue spectrum.
14. Why This Matters for AI (2026 Perspective)
| Concept | Where It Appears in AI/LLMs | Concrete Method |
|---|---|---|
| Eigendecomposition of covariance | Dimensionality reduction, embedding analysis | PCA, Kernel PCA |
| Spectral radius | RNN stability, gradient flow control | Orthogonal RNNs, spectral normalisation |
| Hessian eigenspectrum | Loss landscape analysis, learning rate selection | PyHessian, K-FAC, SAM |
| Perron-Frobenius eigenvalue = 1 | Attention matrix structure, Markov chain mixing | Attention head analysis, mechanistic interpretability |
| Low-rank eigenspace adaptation | Parameter-efficient fine-tuning | LoRA, DoRA, GaLore |
| Graph Laplacian eigenvectors | Graph transformer positional encodings | Lim et al. 2023, GRIT |
| SVD / singular values | Weight compression, stable rank, generalisation | WeightWatcher, model pruning |
| QR algorithm | All eigenvalue computations in numpy/scipy | np.linalg.eigh, scipy.linalg.eig |
| Power iteration | Dominant eigenvalue estimation, PageRank, spectral norm | Spectral normalisation (Miyato 2018) |
| Condition number | Gradient descent convergence rate | Preconditioning, Adam, AdaGrad |
| Rayleigh quotient | PCA objective, curvature measurement | PCA, Fisher LDA, natural gradient |
| Heavy-tailed eigenvalue distribution | Model quality without test data | WeightWatcher alpha metric |
| NTK eigenvalues | Which features are learned first (spectral bias) | NTK theory, frequency principle |
15. Conceptual Bridge
Where This Section Sits
Eigenvalues and eigenvectors are the payoff for everything in Chapters 1 and 2. Vector spaces (Section 02-06) established the abstract setting - subspaces, span, dimension. Matrix rank (Section 02-05) measured the effective dimensionality of a linear map. Determinants (Section 02-04) provided the characteristic polynomial via . Row reduction (Section 02-03) is the tool for finding null spaces, hence eigenvectors. All of this machinery was built so that we could ask and answer the deepest question about a linear map: what are its natural axes, and how does it scale along them?
What This Section Unlocks
Eigenvalues open three major directions forward in this curriculum. First, Singular Value Decomposition (Section 03-02) is the direct generalisation: eigenvalues of are the squared singular values; the eigenvectors of and are the right and left singular vectors. SVD extends eigendecomposition to rectangular matrices and gives the four fundamental subspaces with optimal orthonormal bases. Second, PCA (Section 03-03) is eigendecomposition of the covariance matrix in practice - the full pipeline from raw data to compressed, interpretable representations. Third, Positive Definite Matrices (Section 03-07) uses the spectral theorem to characterise the geometry of quadratic forms - every optimisation problem in Chapter 8 involves a positive definite Hessian.
The Larger Arc
Stepping back further: eigenvalues connect linear algebra to analysis, probability, and AI in ways that become clearer as the curriculum progresses. The Hessian eigenspectrum (Chapter 8, Optimization) controls whether gradient descent converges, how fast, and to what kind of minimum. The graph Laplacian eigenvalues (Chapter 11, Graph Theory) determine the topology of learned graph representations. The eigenvalues of the Fisher information matrix (Chapter 13, ML-Specific Math) define the natural gradient, the theoretically optimal update direction. The NTK eigenvalues (Chapter 14, Transformers) determine which target function components are learned first. Every time you see a covariance matrix, a Gram matrix, a weight matrix, or a Hessian - you are dealing with a system whose behaviour is ultimately governed by eigenvalues.
POSITION IN CURRICULUM
========================================================================
02-06 Vector Spaces 02-05 Matrix Rank
(subspaces, basis) (null space, column space)
| |
+--------------+---------------+
v
03-01 EIGENVALUES & EIGENVECTORS <-- YOU ARE HERE
(characteristic polynomial,
diagonalisation, spectral theorem,
Jordan form, SVD connection)
|
+--------------+-----------------------+
v v v
03-02 SVD 03-03 PCA 03-07 Positive
(singular (covariance Definite
values, eigenstructure, Matrices
Eckart-Young) dimension reduction) (Cholesky,
| | curvature)
+--------------+------------------------+
v
Ch. 08 Optimization * Ch. 11 Graph Theory
Ch. 13 ML-Specific Math * Ch. 14 Transformers
========================================================================
<- Back to Advanced Linear Algebra | Next: Singular Value Decomposition ->