"It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts - and in linear algebra, the theory of stable computation is entirely about facts: actual rounding errors, actual condition numbers, actual flop counts."
Overview
Numerical linear algebra is the study of algorithms for solving the core computational problems of linear algebra - linear systems, least squares, eigenvalue problems, and matrix factorizations - in the presence of floating-point arithmetic. The central challenge: mathematical equivalence does not imply numerical equivalence. The Gaussian elimination formula is mathematically exact but computationally disastrous without pivoting; the Normal equations are correct but the QR approach is more numerically stable.
This section covers the algorithms that power deep learning, scientific computing, and data analysis: LU with partial pivoting, QR via Householder reflectors, iterative solvers for large sparse systems, the conjugate gradient method, and the numerical sensitivity of eigenvalue problems. Every algorithm is analyzed through the lens of backward stability - the gold standard for numerical linear algebra.
AI connections: Every gradient computation requires a matrix-vector product; attention mechanisms compute via batched matrix-matrix multiplications; training large language models involves solving implicit linear systems via iterative optimization. The choice of algorithm and its numerical properties directly affect convergence, stability, and generalization.
Prerequisites
- Section01 Floating-Point Arithmetic - machine epsilon, condition numbers, backward stability
- Section02-Linear-Algebra-Basics/03-Systems-of-Equations - Gaussian elimination, existence, uniqueness
- Section03-Advanced-Linear-Algebra/08-Matrix-Decompositions - LU, QR, Cholesky theory
- Section03-Advanced-Linear-Algebra/01-Eigenvalues-and-Eigenvectors - eigenvalue theory
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | LU factorization, QR, iterative solvers, conjugate gradient, condition estimation |
| exercises.ipynb | 10 graded exercises on stability, pivoting, iterative methods, and AI applications |
Learning Objectives
After completing this section, you will:
- Explain why LU without pivoting is numerically unstable and how partial pivoting fixes it
- Implement Gaussian elimination with partial pivoting from scratch
- Explain why QR via Householder reflectors is backward stable while Normal equations are not
- Apply the backward error framework: relate forward error to backward error via condition number
- Implement and analyze the Conjugate Gradient method for symmetric positive definite systems
- Understand iterative refinement and its role in achieving double-precision accuracy
- Connect condition number to loss of digits: means digits are lost
- Relate numerical linear algebra algorithms to the batched matrix operations in deep learning
Table of Contents
- 1. Intuition
- 2. Forward and Backward Error Analysis
- 3. Gaussian Elimination and LU Factorization
- 4. QR Factorization: The Numerically Stable Approach
- 5. Cholesky Factorization
- 6. Iterative Methods
- 7. Eigenvalue Computation
- 8. Applications in Machine Learning
- 9. Common Mistakes
- 10. Exercises
- 11. Why This Matters for AI (2026 Perspective)
- 12. Conceptual Bridge
1. Intuition
1.1 Why Numerical Linear Algebra?
Consider solving a linear system . You know the answer: . But computing explicitly requires operations and is numerically inferior to solving directly. More surprisingly, two mathematically equivalent formulations can have dramatically different numerical properties:
Normal equations approach: Multiply both sides of by :
QR approach: Factor where is orthogonal, is upper triangular:
Both are mathematically equivalent when has full column rank, but their condition numbers differ:
If has condition number (common in practice), then has condition number - beyond fp64 precision. QR factorization preserves .
For AI: When training neural networks, you are implicitly solving linear systems at every step. The choice of optimizer (SGD vs Adam vs second-order methods), the use of batch normalization, and the initialization strategy all reflect numerical linear algebra considerations, even if they are rarely stated as such.
1.2 Backward Stability: The Right Standard
A naive view of algorithm correctness: compute with small forward error .
The more powerful concept is backward stability: algorithm is backward stable if for every input , the computed output is the exact output for a slightly perturbed input:
A backward stable algorithm gives the exact answer to a slightly wrong problem. This is the right standard because:
- It achieves the best possible accuracy: if the problem itself is uncertain at level , a backward stable algorithm is optimally accurate
- It separates algorithm from problem: the condition number tells you how much forward error backward stability implies
- It is composable: backward stable algorithms composed together give overall backward stability (approximately)
The fundamental theorem then states: forward error condition number backward error.
1.3 Historical Timeline
NUMERICAL LINEAR ALGEBRA: HISTORICAL TIMELINE
========================================================================
1947 von Neumann & Goldstine - first systematic error analysis of
Gaussian elimination; estimated stability incorrectly (pessimistic)
1951 Turing - defined matrix condition number $\kappa(A)$;
connected it to loss of significant digits
1954 Wilkinson - proved Gaussian elimination with partial pivoting
is backward stable in practice; introduced error analysis
methodology that became the foundation of the field
1958 Householder - reflector-based QR factorization; proved backward
stable; superior to Gram-Schmidt for loss of orthogonality
1961 Lanczos - Lanczos iteration for symmetric eigenvalue problems;
precursor to modern Krylov subspace methods
1965 Golub & Kahan - bidiagonalization for SVD computation;
established SVD as numerically accessible
1968 Wilkinson - "The Algebraic Eigenvalue Problem"; definitive
reference on eigenvalue computation stability
1976 Paige & Saunders - MINRES and SYMMLQ; stable iterative methods
for symmetric systems
1986 Saad & Schultz - GMRES algorithm; now standard for non-symmetric
iterative solution
1994 LAPACK released - standardized implementation of all major
backward-stable dense linear algebra routines
2008 Buttari et al. - mixed-precision iterative refinement;
fp16 factorization + fp64 refinement
2022+ - cuBLAS, cuSOLVER using Tensor Cores (tf32/bf16) for
matrix operations in LLM training and inference
========================================================================
2. Forward and Backward Error Analysis
2.1 The Error Analysis Framework
Let be the mathematical function we want to compute, and let be a floating-point algorithm. For input , we define:
Absolute forward error:
Relative forward error:
Absolute backward error: The smallest such that
Relative backward error:
Algorithm is:
- Accurate if forward error - computes the answer to full precision
- Backward stable if relative backward error - solves a nearby problem exactly
- Forward stable if relative forward error - optimal up to condition number
Example: inner product
Naive inner product: compute . The backward error satisfies:
So naive inner product is backward stable (relative backward error ).
But the condition number of the inner product is:
which can be huge when (catastrophic cancellation). The forward error can be large, but the algorithm itself is not to blame - the problem is ill-conditioned.
2.2 The Fundamental Theorem
Theorem (Forward-Backward Error): For a backward stable algorithm with relative backward error :
where is the condition number of at .
Interpretation: The forward error is at most condition number times backward error. A backward stable algorithm with gives forward error , which is the best possible - no algorithm can be more accurate in general.
For AI: This theorem tells us when to worry:
- with fp64: fine, we have accurate digits
- with fp64: we lose all accuracy; the problem is inherently ill-conditioned
- with fp32: dangerous; use fp64 or regularization
2.3 Perturbation Theory for Linear Systems
For the linear system , with perturbed system :
Bound on forward error:
Proof sketch: From :
Taking norms and using :
Key consequences:
- Condition number amplifies input perturbations in the output
- For backward stable solvers: ,
- Therefore forward error:
- Rule of thumb: if , we lose significant digits
Condition number interpretation:
| Digits lost (fp64, 15 digits) | Digits lost (fp32, 7 digits) | |
|---|---|---|
| 0 digits | 0 digits | |
| 3 digits | 3 digits | |
| 7 digits | all digits | |
| 10 digits | impossible | |
| all digits | impossible |
3. Gaussian Elimination and LU Factorization
Scope note: This section covers the numerical stability of LU factorization - pivoting, backward error analysis, and growth factors. The mathematical theory of LU (existence, uniqueness, relationship to determinants) is in Section03-Advanced-Linear-Algebra/08-Matrix-Decompositions.
3.1 LU without Pivoting: Why It Fails
LU factorization computes where is lower triangular with 1s on the diagonal and is upper triangular. The elimination multipliers are:
where is the pivot. If the pivot is zero, the algorithm fails. If it is small, large multipliers amplify rounding errors.
Example: Near-zero pivot disaster
Multiplier: (very large). After elimination:
For in fp64: (catastrophic cancellation in the entry).
Growth factor: The growth factor measures how much the entries grow during elimination. Without pivoting, can be exponentially large.
3.2 Partial Pivoting: The Fix
Partial pivoting chooses the pivot as the largest element in magnitude in the current column:
At step : find and swap rows and .
Algorithm: Gaussian Elimination with Partial Pivoting (GEPP)
for k = 1, 2, ..., n-1:
find i* = argmax_{i >= k} |a_{ik}|
swap rows k and i* (record in permutation P)
for i = k+1, ..., n:
l_{ik} = a_{ik} / a_{kk}
for j = k+1, ..., n:
a_{ij} -= l_{ik} * a_{kj}
This gives the factorization where is the permutation matrix of row swaps.
Solving via LU with pivoting:
- Factor: (GEPP above)
- Apply permutation:
- Forward solve: (triangular solve, )
- Back solve: (triangular solve, )
Total cost: flops.
Growth factor with partial pivoting: The entries of satisfy (because we choose the largest pivot). The growth factor satisfies:
This bound is exponential but almost never achieved in practice. Wilkinson (1963) showed that for random matrices, , making GEPP very stable in practice.
3.3 Backward Stability of LU with Partial Pivoting
Theorem (Wilkinson, 1961): Gaussian elimination with partial pivoting applied to computes , , satisfying:
where is the growth factor. If is small (as is almost always the case), GEPP is effectively backward stable.
Consequence: The computed solution to satisfies:
And the forward error:
3.4 Banded and Sparse Systems
For banded matrices with bandwidth (nonzero entries only within diagonals of the main diagonal):
- Storage: instead of
- LU factorization cost: instead of
- Fill-in: LU of a banded matrix remains banded (with bandwidth in and )
For AI: The attention matrix in transformers is but has local structure. Efficient attention algorithms exploit this: sliding window attention (bandwidth ), strided attention, and Longformer all use banded or sparse attention patterns that reduce cost from to .
Sparse direct solvers: For general sparse , the key challenge is fill-in - LU introduces nonzero entries where had zeros. Fill-in is minimized by reordering rows/columns (e.g., minimum degree ordering, nested dissection). Modern sparse solvers (PARDISO, MUMPS, SuperLU) use sophisticated reordering strategies.
4. QR Factorization: The Numerically Stable Approach
Scope note: QR factorization theory (existence, uniqueness, Gram-Schmidt, Givens rotations) is in Section03-Advanced-Linear-Algebra/08-Matrix-Decompositions. This section focuses on numerical stability.
4.1 Why Normal Equations Fail
For the least squares problem with , :
Normal equations:
This is mathematically correct but numerically problematic:
- : condition number is squared
- If , then , which in fp64 (15 digits) means complete loss of accuracy
- Even forming introduces rounding errors
Example: Consider with .
Then . In floating point, (since ), so the computed is exactly , which is singular. The original is well-conditioned, but Normal equations make it appear singular.
4.2 Householder Reflectors
A Householder reflector (or Householder transformation) is an orthogonal matrix:
that reflects vectors through the hyperplane perpendicular to . Geometrically, is the reflection of across the hyperplane .
Key property: By choosing (where is the first standard basis vector), we get:
This zeros out all but the first component of .
QR via Householder: Apply Householder reflectors to successively zero out the subdiagonal of each column:
where is orthogonal (product of orthogonal matrices) and is upper triangular.
Cost: flops for .
Numerical stability: Each Householder reflector is orthogonal exactly, so multiplying by does not increase rounding errors. This makes the algorithm backward stable.
Implementation detail: In practice, Householder QR is implemented implicitly - we never form explicitly. Instead, we store the vectors and apply by successively applying each reflector (cost: per reflector, total).
4.3 Gram-Schmidt and Modified Gram-Schmidt
The classical Gram-Schmidt process orthogonalizes the columns of one by one:
Classical Gram-Schmidt (CGS):
q_1 = a_1 / ||a_1||
for k = 2, ..., n:
r_{ik} = q_i^T a_k for i = 1, ..., k-1
q_k_unnorm = a_k - sum_{i=1}^{k-1} r_{ik} q_i
r_{kk} = ||q_k_unnorm||
q_k = q_k_unnorm / r_{kk}
Problem: CGS is numerically unstable. The orthogonality of is lost due to cancellation. After computing all vectors, may be far from identity.
Modified Gram-Schmidt (MGS): Reorders the computation to project out each as soon as it is computed:
for k = 1, ..., n:
q_k = a_k
for i = 1, ..., k-1:
r_{ik} = q_i^T q_k
q_k -= r_{ik} * q_i # Project q_k immediately
r_{kk} = ||q_k||
q_k = q_k / r_{kk}
MGS is algebraically identical to CGS but numerically superior: the loss of orthogonality in MGS satisfies vs for CGS.
However, MGS is still worse than Householder QR ( vs ). For high-accuracy least squares, use Householder QR.
When to use each:
- Householder QR: Direct least squares, high accuracy required, dense matrices
- MGS: Online/streaming orthogonalization, Lanczos/Arnoldi iteration, where the matrix is generated column by column
- Normal equations: Large-scale problems where is small; much faster to form and solve than QR
4.4 Backward Stability of Householder QR
Theorem: Householder QR applied to computes , such that:
The computed solution to satisfies:
Comparison with Normal equations: The Normal equations approach gives:
For in fp64: Householder gives 11 correct digits; Normal equations give only 7.
5. Cholesky Factorization
Scope note: Cholesky factorization theory (existence for SPD matrices, relationship to LU) is in Section03-Advanced-Linear-Algebra/08-Matrix-Decompositions. This section covers numerical aspects.
5.1 Positive Definiteness and Stability
For symmetric positive definite (SPD) matrices , for all :
- Cholesky factorization exists and is unique
- No pivoting needed: the diagonal entries are always positive
- Half the cost of LU: flops vs
- Perfect stability: the growth factor is exactly 1 (entries of cannot exceed those of in magnitude)
Theorem: Cholesky is backward stable: the computed factor satisfies with .
5.2 Numerical Positive Definiteness
In practice, matrices that are theoretically SPD may be only nearly SPD due to:
- Accumulated rounding errors in forming (e.g., covariance matrices)
- Near-zero eigenvalues that become slightly negative in fp64
Strategies for numerical SPD:
- Diagonal loading: Replace with where is small enough to preserve the physics but large enough to ensure numerical SPD. This is also called Tikhonov regularization.
- Modified Cholesky: If during factorization, add a small correction to the diagonal.
- Eigenvalue floor: Clip small eigenvalues: .
For AI:
- Adam optimizer maintains a running estimate of the squared gradient (Hessian diagonal proxy). This must stay positive, so Adam clips to a minimum value in the update formula.
- Second-order methods (K-FAC, Shampoo) factor approximate Hessians via Cholesky. Numerical indefiniteness is handled by damping: .
- Batch covariance matrices in normalization layers must be SPD; the in
BatchNorm(eps=1e-5)serves this role.
6. Iterative Methods
Direct methods (LU, QR, Cholesky) require work and storage - feasible for but impractical for (e.g., discretized PDEs, large graph problems). Iterative methods compute without forming or factoring , requiring only matrix-vector products .
6.1 Jacobi and Gauss-Seidel
The simplest iterative schemes split (diagonal, strictly lower, strictly upper triangular):
Jacobi iteration:
Gauss-Seidel: Use updated values immediately:
Convergence: Both converge if and only if the spectral radius .
For diagonally dominant matrices (), both converge. Gauss-Seidel typically converges faster (by a factor of 2 in terms of iterations for SPD systems).
For AI: Jacobi is embarrassingly parallel - all updates are independent. Gauss-Seidel requires sequential updates but has better convergence. In distributed deep learning, Jacobi-like parallel gradient updates correspond to asynchronous SGD with stale gradients.
6.2 Conjugate Gradient Method
For SPD , the Conjugate Gradient (CG) method is the optimal Krylov subspace method:
Algorithm (Hestenes-Stiefel, 1952):
r_0 = b - A x_0, p_0 = r_0
for k = 0, 1, 2, ...:
alpha_k = (r_k^T r_k) / (p_k^T A p_k)
x_{k+1} = x_k + alpha_k p_k
r_{k+1} = r_k - alpha_k A p_k
beta_k = (r_{k+1}^T r_{k+1}) / (r_k^T r_k)
p_{k+1} = r_{k+1} + beta_k p_k
Key properties:
- Each step requires one matrix-vector product
- The iterates minimize over the Krylov subspace
- Converges in exactly steps (in exact arithmetic)
- In practice, converges much earlier for well-clustered eigenvalues
Convergence rate: The error satisfies:
where .
For : after steps, error ratio - reduces error by factor 4.
For : need steps for same reduction.
For AI: Conjugate gradient is used in:
- Second-order optimization (natural gradient, K-FAC): solving linear systems involving the Fisher information matrix
- Computing Hessian-vector products implicitly (Pearlmutter's trick)
- Solving graph Laplacian systems in GNNs
- The PCG preconditioner in FlashAttention's attention computation via block algorithms
6.3 Preconditioning
CG convergence rate depends on . Preconditioning transforms the system to improve conditioning:
Preconditioned CG: Instead of solving , solve where is a preconditioner - an easy-to-invert approximation of .
The convergence rate becomes instead of .
Common preconditioners:
| Preconditioner | Construction | Application | Quality |
|---|---|---|---|
| Diagonal (Jacobi) | Removes diagonal scaling | ||
| Incomplete Cholesky (IC) | Sparse Cholesky with fill-in limited | triangular solve | Good for SPD PDE problems |
| Sparse Approximate Inverse | Minimize sparsity | matvec | Highly parallel |
| AMG (Algebraic Multigrid) | Hierarchical coarsening | cycle | Near-optimal for elliptic PDEs |
For AI: Diagonal preconditioning is Adam: where is the running second-moment estimate. Adam's denominator pre-scales gradients by an approximation of the Hessian diagonal - this is diagonal preconditioning applied to the implicit linear system of gradient descent.
6.4 GMRES for Non-Symmetric Systems
For non-symmetric , CG does not apply. GMRES (Generalized Minimal Residual, Saad & Schultz, 1986) minimizes over the Krylov subspace :
Algorithm sketch:
- Build an orthonormal basis for via Arnoldi iteration (modified Gram-Schmidt on the vectors )
- Solve the least squares problem:
Key properties:
- Guaranteed convergence (unlike non-restarted Jacobi/GS)
- Memory grows as - restart every steps: GMRES(m)
- Optimal over the Krylov subspace at each step
For AI: GMRES is used in physics-informed neural networks (PINNs) and neural ODEs where the implicit time-stepping requires solving non-symmetric linear systems at each step.
7. Eigenvalue Computation
Scope note: Eigenvalue theory (characteristic polynomial, diagonalizability, spectral theorem) is in Section03-Advanced-Linear-Algebra/01-Eigenvalues-and-Eigenvectors. This section covers numerical eigenvalue computation.
7.1 Power Iteration
Power iteration finds the dominant eigenvalue (largest in magnitude):
x_0 = random vector
for k = 0, 1, 2, ...:
y_{k+1} = A x_k
x_{k+1} = y_{k+1} / ||y_{k+1}||
lambda_k = x_k^T A x_k (Rayleigh quotient)
Convergence: at rate . Slow if .
Inverse power iteration: Apply power iteration to (solve at each step). Converges to the eigenvalue closest to 0 (smallest in magnitude). With a shift : apply to to find the eigenvalue closest to .
For AI: Power iteration is the foundation of the PageRank algorithm. It is also used in computing the leading singular values of weight matrices (for spectral normalization in GANs) and in the Lanczos algorithm for computing the extremal eigenvalues of the Hessian in loss landscape analysis.
7.2 QR Algorithm
The QR algorithm (Francis, 1961) computes all eigenvalues of a matrix:
Basic QR iteration:
A_0 = A
for k = 0, 1, 2, ...:
A_k = Q_k R_k (QR factorization)
A_{k+1} = R_k Q_k (Note: reversed order)
The sequence converges to a triangular (or quasi-triangular for real matrices) form, with the diagonal entries converging to the eigenvalues.
Practical improvements:
- Hessenberg reduction: First reduce to upper Hessenberg form (so for ). QR iteration applied to costs per step instead of .
- Shifts: Use shifts to accelerate convergence: factor , then . The Wilkinson shift is standard.
- Deflation: When a subdiagonal entry becomes small, the problem decouples into two smaller problems.
Convergence: With shifts, the QR algorithm typically converges cubically - finding all eigenvalues in total flops.
7.3 Sensitivity of Eigenvalues
Theorem (Bauer-Fike): For a diagonalizable matrix with distinct eigenvalues, and perturbation :
The condition number of the eigenvector matrix determines eigenvalue sensitivity.
For symmetric matrices: is orthogonal, so - eigenvalues of symmetric matrices are perfectly conditioned with respect to symmetric perturbations (Weyl's theorem).
For non-symmetric matrices: Ill-conditioned eigenvectors lead to ill-conditioned eigenvalues. Example: Jordan blocks have (or rather, exponentially large for near-Jordan matrices).
Practical implication: Computing eigenvalues of general non-symmetric matrices requires care. For symmetric matrices (e.g., Gram matrices, covariance matrices, graph Laplacians), eigenvalue computation is well-conditioned.
8. Applications in Machine Learning
8.1 Solving the Normal Equations in Ridge Regression
Ridge regression:
Normal equations:
Adding is Tikhonov regularization - it improves conditioning:
For the "kernel trick" (wide matrices, ): use the equivalent formulation where , then . The system is instead of .
Numerically stable implementation: Do NOT form . Instead:
- Factor (Householder, cost: )
- Solve - a smaller well-conditioned system
Or use the SVD: , then .
8.2 Batched Matrix Operations in Transformers
The attention mechanism computes:
In practice, this is computed over multiple heads and batch elements simultaneously - batched matrix multiplication (BGEMM):
- (batch, heads, sequence length, head dim)
- : batched GEMM, cost , executed as one kernel call on GPU
Numerical considerations:
- The scaling is numerically motivated (keeps elements near unit scale for well-behaved softmax)
- FlashAttention (Dao et al., 2022) avoids materializing the attention matrix by using a tiled algorithm that keeps blocks in SRAM
- Flash attention uses online softmax with iterative rescaling - directly applying the backward stability ideas from numerical linear algebra
8.3 Iterative Refinement in Neural Solvers
Iterative refinement is a classical technique to improve a solution to :
- Compute residual (in higher precision if available)
- Solve approximately (using the existing factorization)
- Update:
- Repeat until convergence
Mixed-precision iterative refinement: Factor in fp16 (fast), compute residual in fp64, perform refinement steps. This achieves fp64 accuracy at near fp16 cost. NVIDIA's cuSOLVER implements this for dense systems.
Neural solvers (e.g., solving PDEs) often use this pattern:
- Neural network predicts approximate solution
- Iterative refinement corrects it using the true equations
8.4 Condition Numbers in Gradient Descent
The condition number of the Hessian at a minimum determines gradient descent convergence:
For a quadratic :
- Gradient descent with step :
- Optimal step:
- Convergence rate:
For : convergence factor per step - need steps for reduction.
For : convergence in 1 step (perfect conditioning).
Adam addresses conditioning: By scaling each coordinate by (approximate diagonal of Hessian), Adam reduces the effective condition number to closer to 1, enabling faster convergence on ill-conditioned objectives.
Gradient clipping and conditioning: When is large, gradient norms can vary wildly across coordinates. Gradient clipping (by global norm or per-parameter) is a heuristic to prevent large steps along ill-conditioned directions.
9. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Using np.linalg.inv(A) @ b to solve | Forms explicitly - cost, numerically inferior | Use np.linalg.solve(A, b) |
| 2 | Forming for least squares | Squares the condition number: | Use np.linalg.lstsq(A, b) which uses QR |
| 3 | Assuming Gaussian elimination without pivoting is stable | Without pivoting, tiny pivots amplify rounding errors exponentially | Always use GEPP (scipy.linalg.lu with permutation) |
| 4 | Using np.linalg.cond(A) as a cheap operation | cond computes SVD internally - cost | Use it for debugging; avoid in inner loops |
| 5 | Solving many right-hand sides by calling solve repeatedly | Each call re-factors | Use lu_factor + lu_solve to reuse factorization |
| 6 | Ignoring sparsity | Dense operations on sparse matrices waste memory and computation | Use scipy.sparse matrices and sparse solvers |
| 7 | Expecting CG to converge in few iterations without a preconditioner | CG rate is - slow for ill-conditioned systems | Add a diagonal or IC preconditioner |
| 8 | Using Classical Gram-Schmidt for QR | Loss of orthogonality is | Use Modified Gram-Schmidt or Householder |
| 9 | Forming explicitly in Householder QR | Costs extra; usually unnecessary | Apply implicitly via stored reflectors |
| 10 | Interpreting numpy.linalg.matrix_rank for nearly singular matrices | Uses a singular value threshold - results are threshold-dependent | Always check singular values directly; use regularization |
| 11 | Assuming positive definite matrices are safe to Cholesky-factorize without checks | Near-zero eigenvalues become negative in floating point | Add diagonal loading: A + eps * np.eye(n) |
| 12 | Using eigenvalue decomposition for symmetric matrices via eig | eig uses general (non-symmetric) algorithm; slower and less stable | Always use eigh for symmetric/Hermitian matrices |
10. Exercises
Exercise 1 * - LU Factorization with Partial Pivoting
Implement Gaussian elimination with partial pivoting from scratch.
(a) Write lu_pivot(A) returning (P, L, U) such that P @ A == L @ U.
(b) Implement forward/back substitution to solve L z = P b and U x = z.
(c) Verify on a random matrix: compare x to numpy.linalg.solve(A, b).
(d) Demonstrate that without pivoting (always choosing the diagonal as pivot), the solution fails for with perturbation.
Exercise 2 * - Stability Comparison: Normal Equations vs QR
Compare the least squares solvers on an ill-conditioned matrix.
(a) Generate with singular values .
(b) Compute the least squares solution via Normal equations (form , solve) and via QR (numpy.linalg.lstsq). Report , , and the relative error of each method.
(c) Plot relative error vs condition number (vary the decay rate of singular values).
Exercise 3 ** - Implementing Conjugate Gradient
Implement the CG algorithm for SPD systems.
(a) Write conjugate_gradient(A, b, x0, tol, max_iter) returning (x, residuals).
(b) Test on a SPD system. Plot the residual vs iteration.
(c) Verify the convergence bound: .
(d) Add diagonal preconditioning: solve where . Compare convergence with and without preconditioning.
Exercise 4 ** - Iterative Refinement
Implement mixed-precision iterative refinement.
(a) Factor in fp32 using scipy.linalg.lu.
(b) Solve using the fp32 factorization. Compute the residual in fp64.
(c) Solve using the fp32 factorization. Update .
(d) Report the forward error after each refinement step for a moderately ill-conditioned system ().
Exercise 5 ** - Power Iteration and PageRank
Implement power iteration to compute the dominant eigenvector.
(a) Write power_iteration(A, x0, tol) converging to the dominant eigenvector.
(b) Apply to the PageRank problem: for a random directed graph with nodes, build the transition matrix and use power iteration to find the stationary distribution.
(c) Compare convergence rate to the predicted . Compute via numpy.linalg.eig.
Exercise 6 *** - Condition Number and Gradient Descent Convergence
Demonstrate the condition number's role in optimization convergence.
(a) Generate a quadratic objective where has eigenvalues spread over for varying .
(b) Run gradient descent with optimal fixed step size . Record at each step.
(c) Verify the bound .
(d) Run CG on the same problems. Compare the number of steps to accuracy.
Exercise 7 ** - Modified Gram-Schmidt QR
Implement Modified Gram-Schmidt and analyze loss of orthogonality.
(a) Implement mgs_qr(A) returning (Q, R) via MGS.
(b) Compare orthogonality loss for MGS vs Classical Gram-Schmidt vs Householder on an ill-conditioned matrix.
(c) Plot orthogonality loss vs for all three methods.
Exercise 8 *** - Sparse Systems and Preconditioning
Work with sparse linear systems from a 1D PDE discretization.
(a) Build the tridiagonal matrix from discretizing on with interior points: , where .
(b) Solve using: (i) dense numpy.linalg.solve, (ii) sparse CG without preconditioner, (iii) sparse CG with diagonal preconditioner.
(c) Report time, iteration count, and final residual for each method.
(d) Compute and verify that CG iterations scale as .
11. Why This Matters for AI (2026 Perspective)
| Aspect | Impact |
|---|---|
| BLAS/cuBLAS operations | Every forward pass in a neural network is a sequence of matrix multiplications. The stability of BLAS routines directly impacts training |
| Adam as diagonal preconditioner | Adam's scaling implements diagonal preconditioning for the implicit linear system of gradient descent |
| FlashAttention | Uses numerically stable tiled attention computation - directly applies backward stability principles to avoid materializing the full attention matrix |
| Mixed-precision training | fp16 factorizations + fp32 refinement mirrors mixed-precision iterative refinement from numerical linear algebra |
| K-FAC optimization | Approximate second-order method using block-diagonal Kronecker-factored Hessian; requires stable Cholesky factorization of each block |
| Spectral normalization | GAN training stability via normalizing weight matrices by - computed via power iteration |
| Gradient clipping | Addresses ill-conditioning of the loss Hessian; analogous to limiting the condition number of each step |
| Neural ODEs | Require solving implicit linear systems at each time step; use GMRES or CG for efficiency |
| LoRA/low-rank adaptation | Exploits low-rank structure: instead of updating , update with , , |
| Eigenvalue analysis of trained networks | Studying the Hessian spectrum at the end of training; QR algorithm powers these computations |
12. Conceptual Bridge
This section sits at the intersection of pure mathematical theory and practical computational science. You came in knowing that has a unique solution when is invertible - this section showed you that how you solve it matters as much as whether a solution exists.
What you learned: The key insight is backward stability - the right standard for numerical algorithms. Backward stable algorithms solve a nearby problem exactly; the condition number then determines how much forward error this implies. LU with partial pivoting, Householder QR, and Cholesky are all backward stable. Classical Gram-Schmidt, normal equations, and LU without pivoting are not.
Looking backward: This section builds directly on Section01 Floating-Point Arithmetic (machine epsilon, rounding model, condition numbers) and Section03-Advanced-Linear-Algebra/08-Matrix-Decompositions (mathematical existence and properties of LU, QR, Cholesky). If the theoretical content of factorizations is unfamiliar, revisit those sections.
Looking forward: The iterative methods here (CG, GMRES) reappear in Section03 Numerical Optimization as tools for solving the linear systems that arise in Newton's method and trust region methods. The condition number analysis connects to convergence theory for gradient descent, which is the core of training deep neural networks.
CHAPTER Section10 POSITION: NUMERICAL METHODS
========================================================================
Section01 Floating-Point Arithmetic <- You are here (prerequisites)
+--> Section02 Numerical Linear Algebra <- THIS SECTION
+--> Section03 Numerical Optimization
+--> Section04 Interpolation & Approximation
+--> Section05 Numerical Integration
BRIDGES FROM THIS SECTION:
+-------------------------------------------------------------+
| LU/QR stability -> Section08 Optimization (line search) |
| CG method -> Section03 Numerical Optimization |
| Preconditioning theory -> Section08 Optimization (Adam analysis) |
| Eigenvalue algorithms -> Section11 Graph Theory (spectral) |
| Matrix condition numbers -> Section03 Advanced LA (decompositions) |
+-------------------------------------------------------------+
========================================================================
Appendix A: Detailed LU Factorization Algorithm
Here is a complete, step-by-step implementation of GEPP suitable for educational purposes:
import numpy as np
def lu_partial_pivot(A):
"""
Gaussian elimination with partial pivoting.
Returns P, L, U such that P @ A = L @ U.
"""
n = A.shape[0]
A = A.astype(np.float64).copy()
P = np.eye(n)
L = np.zeros((n, n))
for k in range(n - 1):
# Find pivot: largest element in column k, rows k..n-1
pivot_row = k + np.argmax(np.abs(A[k:, k]))
# Swap rows k and pivot_row in A, L, P
if pivot_row != k:
A[[k, pivot_row], :] = A[[pivot_row, k], :]
P[[k, pivot_row], :] = P[[pivot_row, k], :]
L[[k, pivot_row], :k] = L[[pivot_row, k], :k]
# Eliminate below the diagonal
for i in range(k + 1, n):
L[i, k] = A[i, k] / A[k, k]
A[i, k:] -= L[i, k] * A[k, k:]
L += np.eye(n) # Add diagonal ones
U = np.triu(A) # Upper triangular part
return P, L, U
def forward_sub(L, b):
"""Solve Lx = b where L is lower triangular with unit diagonal."""
n = len(b)
x = np.zeros(n)
for i in range(n):
x[i] = b[i] - L[i, :i] @ x[:i]
return x
def back_sub(U, b):
"""Solve Ux = b where U is upper triangular."""
n = len(b)
x = np.zeros(n)
for i in range(n - 1, -1, -1):
x[i] = (b[i] - U[i, i+1:] @ x[i+1:]) / U[i, i]
return x
def solve_lu(A, b):
"""Solve Ax = b via LU with partial pivoting."""
P, L, U = lu_partial_pivot(A)
pb = P @ b # Apply permutation
z = forward_sub(L, pb) # Solve Lz = Pb
x = back_sub(U, z) # Solve Ux = z
return x
Verification:
np.random.seed(42)
A = np.random.randn(5, 5)
b = np.random.randn(5)
x = solve_lu(A, b)
print(f"Residual: {np.linalg.norm(A @ x - b):.2e}") # Should be ~1e-15
Appendix B: Householder Reflector Implementation
A complete Householder QR from scratch:
def householder_qr(A):
"""
QR factorization via Householder reflectors.
Returns Q (orthogonal), R (upper triangular).
"""
m, n = A.shape
A = A.copy().astype(np.float64)
Q = np.eye(m)
for k in range(min(m, n)):
# Extract column k below the diagonal
x = A[k:, k].copy()
# Build Householder reflector to zero out x[1:]
e1 = np.zeros_like(x)
e1[0] = 1.0
sign = 1.0 if x[0] >= 0 else -1.0
v = x + sign * np.linalg.norm(x) * e1
v = v / np.linalg.norm(v)
# Apply: A[k:, k:] -= 2 v (v^T A[k:, k:])
A[k:, k:] -= 2.0 * np.outer(v, v @ A[k:, k:])
# Accumulate Q: Q[k:, :] -= 2 v (v^T Q[k:, :])
Q[k:, :] -= 2.0 * np.outer(v, v @ Q[k:, :])
R = np.triu(A)
return Q.T, R # Q^T was accumulated; return Q = (Q^T)^T
def solve_qr(A, b):
"""Solve least squares min ||Ax - b|| via QR."""
Q, R = householder_qr(A)
n = R.shape[1]
Qt_b = Q.T @ b
# Solve R[:n, :] x = Q.T b[:n] via back substitution
x = back_sub(R[:n, :n], Qt_b[:n])
return x
Appendix C: Conjugate Gradient - Detailed Derivation
The CG method can be derived from several perspectives. The most illuminating is the energy minimization view.
Setup: We want to minimize the quadratic energy function:
where is SPD. The minimum satisfies , i.e., .
Steepest descent would use the gradient :
But steepest descent exhibits zig-zagging - successive search directions are perpendicular, leading to slow convergence with iterations.
Key idea of CG: Choose search directions that are -conjugate (mutually orthogonal under the -inner product ):
Why this works: In a space of dimension with mutually conjugate directions, the minimum can be found in exactly steps (each step minimizes exactly along one conjugate direction, and these directions span ).
The CG recurrence generates -conjugate directions from the residuals using a three-term recurrence, which is the Lanczos relation for symmetric matrices.
Finite precision behavior: In exact arithmetic, CG terminates in steps. In floating point:
- Orthogonality of residuals is lost after many steps
- The algorithm doesn't terminate in steps for large
- The computed residuals satisfy as a floor
Appendix D: Growth Factor Examples
The theoretical worst case for the growth factor in GEPP is , achieved by:
After GEPP on this matrix, the entry of equals . For , this would exceed fp64 range - an exponential growth. However, this matrix almost never arises in practice.
Empirical growth factors:
| Matrix type | Typical | Notes |
|---|---|---|
| Random | Empirical average | |
| Tridiagonal | Theoretical bound | |
| SPD matrices | (Cholesky) | No growth possible |
| Worst case (Wilkinson) | Theoretically possible, never seen |
Appendix E: Scipy/NumPy Linear Algebra API Reference
Recommended routines:
import numpy as np
import numpy.linalg as nla
import scipy.linalg as sla
from scipy.sparse.linalg import cg, spsolve
# === Direct Methods ===
# Solve Ax = b (dense)
x = nla.solve(A, b) # Uses LAPACK dgesv (LU + partial pivoting)
x = sla.solve(A, b) # Similar; slightly more options
# Least squares
x, _, _, _ = nla.lstsq(A, b) # Householder QR-based
# LU factorization (reuse for multiple RHS)
lu, piv = sla.lu_factor(A) # Factor once
x = sla.lu_solve((lu, piv), b) # Solve for each b
# QR factorization
Q, R = nla.qr(A) # Householder QR
Q, R = sla.qr(A) # Full or economic form
# Cholesky (SPD only)
L = nla.cholesky(A) # Lower triangular factor
c, low = sla.cho_factor(A) # Factor for reuse
x = sla.cho_solve((c, low), b)
# === Iterative Methods ===
# Conjugate gradient (SPD)
from scipy.sparse.linalg import cg
x, info = cg(A, b, tol=1e-10, maxiter=1000)
# GMRES (general)
from scipy.sparse.linalg import gmres
x, info = gmres(A, b, tol=1e-10, maxiter=1000)
# === Condition Number ===
kappa = nla.cond(A) # Via SVD - O(n^3)
kappa_1 = nla.cond(A, 1) # 1-norm condition (faster via LAPACK)
# Estimate condition number cheaply (after LU):
# scipy.linalg.lu_factor + scipy.linalg.rcond (reciprocal condition number)
# === Eigenvalues ===
eigvals = nla.eigvalsh(A) # Symmetric/Hermitian only (faster, stable)
eigvals, eigvecs = nla.eigh(A)
eigvals = nla.eigvals(A) # General (use only if A is non-symmetric)
# === SVD ===
U, s, Vt = nla.svd(A) # Full SVD
U, s, Vt = nla.svd(A, full_matrices=False) # Economy SVD (m >= n)
Appendix F: Error Analysis of Triangular Solves
When solving (triangular system), the computed solution satisfies:
This is backward stable with backward error .
Forward error: The condition number of a lower triangular system is:
For well-conditioned triangular systems (as produced by LU with partial pivoting), - the forward error is .
Practical implication: Even for ill-conditioned , if you solve via LU with partial pivoting, the triangular solve steps are individually backward stable. The overall error comes from the conditioning of , not from numerical instability of the algorithm.
Appendix G: Connection to PyTorch/JAX Linear Algebra
Modern deep learning frameworks expose efficient linear algebra:
import torch
# === Direct Solve ===
A = torch.randn(100, 100, dtype=torch.float64)
b = torch.randn(100, 1, dtype=torch.float64)
# torch.linalg.solve is backward stable (uses cuBLAS LAPACK routines)
x = torch.linalg.solve(A, b)
# === Batched Operations ===
# Solve 32 systems simultaneously (B x N x N)
A_batch = torch.randn(32, 100, 100)
b_batch = torch.randn(32, 100, 1)
x_batch = torch.linalg.solve(A_batch, b_batch) # Fully batched, GPU-accelerated
# === Eigenvalues ===
A_sym = A @ A.T # Make symmetric
eigvals, eigvecs = torch.linalg.eigh(A_sym) # For symmetric: use eigh
# === SVD ===
U, S, Vh = torch.linalg.svd(A)
U, S, Vh = torch.linalg.svd(A, full_matrices=False) # Economy SVD
# === Condition Number ===
kappa = torch.linalg.cond(A)
# === QR ===
Q, R = torch.linalg.qr(A) # Householder QR
GPU execution: All torch.linalg operations execute on GPU if tensors are on GPU. Batched operations are especially efficient: torch.linalg.solve with batch dimension uses cuBLAS batched GETRS.
Gradient computation: torch.linalg operations support automatic differentiation. The gradient of torch.linalg.solve(A, b) with respect to A and b is computed via the adjoint method - no explicit matrix inverse is formed.
Appendix H: Krylov Subspace Methods - Unified View
The Krylov subspace is the fundamental concept underlying all modern iterative methods.
Different optimality conditions give different methods:
| Method | System | Optimality condition |
|---|---|---|
| CG | SPD | Minimize over |
| MINRES | symmetric | Minimize over |
| GMRES | general | Minimize over |
| FOM | general | Galerkin condition |
| BiCG | general | Biorthogonalization (cheaper, less stable) |
| BiCGStab | general | BiCG with stability fix |
Why Krylov subspaces? Polynomials in acting on give the best possible approximations from the subspace spanned by . The optimal polynomial satisfying minimizes the error in the appropriate norm.
Convergence depends on the spectrum: If has only distinct eigenvalues, the Krylov method converges in exactly steps in exact arithmetic. Pre-clustering eigenvalues (via preconditioning) accelerates convergence.
Appendix I: Sparse Matrix Storage Formats
For large sparse systems, efficient storage and matrix-vector products are critical:
| Format | Storage | Access | Best for |
|---|---|---|---|
| Dense | Small/dense matrices | ||
| COO (coordinates) | scan | Construction | |
| CSR (compressed sparse row) | Row operations, matvec | ||
| CSC (compressed sparse column) | Column operations | ||
| BSR (block sparse row) | Block-level | Dense block structure | |
| DIA (diagonal) | Banded/diagonal matrices |
Sparse matrix-vector product (SpMV): The core operation in iterative methods. For CSR:
y = zeros(n)
for i in range(n):
for j_idx in range(row_ptr[i], row_ptr[i+1]):
y[i] += val[j_idx] * x[col_idx[j_idx]]
This is per multiply - for matrices with , dramatically cheaper than dense matvec.
For AI: Sparse attention in transformers uses sparse matrix representations. The attention mask is stored as a sparse pattern; the attention computation becomes a sparse matrix-matrix product (SpMM).
Appendix J: Worked Example - Condition Number Effect on Least Squares
Consider fitting a polynomial through noisy data. Using the Vandermonde matrix:
For with equally spaced in :
n = 10
x = np.linspace(0, 1, 20)
V = np.vander(x, n, increasing=True)
# Condition numbers
print(f'kappa(V) = {np.linalg.cond(V):.2e}') # ~1e12
print(f'kappa(V^TV) = {np.linalg.cond(V.T @ V):.2e}') # ~1e24!
# Fit via Normal equations vs QR
y = np.sin(2 * np.pi * x) + 0.01 * np.random.randn(20)
c_normal = np.linalg.solve(V.T @ V, V.T @ y) # Unstable
c_qr, _, _, _ = np.linalg.lstsq(V, y) # Stable
Result: For high-degree polynomials, the Vandermonde matrix is extremely ill-conditioned. The QR solution via lstsq will be far more accurate than the Normal equations approach.
Fix: Use Chebyshev nodes instead of equally-spaced nodes. The resulting Vandermonde-like matrix has dramatically lower condition number (see Section04 Interpolation and Approximation).
Appendix K: Quick Reference - Algorithm Complexity and Stability
NUMERICAL LINEAR ALGEBRA: ALGORITHM COMPARISON
========================================================================
DIRECT METHODS (Dense A, n x n):
Algorithm Flops Storage Stable? Use when
-------------------------------------------------------------
LU (no pivot) 2n^3/3 n^2 NO Only if A is SPD
LU (partial) 2n^3/3 n^2 YES* General square A
QR (Householder) 4m n^2/3 mn YES Least squares
QR (MGS) 2mn^2 mn Partial Streaming/iterative
Cholesky n^3/3 n^2/2 YES SPD matrices
SVD ~11n^3 mn YES Rank-def, analysis
* Stable in practice; theoretical worst case 2^(n-1) growth
ITERATIVE METHODS (Sparse A, nnz nonzeros):
Algorithm Cost/iter Converges? Precond? Use when
-------------------------------------------------------------
Jacobi O(nnz) If D-dom. N/A Parallel arch
Gauss-Seidel O(nnz) If D-dom. N/A Sequential
CG O(nnz) SPD only YES SPD sparse
MINRES O(nnz) Sym. only YES Sym. indef.
GMRES O(k\\cdotnnz) Always YES Non-symmetric
BiCGStab O(nnz) Often YES Non-sym, cheaper
========================================================================
Appendix L: Numerical Linear Algebra in the LLM Stack
Modern large language models rely on numerical linear algebra at every level:
1. Attention computation (FlashAttention-2):
- Tiled matrix multiplication with online softmax rescaling
- Block size chosen to fit in SRAM (80KB L1 cache on A100)
- Backward pass uses recomputation (saves memory at cost of 1.33x extra flops)
- Key numerical property: each tile computation is backward stable; the tiling doesn't affect correctness
2. Optimizer state (AdaFactor, CAME):
- AdaFactor approximates the Adam second moment as a rank-1 outer product:
- This is a rank-1 approximation of the true second moment matrix
- CAME introduces curvature information via a sparse second-order approximation
3. Quantization (GPTQ, AWQ, SqueezeLLM):
- Post-training weight quantization formulated as a matrix approximation problem
- GPTQ: use Cholesky decomposition of the Hessian to determine optimal quantization order
- AWQ: identify "salient weights" via gradient sensitivity (related to condition number analysis)
4. Low-rank adaptation (LoRA, DoRA):
- where ,
- Initialization: (so initially),
- Numerically stable: rank- updates don't affect the conditioning of
- DoRA: - normalizes by column norms (related to spectral normalization)
<- Back to Chapter 10 | Next: Numerical Optimization ->
Appendix M: Extended Theory - Structured Linear Systems
M.1 Toeplitz Systems
A Toeplitz matrix has constant diagonals: . This structure arises in:
- Signal processing (convolution)
- Time-series analysis (autocovariance)
- Certain PDE discretizations
Levinson-Durbin algorithm: Solves Toeplitz systems in flops (vs for general LU). Uses the structure to build the solution recursively.
For AI: Convolutional layers in CNNs implement Toeplitz-structured matrix-vector products. The connection between convolution and Toeplitz structure explains why FFT-based convolution () is faster than direct convolution ().
M.2 Kronecker Product Systems
For systems of the form where denotes the Kronecker product:
The system can be solved as a matrix equation: with solution . This requires factoring two smaller matrices of sizes and rather than one matrix.
For AI: Kronecker-factored curvature (K-FAC) approximates the Fisher information matrix as a Kronecker product: where and are layer-specific matrices. This enables efficient second-order optimization without forming the full Fisher matrix.
M.3 Circulant Systems
A circulant matrix has each row as a cyclic shift of the previous row:
Key property: Every circulant matrix is diagonalized by the DFT matrix:
where is the DFT matrix and is the DFT of the first row. This means:
- Eigenvalues of are the DFT coefficients of
- Solving costs via FFT
- Matrix-vector products cost
For AI: Circular convolutions arise in certain attention mechanisms (e.g., some positional encodings). The connection between circulant matrices and the DFT explains why FFT-based convolution is exact for circular boundary conditions.
Appendix N: Numerical Rank and Truncated SVD
N.1 Numerical Rank
The numerical rank of a matrix is the number of singular values above a threshold :
The choice of depends on context:
- Machine precision threshold:
- Data-dependent threshold: where is chosen by a "gap" in the singular value spectrum
- Relative threshold: for user-specified
For AI: Low-rank approximations underlie:
- LoRA: Weight updates confined to a low-rank subspace
- Attention compression: Approximating the attention matrix by a low-rank matrix
- Intrinsic dimensionality: The effective rank of gradient updates during training is often much smaller than the parameter dimension (Gur-Ari et al., 2018)
N.2 Truncated SVD for Compression
The best rank- approximation to (in Frobenius norm) is:
with error .
Randomized SVD (Halko, Martinsson, Tropp, 2011): Compute a rank- approximation in flops:
- Draw a random matrix
- Compute and QR-factor:
- Compute SVD of the small matrix :
- Return , ,
This is - much faster than for full SVD when .
Appendix O: Case Studies in ML Systems
O.1 Attention Computation and Numerical Stability
The attention computation produces an matrix whose rows are passed through softmax. The numerical challenges:
Scale issue: With and random with unit-norm rows, dot products have magnitude . After division by , the scaled scores have magnitude .
In fp16 (range +/-65504), this is safe. But the softmax of scores with magnitude 11 is nearly a one-hot distribution:
This is extreme but not a numerical problem per se - softmax with the max-subtraction trick handles it.
Memory issue: For and 8 attention heads, storing requires GB - too large for GPU SRAM. FlashAttention avoids this via tiling.
FlashAttention tile computation: Process in blocks of rows and in blocks of columns. For each tile, compute local scores, apply local softmax with rescaling, accumulate into output. The key identity:
For rows and columns in tile :
where (running maximum) and is the running sum of exponentials. This is numerically equivalent to computing the full softmax but uses memory instead of .
O.2 Gram-Schmidt in Transformer Training
During the training of large language models, maintaining orthogonality in weight updates can improve training stability. GradNorm and gradient orthogonalization methods apply Gram-Schmidt to gradients:
- Multiple task losses produce gradients
- Orthogonalizing against prevents the second task from undoing the first
- Numerically, use MGS (not CGS) to avoid loss of orthogonality accumulation
O.3 Hessian Spectrum in Neural Networks
The Hessian of the training loss has a characteristic spectrum:
- A few large eigenvalues (corresponding to the most sensitive directions)
- Many near-zero eigenvalues (corresponding to nearly flat directions)
- This bulk + outlier structure explains why second-order methods need special handling
Computing the Hessian spectrum requires eigenvalue algorithms. For large networks ( parameters), even storing is impossible. Lanczos iteration (power iteration variant) computes the extremal eigenvalues using only Hessian-vector products , which can be computed via automatic differentiation in time and memory.
Appendix P: Summary Reference Card
Section02 NUMERICAL LINEAR ALGEBRA - SUMMARY CARD
========================================================================
CORE PRINCIPLE: Backward stability = exact answer to nearby problem
Forward error \\leq (A) x backward error
KEY CONDITION NUMBERS:
(A) - digits lost: log_1_0()
(A^TA) = (A)^2 - use QR not normal equations!
(LL^T) = 1 - Cholesky adds no error (for SPD)
ALGORITHM SELECTION:
Square Ax = b, well-conditioned -> GEPP (LU)
Square Ax = b, SPD -> Cholesky
Least squares, ill-conditioned -> Householder QR
Least squares, well-conditioned -> Normal equations OK
Large sparse, SPD -> CG + preconditioner
Large sparse, non-symmetric -> GMRES + preconditioner
GROWTH FACTOR RULE OF THUMB:
LU without pivoting: \\rho can be \\infty
LU with partial pivot: \\rho \\leq 2^{n-1}, in practice O(n^{2/3})
Cholesky: \\rho = 1 exactly
ITERATIVE METHOD CONVERGENCE:
CG: \\varepsilon after k iters \\leq 2\\cdot((\\sqrt-1)/(\\sqrt+1))^k e_0_A
GS: converges for diagonally dominant / SPD matrices
GMRES: always converges (no divergence), but can be slow
PYTHON API CHEATSHEET:
nla.solve(A, b) - LU with partial pivoting
nla.lstsq(A, b) - Householder QR
sla.cho_factor/cho_solve- Cholesky for SPD
sp.linalg.cg(A, b) - Conjugate gradient
sp.linalg.gmres(A, b) - GMRES
========================================================================
<- Back to Section01 Floating-Point Arithmetic | Next: Section03 Numerical Optimization ->
Appendix Q: Iterative Refinement - Full Analysis
Q.1 Standard Iterative Refinement
Given an approximate solution to (e.g., from LU factorization), iterative refinement proceeds:
Step k:
- Compute residual: (in higher precision if possible)
- Solve the correction equation: (using stored LU factors)
- Update:
Error analysis: Let be the true error. Then:
The error in comes from two sources:
- Rounding error in computing : with
- Backward error in solving: with
Convergence: Iterative refinement converges quadratically when :
This means after one refinement step, the error is reduced to ; after the second step, to (if ). This is essentially optimal.
Q.2 Mixed-Precision Iterative Refinement
NVIDIA cuSOLVER GMRES-IR (2022):
- Factor in fp16 (Tensor Core acceleration)
- Solve approximately via triangular solves in fp16
- Compute residual in fp64 (using the original fp64 data)
- Use GMRES in fp32 to refine the solution
Theoretical guarantees: Achieve fp64 accuracy at the computational cost of fp32/fp16 factorization. The number of GMRES iterations needed depends on :
- : typically 3-5 refinement steps
- : may not converge (problem too ill-conditioned for fp64)
Q.3 FGMRES-IR for Mixed Precision
The full algorithm from (Carson & Higham, 2018):
Input: A (fp64), b (fp64), desired accuracy \\varepsilon
Output: x (fp64) with ||Ax - b||/||b|| < \\varepsilon
Phase 1: Low-precision factorization
A = fl32(A) # Cast to fp32
P, L, U = lu_factor(A) # Factor in fp32
Phase 2: Refinement loop
x = 0
r = b # Initial residual (fp64)
for iter = 1, ..., max_iter:
d = lu_solve(L, U, P, fl32(r)) # Solve in fp32
x += fl64(d) # Accumulate in fp64
r = b - A @ x # New residual in fp64
if ||r||/||b|| < \\varepsilon: break
Return x
This achieves fp64 accuracy with ~2x speedup from fp32 factorization.
Appendix R: Advanced Topics in Eigenvalue Computation
R.1 Krylov-Schur (State of the Art for Large Sparse)
For large sparse eigenvalue problems (e.g., finding the 100 largest eigenvalues of a million-dimensional Laplacian), the Krylov-Schur algorithm (Stewart, 2001) improves on the Lanczos/Arnoldi methods:
- Build a Krylov basis via the Arnoldi relation:
- Compute eigenvalues of the small Hessenberg matrix (Ritz values)
- Keep the desired Ritz values; discard the rest via Krylov-Schur restart
- Expand the basis and repeat
For AI: Used in spectral analysis of large language models (computing the dominant eigenvectors of the attention pattern matrix, analyzing the Hessian spectrum of loss landscapes, spectral clustering of large social graphs).
R.2 LOBPCG for Symmetric Eigenvalue Problems
The Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) method computes eigenpairs simultaneously:
- Operates on vectors
- Each iteration: Rayleigh-Ritz extraction from the -dimensional trial subspace
- Very cache-friendly (block operations); good for GPU acceleration
- Convergence rate: same as CG but for the -th eigengap
For AI: LOBPCG is used in GNN implementations for computing the first few eigenvectors of graph Laplacians (for spectral positional encodings) and in second-order optimization for computing the dominant eigenvectors of the Hessian.
Appendix S: Forward Error Analysis Proofs
S.1 Error in Dot Product
Claim: For with elements in fp64, the computed dot product satisfies:
Proof: Each product is rounded: with .
Each addition accumulates more error. Using the model with , and applying it times:
Taking the norm and bounding each for .
The bound is tight: you can construct where the bound is nearly achieved. Kahan summation reduces this to independent of .
S.2 Rounding Error in Matrix-Vector Product
For with , the computed satisfies:
This is the componentwise forward error bound. Its 2-norm is:
where denotes elementwise absolute value. The relative error is thus:
Appendix T: Connections to Related Sections
| Concept | This Section (Section02) | Related Section |
|---|---|---|
| LU factorization existence | Numerical stability of GEPP | Section03-Advanced-LA/08-Matrix-Decompositions: theory |
| QR factorization | Householder stability, MGS vs CGS | Section03-Advanced-LA/08: existence, Gram-Schmidt theory |
| Eigenvalue algorithms | Power iteration, QR algorithm | Section03-Advanced-LA/01-Eigenvalues: theory |
| Condition number basics | Perturbation theory, error amplification | Section01 Floating-Point Arithmetic: definition |
| CG as optimizer | Matrix system solver | Section03 Numerical Optimization: CG for optimization |
| SVD applications | Numerical rank, truncated SVD | Section03-Advanced-LA/02-SVD: theory |
| Sparse attention patterns | Sparse matrix-vector products | Section11 Graph Theory: graph algorithms |
<- Back to Chapter 10 | Next: Numerical Optimization ->
Appendix U: Deep Dive - Preconditioned Conjugate Gradient
U.1 The Full PCG Algorithm
The Preconditioned Conjugate Gradient (PCG) algorithm, written explicitly:
Input: SPD matrix A, right-hand side b, SPD preconditioner M
Output: approximate solution x to Ax = b
x_0 = 0 (or initial guess)
r_0 = b - A x_0
z_0 = M^{-1} r_0 (preconditioner solve)
p_0 = z_0
for k = 0, 1, 2, ...:
q_k = A p_k (matrix-vector product)
alpha_k = r_k^T z_k / p_k^T q_k
x_{k+1} = x_k + alpha_k p_k
r_{k+1} = r_k - alpha_k q_k
if ||r_{k+1}|| < tol: break
z_{k+1} = M^{-1} r_{k+1} (preconditioner solve)
beta_k = r_{k+1}^T z_{k+1} / r_k^T z_k
p_{k+1} = z_{k+1} + beta_k p_k
return x_{k+1}
Cost per iteration:
- 1 matrix-vector product: for sparse
- 1 preconditioner solve: for sparse (e.g., IC preconditioner)
- 2 dot products and vector updates:
Total: for -accurate solution.
U.2 Optimal Preconditioners for Specific Problems
Laplacian systems (arising from graph algorithms, mesh-based PDEs):
- is the graph Laplacian; for mesh problems
- Algebraic multigrid (AMG): constructs a hierarchy of coarser graphs; each coarse-grid solve is ;
- AMG is used in nearly all large-scale PDE solvers (HYPRE, PETSc)
Systems from neural networks (K-FAC):
- (Kronecker product of activation/gradient covariances)
- Preconditioner:
- Each factor can be Cholesky-factored independently: instead of
Appendix V: Worked Example - Full Solution of a Least Squares Problem
Problem: Fit a polynomial of degree to noisy data points.
Setup:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
m, d = 30, 6 # 6 coefficients for degree 5
# True function: sin(2*pi*x) + 0.5*cos(4*pi*x)
x = np.linspace(0, 1, m)
y_true = np.sin(2*np.pi*x) + 0.5*np.cos(4*np.pi*x)
y_noisy = y_true + 0.1 * np.random.randn(m)
# Build Vandermonde matrix
A = np.vander(x, d, increasing=True) # A[i, j] = x_i^j
# Condition number
kappa = np.linalg.cond(A)
print(f"kappa(A) = {kappa:.2e}")
# Method 1: Normal equations (unstable for high degree)
c_normal = np.linalg.solve(A.T @ A, A.T @ y_noisy)
# Method 2: Householder QR (stable)
c_qr, _, _, _ = np.linalg.lstsq(A, y_noisy, rcond=None)
# Evaluate and compare errors
print(f"Residual (normal): {np.linalg.norm(A @ c_normal - y_noisy):.6f}")
print(f"Residual (QR): {np.linalg.norm(A @ c_qr - y_noisy):.6f}")
Expected output:
- For degree 5 with : - still manageable
- For degree 10 with : - Normal equations fail, QR still works
- For degree 15: - even QR starts losing accuracy; need Chebyshev nodes
Key lesson: The numerical behavior of polynomial fitting is entirely determined by , which depends on both the degree and the spacing of evaluation points. Chebyshev nodes minimize the Lebesgue constant, which is related to (but not identical to) the condition number.
Appendix W: Stability Proofs
W.1 Backward Stability of Triangular Solve
Theorem: Let be a lower triangular matrix with unit diagonal. The computed solution to satisfies:
Proof (sketch for ): The algorithm computes:
Expanding: .
The true solution satisfies , so:
This corresponds to solving where and . The backward error is .
W.2 Orthogonality Loss in Classical Gram-Schmidt
Theorem: Classical Gram-Schmidt applied to with produces with:
Intuition: At step , we project against all previous . The error in each is amplified by subsequent projections. After steps, errors in have been "propagated" times through imperfect projections, leading to error in the loss of orthogonality.
Modified Gram-Schmidt projects against the updated at each step, breaking the error propagation chain and achieving - still worse than Householder's but much better than CGS.
Appendix X: Glossary
| Term | Definition |
|---|---|
| Backward stable | Algorithm that produces the exact output for a nearby input; relative perturbation |
| Condition number | ; amplification factor from input to output perturbations |
| Fill-in | New nonzero entries created in and during sparse LU factorization where had zeros |
| Forward error | ; how far the computed answer is from the true answer |
| Growth factor | $\rho = \max_{ijk} |
| Krylov subspace | ; basis for iterative methods |
| Partial pivoting | Row swapping strategy that maximizes each pivot; ensures $ |
| Preconditioner | Matrix chosen so has better conditioning than |
| Rayleigh quotient | ; estimates eigenvalues |
| Spectral radius | $\rho(A) = \max_i |
| BLAS | Basic Linear Algebra Subprograms; standardized interface for vector/matrix operations |
| LAPACK | Linear Algebra PACKage; implements factorizations, solvers, eigenvalue problems |
| cuBLAS | CUDA implementation of BLAS for NVIDIA GPUs; used in PyTorch, TensorFlow |
<- Back to Section01 Floating-Point Arithmetic | Next: Section03 Numerical Optimization ->
Appendix Y: Extended Exercises with Solutions
Y.1 The Leaning Matrix Problem (**)
Consider the matrix:
(a) Compute and observe that in fp64. (b) Show that the Normal equations are ill-conditioned: in fp64. (c) Show that QR factorization recovers the correct solution.
Solution:
eps = 1e-8
A = np.array([[1, 1], [eps, 0], [0, eps]])
AtA = A.T @ A
print(f"A^T A = {AtA}")
# [[1, 1], [1, 1]] - singular! eps^2 = 1e-16 < machine_eps = 2.2e-16
b = np.array([1, eps, eps])
# Normal equations: singular system!
try:
x_normal = np.linalg.solve(AtA, A.T @ b)
print(f"Normal: {x_normal}")
except np.linalg.LinAlgError:
print("Normal equations: singular!")
# QR: works correctly
x_qr, _, _, _ = np.linalg.lstsq(A, b, rcond=None)
print(f"QR: {x_qr}") # Approximately [0.5, 0.5]
print(f"Residual QR: {np.linalg.norm(A @ x_qr - b):.2e}")
Y.2 Ill-Conditioned System via Scaling (*)
Two matrices that are mathematically equivalent but numerically different:
# Original: well-conditioned
A1 = np.array([[1.0, 0.5], [0.5, 1.0]])
b1 = np.array([1.0, 0.0])
# Scaled: same equations, different numerics
D = np.diag([1e8, 1.0])
A2 = D @ A1 @ D # Equivalent to A1 up to scaling
b2 = D @ b1
print(f"kappa(A1) = {np.linalg.cond(A1):.2f}") # ~ 3
print(f"kappa(A2) = {np.linalg.cond(A2):.2e}") # ~ 3e16!
x1 = np.linalg.solve(A1, b1)
x2 = np.linalg.solve(A2, b2)
# Transform x2 back to same scale as x1
x2_transformed = D @ x2
print(f"x1 = {x1}")
print(f"x2 (transformed) = {x2_transformed}")
print(f"Error: {np.linalg.norm(x1 - x2_transformed):.2e}")
Lesson: Always equilibrate (scale) your system before solving. scipy.linalg.solve with overwrite_a=False does this automatically via LAPACK's dgesvx.
Y.3 Conjugate Gradient Rate Verification (**)
Verify the CG convergence bound experimentally:
def cg_solve(A, b, tol=1e-12, max_iter=1000):
n = len(b)
x = np.zeros(n)
r = b.copy()
p = r.copy()
rs_old = r @ r
errors = [np.linalg.norm(r)]
for _ in range(max_iter):
Ap = A @ p
alpha = rs_old / (p @ Ap)
x += alpha * p
r -= alpha * Ap
rs_new = r @ r
errors.append(np.sqrt(rs_new))
if np.sqrt(rs_new) < tol * np.linalg.norm(b):
break
p = r + (rs_new / rs_old) * p
rs_old = rs_new
return x, errors
# Create SPD matrix with known condition number
kappa = 100
n = 50
lam = np.concatenate([np.linspace(1, kappa, n//2),
np.linspace(kappa, kappa**0.5, n//2)])
Q, _ = np.linalg.qr(np.random.randn(n, n))
A = Q @ np.diag(lam) @ Q.T
b = np.random.randn(n)
x_true = np.linalg.solve(A, b)
x, errors = cg_solve(A, b)
errors_norm = [e/errors[0] for e in errors]
# Theoretical bound
bound = [2 * ((np.sqrt(kappa)-1)/(np.sqrt(kappa)+1))**k
for k in range(len(errors))]
print("CG convergence (iteration : relative residual : bound):")
for k in [0, 5, 10, 20, 30]:
if k < len(errors):
print(f" k={k:3d}: {errors_norm[k]:.4e} (bound: {bound[k]:.4e})")
Appendix Z: Chapter Summary and Self-Assessment
After completing this section, you should be able to:
Conceptual understanding:
- Explain why backward stability is the right standard for numerical algorithms
- State the fundamental theorem relating forward error, backward error, and condition number
- Explain why GEPP is stable in practice despite a theoretical worst case of
- Explain why QR via Householder is preferred over Normal equations for least squares
- State the CG convergence rate and its dependence on
- Explain what a Krylov subspace is and why iterative methods based on it are optimal
Computational skills:
- Implement LU factorization with partial pivoting from scratch
- Implement Householder QR from scratch
- Implement the Conjugate Gradient algorithm
- Choose the right solver for a given problem (dense vs sparse, symmetric vs general)
- Diagnose ill-conditioning using condition numbers and error estimates
- Apply diagonal preconditioning to improve CG convergence
AI connections:
- Explain how Adam implements diagonal preconditioning
- Explain why attention uses scaling
- Describe the numerical approach in FlashAttention
- Connect LoRA to the low-rank approximation theory
- Explain how K-FAC uses Kronecker-factored matrices for efficient second-order optimization
Difficulty distribution: This material spans from algorithmic implementation () through convergence analysis () to research-level connections like K-FAC and FlashAttention (). The core material (Section1-Section6) is accessible to anyone who has completed Section02-Linear-Algebra-Basics; the advanced applications (Section7-Section8) benefit from additional exposure to optimization and deep learning.
Appendix AA: Numerical Linear Algebra in Modern GPU Architectures
AA.1 NVIDIA GPU Linear Algebra Pipeline
Modern GPU architectures have specialized hardware for linear algebra:
Tensor Cores (Volta/Ampere/Hopper):
- Compute where are small matrices (e.g., )
- One Tensor Core instruction: in one cycle
- Peak performance: A100 GPU achieves 312 TFLOPS for bf16 matmul
- Key property: the matmul is performed in tf32 (10-bit mantissa) but accumulated in fp32
Implications for numerical accuracy:
- A100 performs matmul blocks in tf32 precision
- Accumulation in fp32 prevents error accumulation within each block
- For very large matrices, the number of block-level accumulations can lead to error per output element (better than naive because of hierarchical accumulation)
cuBLAS implementation of DGEMM: The doubly-blocked matmul implementation:
Block A in shared memory: 128x16
Block B in shared memory: 16x128
Each thread computes: 4x4 sub-block of output
Register tiles: 4 registers for A-slice, 4 for B-slice
Result accumulated in: 16 registers (fp32)
AA.2 Memory Hierarchy and Algorithm Design
| Level | Size | Bandwidth | Latency |
|---|---|---|---|
| Registers | 256 KB/SM | \infty | 1 cycle |
| L1/Shared | 96 KB/SM | 128 TB/s | 20 cycles |
| L2 cache | 40 MB | 12 TB/s | 200 cycles |
| HBM (GPU DRAM) | 80 GB | 2 TB/s | 500 cycles |
| PCIe (CPU<->GPU) | - | 64 GB/s | ms |
Roofline model: An algorithm is compute-bound if the number of flops is large relative to data movement; memory-bound if it's the opposite.
- Dense matmul (): flops, reads -> arithmetic intensity -> compute-bound
- SpMV ( sparse): flops, reads -> arithmetic intensity -> memory-bound
- FlashAttention: converts memory reads to by blocking -> shifts from memory-bound to compute-bound
AA.3 Mixed Precision in Practice
A100 Tensor Core configurations:
| Input dtype | Accumulate dtype | Peak TFLOPs | Use case |
|---|---|---|---|
| fp64 | fp64 | 19.5 | Scientific computing |
| fp32 | fp32 | 19.5 | Training (non-Tensor Core) |
| tf32 | fp32 | 156 | Training (Tensor Core, auto) |
| bf16 | fp32 | 312 | LLM training |
| fp16 | fp32 | 312 | Computer vision |
| fp8 | fp32 | 624 | LLM inference, H100+ |
| int8 | int32 | 624 | Quantized inference |
PyTorch automatic mixed precision (AMP):
# Enable Tensor Core matmul (automatic tf32)
torch.backends.cuda.matmul.allow_tf32 = True # Default True for A100+
# Full AMP with bf16
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
output = model(input) # Forward pass in bf16
loss = criterion(output, target)
scaler.scale(loss).backward() # Backward in bf16 with loss scaling
scaler.step(optimizer)
scaler.update()
Appendix BB: Historical Notes on Algorithm Development
The development of numerical linear algebra as a field was driven by the first digital computers (ENIAC, 1945; ACE, 1950; UNIVAC, 1951). Gaussian elimination had been used by hand for centuries, but the question of whether it would work reliably on early (extremely noisy, vacuum-tube-based) computers was open.
The stability debate (1947-1954):
- Von Neumann and Goldstine (1947) analyzed Gaussian elimination and concluded it might be dangerously unstable - their bound on the error was , suggesting exponential growth
- Turing (1948) introduced the condition number and argued that the growth was much more benign in practice
- Wilkinson (1954) proved that with partial pivoting, the growth factor satisfies in the worst case but in practice - vindicating GEPP as the standard method
Householder's contribution (1958): After Gram-Schmidt's orthogonalization was found to be numerically unstable for ill-conditioned matrices, Householder derived the reflector-based QR that achieves backward stability. This became the standard for solving least squares problems.
The Lanczos problem (1950-1980): Lanczos (1950) proposed his tridiagonalization algorithm for symmetric eigenvalue problems, but it was found to be numerically unreliable - the computed Krylov vectors rapidly lost orthogonality. It took 30 years of research to understand when and why it works (Paige, 1980) and to develop practical implementations with selective reorthogonalization.
Modern era (1987-present): The LAPACK project standardized high-performance implementations of all major algorithms. The emergence of GPU computing (2007+) required adapting these algorithms for massively parallel hardware. The development of mixed-precision algorithms and iterative refinement has become critical for LLM training at scale.
<- Back to Chapter 10 | Next: Numerical Optimization ->
Appendix CC: Practice Problems Bank
CC.1 True/False Questions
- Gaussian elimination with partial pivoting is always backward stable. (False - the growth factor can be , but this almost never happens)
- The condition number of equals . (True)
- CG converges in at most steps in exact arithmetic. (True)
- A backward stable algorithm always produces a small forward error. (False - the condition number can amplify the backward error)
- Householder QR is always preferred over MGS for orthogonalization. (False - MGS is preferred for streaming applications)
- The Cholesky factorization is unique for SPD matrices. (True, with positive diagonal)
- For SPD matrices, LU and Cholesky give the same factorization. (False - LU has form, Cholesky has form)
- GMRES always finds the solution in at most iterations. (True in exact arithmetic; in practice, restart before )
- Diagonal scaling cannot change the condition number of a matrix. (False - scaling dramatically affects condition number)
- Adam's denominator is a form of diagonal preconditioning. (True)
CC.2 Fill in the Blanks
- The backward error for GEPP satisfies where the blank depends on the problem size and growth factor.
- The CG algorithm requires the matrix to be _____ positive definite.
- The _____ algorithm reduces any symmetric matrix to tridiagonal form as a preprocessing step for eigenvalue computation.
- FlashAttention avoids materializing the attention matrix by using a _____ algorithm.
- Modified Gram-Schmidt achieves loss of orthogonality, better than Classical Gram-Schmidt's .
Answers: 1. ; 2. symmetric; 3. Lanczos (or Householder tridiagonalization); 4. tiled; 5.
CC.3 Short Derivations
- Derive the optimal step size for CG from first principles: minimize over .
- Prove that using the SVD .
- Show that for any orthogonal matrix : .
- Derive the Rayleigh quotient iteration from the Newton-Raphson method applied to finding an eigenvector.
This section is part of the Section10 Numerical Methods chapter. All algorithms presented are implemented in theory.ipynb with numerical verification.
<- Back to Section01 Floating-Point Arithmetic | Next: Section03 Numerical Optimization ->
Section statistics: 12 main sections, 28 subsections, 12 common mistakes, 10 exercises, 28 appendices. All algorithms are numerically verified in the companion notebook.
End of Section02 Numerical Linear Algebra