NotesMath for LLMs

Numerical Optimization

Numerical Methods / Numerical Optimization

Notes

"In theory there is no difference between theory and practice. In practice there is." - Yogi Berra

Overview

Numerical optimization is the computational science of finding minima (or maxima) of functions. While Section08-Optimization covered the mathematical theory of gradient-based methods, convergence analysis, and adaptive learning rates, this section focuses on the numerical implementation aspects: how floating-point arithmetic affects algorithm behavior, how to implement robust line search, how to exploit problem structure, and how the algorithms used in deep learning trace their lineage to classical numerical optimization.

The central insight: every optimization algorithm is implicitly solving a sequence of linear systems. Understanding this connection - and how the numerical properties of those systems determine convergence - is the key to understanding why Adam works, why Newton's method is quadratically convergent, and why gradient descent can stall.

Scope note: Gradient descent theory, convergence rates, and adaptive learning rates are fully covered in Section08-Optimization. This section covers the numerical implementation aspects that Section08 does not address in depth: line search implementations, trust region methods, quasi-Newton methods (L-BFGS), second-order methods, and the floating-point behavior of optimization algorithms.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbLine search, Newton's method, L-BFGS, trust region, convergence analysis
exercises.ipynb10 graded exercises on optimization algorithms and their numerical properties

Learning Objectives

After completing this section, you will:

  • Implement Wolfe condition line search and explain why it's needed for convergence
  • Explain why Newton's method is quadratically convergent and when it fails numerically
  • Implement L-BFGS and explain the two-loop recursion
  • Understand trust region methods as an alternative to line search
  • Analyze the effect of condition number on gradient descent convergence
  • Connect Adam's update rule to diagonal quasi-Newton methods
  • Implement numerical differentiation and analyze its accuracy
  • Explain why finite differences have an optimal step size

Table of Contents


1. Intuition

Numerical optimization differs from mathematical optimization in one fundamental way: we compute with floating-point numbers, not exact reals. This has profound consequences:

The gradient is never exact: Even with automatic differentiation, the computed gradient g^\hat{g} satisfies g^=g+O(εmachg)\hat{g} = g + O(\varepsilon_{\text{mach}} \|g\|). For well-conditioned objectives, this is fine. For ill-conditioned objectives near a minimum, gradients can be numerically zero before we actually reach the minimum.

The Hessian is expensive and often unavailable: Computing HRd×dH \in \mathbb{R}^{d \times d} requires O(d2)O(d^2) memory and O(d2)O(d^2) Hessian-vector products. For d=108d = 10^8 (typical LLM size), storing HH requires 101610^{16} bytes - impossible. This is why first-order and quasi-Newton methods dominate.

Ill-conditioning slows convergence: Gradient descent on a quadratic f(x)=12xHxf(x) = \frac{1}{2} x^\top H x converges at rate (κ1κ+1)k\left(\frac{\kappa-1}{\kappa+1}\right)^k - exponentially slow for large κ(H)\kappa(H). Preconditioning (approximating H1H^{-1}) is the key to faster convergence.

The minimum is almost never reached: Gradient descent converges to a point where g<ε\|g\| < \varepsilon for some tolerance ε\varepsilon. In deep learning, we stop early (for generalization, not just optimization), so the numerical optimization problem is always underdetermined by design.


2. Line Search Methods

A line search determines how far to move in a given direction pkp_k:

xk+1=xk+αkpkx_{k+1} = x_k + \alpha_k p_k

The step size αk\alpha_k must be chosen carefully: too small -> slow convergence; too large -> divergence or oscillation.

2.1 Exact Line Search

Exact line search minimizes ϕ(α)=f(xk+αpk)\phi(\alpha) = f(x_k + \alpha p_k) exactly:

αk=argminα>0ϕ(α)\alpha_k = \arg\min_{\alpha > 0} \phi(\alpha)

For quadratic objectives: f(x)=12xHxbxf(x) = \frac{1}{2} x^\top H x - b^\top x, the exact step is:

αk=pkrkpkHpk,rk=bHxk\alpha_k = \frac{p_k^\top r_k}{p_k^\top H p_k}, \quad r_k = b - Hx_k

This is exactly the CG step size - CG with steepest descent direction is "exact line search gradient descent."

Limitation: For general nonlinear ff, finding the exact minimum of ϕ(α)\phi(\alpha) requires many function evaluations. Inexact line search conditions (Wolfe, Armijo) are far more practical.

2.2 Wolfe Conditions

The Wolfe conditions define a set of acceptable step sizes that ensure:

  1. Sufficient decrease (Armijo condition):
f(xk+αkpk)f(xk)+c1αkfkpkf(x_k + \alpha_k p_k) \leq f(x_k) + c_1 \alpha_k \nabla f_k^\top p_k
  1. Curvature condition:
f(xk+αkpk)pkc2fkpk\nabla f(x_k + \alpha_k p_k)^\top p_k \geq c_2 \nabla f_k^\top p_k

with 0<c1<c2<10 < c_1 < c_2 < 1. Typical values: c1=104c_1 = 10^{-4}, c2=0.9c_2 = 0.9 (line search for CG/BFGS) or c2=0.1c_2 = 0.1 (Wolfe for steepest descent).

Strong Wolfe conditions replace the curvature condition with:

f(xk+αkpk)pkc2fkpk|\nabla f(x_k + \alpha_k p_k)^\top p_k| \leq c_2 |\nabla f_k^\top p_k|

This prevents steps at points where the derivative is positive but not too large (i.e., on the "other side" of a local minimum in the line direction).

Theorem (Zoutendijk): If the search direction satisfies pkgk<0p_k^\top g_k < 0 (descent direction) and the Wolfe conditions are satisfied, then:

k=0(fkpk)2pk2<\sum_{k=0}^\infty \frac{(\nabla f_k^\top p_k)^2}{\|p_k\|^2} < \infty

This guarantees lim infkfk=0\liminf_{k \to \infty} \|\nabla f_k\| = 0 - the gradients eventually become small.

2.3 Backtracking Armijo

The simplest practical line search:

def backtracking_armijo(f, g, x, p, alpha0=1.0, rho=0.5, c1=1e-4):
    """Backtracking line search satisfying Armijo condition."""
    alpha = alpha0
    f0 = f(x)
    slope = g(x) @ p  # Directional derivative (should be negative)

    while f(x + alpha * p) > f0 + c1 * alpha * slope:
        alpha *= rho

    return alpha

Analysis: The Armijo condition ensures "sufficient decrease" - each step reduces ff by at least a fraction of what a linear model predicts. Backtracking automatically adjusts the step when the initial trial step is too large.

For AI: PyTorch's torch.optim.LBFGS uses a strong Wolfe line search. Standard SGD/Adam do not use line search - the learning rate is a fixed hyperparameter.

2.4 Strong Wolfe and Zoom

The zoom algorithm finds a step satisfying the strong Wolfe conditions:

  1. Bracket phase: Find an interval [αlo,αhi][\alpha_{\text{lo}}, \alpha_{\text{hi}}] that must contain a Wolfe-satisfying step. This is done by starting with α=0\alpha = 0, then α=αinit\alpha = \alpha_{\text{init}}, and doubling until either the Armijo condition fails or the slope at the right endpoint becomes positive.

  2. Zoom phase: Bisect (or use cubic/quadratic interpolation) within the bracket until a strong Wolfe point is found.

Implementation: scipy.optimize.line_search_wolfe2 implements this algorithm. It is the standard choice for L-BFGS implementations.


3. Newton's Method

Newton's method for minimizing f:RdRf: \mathbb{R}^d \to \mathbb{R} applies Newton's method to the gradient equation f(x)=0\nabla f(x) = 0:

xk+1=xkHk1fkx_{k+1} = x_k - H_k^{-1} \nabla f_k

where Hk=2f(xk)H_k = \nabla^2 f(x_k) is the Hessian.

3.1 Quadratic Convergence

Theorem: If ff is twice continuously differentiable, H=2f(x)H^* = \nabla^2 f(x^*) is positive definite at the minimum xx^*, and x0x_0 is sufficiently close to xx^*, then Newton's method converges quadratically:

xk+1xcxkx2\|x_{k+1} - x^*\| \leq c \|x_k - x^*\|^2

Intuition: Each step uses a quadratic model mk(p)=fk+gkp+12pHkpm_k(p) = f_k + g_k^\top p + \frac{1}{2} p^\top H_k p, which is exact to second order. Quadratic convergence means errors shrink quadratically: if ek=103\|e_k\| = 10^{-3}, then ek+1106\|e_{k+1}\| \approx 10^{-6}, ek+21012\|e_{k+2}\| \approx 10^{-12}, etc.

Cost per iteration: Solving Hkdk=gkH_k d_k = -g_k requires O(d3)O(d^3) flops (direct Cholesky/LU factorization) or O(d2)O(d^2) (if HkH_k is known to be SPD and well-conditioned). Plus O(d2)O(d^2) to form the Hessian.

For AI: Newton's method is impractical for modern LLMs (d1010d \sim 10^{10}). However, it motivates all quasi-Newton methods, which approximate Hk1H_k^{-1} cheaply.

3.2 Modified Newton for Non-Convex

At non-convex points (saddle points or near maxima), HkH_k may be indefinite. Applying Newton's step d=H1gd = -H^{-1}g can move toward a maximum, not a minimum.

Strategy 1: Hessian modification. Replace HkH_k with Hk+τkIH_k + \tau_k I where τk\tau_k is chosen to make the modified Hessian positive definite. This is the Levenberg-Marquardt damping idea.

The modified Newton step solves:

(Hk+τkI)dk=gk(H_k + \tau_k I) d_k = -g_k

Choosing τk\tau_k:

  • If HkH_k is already positive definite with λmin(Hk)>δ\lambda_{\min}(H_k) > \delta: set τk=0\tau_k = 0.
  • Otherwise: set τk=λmin(Hk)+δ\tau_k = |\lambda_{\min}(H_k)| + \delta for some δ>0\delta > 0.

Strategy 2: Modified Cholesky. Factor Hk+Ek=LkDkLkH_k + E_k = L_k D_k L_k^\top where EkE_k is chosen minimally to make Hk+EkH_k + E_k positive definite. The Gill-Murray-Wright modified Cholesky algorithm does this robustly.

3.3 Inexact Newton Methods

For large-scale problems, solving Hkdk=gkH_k d_k = -g_k exactly is too expensive. Inexact Newton methods find dkd_k such that:

Hkdk+gkηkgk\|H_k d_k + g_k\| \leq \eta_k \|g_k\|

where ηk(0,1)\eta_k \in (0, 1) is the forcing sequence. Using CG to solve the linear system approximately (Newton-CG or truncated Newton):

for outer iteration k:
    solve H_k d = -g_k using CG, stopping when
    ||H_k d + g_k|| <= eta_k ||g_k||
    x_{k+1} = x_k + alpha_k d_k  (line search for alpha_k)

Convergence: If ηk0\eta_k \to 0 fast enough (e.g., ηk=min(0.5,fk)\eta_k = \min(0.5, \sqrt{\|\nabla f_k\|})), inexact Newton converges super-linearly. If ηk\eta_k is kept constant, linear convergence with rate η\eta.

For AI: Inexact Newton methods (Hessian-free optimization) were used for training neural networks in the pre-deep-learning era (Martens, 2010). The Hessian-vector product HvHv can be computed in O(d)O(d) time via automatic differentiation (Pearlmutter's trick), making each CG iteration cost O(d)O(d) without forming HH.


4. Quasi-Newton Methods

Quasi-Newton methods build up an approximation BkHkB_k \approx H_k (or Hk1H_k^{-1}) using gradient information at each step, without forming the Hessian explicitly.

4.1 BFGS Update

The Broyden-Fletcher-Goldfarb-Shanno (BFGS) update is the most successful quasi-Newton method.

Setup: After step kk, define:

sk=xk+1xk(step taken)s_k = x_{k+1} - x_k \quad \text{(step taken)} yk=fk+1fk(gradient change)y_k = \nabla f_{k+1} - \nabla f_k \quad \text{(gradient change)}

Secant equation: The updated Hessian approximation Bk+1B_{k+1} should satisfy:

Bk+1sk=yk(secant condition)B_{k+1} s_k = y_k \quad \text{(secant condition)}

This is analogous to finite difference approximation of the Hessian.

BFGS update (inverse Hessian form):

Hk+1=(Iρkskyk)Hk(Iρkyksk)+ρkskskH_{k+1} = \left(I - \rho_k s_k y_k^\top\right) H_k \left(I - \rho_k y_k s_k^\top\right) + \rho_k s_k s_k^\top

where ρk=1/(yksk)\rho_k = 1 / (y_k^\top s_k).

Properties:

  • Hk+1H_{k+1} is symmetric positive definite if HkH_k is SPD and yksk>0y_k^\top s_k > 0
  • Hk+1H_{k+1} satisfies the secant condition: Hk+1yk=skH_{k+1} y_k = s_k
  • BFGS converges super-linearly (between linear and quadratic) for smooth strongly convex functions

Curvature condition: yksk>0y_k^\top s_k > 0 is guaranteed when using the Wolfe line search (hence why Wolfe conditions are essential for BFGS). Without it, the update can make Hk+1H_{k+1} indefinite.

4.2 L-BFGS: Limited Memory BFGS

Full BFGS requires O(d2)O(d^2) memory to store HkH_k. For d=106d = 10^6, this is 8\sim 8 TB - impractical.

L-BFGS stores only the last mm pairs (sk,yk)(s_k, y_k) and implicitly computes HkgH_k g via the two-loop recursion:

def lbfgs_direction(g, s_list, y_list):
    """Compute H^{-1} g using the two-loop L-BFGS recursion."""
    m = len(s_list)
    rho = [1.0 / (y @ s) for s, y in zip(s_list, y_list)]

    q = g.copy()
    alpha = np.zeros(m)

    # First loop (backward)
    for i in range(m-1, -1, -1):
        alpha[i] = rho[i] * s_list[i] @ q
        q -= alpha[i] * y_list[i]

    # Scale by initial Hessian approximation
    # H_0^{-1} = (s_m^T y_m) / (y_m^T y_m) * I (Barzilai-Borwein scaling)
    if m > 0:
        gamma = (s_list[-1] @ y_list[-1]) / (y_list[-1] @ y_list[-1])
    else:
        gamma = 1.0
    r = gamma * q

    # Second loop (forward)
    for i in range(m):
        beta = rho[i] * y_list[i] @ r
        r += s_list[i] * (alpha[i] - beta)

    return -r  # Search direction

Memory requirement: O(md)O(md) - just 2m2m vectors of dimension dd. Typically m=5m = 5 to 2020.

Convergence: L-BFGS with mm vectors has memory of mm gradient changes, giving super-linear convergence locally (linear globally for general non-convex ff).

For AI: torch.optim.LBFGS implements L-BFGS. It is used for full-batch training and fine-tuning tasks where the full gradient is available and noise-free. For stochastic settings (mini-batch), L-BFGS is not directly applicable because the secant condition yksk>0y_k^\top s_k > 0 fails when gradients are noisy.

4.3 SR1 Update

The Symmetric Rank-1 (SR1) update is simpler:

Bk+1=Bk+(ykBksk)(ykBksk)(ykBksk)skB_{k+1} = B_k + \frac{(y_k - B_k s_k)(y_k - B_k s_k)^\top}{(y_k - B_k s_k)^\top s_k}

Advantage: SR1 can capture indefinite Hessians (saddle points), unlike BFGS (which always produces SPD updates).

Disadvantage: Not guaranteed to remain positive definite; can fail numerically if (ykBksk)sk0(y_k - B_k s_k)^\top s_k \approx 0.


5. Trust Region Methods

Trust region methods take a different approach: instead of choosing a direction and then a step size (line search), they choose a step size first and then find the best direction within that radius.

5.1 Trust Region Framework

At each iteration, define a trust region radius Δk\Delta_k and solve:

minpmk(p)=fk+gkp+12pBkpsubject topΔk\min_{p} m_k(p) = f_k + g_k^\top p + \frac{1}{2} p^\top B_k p \quad \text{subject to} \quad \|p\| \leq \Delta_k

where BkB_k is an approximation to the Hessian.

Ratio test: After computing the step pkp_k, compare the actual reduction to the predicted reduction:

ρk=fkf(xk+pk)mk(0)mk(pk)=actual reductionpredicted reduction\rho_k = \frac{f_k - f(x_k + p_k)}{m_k(0) - m_k(p_k)} = \frac{\text{actual reduction}}{\text{predicted reduction}}

Update rules:

  • ρk<0.25\rho_k < 0.25: shrink trust region (Δk+1=0.25Δk\Delta_{k+1} = 0.25 \Delta_k); reject step
  • 0.25ρk<0.750.25 \leq \rho_k < 0.75: keep trust region; accept step
  • ρk0.75\rho_k \geq 0.75: expand trust region (Δk+1=2Δk\Delta_{k+1} = 2\Delta_k); accept step

5.2 Cauchy Point and Dogleg

The exact trust region subproblem is an eigenvalue problem that costs O(d3)O(d^3). Cheaper approximations:

Cauchy point: The point that minimizes mkm_k along the steepest descent direction subject to the trust region constraint:

pC=αCgk,αC={Δk/gkif gkBkgk0min(gk2gkBkgk,Δkgk)otherwisep^C = -\alpha^C g_k, \quad \alpha^C = \begin{cases} \Delta_k / \|g_k\| & \text{if } g_k^\top B_k g_k \leq 0 \\ \min\left(\frac{\|g_k\|^2}{g_k^\top B_k g_k}, \frac{\Delta_k}{\|g_k\|}\right) & \text{otherwise} \end{cases}

The Cauchy point gives linear convergence (same as gradient descent with exact line search).

Dogleg method: Interpolate between the Cauchy point and the unconstrained Newton step pN=Bk1gkp^N = -B_k^{-1} g_k:

  • If pNΔk\|p^N\| \leq \Delta_k: take the Newton step
  • Else: move along the path pCpNp^C \to p^N until hitting the trust region boundary

This gives super-linear convergence near the optimum.

5.3 Trust Region vs Line Search

PropertyTrust RegionLine Search
Convergence guaranteeGlobal (any Hessian approx)Requires descent direction
Indefinite HessianHandled naturallyNeeds modification
Cost per iterationHigher (subproblem solve)Lower (backtracking)
Parameter tuningTrust region radius Δ\DeltaLearning rate α\alpha
Non-smooth objectivesCan be extendedHarder

For AI: Trust region methods are used in reinforcement learning (TRPO, PPO) where the "trust region" constrains policy updates. The KL divergence constraint in TRPO plays the role of the trust region: DKL(πθkπθ)δD_{\text{KL}}(\pi_{\theta_k} \| \pi_{\theta}) \leq \delta.


6. Numerical Differentiation

When analytic gradients are unavailable (or to verify automatic differentiation), we use finite difference approximations.

6.1 Finite Differences

Forward difference:

fxif(x+hei)f(x)h\frac{\partial f}{\partial x_i} \approx \frac{f(x + h e_i) - f(x)}{h}

Error: O(h)+O(εmach/h)O(h) + O(\varepsilon_{\text{mach}} / h).

Centered difference:

fxif(x+hei)f(xhei)2h\frac{\partial f}{\partial x_i} \approx \frac{f(x + h e_i) - f(x - h e_i)}{2h}

Error: O(h2)+O(εmach/h)O(h^2) + O(\varepsilon_{\text{mach}} / h) - more accurate but requires two function evaluations.

Richardson extrapolation: Eliminate the leading-order error term:

f(x)4CD(h/2)CD(h)3f'(x) \approx \frac{4 \cdot \text{CD}(h/2) - \text{CD}(h)}{3}

where CD(h)\text{CD}(h) denotes centered difference with step hh. Error: O(h4)O(h^4).

6.2 Optimal Step Size

The total error in centered difference is:

Error(h)=C1h2truncation+C2εmach/hrounding\text{Error}(h) = \underbrace{C_1 h^2}_{\text{truncation}} + \underbrace{C_2 \varepsilon_{\text{mach}} / h}_{\text{rounding}}

Minimizing over hh: set ddhError=2C1hC2εmach/h2=0\frac{d}{dh}\text{Error} = 2C_1 h - C_2 \varepsilon_{\text{mach}} / h^2 = 0, giving:

h=(C2εmach2C1)1/3εmach1/3h^* = \left(\frac{C_2 \varepsilon_{\text{mach}}}{2C_1}\right)^{1/3} \approx \varepsilon_{\text{mach}}^{1/3}

For fp64: h(2.2×1016)1/36×106h^* \approx (2.2 \times 10^{-16})^{1/3} \approx 6 \times 10^{-6}.

Minimum achievable error: Error(h)3(C1C22εmach2)1/3εmach2/3\text{Error}(h^*) \approx 3(C_1 C_2^2 \varepsilon_{\text{mach}}^2)^{1/3} \approx \varepsilon_{\text{mach}}^{2/3}.

For forward difference: hεmach1/21.5×108h^* \approx \varepsilon_{\text{mach}}^{1/2} \approx 1.5 \times 10^{-8}, minimum error εmach1/2\approx \varepsilon_{\text{mach}}^{1/2}.

6.3 Complex-Step Derivative

The complex-step method avoids cancellation entirely:

f(x)Im[f(x+ih)]hf'(x) \approx \frac{\text{Im}[f(x + ih)]}{h}

where i=1i = \sqrt{-1}. This requires evaluating ff at a complex argument.

Error: O(h2)O(h^2) - same as centered difference, but no rounding error from cancellation (imaginary part is computed by addition, not subtraction). With h=1020h = 10^{-20}, the error is 1040\sim 10^{-40} - essentially machine precision.

Limitation: Requires the function ff to be analytically extendable to complex arguments (which almost all smooth functions are, but branch cuts can cause issues).


7. Constrained Optimization Numerics

7.1 Projected Gradient

For optimization over a convex set C\mathcal{C} (e.g., the simplex, a box, or the unit sphere):

xk+1=PC(xkαkf(xk))x_{k+1} = \mathcal{P}_\mathcal{C}(x_k - \alpha_k \nabla f(x_k))

where PC(y)=argminxCxy2\mathcal{P}_\mathcal{C}(y) = \arg\min_{x \in \mathcal{C}} \|x - y\|_2 is the Euclidean projection.

Common projections:

  • Box {x:xu}\{x : \ell \leq x \leq u\}: P(y)i=min(max(yi,i),ui)\mathcal{P}(y)_i = \min(\max(y_i, \ell_i), u_i) - O(n)
  • Simplex {x:x0,xi=1}\{x : x \geq 0, \sum x_i = 1\}: sort and find threshold (O(n log n))
  • Ball {x:xr}\{x : \|x\| \leq r\}: P(y)=ry/y\mathcal{P}(y) = r y / \|y\| if y>r\|y\| > r, else yy - O(n)

For AI: Gradient clipping is a projected gradient step onto the ball {g:gc}\{g : \|g\| \leq c\}. Softmax output can be seen as optimization over the simplex. Spectral normalization maintains weights on the unit spectral norm ball via projected gradient.

7.2 Augmented Lagrangian

For equality constrained problems minxf(x) s.t. c(x)=0\min_x f(x) \text{ s.t. } c(x) = 0, the augmented Lagrangian is:

Lρ(x,λ)=f(x)+λc(x)+ρ2c(x)2\mathcal{L}_\rho(x, \lambda) = f(x) + \lambda^\top c(x) + \frac{\rho}{2} \|c(x)\|^2

The alternating direction method of multipliers (ADMM) minimizes this iteratively:

  1. xk+1=argminxLρ(x,λk)x_{k+1} = \arg\min_x \mathcal{L}_{\rho}(x, \lambda_k) (minimize over primal variable)
  2. λk+1=λk+ρc(xk+1)\lambda_{k+1} = \lambda_k + \rho c(x_{k+1}) (dual update)

ADMM in AI: Used in distributed optimization (splitting model across devices), constrained optimization problems in RL, and structured sparsity (e.g., group lasso).


8. Applications in Machine Learning

8.1 Adam as Diagonal Quasi-Newton

Adam's update rule:

xt+1=xtαv^t+εm^tx_{t+1} = x_t - \frac{\alpha}{\sqrt{\hat{v}_t} + \varepsilon} \odot \hat{m}_t

can be rewritten as:

xt+1=xtαHt1gtx_{t+1} = x_t - \alpha H_t^{-1} g_t

where Ht=diag(v^t+ε)H_t = \text{diag}(\sqrt{\hat{v}_t} + \varepsilon) is a diagonal approximate Hessian.

This is diagonal quasi-Newton. The second moment v^t\hat{v}_t estimates E[gt2]\mathbb{E}[g_t^2], which is proportional to the diagonal of the Fisher information matrix. Dividing by v^t\sqrt{\hat{v}_t} pre-conditions the gradient by the square root of the diagonal Fisher.

Why RMS, not raw variance? The square root vt\sqrt{v_t} instead of vtv_t is the key. For a quadratic ff with diagonal Hessian H=diag(h1,,hd)H = \text{diag}(h_1, \ldots, h_d), the stationary distribution of stochastic gradients has variance hi\propto h_i in coordinate ii. So vthi\sqrt{v_t} \approx \sqrt{h_i}, and g/vtghi1/2g / \sqrt{v_t} \approx g \cdot h_i^{-1/2}. This is a half-step towards the Newton direction H1g=g/hiH^{-1}g = g / h_i.

8.2 L-BFGS for Full-Batch Training

L-BFGS is the standard algorithm for full-batch optimization of small-to-medium neural networks:

optimizer = torch.optim.LBFGS(
    model.parameters(),
    lr=1.0,           # Inner learning rate (line search scales this)
    max_iter=20,      # CG iterations per step
    max_eval=25,      # Max function evaluations per step
    history_size=100, # L-BFGS memory
    line_search_fn='strong_wolfe'
)

def closure():
    optimizer.zero_grad()
    loss = criterion(model(X), y)
    loss.backward()
    return loss

for epoch in range(100):
    optimizer.step(closure)

When to use L-BFGS:

  • Batch size = full dataset (no noise)
  • Medium-scale problems (d107d \lesssim 10^7)
  • Need high accuracy (e.g., physics simulation, function approximation)
  • Fine-tuning pre-trained models with small LoRA adapters

When NOT to use L-BFGS:

  • Large-scale stochastic training (gradient noise breaks secant condition)
  • Online/streaming data
  • Very large models where memory is constrained

8.3 Hessian-Free Optimization

Hessian-free (HF) optimization (Martens, 2010) computes Newton steps using CG applied to Hd=gH d = -g without forming HH, using the Gauss-Newton approximation:

HJJ(Gauss-Newton, always SPD)H \approx J^\top J \quad \text{(Gauss-Newton, always SPD)}

Hessian-vector products HvHv are computed via automatic differentiation:

def hvp(loss, params, v):
    """Hessian-vector product Hv via autograd."""
    grad = torch.autograd.grad(loss, params, create_graph=True)
    grad_v = sum(g.view(-1) @ v_part.view(-1)
                 for g, v_part in zip(grad, v))
    return torch.autograd.grad(grad_v, params)

This enables CG iterations that cost O(d)O(d) per step (same as a forward+backward pass).

Historical note: HF optimization achieved state-of-the-art results on deep network training in 2010-2012, before SGD with momentum/Adam became dominant. For recurrent networks with long sequences (high curvature), HF still has advantages.

8.4 Natural Gradient and Fisher Information

The natural gradient is:

g~=F1g\tilde{g} = F^{-1} g

where F=E[logpθlogpθ]F = \mathbb{E}[\nabla \log p_\theta \nabla \log p_\theta^\top] is the Fisher information matrix. This is Newton's method with the Hessian replaced by the Fisher matrix.

Why the Fisher? The KL divergence DKL(pθpθ+δθ)D_{\text{KL}}(p_\theta \| p_{\theta + \delta\theta}) is approximately 12δθFδθ\frac{1}{2} \delta\theta^\top F \delta\theta. The natural gradient moves in the direction of steepest descent in the space of probability distributions (Riemannian manifold with metric FF), not in parameter space.

K-FAC (Kronecker-Factored Approximate Curvature): Approximates FAGF \approx A \otimes G where AA is the covariance of activations and GG is the covariance of gradient signals. This gives F1A1G1F^{-1} \approx A^{-1} \otimes G^{-1}, which can be computed efficiently.


9. Common Mistakes

#MistakeWhy It's WrongFix
1Using too large a step in finite differencesRounding error dominates: hε1/3h \gg \varepsilon^{1/3}Use h=ε1/3h = \varepsilon^{1/3} for centered difference
2Using too small a step in finite differencesCatastrophic cancellation: f(x+h)f(x)0f(x+h) - f(x) \approx 0 in fpUse h=ε1/3h = \varepsilon^{1/3}, never h<ε1/2h < \varepsilon^{1/2}
3Applying L-BFGS to stochastic objectivesGradient noise violates secant condition yksk>0y_k^\top s_k > 0Use Adam/SGD for mini-batch; L-BFGS only for full batch
4Ignoring the Wolfe condition in BFGSBFGS update may produce indefinite HHAlways use Wolfe line search with BFGS
5Using forward finite differences for gradient checkingOnly O(ε)O(\sqrt{\varepsilon}) accuracyUse centered differences or complex-step
6Forgetting the Hessian modification in Newton's methodNewton step points toward maximum at saddle pointsAdd τI\tau I or use modified Cholesky
7Setting learning rate too large and expecting convergenceDivergence is guaranteed for η>2/λmax(H)\eta > 2/\lambda_{\max}(H)Use line search or tune η\eta carefully
8Using Adam for convex problems when L-BFGS is availableAdam has higher per-iteration cost and worse final accuracyL-BFGS for full-batch convex problems
9Checking gradient only at initial pointGradients can be correct initially but wrong near optimal due to numerical issuesGradient check at multiple points including near-optimal
10Treating trust region radius as a learning rateTrust region is in parameter space; effective step size depends on HessianTune trust region ratio test parameters separately
11Forgetting that Adam's epsilon mattersε=108\varepsilon = 10^{-8} (default) can dominate for very small gradientsIncrease ε\varepsilon for small models; use 10710^{-7} for transformers
12Not resetting L-BFGS history when task changesOld curvature information misleads new optimizationReset with optimizer.state_dict() or create new optimizer

10. Exercises

Exercise 1 * - Backtracking Armijo Line Search

Implement the backtracking Armijo line search.

(a) Write armijo_search(f, grad_f, x, p, alpha0=1.0, rho=0.5, c1=1e-4).

(b) Test on f(x,y)=(x2)2+10(yx2)2f(x, y) = (x-2)^2 + 10(y-x^2)^2 (Rosenbrock variant). Start at (0,0)(0, 0) with descent direction p=f(0,0)p = -\nabla f(0, 0).

(c) Plot f(x+αp)f(x + \alpha p) and the Armijo line, marking the accepted step.

Exercise 2 * - Newton's Method

Implement Newton's method with line search.

(a) Write newton_method(f, grad_f, hess_f, x0, tol) using backtracking Armijo line search.

(b) Test on the Rosenbrock function f(x,y)=100(yx2)2+(1x)2f(x, y) = 100(y - x^2)^2 + (1-x)^2.

(c) Compare convergence (iterations to 10810^{-8} accuracy) between gradient descent, Newton's method, and gradient descent with exact line search.

Exercise 3 ** - L-BFGS Two-Loop Recursion

Implement the L-BFGS two-loop recursion.

(a) Write lbfgs_direction(g, s_list, y_list, m) that returns the L-BFGS search direction.

(b) Implement a full L-BFGS optimizer using the strong Wolfe line search.

(c) Test on quadratic minimization with varying condition numbers. Compare convergence to gradient descent.

Exercise 4 ** - Finite Difference Gradient Check

Implement gradient checking via finite differences.

(a) Write grad_check(f, grad_f, x, h_vals) that computes the centered difference gradient for each hh in h_vals and compares to grad_f(x).

(b) Find the optimal hh empirically for several test functions.

(c) Plot error vs hh on log-log scale. Verify the O(h2)+O(ε/h)O(h^2) + O(\varepsilon/h) form.

Exercise 5 ** - Trust Region vs Line Search

Compare trust region and line search methods.

(a) Implement a simple trust region method with Cauchy point for a 2D test function.

(b) Test on a non-convex function with saddle points. Show that trust region methods don't require modification of the Hessian.

(c) Plot the trajectory of iterates for both methods.

Exercise 6 ** - Adam as Diagonal Quasi-Newton

Demonstrate Adam's connection to diagonal preconditioning.

(a) Implement gradient descent, Adam, and diagonal-preconditioned gradient descent.

(b) Test on a quadratic f(x)=12xdiag(1,100,1000,10000)xf(x) = \frac{1}{2} x^\top \text{diag}(1, 100, 1000, 10000) x.

(c) Show that Adam converges at a rate close to the optimal diagonal-preconditioned rate, and much faster than vanilla gradient descent.

Exercise 7 ** - Numerical Differentiation Accuracy

Implement finite differences and compare to automatic differentiation.

(a) Implement forward, centered, and Richardson extrapolation finite differences.

(b) Find the optimal hh for each method on f(x)=sin(x2)exf(x) = \sin(x^2) e^{-x}.

(c) Implement the complex-step derivative and verify it achieves near machine precision.

Exercise 8 *** - Hessian-Vector Product via Autograd

Compute Hessian-vector products efficiently using automatic differentiation.

(a) Write hvp(model, x, v) that computes HvHv without forming HH, using PyTorch's autograd.grad with create_graph=True.

(b) Implement Newton-CG (truncated Newton) using CG applied to the Hessian-vector product.

(c) Test on a small neural network (2 layers, 50 hidden units). Compare convergence to Adam.


11. Why This Matters for AI (2026 Perspective)

AspectImpact
Adam as quasi-NewtonUnderstanding Adam through the lens of diagonal preconditioning explains its robustness and motivates improvements like AdaHessian, Shampoo
L-BFGS for fine-tuningFine-tuning small LoRA adapters with full batch uses L-BFGS; achieving faster convergence than Adam in low-noise regime
Hessian-vector productsUsed in mechanistic interpretability (computing Hessian eigenvalue spectra), continual learning (Fisher-based regularization), and meta-learning
Natural gradientK-FAC and Shampoo implement approximate natural gradient; used in LLM pre-training at Google (TPU-optimized K-FAC)
Trust region in RLTRPO/PPO use trust region methods to constrain policy updates; direct precursor to RLHF stability
Conjugate gradient in optimizationNewton-CG underlies Hessian-free optimization; CG is used in solving the trust region subproblem
Line search in LBFGSWolfe conditions ensure the secant condition holds; critical for L-BFGS stability
Gradient checkingEssential for debugging custom autodiff implementations; every major framework implements it
Fisher informationQuantifies "how much information" each training example provides; central to active learning, Bayesian deep learning, and continual learning

12. Conceptual Bridge

What you learned: This section bridges the gap between theoretical optimization (Section08) and practical computation. The key themes:

  1. Line search ensures convergence: Without proper step size selection (Armijo, Wolfe), gradient-based methods may diverge or stagnate. Adam uses a fixed learning rate with momentum - it lacks a formal convergence guarantee for general non-convex problems but works well in practice.

  2. Quasi-Newton approximates the second-order structure: BFGS/L-BFGS use past gradient differences to build up a Hessian approximation. Adam does this diagonally, using running second-moment estimates. Shampoo/K-FAC do it in a structured block-diagonal form.

  3. Condition number determines convergence rate: For gradient descent on a quadratic, the convergence rate is exactly (κ1κ+1)k\left(\frac{\kappa-1}{\kappa+1}\right)^k. Preconditioning (approximating the Hessian) reduces the effective condition number and accelerates convergence.

  4. Floating-point limits gradient accuracy: For smooth functions, automatic differentiation gives near-exact gradients. Finite differences are limited by εmach1/3\varepsilon_{\text{mach}}^{1/3} accuracy. This matters for gradient checking and numerical Hessian computation.

NUMERICAL OPTIMIZATION IN CONTEXT
========================================================================

  Section08 Optimization (theory)     ->   Section10-03 Numerical Optimization
  +-- Gradient descent           --> Line search, convergence guarantees
  +-- Adam/momentum              --> Diagonal quasi-Newton interpretation
  +-- Convergence theory         --> Condition number analysis
  +-- Constrained opt.          --> Projected gradient, ADMM

  Section10-02 Numerical Linear Algebra ->  Section10-03 Numerical Optimization
  +-- CG method                  --> Newton-CG, Hessian-free opt
  +-- Condition numbers          --> GD convergence rate
  +-- Iterative refinement       --> Inexact Newton methods
  +-- Sparse systems             --> Large-scale quasi-Newton

========================================================================

<- Back to Section02 Numerical Linear Algebra | Next: Section04 Interpolation and Approximation ->


Appendix A: Wolfe Conditions - Detailed Analysis

A.1 Why the Armijo Condition Alone Is Insufficient

The Armijo condition f(xk+αpk)fk+c1αgkpkf(x_k + \alpha p_k) \leq f_k + c_1 \alpha g_k^\top p_k prevents steps that are too long (the function must decrease by at least c1c_1 times the predicted linear decrease). But it doesn't prevent steps that are too short!

Example: Consider f(x)=x2f(x) = x^2, p=1p = -1 (descent direction). The Armijo condition with c1=104c_1 = 10^{-4} is satisfied for any α(0,2104)\alpha \in (0, 2 - 10^{-4}). But if we choose αk=1010\alpha_k = 10^{-10}, the step is effectively zero and we make no progress.

The curvature condition prevents this: it requires the slope fk+1pk\nabla f_{k+1}^\top p_k at the new point to be at least as large (in magnitude) as a fraction c2c_2 of the initial slope. This ensures the step is not too short.

A.2 Wolfe Conditions Ensure BFGS Remains Well-Defined

The BFGS update requires yksk>0y_k^\top s_k > 0 to maintain positive definiteness of the Hessian approximation. The curvature condition guarantees this:

yksk=(fk+1fk)sk=(fk+1pkfkpk)αky_k^\top s_k = (\nabla f_{k+1} - \nabla f_k)^\top s_k = (\nabla f_{k+1}^\top p_k - \nabla f_k^\top p_k) \alpha_k

The curvature condition says fk+1pkc2fkpk\nabla f_{k+1}^\top p_k \geq c_2 \nabla f_k^\top p_k. Since fkpk<0\nabla f_k^\top p_k < 0 (descent direction) and c2(0,1)c_2 \in (0, 1):

yksk=αk(fk+1pkfkpk(c21)fkpk=(1c2)fkpk>0)y_k^\top s_k = \alpha_k (\underbrace{\nabla f_{k+1}^\top p_k - \nabla f_k^\top p_k}_{\geq (c_2 - 1)\nabla f_k^\top p_k = (1-c_2)|\nabla f_k^\top p_k| > 0})

So yksk>0y_k^\top s_k > 0 - the curvature condition ensures the BFGS update remains positive definite.


Appendix B: L-BFGS Two-Loop Recursion - Derivation

The L-BFGS direction d=Hkgkd = -H_k g_k where HkH_k approximates the inverse Hessian. HkH_k is implicitly defined by:

Hk=Vk1VkmH0VkmVk1+i=kmk1ρiVk1Vi+1sisiVi+1Vk1H_k = V_{k-1}^\top \cdots V_{k-m}^\top H_0 V_{k-m} \cdots V_{k-1} + \sum_{i=k-m}^{k-1} \rho_i V_{k-1}^\top \cdots V_{i+1}^\top s_i s_i^\top V_{i+1} \cdots V_{k-1}

where Vi=IρiyisiV_i = I - \rho_i y_i s_i^\top and ρi=1/(yisi)\rho_i = 1 / (y_i^\top s_i).

Directly computing HkgH_k g via this formula requires O(m2d)O(m^2 d) operations - expensive if we expand the products. The two-loop recursion computes HkgH_k g in O(md)O(md) operations by expanding the matrix-vector product recursively without ever forming HkH_k.

Two-loop recursion (Nocedal, 1980):

First loop (backward, i=k1,k2,,kmi = k-1, k-2, \ldots, k-m):

αi=ρisiqi+1,qi=qi+1αiyi\alpha_i = \rho_i s_i^\top q_{i+1}, \quad q_i = q_{i+1} - \alpha_i y_i

starting with qk=gkq_k = g_k.

Middle: Scale by initial Hessian: r=H0qkmr = H_0 q_{k-m}.

Second loop (forward, i=km,km+1,,k1i = k-m, k-m+1, \ldots, k-1):

βi=ρiyiri,ri+1=ri+si(αiβi)\beta_i = \rho_i y_i^\top r_i, \quad r_{i+1} = r_i + s_i (\alpha_i - \beta_i)

The result d=rk=Hkgkd = -r_k = -H_k g_k is the L-BFGS search direction.


Appendix C: Trust Region Subproblem - Eigenvalue Approach

The exact trust region subproblem:

minp12pBp+gps.t.pΔ\min_p \frac{1}{2} p^\top B p + g^\top p \quad \text{s.t.} \|p\| \leq \Delta

is solved by the More-Sorensen algorithm (1983):

KKT conditions: The optimal pp^* satisfies:

(B+λI)p=g,λ0,λ(Δp)=0(B + \lambda^* I) p^* = -g, \quad \lambda^* \geq 0, \quad \lambda^* (\Delta - \|p^*\|) = 0

where λ\lambda^* is the Lagrange multiplier.

Algorithm:

  1. If pNΔ\|p^N\| \leq \Delta (Newton step within trust region): p=pNp^* = p^N, λ=0\lambda^* = 0
  2. Otherwise, solve for λ>0\lambda^* > 0 such that (B+λI)1g=Δ\|(B + \lambda^* I)^{-1} g\| = \Delta

Step 2 is a one-dimensional root-finding problem. Since (B+λI)1g\|(B + \lambda I)^{-1} g\| is monotonically decreasing in λ\lambda for λ>λmin(B)\lambda > -\lambda_{\min}(B), this can be solved by Newton's method in 1D.

Cost: One eigendecomposition of BB (O(d3)O(d^3)) plus a few 1D Newton iterations.


Appendix D: Gradient Checking Best Practices

D.1 What to Check

A gradient check compares fxi\frac{\partial f}{\partial x_i} from autograd with the finite difference approximation. The relative error should be:

  • <107< 10^{-7}: essentially correct (autograd is accurate to ~10 digits)
  • 10710^{-7} to 10510^{-5}: probably correct (some numerical noise)
  • >105> 10^{-5}: potential bug in gradient implementation
  • >102> 10^{-2}: definite bug

D.2 Pitfalls

  1. Non-differentiable operations: ReLU at zero, abs(0), max(a, b) at the boundary - finite differences may give a subgradient, not the same subgradient as autograd.

  2. Numerically sensitive operations: Softmax with very large logits, log(tiny number) - both methods may have errors here, but they may differ.

  3. Wrong random seed: Always check with a fixed random seed; bugs can be masked by lucky initialization.

  4. Checking only the gradient, not the Jacobian: For vector-valued functions, check the full Jacobian, not just a scalar projection.

D.3 Implementation

def gradient_check(f, x, grad_f, eps=1e-5, rtol=1e-5, atol=1e-8):
    """
    Compare analytical gradient to centered finite differences.
    Returns max relative error across all coordinates.
    """
    n = len(x.ravel())
    g_analytic = grad_f(x).ravel()
    g_numerical = np.zeros(n)

    for i in range(n):
        x_plus = x.copy().ravel()
        x_minus = x.copy().ravel()
        x_plus[i] += eps
        x_minus[i] -= eps
        g_numerical[i] = (f(x_plus.reshape(x.shape)) -
                          f(x_minus.reshape(x.shape))) / (2 * eps)

    # Relative error (with absolute tolerance floor)
    denom = np.maximum(np.abs(g_analytic), atol)
    rel_err = np.abs(g_analytic - g_numerical) / denom
    return rel_err.max(), rel_err

Appendix E: Adam - Complete Update Rule and Variants

E.1 Standard Adam

def adam_step(g, m, v, t, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
    """One step of Adam optimizer."""
    m = beta1 * m + (1 - beta1) * g          # First moment
    v = beta2 * v + (1 - beta2) * g**2       # Second moment
    m_hat = m / (1 - beta1**t)               # Bias correction
    v_hat = v / (1 - beta2**t)               # Bias correction
    delta = lr * m_hat / (np.sqrt(v_hat) + eps)  # Update
    return m, v, delta

E.2 AdamW - Weight Decay Decoupling

Standard Adam with L2 regularization: add λx2\lambda \|x\|^2 to the loss. This makes the update:

δ=m^tv^t+ε+λxt\delta = \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon} + \lambda x_t

The weight decay term is scaled by v^t1/2\hat{v}_t^{-1/2} - large for coordinates with small gradient variance. AdamW decouples weight decay from the adaptive scaling:

δ=m^tv^t+ε+λxt(no adaptive scaling)\delta = \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon} + \lambda x_t \cdot \text{(no adaptive scaling)}

This is a better regularizer: all weights decay at the same rate, not faster for rarely-updated weights.

E.3 AdaFactor - Memory-Efficient Adam

AdaFactor (Shazeer & Stern, 2018) reduces Adam's O(d)O(d) memory for second moments by approximating the second moment matrix as a rank-1 outer product:

vtrtctwhere rtRm,ctRnv_t \approx r_t c_t^\top \quad \text{where } r_t \in \mathbb{R}^m, c_t \in \mathbb{R}^n

For a weight matrix WRm×nW \in \mathbb{R}^{m \times n}, this reduces memory from O(mn)O(mn) to O(m+n)O(m + n).

Trade-off: Less accurate second moment estimate -> slightly worse convergence in practice, but viable for very large models where Adam's memory cost is prohibitive.

E.4 Shampoo - Block Preconditioning

Shampoo (Gupta et al., 2018) applies a full-matrix preconditioner to each layer:

For layer weight WRm×nW \in \mathbb{R}^{m \times n}:

Lt=Lt1+dLdW(dLdW)Rm×mL_t = L_{t-1} + \frac{dL}{dW} \left(\frac{dL}{dW}\right)^\top \in \mathbb{R}^{m \times m} Rt=Rt1+(dLdW)dLdWRn×nR_t = R_{t-1} + \left(\frac{dL}{dW}\right)^\top \frac{dL}{dW} \in \mathbb{R}^{n \times n}

Update: WWαLt1/4dLdWRt1/4W \leftarrow W - \alpha L_t^{-1/4} \frac{dL}{dW} R_t^{-1/4}

This approximates the natural gradient (full Fisher) with only O(m2+n2)O(m^2 + n^2) memory per layer, vs O((mn)2)O((mn)^2) for full Fisher. Shampoo is used in Google's LLM pre-training (via distributed Shampoo implementation).


Appendix F: Convergence Analysis Review

F.1 Gradient Descent Convergence

Smooth convex case (ff has LL-Lipschitz gradients, ff convex):

f(xk)f(x)Lx0x22kf(x_k) - f(x^*) \leq \frac{L \|x_0 - x^*\|^2}{2k}

with step η=1/L\eta = 1/L.

Strongly convex case (ff is μ\mu-strongly convex, LL-smooth):

f(xk)f(x)(1μL)k(f(x0)f(x))f(x_k) - f(x^*) \leq \left(1 - \frac{\mu}{L}\right)^k (f(x_0) - f(x^*))

Convergence rate: κ=L/μ\kappa = L/\mu iterations to reduce error by e1e^{-1}.

Non-convex case: minkTf(xk)22(f(x0)f)Tη(2Lη)\min_{k \leq T} \|\nabla f(x_k)\|^2 \leq \frac{2(f(x_0) - f^*)}{T \eta (2 - L\eta)}

F.2 Newton's Method Convergence

Near a non-degenerate minimum xx^*:

xk+1xM2λmin(H)xkx2\|x_{k+1} - x^*\| \leq \frac{M}{2\lambda_{\min}(H^*)} \|x_k - x^*\|^2

where MM is the Lipschitz constant of 2f\nabla^2 f and λmin(H)\lambda_{\min}(H^*) is the smallest eigenvalue of the Hessian at xx^*.

Basin of convergence: Newton's method converges from any x0x_0 satisfying x0x<2λmin(H)2Mκ(H)\|x_0 - x^*\| < \frac{2\lambda_{\min}(H^*)^2}{M \kappa(H^*)}.

F.3 BFGS Convergence

For strongly convex ff with Lipschitz continuous Hessian, BFGS converges super-linearly:

limkxk+1xxkx=0\lim_{k \to \infty} \frac{\|x_{k+1} - x^*\|}{\|x_k - x^*\|} = 0

and rr-linearly convergent globally:

xkxcρk\|x_k - x^*\| \leq c \rho^k

for some c>0c > 0 and ρ(0,1)\rho \in (0, 1).


Appendix G: Software Ecosystem

G.1 SciPy Optimize

from scipy.optimize import minimize, check_grad

# L-BFGS-B
result = minimize(f, x0, jac=grad_f, method='L-BFGS-B',
                  options={'maxiter': 1000, 'ftol': 1e-15, 'gtol': 1e-10})

# Newton-CG
result = minimize(f, x0, jac=grad_f, hess=hess_f, method='Newton-CG')

# Trust region methods
result = minimize(f, x0, jac=grad_f, hess=hess_f, method='trust-ncg')
result = minimize(f, x0, jac=grad_f, hess=hess_f, method='trust-exact')

# Gradient checking
err = check_grad(f, grad_f, x0)  # Should be < 1e-5 for correct gradient

G.2 PyTorch Optimizers

import torch.optim as optim

# Standard first-order
sgd    = optim.SGD(params, lr=0.01, momentum=0.9, weight_decay=1e-4)
adam   = optim.Adam(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8)
adamw  = optim.AdamW(params, lr=1e-3, weight_decay=0.01)

# Second-order
lbfgs  = optim.LBFGS(params, lr=1, max_iter=20, history_size=100,
                      line_search_fn='strong_wolfe')

# Usage with closure (required for LBFGS and some line search methods)
def closure():
    optimizer.zero_grad()
    loss = criterion(model(X), y)
    loss.backward()
    return loss

optimizer.step(closure)

G.3 JAX Optimizers (Optax)

import optax

# Standard optimizers
optimizer = optax.adam(learning_rate=1e-3)
opt_state = optimizer.init(params)

# One step
grads = jax.grad(loss_fn)(params)
updates, opt_state = optimizer.update(grads, opt_state, params)
params = optax.apply_updates(params, updates)

# Composing transformations
optimizer = optax.chain(
    optax.clip_by_global_norm(1.0),  # Gradient clipping
    optax.scale_by_adam(),           # Adam scaling
    optax.scale(-lr),                # Learning rate
)

Appendix H: Summary and Quick Reference

NUMERICAL OPTIMIZATION - ALGORITHM SELECTION GUIDE
========================================================================

  STOCHASTIC (mini-batch, noisy gradients)
  -----------------------------------------
  Adam (default for LLMs/transformers)
  AdamW (with weight decay - preferred for fine-tuning)
  SGD + momentum (slower to converge, better generalization sometimes)
  Adafactor (when memory is very constrained)
  Shampoo/K-FAC (when second-order info is affordable)

  FULL-BATCH (all data, low noise)
  ---------------------------------
  L-BFGS (the gold standard for full-batch)
  Nonlinear CG (memory efficient, similar to L-BFGS)
  Newton-CG (when Hessian-vector products are cheap)
  Trust-region (when Hessian is indefinite)

  CONVERGENCE RATE COMPARISON (quadratic objective,  = 100)
  ---------------------------------------------------------
  GD (optimal step):    ~100 iters to 1e-6 (rate = (-1)/(+1) = 0.98)
  CG:                   ~10 iters to 1e-6 (rate \\approx (\\sqrt-1)/(\\sqrt+1) = 0.82)
  L-BFGS (m=10):        ~10 iters to 1e-6 (super-linear)
  Newton:               ~3 iters to 1e-6 (quadratic convergence!)

  GRADIENT CHECKING STEP SIZE
  ----------------------------
  Forward differences:  h = sqrt(eps_mach) \\approx 1.5e-8
  Centered differences: h = cbrt(eps_mach) \\approx 6e-6
  Complex step:         h = 1e-20 (essentially exact)

========================================================================

<- Back to Section02 Numerical Linear Algebra | Next: Section04 Interpolation and Approximation ->


Appendix I: Extended Case Studies

I.1 Why Adam Doesn't Use Line Search

Deterministic optimization (with exact gradients) benefits greatly from line search: at each step, we can evaluate ff as many times as needed to find a good step size. But stochastic optimization with mini-batches cannot use line search:

Problem 1: Inconsistent function values. The function value f(x;Bk)f(x; B_k) computed on batch BkB_k is different from f(x;Bk+1)f(x; B_{k+1}) computed on the next batch. We cannot meaningfully compare f(xk;Bk)f(x_k; B_k) to f(xk+αp;Bk)f(x_k + \alpha p; B_k) if the batch changes between evaluation.

Problem 2: Gradient noise breaks secant condition. The L-BFGS secant condition yksk>0y_k^\top s_k > 0 requires the gradient change yk=gk+1gky_k = g_{k+1} - g_k to be in a specific relationship to the step sks_k. With stochastic gradients, yky_k is corrupted by noise, and the condition may fail.

Adam's solution: Instead of adapting the step size using function evaluations (line search), Adam adapts using the history of gradient squares (vtv_t). This is "implicit line search" - steps are small where gradients have been large historically, and large where gradients have been small.

I.2 When L-BFGS Beats Adam

Task: Minimize the loss of a 2-layer network on a small dataset (N=1000, d=100).

MethodIterationsFinal lossTime
Adam (η=103\eta=10^{-3})10,0003.2×1053.2 \times 10^{-5}45s
Adam (η=104\eta=10^{-4})10,0008.1×1048.1 \times 10^{-4}45s
L-BFGS (m=20m=20)1501.1×1081.1 \times 10^{-8}12s

L-BFGS reaches much lower loss faster when:

  • Gradient is exact (full batch)
  • The loss function is smooth (no noise)
  • High accuracy is needed (physics simulation, calibration)

But for LLM pretraining (N=101210^{12}, d=101010^{10}): L-BFGS is impractical. Adam is the right choice.

I.3 The Role of Condition Number in Transformer Training

During LLM pretraining, the condition number of the loss Hessian at different points in training:

  • Early training (t=0t = 0): Random initialization, κO(1)\kappa \sim O(1) for linear layers. Gradients flow well.
  • Mid training: Parameters have adapted, κ\kappa can reach 10310^3 to 10610^6 for some layers. This is where gradient clipping matters.
  • Late training: Near convergence, κ\kappa stabilizes. The learning rate schedule (cosine decay, linear decay) effectively adapts to the local curvature.

Weight decay effect on condition number: Adding L2 regularization λWF2\lambda \|W\|_F^2 shifts all Hessian eigenvalues by +2λ+2\lambda, changing condition number from κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min} to (λmax+2λ)/(λmin+2λ)(\lambda_{\max} + 2\lambda) / (\lambda_{\min} + 2\lambda), which is smaller (better conditioned). This is one reason AdamW (with explicit weight decay) often trains more stably than Adam with L2 in the loss.


Appendix J: Numerical Analysis of Backpropagation

J.1 Floating-Point Error in Backprop

The backpropagation algorithm computes L/W\partial L / \partial W by the chain rule:

δl=Lzl=f(zl)(Wl+1δl+1)\delta_l = \frac{\partial L}{\partial z_l} = f'(z_l) \odot (W_{l+1}^\top \delta_{l+1})

In floating-point, each multiplication and activation application introduces O(εmach)O(\varepsilon_{\text{mach}}) relative error. For an LL-layer network, the accumulated gradient error is:

δg^/g=O(Lεmachκjacobian)\|\delta \hat{g}\| / \|g\| = O(L \varepsilon_{\text{mach}} \cdot \kappa_{\text{jacobian}})

where κjacobian\kappa_{\text{jacobian}} is the condition number of the Jacobian of the network (which can be large for deep networks with poor initialization).

J.2 Mixed Precision and Gradient Accuracy

In bf16 training, gradients are computed in bf16 with εmach8×103\varepsilon_{\text{mach}} \approx 8 \times 10^{-3}. This means:

  • Parameter updates have 3\sim 3 significant digits of accuracy
  • Accumulated over many steps, this can introduce drift

Why it works despite low precision: The optimizer (Adam) effectively computes a smoothed gradient over many steps via the exponential moving average mtm_t. Noise in individual gradient estimates is averaged out. The effective gradient accuracy is O(εbf16/T)O(\varepsilon_{\text{bf16}} / \sqrt{T}) after TT steps, which is much better than single-step precision.

J.3 Gradient Accumulation and Precision

When accumulating gradients over KK mini-batches before an update:

G=1Kk=1KgkG = \frac{1}{K} \sum_{k=1}^K g_k

In fp32 with Kahan summation: error =O(εfp32)= O(\varepsilon_{\text{fp32}}) independent of KK. In bf16 with naive summation: error =O(Kεbf16)= O(K \varepsilon_{\text{bf16}}) - grows with accumulation steps.

Fix: Accumulate gradients in fp32, even if the forward/backward pass uses bf16. This is the default behavior in PyTorch AMP when scaler.scale(loss).backward() is used.


Appendix K: Connection to Statistical Learning Theory

Numerical optimization connects to statistical learning theory through the bias-variance tradeoff:

Optimization error: f(xk)f(x)f(x_k) - f(x^*) - how far we are from the training loss minimum.

Approximation error: f(x)fBayesf(x^*) - f^*_{\text{Bayes}} - how far the best model in our function class is from the Bayes optimal.

Estimation error: fBayesf^*_{\text{Bayes}} - (error on new data) - generalization gap.

The over-optimization paradox: Running optimization to convergence (small optimization error) often leads to overfitting (large estimation error). This is why:

  • We use early stopping (stop before convergence)
  • We use weight decay (adds regularization that changes xx^*)
  • We use dropout and other regularizers (change the optimization landscape)

Implicit regularization in SGD: Stochastic gradient descent, by virtue of its noise, converges to "flat minima" - regions where the Hessian has small eigenvalues. Such minima generalize better (Hochreiter & Schmidhuber, 1997; Keskar et al., 2016). This is NOT a numerical issue - it's a feature of the stochastic optimization process.


Appendix L: Historical Timeline of Optimization Methods

OPTIMIZATION METHODS: TIMELINE
========================================================================

  1847  Cauchy - steepest descent method (gradient descent)
  1944  Levenberg - damped least squares (LM algorithm)
  1952  Kaczmarz - projection-based iterative method
  1952  Hestenes & Stiefel - conjugate gradient method
  1959  Broyden, Fletcher, Goldfarb, Shanno (BFGS) - quasi-Newton
  1963  Marquardt - generalized LM algorithm
  1970  Sargent & Sebastian - first L-BFGS-like method
  1980  Nocedal - limited-memory BFGS
  1988  Armijo - sufficient decrease condition
  1969  Wolfe - curvature condition for line search
  1994  Byrd, Lu, Nocedal, Zhu - L-BFGS-B (bounds constrained)
  1997  Hochreiter & Schmidhuber - flat minima and generalization
  2010  Martens - Hessian-free optimization for deep nets
  2012  Duchi et al. - Adagrad
  2014  Kingma & Ba - Adam
  2015  Bottou - stochastic optimization in ML survey
  2016  Keskar et al. - sharp vs flat minima for generalization
  2017  Loshchilov & Hutter - AdamW (decoupled weight decay)
  2018  Gupta et al. - Shampoo (block-diagonal preconditioning)
  2018  Anil et al. - Scalable second-order optimization
  2021+ Vyas et al., Dettmers - training efficiency, mixed precision

========================================================================

Appendix M: Chapter Position and Summary

Section10 NUMERICAL METHODS - SECTION 03 SUMMARY
========================================================================

  PREREQUISITE CHAIN:
  Section01 Floating Point -> Section02 Numerical LA -> Section03 Numerical Optimization

  CONNECTIONS FROM THIS SECTION:
    +--------------------------------------------------------------+
    |  Line search (Wolfe)      -> Used in L-BFGS, BFGS            |
    |  Newton's method          -> Quasi-Newton (BFGS, L-BFGS)     |
    |  L-BFGS                   -> torch.optim.LBFGS               |
    |  Trust region             -> TRPO, PPO (RL policy updates)    |
    |  Diagonal quasi-Newton    -> Adam, AdaGrad, RMSProp          |
    |  Natural gradient         -> K-FAC, Shampoo                  |
    |  Condition numbers        -> Section02 Numerical LA                 |
    |  Gradient checking        -> Debugging autodiff implementations|
    +--------------------------------------------------------------+

  KEY FORMULAS:
    GD convergence rate:   (-1)/(+1) per iteration
    CG convergence rate:   (\\sqrt-1)/(\\sqrt+1) per iteration
    Newton convergence:    quadratic near minimum
    L-BFGS convergence:    super-linear (between GD and Newton)

    Optimal finite diff h: cbrt(eps_mach) \\approx 6e-6  (centered)
    Finite diff error:     O(h^2) + O(eps/h)
    Complex step error:    O(h^2) - no rounding error

========================================================================

Section statistics: 12 main sections, 22 subsections, 12 common mistakes, 10 exercises, 13 appendices.

<- Back to Section02 Numerical Linear Algebra | Next: Section04 Interpolation and Approximation ->


Appendix N: Additional Optimization Algorithms

N.1 Nonlinear Conjugate Gradient

The nonlinear conjugate gradient (NCG) method extends CG to nonlinear objectives by using the same direction-update structure but with gradient changes instead of exact Hessian-vector products.

The search directions are:

dk+1=gk+1+βkdkd_{k+1} = -g_{k+1} + \beta_k d_k

where the scalar βk\beta_k determines which variant of NCG:

FormulaNameBest for
βk=gk+1gk+1gkgk\beta_k = \frac{g_{k+1}^\top g_{k+1}}{g_k^\top g_k}Fletcher-ReevesSmooth convex
βk=gk+1(gk+1gk)gkgk\beta_k = \frac{g_{k+1}^\top (g_{k+1} - g_k)}{g_k^\top g_k}Polak-RibiereGeneral nonlinear
βk=max(gk+1(gk+1gk)dk(gk+1gk),0)\beta_k = \max\left(\frac{g_{k+1}^\top (g_{k+1} - g_k)}{d_k^\top (g_{k+1} - g_k)}, 0\right)Hestenes-StiefelNon-convex

NCG vs L-BFGS:

  • NCG: O(d)O(d) memory (just the previous direction)
  • L-BFGS (mm vectors): O(md)O(md) memory, faster convergence
  • NCG: simpler to implement, same asymptotic convergence order

N.2 Stochastic Gradient Descent with Momentum

SGD + Momentum:

vt+1=μvtαgt,xt+1=xt+vt+1v_{t+1} = \mu v_t - \alpha g_t, \quad x_{t+1} = x_t + v_{t+1}

Nesterov accelerated gradient (NAG):

vt+1=μvtαf(xt+μvt),xt+1=xt+vt+1v_{t+1} = \mu v_t - \alpha \nabla f(x_t + \mu v_t), \quad x_{t+1} = x_t + v_{t+1}

Connection to CG: Heavy ball (momentum SGD) is the Riemann form of CG for quadratic objectives. For non-quadratic objectives, momentum helps navigate ravines (valleys where the gradient is large perpendicular to the valley but small along it).

Optimal momentum for quadratics: μ=(κ1κ+1)2\mu^* = \left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^2. With this momentum, gradient descent with momentum has the same convergence rate as CG: O(κ)O(\sqrt{\kappa}) vs O(κ)O(\kappa) without momentum.

N.3 Stochastic Variance Reduction

For finite-sum objectives f(x)=1nifi(x)f(x) = \frac{1}{n} \sum_i f_i(x), variance-reduced methods achieve better convergence than SGD by occasionally computing the full gradient:

SVRG (Johnson & Zhang, 2013):

for outer loop s = 1, 2, ..., S:
    compute full gradient \\mu = f(x_s)
    for inner loop t = 1, ..., m:
        sample i uniformly
        g_t = f_i(x_t) - f_i(x_s) + \\mu  # Variance-reduced gradient
        x_{t+1} = x_t - \\alpha g_t
    x_{s+1} = x_{t_m} (some iterate from inner loop)

The key: E[gt]=f(xt)\mathbb{E}[g_t] = \nabla f(x_t) (unbiased) and Var(gt)0\text{Var}(g_t) \to 0 as xtxx_t \to x^* (variance vanishes at optimum). SVRG achieves linear convergence (like GD) with per-iteration cost similar to SGD.

For AI: SVRG is rarely used for deep learning because the gradient computation involves complex non-linearities (making the variance reduction calculation non-trivial) and because large datasets make computing the full gradient expensive. Adam with decaying learning rate achieves comparable practical performance without the exact full gradient.


Appendix O: Extended Exercises

O.1 Implementing the Wolfe Zoom Algorithm (***)

The Wolfe zoom algorithm is the industry-standard line search for L-BFGS. Implement it from scratch:

def wolfe_zoom(f, g, x, p, alpha_lo, alpha_hi, phi_lo, phi0, dphi0,
               c1=1e-4, c2=0.9, max_iter=20):
    """
    Find alpha satisfying strong Wolfe conditions within [alpha_lo, alpha_hi].
    phi_lo = f(x + alpha_lo * p), phi0 = f(x), dphi0 = g(x) @ p
    """
    for i in range(max_iter):
        # Cubic interpolation to find trial step
        alpha_j = 0.5 * (alpha_lo + alpha_hi)  # Simplified: bisection
        phi_j = f(x + alpha_j * p)

        if phi_j > phi0 + c1 * alpha_j * dphi0 or phi_j >= phi_lo:
            alpha_hi = alpha_j
        else:
            dphi_j = g(x + alpha_j * p) @ p
            if abs(dphi_j) <= -c2 * dphi0:
                return alpha_j  # Strong Wolfe satisfied
            if dphi_j * (alpha_hi - alpha_lo) >= 0:
                alpha_hi = alpha_lo
            alpha_lo = alpha_j
            phi_lo = phi_j

    return alpha_j  # Return best found

O.2 Numerical Gradient Check via PyTorch (*)

import torch
from torch.autograd import gradcheck

# Define a custom function
class MyFunc(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        return x ** 3 + torch.sin(x)

    @staticmethod
    def backward(ctx, grad_output):
        x, = ctx.saved_tensors
        return grad_output * (3 * x**2 + torch.cos(x))

# Gradient check with double precision
x = torch.randn(5, dtype=torch.float64, requires_grad=True)
result = gradcheck(MyFunc.apply, x, eps=1e-6, atol=1e-4, rtol=1e-3)
print(f"Gradient check passed: {result}")

O.3 Computing Hessian Spectrum via Lanczos (***)

The Hessian spectrum of a trained neural network reveals the curvature structure of the loss landscape:

def hessian_spectrum_lanczos(loss, params, n_iters=100):
    """
    Compute the extremal eigenvalues of the Hessian via Lanczos iteration.
    Each iteration requires one Hessian-vector product (one backward pass).
    """
    d = sum(p.numel() for p in params)

    # Initial vector
    v = [torch.randn_like(p) for p in params]
    v_norm = sum(torch.sum(vi**2) for vi in v).sqrt()
    v = [vi / v_norm for vi in v]

    alphas, betas = [], []

    for i in range(n_iters):
        # Hessian-vector product
        Hv = hvp(loss, params, v)

        alpha = sum(torch.sum(vi * Hvi) for vi, Hvi in zip(v, Hv)).item()
        alphas.append(alpha)

        # Orthogonalize
        w = [Hvi - alpha * vi for Hvi, vi in zip(Hv, v)]
        if i > 0:
            w = [wi - beta * vi_old for wi, vi_old in zip(w, v_old)]

        beta = sum(torch.sum(wi**2) for wi in w).sqrt().item()
        betas.append(beta)

        v_old = v
        v = [wi / beta for wi in w]

    # The Lanczos tridiagonal matrix T has diag=alphas, off-diag=betas
    T = np.diag(alphas) + np.diag(betas[:-1], 1) + np.diag(betas[:-1], -1)
    eigvals = np.linalg.eigvalsh(T)
    return eigvals

Appendix P: Key Theorems Reference

P.1 Convergence Theorems

Theorem 1 (Gradient Descent): For ff LL-smooth with η=1/L\eta = 1/L:

  • Convex: f(xk)f=O(1/k)f(x_k) - f^* = O(1/k)
  • μ\mu-strongly convex: xkx=O((1μ/L)k)\|x_k - x^*\| = O((1 - \mu/L)^k)

Theorem 2 (CG): For SPD AA with condition number κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min}:

ekAe0A2(κ1κ+1)k\frac{\|e_k\|_A}{\|e_0\|_A} \leq 2\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^k

Theorem 3 (Newton): Near a non-degenerate minimum, Newton's method converges quadratically:

xk+1xCxkx2\|x_{k+1} - x^*\| \leq C \|x_k - x^*\|^2

Theorem 4 (BFGS): For strongly convex ff with Lipschitz Hessian, BFGS converges super-linearly.

Theorem 5 (Zoutendijk): For descent direction pkp_k with Wolfe line search:

kfkpk2pk2<    lim inffk=0\sum_k \frac{|\nabla f_k^\top p_k|^2}{\|p_k\|^2} < \infty \implies \liminf \|\nabla f_k\| = 0

P.2 Optimality Conditions

First-order necessary: f(x)=0\nabla f(x^*) = 0

Second-order sufficient: f(x)=0\nabla f(x^*) = 0 and 2f(x)0\nabla^2 f(x^*) \succ 0 (positive definite Hessian)

KKT conditions (constrained): See Section08-04 Constrained Optimization


End of Section03 Numerical Optimization

<- Back to Section02 Numerical Linear Algebra | Next: Section04 Interpolation and Approximation ->


Appendix Q: Deep Dive - The Rosenbrock Function

The Rosenbrock function f(x,y)=100(yx2)2+(1x)2f(x, y) = 100(y - x^2)^2 + (1-x)^2 is the canonical test for optimization algorithms:

Properties:

  • Minimum at (1,1)(1, 1) with f(1,1)=0f(1,1) = 0
  • The valley y=x2y = x^2 is nearly flat along the xx direction but steep perpendicular to it
  • Condition number of the Hessian at the minimum: κ(H)=2508\kappa(H^*) = 2508 (for the standard form)
  • Not convex globally, but has a unique global minimum

Hessian at the minimum (1,1)(1,1):

H=(802400400200)H = \begin{pmatrix} 802 & -400 \\ -400 & 200 \end{pmatrix}

Eigenvalues: λ11001\lambda_1 \approx 1001, λ21\lambda_2 \approx 1.

Performance comparison on Rosenbrock:

MethodIterations to 10810^{-8}
Gradient descent (optimal η\eta)105\approx 10^5
CG (Fletcher-Reeves)1000\approx 1000
L-BFGS (m=5m=5)30\approx 30
Newton (with line search)15\approx 15
BFGS40\approx 40

Why it's hard for gradient descent: The "banana-shaped" valley means gradient descent oscillates across the valley and makes tiny progress along the valley. Line search helps but doesn't fix the fundamental curvature mismatch.

Why L-BFGS works well: After a few steps, L-BFGS builds up good curvature information along the valley direction, allowing it to take large steps along the valley.

Q.1 Numerical Analysis of Rosenbrock

import numpy as np

def rosenbrock(x):
    return 100 * (x[1] - x[0]**2)**2 + (1 - x[0])**2

def rosenbrock_grad(x):
    g = np.zeros(2)
    g[0] = -400 * x[0] * (x[1] - x[0]**2) - 2 * (1 - x[0])
    g[1] = 200 * (x[1] - x[0]**2)
    return g

def rosenbrock_hess(x):
    H = np.zeros((2, 2))
    H[0, 0] = -400 * (x[1] - x[0]**2) + 800 * x[0]**2 + 2
    H[0, 1] = -400 * x[0]
    H[1, 0] = -400 * x[0]
    H[1, 1] = 200
    return H

# Condition number at the minimum
H_star = rosenbrock_hess(np.array([1.0, 1.0]))
eigvals = np.linalg.eigvalsh(H_star)
kappa = eigvals.max() / eigvals.min()
print(f"kappa(H*) = {kappa:.2f}")
# Expected: ~2508

Appendix R: Optimization Landscape Analysis for LLMs

Recent research (2020-2024) has revealed interesting properties of the optimization landscape for large language models:

R.1 The Edge of Stability

Edge of Stability (EoS) phenomenon (Cohen et al., 2021): During neural network training with a constant learning rate η\eta, the largest Hessian eigenvalue λmax(H)\lambda_{\max}(H) tends to increase until it reaches λmax2/η\lambda_{\max} \approx 2/\eta and then stabilizes there.

Implication: The loss oscillates but decreases on average. The learning rate implicitly controls the curvature of the loss landscape that training "settles into."

Numerical analysis view: 2/η2/\eta is the stability threshold for gradient descent (η<2/λmax\eta < 2/\lambda_{\max} for convergence). The EoS phenomenon shows that neural networks are often training at the boundary of instability - aggressive but not diverging.

R.2 Catastrophic Forgetting and Flat Minima

Observation (Hochreiter & Schmidhuber, 1997; Keskar et al., 2016): Models trained with small batches (high gradient noise) tend to find "flatter" minima where λmax(H)\lambda_{\max}(H^*) is small. Models trained with large batches find "sharper" minima with large λmax\lambda_{\max}.

Flat minima generalize better because small perturbations to the weights (due to test distribution shift, quantization, etc.) cause smaller changes in the loss.

Numerical measure of sharpness:

Sharpness=maxδϵf(x+δ)f(x)ϵ2λmax(2f(x))\text{Sharpness} = \max_{\|\delta\| \leq \epsilon} \frac{f(x^* + \delta) - f(x^*)}{\epsilon^2} \approx \lambda_{\max}(\nabla^2 f(x^*))

SAM (Sharpness-Aware Minimization, Foret et al., 2021) explicitly minimizes sharpness by finding the worst-case perturbation and minimizing there:

minxmaxϵρf(x+ϵ)minxf(x+ρff)\min_x \max_{\|\epsilon\| \leq \rho} f(x + \epsilon) \approx \min_x f\left(x + \rho \frac{\nabla f}{\|\nabla f\|}\right)

This requires two gradient evaluations per step but consistently improves generalization.

R.3 Loss Landscape at Scale

For billion-parameter models (GPT-3, LLaMA, etc.), direct computation of the Hessian is impossible. Researchers use:

  1. Implicit Hessian-vector products (via Pearlmutter's trick) to estimate Hessian properties
  2. Lanczos iteration to find the leading eigenvalues of the Hessian
  3. Local quadratic approximation using finite differences around a trained model

Key findings (Ghorbani et al., 2019; Yao et al., 2020):

  • The Hessian has a "bulk + outlier" structure: most eigenvalues are near zero (flat directions), with a few large eigenvalues (curved directions)
  • The number of large eigenvalues grows with model capacity
  • The condition number at convergence is surprisingly manageable (κ103\kappa \sim 10^3 to 10410^4) for well-trained models

Appendix S: Practical Implementation Notes

S.1 Warm Starting L-BFGS

When fine-tuning a pre-trained model with L-BFGS, warm-start the optimizer by providing the final state from a previous run:

# Load pre-trained weights
model.load_state_dict(torch.load('pretrained.pt'))

# Create L-BFGS optimizer
optimizer = torch.optim.LBFGS(model.parameters(), lr=1.0, history_size=20)

# Optionally load optimizer state for warm start
try:
    optimizer.load_state_dict(torch.load('optimizer_state.pt'))
    print("Warm-starting from previous optimizer state")
except FileNotFoundError:
    print("Starting fresh L-BFGS")

S.2 Debugging Failed Optimization

Symptom: Loss not decreasing after several steps

  • Check gradient norm: g_norm = sum(p.grad.norm()**2 for p in params) ** 0.5
  • If g<ε\|g\| < \varepsilon: already at a local minimum or learning rate too small
  • If g\|g\| is large: check for NaN gradients, verify loss computation

Symptom: Loss oscillating or diverging

  • Reduce learning rate by 10x
  • Check condition number: kappa = max_sv / min_sv for key weight matrices
  • Add gradient clipping: torch.nn.utils.clip_grad_norm_(params, max_norm=1.0)

Symptom: L-BFGS says "function evaluation limit reached"

  • Increase max_eval parameter
  • Reduce model complexity or regularize more strongly
  • Check if the closure is deterministic (L-BFGS requires consistent function values)

S.3 Hyperparameter Sensitivity

For Adam in practice:

  • Learning rate α\alpha: most sensitive; typical values 10410^{-4} to 10310^{-3} for LLMs
  • β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999: largely insensitive; β2=0.95\beta_2 = 0.95 for very noisy gradients
  • ε=108\varepsilon = 10^{-8}: increase to 10610^{-6} or 10710^{-7} for numerical stability in fp16/bf16
  • Weight decay: treat as regularization, not a numerical parameter; 10210^{-2} for large models

For L-BFGS:

  • history_size mm: 10-100; larger mm = better curvature approximation but more memory
  • max_iter: 5-20 CG iterations per step
  • line_search_fn: always use 'strong_wolfe' for robustness

End of Section03 Numerical Optimization Notes (1600+ lines)

<- Section02 Numerical Linear Algebra | -> Section04 Interpolation and Approximation


Appendix T: Convergence Rate Summary Tables

T.1 First-Order Methods

MethodCost/IterConv. Rate (smooth convex)Conv. Rate (strongly convex, )
Gradient descentO(d)O(d)O(1/k)O(1/k)O((11/κ)k)O((1 - 1/\kappa)^k)
GD (optimal step)O(d)O(d)O(L/k)O(L/k)O(((1)/(+1))k)O(((-1)/(+1))^k)
GD + momentumO(d)O(d)O(1/k2)O(1/k^2) (Nesterov)O(((1)/(+1))2k)O(((\sqrt{}-1)/(\sqrt{}+1))^{2k})
CGO(d)O(d)O(1/k2)O(1/k^2)O(((1)/(+1))k)O(((\sqrt{}-1)/(\sqrt{}+1))^k)
SGDO(d)O(d)O(1/k)O(1/\sqrt{k})O(1/k)O(1/k)
AdamO(d)O(d)O(1/k)O(1/\sqrt{k})Empirically fast

T.2 Second-Order and Quasi-Newton Methods

MethodMemoryCost/IterConvergence
NewtonO(d2)O(d^2)O(d3)O(d^3)Quadratic near xx^*
Newton-CGO(d)O(d)O(d)O(d) per CG stepSuper-linear
BFGSO(d2)O(d^2)O(d2)O(d^2)Super-linear
L-BFGSO(md)O(md)O(md)O(md)Super-linear locally
Diagonal precond.O(d)O(d)O(d)O(d)Linear, rate ((D1)/...)k((\sqrt{_D}-1)/...)^k

where κD\kappa_D is the condition number after diagonal preconditioning.

T.3 Trust Region vs Line Search

AspectLine SearchTrust Region
Convergence guaranteeYes (Wolfe)Yes (ratio test)
Non-convex HessianNeeds modificationNatural handling
Implementation complexityModerateHigher
Typical algorithmsBFGS, L-BFGSDog-leg, GLTR

Appendix U: Worked Numerical Example - L-BFGS Step

Setup: Minimize f(x)=x12+10x22f(x) = x_1^2 + 10x_2^2 starting from x0=(2,1)x_0 = (2, 1).

Step 0: x0=(2,1)x_0 = (2, 1), g0=(4,20)g_0 = (4, 20). Since m=0m=0, initial direction d0=g0=(4,20)d_0 = -g_0 = (-4, -20).

Line search: f(x0+αd0)=(24α)2+10(120α)2f(x_0 + \alpha d_0) = (2 - 4\alpha)^2 + 10(1 - 20\alpha)^2. Minimize: α=51/40400.01263\alpha^* = 51/4040 \approx 0.01263.

After step 0: x1=(240.01263,1200.01263)=(1.9495,0.7475)x_1 = (2 - 4 \cdot 0.01263, 1 - 20 \cdot 0.01263) = (1.9495, 0.7475).

g1=(3.899,14.95)g_1 = (3.899, 14.95).

Update L-BFGS history: s0=x1x0=(0.0505,0.2525)s_0 = x_1 - x_0 = (-0.0505, -0.2525) y0=g1g0=(0.101,5.05)y_0 = g_1 - g_0 = (-0.101, -5.05) ρ0=1/(y0s0)=1/(0.1010.0505+5.050.2525)=1/0.27983.574\rho_0 = 1/(y_0^\top s_0) = 1/(0.101 \cdot 0.0505 + 5.05 \cdot 0.2525) = 1/0.2798 \approx 3.574

Step 1 direction (L-BFGS, m=1m=1):

First loop: α0=ρ0(s0g1)=3.574((0.0505)(3.899)+(0.2525)(14.95))=3.574(4.003)14.31\alpha_0 = \rho_0 (s_0^\top g_1) = 3.574 \cdot ((-0.0505)(3.899) + (-0.2525)(14.95)) = 3.574 \cdot (-4.003) \approx -14.31

q=g1α0y0=(3.899(14.31)(0.101),14.95(14.31)(5.05))q = g_1 - \alpha_0 y_0 = (3.899 - (-14.31)(-0.101), 14.95 - (-14.31)(-5.05)) =(3.8991.445,14.9572.27)=(2.454,57.32)= (3.899 - 1.445, 14.95 - 72.27) = (2.454, -57.32)

Scale: γ=s0y0/(y0y0)=0.2798/(0.1012+5.052)0.01097\gamma = s_0^\top y_0 / (y_0^\top y_0) = 0.2798 / (0.101^2 + 5.05^2) \approx 0.01097

r=γq=(0.02692,0.629)r = \gamma q = (0.02692, -0.629)

Second loop: β0=ρ0(y0r)=3.574((0.101)(0.02692)+(5.05)(0.629))=3.5743.17111.33\beta_0 = \rho_0 (y_0^\top r) = 3.574 \cdot ((-0.101)(0.02692) + (-5.05)(-0.629)) = 3.574 \cdot 3.171 \approx 11.33

rr+s0(α0β0)=(0.02692,0.629)+(0.0505)(14.3111.33)(1,0)+(0.2525)(14.3111.33)(0,1)r \leftarrow r + s_0 (\alpha_0 - \beta_0) = (0.02692, -0.629) + (-0.0505)(-14.31 - 11.33) \cdot (1, 0) + (-0.2525)(-14.31 - 11.33) \cdot (0, 1) =(0.02692+1.294,0.629+6.463)=(1.321,5.834)= (0.02692 + 1.294, -0.629 + 6.463) = (1.321, 5.834)

Direction: d1=r=(1.321,5.834)d_1 = -r = (-1.321, -5.834).

This is much better aligned with the steepest descent direction scaled by approximate Hessian inverse: true Newton direction is H1g1=(3.899/2,14.95/20)=(1.950,0.748)H^{-1}g_1 = (3.899/2, 14.95/20) = (1.950, 0.748), so d1H1g1d_1 \approx -H^{-1}g_1 within a scaling factor.


<- Section02 Numerical Linear Algebra | -> Section04 Interpolation and Approximation


Appendix V: Extended Exercises with Solutions

V.1 Verifying Adam's Diagonal Preconditioning (**)

Problem: On the quadratic f(x)=12xHxf(x) = \frac{1}{2} x^\top H x with H=diag(h1,,hd)H = \text{diag}(h_1, \ldots, h_d):

(a) After many steps of Adam (small mini-batches with noise σi2=hi\sigma_i^2 = h_i per coordinate), the second moment vhiv_\infty \approx h_i.

(b) Show that Adam's effective step in coordinate ii is approximately:

Δxiαhi+εgiαhigi\Delta x_i \approx -\frac{\alpha}{\sqrt{h_i} + \varepsilon} g_i \approx -\frac{\alpha}{\sqrt{h_i}} g_i

This is gradient descent with step α/hi\alpha / \sqrt{h_i} in coordinate ii.

(c) For the quadratic ff, this corresponds to the Newton step H1gH^{-1} g with step size α/higi/(higi)=α/hi\alpha / \sqrt{h_i} \cdot g_i / (h_i g_i) = \alpha / \sqrt{h_i} ... wait, the Newton step in coordinate ii is gi/hig_i / h_i. Adam gives gi/hig_i / \sqrt{h_i} - this is a half-Newton step (square root of the Hessian diagonal, not the Hessian diagonal itself).

This is why Adam is not full Newton: Adam uses Hdiag\sqrt{H_{\text{diag}}} as the preconditioner, not HdiagH_{\text{diag}}. For a quadratic, the optimal preconditioner is HH (Newton), not H\sqrt{H}.

Consequence for convergence: With Adam on a quadratic H=diag(h1,,hd)H = \text{diag}(h_1, \ldots, h_d):

  • Gradient descent rate: 12αμ1 - 2\alpha \mu where μ=minhi/κ=hmin\mu = \min h_i / \kappa = h_{\min}
  • Adam rate: 12α/hmaxhmin1/21 - 2\alpha / \sqrt{h_{\max}} \cdot h_{\min}^{1/2} - better but not optimal
  • Optimal diagonal Newton rate: 12α1 - 2\alpha (constant, independent of κ\kappa)

V.2 Finite Difference Gradient Check (*)

Implement a gradient check for a custom loss function:

def custom_loss(W, X, y):
    """Binary cross-entropy with sigmoid activation."""
    z = X @ W
    p = 1 / (1 + np.exp(-z))
    return -np.mean(y * np.log(p + 1e-12) + (1-y) * np.log(1 - p + 1e-12))

def custom_grad(W, X, y):
    """Analytical gradient."""
    z = X @ W
    p = 1 / (1 + np.exp(-z))
    return X.T @ (p - y) / len(y)

# Test
n, d = 100, 5
X = np.random.randn(n, d)
y = (np.random.randn(n) > 0).astype(float)
W = np.random.randn(d)

# Numerical gradient
h = 1e-5
g_num = np.zeros(d)
for i in range(d):
    W_plus  = W.copy(); W_plus[i]  += h
    W_minus = W.copy(); W_minus[i] -= h
    g_num[i] = (custom_loss(W_plus, X, y) - custom_loss(W_minus, X, y)) / (2*h)

g_ana = custom_grad(W, X, y)
rel_err = np.abs(g_num - g_ana).max() / (np.abs(g_ana).max() + 1e-12)
print(f"Max relative error: {rel_err:.2e}")  # Should be < 1e-5

V.3 Trust Region Radius Adaptation (**)

Implement the trust region ratio test and radius update:

def trust_region_update(f, x, p, delta, f_k, m_decrease):
    """Update trust region radius based on ratio test."""
    f_new = f(x + p)
    actual = f_k - f_new
    rho = actual / m_decrease  # Ratio: actual/predicted decrease

    # Accept or reject step
    x_new = x + p if rho > 0.1 else x

    # Update radius
    if rho < 0.25:
        delta_new = 0.25 * delta
    elif rho > 0.75 and abs(np.linalg.norm(p) - delta) < 1e-8:
        delta_new = min(2 * delta, 1e6)  # Cap at maximum
    else:
        delta_new = delta

    return x_new, delta_new, rho

Appendix W: Self-Assessment Checklist

After completing Section03, you should be able to:

Core algorithms:

  • Implement backtracking Armijo line search
  • Explain why the Wolfe curvature condition is needed for BFGS
  • Implement the L-BFGS two-loop recursion
  • Implement the trust region ratio test and radius update
  • Compute numerical gradients and find the optimal step size

Conceptual understanding:

  • Explain the connection between Adam and diagonal quasi-Newton methods
  • Explain why L-BFGS doesn't work for stochastic optimization
  • Explain why Newton's method is quadratically convergent
  • State the convergence rate of CG vs gradient descent
  • Explain when to use line search vs trust region

AI connections:

  • Explain Adam's adaptive learning rate as diagonal preconditioning
  • Explain why TRPO/PPO use trust region constraints
  • Explain the role of the Wolfe line search in PyTorch's L-BFGS
  • Describe Shampoo/K-FAC as block quasi-Newton methods
  • Connect gradient clipping to projected gradient descent

Total content: Section1-Section12 (main), Appendices A-W (extended). Full coverage of numerical optimization for ML practitioners.


Appendix X: GPU-Accelerated Optimization

X.1 Batch Size and Optimization Geometry

The batch size has a profound effect on optimization:

Large batch: The gradient gB=1BiBfi(x)g_B = \frac{1}{|B|} \sum_{i \in B} \nabla f_i(x) approximates the full gradient with low variance. This enables line search, quasi-Newton methods, and faster convergence per epoch.

Small batch: High gradient noise. Adam's second moment estimate vtv_t is noisy. The training trajectory is stochastic, which (counterintuitively) improves generalization.

Linear scaling rule (Goyal et al., 2017): If you increase batch size by kk, increase learning rate by k\sqrt{k} (for SGD) or by kk (for the warmup phase). This keeps the signal-to-noise ratio of gradient estimates roughly constant.

X.2 Gradient Accumulation as Larger Batch

For very large models that don't fit in GPU memory, gradient accumulation simulates a larger batch by summing gradients over multiple forward-backward passes before applying the optimizer update:

accumulation_steps = 4
optimizer.zero_grad()

for step, batch in enumerate(dataloader):
    with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
        loss = model(batch) / accumulation_steps  # Scale loss
    scaler.scale(loss).backward()

    if (step + 1) % accumulation_steps == 0:
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()

Numerical consideration: Accumulating gradients in fp32 (not bf16) is critical. If the gradient accumulator itself is in bf16, the accumulated error grows linearly with accumulation steps.

X.3 ZeRO Optimization and Distributed Adam

ZeRO (Zero Redundancy Optimizer, Rajbhandari et al., 2020) shards the optimizer state across GPUs:

ZeRO StageWhat's ShardedMemory Savings
Stage 1Optimizer states (mm, vv in Adam)4x
Stage 2+ Gradients8x
Stage 3+ ParametersNgpus\propto N_{\text{gpus}}

For a 7B parameter model with Adam:

  • fp32 params: 28 GB
  • fp32 Adam states (mm, vv): 56 GB
  • Total: 84 GB per GPU -> requires 8x A100 (80 GB) without ZeRO
  • With ZeRO-3 across 64 GPUs: ~1.3 GB per GPU

Numerical impact: ZeRO doesn't change the mathematical update - it just distributes the storage. The optimizer state is identical to non-distributed Adam.


Appendix Y: Connections to Bayesian Optimization

Bayesian optimization is a black-box optimization method that uses a probabilistic model (usually a Gaussian Process) to model the objective function and selects the next evaluation point by maximizing an acquisition function.

Connection to numerical optimization:

  • The acquisition function maximization is itself a classical optimization problem (e.g., L-BFGS on the acquisition function)
  • The surrogate model (GP) must be numerically stable: Cholesky factorization of the kernel matrix
  • Ill-conditioning of the kernel matrix -> add noise σ2I\sigma^2 I (numerical regularization)

For AI hyperparameter tuning: Bayesian optimization is used to tune:

  • Learning rate, batch size, weight decay
  • Architecture hyperparameters (number of layers, hidden dimensions)
  • Training duration (early stopping threshold)

Tools: Optuna (Python), GPyOpt, BoTorch (PyTorch-native).


Appendix Z: Final Notes

This section covered:

  1. Line search methods (Armijo, Wolfe, backtracking) - essential for convergence guarantees
  2. Newton's method - quadratic convergence, modified Newton for non-convex settings
  3. Quasi-Newton methods (BFGS, L-BFGS) - practical second-order optimization
  4. Trust region methods - alternative to line search for non-convex landscapes
  5. Numerical differentiation - finite differences, optimal step size, complex-step
  6. Constrained optimization - projected gradient, ADMM
  7. AI applications - Adam as diagonal quasi-Newton, L-BFGS for full-batch training, HF optimization

Key takeaways:

  • Adam is not an arbitrary heuristic: it approximates diagonal preconditioning of the gradient
  • L-BFGS is the gold standard for full-batch optimization but fails for stochastic settings
  • The condition number of the loss Hessian fundamentally limits gradient descent convergence
  • Trust region methods (TRPO, PPO) provide a sound theoretical framework for policy gradient RL
  • Gradient checking is essential for debugging custom autodiff implementations

Notes length: ~2000 lines. Companion: theory.ipynb (50+ cells) | exercises.ipynb (10 exercises)


Appendix AA: Worked Examples

AA.1 Computing the Optimal Step Size for Gradient Descent

Problem: Minimize f(x)=x12+50x22f(x) = x_1^2 + 50x_2^2 with gradient descent. Find the optimal fixed step size.

Analysis:

  • Hessian: H=diag(2,100)H = \text{diag}(2, 100)
  • λmax=100\lambda_{\max} = 100, λmin=2\lambda_{\min} = 2, κ=50\kappa = 50
  • Optimal step: η=2/(λmax+λmin)=2/1020.0196\eta^* = 2/(\lambda_{\max} + \lambda_{\min}) = 2/102 \approx 0.0196
  • Convergence rate: (κ1)/(κ+1)=49/510.961(\kappa - 1)/(\kappa + 1) = 49/51 \approx 0.961 per iteration
  • Iterations to 10610^{-6} reduction: log(106)/log(0.961)334\log(10^{-6}) / \log(0.961) \approx 334 iterations

Python verification:

H_diag = np.array([2., 100.])
kappa = 50
eta = 2 / (100 + 2)
rate = (kappa - 1) / (kappa + 1)

x = np.array([1., 1.])
errors = [np.linalg.norm(x)]
for _ in range(500):
    g = H_diag * x
    x -= eta * g
    errors.append(np.linalg.norm(x))

# Check: errors should decrease geometrically at rate ~0.961
actual_rate = errors[10] / errors[9]
print(f"Theoretical rate: {rate:.4f}")
print(f"Actual rate:      {actual_rate:.4f}")  # Should match

AA.2 L-BFGS on the Rosenbrock Function

from scipy.optimize import minimize

def rosenbrock(x): return 100*(x[1]-x[0]**2)**2 + (1-x[0])**2
def rosenbrock_grad(x):
    return np.array([-400*x[0]*(x[1]-x[0]**2) - 2*(1-x[0]),
                     200*(x[1]-x[0]**2)])

x0 = np.array([-1.0, 1.0])
result = minimize(rosenbrock, x0, jac=rosenbrock_grad, method='L-BFGS-B',
                  options={'maxiter': 1000, 'gtol': 1e-12})

print(f"Solution: {result.x}")          # Should be [1., 1.]
print(f"Function value: {result.fun}")  # Should be ~0
print(f"Iterations: {result.nit}")      # Typically ~30-50

Appendix BB: Glossary

TermDefinition
Armijo conditionSufficient decrease condition: f(x+αp)f(x)+c1αgpf(x + \alpha p) \leq f(x) + c_1 \alpha g^\top p
Backtracking line searchMultiply step by ρ<1\rho < 1 until Armijo condition satisfied
BFGSQuasi-Newton method that builds inverse Hessian approximation from s,ys, y pairs
Conjugate directiond1,d2d_1, d_2 are AA-conjugate if d1Ad2=0d_1^\top A d_2 = 0
Curvature conditionEnsures step is not too small; required for BFGS stability
Descent directionpp such that fp<0\nabla f^\top p < 0
Fletcher-ReevesNCG formula: β=gk+12/gk2\beta = \|g_{k+1}\|^2 / \|g_k\|^2
L-BFGSBFGS with limited memory (mm vector pairs); O(md)O(md) cost
Secant conditionBk+1sk=ykB_{k+1} s_k = y_k; ensures Hessian approx matches observed curvature
Super-linear convergencelimek+1/ek=0\lim \|e_{k+1}\| / \|e_k\| = 0; faster than linear, slower than quadratic
Trust regionSphere of radius Δ\Delta within which the quadratic model is trusted
Two-loop recursionAlgorithm for efficiently computing Hk1gH_k^{-1} g in L-BFGS
Wolfe conditionsArmijo + curvature conditions; standard for line search in quasi-Newton methods

<- Section02 Numerical Linear Algebra | -> Section04 Interpolation and Approximation

End of Section03 Numerical Optimization


Appendix CC: Further Reading

Books

  • Nocedal & Wright, "Numerical Optimization" (2nd ed., 2006) - The definitive reference. Chapters 2-7 cover all algorithms in this section in depth.
  • Boyd & Vandenberghe, "Convex Optimization" - Available free online. Rigorous treatment of convex optimization theory.
  • Bertsekas, "Nonlinear Programming" - Comprehensive treatment including constrained methods.

Papers

  • Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2014) - Original Adam paper
  • Nocedal, "Updating quasi-Newton matrices with limited storage" (1980) - Original L-BFGS paper
  • Martens, "Deep learning via Hessian-free optimization" (2010) - Hessian-free for neural networks
  • Gupta, Koren, Singer, "Shampoo: Preconditioned stochastic tensor optimization" (2018) - Shampoo
  • Foret et al., "Sharpness-Aware Minimization" (2021) - SAM optimizer
  • Cohen et al., "Gradient descent on neural networks typically occurs at the edge of stability" (2021) - EoS phenomenon

Software

  • scipy.optimize - Reference implementations of L-BFGS-B, Newton-CG, trust region methods
  • torch.optim - Adam, AdamW, LBFGS, SGD with momentum, Adagrad, RMSProp
  • optax (JAX) - Composable optimizer building blocks
  • bfgs (Julia) - Clean reference implementation of BFGS
  • Optuna - Bayesian hyperparameter optimization

Appendix DD: Summary Statistics

Section03 NUMERICAL OPTIMIZATION - SUMMARY
========================================================================

  Main sections:        12 (Intuition through Conceptual Bridge)
  Subsections:          22
  Common mistakes:      12
  Exercises:            8 (* to ***)
  Appendices:           30 (A through DD)

  Algorithms covered:   15 (GD, CG, NCG, BFGS, L-BFGS, SR1, Newton,
                           inexact Newton, trust region, Cauchy point,
                           dogleg, Armijo, Wolfe zoom, ADMM, Adam)

  AI connections:       10+ (Adam, Shampoo, K-FAC, L-BFGS, TRPO/PPO,
                            gradient clipping, SAM, natural gradient,
                            HF optimization, edge of stability)

  Approximate notes length: 2000 lines
  Theory cells: 50+ (planned)
  Exercises: 10 graded problems (24+ cells)

========================================================================

Appendix EE: Numerical Experiments Catalog

EE.1 Comparing Optimizers on Standard Test Functions

The following test functions are standard benchmarks for optimization algorithms:

FunctionFormulaDimensionChallenge
Rosenbrocki=1n1[100(xi+1xi2)2+(1xi)2]\sum_{i=1}^{n-1} [100(x_{i+1} - x_i^2)^2 + (1-x_i)^2]nnIll-conditioned valley
Rastrigin10n+i[xi210cos(2πxi)]10n + \sum_i [x_i^2 - 10\cos(2\pi x_i)]nnMany local minima
Booth(x+2y7)2+(2x+y5)2(x + 2y - 7)^2 + (2x + y - 5)^22Simple quadratic
Beale(1.5x+xy)2+(2.25x+xy2)2+(2.625x+xy3)2(1.5 - x + xy)^2 + (2.25 - x + xy^2)^2 + (2.625 - x + xy^3)^22Nonlinear
Himmelblau(x2+y11)2+(x+y27)2(x^2 + y - 11)^2 + (x + y^2 - 7)^22Multiple global minima

EE.2 Expected Results

Running L-BFGS, gradient descent, and conjugate gradient on the Rosenbrock function (n=2n=2):

Starting from x0 = (-1, 0):
  Method         Iterations  Final Loss    Final ||g||
  -----------------------------------------------------
  GD (optimal)   ~10,000    ~1e-5         ~1e-3
  CG (FR)        ~500       ~1e-8         ~1e-4
  BFGS           ~50        ~1e-10        ~1e-5
  L-BFGS (m=10)  ~40        ~1e-10        ~1e-5
  Newton         ~20        ~1e-12        ~1e-6

These numbers are approximate and depend on the line search implementation and tolerance settings.

EE.3 Optimizer Sensitivity Analysis

For Adam on a quadratic loss with varying condition numbers:

(H)Steps to \varepsilon=10^-4Steps to \varepsilon=10^-8
1~100~200
10~200~400
100~600~1200
1000~2000~4000
10000~6000~12000

Adam improves on vanilla gradient descent but still scales poorly with condition number. The rate scales as O(κ1/2)O(\kappa^{1/2}) (from the diagonal preconditioning) vs O(κ)O(\kappa) for gradient descent.


<- Section02 Numerical Linear Algebra | -> Section04 Interpolation and Approximation


End of Section03 Numerical Optimization. See theory.ipynb for implementations and exercises.ipynb for practice problems.

Total content: 30+ appendices, 12 main sections, 10 exercises, 15 algorithms covered.

Appendix FF: Practice Problems

  1. Implement backtracking Armijo search and test on f(x)=exxf(x) = e^x - x.

  2. Derive the BFGS update from the constraint of minimal change: minBBk\min \|B - B_k\| subject to Bsk=ykBs_k = y_k, B=BB = B^\top.

  3. Show that for the quadratic f(x)=12xHxf(x) = \frac{1}{2}x^\top Hx, the optimal fixed step for gradient descent is α=2/(λmax+λmin)\alpha^* = 2/(\lambda_{\max} + \lambda_{\min}).

  4. Implement the complex-step derivative for f(x)=sin(ln(x2+1))f(x) = \sin(\ln(x^2 + 1)) and verify against numpy.grad.

  5. Implement L-BFGS with strong Wolfe line search for the Rosenbrock function and count iterations to f<108\|\nabla f\| < 10^{-8}.

  6. Prove that if yksk>0y_k^\top s_k > 0 and Hk0H_k \succ 0, then the BFGS update Hk+10H_{k+1} \succ 0.

  7. Implement the Cauchy point computation for the trust region subproblem on a 2D quadratic.

  8. Show that Adam converges to the critical point g=0g = 0 for convex ff with decaying learning rate αt=α/t\alpha_t = \alpha/\sqrt{t}.

  9. Implement gradient accumulation with fp32 accumulators and bf16 forward/backward, and verify the accumulated gradient matches the full-batch gradient.

  10. Implement ZeRO stage 1 (shard optimizer states across 2 processes using Python multiprocessing) and verify the same parameters result.


Section complete.


<- Section02 Numerical Linear Algebra | -> Section04 Interpolation and Approximation

End of file.