NotesMath for LLMs

Partial Derivatives and Gradients

Multivariate Calculus / Partial Derivatives and Gradients

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Notes

"The gradient is the generalization of the derivative to functions of several variables, and it points in the direction in which the function increases most rapidly. This single idea powers every learning algorithm in modern AI."

  • inspired by Cauchy, 1847

Overview

Almost every function that matters in machine learning maps a high-dimensional parameter vector to a single scalar: the loss. A neural network with a billion parameters defines a surface in a billion-plus-one-dimensional space, and training means navigating that surface downhill. To navigate, you need to know the slope - not just along one axis, but simultaneously in every direction. That is precisely what the gradient provides.

This section builds the rigorous foundation for differentiation in Rn\mathbb{R}^n. We begin with partial derivatives - the idea of measuring rate of change while holding all other variables fixed - then assemble them into the gradient vector, the single most important object in optimization. We prove that the gradient points in the direction of steepest ascent, show that it is perpendicular to the level sets of the function, and derive its formula as the linear part of a first-order approximation. We then compute gradients for every standard ML expression (MSE loss, cross-entropy, quadratic forms, regularized objectives) and study how to verify gradient implementations numerically.

The geometric and algebraic perspective developed here is the prerequisite for everything that follows: Jacobians and Hessians (02) generalize the gradient to vector-valued functions and second-order information; the chain rule (03) tells us how gradients propagate through composed functions; optimality conditions (04) use gradient conditions to characterize minima; automatic differentiation (05) computes gradients of arbitrary programs exactly.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive: partial derivative surfaces, gradient fields, directional derivatives, loss landscape geometry, gradient checking
exercises.ipynb10 graded exercises from partial derivative computation to natural gradient and softmax derivations

Learning Objectives

After completing this section, you will be able to:

  1. Define the partial derivative as a limit and compute it for polynomial, exponential, logarithmic, and composite functions
  2. Explain why a function can have all partial derivatives at a point without being differentiable there
  3. Define the gradient xfRn\nabla_{\mathbf{x}} f \in \mathbb{R}^n and identify it as a column vector
  4. Prove that the gradient points in the direction of steepest ascent, using the directional derivative formula
  5. Prove that the gradient is perpendicular to every level set of the function
  6. Compute gradients of standard ML expressions: ax\mathbf{a}^\top \mathbf{x}, xAx\mathbf{x}^\top A \mathbf{x}, Axb2\lVert A\mathbf{x} - \mathbf{b} \rVert^2, softmax cross-entropy
  7. Derive the directional derivative formula Duf=fuD_{\mathbf{u}} f = \nabla f \cdot \mathbf{u} and use it to find the directions of maximum and minimum change
  8. Implement numerical gradient checking using centered finite differences and measure relative error
  9. Explain gradient clipping, gradient accumulation, and gradient norm as training diagnostics
  10. Preview how the gradient generalizes to Jacobians (02), flows through computation graphs (03), and is computed by automatic differentiation (05)

Table of Contents


1. Intuition

1.1 From Slopes to Surfaces

In single-variable calculus, the derivative f(x)f'(x) measures how steeply ff rises or falls as xx changes. The picture is one-dimensional: you stand on a hill, face a direction, and the derivative tells you the slope. Multivariate calculus asks: what if you can move in any direction? The surface z=f(x,y)z = f(x, y) is a two-dimensional landscape - a mountain range - and from any point on it you can move north, south, east, west, or any diagonal. Different directions give different slopes.

The key insight is that knowing the slope in every possible direction would be infinitely much information, but it turns out you only need to store nn numbers - one slope per coordinate axis. These are the partial derivatives. And assembling them into a single vector - the gradient - encodes all directional information: given f\nabla f, you can compute the slope in any direction via a dot product.

This reduction from "infinitely many slopes" to "one nn-dimensional vector" is the central miracle of multivariable calculus. It is why gradient descent is computationally tractable for models with a billion parameters: at each point, we compute one gradient vector θLRn\nabla_{\boldsymbol{\theta}} \mathcal{L} \in \mathbb{R}^n, and we move in the direction θL-\nabla_{\boldsymbol{\theta}} \mathcal{L}.

GRADIENT: FROM ONE SLOPE TO A DIRECTION VECTOR


  Single variable:           Multiple variables:
  f: R -> R                   f: R -> R

  f'(x) in R                  nablaf(x) in R
  (one number)                (a vector of n partial derivatives)

  Slope along x-axis          Slope along x_1-axis: partialf/partialx_1
                              Slope along x_2-axis: partialf/partialx_2
                               
                              Slope along x-axis: partialf/partialx

  "Steepest direction"        "Steepest direction" = direction of nablaf
  is trivially +/-x             requires computing nablaf


For AI: A language model with n=7×109n = 7 \times 10^9 parameters defines a loss function L:RnR\mathcal{L}: \mathbb{R}^n \to \mathbb{R}. The gradient θLRn\nabla_{\boldsymbol{\theta}} \mathcal{L} \in \mathbb{R}^n tells every parameter how much to change and in which direction. One gradient computation, via backpropagation, produces all nn values simultaneously - this is why reverse-mode automatic differentiation (05) is so powerful.

1.2 Historical Context

The theory of partial derivatives developed over two centuries, driven largely by physics and the need to describe phenomena depending on space and time simultaneously.

YearMathematicianContribution
1734EulerIntroduced partial derivative notation f/x\partial f / \partial x for fluid mechanics
1770LagrangeDeveloped calculus of variations; implicit use of gradients in mechanics
1816CauchyRigorous foundation of multivariate analysis; epsilon-delta for limits in Rn\mathbb{R}^n
1847CauchyPublished the method of gradient descent for minimizing functions
1873MaxwellGradient, divergence, curl notation for electromagnetic field equations
1884Gibbs / HeavisideModern vector calculus notation; \nabla (nabla) operator
1944Hestenes, StiefelConjugate gradient method - gradient descent with memory
1986Rumelhart, Hinton, WilliamsBackpropagation as systematic gradient computation in neural networks
2012Krizhevsky et al.AlexNet - gradient descent on GPU scales to ImageNet
2017Vaswani et al.Attention mechanisms trained by gradient descent (Transformers)
2022-2026LLM eraGradient descent with AdamW, gradient clipping, and mixed precision trains 10^1^2 parameter models

The notation \nabla (nabla) was introduced by the Scottish mathematician Peter Guthrie Tait around 1884, borrowing the shape of the ancient Assyrian harp. Cauchy's 1847 gradient descent paper is remarkably modern - he proves convergence for smooth functions and discusses step-size selection. The 140-year gap between Cauchy's paper and backpropagation's popularization shows how long theoretical tools can wait for the right application.

1.3 Why the Gradient Is the Heart of ML

Every major operation in modern AI either computes or responds to a gradient:

  • Training: θθηθL\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L} - gradient descent or its adaptive variants (Adam, AdamW, Lion)
  • Backpropagation: the chain rule for computing θL\nabla_{\boldsymbol{\theta}} \mathcal{L} by propagating gradients backward through a computation graph
  • Attention gradient: QL\nabla_Q \mathcal{L} drives how query projections are updated to attend to more useful keys
  • LoRA (Hu et al., 2022): restricts gradient updates to a low-rank subspace ΔW=BA\Delta W = BA where BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k} - the gradient of L\mathcal{L} w.r.t. AA and BB is what gets computed
  • Gradient clipping: prevents exploding gradients in transformer training by rescaling L\nabla \mathcal{L} when L>G\lVert \nabla \mathcal{L} \rVert > G
  • Mechanistic interpretability: gradient outputinput\frac{\partial \text{output}}{\partial \text{input}} is a saliency map - it tells us which input tokens matter most

Understanding the gradient rigorously - not just as a formula but as a geometric object with provable properties - is the difference between knowing how to run gradient descent and understanding why it works, when it fails, and how to fix it.


2. Functions of Multiple Variables

2.1 Scalar Fields: f:RnRf: \mathbb{R}^n \to \mathbb{R}

Definition. A scalar field (or scalar-valued function of nn variables) is a function

f:DRnRf: \mathcal{D} \subseteq \mathbb{R}^n \to \mathbb{R}

that assigns to each point x=(x1,x2,,xn)D\mathbf{x} = (x_1, x_2, \ldots, x_n)^\top \in \mathcal{D} a real number f(x)f(\mathbf{x}).

The domain D\mathcal{D} is a subset of Rn\mathbb{R}^n; we usually take D=Rn\mathcal{D} = \mathbb{R}^n (for polynomials) or the natural domain (where the formula is defined).

Standard examples:

FunctionFormulaDomainGeometric meaning
Paraboloidf(x,y)=x2+y2f(x,y) = x^2 + y^2R2\mathbb{R}^2Bowl shape; unique minimum at origin
Saddlef(x,y)=x2y2f(x,y) = x^2 - y^2R2\mathbb{R}^2Rises in xx, falls in yy; saddle point at origin
Gaussianf(x)=ex2/(2σ2)f(\mathbf{x}) = e^{-\lVert \mathbf{x} \rVert^2 / (2\sigma^2)}Rn\mathbb{R}^nBell shaped; probability density (unnormalized)
MSE lossf(w)=1mXwy2f(\mathbf{w}) = \frac{1}{m}\lVert X\mathbf{w} - \mathbf{y} \rVert^2Rd\mathbb{R}^dConvex paraboloid in weight space
Cross-entropyf(w)=1mi=1mlogpw(y(i)x(i))f(\mathbf{w}) = -\frac{1}{m}\sum_{i=1}^m \log p_{\mathbf{w}}(y^{(i)} \mid \mathbf{x}^{(i)})Rd\mathbb{R}^dConvex for logistic regression; non-convex for neural networks

Non-examples (not scalar fields):

  • f(x)=AxRm\mathbf{f}(\mathbf{x}) = A\mathbf{x} \in \mathbb{R}^m - this is vector-valued; its derivative is the Jacobian (02)
  • F(x)=01f(x,t)dtF(\mathbf{x}) = \int_0^1 f(\mathbf{x}, t)\,dt - this is a functional, not a finite-dimensional function

For AI: Every loss function in supervised learning, reinforcement learning, and self-supervised learning is a scalar field over parameter space. Training is navigation of this surface.

2.2 Level Sets and Contour Geometry

Definition. The level set (or level surface) of f:RnRf: \mathbb{R}^n \to \mathbb{R} at value cc is

Lc(f)={xRn:f(x)=c}L_c(f) = \{\mathbf{x} \in \mathbb{R}^n : f(\mathbf{x}) = c\}

For n=2n = 2, level sets are contour lines (curves in the plane). For n=3n = 3, they are level surfaces. For general nn, they are (n1)(n-1)-dimensional hypersurfaces.

Geometric meaning: Moving along a level set keeps the function value constant. You expend no potential energy. The gradient, as we will prove in 4.3, is always perpendicular to every level set through the point.

Examples:

  • f(x,y)=x2+y2f(x,y) = x^2 + y^2: level sets are concentric circles x2+y2=cx^2 + y^2 = c (for c>0c > 0)
  • f(x,y)=x2y2f(x,y) = x^2 - y^2: level sets are hyperbolas
  • f(w)=Xwy2f(\mathbf{w}) = \lVert X\mathbf{w} - \mathbf{y} \rVert^2: level sets are ellipsoids in Rd\mathbb{R}^d (elongated if XX is ill-conditioned)
CONTOUR GEOMETRY FOR f(x,y) = x^2 + y^2


         y
         
     3       c = 9
             
     2      c = 4
             
     1           <- minimum (0,0)  c = 1
                    
     0  x
                    
    -1          
             
    -2   
             
    -3    

  - Each ring = constant f value   - Minimum at center
  - nablaf points radially outward    - nablaf  each ring (proved in 4.3)


For AI: The loss landscape's level sets determine how gradient descent navigates. If level sets are elongated ellipses (high condition number), standard gradient descent zigzags slowly - motivating preconditioning and adaptive methods like Adam, which effectively rescale the gradient to account for curvature.

2.3 Limits and Continuity in Rn\mathbb{R}^n

Definition. We say limxaf(x)=L\lim_{\mathbf{x} \to \mathbf{a}} f(\mathbf{x}) = L if for every ε>0\varepsilon > 0 there exists δ>0\delta > 0 such that

0<xa<δ    f(x)L<ε0 < \lVert \mathbf{x} - \mathbf{a} \rVert < \delta \implies |f(\mathbf{x}) - L| < \varepsilon

This definition is formally identical to the single-variable case, but with Euclidean distance xa\lVert \mathbf{x} - \mathbf{a} \rVert replacing xa|x - a|.

Critical difference from single-variable calculus: In R\mathbb{R}, you can approach a point from only two directions (left or right). In Rn\mathbb{R}^n, you can approach along infinitely many paths. A limit exists only if the same value LL is obtained along every path.

Non-example (path-dependent limit): Consider f(x,y)=xyx2+y2f(x,y) = \frac{xy}{x^2 + y^2} as (x,y)(0,0)(x,y) \to (0,0):

  • Along y=0y = 0: f(x,0)=00f(x,0) = 0 \to 0
  • Along y=xy = x: f(x,x)=x22x2=1212f(x,x) = \frac{x^2}{2x^2} = \frac{1}{2} \to \frac{1}{2}

Different paths give different limits, so lim(x,y)(0,0)f(x,y)\lim_{(x,y)\to(0,0)} f(x,y) does not exist.

Continuity: ff is continuous at a\mathbf{a} if limxaf(x)=f(a)\lim_{\mathbf{x}\to\mathbf{a}} f(\mathbf{x}) = f(\mathbf{a}). Polynomials, rational functions (away from zeros of denominator), ef(x)e^{f(\mathbf{x})}, logf(x)\log f(\mathbf{x}) (where f>0f > 0) are all continuous on their domains.

All loss functions used in practice (MSE, cross-entropy, hinge, Huber) are continuous on their natural domains. This ensures the gradient descent trajectory is well-defined and doesn't jump discontinuously.

2.4 Differentiability vs. Partial Differentiability

This is a subtle but important distinction that illuminates why the gradient is defined the way it is.

Partial differentiability does NOT imply differentiability. The partial derivatives can exist at a point while the function fails to be differentiable (or even continuous) there.

Non-example (Cauchy 1821): Define

f(x,y)={xyx2+y2(x,y)(0,0)0(x,y)=(0,0)f(x,y) = \begin{cases} \frac{xy}{x^2 + y^2} & (x,y) \neq (0,0) \\ 0 & (x,y) = (0,0) \end{cases}
  • fx(0,0)=limh0f(h,0)f(0,0)h=limh00h=0\frac{\partial f}{\partial x}(0,0) = \lim_{h \to 0} \frac{f(h,0) - f(0,0)}{h} = \lim_{h \to 0} \frac{0}{h} = 0
  • fy(0,0)=0\frac{\partial f}{\partial y}(0,0) = 0 (by symmetry)
  • But ff is not continuous at (0,0)(0,0) (as shown in 2.3, the limit doesn't exist)

True differentiability requires the existence of a linear map L:RnRL: \mathbb{R}^n \to \mathbb{R} such that

limh0f(a+h)f(a)L(h)h=0\lim_{\mathbf{h} \to \mathbf{0}} \frac{f(\mathbf{a} + \mathbf{h}) - f(\mathbf{a}) - L(\mathbf{h})}{\lVert \mathbf{h} \rVert} = 0

When ff is differentiable at a\mathbf{a}, the linear map LL is given by L(h)=f(a)hL(\mathbf{h}) = \nabla f(\mathbf{a})^\top \mathbf{h}. The gradient is then the representation of this linear map in the standard basis.

Sufficient condition: If all partial derivatives exist and are continuous on a neighborhood of a\mathbf{a}, then ff is differentiable at a\mathbf{a}. All standard ML functions satisfy this condition away from a few exceptional points (like x=0x = 0 for ReLU).


3. Partial Derivatives

3.1 Definition as a Limit

Definition. The partial derivative of f:RnRf: \mathbb{R}^n \to \mathbb{R} with respect to xix_i at the point a\mathbf{a} is

fxi(a)=limh0f(a1,,ai+h,,an)f(a1,,ai,,an)h\frac{\partial f}{\partial x_i}(\mathbf{a}) = \lim_{h \to 0} \frac{f(a_1, \ldots, a_i + h, \ldots, a_n) - f(a_1, \ldots, a_i, \ldots, a_n)}{h}

provided this limit exists. We perturb only the ii-th coordinate by hh and hold all others fixed.

Notation variants (all in common use):

NotationContext
fxi\frac{\partial f}{\partial x_i}Standard - preferred in this repo
xif\partial_{x_i} f or fxif_{x_i}Compact, used in PDEs
DifD_i fOperator notation
fif_iWhen clear from context

Geometric interpretation: fxi(a)\frac{\partial f}{\partial x_i}(\mathbf{a}) is the slope of the curve obtained by intersecting the surface z=f(x)z = f(\mathbf{x}) with the hyperplane {xj=aj:ji}\{x_j = a_j : j \neq i\}. It measures the instantaneous rate of change of ff when we move parallel to the xix_i-axis.

GEOMETRIC MEANING OF A PARTIAL DERIVATIVE


  Surface z = f(x, y)                  Cross-section at fixed y = a_2

       z                                      z
                                           
               surface                         
                                                slope = partialf/partialx at (a_1,a_2)
                                             
        y                       
        fixed y = a_2                          x
                                             a_1
     x
  
  The partial partialf/partialx is the slope of the curve when we
  slice the surface with the plane y = a_2 (constant).


For AI: PyTorch's .grad attribute on a scalar after .backward() is exactly Lθi\frac{\partial \mathcal{L}}{\partial \theta_i} for each parameter θi\theta_i. When you call loss.backward(), it computes all nn partial derivatives simultaneously via reverse-mode automatic differentiation.

3.2 Computation: Treating Other Variables as Constants

The practical rule: hold all variables except xix_i constant, then differentiate as a single-variable function of xix_i.

All standard differentiation rules (power rule, product rule, chain rule) apply, treating the fixed variables as parameters.

Example 1: f(x,y)=3x2y+exyy3f(x, y) = 3x^2y + e^{xy} - y^3

fx=6xy+yexy(treat y as constant: ddx[3x2y]=6xy,  ddx[exy]=yexy)\frac{\partial f}{\partial x} = 6xy + ye^{xy} \qquad \text{(treat } y \text{ as constant: } \frac{d}{dx}[3x^2 \cdot y] = 6xy, \; \frac{d}{dx}[e^{xy}] = ye^{xy} \text{)} fy=3x2+xexy3y2(treat x as constant)\frac{\partial f}{\partial y} = 3x^2 + xe^{xy} - 3y^2 \qquad \text{(treat } x \text{ as constant)}

Example 2: f(x,y,z)=ln(x2+y2)+zsin(x)f(x, y, z) = \ln(x^2 + y^2) + z\sin(x)

fx=2xx2+y2+zcos(x)\frac{\partial f}{\partial x} = \frac{2x}{x^2 + y^2} + z\cos(x) fy=2yx2+y2\frac{\partial f}{\partial y} = \frac{2y}{x^2 + y^2} fz=sin(x)\frac{\partial f}{\partial z} = \sin(x)

Example 3 (ML): Loss L(w1,w2)=(w1x1+w2x2y)2\mathcal{L}(w_1, w_2) = (w_1 x_1 + w_2 x_2 - y)^2 for one data point (x1,x2,y)(x_1, x_2, y):

Lw1=2(w1x1+w2x2y)x1\frac{\partial \mathcal{L}}{\partial w_1} = 2(w_1 x_1 + w_2 x_2 - y) \cdot x_1 Lw2=2(w1x1+w2x2y)x2\frac{\partial \mathcal{L}}{\partial w_2} = 2(w_1 x_1 + w_2 x_2 - y) \cdot x_2

This illustrates the general pattern: wi[prediction error]2=2errorxi\frac{\partial}{\partial w_i}[\text{prediction error}]^2 = 2 \cdot \text{error} \cdot x_i - the gradient of a squared loss is error times feature.

3.3 Higher-Order Partial Derivatives

Since fxi\frac{\partial f}{\partial x_i} is itself a function, we can differentiate again:

Second-order partials:

2fxi2=xi ⁣(fxi)(differentiate xi twice)\frac{\partial^2 f}{\partial x_i^2} = \frac{\partial}{\partial x_i}\!\left(\frac{\partial f}{\partial x_i}\right) \qquad \text{(differentiate } x_i \text{ twice)} 2fxjxi=xj ⁣(fxi)(differentiate xi first, then xj)\frac{\partial^2 f}{\partial x_j \partial x_i} = \frac{\partial}{\partial x_j}\!\left(\frac{\partial f}{\partial x_i}\right) \qquad \text{(differentiate } x_i \text{ first, then } x_j \text{)}

The collection of all second-order partials forms the Hessian matrix HfRn×nH_f \in \mathbb{R}^{n \times n}:

(Hf)ij=2fxixj(H_f)_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

Scope note: The Hessian is defined here, but all properties (symmetry, definiteness, Newton's method) are treated in 02-Jacobians-and-Hessians.

Example: f(x,y)=x3y2+sin(xy)f(x,y) = x^3y^2 + \sin(xy)

fx=3x2y2+ycos(xy),2fx2=6xy2y2sin(xy)\frac{\partial f}{\partial x} = 3x^2y^2 + y\cos(xy), \qquad \frac{\partial^2 f}{\partial x^2} = 6xy^2 - y^2\sin(xy) 2fyx=y[3x2y2+ycos(xy)]=6x2y+cos(xy)xysin(xy)\frac{\partial^2 f}{\partial y \partial x} = \frac{\partial}{\partial y}[3x^2y^2 + y\cos(xy)] = 6x^2y + \cos(xy) - xy\sin(xy) 2fxy=x[2x3y+xcos(xy)]=6x2y+cos(xy)xysin(xy)\frac{\partial^2 f}{\partial x \partial y} = \frac{\partial}{\partial x}[2x^3y + x\cos(xy)] = 6x^2y + \cos(xy) - xy\sin(xy)

Note: 2fyx=2fxy\frac{\partial^2 f}{\partial y \partial x} = \frac{\partial^2 f}{\partial x \partial y} - this equality is Clairaut's theorem.

3.4 Equality of Mixed Partials - Clairaut Preview

Theorem (Clairaut/Schwarz). If f:RnRf: \mathbb{R}^n \to \mathbb{R} has continuous second partial derivatives on an open set, then for all iji \neq j:

2fxjxi=2fxixj\frac{\partial^2 f}{\partial x_j \partial x_i} = \frac{\partial^2 f}{\partial x_i \partial x_j}

That is, the order of partial differentiation does not matter.

Consequence for the Hessian: The Hessian matrix HfH_f is symmetric: Hf=HfH_f = H_f^\top.

Why continuity is required: Without continuity of the mixed partials, equality can fail. The standard counterexample is:

f(x,y)={xyx2y2x2+y2(x,y)(0,0)0(x,y)=(0,0)f(x,y) = \begin{cases} xy\,\frac{x^2 - y^2}{x^2 + y^2} & (x,y) \neq (0,0) \\ 0 & (x,y) = (0,0) \end{cases}

For this function, 2fxy(0,0)=11=2fyx(0,0)\frac{\partial^2 f}{\partial x \partial y}(0,0) = 1 \neq -1 = \frac{\partial^2 f}{\partial y \partial x}(0,0).

Full proof and implications: 02-Jacobians-and-Hessians - Clairaut's theorem, Hessian symmetry, positive definiteness.

For AI: The symmetry Hf=HfH_f = H_f^\top is why Hessian-vector products are computationally feasible - symmetric matrices have real eigenvalues, and efficient algorithms (Pearlmutter 1994) compute HvH\mathbf{v} without forming HH explicitly.

3.5 The Total Derivative

When all variables change simultaneously by small amounts δx1,,δxn\delta x_1, \ldots, \delta x_n, the total change in ff is:

δfi=1nfxiδxi=f(a)δ\delta f \approx \sum_{i=1}^n \frac{\partial f}{\partial x_i} \delta x_i = \nabla f(\mathbf{a})^\top \boldsymbol{\delta}

In differential notation:

df=fx1dx1+fx2dx2++fxndxndf = \frac{\partial f}{\partial x_1} dx_1 + \frac{\partial f}{\partial x_2} dx_2 + \cdots + \frac{\partial f}{\partial x_n} dx_n

This is the total differential of ff at a\mathbf{a}. It is exact (not an approximation) in the differential sense, and gives the first-order approximation

f(a+δ)f(a)+f(a)δf(\mathbf{a} + \boldsymbol{\delta}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^\top \boldsymbol{\delta}

when δ\lVert \boldsymbol{\delta} \rVert is small. This first-order Taylor approximation is the foundation of gradient descent: taking a step δ=ηf(a)\boldsymbol{\delta} = -\eta \nabla f(\mathbf{a}) decreases ff by approximately ηf(a)2\eta \lVert \nabla f(\mathbf{a}) \rVert^2 (to first order).

Perturbation analysis in ML: When we change a model's weight wiw_i by a small amount δwi\delta w_i, the loss changes by approximately Lwiδwi\frac{\partial \mathcal{L}}{\partial w_i} \cdot \delta w_i. This is why the gradient tells us exactly which parameters, perturbed in which direction, reduce the loss most efficiently.


4. The Gradient Vector

4.1 Definition and Notation

Definition. Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be differentiable at a\mathbf{a}. The gradient of ff at a\mathbf{a} is the column vector of all partial derivatives:

xf(a)=(fx1(a)fx2(a)fxn(a))Rn\nabla_{\mathbf{x}} f(\mathbf{a}) = \begin{pmatrix} \frac{\partial f}{\partial x_1}(\mathbf{a}) \\ \frac{\partial f}{\partial x_2}(\mathbf{a}) \\ \vdots \\ \frac{\partial f}{\partial x_n}(\mathbf{a}) \end{pmatrix} \in \mathbb{R}^n

Notation conventions (following docs/NOTATION_GUIDE.md):

NotationMeaning
f(a)\nabla f(\mathbf{a})Gradient at a\mathbf{a}; subscript on \nabla omitted when variable is clear
xf\nabla_{\mathbf{x}} fGradient with respect to x\mathbf{x} (explicit subscript)
θL\nabla_{\boldsymbol{\theta}} \mathcal{L}Gradient of loss L\mathcal{L} with respect to parameters θ\boldsymbol{\theta}
g\mathbf{g}Common shorthand for the gradient in optimization

Orientation convention: Following Nocedal & Wright (2006) and all major ML frameworks, the gradient f(a)Rn\nabla f(\mathbf{a}) \in \mathbb{R}^n is a column vector. This means the first-order approximation is f(a+δ)f(a)+f(a)δf(\mathbf{a} + \boldsymbol{\delta}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^\top \boldsymbol{\delta}.

Non-notation note: Do not write f\nabla f as a row vector. Some texts (especially those following Jacobian conventions) write the gradient as a row, but this creates ambiguity with the Jacobian JfR1×nJ_f \in \mathbb{R}^{1 \times n}. We reserve the Jacobian convention for 02.

Examples:

f(x,y)=x2+3xyy2f(x,y) = x^2 + 3xy - y^2:

f=(2x+3y3x2y)\nabla f = \begin{pmatrix} 2x + 3y \\ 3x - 2y \end{pmatrix}

f(x)=x2=ixi2f(\mathbf{x}) = \lVert \mathbf{x} \rVert^2 = \sum_i x_i^2:

f=(2x12x22xn)=2x\nabla f = \begin{pmatrix} 2x_1 \\ 2x_2 \\ \vdots \\ 2x_n \end{pmatrix} = 2\mathbf{x}

f(x)=ax=iaixif(\mathbf{x}) = \mathbf{a}^\top \mathbf{x} = \sum_i a_i x_i:

f=(a1a2an)=a\nabla f = \begin{pmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{pmatrix} = \mathbf{a}

4.2 The Gradient Points in the Direction of Steepest Ascent

Theorem. Among all unit vectors u\mathbf{u} with u=1\lVert \mathbf{u} \rVert = 1, the directional derivative Duf(a)=f(a)uD_{\mathbf{u}} f(\mathbf{a}) = \nabla f(\mathbf{a})^\top \mathbf{u} is maximized when u=f(a)f(a)\mathbf{u} = \frac{\nabla f(\mathbf{a})}{\lVert \nabla f(\mathbf{a}) \rVert}, giving maximum value f(a)\lVert \nabla f(\mathbf{a}) \rVert.

Proof. By the Cauchy-Schwarz inequality:

Duf(a)=f(a)uf(a)u=f(a)D_{\mathbf{u}} f(\mathbf{a}) = \nabla f(\mathbf{a})^\top \mathbf{u} \leq \lVert \nabla f(\mathbf{a}) \rVert \cdot \lVert \mathbf{u} \rVert = \lVert \nabla f(\mathbf{a}) \rVert

with equality when uf(a)\mathbf{u} \parallel \nabla f(\mathbf{a}), i.e., u=f(a)f(a)\mathbf{u} = \frac{\nabla f(\mathbf{a})}{\lVert \nabla f(\mathbf{a}) \rVert}.

Similarly, the minimum directional derivative is f(a)-\lVert \nabla f(\mathbf{a}) \rVert, achieved in direction f(a)f(a)-\frac{\nabla f(\mathbf{a})}{\lVert \nabla f(\mathbf{a}) \rVert}.

Corollary (gradient descent). To decrease ff most rapidly, step in direction f(a)-\nabla f(\mathbf{a}). This is why gradient descent uses θθηθL\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L} - it moves in the direction of steepest descent.

DIRECTIONAL DERIVATIVE vs. DIRECTION


  Given nablaf at a point, the directional derivative in direction u:

     D_u f = nablaf * u = ||nablaf|| cos(theta)
                               where theta = angle between nablaf and u

  Maximum (theta=0):    u = nablaf/||nablaf||    D_u f = ||nablaf||    <- steepest ascent
  Zero      (theta=90): u  nablaf           D_u f = 0         <- along level set
  Minimum (theta=180):  u = -nablaf/||nablaf||   D_u f = -||nablaf||   <- steepest descent

  Gradient descent chooses u = -nablaf/||nablaf|| at every step.


For AI: The gradient θL\nabla_{\boldsymbol{\theta}} \mathcal{L} is the unique direction in which perturbing θ\boldsymbol{\theta} increases the loss most rapidly per unit norm. Gradient descent follows its negative. Adam modifies this direction by dividing by the running second moment vt+ε\sqrt{v_t} + \varepsilon, effectively normalizing each coordinate - this is why Adam often converges faster when gradients have very different scales across parameters.

4.3 Gradient Perpendicular to Level Sets

Theorem. Let f:RnRf: \mathbb{R}^n \to \mathbb{R} be differentiable, and let a\mathbf{a} lie on the level set Lc(f)={f(x)=c}L_c(f) = \{f(\mathbf{x}) = c\}. Then f(a)\nabla f(\mathbf{a}) is perpendicular to Lc(f)L_c(f) at a\mathbf{a} - that is, f(a)\nabla f(\mathbf{a}) is orthogonal to every tangent vector of Lc(f)L_c(f) at a\mathbf{a}.

Proof. Let γ:(ε,ε)Rn\boldsymbol{\gamma}: (-\varepsilon, \varepsilon) \to \mathbb{R}^n be any smooth curve lying in Lc(f)L_c(f) with γ(0)=a\boldsymbol{\gamma}(0) = \mathbf{a}. Then f(γ(t))=cf(\boldsymbol{\gamma}(t)) = c for all tt. Differentiating:

ddtf(γ(t))t=0=0\frac{d}{dt} f(\boldsymbol{\gamma}(t)) \Big|_{t=0} = 0

By the single-variable chain rule applied along the curve:

ddtf(γ(t))=f(γ(t))γ(t)\frac{d}{dt} f(\boldsymbol{\gamma}(t)) = \nabla f(\boldsymbol{\gamma}(t))^\top \boldsymbol{\gamma}'(t)

At t=0t = 0: f(a)γ(0)=0\nabla f(\mathbf{a})^\top \boldsymbol{\gamma}'(0) = 0.

Since γ(0)\boldsymbol{\gamma}'(0) is any tangent vector of Lc(f)L_c(f) at a\mathbf{a}, we conclude f(a)TaLc(f)\nabla f(\mathbf{a}) \perp T_{\mathbf{a}} L_c(f). \square

Geometric consequence: The gradient always points directly "away" from the level set - it is the normal vector to the surface f=cf = c. To travel along the level set (constant loss), you must move perpendicular to the gradient.

For AI: In continual learning and orthogonal gradient descent (OGD), gradients are projected to be orthogonal to previous task gradients. This ensures learning a new task doesn't increase loss on previous tasks - it literally moves along the level set of the old loss while decreasing the new one.

4.4 Gradient as Linear Approximation

The gradient gives the best linear (affine) approximation to ff near a point:

f(a+δ)=f(a)+f(a)δ+O(δ2)f(\mathbf{a} + \boldsymbol{\delta}) = f(\mathbf{a}) + \nabla f(\mathbf{a})^\top \boldsymbol{\delta} + O(\lVert \boldsymbol{\delta} \rVert^2)

This is the first-order multivariate Taylor expansion. The remainder term O(δ2)O(\lVert \boldsymbol{\delta} \rVert^2) is controlled by the Hessian (02).

Interpretation: The hyperplane z=f(a)+f(a)(xa)z = f(\mathbf{a}) + \nabla f(\mathbf{a})^\top (\mathbf{x} - \mathbf{a}) is the tangent hyperplane to the surface z=f(x)z = f(\mathbf{x}) at the point (a,f(a))(\mathbf{a}, f(\mathbf{a})).

Error bound: If ff has bounded second derivatives - specifically if Hf(x)2L\lVert H_f(\mathbf{x}) \rVert_2 \leq L for all x\mathbf{x} - then:

f(a+δ)f(a)f(a)δL2δ2\left| f(\mathbf{a} + \boldsymbol{\delta}) - f(\mathbf{a}) - \nabla f(\mathbf{a})^\top \boldsymbol{\delta} \right| \leq \frac{L}{2} \lVert \boldsymbol{\delta} \rVert^2

For AI: The LL-smoothness condition f(x)f(y)Lxy\lVert \nabla f(\mathbf{x}) - \nabla f(\mathbf{y}) \rVert \leq L \lVert \mathbf{x} - \mathbf{y} \rVert (equivalent to bounded Hessian) is the key assumption in gradient descent convergence proofs. It guarantees that the linear approximation is not too far off, so taking a step ηf-\eta \nabla f with η1/L\eta \leq 1/L always decreases ff.

4.5 Gradient of Standard Forms

These are among the most frequently used gradient computations in ML. All can be derived from first principles via the definition.

Theorem (matrix calculus identities). Let xRn\mathbf{x} \in \mathbb{R}^n, aRn\mathbf{a} \in \mathbb{R}^n, ARn×nA \in \mathbb{R}^{n \times n}.

f(x)f(\mathbf{x})xf\nabla_{\mathbf{x}} fProof sketch
ax\mathbf{a}^\top \mathbf{x}a\mathbf{a}f=iaixif = \sum_i a_i x_i, so f/xj=aj\partial f/\partial x_j = a_j
xa\mathbf{x}^\top \mathbf{a}a\mathbf{a}Same as above (dot product is symmetric)
x2=xx\lVert \mathbf{x} \rVert^2 = \mathbf{x}^\top \mathbf{x}2x2\mathbf{x}f=ixi2f = \sum_i x_i^2, so f/xj=2xj\partial f/\partial x_j = 2x_j
xAx\mathbf{x}^\top A \mathbf{x}(A+A)x(A + A^\top)\mathbf{x}See derivation below
xAx\mathbf{x}^\top A \mathbf{x} (AA symmetric)2Ax2A\mathbf{x}Since A=AA = A^\top
Axb2\lVert A\mathbf{x} - \mathbf{b} \rVert^22A(Axb)2A^\top(A\mathbf{x} - \mathbf{b})Chain rule + above
log(ax)\log(\mathbf{a}^\top \mathbf{x})a/(ax)\mathbf{a} / (\mathbf{a}^\top \mathbf{x})Chain rule: /xj=aj/(ax)\partial/\partial x_j = a_j / (\mathbf{a}^\top \mathbf{x})

Proof of (xAx)=(A+A)x\nabla(\mathbf{x}^\top A \mathbf{x}) = (A + A^\top)\mathbf{x}:

f(x)=xAx=i,jAijxixjf(\mathbf{x}) = \mathbf{x}^\top A \mathbf{x} = \sum_{i,j} A_{ij} x_i x_j fxk=jAkjxj+iAikxi=(Ax)k+(Ax)k\frac{\partial f}{\partial x_k} = \sum_j A_{kj} x_j + \sum_i A_{ik} x_i = (A\mathbf{x})_k + (A^\top \mathbf{x})_k

Therefore f=Ax+Ax=(A+A)x\nabla f = A\mathbf{x} + A^\top \mathbf{x} = (A + A^\top)\mathbf{x}. When AA is symmetric (A=AA = A^\top), this simplifies to 2Ax2A\mathbf{x}. \square

Proof of Axb2=2A(Axb)\nabla \lVert A\mathbf{x} - \mathbf{b} \rVert^2 = 2A^\top(A\mathbf{x} - \mathbf{b}):

Let r=Axb\mathbf{r} = A\mathbf{x} - \mathbf{b}. Then f=r2=rrf = \lVert \mathbf{r} \rVert^2 = \mathbf{r}^\top \mathbf{r}. By the chain rule:

xf=2Ar=2A(Axb)\nabla_{\mathbf{x}} f = 2A^\top \mathbf{r} = 2A^\top(A\mathbf{x} - \mathbf{b})

For the MSE loss L(w)=1mXwy2\mathcal{L}(\mathbf{w}) = \frac{1}{m}\lVert X\mathbf{w} - \mathbf{y} \rVert^2, this gives:

wL=2mX(Xwy)\nabla_{\mathbf{w}} \mathcal{L} = \frac{2}{m} X^\top (X\mathbf{w} - \mathbf{y})

This is the gradient of the ordinary least squares objective - the formula that underlies every linear regression solver.


5. Directional Derivatives

5.1 Definition and Limit Form

The partial derivative measures change along a coordinate axis. The directional derivative measures change in any arbitrary direction.

Definition. Let f:RnRf: \mathbb{R}^n \to \mathbb{R}, aRn\mathbf{a} \in \mathbb{R}^n, and uRn\mathbf{u} \in \mathbb{R}^n with u=1\lVert \mathbf{u} \rVert = 1 (unit vector). The directional derivative of ff at a\mathbf{a} in direction u\mathbf{u} is:

Duf(a)=limh0f(a+hu)f(a)hD_{\mathbf{u}} f(\mathbf{a}) = \lim_{h \to 0} \frac{f(\mathbf{a} + h\mathbf{u}) - f(\mathbf{a})}{h}

Note on unit vectors: The convention u=1\lVert \mathbf{u} \rVert = 1 is important. If we allowed u1\lVert \mathbf{u} \rVert \neq 1, the directional derivative would scale with u\lVert \mathbf{u} \rVert, making direction and magnitude conflated. With u=1\lVert \mathbf{u} \rVert = 1, DufD_{\mathbf{u}} f measures purely the rate of change per unit distance in direction u\mathbf{u}.

Relationship to partial derivatives: Partial derivatives are special cases. If u=ei\mathbf{u} = \mathbf{e}_i (the ii-th standard basis vector), then:

Deif(a)=limh0f(a+hei)f(a)h=fxi(a)D_{\mathbf{e}_i} f(\mathbf{a}) = \lim_{h \to 0} \frac{f(\mathbf{a} + h\mathbf{e}_i) - f(\mathbf{a})}{h} = \frac{\partial f}{\partial x_i}(\mathbf{a})

So partial derivatives are directional derivatives along coordinate axes.

Example: f(x,y)=x2xy+y2f(x,y) = x^2 - xy + y^2, a=(1,1)\mathbf{a} = (1,1), u=(1/2,1/2)\mathbf{u} = (1/\sqrt{2}, 1/\sqrt{2}) (northeast direction).

Duf(1,1)=limh0f(1+h/2,1+h/2)f(1,1)hD_{\mathbf{u}} f(1,1) = \lim_{h \to 0} \frac{f(1 + h/\sqrt{2}, 1 + h/\sqrt{2}) - f(1,1)}{h}

We can compute this via the gradient formula (5.2): f=(2xy,x+2y)\nabla f = (2x - y, -x + 2y), so f(1,1)=(1,1)\nabla f(1,1) = (1, 1)^\top, and:

Duf(1,1)=(1,1)(1/2,1/2)=2D_{\mathbf{u}} f(1,1) = (1,1) \cdot (1/\sqrt{2}, 1/\sqrt{2}) = \sqrt{2}

5.2 The Gradient Formula

Theorem. If ff is differentiable at a\mathbf{a}, then for any unit vector u\mathbf{u}:

Duf(a)=f(a)u=i=1nfxi(a)uiD_{\mathbf{u}} f(\mathbf{a}) = \nabla f(\mathbf{a})^\top \mathbf{u} = \sum_{i=1}^n \frac{\partial f}{\partial x_i}(\mathbf{a}) \cdot u_i

Proof. Define g(t)=f(a+tu)g(t) = f(\mathbf{a} + t\mathbf{u}) for tRt \in \mathbb{R}. Then Duf(a)=g(0)D_{\mathbf{u}} f(\mathbf{a}) = g'(0). By the single-variable chain rule applied component-wise:

g(t)=i=1nfxi(a+tu)uig'(t) = \sum_{i=1}^n \frac{\partial f}{\partial x_i}(\mathbf{a} + t\mathbf{u}) \cdot u_i

At t=0t = 0: g(0)=i=1nfxi(a)ui=f(a)ug'(0) = \sum_{i=1}^n \frac{\partial f}{\partial x_i}(\mathbf{a}) \cdot u_i = \nabla f(\mathbf{a})^\top \mathbf{u}. \square

Geometric meaning: Duf(a)=f(a)u=f(a)cosθD_{\mathbf{u}} f(\mathbf{a}) = \nabla f(\mathbf{a})^\top \mathbf{u} = \lVert \nabla f(\mathbf{a}) \rVert \cos\theta, where θ\theta is the angle between f(a)\nabla f(\mathbf{a}) and u\mathbf{u}.

Consequence: The entire directional derivative function uDuf\mathbf{u} \mapsto D_{\mathbf{u}} f is determined by the single vector f(a)\nabla f(\mathbf{a}). This is the key economy: nn numbers encode all directional information.

5.3 Extremizing the Directional Derivative

Summary table:

DirectionAngle θ\thetaDufD_{\mathbf{u}} fMeaning
u=^f\mathbf{u} = \hat{\nabla} f00f\lVert \nabla f \rVertSteepest ascent
uf\mathbf{u} \perp \nabla f909000Along level set
u=^f\mathbf{u} = -\hat{\nabla} f180180f-\lVert \nabla f \rVertSteepest descent

where ^f=f/f\hat{\nabla} f = \nabla f / \lVert \nabla f \rVert is the unit gradient.

Worked example: f(x,y)=ex2+y2f(x,y) = e^{x^2+y^2} at a=(1,0)\mathbf{a} = (1,0). Find: (a) steepest ascent direction; (b) directional derivative along u=(0,1)\mathbf{u} = (0,1).

f=(2xex2+y22yex2+y2),f(1,0)=(2e0)\nabla f = \begin{pmatrix} 2x e^{x^2+y^2} \\ 2y e^{x^2+y^2} \end{pmatrix}, \qquad \nabla f(1,0) = \begin{pmatrix} 2e \\ 0 \end{pmatrix}

(a) Steepest ascent: ^f(1,0)=(1,0)\hat{\nabla} f(1,0) = (1, 0)^\top (eastward), rate =2e= 2e.

(b) D(0,1)f(1,0)=(2e,0)(0,1)=0D_{(0,1)} f(1,0) = (2e, 0) \cdot (0,1) = 0 - the Gaussian is flat in the yy-direction at (1,0)(1,0) (it's on a ridge).

5.4 Subgradients - Preview

At points where ff is not differentiable, the gradient f\nabla f doesn't exist, but we can generalize via subgradients.

Definition. A vector gRn\mathbf{g} \in \mathbb{R}^n is a subgradient of ff at a\mathbf{a} if for all x\mathbf{x}:

f(x)f(a)+g(xa)f(\mathbf{x}) \geq f(\mathbf{a}) + \mathbf{g}^\top (\mathbf{x} - \mathbf{a})

The subdifferential f(a)\partial f(\mathbf{a}) is the set of all subgradients.

Example (ReLU): f(x)=max(0,x)f(x) = \max(0, x):

  • For x>0x > 0: f(x)={1}\partial f(x) = \{1\} (unique gradient)
  • For x<0x < 0: f(x)={0}\partial f(x) = \{0\} (unique gradient)
  • For x=0x = 0: f(0)=[0,1]\partial f(0) = [0, 1] (any g[0,1]g \in [0,1] is a subgradient)

In PyTorch, torch.relu(x).backward() uses g=0g = 0 at x=0x = 0 by convention - this is a valid subgradient choice and works fine in practice since x=0x = 0 occurs with probability zero for continuous distributions.

Full treatment of subgradients and convex optimization: 04-Optimality-Conditions and Chapter 8 - Optimization.


6. Gradient in Machine Learning

6.1 Mean Squared Error Gradient

Setup: Linear regression model y^=Xw\hat{\mathbf{y}} = X\mathbf{w}, where XRm×dX \in \mathbb{R}^{m \times d} is the data matrix, wRd\mathbf{w} \in \mathbb{R}^d is the weight vector, yRm\mathbf{y} \in \mathbb{R}^m is the target vector.

Loss: Mean squared error (MSE):

L(w)=1mXwy2=1m(Xwy)(Xwy)\mathcal{L}(\mathbf{w}) = \frac{1}{m}\lVert X\mathbf{w} - \mathbf{y} \rVert^2 = \frac{1}{m}(X\mathbf{w} - \mathbf{y})^\top(X\mathbf{w} - \mathbf{y})

Gradient derivation:

Let r=Xwy\mathbf{r} = X\mathbf{w} - \mathbf{y} (residuals). Then:

L=1mrr\mathcal{L} = \frac{1}{m}\mathbf{r}^\top \mathbf{r} wL=1mw[rr]=1m2Xr=2mX(Xwy)\nabla_{\mathbf{w}} \mathcal{L} = \frac{1}{m} \nabla_{\mathbf{w}} [\mathbf{r}^\top \mathbf{r}] = \frac{1}{m} \cdot 2X^\top \mathbf{r} = \frac{2}{m} X^\top(X\mathbf{w} - \mathbf{y})

Interpretation: Each component Lwj=2mi=1m(xij)(ri)\frac{\partial \mathcal{L}}{\partial w_j} = \frac{2}{m} \sum_{i=1}^m (x_{ij})(r_i) is the correlation between the jj-th feature and the residuals. The gradient is zero when all features are uncorrelated with the residuals - exactly the OLS normal equations X(Xwy)=0X^\top(X\mathbf{w} - \mathbf{y}) = \mathbf{0}.

Gradient descent for linear regression:

wt+1=wtη2mX(Xwty)\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \cdot \frac{2}{m} X^\top(X\mathbf{w}_t - \mathbf{y})

With learning rate η<1/σmax(XX/m)\eta < 1/\sigma_{\max}(X^\top X / m), this converges to the least squares solution (Boyd & Vandenberghe, 2004, 9.3).

6.2 Cross-Entropy and Logistic Regression Gradient

Logistic regression: Binary classifier y^=σ(wx)\hat{y} = \sigma(\mathbf{w}^\top \mathbf{x}) where σ(z)=1/(1+ez)\sigma(z) = 1/(1 + e^{-z}).

Loss (binary cross-entropy on mm examples):

L(w)=1mi=1m[y(i)logσ(wx(i))+(1y(i))log(1σ(wx(i)))]\mathcal{L}(\mathbf{w}) = -\frac{1}{m}\sum_{i=1}^m \left[y^{(i)} \log \sigma(\mathbf{w}^\top \mathbf{x}^{(i)}) + (1 - y^{(i)}) \log(1 - \sigma(\mathbf{w}^\top \mathbf{x}^{(i)}))\right]

Gradient derivation:

For one example, let z=wxz = \mathbf{w}^\top \mathbf{x}, p=σ(z)p = \sigma(z). Using σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1-\sigma(z)):

w[ylogp(1y)log(1p)]=(yp+1y1p)σ(z)x\frac{\partial}{\partial \mathbf{w}}\left[-y\log p - (1-y)\log(1-p)\right] = \left(\frac{-y}{p} + \frac{1-y}{1-p}\right) \cdot \sigma'(z) \cdot \mathbf{x} =pyp(1p)p(1p)x=(py)x= \frac{p - y}{p(1-p)} \cdot p(1-p) \cdot \mathbf{x} = (p - y)\mathbf{x}

Therefore, averaging over mm examples:

wL=1mX(y^y)\nabla_{\mathbf{w}} \mathcal{L} = \frac{1}{m} X^\top(\hat{\mathbf{y}} - \mathbf{y})

where y^=σ(Xw)\hat{\mathbf{y}} = \sigma(X\mathbf{w}) elementwise. Strikingly, this has the same form as the linear regression gradient: error times features. This pattern holds for all generalized linear models in the exponential family.

Softmax (multiclass) gradient:

For KK-class classification with softmax probabilities pk(x)=ewkxjewjxp_k(\mathbf{x}) = \frac{e^{\mathbf{w}_k^\top \mathbf{x}}}{\sum_j e^{\mathbf{w}_j^\top \mathbf{x}}} and one-hot label y\mathbf{y}:

wkL=1mi=1m(pk(x(i))yk(i))x(i)\nabla_{\mathbf{w}_k} \mathcal{L} = \frac{1}{m} \sum_{i=1}^m (p_k(\mathbf{x}^{(i)}) - y_k^{(i)}) \mathbf{x}^{(i)}

In matrix form: WL=1mX(PY)\nabla_W \mathcal{L} = \frac{1}{m} X^\top(P - Y), where PRm×KP \in \mathbb{R}^{m \times K} has rows of predicted probabilities.

For AI: The output gradient of a transformer's logit layer Lz=py\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \mathbf{p} - \mathbf{y} (predicted minus true distribution) is used directly in backpropagation. Its simplicity - no need to differentiate through the softmax explicitly - is why cross-entropy is the universal choice for classification losses.

6.3 Gradient of Regularized Objectives

Regularization adds a penalty term to the loss. The gradient of the combined objective is the sum of gradients:

L2 regularization (weight decay):

Lreg(w)=L(w)+λ2w2\mathcal{L}_{\text{reg}}(\mathbf{w}) = \mathcal{L}(\mathbf{w}) + \frac{\lambda}{2}\lVert \mathbf{w} \rVert^2 wLreg=wL+λw\nabla_{\mathbf{w}} \mathcal{L}_{\text{reg}} = \nabla_{\mathbf{w}} \mathcal{L} + \lambda \mathbf{w}

The update rule becomes: wt+1=wt(1ηλ)ηwL(wt)\mathbf{w}_{t+1} = \mathbf{w}_t(1 - \eta\lambda) - \eta \nabla_{\mathbf{w}} \mathcal{L}(\mathbf{w}_t). The factor (1ηλ)(1 - \eta\lambda) shrinks the weights at every step - this is called weight decay, and is the standard regularization in LLM training (AdamW with λ=0.01\lambda = 0.01-0.10.1).

L1 regularization (sparsity):

Lreg(w)=L(w)+λw1\mathcal{L}_{\text{reg}}(\mathbf{w}) = \mathcal{L}(\mathbf{w}) + \lambda \lVert \mathbf{w} \rVert_1

The gradient of w1=iwi\lVert \mathbf{w} \rVert_1 = \sum_i |w_i| is sign(w)\operatorname{sign}(\mathbf{w}) (elementwise), which is a subgradient at wi=0w_i = 0. L1 regularization promotes sparsity - many weights become exactly zero - but is rarely used in deep learning because automatic differentiation requires smooth functions.

Elastic net: λ1w1+λ2w2\lambda_1 \lVert \mathbf{w} \rVert_1 + \lambda_2 \lVert \mathbf{w} \rVert^2 combines both.

6.4 Weight Sharing and Gradient Accumulation

Weight sharing: When the same parameter ww appears multiple times in a model (e.g., a shared embedding layer used in both encoder and decoder), the gradient is the sum of gradients from each occurrence:

Lw=occurrencesLwcopy\frac{\partial \mathcal{L}}{\partial w} = \sum_{\text{occurrences}} \frac{\partial \mathcal{L}}{\partial w_{\text{copy}}}

This follows from the multivariable chain rule. In convolutional neural networks, all positions share the same filter weights - the filter's gradient is the sum of its gradient contributions from every spatial location.

Gradient accumulation: For large batch sizes that don't fit in GPU memory, we split the batch into micro-batches, accumulate gradients across them, and apply one optimizer step:

for micro_batch in micro_batches:
    loss = model(micro_batch) / num_micro_batches
    loss.backward()  # accumulates .grad
optimizer.step()
optimizer.zero_grad()

This gives the same gradient as computing on the full batch because 1miLi=1miLi\nabla \frac{1}{m}\sum_i \mathcal{L}_i = \frac{1}{m}\sum_i \nabla \mathcal{L}_i (linearity of the gradient).

For AI: GPT-3 and subsequent LLMs were trained with gradient accumulation because the full batch size (millions of tokens) far exceeds GPU memory. Gradient accumulation is mathematically equivalent to large-batch training, making it essential infrastructure for frontier model training.

6.5 Gradient Descent: The Algorithm

Gradient descent is the iterative algorithm for minimizing L(θ)\mathcal{L}(\boldsymbol{\theta}):

θt+1=θtηtθL(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta_t \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}_t)

where ηt>0\eta_t > 0 is the learning rate (or step size) at step tt.

Why it works (local guarantee): By the first-order approximation:

L(θt+1)L(θt)ηtL(θt)2+O(ηt2)\mathcal{L}(\boldsymbol{\theta}_{t+1}) \approx \mathcal{L}(\boldsymbol{\theta}_t) - \eta_t \lVert \nabla \mathcal{L}(\boldsymbol{\theta}_t) \rVert^2 + O(\eta_t^2)

For small enough ηt\eta_t, the second term dominates, guaranteeing descent: L(θt+1)<L(θt)\mathcal{L}(\boldsymbol{\theta}_{t+1}) < \mathcal{L}(\boldsymbol{\theta}_t) (unless L=0\nabla \mathcal{L} = \mathbf{0}).

Variants:

VariantUpdateWhen to use
Batch GD\nabla over all mm examplesExact gradient; slow per step
Stochastic GD (SGD)\nabla over 1 exampleNoisy but fast; escapes sharp minima
Mini-batch GD\nabla over BB examplesBest of both; standard in practice
SGD + MomentumAdds exponential moving average of gradientAccelerated convergence; reduces oscillation
AdamAdaptive per-coordinate learning ratesDefault for transformer training

Convergence (L-smooth, convex case): If L\mathcal{L} is convex and LL-smooth with η=1/L\eta = 1/L:

L(θt)Lθ0θ22ηt\mathcal{L}(\boldsymbol{\theta}_t) - \mathcal{L}^* \leq \frac{\lVert \boldsymbol{\theta}_0 - \boldsymbol{\theta}^* \rVert^2}{2\eta t}

This O(1/t)O(1/t) rate is tight for gradient descent. Strongly convex functions achieve O((1μ/L)t)O((1 - \mu/L)^t) linear (exponential) convergence.

For AI: Modern LLM training uses AdamW with cosine learning rate decay, gradient clipping at G=1.0G = 1.0, and weight decay λ=0.1\lambda = 0.1. The gradient itself is the same object as described here - the engineering around it (clipping, schedule, accumulation) addresses numerical and optimization dynamics, not the mathematical definition.

Full convergence theory and algorithm variants: Chapter 8 - Optimization.


7. Numerical Differentiation

7.1 Finite Difference Formulas

When an analytical gradient is unavailable or hard to verify, we can approximate it numerically using finite differences.

Forward difference:

fxi(a)f(a+hei)f(a)h\frac{\partial f}{\partial x_i}(\mathbf{a}) \approx \frac{f(\mathbf{a} + h\mathbf{e}_i) - f(\mathbf{a})}{h}

Error: O(h)O(h) - one term of Taylor expansion is dropped.

Backward difference:

fxi(a)f(a)f(ahei)h\frac{\partial f}{\partial x_i}(\mathbf{a}) \approx \frac{f(\mathbf{a}) - f(\mathbf{a} - h\mathbf{e}_i)}{h}

Error: O(h)O(h) - same order.

Centered difference:

fxi(a)f(a+hei)f(ahei)2h\frac{\partial f}{\partial x_i}(\mathbf{a}) \approx \frac{f(\mathbf{a} + h\mathbf{e}_i) - f(\mathbf{a} - h\mathbf{e}_i)}{2h}

Error: O(h2)O(h^2) - second-order accurate (all odd-order Taylor terms cancel).

Derivation of centered difference error:

By Taylor expansion:

f(a+hei)=f(a)+hfxi+h222fxi2+O(h3)f(\mathbf{a} + h\mathbf{e}_i) = f(\mathbf{a}) + h\frac{\partial f}{\partial x_i} + \frac{h^2}{2}\frac{\partial^2 f}{\partial x_i^2} + O(h^3) f(ahei)=f(a)hfxi+h222fxi2+O(h3)f(\mathbf{a} - h\mathbf{e}_i) = f(\mathbf{a}) - h\frac{\partial f}{\partial x_i} + \frac{h^2}{2}\frac{\partial^2 f}{\partial x_i^2} + O(h^3)

Subtracting: f(a+hei)f(ahei)=2hfxi+O(h3)f(\mathbf{a} + h\mathbf{e}_i) - f(\mathbf{a} - h\mathbf{e}_i) = 2h\frac{\partial f}{\partial x_i} + O(h^3), so the centered difference approximation has error O(h2)O(h^2).

Computational cost: Computing the full gradient fRn\nabla f \in \mathbb{R}^n via centered differences requires 2n2n function evaluations. For a model with n=109n = 10^9 parameters, this is computationally infeasible - this is why automatic differentiation (05) is essential.

7.2 Optimal Step Size

Choosing hh involves a fundamental trade-off:

  • Too large hh: Taylor truncation error O(h2)O(h^2) is large
  • Too small hh: Floating-point subtraction cancellation error dominates

For centered differences, the optimal hh minimizes total error h2C+ϵmach/h\approx h^2 C + \epsilon_\text{mach} / h, which gives:

hoptϵmach1/3105(for double precision ϵmach1016)h_\text{opt} \approx \epsilon_\text{mach}^{1/3} \approx 10^{-5} \qquad (\text{for double precision } \epsilon_\text{mach} \approx 10^{-16})
NUMERICAL GRADIENT ERROR vs. STEP SIZE h


  log(error)
      
   -2 
        Truncation error O(h^2)
   -4  
                         Optimal h ~= 1e-5
   -6    _____________/
           minimum      Roundoff error epsilon/h
   -8                   
                         
  -10                     
       log(h)
        -12 -9 -6 -3  0


Rule of thumb: Use h=105h = 10^{-5} for centered differences in double precision. In single precision (ϵmach107\epsilon_\text{mach} \approx 10^{-7}), use h104h \approx 10^{-4}.

7.3 Gradient Checking

Gradient checking is the practice of verifying an analytical gradient implementation by comparing it to the numerical gradient.

Algorithm:

  1. Compute analytical gradient ganal=θL(θ)\mathbf{g}_\text{anal} = \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}) using backpropagation
  2. For each parameter θi\theta_i, compute:
ginum=L(θ+hei)L(θhei)2hg_i^\text{num} = \frac{\mathcal{L}(\boldsymbol{\theta} + h\mathbf{e}_i) - \mathcal{L}(\boldsymbol{\theta} - h\mathbf{e}_i)}{2h}
  1. Compute relative error:
relerr=ganalgnum2ganal2+gnum2\text{relerr} = \frac{\lVert \mathbf{g}_\text{anal} - \mathbf{g}_\text{num} \rVert_2}{\lVert \mathbf{g}_\text{anal} \rVert_2 + \lVert \mathbf{g}_\text{num} \rVert_2}

Thresholds:

  • <107< 10^{-7}: Great - gradient is likely correct
  • 10710^{-7} to 10510^{-5}: Acceptable for most applications
  • >103> 10^{-3}: Bug in gradient - investigate component-by-component

Practical tips:

  • Use double precision (float64) during gradient checks, even if training uses float16
  • Check on a small random input, not on all training data
  • Check each component separately to localize bugs
  • Do NOT gradient check with dropout active (stochastic masks break the comparison)

For AI: Andrej Karpathy's micrograd tutorial uses gradient checking to verify backpropagation at every step. Gradient checking was critical during the early days of deep learning to ensure backprop implementations were correct; today, libraries like PyTorch and JAX are well-tested, but gradient checking remains essential when implementing custom layers or loss functions.

7.4 Limitations and the Case for Automatic Differentiation

Numerical differentiation has three fatal limitations for large models:

  1. Cost: O(n)O(n) function evaluations for an nn-parameter model - infeasible for n>106n > 10^6
  2. Accuracy: O(h2)O(h^2) error is often insufficient for sensitive applications (e.g., second-order methods)
  3. No higher-order: Computing Hessian-vector products numerically costs O(n2)O(n^2)

Automatic differentiation (05) resolves all three:

  • Cost: Reverse mode computes the full gradient in O(1)O(1) model evaluations (same cost as a forward pass)
  • Accuracy: Exact to machine precision - no truncation error
  • Higher-order: Forward-over-reverse AD computes Hessian-vector products exactly in O(1)O(1) cost

Full treatment: 05-Automatic-Differentiation.


8. Loss Landscape Geometry

8.1 Loss Landscapes for Neural Networks

A neural network's loss L(θ)\mathcal{L}(\boldsymbol{\theta}) defines a high-dimensional surface over parameter space Rn\mathbb{R}^n. Unlike convex functions (which have a unique global minimum), neural network loss landscapes are:

  • Non-convex: Multiple local minima, saddle points, and flat regions
  • High-dimensional: Intuitions from 2D and 3D break down; most "local minima" in high dimensions are actually saddle points with at least one negative curvature direction
  • Approximately convex near minima: Empirically, well-trained networks sit in wide flat basins where the Hessian has many near-zero eigenvalues and few negative ones

Critical points and their classification (using the Hessian - see 02):

TypeL\nabla \mathcal{L}Hessian eigenvaluesBehavior
Global minimum=0= \mathbf{0}All >0> 0Best possible parameters
Local minimum=0= \mathbf{0}All >0> 0Training converges here; good generalization if in a wide basin
Saddle point=0= \mathbf{0}Mixed signsGradient descent escapes with noise (SGD)
Flat region0\approx \mathbf{0}Many 0\approx 0Plateau; training stalls; warm restarts help

The Hessian eigenvalue view: The curvature of the loss surface at a minimum is characterized by the Hessian eigenvalues. Wide minima (small eigenvalues) generalize better (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017) - this is the theoretical basis for SAM (Sharpness-Aware Minimization, Foret et al., 2021), which explicitly seeks flat minima.

8.2 Gradient Norm as a Training Diagnostic

The gradient norm θL\lVert \nabla_{\boldsymbol{\theta}} \mathcal{L} \rVert is one of the most informative training diagnostics:

Gradient norm behaviorPossible cause
Steadily decreasingNormal convergence
Suddenly large (spike)Batch with unusual data; learning rate too high
Near zero early in trainingVanishing gradients (very deep networks without skip connections)
Near zero late in trainingNear a critical point (minimum or saddle)
Persistently largeDivergence; reduce learning rate

Vanishing gradients: In deep networks without normalization or skip connections, the gradient norm often decays exponentially with depth. If layer ll has weight matrix W[l]W^{[l]} with W[l]2<1\lVert W^{[l]} \rVert_2 < 1, then θ[1]LlW[l]20\lVert \nabla_{\boldsymbol{\theta}^{[1]}} \mathcal{L} \rVert \leq \prod_l \lVert W^{[l]} \rVert_2 \to 0 exponentially. This motivated: BatchNorm (Ioffe & Szegedy, 2015), LayerNorm (Ba et al., 2016), and residual connections (He et al., 2016).

8.3 Gradient Clipping

Problem: In recurrent networks (RNNs, early transformers), gradients can explode - growing exponentially through time - making training unstable.

Solution (Pascanu et al., 2013): Rescale the gradient whenever its norm exceeds a threshold GG:

g{gif gGGggif g>G\mathbf{g} \leftarrow \begin{cases} \mathbf{g} & \text{if } \lVert \mathbf{g} \rVert \leq G \\ G \cdot \frac{\mathbf{g}}{\lVert \mathbf{g} \rVert} & \text{if } \lVert \mathbf{g} \rVert > G \end{cases}

This preserves the direction of the gradient while bounding the step size. It is not the same as gradient truncation (which clips each component independently and changes the direction).

Effect: Clipping to GG guarantees the update ηg\eta \mathbf{g} has norm at most ηG\eta G, regardless of the curvature of L\mathcal{L}.

Standard values: G=1.0G = 1.0 for transformer training (used in GPT-2, GPT-3, Llama, and most modern LLMs). The PyTorch call is torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0).

For AI: Gradient clipping is a near-universal practice in LLM training. Without it, a single pathological batch can send parameters into a numerically unstable region from which training never recovers. The choice G=1.0G = 1.0 is empirical - it rarely triggers during stable training but provides a safety net during instability events.

8.4 Gradient Accumulation for Large Batch Training

When the effective batch size BB is larger than what fits in GPU memory, gradient accumulation simulates the large batch:

θLbatch=1Bi=1BθLi=1kj=1k1B/kimicro-batchjθLimicro-batch gradient\nabla_{\boldsymbol{\theta}} \mathcal{L}_\text{batch} = \frac{1}{B}\sum_{i=1}^B \nabla_{\boldsymbol{\theta}} \mathcal{L}_i = \frac{1}{k}\sum_{j=1}^k \underbrace{\frac{1}{B/k}\sum_{i \in \text{micro-batch}_j} \nabla_{\boldsymbol{\theta}} \mathcal{L}_i}_{\text{micro-batch gradient}}

where kk is the number of gradient accumulation steps. The gradient is accumulated across kk micro-batches before a single optimizer step.

Mathematical validity: By linearity of the gradient, the accumulated gradient is exactly equal to the gradient computed over the full batch. This holds exactly (not approximately).

For AI: Llama 2 70B used a global batch size of 2048 sequences of 4096 tokens - far exceeding GPU memory for a single forward pass. Gradient accumulation, combined with data-parallel training across multiple GPUs, makes this feasible.


9. Advanced Topics

9.1 Gateaux and Frchet Derivatives

In infinite-dimensional settings (function spaces), the ordinary gradient doesn't apply. Two generalizations are:

Gateaux derivative (directional derivative in function space):

DhF(f)=limε0F(f+εh)F(f)εD_h F(f) = \lim_{\varepsilon \to 0} \frac{F(f + \varepsilon h) - F(f)}{\varepsilon}

where FF is a functional (maps functions to scalars), ff is a function, and hh is a perturbation function. This is the infinite-dimensional analog of the directional derivative.

Frchet derivative: The Gateaux derivative DhF(f)D_h F(f) is linear and continuous in hh, and can be written as DhF(f)=F(f),hD_h F(f) = \langle \nabla F(f), h \rangle for some function F(f)\nabla F(f) - this is the functional gradient.

Applications in ML:

  • Variational inference: Optimizing the ELBO L[q]\mathcal{L}[q] over distributions qq - the gradient is a functional derivative
  • Optimal transport: Wasserstein gradient flows define how distributions evolve under gradient descent in the space of probability measures
  • Neural tangent kernel (NTK): As network width \to \infty, gradient descent on θ\boldsymbol{\theta} induces gradient descent in function space with the NTK as inner product

9.2 Wirtinger Derivatives

When optimizing real-valued functions of complex variables - common in signal processing, compressed sensing, and some neural network architectures - the standard gradient doesn't apply because complex functions satisfying the Cauchy-Riemann equations are conformal (angle-preserving), and most practical functions are not holomorphic.

Wirtinger calculus defines pseudo-derivatives:

fz=12(fxify),fzˉ=12(fx+ify)\frac{\partial f}{\partial z} = \frac{1}{2}\left(\frac{\partial f}{\partial x} - i\frac{\partial f}{\partial y}\right), \qquad \frac{\partial f}{\partial \bar{z}} = \frac{1}{2}\left(\frac{\partial f}{\partial x} + i\frac{\partial f}{\partial y}\right)

where z=x+iyz = x + iy. For real-valued ff, gradient descent follows fzˉ-\frac{\partial f}{\partial \bar{z}} (the Wirtinger gradient).

For AI: Complex-valued neural networks used in audio processing and MRI reconstruction use Wirtinger derivatives. PyTorch supports complex tensors and correctly computes Wirtinger gradients via .backward().

9.3 Gradients on Manifolds and the Natural Gradient

Standard gradient descent treats parameter space as flat Euclidean space. But probability distributions form a statistical manifold with a non-Euclidean geometry defined by the Fisher information matrix:

F(θ)=Expθ ⁣[θlogpθ(x)θlogpθ(x)]F(\boldsymbol{\theta}) = \mathbb{E}_{x \sim p_{\boldsymbol{\theta}}}\!\left[\nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}}(x) \nabla_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}}(x)^\top\right]

The natural gradient (Amari, 1998) accounts for this geometry:

~θL=F(θ)1θL\tilde{\nabla}_{\boldsymbol{\theta}} \mathcal{L} = F(\boldsymbol{\theta})^{-1} \nabla_{\boldsymbol{\theta}} \mathcal{L}

Natural gradient descent converges faster than standard gradient descent because it takes steps of equal "distributional size" rather than equal "parameter size."

Connection to LLM fine-tuning:

  • LoRA (Hu et al., 2022): Constrains the gradient update ΔW=BA\Delta W = BA to a low-rank subspace. The intuition is that the gradient lies in a low-rank manifold, and parameterizing it as BABA is an efficient approximation
  • K-FAC (Martens & Grosse, 2015): Approximates the Fisher information as a Kronecker product of smaller matrices, making natural gradient descent tractable for neural networks
  • DoRA (Liu et al., 2024): Decomposes weight updates into magnitude and direction components, further refining the LoRA idea

Riemannian gradient: On a manifold with metric tensor G(θ)G(\boldsymbol{\theta}), the Riemannian gradient is G(θ)1θLG(\boldsymbol{\theta})^{-1}\nabla_{\boldsymbol{\theta}} \mathcal{L}. The Fisher information F(θ)F(\boldsymbol{\theta}) plays the role of GG on the statistical manifold.

9.4 Jacobian Preview

So far we have discussed gradients of scalar functions f:RnRf: \mathbb{R}^n \to \mathbb{R}. When ff is vector-valued, f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m, the gradient generalizes to the Jacobian matrix JfRm×nJ_{\mathbf{f}} \in \mathbb{R}^{m \times n}:

(Jf)ij=fixj(J_{\mathbf{f}})_{ij} = \frac{\partial f_i}{\partial x_j}

The Jacobian is the "matrix of all first-order partial derivatives" - its ii-th row is the gradient of the ii-th output component.

Preview: Jacobians and Hessians

When f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m, the Jacobian JfRm×nJ_{\mathbf{f}} \in \mathbb{R}^{m \times n} generalizes the gradient and is the fundamental object for backpropagation through vector-valued layers. The Hessian HfRn×nH_f \in \mathbb{R}^{n \times n} generalizes the second derivative and characterizes curvature, saddle points, and Newton's method.

-> Full treatment: 02 Jacobians and Hessians


10. Common Mistakes

#MistakeWhy It's WrongFix
1Writing f\nabla f as a row vectorGradient is a column vector Rn\in \mathbb{R}^n; writing as a row conflates it with the Jacobian (which is 1×n1 \times n for scalar ff) and causes dimension errors in fδ\nabla f^\top \deltaAlways write fRn\nabla f \in \mathbb{R}^n as a column vector; write f\nabla f^\top when you need a row
2Confusing fxgy\frac{\partial f}{\partial x}\frac{\partial g}{\partial y} with 2(fg)xy\frac{\partial^2 (fg)}{\partial x \partial y}The product rule applies: 2(fg)xy=fxgy+f2gxy+2fxyg+gxfy\frac{\partial^2 (fg)}{\partial x \partial y} = \frac{\partial f}{\partial x}\frac{\partial g}{\partial y} + f \frac{\partial^2 g}{\partial x \partial y} + \frac{\partial^2 f}{\partial x \partial y} g + \frac{\partial g}{\partial x}\frac{\partial f}{\partial y}Apply the product rule in each differentiation step
3Treating (xAx)=2Ax\nabla(\mathbf{x}^\top A \mathbf{x}) = 2A\mathbf{x} when AA is not symmetricThis holds only when A=AA = A^\top; in general, (xAx)=(A+A)x\nabla(\mathbf{x}^\top A \mathbf{x}) = (A + A^\top)\mathbf{x}Check symmetry; use the general formula from 4.5
4Using forward differences instead of centered differences for gradient checkingForward difference has O(h)O(h) error; centered has O(h2)O(h^2); at h=105h = 10^{-5} this is 10510^{-5} vs 101010^{-10} - a factor of 10510^5Always use centered differences f(x+hei)f(xhei)2h\frac{f(\mathbf{x}+h\mathbf{e}_i) - f(\mathbf{x}-h\mathbf{e}_i)}{2h} for gradient checking
5Concluding a point is a local minimum because all partial derivatives are zerof=0\nabla f = \mathbf{0} is necessary but not sufficient; the point could be a saddle point (e.g., f(x,y)=x2y2f(x,y) = x^2 - y^2 at the origin)Check the Hessian: if Hf0H_f \succ 0, it's a local minimum (02, 04)
6Claiming a function is differentiable because its partial derivatives existFalse - all partials can exist while ff is not even continuous (2.4 counterexample)Differentiability requires more: the linear approximation error must be o(δ)o(\lVert \boldsymbol{\delta}\rVert)
7Applying gradient descent with a step η>2/L\eta > 2/L for LL-smooth functionsThis can cause divergence; the safe step size for gradient descent is η1/L\eta \leq 1/L where LL is the Lipschitz constant of f\nabla fCheck the smoothness constant; use line search or backtracking if unknown
8Gradient clipping individual components (element-wise) instead of the normElement-wise clipping changes the gradient direction and can slow convergence; norm clipping preserves directionUse clip_grad_norm_() (norm clipping), not clip_grad_value_() unless you specifically want element-wise clipping
9Ignoring the 1m\frac{1}{m} factor in batch gradients vs. single-sample gradientsLearning rate must be retuned when switching batch sizes; gradient magnitudes scale with mmConsistently include the 1m\frac{1}{m} factor in the loss; or rescale learning rate by BB when changing batch size BB
10Confusing the directional derivative with the gradient projected onto u\mathbf{u}The directional derivative Duf=fuD_{\mathbf{u}} f = \nabla f \cdot \mathbf{u} requires u=1\lVert \mathbf{u} \rVert = 1; for non-unit u\mathbf{u}, the result scales with u\lVert \mathbf{u} \rVertNormalize u\mathbf{u} before computing the directional derivative

11. Exercises

Exercise 1 - Partial Derivatives

Let f(x,y,z)=x3y2exz+ln(y2+z2)f(x, y, z) = x^3 y^2 - e^{xz} + \ln(y^2 + z^2).

(a) Compute fx\frac{\partial f}{\partial x}, fy\frac{\partial f}{\partial y}, fz\frac{\partial f}{\partial z}.

(b) Evaluate f(1,1,0)\nabla f(1, 1, 0).

(c) Verify that the mixed partials 2fxy\frac{\partial^2 f}{\partial x \partial y} and 2fyx\frac{\partial^2 f}{\partial y \partial x} are equal.

(d) For AI: The loss L(w1,w2)=(w12w2)2+(w21)2\mathcal{L}(w_1, w_2) = (w_1^2 - w_2)^2 + (w_2 - 1)^2. Find L\nabla \mathcal{L} and the critical points.


Exercise 2 - Gradient Computation and Geometry

Let f(x,y)=x2+xy+y2f(x,y) = x^2 + xy + y^2.

(a) Compute f\nabla f and evaluate at (1,1)(1, -1), (0,0)(0, 0), (2,3)(2, 3).

(b) At each of the three points, find the direction of steepest ascent (unit vector).

(c) Find all level curves f(x,y)=cf(x,y) = c for c=1,3,7c = 1, 3, 7. Verify numerically that f(1,1)\nabla f(1,-1) is perpendicular to the level curve passing through (1,1)(1,-1).

(d) Show that f(x,y)=x2+xy+y2>0f(x,y) = x^2 + xy + y^2 > 0 for all (x,y)(0,0)(x,y) \neq (0,0) by completing the square or using the discriminant of the associated quadratic form.


Exercise 3 - Directional Derivatives

Let f(x,y)=sin(xy)+cos(x)f(x,y) = \sin(xy) + \cos(x).

(a) Compute f\nabla f at a=(π/2,1)\mathbf{a} = (\pi/2, 1).

(b) Find the direction u\mathbf{u} that maximizes Duf(a)D_{\mathbf{u}} f(\mathbf{a}) and the direction that minimizes it. What are the maximum and minimum values?

(c) Find all directions u\mathbf{u} such that Duf(a)=0D_{\mathbf{u}} f(\mathbf{a}) = 0.

(d) Compute the directional derivative in direction (3/5,4/5)(3/5, 4/5)^\top and in direction (1/2,1/2)(1/\sqrt{2}, -1/\sqrt{2})^\top.


Exercise 4 - Matrix Calculus from First Principles

Prove the following from the definition fxk=limh0f(x+hek)f(x)h\frac{\partial f}{\partial x_k} = \lim_{h\to 0}\frac{f(\mathbf{x}+h\mathbf{e}_k)-f(\mathbf{x})}{h}:

(a) x(ax)=a\nabla_{\mathbf{x}}(\mathbf{a}^\top \mathbf{x}) = \mathbf{a} for fixed aRn\mathbf{a} \in \mathbb{R}^n.

(b) x(xAx)=(A+A)x\nabla_{\mathbf{x}}(\mathbf{x}^\top A \mathbf{x}) = (A + A^\top)\mathbf{x} for fixed ARn×nA \in \mathbb{R}^{n \times n}.

(c) xAxb2=2A(Axb)\nabla_{\mathbf{x}} \lVert A\mathbf{x} - \mathbf{b} \rVert^2 = 2A^\top(A\mathbf{x} - \mathbf{b}) for fixed ARm×nA \in \mathbb{R}^{m \times n}, bRm\mathbf{b} \in \mathbb{R}^m.

(d) Show that the gradient of the MSE loss 1mXwy2\frac{1}{m}\lVert X\mathbf{w} - \mathbf{y} \rVert^2 equals zero when XXw=XyX^\top X \mathbf{w} = X^\top \mathbf{y} - the normal equations of linear regression.


Exercise 5 - MSE Gradient and Linear Regression

Implement gradient descent for linear regression from scratch.

(a) Generate synthetic data: y=Xw+ε\mathbf{y} = X\mathbf{w}^* + \boldsymbol{\varepsilon} where XR100×3X \in \mathbb{R}^{100 \times 3}, w=(2,1,0.5)\mathbf{w}^* = (2, -1, 0.5)^\top, εN(0,0.1I)\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, 0.1I).

(b) Implement the MSE gradient wL=2mX(Xwy)\nabla_{\mathbf{w}} \mathcal{L} = \frac{2}{m}X^\top(X\mathbf{w} - \mathbf{y}).

(c) Run gradient descent for 1000 steps with η=0.01\eta = 0.01. Plot the loss vs. step number.

(d) Compute the analytical solution w^=(XX)1Xy\hat{\mathbf{w}} = (X^\top X)^{-1}X^\top \mathbf{y} and compare with gradient descent's solution.

(e) How does the convergence rate change with learning rate η\eta? Try η{0.001,0.01,0.1,1.0}\eta \in \{0.001, 0.01, 0.1, 1.0\}.


Exercise 6 - Gradient Checking Implementation

(a) Implement centered-difference numerical gradient for an arbitrary function f:RnRf: \mathbb{R}^n \to \mathbb{R}.

(b) Test on f(w)=12Xwy2f(\mathbf{w}) = \frac{1}{2}\lVert X\mathbf{w} - \mathbf{y} \rVert^2 with random data. Compute the relative error between numerical and analytical gradients.

(c) Plot relative error vs. step size hh for h[1012,101]h \in [10^{-12}, 10^{-1}]. Identify the optimal hh and the crossover between truncation error and floating-point cancellation.

(d) Implement a gradient checker for the logistic regression cross-entropy loss. Use it to verify the gradient formula wL=1mX(σ(Xw)y)\nabla_{\mathbf{w}} \mathcal{L} = \frac{1}{m}X^\top(\sigma(X\mathbf{w}) - \mathbf{y}).


Exercise 7 - Softmax Gradient Derivation

The softmax function softmax(z)k=ezkjezj\operatorname{softmax}(\mathbf{z})_k = \frac{e^{z_k}}{\sum_j e^{z_j}} takes logits zRK\mathbf{z} \in \mathbb{R}^K to a probability distribution.

(a) Compute the Jacobian pkzj\frac{\partial p_k}{\partial z_j} for all pairs (k,j)(k,j). Show that it equals pk(δkjpj)p_k(\delta_{kj} - p_j) where δkj\delta_{kj} is the Kronecker delta.

(b) Let L=kyklogpk\mathcal{L} = -\sum_k y_k \log p_k (cross-entropy with one-hot labels y\mathbf{y}). Using the chain rule Lzj=kLpkpkzj\frac{\partial \mathcal{L}}{\partial z_j} = \sum_k \frac{\partial \mathcal{L}}{\partial p_k}\frac{\partial p_k}{\partial z_j}, derive zL=py\nabla_{\mathbf{z}} \mathcal{L} = \mathbf{p} - \mathbf{y}.

(c) Verify numerically: compute zL\nabla_{\mathbf{z}} \mathcal{L} using your formula and using centered differences. Confirm relative error <106< 10^{-6}.

(d) Why is the gradient py\mathbf{p} - \mathbf{y} so elegant? Explain its interpretation: what happens when the model is perfectly confident and correct? When it's confident and wrong?


Exercise 8 - Natural Gradient vs. Gradient Descent

Consider fitting a Gaussian model pμ,σ(x)=N(x;μ,σ2)p_{\mu,\sigma}(x) = \mathcal{N}(x;\mu,\sigma^2) to observed data {x1,,xm}\{x_1,\ldots,x_m\} by maximizing the log-likelihood L(μ,σ)=ilogpμ,σ(xi)\mathcal{L}(\mu,\sigma) = \sum_i \log p_{\mu,\sigma}(x_i).

(a) Compute the ordinary gradient (μ,σ)L\nabla_{(\mu,\sigma)} \mathcal{L}.

(b) Compute the Fisher information matrix F(μ,σ)=E[2logpμ,σ(X)]F(\mu,\sigma) = -\mathbb{E}\left[\nabla^2 \log p_{\mu,\sigma}(X)\right] and show that for a Gaussian, F=diag(1/σ2,2/σ2)F = \operatorname{diag}(1/\sigma^2, 2/\sigma^2).

(c) Compute the natural gradient F1LF^{-1}\nabla\mathcal{L} and compare it to the ordinary gradient.

(d) Simulate: generate 100 points from N(2,1)\mathcal{N}(2, 1), run gradient descent and natural gradient descent from (μ0,σ0)=(0,2)(\mu_0,\sigma_0) = (0, 2). Plot trajectories in (μ,σ)(\mu,\sigma) space. Which converges faster?


12. Why This Matters for AI (2026 Perspective)

ConceptAI/ML Impact
Gradient θL\nabla_{\boldsymbol{\theta}}\mathcal{L}The update signal for every parameter in every neural network; computed billions of times per second during LLM training
MSE gradient derivationFoundation of all linear models; normal equations; least squares regularization; used in LoRA target initialization
Cross-entropy gradient py\mathbf{p} - \mathbf{y}Elegantly simple form used in every transformer output layer; enables efficient backpropagation through softmax
Gradient as steepest descent directionTheoretical guarantee that SGD moves downhill; motivates all descent-based optimization algorithms
Gradient perpendicular to level setsFoundation of constrained optimization; used in gradient projection methods and orthogonal gradient descent for continual learning
First-order Taylor approximationGuarantees gradient descent decreases loss for small steps; basis of convergence proofs for smooth functions
Gradient of quadratic forms(xAx)=2Ax\nabla(\mathbf{x}^\top A\mathbf{x}) = 2A\mathbf{x} is used in kernel methods, Gaussian process regression, and second-order optimization
Gradient clippingStandard practice in all LLM training; prevents training instability from pathological batches; norm G=1.0G=1.0 used in GPT-3, Llama, Mistral
Gradient accumulationEnables large effective batch sizes (millions of tokens) on GPU-memory-constrained systems; essential for frontier model training
Numerical gradient checkingDebugging tool for custom layers; verifies correctness of backpropagation implementation before costly training runs
Weight decay as L2 gradientAdamW decouples weight decay from the adaptive learning rate; the gradient of L2 regularization adds λw\lambda\mathbf{w} to the update
Gradient of log-likelihoodScore function θlogpθ(x)\nabla_{\boldsymbol{\theta}}\log p_{\boldsymbol{\theta}}(x) used in REINFORCE, PPO, and all policy gradient reinforcement learning methods
Natural gradient / Fisher informationK-FAC approximates it for efficient second-order optimization; Shampoo computes Kronecker factors; understanding it motivates AdaGrad, RMSprop, Adam
Functional gradientsWasserstein gradient flows, neural tangent kernels, variational inference - all use gradients in infinite-dimensional function spaces

Conceptual Bridge

Looking Backward

The gradient builds directly on two prerequisites:

  1. Single-variable chain rule (04/02): The directional derivative formula Duf=fuD_{\mathbf{u}} f = \nabla f^\top \mathbf{u} was proved using the chain rule along a curve γ(t)\boldsymbol{\gamma}(t). Every property of the gradient ultimately rests on single-variable calculus applied component-by-component.

  2. First-order Taylor series (04/04): The gradient-as-linear-approximation view f(a+δ)f(a)+f(a)δf(\mathbf{a}+\boldsymbol{\delta}) \approx f(\mathbf{a}) + \nabla f(\mathbf{a})^\top\boldsymbol{\delta} is the multivariate Taylor expansion truncated at first order. The remainder is O(δ2)O(\lVert\boldsymbol{\delta}\rVert^2), controlled by the second-order term - the Hessian.

  3. Vector norms and inner products (02-Linear-Algebra-Basics): The Cauchy-Schwarz inequality fufu\nabla f^\top \mathbf{u} \leq \lVert\nabla f\rVert\lVert\mathbf{u}\rVert was the key step in proving the gradient points in the steepest ascent direction.

Looking Forward

This section is the foundation for three immediate extensions:

  1. Jacobians and Hessians (02): When f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m is vector-valued, each output fif_i has a gradient fiRn\nabla f_i \in \mathbb{R}^n. Stacking these as rows gives the Jacobian JfRm×nJ_{\mathbf{f}} \in \mathbb{R}^{m \times n}. The Hessian stacks the gradients of the gradient components to form a symmetric matrix capturing curvature.

  2. Chain Rule and Backpropagation (03): Backpropagation is the multivariate chain rule applied to computation graphs. If z=f(y)\mathbf{z} = \mathbf{f}(\mathbf{y}) and y=g(x)\mathbf{y} = \mathbf{g}(\mathbf{x}), then Jfg=JfJgJ_{\mathbf{f}\circ\mathbf{g}} = J_{\mathbf{f}} \cdot J_{\mathbf{g}}. Propagating gradients backward layer by layer uses exactly this matrix product structure.

  3. Optimality Conditions (04): The condition f(a)=0\nabla f(\mathbf{a}) = \mathbf{0} is necessary (but not sufficient) for a\mathbf{a} to be a local minimum. Adding the condition Hf(a)0H_f(\mathbf{a}) \succ 0 (positive definite Hessian) makes it sufficient. Lagrange multipliers extend this to constrained optimization.

POSITION IN CURRICULUM


  Ch. 4 - Calculus Fundamentals
   01 Limits and Continuity    (epsilon-delta, sequential limits)
   02 Derivatives              (chain rule, activation derivatives)
   03 Integration              (FTC, probability integrals)
   04 Series and Sequences     (Taylor series, convergence)
         
           builds on: derivatives, Taylor expansion, vector algebra
         
  Ch. 5 - Multivariate Calculus
   01 Partial Derivatives and Gradients   <- YOU ARE HERE
          - Partial derivatives = derivatives holding others fixed
          - Gradient = vector of all partial derivatives
          - Directional derivative = nablaf * u
          - Gradient of ML losses (MSE, cross-entropy)
          - Numerical differentiation, gradient checking
  
   02 Jacobians and Hessians              -> matrix generalization
          (gradient of vector-valued f; curvature matrix)
  
   03 Chain Rule and Backpropagation      -> gradient composition
          (Jacobian chain rule; backprop algorithm)
  
   04 Optimality Conditions               -> when nablaf = 0 is a minimum
          (first/second-order conditions; KKT; Lagrange multipliers)
  
   05 Automatic Differentiation           -> computing nablaf exactly
           (reverse mode; tapes; PyTorch/JAX)
         
         
  Ch. 8 - Optimization
  (gradient descent variants, Adam, Newton, convergence theory)



<- Back to Multivariate Calculus | Next: Jacobians and Hessians ->


Appendix A: Standard Gradient Reference Table

The following identities are used throughout the curriculum. All assume x,aRn\mathbf{x}, \mathbf{a} \in \mathbb{R}^n, ARn×nA \in \mathbb{R}^{n \times n}, BRm×nB \in \mathbb{R}^{m \times n}, bRm\mathbf{b} \in \mathbb{R}^m unless noted.

A.1 Linear Forms

f(x)f(\mathbf{x})xf\nabla_{\mathbf{x}} fNotes
ax\mathbf{a}^\top \mathbf{x}a\mathbf{a}Constant gradient - loss increases uniformly in direction a\mathbf{a}
xa\mathbf{x}^\top \mathbf{a}a\mathbf{a}Same as above (dot product commutes)
xx=x2\mathbf{x}^\top \mathbf{x} = \lVert\mathbf{x}\rVert^22x2\mathbf{x}Gradient points radially outward
xa2\lVert\mathbf{x} - \mathbf{a}\rVert^22(xa)2(\mathbf{x} - \mathbf{a})Points away from a\mathbf{a}; zero at a\mathbf{a}
x\lVert\mathbf{x}\rVertx/x\mathbf{x}/\lVert\mathbf{x}\rVertUnit vector in direction of x\mathbf{x}; undefined at 0\mathbf{0}

A.2 Quadratic Forms

f(x)f(\mathbf{x})xf\nabla_{\mathbf{x}} fNotes
xAx\mathbf{x}^\top A \mathbf{x}(A+A)x(A + A^\top)\mathbf{x}General AA
xAx\mathbf{x}^\top A \mathbf{x}2Ax2A\mathbf{x}When A=AA = A^\top (symmetric)
aAx\mathbf{a}^\top A \mathbf{x}AaA^\top \mathbf{a}Bilinear: gradient w.r.t. x\mathbf{x}
xAb\mathbf{x}^\top A \mathbf{b}AbA\mathbf{b}Linear in x\mathbf{x}

A.3 Least Squares and Norms

f(x)f(\mathbf{x})xf\nabla_{\mathbf{x}} fDerivation
Bxb2\lVert B\mathbf{x} - \mathbf{b}\rVert^22B(Bxb)2B^\top(B\mathbf{x} - \mathbf{b})Chain rule: r2=2Br\nabla\lVert\mathbf{r}\rVert^2 = 2B^\top\mathbf{r}
Bxb2+λx2\lVert B\mathbf{x} - \mathbf{b}\rVert^2 + \lambda\lVert\mathbf{x}\rVert^22B(Bxb)+2λx2B^\top(B\mathbf{x}-\mathbf{b}) + 2\lambda\mathbf{x}Ridge regression gradient
12xAxbx\frac{1}{2}\mathbf{x}^\top A\mathbf{x} - \mathbf{b}^\top\mathbf{x}AxbA\mathbf{x} - \mathbf{b} (AA symmetric)Set to 0: Ax=bA\mathbf{x}=\mathbf{b} (linear system)

A.4 Elementwise Functions

For elementwise functions f(x)=ig(xi)f(\mathbf{x}) = \sum_i g(x_i), the gradient is elementwise:

(f)i=g(xi)(\nabla f)_i = g'(x_i)
f(x)f(\mathbf{x})xf\nabla_{\mathbf{x}} fML use
iexi\sum_i e^{x_i}exe^{\mathbf{x}} (elementwise)Partition function in softmax
ilogxi\sum_i \log x_i1/x1/\mathbf{x} (elementwise, xi>0x_i > 0)Log-likelihood
iσ(xi)\sum_i \sigma(x_i)σ(x)(1σ(x))\sigma(\mathbf{x})\odot(1-\sigma(\mathbf{x}))Sigmoid gradient
imax(0,xi)\sum_i \max(0, x_i)1[x>0]\mathbf{1}[\mathbf{x}>0] (indicator)ReLU gradient (subgradient at 0)

A.5 Composed Functions

f(x)f(\mathbf{x})xf\nabla_{\mathbf{x}} fProof
g(ax)g(\mathbf{a}^\top\mathbf{x})g(ax)ag'(\mathbf{a}^\top\mathbf{x})\,\mathbf{a}Chain rule: /xi=gai\partial/\partial x_i = g'\cdot a_i
g(x2)g(\lVert\mathbf{x}\rVert^2)2g(x2)x2g'(\lVert\mathbf{x}\rVert^2)\,\mathbf{x}Chain rule on h=x2h = \lVert\mathbf{x}\rVert^2
exp(ax)\exp(\mathbf{a}^\top\mathbf{x})exp(ax)a\exp(\mathbf{a}^\top\mathbf{x})\,\mathbf{a}Special case of above with g=expg=\exp
log(ax)\log(\mathbf{a}^\top\mathbf{x})a/(ax)\mathbf{a}/(\mathbf{a}^\top\mathbf{x})Special case with g=logg=\log

Appendix B: Gradient Descent Convergence for Smooth Convex Functions

We prove a basic convergence result establishing the O(1/t)O(1/t) rate for gradient descent on smooth convex functions.

Setting: f:RnRf: \mathbb{R}^n \to \mathbb{R} is:

  • Convex: f(y)f(x)+f(x)(yx)f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^\top(\mathbf{y}-\mathbf{x}) for all x,y\mathbf{x},\mathbf{y}
  • LL-smooth: f(x)f(y)Lxy\lVert\nabla f(\mathbf{x}) - \nabla f(\mathbf{y})\rVert \leq L\lVert\mathbf{x}-\mathbf{y}\rVert for all x,y\mathbf{x},\mathbf{y}

Gradient descent: xt+1=xtηf(xt)\mathbf{x}_{t+1} = \mathbf{x}_t - \eta\nabla f(\mathbf{x}_t) with constant step η=1/L\eta = 1/L.

Lemma (descent lemma). For LL-smooth ff and step η=1/L\eta = 1/L:

f(xt+1)f(xt)12Lf(xt)2f(\mathbf{x}_{t+1}) \leq f(\mathbf{x}_t) - \frac{1}{2L}\lVert\nabla f(\mathbf{x}_t)\rVert^2

Proof of lemma. By LL-smoothness and Taylor's theorem:

f(xt+1)f(xt)+f(xt)(xt+1xt)+L2xt+1xt2f(\mathbf{x}_{t+1}) \leq f(\mathbf{x}_t) + \nabla f(\mathbf{x}_t)^\top(\mathbf{x}_{t+1}-\mathbf{x}_t) + \frac{L}{2}\lVert\mathbf{x}_{t+1}-\mathbf{x}_t\rVert^2

Substituting xt+1xt=1Lf(xt)\mathbf{x}_{t+1} - \mathbf{x}_t = -\frac{1}{L}\nabla f(\mathbf{x}_t):

f(xt+1)f(xt)1Lf(xt)2+L21L2f(xt)2=f(xt)12Lf(xt)2f(\mathbf{x}_{t+1}) \leq f(\mathbf{x}_t) - \frac{1}{L}\lVert\nabla f(\mathbf{x}_t)\rVert^2 + \frac{L}{2}\cdot\frac{1}{L^2}\lVert\nabla f(\mathbf{x}_t)\rVert^2 = f(\mathbf{x}_t) - \frac{1}{2L}\lVert\nabla f(\mathbf{x}_t)\rVert^2

\square

Convergence proof. Let x\mathbf{x}^* be a minimizer of ff with value f=f(x)f^* = f(\mathbf{x}^*). By convexity:

f(xt)f+f(xt)(xtx)f(\mathbf{x}_t) \leq f^* + \nabla f(\mathbf{x}_t)^\top(\mathbf{x}_t - \mathbf{x}^*)

Also:

xt+1x2=xt1Lf(xt)x2=xtx22Lf(xt)(xtx)+1L2f(xt)2\lVert\mathbf{x}_{t+1} - \mathbf{x}^*\rVert^2 = \lVert\mathbf{x}_t - \frac{1}{L}\nabla f(\mathbf{x}_t) - \mathbf{x}^*\rVert^2 = \lVert\mathbf{x}_t-\mathbf{x}^*\rVert^2 - \frac{2}{L}\nabla f(\mathbf{x}_t)^\top(\mathbf{x}_t-\mathbf{x}^*) + \frac{1}{L^2}\lVert\nabla f(\mathbf{x}_t)\rVert^2

Rearranging: f(xt)(xtx)=L2(xtx2xt+1x2)+12Lf(xt)2\nabla f(\mathbf{x}_t)^\top(\mathbf{x}_t-\mathbf{x}^*) = \frac{L}{2}(\lVert\mathbf{x}_t-\mathbf{x}^*\rVert^2 - \lVert\mathbf{x}_{t+1}-\mathbf{x}^*\rVert^2) + \frac{1}{2L}\lVert\nabla f(\mathbf{x}_t)\rVert^2

Combining with the descent lemma and summing over t=0,,T1t = 0, \ldots, T-1:

f(xT)fLx0x22Tf(\mathbf{x}_T) - f^* \leq \frac{L\lVert\mathbf{x}_0-\mathbf{x}^*\rVert^2}{2T}

This establishes the O(L/T)O(L/T) convergence rate. \square

Interpretation: After TT steps with step size η=1/L\eta = 1/L, the excess loss decreases as O(1/T)O(1/T). To achieve accuracy ε\varepsilon, we need T=O(L/ε)T = O(L/\varepsilon) steps. For strongly convex functions (Hessian lower-bounded by μI\mu I), the rate improves to O((1μ/L)T)O((1-\mu/L)^T) - exponential convergence.


Appendix C: Gradient of the Attention Mechanism

Computing the gradient of scaled dot-product attention is a fundamental exercise in multivariable calculus applied to transformers.

Setup. Let Q,K,VRn×dQ, K, V \in \mathbb{R}^{n \times d} and define:

S=QKdRn×n,A=softmax(S)Rn×n,O=AVRn×dS = \frac{QK^\top}{\sqrt{d}} \in \mathbb{R}^{n \times n}, \qquad A = \operatorname{softmax}(S) \in \mathbb{R}^{n \times n}, \qquad O = AV \in \mathbb{R}^{n \times d}

where softmax is applied row-wise. The loss L\mathcal{L} is a scalar depending on OO.

Gradient w.r.t. VV:

LV=ALO\frac{\partial \mathcal{L}}{\partial V} = A^\top \frac{\partial \mathcal{L}}{\partial O}

Derivation: O=AVO = AV, so Oij=kAikVkjO_{ij} = \sum_k A_{ik}V_{kj}, giving LVkj=iLOijAik=(ALO)kj\frac{\partial \mathcal{L}}{\partial V_{kj}} = \sum_i \frac{\partial \mathcal{L}}{\partial O_{ij}} A_{ik} = (A^\top \frac{\partial \mathcal{L}}{\partial O})_{kj}.

Gradient w.r.t. AA:

LA=LOV\frac{\partial \mathcal{L}}{\partial A} = \frac{\partial \mathcal{L}}{\partial O} V^\top

Gradient w.r.t. SS (through softmax):

Let ai=softmax(si)\mathbf{a}_i = \operatorname{softmax}(\mathbf{s}_i) be the ii-th row of AA. The Jacobian of softmax is Ji=diag(ai)aiaiJ_i = \operatorname{diag}(\mathbf{a}_i) - \mathbf{a}_i\mathbf{a}_i^\top. Therefore:

Lsi=JiLai=aiLaiai(aiLai)\frac{\partial \mathcal{L}}{\partial \mathbf{s}_i} = J_i^\top \frac{\partial \mathcal{L}}{\partial \mathbf{a}_i} = \mathbf{a}_i \odot \frac{\partial \mathcal{L}}{\partial \mathbf{a}_i} - \mathbf{a}_i \left(\mathbf{a}_i^\top \frac{\partial \mathcal{L}}{\partial \mathbf{a}_i}\right)

This is the "softmax gradient" formula used in FlashAttention (Dao et al., 2022) to avoid materializing the full n×nn \times n attention matrix.

Gradient w.r.t. QQ and KK:

LQ=LSKd,LK=(LS)Qd\frac{\partial \mathcal{L}}{\partial Q} = \frac{\partial \mathcal{L}}{\partial S} \cdot \frac{K}{\sqrt{d}}, \qquad \frac{\partial \mathcal{L}}{\partial K} = \left(\frac{\partial \mathcal{L}}{\partial S}\right)^\top \cdot \frac{Q}{\sqrt{d}}

For AI: FlashAttention implements this gradient computation in a memory-efficient way by fusing the forward and backward passes and using tiling to avoid storing the full n×nn \times n attention matrix. The gradient formulas above are exactly what FlashAttention computes, just in a hardware-aware order.


Appendix D: Coordinate Descent and Block Coordinate Descent

When computing the full gradient fRn\nabla f \in \mathbb{R}^n is expensive, coordinate descent updates one variable at a time:

xi(t+1)=argminxif(x1(t),,xi1(t),xi,xi+1(t),,xn(t))x_i^{(t+1)} = \arg\min_{x_i} f(x_1^{(t)}, \ldots, x_{i-1}^{(t)}, x_i, x_{i+1}^{(t)}, \ldots, x_n^{(t)})

or equivalently, takes a gradient step only along the ii-th coordinate:

xi(t+1)=xi(t)ηfxi(x(t))x_i^{(t+1)} = x_i^{(t)} - \eta \frac{\partial f}{\partial x_i}(\mathbf{x}^{(t)})

Convergence: For smooth convex ff, cyclic coordinate descent (cycling through i=1,,ni = 1, \ldots, n) converges at the same O(1/t)O(1/t) rate as gradient descent - but each step costs O(1)O(1) partial derivatives instead of O(n)O(n).

Block coordinate descent: Update groups of variables simultaneously:

xB(t+1)=argminxBf(xB,xBˉ(t))\mathbf{x}_{\mathcal{B}}^{(t+1)} = \arg\min_{\mathbf{x}_{\mathcal{B}}} f(\mathbf{x}_{\mathcal{B}}, \mathbf{x}_{\bar{\mathcal{B}}}^{(t)})

where B\mathcal{B} is a block of indices and Bˉ\bar{\mathcal{B}} is its complement.

For AI (LoRA): LoRA can be seen as block coordinate descent in parameter space. Instead of optimizing all WRd×kW \in \mathbb{R}^{d \times k}, it freezes WW and optimizes the low-rank factors ARr×kA \in \mathbb{R}^{r \times k} and BRd×rB \in \mathbb{R}^{d \times r} (block B={A,B}\mathcal{B} = \{A, B\}). The gradient w.r.t. AA and BB are partial derivatives holding all other weights fixed.


Appendix E: Numerical Stability in Gradient Computations

E.1 Log-Sum-Exp and Softmax Gradients

Computing softmax(z)k=ezk/jezj\operatorname{softmax}(\mathbf{z})_k = e^{z_k} / \sum_j e^{z_j} naively overflows when zkz_k is large.

Stable implementation: Subtract the maximum before exponentiating:

softmax(z)k=ezkzmaxjezjzmax\operatorname{softmax}(\mathbf{z})_k = \frac{e^{z_k - z_{\max}}}{\sum_j e^{z_j - z_{\max}}}

This doesn't change the value (same numerator and denominator factor cancel) but keeps all exponentials 1\leq 1.

The gradient zL=py\nabla_{\mathbf{z}} \mathcal{L} = \mathbf{p} - \mathbf{y} is inherently stable since pk[0,1]p_k \in [0,1].

E.2 Gradient Vanishing Through Deep Networks

In a depth-LL feedforward network without residual connections, the gradient of the loss w.r.t. the first layer's weights passes through LL Jacobian matrices:

W[1]L=J[L]J[L1]J[2]product of L1 JacobiansW[1]z[2]\nabla_{W^{[1]}} \mathcal{L} = \underbrace{J^{[L]} \cdot J^{[L-1]} \cdots J^{[2]}}_{\text{product of }L-1\text{ Jacobians}} \cdot \nabla_{W^{[1]}} \mathbf{z}^{[2]}

If each Jacobian has spectral norm σ<1\sigma < 1 (all singular values <1< 1), then W[1]LσL10\lVert \nabla_{W^{[1]}} \mathcal{L} \rVert \leq \sigma^{L-1} \to 0 exponentially. This is vanishing gradients.

Fixes used in practice:

  • Residual connections (He et al., 2016): x[l+1]=x[l]+F(x[l])\mathbf{x}^{[l+1]} = \mathbf{x}^{[l]} + F(\mathbf{x}^{[l]}) - the identity path ensures gradient I\mathbf{I} passes through each block
  • LayerNorm (Ba et al., 2016): Normalizes pre-activations, keeping Jacobian singular values near 1
  • Careful initialization (He, Xavier/Glorot): Sets initial weight variance to maintain gradient magnitude across layers

E.3 Mixed Precision and Gradient Scaling

In half-precision (fp16) training, gradients can underflow to zero (fp16 smallest positive 6×108\approx 6 \times 10^{-8}). Gradient scaling multiplies the loss by a large scale factor ss before backpropagation:

θ(sLscaled)=sθL\nabla_{\boldsymbol{\theta}} (\underbrace{s \cdot \mathcal{L}}_{\text{scaled}}) = s \cdot \nabla_{\boldsymbol{\theta}} \mathcal{L}

After computing the scaled gradient, we divide by ss before the optimizer step. This keeps gradient values in the representable range of fp16. PyTorch's torch.cuda.amp.GradScaler manages ss automatically, halving it on overflow and increasing it gradually otherwise.


Appendix F: Gradient Flow - Continuous-Time View

The gradient descent iteration θt+1=θtηf(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\nabla f(\boldsymbol{\theta}_t) can be viewed as the Euler discretization of the gradient flow ODE:

dθdt=f(θ(t))\frac{d\boldsymbol{\theta}}{dt} = -\nabla f(\boldsymbol{\theta}(t))

In continuous time, this is exact gradient descent - the parameter moves in the direction of steepest descent at each instant.

Properties of gradient flow:

  1. Energy decreases: ddtf(θ(t))=fdθdt=f20\frac{d}{dt}f(\boldsymbol{\theta}(t)) = \nabla f^\top \frac{d\boldsymbol{\theta}}{dt} = -\lVert\nabla f\rVert^2 \leq 0
  2. Fixed points: Equilibria occur at f=0\nabla f = \mathbf{0} - critical points of ff
  3. Convergence for convex ff: For LL-smooth convex ff: f(θ(t))fLθ(0)θ22Lt=θ(0)θ22tf(\boldsymbol{\theta}(t)) - f^* \leq \frac{L\lVert\boldsymbol{\theta}(0) - \boldsymbol{\theta}^*\rVert^2}{2Lt} = \frac{\lVert\boldsymbol{\theta}(0)-\boldsymbol{\theta}^*\rVert^2}{2t}

For AI: The Neural Tangent Kernel (NTK) theory (Jacot et al., 2018) analyzes infinitely wide networks by studying gradient flow in function space. The NTK Θ(x,x)=θfθ(x)θfθ(x)\Theta(\mathbf{x}, \mathbf{x}') = \nabla_{\boldsymbol{\theta}} f_{\boldsymbol{\theta}}(\mathbf{x})^\top \nabla_{\boldsymbol{\theta}} f_{\boldsymbol{\theta}}(\mathbf{x}') determines convergence speed and generalization in this regime. Understanding gradient flow is the entry point to NTK theory, which explains why overparameterized networks can generalize despite fitting the training data exactly.


Appendix G: Historical Notes on Gradient Methods

Augustin-Louis Cauchy (1847). "Mthode gnrale pour la rsolution des systmes d'quations simultanes" - the first gradient descent paper. Cauchy showed that moving in the direction of steepest descent decreases a function and proposed an iterative scheme that converges to a local minimum. His motivation was solving systems of equations, not machine learning.

Gradient descent lay dormant for nearly a century as an optimization tool because numerical computation was impractical before computers. The method was rediscovered multiple times:

  • 1944, Hestenes & Stiefel: Conjugate gradient - stores memory of previous gradient directions to avoid zigzagging on elongated loss surfaces. Remains the gold standard for solving large linear systems.
  • 1951, Robbins & Monro: Stochastic approximation - convergence theory for gradient descent with noisy observations. The theoretical foundation for SGD.
  • 1986, Rumelhart, Hinton, Williams: "Learning representations by back-propagating errors" - showed that gradient descent on layered networks via backpropagation trains useful representations. This paper launched the first neural network renaissance.
  • 2012-present: Deep learning era - GPUs make gradient descent tractable for billions of parameters. Adam (Kingma & Ba, 2015) adapts the learning rate per-parameter. Gradient clipping (Pascanu et al., 2013) stabilizes RNN/transformer training. AdamW (Loshchilov & Hutter, 2019) decouples weight decay from adaptive learning rates.

The gradient is the same mathematical object Cauchy defined in 1847 - what changed is our ability to compute it (via backpropagation and automatic differentiation) and the scale at which we apply it (billions of parameters, trillions of tokens).


Appendix H: Worked Examples - Gradient Computations in Full Detail

H.1 Gradient of Logistic Regression Loss (Full Derivation)

Setup: mm binary examples {(x(i),y(i))}\{(\mathbf{x}^{(i)}, y^{(i)})\} with y(i){0,1}y^{(i)} \in \{0,1\}. Model: p(i)=σ(wx(i))p^{(i)} = \sigma(\mathbf{w}^\top \mathbf{x}^{(i)}) where σ(z)=(1+ez)1\sigma(z) = (1+e^{-z})^{-1}.

Loss: L(w)=1mi=1m[y(i)logp(i)+(1y(i))log(1p(i))]\mathcal{L}(\mathbf{w}) = -\frac{1}{m}\sum_{i=1}^m \left[y^{(i)}\log p^{(i)} + (1-y^{(i)})\log(1-p^{(i)})\right]

Step 1: Gradient of a single term. Fix ii; let z(i)=wx(i)z^{(i)} = \mathbf{w}^\top\mathbf{x}^{(i)}.

(i)=y(i)logσ(z(i))(1y(i))log(1σ(z(i)))\ell^{(i)} = -y^{(i)}\log\sigma(z^{(i)}) - (1-y^{(i)})\log(1-\sigma(z^{(i)})) (i)z(i)=y(i)σ(z(i))σ(z(i))+1y(i)1σ(z(i))σ(z(i))\frac{\partial \ell^{(i)}}{\partial z^{(i)}} = -\frac{y^{(i)}}{\sigma(z^{(i)})} \cdot \sigma'(z^{(i)}) + \frac{1-y^{(i)}}{1-\sigma(z^{(i)})} \cdot \sigma'(z^{(i)})

Using σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1-\sigma(z)):

=y(i)(1σ(z(i)))+(1y(i))σ(z(i))=σ(z(i))y(i)=p(i)y(i)= -y^{(i)}(1-\sigma(z^{(i)})) + (1-y^{(i)})\sigma(z^{(i)}) = \sigma(z^{(i)}) - y^{(i)} = p^{(i)} - y^{(i)}

Step 2: Gradient w.r.t. w\mathbf{w} via chain rule.

(i)w=(i)z(i)z(i)w=(p(i)y(i))x(i)\frac{\partial \ell^{(i)}}{\partial \mathbf{w}} = \frac{\partial \ell^{(i)}}{\partial z^{(i)}} \cdot \frac{\partial z^{(i)}}{\partial \mathbf{w}} = (p^{(i)} - y^{(i)}) \cdot \mathbf{x}^{(i)}

Step 3: Average over examples.

wL=1mi=1m(p(i)y(i))x(i)=1mX(py)\nabla_{\mathbf{w}} \mathcal{L} = \frac{1}{m}\sum_{i=1}^m (p^{(i)} - y^{(i)}) \mathbf{x}^{(i)} = \frac{1}{m}X^\top(\mathbf{p} - \mathbf{y})

Verification: At the optimum wL=0\nabla_{\mathbf{w}}\mathcal{L} = \mathbf{0}, so Xp=XyX^\top\mathbf{p} = X^\top\mathbf{y} - the predicted probabilities are "feature-uncorrelated" with the residuals.

H.2 Gradient of the Negative Log-Likelihood for a Gaussian

Setup: nn iid observations x1,,xnN(μ,σ2)x_1,\ldots,x_n \sim \mathcal{N}(\mu, \sigma^2). Parameters: θ=(μ,σ)\boldsymbol{\theta} = (\mu, \sigma).

Negative log-likelihood:

L(μ,σ)=n2log(2π)+nlogσ+12σ2i=1n(xiμ)2\mathcal{L}(\mu,\sigma) = \frac{n}{2}\log(2\pi) + n\log\sigma + \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2

Partial derivatives:

Lμ=1σ2i=1n(xiμ)=n(μxˉ)σ2\frac{\partial \mathcal{L}}{\partial \mu} = -\frac{1}{\sigma^2}\sum_{i=1}^n(x_i-\mu) = \frac{n(\mu - \bar{x})}{\sigma^2} Lσ=nσ1σ3i=1n(xiμ)2\frac{\partial \mathcal{L}}{\partial \sigma} = \frac{n}{\sigma} - \frac{1}{\sigma^3}\sum_{i=1}^n(x_i-\mu)^2

Setting to zero (MLE):

From Lμ=0\frac{\partial \mathcal{L}}{\partial \mu} = 0: μ^=xˉ=1nixi\hat{\mu} = \bar{x} = \frac{1}{n}\sum_i x_i (sample mean).

From Lσ=0\frac{\partial \mathcal{L}}{\partial \sigma} = 0: σ^2=1ni(xixˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum_i(x_i-\bar{x})^2 (biased sample variance).

For AI: Maximum likelihood estimation of neural network parameters is gradient descent on the negative log-likelihood L(θ)=ilogpθ(y(i)x(i))\mathcal{L}(\boldsymbol{\theta}) = -\sum_i \log p_{\boldsymbol{\theta}}(y^{(i)}\mid\mathbf{x}^{(i)}) - identical structure to the Gaussian example, but with a deep network defining pθp_{\boldsymbol{\theta}}.

H.3 Gradient of the Frobenius Norm Loss

Setup: Matrix regression - minimize L(W)=12XWYF2\mathcal{L}(W) = \frac{1}{2}\lVert XW - Y\rVert_F^2 where XRm×dX \in \mathbb{R}^{m\times d}, WRd×kW \in \mathbb{R}^{d\times k}, YRm×kY \in \mathbb{R}^{m\times k}.

Gradient derivation:

L(W)=12tr((XWY)(XWY))\mathcal{L}(W) = \frac{1}{2}\operatorname{tr}((XW-Y)^\top(XW-Y)) WL=X(XWY)\nabla_W \mathcal{L} = X^\top(XW - Y)

Proof: Let R=XWYR = XW - Y. Then L=12tr(RR)\mathcal{L} = \frac{1}{2}\operatorname{tr}(R^\top R). A perturbation δW\delta W gives:

δL=12tr((R+XδW)(R+XδW))12tr(RR)\delta \mathcal{L} = \frac{1}{2}\operatorname{tr}((R + X\delta W)^\top(R + X\delta W)) - \frac{1}{2}\operatorname{tr}(R^\top R) =tr((XδW)R)+O(δW2)=tr(δWXR)=XR,δWF= \operatorname{tr}((X\delta W)^\top R) + O(\lVert\delta W\rVert^2) = \operatorname{tr}(\delta W^\top X^\top R) = \langle X^\top R, \delta W \rangle_F

So WL=XR=X(XWY)\nabla_W \mathcal{L} = X^\top R = X^\top(XW-Y).

This is the matrix analog of the vector gradient 2X(Xwy)2X^\top(X\mathbf{w}-\mathbf{y}) (without the factor of 2 because of the 12\frac{1}{2}). Applied to transformer attention: optimizing the value projection WVW_V uses this exact gradient structure.


Appendix I: Gradient Interpretation - A Physics Analogy

The mathematical definition of the gradient acquires deeper meaning through physical analogies.

Temperature field: Consider T(x)T(\mathbf{x}) as the temperature at point x\mathbf{x} in a room. The gradient T(x)\nabla T(\mathbf{x}) points in the direction of fastest temperature increase. Heat flows in direction T-\nabla T (Fourier's law): q=kT\mathbf{q} = -k\nabla T, where q\mathbf{q} is the heat flux vector and kk is thermal conductivity.

Gradient descent = steepest descent on a potential:

  • Elevation field: f(x)f(\mathbf{x}) = altitude at position x\mathbf{x}. f-\nabla f = direction of steepest downhill. Water flows in direction f-\nabla f (gradient flow).
  • Electric potential: ϕ(x)\phi(\mathbf{x}) = electric potential. E=ϕ\mathbf{E} = -\nabla\phi = electric field (force per unit charge). Charges accelerate in direction ϕ-\nabla\phi.
  • Loss surface: L(θ)\mathcal{L}(\boldsymbol{\theta}) = loss at parameters θ\boldsymbol{\theta}. L-\nabla\mathcal{L} = direction of maximum loss decrease. Gradient descent moves parameters in direction L-\nabla\mathcal{L}.

The physics insight: In all three cases, the gradient defines a vector field (a vector at every point in space), and dynamics follow the negative gradient. This is the fundamental structure of steepest-descent optimization - borrowed from physics, applied to machine learning.

Potential energy landscape: The loss L(θ)\mathcal{L}(\boldsymbol{\theta}) plays the role of potential energy. Parameters play the role of a particle. Gradient descent is a particle rolling downhill with heavy friction (no momentum). SGD adds noise - like a particle in a bath of random thermal kicks (Langevin dynamics). Adam is like an adaptive friction coefficient that varies per dimension.


Appendix J: Reading Guide and Further References

Core References

  1. Nocedal & Wright, Numerical Optimization (2006), Chapters 1-2 - rigorous treatment of gradient descent, step size rules, convergence for smooth functions. The standard reference for optimization in ML.

  2. Boyd & Vandenberghe, Convex Optimization (2004), Chapter 9 - gradient descent for convex functions, convergence rates, extensions to non-smooth functions. Freely available online.

  3. Goodfellow, Bengio & Courville, Deep Learning (2016), Chapter 4 - matrix calculus for ML practitioners. Gradient and Jacobian conventions used throughout this repo follow GBC.

  4. Strang, Introduction to Linear Algebra (2016), Chapter 8 - accessible introduction to gradients and their geometric meaning, from a linear algebra perspective.

  5. Amari, Information Geometry and Its Applications (2016) - the mathematical foundation of natural gradient methods, Fisher information, and gradient flows on statistical manifolds.

For AI/ML Practitioners

  1. Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2015) - introduces the Adam optimizer; Section 2 discusses the moment estimators in terms of the gradient.

  2. Pascanu, Mikolov & Bengio, "On the difficulty of training recurrent neural networks" (2013) - introduces gradient clipping; explains exploding/vanishing gradients mathematically.

  3. Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2022) - the key idea is restricting gradient updates to a low-rank subspace.

  4. Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention" (2022) - implements attention gradients in a hardware-aware fashion; the gradient formulas in Appendix C are what FlashAttention computes.

  5. Foret et al., "Sharpness-Aware Minimization for Efficiently Improving Generalization" (2021) - uses gradient norm and loss landscape geometry to find flat minima; directly applies the concepts in 8.

Online Resources

  • Andrej Karpathy's micrograd - a 150-line automatic differentiation engine built from scratch; excellent for understanding how gradients are computed
  • Matrix Cookbook - comprehensive reference for matrix calculus identities (freely available as PDF)
  • Chris Olah's "Neural Networks, Manifolds, and Topology" - visual intuition for how gradients relate to neural network geometry

Appendix K: Stochastic Gradient Descent - Theory and Practice

K.1 Why Stochastic Gradients Work

Full gradient descent computes θL=1mi=1mθ(i)\nabla_{\boldsymbol{\theta}}\mathcal{L} = \frac{1}{m}\sum_{i=1}^m \nabla_{\boldsymbol{\theta}}\ell^{(i)}, which is expensive for large mm. SGD replaces this with an unbiased estimate using a single randomly sampled example ii:

gt=θ(it),E[gtθt]=θL(θt)\mathbf{g}_t = \nabla_{\boldsymbol{\theta}}\ell^{(i_t)}, \qquad \mathbb{E}[\mathbf{g}_t \mid \boldsymbol{\theta}_t] = \nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta}_t)

Why it converges: The Robbins-Monro conditions ensure convergence for non-convex functions under SGD:

t=1ηt=,t=1ηt2<\sum_{t=1}^\infty \eta_t = \infty, \qquad \sum_{t=1}^\infty \eta_t^2 < \infty

A schedule satisfying these: ηt=η0/t\eta_t = \eta_0 / t (harmonic decay). In practice, cosine decay or piecewise constant schedules are used.

Variance of the stochastic gradient:

Var(gt)=E[gtL2]\operatorname{Var}(\mathbf{g}_t) = \mathbb{E}[\lVert\mathbf{g}_t - \nabla\mathcal{L}\rVert^2]

With mini-batch size BB: Var(gˉ)=Var(gt)/B\operatorname{Var}(\bar{\mathbf{g}}) = \operatorname{Var}(\mathbf{g}_t)/B - variance decreases as 1/B1/B. This is why larger batches have smaller gradient noise but don't always converge better (the noise helps escape sharp minima).

K.2 Gradient Noise as Regularization

The noise in SGD gradients acts as an implicit regularizer. Formally:

θt+1=θtη[L(θt)+ϵt]\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta[\nabla\mathcal{L}(\boldsymbol{\theta}_t) + \boldsymbol{\epsilon}_t]

where ϵt=gtL(θt)\boldsymbol{\epsilon}_t = \mathbf{g}_t - \nabla\mathcal{L}(\boldsymbol{\theta}_t) is the gradient noise. This is equivalent to gradient flow with a stochastic forcing term - a Langevin dynamics:

dθ=L(θ)dt+2ηdWtd\boldsymbol{\theta} = -\nabla\mathcal{L}(\boldsymbol{\theta})\,dt + \sqrt{2\eta}\,d\mathbf{W}_t

where Wt\mathbf{W}_t is Brownian motion. The stationary distribution of this SDE is approximately p(θ)exp(L(θ)/η)p(\boldsymbol{\theta}) \propto \exp(-\mathcal{L}(\boldsymbol{\theta})/\eta) - a Gibbs distribution at "temperature" η\eta. Lower learning rate = lower temperature = sharper distribution concentrated near minima.

Practical implication: SGD with large learning rate explores the loss landscape (high temperature), while small learning rate exploits (low temperature). Warmup schedules start with a small learning rate (before gradients are reliable) then increase to full rate, then decay.

K.3 Mini-Batch Gradient and Gradient Variance Reduction

MethodGradient estimateVarianceCost per step
Full batch1mi(i)\frac{1}{m}\sum_i \nabla\ell^{(i)}0O(m)O(m)
SGD(i)\nabla\ell^{(i)}σ2\sigma^2O(1)O(1)
Mini-batch1BiB(i)\frac{1}{B}\sum_{i\in\mathcal{B}}\nabla\ell^{(i)}σ2/B\sigma^2/BO(B)O(B)
SVRG(i)anchor(i)+gˉ\nabla\ell^{(i)} - \nabla\ell^{(i)}_\text{anchor} + \bar{\mathbf{g}}O(dist from anchor)O(\text{dist from anchor})O(1)+O(1)+ anchor cost

For LLMs: Gradient noise in transformer training is beneficial - it helps the model escape sharp minima and generalize better (Jiang et al., 2020). The typical LLM training batch (1M-4M tokens) is large enough to reduce gradient variance substantially while retaining some beneficial noise.


Appendix L: Gradient-Based Saliency and Interpretability

L.1 Gradient Saliency Maps

The simplest form of neural network interpretability uses the gradient of the output w.r.t. the input:

Sij=y^xijS_{ij} = \left|\frac{\partial \hat{y}}{\partial x_{ij}}\right|

where xijx_{ij} is the jj-th feature of the ii-th input token (or pixel). Large Sij|S_{ij}| means small changes in xijx_{ij} cause large changes in y^\hat{y} - the model is "sensitive" to that feature.

For language models: logp(yx)xi\frac{\partial \log p(y\mid\mathbf{x})}{\partial x_i} measures how much token ii's embedding affects the predicted probability. This gives a token-level attribution - which words most influence the model's prediction.

Limitations:

  • Gradient saliency is local (valid only near the current input)
  • Can be adversarially fooled - the gradient can be large for imperceptible perturbations
  • Does not account for saturation (sigmoid/tanh can saturate, making gradients near zero even for important features)

L.2 Integrated Gradients (Sundararajan et al., 2017)

To fix saturation issues, integrate the gradient along a path from a baseline x\mathbf{x}' to the input x\mathbf{x}:

ϕi(x)=(xixi)01f(x+α(xx))xidα\phi_i(\mathbf{x}) = (x_i - x'_i) \int_0^1 \frac{\partial f(\mathbf{x}' + \alpha(\mathbf{x}-\mathbf{x}'))}{\partial x_i}\,d\alpha

This satisfies the completeness axiom: iϕi(x)=f(x)f(x)\sum_i \phi_i(\mathbf{x}) = f(\mathbf{x}) - f(\mathbf{x}') (the attributions sum to the prediction difference from the baseline).

Implementation: Approximate the integral with a Riemann sum over kk steps. In practice, k=50k = 50 interpolations between x\mathbf{x}' and x\mathbf{x}, computing gradients at each.

For AI: Integrated gradients is the standard attribution method for language models (used in LIME, SHAP extensions, and many interpretability tools). Understanding that it's an integral of partial derivatives connects it directly to the material in this section.


Appendix M: Exercises with Full Solutions

M.1 Worked Solution - Exercise 2(c)

Problem: Verify numerically that f(1,1)\nabla f(1,-1) is perpendicular to the level curve of f(x,y)=x2+xy+y2f(x,y) = x^2+xy+y^2 through (1,1)(1,-1).

Solution:

Step 1: Compute f(1,1)=11+1=1f(1,-1) = 1 - 1 + 1 = 1. The level curve is {x2+xy+y2=1}\{x^2+xy+y^2 = 1\}.

Step 2: Gradient at (1,1)(1,-1): f=(2x+y,x+2y)=(2(1)+(1),1+2(1))=(1,1)\nabla f = (2x+y, x+2y)^\top = (2(1)+(-1), 1+2(-1))^\top = (1, -1)^\top.

Step 3: Parametrize the level curve near (1,1)(1,-1). By implicit differentiation: dydx=fxfy=11=1\frac{dy}{dx} = -\frac{f_x}{f_y} = -\frac{1}{-1} = 1. So the tangent direction is (1,1)(1, 1)^\top (normalized: (1/2,1/2)(1/\sqrt{2}, 1/\sqrt{2})^\top).

Step 4: Check perpendicularity: (1,1)(1,1)=11=0(1, -1) \cdot (1, 1) = 1 - 1 = 0. Perpendicular. \checkmark

Numerical verification: Sample points on the level curve near (1,1)(1,-1): try (1+ε,1+ε)(1 + \varepsilon, -1 + \varepsilon) for small ε\varepsilon. We have f(1+ε,1+ε)=(1+ε)2+(1+ε)(1+ε)+(1+ε)2=1+2ε+ε2+(1+ε2)+12ε+ε2=1+3ε21f(1+\varepsilon, -1+\varepsilon) = (1+\varepsilon)^2 + (1+\varepsilon)(-1+\varepsilon) + (-1+\varepsilon)^2 = 1 + 2\varepsilon + \varepsilon^2 + (-1 + \varepsilon^2) + 1 - 2\varepsilon + \varepsilon^2 = 1 + 3\varepsilon^2 \approx 1 for small ε\varepsilon. This confirms the tangent direction (1,1)(1,1)^\top lies along the level curve.

M.2 Worked Solution - Key MSE Formula

Problem: Derive w12mXwy2\nabla_{\mathbf{w}} \frac{1}{2m}\lVert X\mathbf{w} - \mathbf{y}\rVert^2 step by step.

Solution:

L(w)=12m(Xwy)(Xwy)=12mi=1m(jXijwjyi)2\mathcal{L}(\mathbf{w}) = \frac{1}{2m}(X\mathbf{w}-\mathbf{y})^\top(X\mathbf{w}-\mathbf{y}) = \frac{1}{2m}\sum_{i=1}^m\left(\sum_j X_{ij}w_j - y_i\right)^2

Component-wise: The kk-th component of wL\nabla_{\mathbf{w}}\mathcal{L}:

Lwk=1mi=1m(jXijwjyi)Xik=1mi=1mXik(Xwy)i\frac{\partial\mathcal{L}}{\partial w_k} = \frac{1}{m}\sum_{i=1}^m\left(\sum_j X_{ij}w_j - y_i\right) \cdot X_{ik} = \frac{1}{m}\sum_{i=1}^m X_{ik}(X\mathbf{w}-\mathbf{y})_i

In matrix form: Lwk=1m[X(Xwy)]k\frac{\partial\mathcal{L}}{\partial w_k} = \frac{1}{m}[X^\top(X\mathbf{w}-\mathbf{y})]_k, so wL=1mX(Xwy)\nabla_{\mathbf{w}}\mathcal{L} = \frac{1}{m}X^\top(X\mathbf{w}-\mathbf{y}).

Setting to zero: X(Xwy)=0XXw=XyX^\top(X\mathbf{w}-\mathbf{y}) = \mathbf{0} \Rightarrow X^\top X\mathbf{w} = X^\top\mathbf{y} - the normal equations. When XXX^\top X is invertible: w^=(XX)1Xy\hat{\mathbf{w}} = (X^\top X)^{-1}X^\top\mathbf{y}.


Appendix N: Connection Map - Gradient to Modern AI

GRADIENT nablaf: CONNECTIONS TO MODERN AI SYSTEMS (2026)


  FOUNDATIONAL                    ML ALGORITHMS
                        
  partialf/partialx (partial deriv)     Backpropagation (03)
  nablaf in R (gradient)         Gradient descent / Adam / AdamW
  D_u f (directional)        Projected gradient descent
  nablaf  level set             Orthogonal gradient descent (continual learning)
  Linear approx f(a+delta)       L-smoothness condition (convergence proofs)
  nablaf = 0 at extremum       Optimality conditions -> KKT (04)

  SPECIFIC ML SYSTEMS
  
  nabla_w(cross-entropy)         p - y: softmax output gradient (every LLM)
  nabla_w Xw-y^2               Linear regression, LoRA initialization
  Gradient clipping          GPT-3, Llama, Mistral training stability
  Gradient accumulation      LLM pretraining (effective batch = millions of tokens)
  Fisher F = E[nablalogp*nablalogp] Natural gradient, K-FAC, second-order methods
  partialL/partialinput (saliency)       Mechanistic interpretability, activation patching
  Integrated gradients       LIME, SHAP, LLM attribution
  Functional gradient        Neural Tangent Kernel, Wasserstein gradient flows



Appendix O: Multivariate Calculus Identities Quick Reference

O.1 Rules for Computing Partial Derivatives

RuleFormulaNotes
Sum rule(f+g)xi=fxi+gxi\frac{\partial(f+g)}{\partial x_i} = \frac{\partial f}{\partial x_i} + \frac{\partial g}{\partial x_i}Always valid
Product rule(fg)xi=fxig+fgxi\frac{\partial(fg)}{\partial x_i} = \frac{\partial f}{\partial x_i}g + f\frac{\partial g}{\partial x_i}Both factors can depend on xix_i
Quotient rule(f/g)xi=fxigfgxig2\frac{\partial(f/g)}{\partial x_i} = \frac{\frac{\partial f}{\partial x_i}g - f\frac{\partial g}{\partial x_i}}{g^2}g0g \neq 0 required
Chain rulexi[h(f(x))]=h(f(x))fxi\frac{\partial}{\partial x_i}[h(f(\mathbf{x}))] = h'(f(\mathbf{x}))\frac{\partial f}{\partial x_i}Composite function
Mixed chain rulexi[f(g(x),h(x))]=fugxi+fvhxi\frac{\partial}{\partial x_i}[f(g(\mathbf{x}),h(\mathbf{x}))] = \frac{\partial f}{\partial u}\frac{\partial g}{\partial x_i} + \frac{\partial f}{\partial v}\frac{\partial h}{\partial x_i}u=g(x)u=g(\mathbf{x}), v=h(x)v=h(\mathbf{x})

O.2 Gradient Operator Properties

The gradient operator \nabla (applied to scalar fields) is linear:

(af+bg)=af+bg\nabla(af + bg) = a\nabla f + b\nabla g

Product rule for gradients:

(fg)=gf+fg\nabla(fg) = g\nabla f + f\nabla g

Chain rule for gradients:

h(f(x))=h(f(x))f(x)\nabla h(f(\mathbf{x})) = h'(f(\mathbf{x}))\nabla f(\mathbf{x})

For composed functions f:RnRf: \mathbb{R}^n \to \mathbb{R}, g:RmRng: \mathbb{R}^m \to \mathbb{R}^n:

z[f(g(z))]=Jg(z)xf(g(z))\nabla_{\mathbf{z}}[f(g(\mathbf{z}))] = J_g(\mathbf{z})^\top \nabla_{\mathbf{x}} f(g(\mathbf{z}))

This is the full multivariate chain rule - proved in 03.

O.3 Gradient in Different Coordinate Systems

While we always work in Cartesian coordinates in this section, the gradient takes different forms in other systems:

Polar coordinates (r,θ)(r, \theta):

f=frr^+1rfθθ^\nabla f = \frac{\partial f}{\partial r}\hat{\mathbf{r}} + \frac{1}{r}\frac{\partial f}{\partial \theta}\hat{\boldsymbol{\theta}}

Spherical coordinates (r,θ,ϕ)(r, \theta, \phi):

f=frr^+1rfθθ^+1rsinθfϕϕ^\nabla f = \frac{\partial f}{\partial r}\hat{\mathbf{r}} + \frac{1}{r}\frac{\partial f}{\partial \theta}\hat{\boldsymbol{\theta}} + \frac{1}{r\sin\theta}\frac{\partial f}{\partial \phi}\hat{\boldsymbol{\phi}}

For ML: Riemannian optimization on manifolds (e.g., Stiefel manifold for orthogonal weight matrices, SPD manifold for covariance matrices) uses a metric-aware gradient - the Riemannian gradient - which is the projection of the Euclidean gradient onto the tangent space of the manifold.


Appendix P: Common Partial Derivative Computations in ML Practice

P.1 Attention Score Gradient (Simplified)

For a single query-key pair, the attention logit is s=qk/ds = \mathbf{q}^\top\mathbf{k}/\sqrt{d}. Gradients:

sq=kd,sk=qd\frac{\partial s}{\partial \mathbf{q}} = \frac{\mathbf{k}}{\sqrt{d}}, \qquad \frac{\partial s}{\partial \mathbf{k}} = \frac{\mathbf{q}}{\sqrt{d}}

These are the update signals that drive query and key projections to produce higher/lower attention scores.

P.2 Layer Normalization Gradient

LayerNorm normalizes: x^=(xμ1)/σ\hat{\mathbf{x}} = (\mathbf{x} - \mu\mathbf{1})/\sigma, then scales and shifts: y=γx^+β\mathbf{y} = \boldsymbol{\gamma}\odot\hat{\mathbf{x}} + \boldsymbol{\beta}.

Gradients w.r.t. the learnable parameters γ\boldsymbol{\gamma} and β\boldsymbol{\beta}:

Lγ=Lyx^,Lβ=Ly\frac{\partial\mathcal{L}}{\partial\boldsymbol{\gamma}} = \frac{\partial\mathcal{L}}{\partial\mathbf{y}}\odot\hat{\mathbf{x}}, \qquad \frac{\partial\mathcal{L}}{\partial\boldsymbol{\beta}} = \frac{\partial\mathcal{L}}{\partial\mathbf{y}}

The gradient w.r.t. x\mathbf{x} (for backpropagation through LayerNorm) involves the partial derivatives of x^\hat{\mathbf{x}} w.r.t. x\mathbf{x}, which requires careful treatment because μ\mu and σ\sigma both depend on x\mathbf{x} - this is a full multivariate chain rule computation covered in 03.

P.3 Embedding Table Gradient

For a language model with vocabulary size VV and embedding matrix ERV×dE \in \mathbb{R}^{V\times d}:

The gradient LE\frac{\partial\mathcal{L}}{\partial E} is sparse - only the rows corresponding to tokens that appeared in the batch have nonzero gradients. Specifically:

(LE)v,:=t:xt=vLext\left(\frac{\partial\mathcal{L}}{\partial E}\right)_{v,:} = \sum_{t : x_t = v} \frac{\partial\mathcal{L}}{\partial\mathbf{e}_{x_t}}

where extRd\mathbf{e}_{x_t} \in \mathbb{R}^d is the embedding of token xtx_t. This sparsity is exploited by embedding optimizers that only update rows corresponding to tokens in the current batch.

P.4 Weight Tying Gradient

When the input embedding EE and output projection Wout=EW_\text{out} = E^\top share the same weight matrix (common in transformer language models), the gradient of L\mathcal{L} w.r.t. EE receives contributions from both uses:

LE=LEinput+(LWout)\frac{\partial\mathcal{L}}{\partial E} = \frac{\partial\mathcal{L}}{\partial E_\text{input}} + \left(\frac{\partial\mathcal{L}}{\partial W_\text{out}}\right)^\top

This is the weight-sharing gradient accumulation rule (6.4) applied to the input-output embedding tying used in GPT-2 and many transformer models.


Appendix Q: Visualization Gallery - Gradient Fields

The following descriptions correspond to plots generated in theory.ipynb.

Plot 1 - Gradient field of a paraboloid: f(x,y)=x2+y2f(x,y) = x^2 + y^2. Arrows at a grid of points show f=(2x,2y)\nabla f = (2x, 2y)^\top. All arrows point radially outward from the minimum at the origin, with arrow length proportional to f=2x2+y2\lVert\nabla f\rVert = 2\sqrt{x^2+y^2}.

Plot 2 - Gradient and level curves: Contour lines of ff overlaid with gradient arrows. Visually confirms f\nabla f \perp level curves at every point.

Plot 3 - Saddle point gradient field: f(x,y)=x2y2f(x,y) = x^2 - y^2. Gradient f=(2x,2y)\nabla f = (2x, -2y)^\top points outward in xx and inward in yy. At the origin, f=0\nabla f = \mathbf{0} - a critical point that is neither a local min nor max.

Plot 4 - Loss landscape and gradient descent trajectory: A non-convex 2D function (e.g., Beale function or Rosenbrock) with gradient descent path from several starting points. Shows convergence to different minima depending on initialization.

Plot 5 - Gradient norm as a function of training step: From a simple neural network training run, shows the gradient norm θL\lVert\nabla_{\boldsymbol{\theta}}\mathcal{L}\rVert decreasing as training progresses, with occasional spikes.


Appendix R: Self-Assessment Checklist

After studying this section, verify you can:

Mechanics:

  • Compute partial derivatives of polynomial, exponential, and logarithmic functions
  • Write down the gradient vector f\nabla f for a given function
  • Compute directional derivatives using the formula Duf=fuD_{\mathbf{u}}f = \nabla f^\top\mathbf{u}
  • Find the directions of maximum/minimum directional derivative

Theory:

  • Prove that the gradient points in the direction of steepest ascent (from Cauchy-Schwarz)
  • Prove that the gradient is perpendicular to level sets (from the curve argument)
  • Explain why partial differentiability does not imply differentiability (with a counterexample)
  • State and apply Clairaut's theorem on equality of mixed partials

Applications:

  • Derive the gradient of the MSE loss 1mXwy2\frac{1}{m}\lVert X\mathbf{w}-\mathbf{y}\rVert^2
  • Derive the gradient of the cross-entropy loss for logistic regression
  • Implement gradient descent for a simple ML problem
  • Implement gradient checking using centered differences
  • Explain gradient clipping, gradient accumulation, and their role in LLM training

Advanced:

  • State what the Fisher information matrix is and how it relates to the natural gradient
  • Explain why gradient descent converges at rate O(1/t)O(1/t) for LL-smooth convex functions
  • Describe how LoRA restricts gradient updates to a low-rank subspace

Appendix S: Notation Summary for This Section

SymbolMeaningConvention
f:RnRf: \mathbb{R}^n \to \mathbb{R}Scalar field (loss function)Lowercase ff, uppercase L\mathcal{L} for loss
xRn\mathbf{x} \in \mathbb{R}^nInput vector (column)Bold lowercase
fxi\frac{\partial f}{\partial x_i}Partial derivative w.r.t. xix_i\partial (curly d), not dd
xfRn\nabla_{\mathbf{x}} f \in \mathbb{R}^nGradient (column vector)Bold \nabla with subscript; result is column
θL\nabla_{\boldsymbol{\theta}}\mathcal{L}Gradient of loss w.r.t. parametersAlways column; rows = parameter dimensions
DufD_{\mathbf{u}} fDirectional derivative in direction u\mathbf{u}Requires u=1\lVert\mathbf{u}\rVert = 1
HfRn×nH_f \in \mathbb{R}^{n\times n}Hessian (second-order partial derivatives)Symmetric for C2C^2 functions; full treatment in 02
JfRm×nJ_{\mathbf{f}} \in \mathbb{R}^{m\times n}Jacobian (gradient of vector-valued f\mathbf{f})Full treatment in 02
Lc(f)L_c(f)Level set {f=c}\{f = c\}(n1)(n-1)-dimensional hypersurface
η\etaLearning rateAlways positive; η1/L\eta \leq 1/L for guaranteed descent
θ\boldsymbol{\theta}All model parameters (bold, lowercase)Follows NOTATION_GUIDE.md 7
L(θ)\mathcal{L}(\boldsymbol{\theta})Loss functionCalligraphic L\mathcal{L} reserved for loss
F(θ)F(\boldsymbol{\theta})Fisher information matrixn×nn \times n PSD matrix; used in natural gradient

Appendix T: Extended Worked Examples for ML Practitioners

T.1 Gradient Tape Mental Model

Before automatic differentiation (05), practitioners had to derive gradients by hand. The "gradient tape" mental model - tracking how each computation depends on parameters - is still valuable for debugging.

Example: One-layer network forward pass.

Given xRd\mathbf{x} \in \mathbb{R}^d, WRk×dW \in \mathbb{R}^{k\times d}, bRk\mathbf{b} \in \mathbb{R}^k, ytrueRk\mathbf{y}_\text{true} \in \mathbb{R}^k:

Forward:
  z = W*x + b        (pre-activation, shape k)
  a = sigma(z)           (post-activation, shape k)
  L = a - y_true^2  (MSE loss, scalar)

Backward (right to left):
  partialL/partiala = a - y_true                            (shape k)
  partialL/partialz = (partialL/partiala)  sigma'(z) = (a - y_true)  sigma'(z)   (elementwise, shape k)
  partialL/partialW = (partialL/partialz) * x^T                         (outer product, shape kxd)
  partialL/partialb = partialL/partialz                                 (shape k)
  partialL/partialx = W^T * (partialL/partialz)                         (shape d, for next layer)

Every line is a gradient formula from this section:

  • L/a=aytrue\partial L/\partial \mathbf{a} = \mathbf{a} - \mathbf{y}_\text{true} - gradient of MSE (6.1)
  • The σ(z)\odot\sigma'(\mathbf{z}) step - chain rule, elementwise (3.2, 5.2)
  • L/W=(L/z)x\partial L/\partial W = (\partial L/\partial\mathbf{z})\mathbf{x}^\top - gradient of a bilinear form (4.5)

T.2 Softmax Temperature Gradient

In language model decoding, the temperature-scaled softmax is:

pk=ezk/τjezj/τp_k = \frac{e^{z_k/\tau}}{\sum_j e^{z_j/\tau}}

Gradient w.r.t. τ\tau:

Let sk=zk/τs_k = z_k/\tau. By the chain rule:

pkτ=pkskskτ+jkpksjsjτ\frac{\partial p_k}{\partial \tau} = \frac{\partial p_k}{\partial s_k} \cdot \frac{\partial s_k}{\partial \tau} + \sum_{j\neq k} \frac{\partial p_k}{\partial s_j} \cdot \frac{\partial s_j}{\partial \tau} =jpksj(zjτ2)=1τ2jpksjzj= \sum_j \frac{\partial p_k}{\partial s_j} \cdot \left(-\frac{z_j}{\tau^2}\right) = -\frac{1}{\tau^2}\sum_j \frac{\partial p_k}{\partial s_j} z_j

Using the softmax Jacobian pksj=pk(δkjpj)\frac{\partial p_k}{\partial s_j} = p_k(\delta_{kj} - p_j):

pkτ=1τ2[pkzkpkjpjzj]=pkτ2[zkEp[z]]\frac{\partial p_k}{\partial \tau} = -\frac{1}{\tau^2}\left[p_k z_k - p_k\sum_j p_j z_j\right] = -\frac{p_k}{\tau^2}\left[z_k - \mathbb{E}_p[z]\right]

Interpretation: Temperature gradient is negative when zk>Ep[z]z_k > \mathbb{E}_p[z] (the chosen token has above-average logit). Increasing τ\tau flattens the distribution; decreasing τ\tau sharpens it - this is reflected in the gradient sign.

T.3 Gradient w.r.t. Embedding Lookup

For input tokens x1,,xTx_1, \ldots, x_T (indices), the embedding layer looks up ext=Ext,:\mathbf{e}_{x_t} = E_{x_t, :} (the xtx_t-th row of the embedding matrix ERV×dE \in \mathbb{R}^{V\times d}).

For a loss L\mathcal{L} that depends on the embeddings {ext}\{\mathbf{e}_{x_t}\}, the gradient w.r.t. the embedding matrix has the sparsity structure:

LEv,:=t:xt=vLext\frac{\partial\mathcal{L}}{\partial E_{v,:}} = \sum_{t: x_t = v} \frac{\partial\mathcal{L}}{\partial \mathbf{e}_{x_t}}

Only rows vv that appear in the batch receive gradient updates. For vocabulary size V=128,000V = 128{,}000 (Llama 3) and a batch with B=4096B = 4096 tokens, fewer than 40964096 out of 128,000128{,}000 rows have nonzero gradients. Sparse embedding optimizers (Adagrad, sparse Adam) exploit this structure for efficiency.


Appendix U: Connections to Other Branches of Mathematics

U.1 Gradient and Topology (Morse Theory)

Morse theory studies how the topology of a manifold changes as we sweep through level sets {fc}\{f \leq c\}. The topology changes only at critical points where f=0\nabla f = \mathbf{0}. The number and type (index = number of negative Hessian eigenvalues) of critical points constrain the global topology.

For ML: The "loss landscape topology" program (Goodfellow et al., 2015; Li et al., 2018) studies how gradient descent navigates the network of critical points. The key result: most local minima of large overparameterized networks have comparable loss to the global minimum (high-dimensional Morse-Bott theory).

U.2 Gradient and Differential Forms

In differential geometry, the gradient is the unique vector field dual to the exterior derivative (1-form) dfdf:

df=ifxidxidf = \sum_i \frac{\partial f}{\partial x_i} dx_i

The gradient f\nabla f is obtained by "raising the index" using the metric tensor gg:

f=g1(df)\nabla f = g^{-1}(df)

In Euclidean space g=Ig = I, so f=df\nabla f = df (as a vector). On a Riemannian manifold, the metric gg introduces a non-trivial transformation - this is exactly the natural gradient transformation F1LF^{-1}\nabla\mathcal{L} where FF plays the role of the metric tensor on the statistical manifold.

U.3 Gradient and Functional Analysis

For functionals on Hilbert spaces H\mathcal{H} (function spaces with inner product), the gradient (Frchet derivative) is defined analogously: F:HRF: \mathcal{H} \to \mathbb{R} is differentiable with gradient F(f)H\nabla F(f) \in \mathcal{H} if:

F(f+h)=F(f)+F(f),hH+o(hH)F(f + h) = F(f) + \langle \nabla F(f), h\rangle_{\mathcal{H}} + o(\lVert h\rVert_{\mathcal{H}})

Applications: The ELM (Energy-based Language Model) and the score function xlogp(x)\nabla_x \log p(x) (used in diffusion models, score matching, Langevin MCMC) are functional gradients. The denoising score matching objective of diffusion models (Ho et al., 2020; Song et al., 2021) trains a network to estimate xlogp(x)\nabla_{\mathbf{x}} \log p(\mathbf{x}) - the gradient of the log-density - which is used to sample via Langevin dynamics.


Appendix V: Gradient Descent Variants - A Unified View

All modern optimizers can be written as preconditioned gradient descent:

θt+1=θtηPt1θLt\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta P_t^{-1} \nabla_{\boldsymbol{\theta}} \mathcal{L}_t

where PtP_t is a preconditioning matrix that captures curvature information.

OptimizerPreconditioning PtP_tKey property
Gradient descentIINo curvature info; uniform step per parameter
AdaGraddiag(sgsgs)1/2\operatorname{diag}(\sum_s \mathbf{g}_s \odot \mathbf{g}_s)^{1/2}Adapts to historical gradient magnitude per coordinate; sparse-gradient friendly
RMSpropdiag(E[g2])1/2\operatorname{diag}(\mathbb{E}[\mathbf{g}^2])^{1/2}Exponential moving average of g2\mathbf{g}^2; non-monotone, avoids diminishing step
Adamdiag(v^t)1/2\operatorname{diag}(\hat{v}_t)^{1/2}, bias-correctedCombines momentum (first moment) with RMSprop (second moment)
Natural gradientF(θ)1/2F(\boldsymbol{\theta})^{1/2}Riemannian gradient; parameter-efficient steps on statistical manifold
NewtonHf(θt)1/2H_f(\boldsymbol{\theta}_t)^{1/2}Exact curvature; quadratic convergence; O(n3)O(n^3) cost for nn parameters
K-FACKronecker-factored Fisher approx.Tractable approximation to natural gradient for neural networks
ShampooKronecker product of Hessian factorsFull matrix preconditioning per layer; used in large-scale LLM training

The gradient θL\nabla_{\boldsymbol{\theta}}\mathcal{L} is computed identically in all cases - the optimizer only changes what it does with that gradient. Understanding the raw gradient (this section) is prerequisite to understanding any of these methods.

For AI: In 2025-2026, frontier LLM training uses AdamW with gradient clipping and weight decay. Shampoo and distributed Shampoo are beginning to appear in production (Google, DeepMind), offering faster convergence by more faithfully approximating the natural gradient. The mathematical basis for all these methods is the gradient this section defines.