"The gradient is the generalization of the derivative to functions of several variables, and it points in the direction in which the function increases most rapidly. This single idea powers every learning algorithm in modern AI."
- inspired by Cauchy, 1847
Overview
Almost every function that matters in machine learning maps a high-dimensional parameter vector to a single scalar: the loss. A neural network with a billion parameters defines a surface in a billion-plus-one-dimensional space, and training means navigating that surface downhill. To navigate, you need to know the slope - not just along one axis, but simultaneously in every direction. That is precisely what the gradient provides.
This section builds the rigorous foundation for differentiation in . We begin with partial derivatives - the idea of measuring rate of change while holding all other variables fixed - then assemble them into the gradient vector, the single most important object in optimization. We prove that the gradient points in the direction of steepest ascent, show that it is perpendicular to the level sets of the function, and derive its formula as the linear part of a first-order approximation. We then compute gradients for every standard ML expression (MSE loss, cross-entropy, quadratic forms, regularized objectives) and study how to verify gradient implementations numerically.
The geometric and algebraic perspective developed here is the prerequisite for everything that follows: Jacobians and Hessians (02) generalize the gradient to vector-valued functions and second-order information; the chain rule (03) tells us how gradients propagate through composed functions; optimality conditions (04) use gradient conditions to characterize minima; automatic differentiation (05) computes gradients of arbitrary programs exactly.
Prerequisites
- Single-variable derivatives - definition as a limit, differentiation rules, chain rule: 04/02-Derivatives-and-Differentiation
- First-order Taylor approximation - : 04/04-Series-and-Sequences
- Vector notation - , dot product, norms: 02-Linear-Algebra-Basics
- Matrix-vector multiplication - , transpose: 02/02-Matrix-Operations
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive: partial derivative surfaces, gradient fields, directional derivatives, loss landscape geometry, gradient checking |
| exercises.ipynb | 10 graded exercises from partial derivative computation to natural gradient and softmax derivations |
Learning Objectives
After completing this section, you will be able to:
- Define the partial derivative as a limit and compute it for polynomial, exponential, logarithmic, and composite functions
- Explain why a function can have all partial derivatives at a point without being differentiable there
- Define the gradient and identify it as a column vector
- Prove that the gradient points in the direction of steepest ascent, using the directional derivative formula
- Prove that the gradient is perpendicular to every level set of the function
- Compute gradients of standard ML expressions: , , , softmax cross-entropy
- Derive the directional derivative formula and use it to find the directions of maximum and minimum change
- Implement numerical gradient checking using centered finite differences and measure relative error
- Explain gradient clipping, gradient accumulation, and gradient norm as training diagnostics
- Preview how the gradient generalizes to Jacobians (02), flows through computation graphs (03), and is computed by automatic differentiation (05)
Table of Contents
- 1. Intuition
- 2. Functions of Multiple Variables
- 3. Partial Derivatives
- 4. The Gradient Vector
- 5. Directional Derivatives
- 6. Gradient in Machine Learning
- 7. Numerical Differentiation
- 8. Loss Landscape Geometry
- 9. Advanced Topics
- 10. Common Mistakes
- 11. Exercises
- 12. Why This Matters for AI (2026 Perspective)
- Conceptual Bridge
1. Intuition
1.1 From Slopes to Surfaces
In single-variable calculus, the derivative measures how steeply rises or falls as changes. The picture is one-dimensional: you stand on a hill, face a direction, and the derivative tells you the slope. Multivariate calculus asks: what if you can move in any direction? The surface is a two-dimensional landscape - a mountain range - and from any point on it you can move north, south, east, west, or any diagonal. Different directions give different slopes.
The key insight is that knowing the slope in every possible direction would be infinitely much information, but it turns out you only need to store numbers - one slope per coordinate axis. These are the partial derivatives. And assembling them into a single vector - the gradient - encodes all directional information: given , you can compute the slope in any direction via a dot product.
This reduction from "infinitely many slopes" to "one -dimensional vector" is the central miracle of multivariable calculus. It is why gradient descent is computationally tractable for models with a billion parameters: at each point, we compute one gradient vector , and we move in the direction .
GRADIENT: FROM ONE SLOPE TO A DIRECTION VECTOR
Single variable: Multiple variables:
f: R -> R f: R -> R
f'(x) in R nablaf(x) in R
(one number) (a vector of n partial derivatives)
Slope along x-axis Slope along x_1-axis: partialf/partialx_1
Slope along x_2-axis: partialf/partialx_2
Slope along x-axis: partialf/partialx
"Steepest direction" "Steepest direction" = direction of nablaf
is trivially +/-x requires computing nablaf
For AI: A language model with parameters defines a loss function . The gradient tells every parameter how much to change and in which direction. One gradient computation, via backpropagation, produces all values simultaneously - this is why reverse-mode automatic differentiation (05) is so powerful.
1.2 Historical Context
The theory of partial derivatives developed over two centuries, driven largely by physics and the need to describe phenomena depending on space and time simultaneously.
| Year | Mathematician | Contribution |
|---|---|---|
| 1734 | Euler | Introduced partial derivative notation for fluid mechanics |
| 1770 | Lagrange | Developed calculus of variations; implicit use of gradients in mechanics |
| 1816 | Cauchy | Rigorous foundation of multivariate analysis; epsilon-delta for limits in |
| 1847 | Cauchy | Published the method of gradient descent for minimizing functions |
| 1873 | Maxwell | Gradient, divergence, curl notation for electromagnetic field equations |
| 1884 | Gibbs / Heaviside | Modern vector calculus notation; (nabla) operator |
| 1944 | Hestenes, Stiefel | Conjugate gradient method - gradient descent with memory |
| 1986 | Rumelhart, Hinton, Williams | Backpropagation as systematic gradient computation in neural networks |
| 2012 | Krizhevsky et al. | AlexNet - gradient descent on GPU scales to ImageNet |
| 2017 | Vaswani et al. | Attention mechanisms trained by gradient descent (Transformers) |
| 2022-2026 | LLM era | Gradient descent with AdamW, gradient clipping, and mixed precision trains 10^1^2 parameter models |
The notation (nabla) was introduced by the Scottish mathematician Peter Guthrie Tait around 1884, borrowing the shape of the ancient Assyrian harp. Cauchy's 1847 gradient descent paper is remarkably modern - he proves convergence for smooth functions and discusses step-size selection. The 140-year gap between Cauchy's paper and backpropagation's popularization shows how long theoretical tools can wait for the right application.
1.3 Why the Gradient Is the Heart of ML
Every major operation in modern AI either computes or responds to a gradient:
- Training: - gradient descent or its adaptive variants (Adam, AdamW, Lion)
- Backpropagation: the chain rule for computing by propagating gradients backward through a computation graph
- Attention gradient: drives how query projections are updated to attend to more useful keys
- LoRA (Hu et al., 2022): restricts gradient updates to a low-rank subspace where , - the gradient of w.r.t. and is what gets computed
- Gradient clipping: prevents exploding gradients in transformer training by rescaling when
- Mechanistic interpretability: gradient is a saliency map - it tells us which input tokens matter most
Understanding the gradient rigorously - not just as a formula but as a geometric object with provable properties - is the difference between knowing how to run gradient descent and understanding why it works, when it fails, and how to fix it.
2. Functions of Multiple Variables
2.1 Scalar Fields:
Definition. A scalar field (or scalar-valued function of variables) is a function
that assigns to each point a real number .
The domain is a subset of ; we usually take (for polynomials) or the natural domain (where the formula is defined).
Standard examples:
| Function | Formula | Domain | Geometric meaning |
|---|---|---|---|
| Paraboloid | Bowl shape; unique minimum at origin | ||
| Saddle | Rises in , falls in ; saddle point at origin | ||
| Gaussian | Bell shaped; probability density (unnormalized) | ||
| MSE loss | Convex paraboloid in weight space | ||
| Cross-entropy | Convex for logistic regression; non-convex for neural networks |
Non-examples (not scalar fields):
- - this is vector-valued; its derivative is the Jacobian (02)
- - this is a functional, not a finite-dimensional function
For AI: Every loss function in supervised learning, reinforcement learning, and self-supervised learning is a scalar field over parameter space. Training is navigation of this surface.
2.2 Level Sets and Contour Geometry
Definition. The level set (or level surface) of at value is
For , level sets are contour lines (curves in the plane). For , they are level surfaces. For general , they are -dimensional hypersurfaces.
Geometric meaning: Moving along a level set keeps the function value constant. You expend no potential energy. The gradient, as we will prove in 4.3, is always perpendicular to every level set through the point.
Examples:
- : level sets are concentric circles (for )
- : level sets are hyperbolas
- : level sets are ellipsoids in (elongated if is ill-conditioned)
CONTOUR GEOMETRY FOR f(x,y) = x^2 + y^2
y
3 c = 9
2 c = 4
1 <- minimum (0,0) c = 1
0 x
-1
-2
-3
- Each ring = constant f value - Minimum at center
- nablaf points radially outward - nablaf each ring (proved in 4.3)
For AI: The loss landscape's level sets determine how gradient descent navigates. If level sets are elongated ellipses (high condition number), standard gradient descent zigzags slowly - motivating preconditioning and adaptive methods like Adam, which effectively rescale the gradient to account for curvature.
2.3 Limits and Continuity in
Definition. We say if for every there exists such that
This definition is formally identical to the single-variable case, but with Euclidean distance replacing .
Critical difference from single-variable calculus: In , you can approach a point from only two directions (left or right). In , you can approach along infinitely many paths. A limit exists only if the same value is obtained along every path.
Non-example (path-dependent limit): Consider as :
- Along :
- Along :
Different paths give different limits, so does not exist.
Continuity: is continuous at if . Polynomials, rational functions (away from zeros of denominator), , (where ) are all continuous on their domains.
All loss functions used in practice (MSE, cross-entropy, hinge, Huber) are continuous on their natural domains. This ensures the gradient descent trajectory is well-defined and doesn't jump discontinuously.
2.4 Differentiability vs. Partial Differentiability
This is a subtle but important distinction that illuminates why the gradient is defined the way it is.
Partial differentiability does NOT imply differentiability. The partial derivatives can exist at a point while the function fails to be differentiable (or even continuous) there.
Non-example (Cauchy 1821): Define
- (by symmetry)
- But is not continuous at (as shown in 2.3, the limit doesn't exist)
True differentiability requires the existence of a linear map such that
When is differentiable at , the linear map is given by . The gradient is then the representation of this linear map in the standard basis.
Sufficient condition: If all partial derivatives exist and are continuous on a neighborhood of , then is differentiable at . All standard ML functions satisfy this condition away from a few exceptional points (like for ReLU).
3. Partial Derivatives
3.1 Definition as a Limit
Definition. The partial derivative of with respect to at the point is
provided this limit exists. We perturb only the -th coordinate by and hold all others fixed.
Notation variants (all in common use):
| Notation | Context |
|---|---|
| Standard - preferred in this repo | |
| or | Compact, used in PDEs |
| Operator notation | |
| When clear from context |
Geometric interpretation: is the slope of the curve obtained by intersecting the surface with the hyperplane . It measures the instantaneous rate of change of when we move parallel to the -axis.
GEOMETRIC MEANING OF A PARTIAL DERIVATIVE
Surface z = f(x, y) Cross-section at fixed y = a_2
z z
surface
slope = partialf/partialx at (a_1,a_2)
y
fixed y = a_2 x
a_1
x
The partial partialf/partialx is the slope of the curve when we
slice the surface with the plane y = a_2 (constant).
For AI: PyTorch's .grad attribute on a scalar after .backward() is exactly for each parameter . When you call loss.backward(), it computes all partial derivatives simultaneously via reverse-mode automatic differentiation.
3.2 Computation: Treating Other Variables as Constants
The practical rule: hold all variables except constant, then differentiate as a single-variable function of .
All standard differentiation rules (power rule, product rule, chain rule) apply, treating the fixed variables as parameters.
Example 1:
Example 2:
Example 3 (ML): Loss for one data point :
This illustrates the general pattern: - the gradient of a squared loss is error times feature.
3.3 Higher-Order Partial Derivatives
Since is itself a function, we can differentiate again:
Second-order partials:
The collection of all second-order partials forms the Hessian matrix :
Scope note: The Hessian is defined here, but all properties (symmetry, definiteness, Newton's method) are treated in 02-Jacobians-and-Hessians.
Example:
Note: - this equality is Clairaut's theorem.
3.4 Equality of Mixed Partials - Clairaut Preview
Theorem (Clairaut/Schwarz). If has continuous second partial derivatives on an open set, then for all :
That is, the order of partial differentiation does not matter.
Consequence for the Hessian: The Hessian matrix is symmetric: .
Why continuity is required: Without continuity of the mixed partials, equality can fail. The standard counterexample is:
For this function, .
Full proof and implications: 02-Jacobians-and-Hessians - Clairaut's theorem, Hessian symmetry, positive definiteness.
For AI: The symmetry is why Hessian-vector products are computationally feasible - symmetric matrices have real eigenvalues, and efficient algorithms (Pearlmutter 1994) compute without forming explicitly.
3.5 The Total Derivative
When all variables change simultaneously by small amounts , the total change in is:
In differential notation:
This is the total differential of at . It is exact (not an approximation) in the differential sense, and gives the first-order approximation
when is small. This first-order Taylor approximation is the foundation of gradient descent: taking a step decreases by approximately (to first order).
Perturbation analysis in ML: When we change a model's weight by a small amount , the loss changes by approximately . This is why the gradient tells us exactly which parameters, perturbed in which direction, reduce the loss most efficiently.
4. The Gradient Vector
4.1 Definition and Notation
Definition. Let be differentiable at . The gradient of at is the column vector of all partial derivatives:
Notation conventions (following docs/NOTATION_GUIDE.md):
| Notation | Meaning |
|---|---|
| Gradient at ; subscript on omitted when variable is clear | |
| Gradient with respect to (explicit subscript) | |
| Gradient of loss with respect to parameters | |
| Common shorthand for the gradient in optimization |
Orientation convention: Following Nocedal & Wright (2006) and all major ML frameworks, the gradient is a column vector. This means the first-order approximation is .
Non-notation note: Do not write as a row vector. Some texts (especially those following Jacobian conventions) write the gradient as a row, but this creates ambiguity with the Jacobian . We reserve the Jacobian convention for 02.
Examples:
:
:
:
4.2 The Gradient Points in the Direction of Steepest Ascent
Theorem. Among all unit vectors with , the directional derivative is maximized when , giving maximum value .
Proof. By the Cauchy-Schwarz inequality:
with equality when , i.e., .
Similarly, the minimum directional derivative is , achieved in direction .
Corollary (gradient descent). To decrease most rapidly, step in direction . This is why gradient descent uses - it moves in the direction of steepest descent.
DIRECTIONAL DERIVATIVE vs. DIRECTION
Given nablaf at a point, the directional derivative in direction u:
D_u f = nablaf * u = ||nablaf|| cos(theta)
where theta = angle between nablaf and u
Maximum (theta=0): u = nablaf/||nablaf|| D_u f = ||nablaf|| <- steepest ascent
Zero (theta=90): u nablaf D_u f = 0 <- along level set
Minimum (theta=180): u = -nablaf/||nablaf|| D_u f = -||nablaf|| <- steepest descent
Gradient descent chooses u = -nablaf/||nablaf|| at every step.
For AI: The gradient is the unique direction in which perturbing increases the loss most rapidly per unit norm. Gradient descent follows its negative. Adam modifies this direction by dividing by the running second moment , effectively normalizing each coordinate - this is why Adam often converges faster when gradients have very different scales across parameters.
4.3 Gradient Perpendicular to Level Sets
Theorem. Let be differentiable, and let lie on the level set . Then is perpendicular to at - that is, is orthogonal to every tangent vector of at .
Proof. Let be any smooth curve lying in with . Then for all . Differentiating:
By the single-variable chain rule applied along the curve:
At : .
Since is any tangent vector of at , we conclude .
Geometric consequence: The gradient always points directly "away" from the level set - it is the normal vector to the surface . To travel along the level set (constant loss), you must move perpendicular to the gradient.
For AI: In continual learning and orthogonal gradient descent (OGD), gradients are projected to be orthogonal to previous task gradients. This ensures learning a new task doesn't increase loss on previous tasks - it literally moves along the level set of the old loss while decreasing the new one.
4.4 Gradient as Linear Approximation
The gradient gives the best linear (affine) approximation to near a point:
This is the first-order multivariate Taylor expansion. The remainder term is controlled by the Hessian (02).
Interpretation: The hyperplane is the tangent hyperplane to the surface at the point .
Error bound: If has bounded second derivatives - specifically if for all - then:
For AI: The -smoothness condition (equivalent to bounded Hessian) is the key assumption in gradient descent convergence proofs. It guarantees that the linear approximation is not too far off, so taking a step with always decreases .
4.5 Gradient of Standard Forms
These are among the most frequently used gradient computations in ML. All can be derived from first principles via the definition.
Theorem (matrix calculus identities). Let , , .
| Proof sketch | ||
|---|---|---|
| , so | ||
| Same as above (dot product is symmetric) | ||
| , so | ||
| See derivation below | ||
| ( symmetric) | Since | |
| Chain rule + above | ||
| Chain rule: |
Proof of :
Therefore . When is symmetric (), this simplifies to .
Proof of :
Let . Then . By the chain rule:
For the MSE loss , this gives:
This is the gradient of the ordinary least squares objective - the formula that underlies every linear regression solver.
5. Directional Derivatives
5.1 Definition and Limit Form
The partial derivative measures change along a coordinate axis. The directional derivative measures change in any arbitrary direction.
Definition. Let , , and with (unit vector). The directional derivative of at in direction is:
Note on unit vectors: The convention is important. If we allowed , the directional derivative would scale with , making direction and magnitude conflated. With , measures purely the rate of change per unit distance in direction .
Relationship to partial derivatives: Partial derivatives are special cases. If (the -th standard basis vector), then:
So partial derivatives are directional derivatives along coordinate axes.
Example: , , (northeast direction).
We can compute this via the gradient formula (5.2): , so , and:
5.2 The Gradient Formula
Theorem. If is differentiable at , then for any unit vector :
Proof. Define for . Then . By the single-variable chain rule applied component-wise:
At : .
Geometric meaning: , where is the angle between and .
Consequence: The entire directional derivative function is determined by the single vector . This is the key economy: numbers encode all directional information.
5.3 Extremizing the Directional Derivative
Summary table:
| Direction | Angle | Meaning | |
|---|---|---|---|
| Steepest ascent | |||
| Along level set | |||
| Steepest descent |
where is the unit gradient.
Worked example: at . Find: (a) steepest ascent direction; (b) directional derivative along .
(a) Steepest ascent: (eastward), rate .
(b) - the Gaussian is flat in the -direction at (it's on a ridge).
5.4 Subgradients - Preview
At points where is not differentiable, the gradient doesn't exist, but we can generalize via subgradients.
Definition. A vector is a subgradient of at if for all :
The subdifferential is the set of all subgradients.
Example (ReLU): :
- For : (unique gradient)
- For : (unique gradient)
- For : (any is a subgradient)
In PyTorch, torch.relu(x).backward() uses at by convention - this is a valid subgradient choice and works fine in practice since occurs with probability zero for continuous distributions.
Full treatment of subgradients and convex optimization: 04-Optimality-Conditions and Chapter 8 - Optimization.
6. Gradient in Machine Learning
6.1 Mean Squared Error Gradient
Setup: Linear regression model , where is the data matrix, is the weight vector, is the target vector.
Loss: Mean squared error (MSE):
Gradient derivation:
Let (residuals). Then:
Interpretation: Each component is the correlation between the -th feature and the residuals. The gradient is zero when all features are uncorrelated with the residuals - exactly the OLS normal equations .
Gradient descent for linear regression:
With learning rate , this converges to the least squares solution (Boyd & Vandenberghe, 2004, 9.3).
6.2 Cross-Entropy and Logistic Regression Gradient
Logistic regression: Binary classifier where .
Loss (binary cross-entropy on examples):
Gradient derivation:
For one example, let , . Using :
Therefore, averaging over examples:
where elementwise. Strikingly, this has the same form as the linear regression gradient: error times features. This pattern holds for all generalized linear models in the exponential family.
Softmax (multiclass) gradient:
For -class classification with softmax probabilities and one-hot label :
In matrix form: , where has rows of predicted probabilities.
For AI: The output gradient of a transformer's logit layer (predicted minus true distribution) is used directly in backpropagation. Its simplicity - no need to differentiate through the softmax explicitly - is why cross-entropy is the universal choice for classification losses.
6.3 Gradient of Regularized Objectives
Regularization adds a penalty term to the loss. The gradient of the combined objective is the sum of gradients:
L2 regularization (weight decay):
The update rule becomes: . The factor shrinks the weights at every step - this is called weight decay, and is the standard regularization in LLM training (AdamW with -).
L1 regularization (sparsity):
The gradient of is (elementwise), which is a subgradient at . L1 regularization promotes sparsity - many weights become exactly zero - but is rarely used in deep learning because automatic differentiation requires smooth functions.
Elastic net: combines both.
6.4 Weight Sharing and Gradient Accumulation
Weight sharing: When the same parameter appears multiple times in a model (e.g., a shared embedding layer used in both encoder and decoder), the gradient is the sum of gradients from each occurrence:
This follows from the multivariable chain rule. In convolutional neural networks, all positions share the same filter weights - the filter's gradient is the sum of its gradient contributions from every spatial location.
Gradient accumulation: For large batch sizes that don't fit in GPU memory, we split the batch into micro-batches, accumulate gradients across them, and apply one optimizer step:
for micro_batch in micro_batches:
loss = model(micro_batch) / num_micro_batches
loss.backward() # accumulates .grad
optimizer.step()
optimizer.zero_grad()
This gives the same gradient as computing on the full batch because (linearity of the gradient).
For AI: GPT-3 and subsequent LLMs were trained with gradient accumulation because the full batch size (millions of tokens) far exceeds GPU memory. Gradient accumulation is mathematically equivalent to large-batch training, making it essential infrastructure for frontier model training.
6.5 Gradient Descent: The Algorithm
Gradient descent is the iterative algorithm for minimizing :
where is the learning rate (or step size) at step .
Why it works (local guarantee): By the first-order approximation:
For small enough , the second term dominates, guaranteeing descent: (unless ).
Variants:
| Variant | Update | When to use |
|---|---|---|
| Batch GD | over all examples | Exact gradient; slow per step |
| Stochastic GD (SGD) | over 1 example | Noisy but fast; escapes sharp minima |
| Mini-batch GD | over examples | Best of both; standard in practice |
| SGD + Momentum | Adds exponential moving average of gradient | Accelerated convergence; reduces oscillation |
| Adam | Adaptive per-coordinate learning rates | Default for transformer training |
Convergence (L-smooth, convex case): If is convex and -smooth with :
This rate is tight for gradient descent. Strongly convex functions achieve linear (exponential) convergence.
For AI: Modern LLM training uses AdamW with cosine learning rate decay, gradient clipping at , and weight decay . The gradient itself is the same object as described here - the engineering around it (clipping, schedule, accumulation) addresses numerical and optimization dynamics, not the mathematical definition.
Full convergence theory and algorithm variants: Chapter 8 - Optimization.
7. Numerical Differentiation
7.1 Finite Difference Formulas
When an analytical gradient is unavailable or hard to verify, we can approximate it numerically using finite differences.
Forward difference:
Error: - one term of Taylor expansion is dropped.
Backward difference:
Error: - same order.
Centered difference:
Error: - second-order accurate (all odd-order Taylor terms cancel).
Derivation of centered difference error:
By Taylor expansion:
Subtracting: , so the centered difference approximation has error .
Computational cost: Computing the full gradient via centered differences requires function evaluations. For a model with parameters, this is computationally infeasible - this is why automatic differentiation (05) is essential.
7.2 Optimal Step Size
Choosing involves a fundamental trade-off:
- Too large : Taylor truncation error is large
- Too small : Floating-point subtraction cancellation error dominates
For centered differences, the optimal minimizes total error , which gives:
NUMERICAL GRADIENT ERROR vs. STEP SIZE h
log(error)
-2
Truncation error O(h^2)
-4
Optimal h ~= 1e-5
-6 _____________/
minimum Roundoff error epsilon/h
-8
-10
log(h)
-12 -9 -6 -3 0
Rule of thumb: Use for centered differences in double precision. In single precision (), use .
7.3 Gradient Checking
Gradient checking is the practice of verifying an analytical gradient implementation by comparing it to the numerical gradient.
Algorithm:
- Compute analytical gradient using backpropagation
- For each parameter , compute:
- Compute relative error:
Thresholds:
- : Great - gradient is likely correct
- to : Acceptable for most applications
- : Bug in gradient - investigate component-by-component
Practical tips:
- Use double precision (
float64) during gradient checks, even if training usesfloat16 - Check on a small random input, not on all training data
- Check each component separately to localize bugs
- Do NOT gradient check with dropout active (stochastic masks break the comparison)
For AI: Andrej Karpathy's micrograd tutorial uses gradient checking to verify backpropagation at every step. Gradient checking was critical during the early days of deep learning to ensure backprop implementations were correct; today, libraries like PyTorch and JAX are well-tested, but gradient checking remains essential when implementing custom layers or loss functions.
7.4 Limitations and the Case for Automatic Differentiation
Numerical differentiation has three fatal limitations for large models:
- Cost: function evaluations for an -parameter model - infeasible for
- Accuracy: error is often insufficient for sensitive applications (e.g., second-order methods)
- No higher-order: Computing Hessian-vector products numerically costs
Automatic differentiation (05) resolves all three:
- Cost: Reverse mode computes the full gradient in model evaluations (same cost as a forward pass)
- Accuracy: Exact to machine precision - no truncation error
- Higher-order: Forward-over-reverse AD computes Hessian-vector products exactly in cost
Full treatment: 05-Automatic-Differentiation.
8. Loss Landscape Geometry
8.1 Loss Landscapes for Neural Networks
A neural network's loss defines a high-dimensional surface over parameter space . Unlike convex functions (which have a unique global minimum), neural network loss landscapes are:
- Non-convex: Multiple local minima, saddle points, and flat regions
- High-dimensional: Intuitions from 2D and 3D break down; most "local minima" in high dimensions are actually saddle points with at least one negative curvature direction
- Approximately convex near minima: Empirically, well-trained networks sit in wide flat basins where the Hessian has many near-zero eigenvalues and few negative ones
Critical points and their classification (using the Hessian - see 02):
| Type | Hessian eigenvalues | Behavior | |
|---|---|---|---|
| Global minimum | All | Best possible parameters | |
| Local minimum | All | Training converges here; good generalization if in a wide basin | |
| Saddle point | Mixed signs | Gradient descent escapes with noise (SGD) | |
| Flat region | Many | Plateau; training stalls; warm restarts help |
The Hessian eigenvalue view: The curvature of the loss surface at a minimum is characterized by the Hessian eigenvalues. Wide minima (small eigenvalues) generalize better (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017) - this is the theoretical basis for SAM (Sharpness-Aware Minimization, Foret et al., 2021), which explicitly seeks flat minima.
8.2 Gradient Norm as a Training Diagnostic
The gradient norm is one of the most informative training diagnostics:
| Gradient norm behavior | Possible cause |
|---|---|
| Steadily decreasing | Normal convergence |
| Suddenly large (spike) | Batch with unusual data; learning rate too high |
| Near zero early in training | Vanishing gradients (very deep networks without skip connections) |
| Near zero late in training | Near a critical point (minimum or saddle) |
| Persistently large | Divergence; reduce learning rate |
Vanishing gradients: In deep networks without normalization or skip connections, the gradient norm often decays exponentially with depth. If layer has weight matrix with , then exponentially. This motivated: BatchNorm (Ioffe & Szegedy, 2015), LayerNorm (Ba et al., 2016), and residual connections (He et al., 2016).
8.3 Gradient Clipping
Problem: In recurrent networks (RNNs, early transformers), gradients can explode - growing exponentially through time - making training unstable.
Solution (Pascanu et al., 2013): Rescale the gradient whenever its norm exceeds a threshold :
This preserves the direction of the gradient while bounding the step size. It is not the same as gradient truncation (which clips each component independently and changes the direction).
Effect: Clipping to guarantees the update has norm at most , regardless of the curvature of .
Standard values: for transformer training (used in GPT-2, GPT-3, Llama, and most modern LLMs). The PyTorch call is torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0).
For AI: Gradient clipping is a near-universal practice in LLM training. Without it, a single pathological batch can send parameters into a numerically unstable region from which training never recovers. The choice is empirical - it rarely triggers during stable training but provides a safety net during instability events.
8.4 Gradient Accumulation for Large Batch Training
When the effective batch size is larger than what fits in GPU memory, gradient accumulation simulates the large batch:
where is the number of gradient accumulation steps. The gradient is accumulated across micro-batches before a single optimizer step.
Mathematical validity: By linearity of the gradient, the accumulated gradient is exactly equal to the gradient computed over the full batch. This holds exactly (not approximately).
For AI: Llama 2 70B used a global batch size of 2048 sequences of 4096 tokens - far exceeding GPU memory for a single forward pass. Gradient accumulation, combined with data-parallel training across multiple GPUs, makes this feasible.
9. Advanced Topics
9.1 Gateaux and Frchet Derivatives
In infinite-dimensional settings (function spaces), the ordinary gradient doesn't apply. Two generalizations are:
Gateaux derivative (directional derivative in function space):
where is a functional (maps functions to scalars), is a function, and is a perturbation function. This is the infinite-dimensional analog of the directional derivative.
Frchet derivative: The Gateaux derivative is linear and continuous in , and can be written as for some function - this is the functional gradient.
Applications in ML:
- Variational inference: Optimizing the ELBO over distributions - the gradient is a functional derivative
- Optimal transport: Wasserstein gradient flows define how distributions evolve under gradient descent in the space of probability measures
- Neural tangent kernel (NTK): As network width , gradient descent on induces gradient descent in function space with the NTK as inner product
9.2 Wirtinger Derivatives
When optimizing real-valued functions of complex variables - common in signal processing, compressed sensing, and some neural network architectures - the standard gradient doesn't apply because complex functions satisfying the Cauchy-Riemann equations are conformal (angle-preserving), and most practical functions are not holomorphic.
Wirtinger calculus defines pseudo-derivatives:
where . For real-valued , gradient descent follows (the Wirtinger gradient).
For AI: Complex-valued neural networks used in audio processing and MRI reconstruction use Wirtinger derivatives. PyTorch supports complex tensors and correctly computes Wirtinger gradients via .backward().
9.3 Gradients on Manifolds and the Natural Gradient
Standard gradient descent treats parameter space as flat Euclidean space. But probability distributions form a statistical manifold with a non-Euclidean geometry defined by the Fisher information matrix:
The natural gradient (Amari, 1998) accounts for this geometry:
Natural gradient descent converges faster than standard gradient descent because it takes steps of equal "distributional size" rather than equal "parameter size."
Connection to LLM fine-tuning:
- LoRA (Hu et al., 2022): Constrains the gradient update to a low-rank subspace. The intuition is that the gradient lies in a low-rank manifold, and parameterizing it as is an efficient approximation
- K-FAC (Martens & Grosse, 2015): Approximates the Fisher information as a Kronecker product of smaller matrices, making natural gradient descent tractable for neural networks
- DoRA (Liu et al., 2024): Decomposes weight updates into magnitude and direction components, further refining the LoRA idea
Riemannian gradient: On a manifold with metric tensor , the Riemannian gradient is . The Fisher information plays the role of on the statistical manifold.
9.4 Jacobian Preview
So far we have discussed gradients of scalar functions . When is vector-valued, , the gradient generalizes to the Jacobian matrix :
The Jacobian is the "matrix of all first-order partial derivatives" - its -th row is the gradient of the -th output component.
Preview: Jacobians and Hessians
When , the Jacobian generalizes the gradient and is the fundamental object for backpropagation through vector-valued layers. The Hessian generalizes the second derivative and characterizes curvature, saddle points, and Newton's method.
-> Full treatment: 02 Jacobians and Hessians
10. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Writing as a row vector | Gradient is a column vector ; writing as a row conflates it with the Jacobian (which is for scalar ) and causes dimension errors in | Always write as a column vector; write when you need a row |
| 2 | Confusing with | The product rule applies: | Apply the product rule in each differentiation step |
| 3 | Treating when is not symmetric | This holds only when ; in general, | Check symmetry; use the general formula from 4.5 |
| 4 | Using forward differences instead of centered differences for gradient checking | Forward difference has error; centered has ; at this is vs - a factor of | Always use centered differences for gradient checking |
| 5 | Concluding a point is a local minimum because all partial derivatives are zero | is necessary but not sufficient; the point could be a saddle point (e.g., at the origin) | Check the Hessian: if , it's a local minimum (02, 04) |
| 6 | Claiming a function is differentiable because its partial derivatives exist | False - all partials can exist while is not even continuous (2.4 counterexample) | Differentiability requires more: the linear approximation error must be |
| 7 | Applying gradient descent with a step for -smooth functions | This can cause divergence; the safe step size for gradient descent is where is the Lipschitz constant of | Check the smoothness constant; use line search or backtracking if unknown |
| 8 | Gradient clipping individual components (element-wise) instead of the norm | Element-wise clipping changes the gradient direction and can slow convergence; norm clipping preserves direction | Use clip_grad_norm_() (norm clipping), not clip_grad_value_() unless you specifically want element-wise clipping |
| 9 | Ignoring the factor in batch gradients vs. single-sample gradients | Learning rate must be retuned when switching batch sizes; gradient magnitudes scale with | Consistently include the factor in the loss; or rescale learning rate by when changing batch size |
| 10 | Confusing the directional derivative with the gradient projected onto | The directional derivative requires ; for non-unit , the result scales with | Normalize before computing the directional derivative |
11. Exercises
Exercise 1 - Partial Derivatives
Let .
(a) Compute , , .
(b) Evaluate .
(c) Verify that the mixed partials and are equal.
(d) For AI: The loss . Find and the critical points.
Exercise 2 - Gradient Computation and Geometry
Let .
(a) Compute and evaluate at , , .
(b) At each of the three points, find the direction of steepest ascent (unit vector).
(c) Find all level curves for . Verify numerically that is perpendicular to the level curve passing through .
(d) Show that for all by completing the square or using the discriminant of the associated quadratic form.
Exercise 3 - Directional Derivatives
Let .
(a) Compute at .
(b) Find the direction that maximizes and the direction that minimizes it. What are the maximum and minimum values?
(c) Find all directions such that .
(d) Compute the directional derivative in direction and in direction .
Exercise 4 - Matrix Calculus from First Principles
Prove the following from the definition :
(a) for fixed .
(b) for fixed .
(c) for fixed , .
(d) Show that the gradient of the MSE loss equals zero when - the normal equations of linear regression.
Exercise 5 - MSE Gradient and Linear Regression
Implement gradient descent for linear regression from scratch.
(a) Generate synthetic data: where , , .
(b) Implement the MSE gradient .
(c) Run gradient descent for 1000 steps with . Plot the loss vs. step number.
(d) Compute the analytical solution and compare with gradient descent's solution.
(e) How does the convergence rate change with learning rate ? Try .
Exercise 6 - Gradient Checking Implementation
(a) Implement centered-difference numerical gradient for an arbitrary function .
(b) Test on with random data. Compute the relative error between numerical and analytical gradients.
(c) Plot relative error vs. step size for . Identify the optimal and the crossover between truncation error and floating-point cancellation.
(d) Implement a gradient checker for the logistic regression cross-entropy loss. Use it to verify the gradient formula .
Exercise 7 - Softmax Gradient Derivation
The softmax function takes logits to a probability distribution.
(a) Compute the Jacobian for all pairs . Show that it equals where is the Kronecker delta.
(b) Let (cross-entropy with one-hot labels ). Using the chain rule , derive .
(c) Verify numerically: compute using your formula and using centered differences. Confirm relative error .
(d) Why is the gradient so elegant? Explain its interpretation: what happens when the model is perfectly confident and correct? When it's confident and wrong?
Exercise 8 - Natural Gradient vs. Gradient Descent
Consider fitting a Gaussian model to observed data by maximizing the log-likelihood .
(a) Compute the ordinary gradient .
(b) Compute the Fisher information matrix and show that for a Gaussian, .
(c) Compute the natural gradient and compare it to the ordinary gradient.
(d) Simulate: generate 100 points from , run gradient descent and natural gradient descent from . Plot trajectories in space. Which converges faster?
12. Why This Matters for AI (2026 Perspective)
| Concept | AI/ML Impact |
|---|---|
| Gradient | The update signal for every parameter in every neural network; computed billions of times per second during LLM training |
| MSE gradient derivation | Foundation of all linear models; normal equations; least squares regularization; used in LoRA target initialization |
| Cross-entropy gradient | Elegantly simple form used in every transformer output layer; enables efficient backpropagation through softmax |
| Gradient as steepest descent direction | Theoretical guarantee that SGD moves downhill; motivates all descent-based optimization algorithms |
| Gradient perpendicular to level sets | Foundation of constrained optimization; used in gradient projection methods and orthogonal gradient descent for continual learning |
| First-order Taylor approximation | Guarantees gradient descent decreases loss for small steps; basis of convergence proofs for smooth functions |
| Gradient of quadratic forms | is used in kernel methods, Gaussian process regression, and second-order optimization |
| Gradient clipping | Standard practice in all LLM training; prevents training instability from pathological batches; norm used in GPT-3, Llama, Mistral |
| Gradient accumulation | Enables large effective batch sizes (millions of tokens) on GPU-memory-constrained systems; essential for frontier model training |
| Numerical gradient checking | Debugging tool for custom layers; verifies correctness of backpropagation implementation before costly training runs |
| Weight decay as L2 gradient | AdamW decouples weight decay from the adaptive learning rate; the gradient of L2 regularization adds to the update |
| Gradient of log-likelihood | Score function used in REINFORCE, PPO, and all policy gradient reinforcement learning methods |
| Natural gradient / Fisher information | K-FAC approximates it for efficient second-order optimization; Shampoo computes Kronecker factors; understanding it motivates AdaGrad, RMSprop, Adam |
| Functional gradients | Wasserstein gradient flows, neural tangent kernels, variational inference - all use gradients in infinite-dimensional function spaces |
Conceptual Bridge
Looking Backward
The gradient builds directly on two prerequisites:
-
Single-variable chain rule (04/02): The directional derivative formula was proved using the chain rule along a curve . Every property of the gradient ultimately rests on single-variable calculus applied component-by-component.
-
First-order Taylor series (04/04): The gradient-as-linear-approximation view is the multivariate Taylor expansion truncated at first order. The remainder is , controlled by the second-order term - the Hessian.
-
Vector norms and inner products (02-Linear-Algebra-Basics): The Cauchy-Schwarz inequality was the key step in proving the gradient points in the steepest ascent direction.
Looking Forward
This section is the foundation for three immediate extensions:
-
Jacobians and Hessians (02): When is vector-valued, each output has a gradient . Stacking these as rows gives the Jacobian . The Hessian stacks the gradients of the gradient components to form a symmetric matrix capturing curvature.
-
Chain Rule and Backpropagation (03): Backpropagation is the multivariate chain rule applied to computation graphs. If and , then . Propagating gradients backward layer by layer uses exactly this matrix product structure.
-
Optimality Conditions (04): The condition is necessary (but not sufficient) for to be a local minimum. Adding the condition (positive definite Hessian) makes it sufficient. Lagrange multipliers extend this to constrained optimization.
POSITION IN CURRICULUM
Ch. 4 - Calculus Fundamentals
01 Limits and Continuity (epsilon-delta, sequential limits)
02 Derivatives (chain rule, activation derivatives)
03 Integration (FTC, probability integrals)
04 Series and Sequences (Taylor series, convergence)
builds on: derivatives, Taylor expansion, vector algebra
Ch. 5 - Multivariate Calculus
01 Partial Derivatives and Gradients <- YOU ARE HERE
- Partial derivatives = derivatives holding others fixed
- Gradient = vector of all partial derivatives
- Directional derivative = nablaf * u
- Gradient of ML losses (MSE, cross-entropy)
- Numerical differentiation, gradient checking
02 Jacobians and Hessians -> matrix generalization
(gradient of vector-valued f; curvature matrix)
03 Chain Rule and Backpropagation -> gradient composition
(Jacobian chain rule; backprop algorithm)
04 Optimality Conditions -> when nablaf = 0 is a minimum
(first/second-order conditions; KKT; Lagrange multipliers)
05 Automatic Differentiation -> computing nablaf exactly
(reverse mode; tapes; PyTorch/JAX)
Ch. 8 - Optimization
(gradient descent variants, Adam, Newton, convergence theory)
<- Back to Multivariate Calculus | Next: Jacobians and Hessians ->
Appendix A: Standard Gradient Reference Table
The following identities are used throughout the curriculum. All assume , , , unless noted.
A.1 Linear Forms
| Notes | ||
|---|---|---|
| Constant gradient - loss increases uniformly in direction | ||
| Same as above (dot product commutes) | ||
| Gradient points radially outward | ||
| Points away from ; zero at | ||
| Unit vector in direction of ; undefined at |
A.2 Quadratic Forms
| Notes | ||
|---|---|---|
| General | ||
| When (symmetric) | ||
| Bilinear: gradient w.r.t. | ||
| Linear in |
A.3 Least Squares and Norms
| Derivation | ||
|---|---|---|
| Chain rule: | ||
| Ridge regression gradient | ||
| ( symmetric) | Set to 0: (linear system) |
A.4 Elementwise Functions
For elementwise functions , the gradient is elementwise:
| ML use | ||
|---|---|---|
| (elementwise) | Partition function in softmax | |
| (elementwise, ) | Log-likelihood | |
| Sigmoid gradient | ||
| (indicator) | ReLU gradient (subgradient at 0) |
A.5 Composed Functions
| Proof | ||
|---|---|---|
| Chain rule: | ||
| Chain rule on | ||
| Special case of above with | ||
| Special case with |
Appendix B: Gradient Descent Convergence for Smooth Convex Functions
We prove a basic convergence result establishing the rate for gradient descent on smooth convex functions.
Setting: is:
- Convex: for all
- -smooth: for all
Gradient descent: with constant step .
Lemma (descent lemma). For -smooth and step :
Proof of lemma. By -smoothness and Taylor's theorem:
Substituting :
Convergence proof. Let be a minimizer of with value . By convexity:
Also:
Rearranging:
Combining with the descent lemma and summing over :
This establishes the convergence rate.
Interpretation: After steps with step size , the excess loss decreases as . To achieve accuracy , we need steps. For strongly convex functions (Hessian lower-bounded by ), the rate improves to - exponential convergence.
Appendix C: Gradient of the Attention Mechanism
Computing the gradient of scaled dot-product attention is a fundamental exercise in multivariable calculus applied to transformers.
Setup. Let and define:
where softmax is applied row-wise. The loss is a scalar depending on .
Gradient w.r.t. :
Derivation: , so , giving .
Gradient w.r.t. :
Gradient w.r.t. (through softmax):
Let be the -th row of . The Jacobian of softmax is . Therefore:
This is the "softmax gradient" formula used in FlashAttention (Dao et al., 2022) to avoid materializing the full attention matrix.
Gradient w.r.t. and :
For AI: FlashAttention implements this gradient computation in a memory-efficient way by fusing the forward and backward passes and using tiling to avoid storing the full attention matrix. The gradient formulas above are exactly what FlashAttention computes, just in a hardware-aware order.
Appendix D: Coordinate Descent and Block Coordinate Descent
When computing the full gradient is expensive, coordinate descent updates one variable at a time:
or equivalently, takes a gradient step only along the -th coordinate:
Convergence: For smooth convex , cyclic coordinate descent (cycling through ) converges at the same rate as gradient descent - but each step costs partial derivatives instead of .
Block coordinate descent: Update groups of variables simultaneously:
where is a block of indices and is its complement.
For AI (LoRA): LoRA can be seen as block coordinate descent in parameter space. Instead of optimizing all , it freezes and optimizes the low-rank factors and (block ). The gradient w.r.t. and are partial derivatives holding all other weights fixed.
Appendix E: Numerical Stability in Gradient Computations
E.1 Log-Sum-Exp and Softmax Gradients
Computing naively overflows when is large.
Stable implementation: Subtract the maximum before exponentiating:
This doesn't change the value (same numerator and denominator factor cancel) but keeps all exponentials .
The gradient is inherently stable since .
E.2 Gradient Vanishing Through Deep Networks
In a depth- feedforward network without residual connections, the gradient of the loss w.r.t. the first layer's weights passes through Jacobian matrices:
If each Jacobian has spectral norm (all singular values ), then exponentially. This is vanishing gradients.
Fixes used in practice:
- Residual connections (He et al., 2016): - the identity path ensures gradient passes through each block
- LayerNorm (Ba et al., 2016): Normalizes pre-activations, keeping Jacobian singular values near 1
- Careful initialization (He, Xavier/Glorot): Sets initial weight variance to maintain gradient magnitude across layers
E.3 Mixed Precision and Gradient Scaling
In half-precision (fp16) training, gradients can underflow to zero (fp16 smallest positive ). Gradient scaling multiplies the loss by a large scale factor before backpropagation:
After computing the scaled gradient, we divide by before the optimizer step. This keeps gradient values in the representable range of fp16. PyTorch's torch.cuda.amp.GradScaler manages automatically, halving it on overflow and increasing it gradually otherwise.
Appendix F: Gradient Flow - Continuous-Time View
The gradient descent iteration can be viewed as the Euler discretization of the gradient flow ODE:
In continuous time, this is exact gradient descent - the parameter moves in the direction of steepest descent at each instant.
Properties of gradient flow:
- Energy decreases:
- Fixed points: Equilibria occur at - critical points of
- Convergence for convex : For -smooth convex :
For AI: The Neural Tangent Kernel (NTK) theory (Jacot et al., 2018) analyzes infinitely wide networks by studying gradient flow in function space. The NTK determines convergence speed and generalization in this regime. Understanding gradient flow is the entry point to NTK theory, which explains why overparameterized networks can generalize despite fitting the training data exactly.
Appendix G: Historical Notes on Gradient Methods
Augustin-Louis Cauchy (1847). "Mthode gnrale pour la rsolution des systmes d'quations simultanes" - the first gradient descent paper. Cauchy showed that moving in the direction of steepest descent decreases a function and proposed an iterative scheme that converges to a local minimum. His motivation was solving systems of equations, not machine learning.
Gradient descent lay dormant for nearly a century as an optimization tool because numerical computation was impractical before computers. The method was rediscovered multiple times:
- 1944, Hestenes & Stiefel: Conjugate gradient - stores memory of previous gradient directions to avoid zigzagging on elongated loss surfaces. Remains the gold standard for solving large linear systems.
- 1951, Robbins & Monro: Stochastic approximation - convergence theory for gradient descent with noisy observations. The theoretical foundation for SGD.
- 1986, Rumelhart, Hinton, Williams: "Learning representations by back-propagating errors" - showed that gradient descent on layered networks via backpropagation trains useful representations. This paper launched the first neural network renaissance.
- 2012-present: Deep learning era - GPUs make gradient descent tractable for billions of parameters. Adam (Kingma & Ba, 2015) adapts the learning rate per-parameter. Gradient clipping (Pascanu et al., 2013) stabilizes RNN/transformer training. AdamW (Loshchilov & Hutter, 2019) decouples weight decay from adaptive learning rates.
The gradient is the same mathematical object Cauchy defined in 1847 - what changed is our ability to compute it (via backpropagation and automatic differentiation) and the scale at which we apply it (billions of parameters, trillions of tokens).
Appendix H: Worked Examples - Gradient Computations in Full Detail
H.1 Gradient of Logistic Regression Loss (Full Derivation)
Setup: binary examples with . Model: where .
Loss:
Step 1: Gradient of a single term. Fix ; let .
Using :
Step 2: Gradient w.r.t. via chain rule.
Step 3: Average over examples.
Verification: At the optimum , so - the predicted probabilities are "feature-uncorrelated" with the residuals.
H.2 Gradient of the Negative Log-Likelihood for a Gaussian
Setup: iid observations . Parameters: .
Negative log-likelihood:
Partial derivatives:
Setting to zero (MLE):
From : (sample mean).
From : (biased sample variance).
For AI: Maximum likelihood estimation of neural network parameters is gradient descent on the negative log-likelihood - identical structure to the Gaussian example, but with a deep network defining .
H.3 Gradient of the Frobenius Norm Loss
Setup: Matrix regression - minimize where , , .
Gradient derivation:
Proof: Let . Then . A perturbation gives:
So .
This is the matrix analog of the vector gradient (without the factor of 2 because of the ). Applied to transformer attention: optimizing the value projection uses this exact gradient structure.
Appendix I: Gradient Interpretation - A Physics Analogy
The mathematical definition of the gradient acquires deeper meaning through physical analogies.
Temperature field: Consider as the temperature at point in a room. The gradient points in the direction of fastest temperature increase. Heat flows in direction (Fourier's law): , where is the heat flux vector and is thermal conductivity.
Gradient descent = steepest descent on a potential:
- Elevation field: = altitude at position . = direction of steepest downhill. Water flows in direction (gradient flow).
- Electric potential: = electric potential. = electric field (force per unit charge). Charges accelerate in direction .
- Loss surface: = loss at parameters . = direction of maximum loss decrease. Gradient descent moves parameters in direction .
The physics insight: In all three cases, the gradient defines a vector field (a vector at every point in space), and dynamics follow the negative gradient. This is the fundamental structure of steepest-descent optimization - borrowed from physics, applied to machine learning.
Potential energy landscape: The loss plays the role of potential energy. Parameters play the role of a particle. Gradient descent is a particle rolling downhill with heavy friction (no momentum). SGD adds noise - like a particle in a bath of random thermal kicks (Langevin dynamics). Adam is like an adaptive friction coefficient that varies per dimension.
Appendix J: Reading Guide and Further References
Core References
-
Nocedal & Wright, Numerical Optimization (2006), Chapters 1-2 - rigorous treatment of gradient descent, step size rules, convergence for smooth functions. The standard reference for optimization in ML.
-
Boyd & Vandenberghe, Convex Optimization (2004), Chapter 9 - gradient descent for convex functions, convergence rates, extensions to non-smooth functions. Freely available online.
-
Goodfellow, Bengio & Courville, Deep Learning (2016), Chapter 4 - matrix calculus for ML practitioners. Gradient and Jacobian conventions used throughout this repo follow GBC.
-
Strang, Introduction to Linear Algebra (2016), Chapter 8 - accessible introduction to gradients and their geometric meaning, from a linear algebra perspective.
-
Amari, Information Geometry and Its Applications (2016) - the mathematical foundation of natural gradient methods, Fisher information, and gradient flows on statistical manifolds.
For AI/ML Practitioners
-
Kingma & Ba, "Adam: A Method for Stochastic Optimization" (2015) - introduces the Adam optimizer; Section 2 discusses the moment estimators in terms of the gradient.
-
Pascanu, Mikolov & Bengio, "On the difficulty of training recurrent neural networks" (2013) - introduces gradient clipping; explains exploding/vanishing gradients mathematically.
-
Hu et al., "LoRA: Low-Rank Adaptation of Large Language Models" (2022) - the key idea is restricting gradient updates to a low-rank subspace.
-
Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention" (2022) - implements attention gradients in a hardware-aware fashion; the gradient formulas in Appendix C are what FlashAttention computes.
-
Foret et al., "Sharpness-Aware Minimization for Efficiently Improving Generalization" (2021) - uses gradient norm and loss landscape geometry to find flat minima; directly applies the concepts in 8.
Online Resources
- Andrej Karpathy's micrograd - a 150-line automatic differentiation engine built from scratch; excellent for understanding how gradients are computed
- Matrix Cookbook - comprehensive reference for matrix calculus identities (freely available as PDF)
- Chris Olah's "Neural Networks, Manifolds, and Topology" - visual intuition for how gradients relate to neural network geometry
Appendix K: Stochastic Gradient Descent - Theory and Practice
K.1 Why Stochastic Gradients Work
Full gradient descent computes , which is expensive for large . SGD replaces this with an unbiased estimate using a single randomly sampled example :
Why it converges: The Robbins-Monro conditions ensure convergence for non-convex functions under SGD:
A schedule satisfying these: (harmonic decay). In practice, cosine decay or piecewise constant schedules are used.
Variance of the stochastic gradient:
With mini-batch size : - variance decreases as . This is why larger batches have smaller gradient noise but don't always converge better (the noise helps escape sharp minima).
K.2 Gradient Noise as Regularization
The noise in SGD gradients acts as an implicit regularizer. Formally:
where is the gradient noise. This is equivalent to gradient flow with a stochastic forcing term - a Langevin dynamics:
where is Brownian motion. The stationary distribution of this SDE is approximately - a Gibbs distribution at "temperature" . Lower learning rate = lower temperature = sharper distribution concentrated near minima.
Practical implication: SGD with large learning rate explores the loss landscape (high temperature), while small learning rate exploits (low temperature). Warmup schedules start with a small learning rate (before gradients are reliable) then increase to full rate, then decay.
K.3 Mini-Batch Gradient and Gradient Variance Reduction
| Method | Gradient estimate | Variance | Cost per step |
|---|---|---|---|
| Full batch | 0 | ||
| SGD | |||
| Mini-batch | |||
| SVRG | anchor cost |
For LLMs: Gradient noise in transformer training is beneficial - it helps the model escape sharp minima and generalize better (Jiang et al., 2020). The typical LLM training batch (1M-4M tokens) is large enough to reduce gradient variance substantially while retaining some beneficial noise.
Appendix L: Gradient-Based Saliency and Interpretability
L.1 Gradient Saliency Maps
The simplest form of neural network interpretability uses the gradient of the output w.r.t. the input:
where is the -th feature of the -th input token (or pixel). Large means small changes in cause large changes in - the model is "sensitive" to that feature.
For language models: measures how much token 's embedding affects the predicted probability. This gives a token-level attribution - which words most influence the model's prediction.
Limitations:
- Gradient saliency is local (valid only near the current input)
- Can be adversarially fooled - the gradient can be large for imperceptible perturbations
- Does not account for saturation (sigmoid/tanh can saturate, making gradients near zero even for important features)
L.2 Integrated Gradients (Sundararajan et al., 2017)
To fix saturation issues, integrate the gradient along a path from a baseline to the input :
This satisfies the completeness axiom: (the attributions sum to the prediction difference from the baseline).
Implementation: Approximate the integral with a Riemann sum over steps. In practice, interpolations between and , computing gradients at each.
For AI: Integrated gradients is the standard attribution method for language models (used in LIME, SHAP extensions, and many interpretability tools). Understanding that it's an integral of partial derivatives connects it directly to the material in this section.
Appendix M: Exercises with Full Solutions
M.1 Worked Solution - Exercise 2(c)
Problem: Verify numerically that is perpendicular to the level curve of through .
Solution:
Step 1: Compute . The level curve is .
Step 2: Gradient at : .
Step 3: Parametrize the level curve near . By implicit differentiation: . So the tangent direction is (normalized: ).
Step 4: Check perpendicularity: . Perpendicular.
Numerical verification: Sample points on the level curve near : try for small . We have for small . This confirms the tangent direction lies along the level curve.
M.2 Worked Solution - Key MSE Formula
Problem: Derive step by step.
Solution:
Component-wise: The -th component of :
In matrix form: , so .
Setting to zero: - the normal equations. When is invertible: .
Appendix N: Connection Map - Gradient to Modern AI
GRADIENT nablaf: CONNECTIONS TO MODERN AI SYSTEMS (2026)
FOUNDATIONAL ML ALGORITHMS
partialf/partialx (partial deriv) Backpropagation (03)
nablaf in R (gradient) Gradient descent / Adam / AdamW
D_u f (directional) Projected gradient descent
nablaf level set Orthogonal gradient descent (continual learning)
Linear approx f(a+delta) L-smoothness condition (convergence proofs)
nablaf = 0 at extremum Optimality conditions -> KKT (04)
SPECIFIC ML SYSTEMS
nabla_w(cross-entropy) p - y: softmax output gradient (every LLM)
nabla_w Xw-y^2 Linear regression, LoRA initialization
Gradient clipping GPT-3, Llama, Mistral training stability
Gradient accumulation LLM pretraining (effective batch = millions of tokens)
Fisher F = E[nablalogp*nablalogp] Natural gradient, K-FAC, second-order methods
partialL/partialinput (saliency) Mechanistic interpretability, activation patching
Integrated gradients LIME, SHAP, LLM attribution
Functional gradient Neural Tangent Kernel, Wasserstein gradient flows
Appendix O: Multivariate Calculus Identities Quick Reference
O.1 Rules for Computing Partial Derivatives
| Rule | Formula | Notes |
|---|---|---|
| Sum rule | Always valid | |
| Product rule | Both factors can depend on | |
| Quotient rule | required | |
| Chain rule | Composite function | |
| Mixed chain rule | , |
O.2 Gradient Operator Properties
The gradient operator (applied to scalar fields) is linear:
Product rule for gradients:
Chain rule for gradients:
For composed functions , :
This is the full multivariate chain rule - proved in 03.
O.3 Gradient in Different Coordinate Systems
While we always work in Cartesian coordinates in this section, the gradient takes different forms in other systems:
Polar coordinates :
Spherical coordinates :
For ML: Riemannian optimization on manifolds (e.g., Stiefel manifold for orthogonal weight matrices, SPD manifold for covariance matrices) uses a metric-aware gradient - the Riemannian gradient - which is the projection of the Euclidean gradient onto the tangent space of the manifold.
Appendix P: Common Partial Derivative Computations in ML Practice
P.1 Attention Score Gradient (Simplified)
For a single query-key pair, the attention logit is . Gradients:
These are the update signals that drive query and key projections to produce higher/lower attention scores.
P.2 Layer Normalization Gradient
LayerNorm normalizes: , then scales and shifts: .
Gradients w.r.t. the learnable parameters and :
The gradient w.r.t. (for backpropagation through LayerNorm) involves the partial derivatives of w.r.t. , which requires careful treatment because and both depend on - this is a full multivariate chain rule computation covered in 03.
P.3 Embedding Table Gradient
For a language model with vocabulary size and embedding matrix :
The gradient is sparse - only the rows corresponding to tokens that appeared in the batch have nonzero gradients. Specifically:
where is the embedding of token . This sparsity is exploited by embedding optimizers that only update rows corresponding to tokens in the current batch.
P.4 Weight Tying Gradient
When the input embedding and output projection share the same weight matrix (common in transformer language models), the gradient of w.r.t. receives contributions from both uses:
This is the weight-sharing gradient accumulation rule (6.4) applied to the input-output embedding tying used in GPT-2 and many transformer models.
Appendix Q: Visualization Gallery - Gradient Fields
The following descriptions correspond to plots generated in theory.ipynb.
Plot 1 - Gradient field of a paraboloid: . Arrows at a grid of points show . All arrows point radially outward from the minimum at the origin, with arrow length proportional to .
Plot 2 - Gradient and level curves: Contour lines of overlaid with gradient arrows. Visually confirms level curves at every point.
Plot 3 - Saddle point gradient field: . Gradient points outward in and inward in . At the origin, - a critical point that is neither a local min nor max.
Plot 4 - Loss landscape and gradient descent trajectory: A non-convex 2D function (e.g., Beale function or Rosenbrock) with gradient descent path from several starting points. Shows convergence to different minima depending on initialization.
Plot 5 - Gradient norm as a function of training step: From a simple neural network training run, shows the gradient norm decreasing as training progresses, with occasional spikes.
Appendix R: Self-Assessment Checklist
After studying this section, verify you can:
Mechanics:
- Compute partial derivatives of polynomial, exponential, and logarithmic functions
- Write down the gradient vector for a given function
- Compute directional derivatives using the formula
- Find the directions of maximum/minimum directional derivative
Theory:
- Prove that the gradient points in the direction of steepest ascent (from Cauchy-Schwarz)
- Prove that the gradient is perpendicular to level sets (from the curve argument)
- Explain why partial differentiability does not imply differentiability (with a counterexample)
- State and apply Clairaut's theorem on equality of mixed partials
Applications:
- Derive the gradient of the MSE loss
- Derive the gradient of the cross-entropy loss for logistic regression
- Implement gradient descent for a simple ML problem
- Implement gradient checking using centered differences
- Explain gradient clipping, gradient accumulation, and their role in LLM training
Advanced:
- State what the Fisher information matrix is and how it relates to the natural gradient
- Explain why gradient descent converges at rate for -smooth convex functions
- Describe how LoRA restricts gradient updates to a low-rank subspace
Appendix S: Notation Summary for This Section
| Symbol | Meaning | Convention |
|---|---|---|
| Scalar field (loss function) | Lowercase , uppercase for loss | |
| Input vector (column) | Bold lowercase | |
| Partial derivative w.r.t. | (curly d), not | |
| Gradient (column vector) | Bold with subscript; result is column | |
| Gradient of loss w.r.t. parameters | Always column; rows = parameter dimensions | |
| Directional derivative in direction | Requires | |
| Hessian (second-order partial derivatives) | Symmetric for functions; full treatment in 02 | |
| Jacobian (gradient of vector-valued ) | Full treatment in 02 | |
| Level set | -dimensional hypersurface | |
| Learning rate | Always positive; for guaranteed descent | |
| All model parameters (bold, lowercase) | Follows NOTATION_GUIDE.md 7 | |
| Loss function | Calligraphic reserved for loss | |
| Fisher information matrix | PSD matrix; used in natural gradient |
Appendix T: Extended Worked Examples for ML Practitioners
T.1 Gradient Tape Mental Model
Before automatic differentiation (05), practitioners had to derive gradients by hand. The "gradient tape" mental model - tracking how each computation depends on parameters - is still valuable for debugging.
Example: One-layer network forward pass.
Given , , , :
Forward:
z = W*x + b (pre-activation, shape k)
a = sigma(z) (post-activation, shape k)
L = a - y_true^2 (MSE loss, scalar)
Backward (right to left):
partialL/partiala = a - y_true (shape k)
partialL/partialz = (partialL/partiala) sigma'(z) = (a - y_true) sigma'(z) (elementwise, shape k)
partialL/partialW = (partialL/partialz) * x^T (outer product, shape kxd)
partialL/partialb = partialL/partialz (shape k)
partialL/partialx = W^T * (partialL/partialz) (shape d, for next layer)
Every line is a gradient formula from this section:
- - gradient of MSE (6.1)
- The step - chain rule, elementwise (3.2, 5.2)
- - gradient of a bilinear form (4.5)
T.2 Softmax Temperature Gradient
In language model decoding, the temperature-scaled softmax is:
Gradient w.r.t. :
Let . By the chain rule:
Using the softmax Jacobian :
Interpretation: Temperature gradient is negative when (the chosen token has above-average logit). Increasing flattens the distribution; decreasing sharpens it - this is reflected in the gradient sign.
T.3 Gradient w.r.t. Embedding Lookup
For input tokens (indices), the embedding layer looks up (the -th row of the embedding matrix ).
For a loss that depends on the embeddings , the gradient w.r.t. the embedding matrix has the sparsity structure:
Only rows that appear in the batch receive gradient updates. For vocabulary size (Llama 3) and a batch with tokens, fewer than out of rows have nonzero gradients. Sparse embedding optimizers (Adagrad, sparse Adam) exploit this structure for efficiency.
Appendix U: Connections to Other Branches of Mathematics
U.1 Gradient and Topology (Morse Theory)
Morse theory studies how the topology of a manifold changes as we sweep through level sets . The topology changes only at critical points where . The number and type (index = number of negative Hessian eigenvalues) of critical points constrain the global topology.
For ML: The "loss landscape topology" program (Goodfellow et al., 2015; Li et al., 2018) studies how gradient descent navigates the network of critical points. The key result: most local minima of large overparameterized networks have comparable loss to the global minimum (high-dimensional Morse-Bott theory).
U.2 Gradient and Differential Forms
In differential geometry, the gradient is the unique vector field dual to the exterior derivative (1-form) :
The gradient is obtained by "raising the index" using the metric tensor :
In Euclidean space , so (as a vector). On a Riemannian manifold, the metric introduces a non-trivial transformation - this is exactly the natural gradient transformation where plays the role of the metric tensor on the statistical manifold.
U.3 Gradient and Functional Analysis
For functionals on Hilbert spaces (function spaces with inner product), the gradient (Frchet derivative) is defined analogously: is differentiable with gradient if:
Applications: The ELM (Energy-based Language Model) and the score function (used in diffusion models, score matching, Langevin MCMC) are functional gradients. The denoising score matching objective of diffusion models (Ho et al., 2020; Song et al., 2021) trains a network to estimate - the gradient of the log-density - which is used to sample via Langevin dynamics.
Appendix V: Gradient Descent Variants - A Unified View
All modern optimizers can be written as preconditioned gradient descent:
where is a preconditioning matrix that captures curvature information.
| Optimizer | Preconditioning | Key property |
|---|---|---|
| Gradient descent | No curvature info; uniform step per parameter | |
| AdaGrad | Adapts to historical gradient magnitude per coordinate; sparse-gradient friendly | |
| RMSprop | Exponential moving average of ; non-monotone, avoids diminishing step | |
| Adam | , bias-corrected | Combines momentum (first moment) with RMSprop (second moment) |
| Natural gradient | Riemannian gradient; parameter-efficient steps on statistical manifold | |
| Newton | Exact curvature; quadratic convergence; cost for parameters | |
| K-FAC | Kronecker-factored Fisher approx. | Tractable approximation to natural gradient for neural networks |
| Shampoo | Kronecker product of Hessian factors | Full matrix preconditioning per layer; used in large-scale LLM training |
The gradient is computed identically in all cases - the optimizer only changes what it does with that gradient. Understanding the raw gradient (this section) is prerequisite to understanding any of these methods.
For AI: In 2025-2026, frontier LLM training uses AdamW with gradient clipping and weight decay. Shampoo and distributed Shampoo are beginning to appear in production (Google, DeepMind), offering faster convergence by more faithfully approximating the natural gradient. The mathematical basis for all these methods is the gradient this section defines.