"The derivative is the heart of calculus. Everything else - integrals, series, differential equations - is built around it."
- Michael Spivak, Calculus (1967)
Overview
The derivative formalizes the intuition of instantaneous rate of change. Where a limit asks "what value does approach?", a derivative asks "how fast is changing right now?" This single idea - the slope of the tangent line, the instantaneous velocity, the marginal cost - is the load-bearing concept of all continuous optimization.
In machine learning, every gradient computation is a derivative. The backpropagation algorithm is the chain rule applied systematically to a computation graph. The vanishing gradient problem is a consequence of activation function derivatives approaching zero. The Adam optimizer's adaptive learning rates are functions of accumulated squared derivatives. Understanding derivatives at the mathematical level - not just as "the thing PyTorch computes" - is essential for debugging, designing, and innovating in modern AI.
This section develops single-variable derivatives from the limit definition through all standard rules, applies them to activation functions and computation graphs, and connects numerical differentiation to practical gradient checking.
Prerequisites
- Limits and continuity - 04/01-Limits-and-Continuity: epsilon-delta definition, limit laws, continuity
- Functions: domain, range, composition, inverse
- Algebra: polynomial manipulation, exponential and logarithmic identities
- Trigonometry: , and their basic identities
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive: derivative definition, rules, activation functions, backprop, numerical differentiation |
| exercises.ipynb | 10 graded exercises from power rule to backpropagation and gradient checking |
Learning Objectives
After completing this section, you will be able to:
- Define the derivative as a limit and verify it for elementary functions from first principles
- Apply the power, product, quotient, and chain rules fluently
- Differentiate composite, implicit, and inverse functions
- Compute derivatives of all standard activation functions and explain their ML significance
- State and prove the Mean Value Theorem and apply it to optimization arguments
- Classify critical points using first and second derivative tests
- Implement one-sided and centered finite difference approximations with correct error order
- Perform gradient checking to validate autodiff implementations
- Trace backpropagation through a scalar computation graph using the chain rule
- Connect derivative concepts to vanishing gradients, dying ReLU, and learning rate behavior
Table of Contents
- 1. Intuition
- 2. Formal Definition
- 3. Basic Differentiation Rules
- 4. The Chain Rule
- 5. Implicit Differentiation and Related Rates
- 6. Higher-Order Derivatives
- 7. Applications: Extrema and the Mean Value Theorem
- 8. Activation Function Derivatives
- 9. Numerical Differentiation
- 10. Common Mistakes
- 11. Exercises
- 12. Why This Matters for AI (2026 Perspective)
- Conceptual Bridge
1. Intuition
1.1 What a Derivative Measures
Consider a car's position at time . The average velocity over the interval is:
As , this average approaches the instantaneous velocity at time . That limit - if it exists - is the derivative .
Geometrically, the average velocity is the slope of a secant line through and . As , the secant approaches the tangent line at .
DERIVATIVE: FROM SECANT TO TANGENT
f(x)
(x+h, f(x+h))
|
|
secant slope = [f(x+h)-f(x)]/h
(x, f(x))
<- tangent line at x (slope = f'(x))
-> x
x x+h
As h -> 0: secant -> tangent, average rate -> instantaneous rate.
The derivative measures sensitivity: how much does output change per unit change in input? If , a small nudge in input produces a change of approximately in output.
1.2 Historical Motivation
The derivative was invented twice - independently and almost simultaneously - by two of history's greatest mathematicians, triggering a bitter priority dispute that divided European mathematics for a generation.
Isaac Newton (1665-1666) developed "the method of fluxions" during the Great Plague, when Cambridge was closed and Newton worked alone at Woolsthorpe Manor. He called the derivative a "fluxion" - the rate of flow of a "fluent" quantity. He used the notation (dot notation still used in physics). Newton did not publish for decades.
Gottfried Wilhelm Leibniz (1675) independently developed calculus in Paris, motivated by geometric problems. He introduced the notation and (still universal today). He published first, in 1684.
The Royal Society's 1713 judgment in favor of Newton - almost certainly biased - poisoned British-Continental mathematical relations for over a century, isolating British mathematics from advances made by the Bernoullis, Euler, and Lagrange.
| Year | Event |
|---|---|
| 1665 | Newton develops fluxions (unpublished) |
| 1675 | Leibniz develops calculus independently |
| 1684 | Leibniz publishes first |
| 1687 | Newton publishes Principia (uses fluxions implicitly) |
| 1713 | Royal Society (controlled by Newton) rules for Newton |
| 1821 | Cauchy formalizes with limits (modern foundation) |
| 1861 | Weierstrass gives epsilon-delta definition |
Lagrange (1797) introduced the prime notation and attempted an algebraic foundation via power series. Cauchy (1821) finally gave the limit-based definition we use today.
1.3 Why Derivatives Are Central to AI
Every parameter update in every gradient-based learning algorithm is computed via derivatives.
Backpropagation (Rumelhart, Hinton, Williams, 1986) is the chain rule applied to a computation graph. Given a loss and parameters , the update requires computing all partial derivatives efficiently - which backprop does in time.
Vanishing gradients (Hochreiter, 1991; Hochreiter & Schmidhuber, 1997): sigmoid activations have . In a 20-layer network, the chain rule multiplies 20 such factors: . Early layers receive effectively zero gradient - they cannot learn. The fix (ReLU, residual connections, layer norm) is entirely about derivative behavior.
Adam optimizer (Kingma & Ba, 2014): the adaptive learning rate where is the exponential moving average of squared gradients . Every component is derivative-driven.
Neural architecture search, meta-learning, differentiable programming: modern AI increasingly treats the architecture itself as differentiable - requiring derivatives through discrete operations via straight-through estimators, Gumbel-softmax, and continuous relaxations.
2. Formal Definition
2.1 The Derivative as a Limit
Recall: The limit was defined rigorously in 01-Limits-and-Continuity. The derivative is a specific instance of that definition.
Definition (Derivative). The derivative of at , written , is:
provided this limit exists (is finite). If it exists, is differentiable at .
The expression is called a difference quotient. It is the slope of the secant line through and .
Example: .
So . The derivative of at is .
Example: at .
Example: .
using the fundamental limit from 01 3.2. So - the exponential is its own derivative.
The derivative as a function. If is differentiable at every in its domain, the rule defines a new function called the derivative function.
2.2 Differentiability vs. Continuity
Theorem. If is differentiable at , then is continuous at .
Proof. We show :
So .
The converse is false: continuity does not imply differentiability.
Counterexamples:
- : continuous everywhere, but at : left derivative , right derivative - not equal, so not differentiable
- : continuous at , but - vertical tangent, not differentiable
- Weierstrass function (1872): continuous everywhere, differentiable nowhere - the first published example showing the two concepts are genuinely distinct
DIFFERENTIABILITY VS. CONTINUITY
All differentiable functions Continuous functions
Differentiable |x| x^{1/3} Weierstrass fn
(smooth, no kinks)
Differentiable Continuous (proven above)
Continuous Differentiable FALSE (counterexamples above)
For AI: ReLU is continuous everywhere but not differentiable at (left derivative , right derivative ). In practice, we use a subgradient - any value in at , typically or . Training converges despite this because for continuously distributed pre-activations.
2.3 Notational Systems
Four equivalent notations coexist; knowing all is essential for reading across fields:
| Notation | Symbol | Usage |
|---|---|---|
| Lagrange (prime) | , , | Most common in pure math |
| Leibniz (fraction) | , | Chain rule, implicit diff, physics |
| Newton (dot) | , | Physics, ODEs, time derivatives |
| Euler (operator) | , | Functional analysis, operator theory |
Leibniz notation is the most powerful for computation because it makes the chain rule look like fraction cancellation:
Higher order: . Note: - a common error.
2.4 One-Sided Derivatives
Definition. The right-hand derivative is . The left-hand derivative is .
is differentiable at if and only if both one-sided derivatives exist and are equal: .
For AI - ReLU subgradient: At : and . Since , ReLU is not differentiable at . Frameworks like PyTorch use the convention (the left derivative), which works in practice.
3. Basic Differentiation Rules
These rules allow computing derivatives algebraically without evaluating limits each time. All are proved from the limit definition.
3.1 Constant and Power Rules
Constant Rule: for any constant .
Proof: .
Power Rule: for any real .
Proof (integer ): Expand by the binomial theorem:
Subtract , divide by , let : only the term survives.
The power rule holds for all real (including fractions and negatives):
3.2 Sum and Constant-Multiple Rules
Sum Rule:
Constant-Multiple Rule:
Together: - differentiation is linear.
Consequence: Differentiating polynomials term by term:
3.3 Product Rule
Theorem.
Proof:
Add and subtract :
As : first term , second .
Memory aid: "derivative of first times second, plus first times derivative of second."
Examples:
Generalized product (Leibniz rule):
3.4 Quotient Rule
Theorem. , provided .
Proof: Write and apply the product rule to both sides.
Memory aid: "lo d-hi minus hi d-lo, over lo squared" (where hi , lo ).
Example:
3.5 Elementary Function Derivatives
Core table - memorize these:
| Function | Derivative | Notes |
|---|---|---|
| Only function equal to its own derivative | ||
| For any base | ||
| For | ||
| Change of base | ||
| Note the negative | ||
| Via inverse function rule | ||
| Via inverse function rule |
Proof: .
using (from 01 3.2).
4. The Chain Rule
4.1 Statement and Proof
The chain rule handles composites - functions of functions. If , then:
Proof (informal but rigorous enough for most purposes). Let and . Then:
Taking : if exists and exists at , the limit yields the chain rule. (The formal proof handles the case separately via an auxiliary function.)
4.2 Worked Examples
Example 1. . Let , .
Example 2. (Gaussian kernel).
Example 3 (triple composition). .
The pattern: differentiate outer, keep inner untouched, multiply by derivative of inner - repeat inward.
4.3 Chain Rule and Backpropagation
Every neural network is a deeply composed function:
Backpropagation is the chain rule applied to a computation graph. The gradient of the loss with respect to any parameter is a product of local derivatives along the path from that parameter to the output. The chain rule guarantees this product equals the true gradient - the theoretical justification for gradient-based learning.
Forward reference. The multivariable chain rule (Jacobians, backpropagation for vector-valued functions) is the canonical topic of 05-Multivariate Calculus. Here we develop the single-variable foundation only.
5. Implicit Differentiation and Related Rates
5.1 Implicit Differentiation
Not every curve is a function . The unit circle defines implicitly. To find , differentiate both sides with respect to , treating as a function of :
General procedure:
- Differentiate both sides w.r.t. ; apply chain rule wherever appears (treat as ).
- Collect all terms on one side.
- Solve algebraically for .
Example. Derive .
Let , so . Differentiate implicitly:
For AI: Implicit differentiation underlies the derivation of softmax Jacobians and the inverse function theorem applied to normalizing flows - generative models that learn invertible mappings.
5.2 Related Rates
Related rates problems ask: if two quantities are linked by an equation, how fast does one change relative to the other? Differentiate the linking equation w.r.t. time .
Example. The cross-entropy loss and the logit are linked by the sigmoid: . Rate of change of loss w.r.t. logit:
This is the gradient passed to the logit layer during backpropagation - related rates in disguise.
5.3 Linear Approximation
The tangent line at approximates locally:
This linearization is the first-order Taylor approximation. It underlies gradient descent: the loss surface near looks like a hyperplane, and gradient descent steps in the direction of steepest descent on that plane.
Forward reference. The full Taylor series (all higher-order terms, remainder, convergence) is the canonical topic of 04-Series-and-Sequences.
6. Higher-Order Derivatives and Concavity
6.1 Definition and Notation
The second derivative is the derivative of the derivative:
Higher-order derivatives: .
Geometric meaning:
- : is increasing at
- : is decreasing at
- : is increasing -> is concave up (bowl shape )
- : is decreasing -> is concave down (dome shape )
6.2 Inflection Points
A point is an inflection point if changes sign at . The concavity flips - from to or vice versa. Necessary condition: (but not sufficient - check sign change).
Example. : . Changes sign at : inflection point. Non-example. : , equals zero at but no sign change - not an inflection point.
6.3 Significance for Optimization
The second derivative determines the local curvature of the loss surface:
- Positive curvature (): minimum is possible here; gradient descent converges toward it.
- Negative curvature (): maximum; gradient descent moves away from saddle regions.
- Zero curvature: saddle point candidate - common in high-dimensional loss landscapes.
The Hessian (matrix of second partial derivatives, covered in 05-Multivariate Calculus) generalizes to multi-parameter models. Positive definite Hessians certify local minima.
Recall: Positive definite matrices were defined in 07-Positive-Definite-Matrices - a matrix is PD iff for all nonzero .
6.4 Taylor Coefficients Foreshadowed
The -th derivative at a point gives the -th Taylor coefficient:
So controls the quadratic correction to the linear approximation - exactly the second-order term Adam exploits by tracking gradient variance as a proxy for curvature.
Forward reference. The full derivation of Taylor series using higher-order derivatives is in 04-Series-and-Sequences.
7. Extrema, Critical Points, and the Mean Value Theorem
7.1 Critical Points and the First Derivative Test
A critical point of is any where or does not exist.
First Derivative Test. If changes sign at :
- : local minimum
- : local maximum
- No sign change: saddle point / inflection
Example. . Critical points: .
- : changes -> local max,
- : changes -> local min,
7.2 Second Derivative Test
If :
- : local minimum
- : local maximum
- : inconclusive - higher-order test needed
For AI: In a scalar model, the second derivative test tells us whether a parameter has converged to a local minimum or is at a saddle. In neural networks this extends to the Hessian eigenspectrum - if all eigenvalues are positive, we've found a local minimum.
7.3 Global Extrema on Closed Intervals
Recall: The Extreme Value Theorem (EVT) from 01-Limits-and-Continuity guarantees that a continuous function on attains its maximum and minimum.
Algorithm for global extrema on :
- Find all critical points in where or undefined.
- Evaluate at critical points and endpoints , .
- The largest value is the global max; the smallest is the global min.
7.4 The Mean Value Theorem
Theorem (MVT). If is continuous on and differentiable on , there exists such that:
Geometrically: there is a point where the tangent slope equals the average (secant) slope.
Proof. Apply Rolle's Theorem (MVT with ) to .
For optimization: The MVT bounds how much a function can change along a gradient descent path. If everywhere, then - a Lipschitz bound used to set safe learning rates.
Forward reference. Critical point analysis extends to multivariable functions via gradient = 0 and Hessian definiteness in 05-Multivariate Calculus and 08-Optimization.
8. Activation Function Derivatives
Recall: Continuity of activation functions (whether ReLU, GELU, sigmoid are continuous at the origin) was analyzed in 01-Limits-and-Continuity 10. Here we derive their derivatives - essential for backpropagation.
8.1 Sigmoid
Derivative. Using the quotient rule with , :
This self-referential form means the gradient can be computed from the forward-pass output - no second exponential needed. Peak value: .
Vanishing gradient. As : or , so . Products of many near-zero gradients -> exponential decay through layers.
8.2 Tanh
Derivative. Using or direct quotient rule:
Peak value: - double that of sigmoid. Still vanishes for large .
8.3 ReLU and Variants
Undefined at ; in practice subgradient is used. No vanishing gradient for active units () - the key advantage over sigmoid/tanh.
Leaky ReLU: with , derivative for .
8.4 GELU
The Gaussian Error Linear Unit:
Derivative (product rule):
where is the standard normal PDF. GELU is smooth everywhere, approximating ReLU for large positive and gating negative values by their Gaussian probability. Used in GPT-2, BERT, and most modern transformers.
8.5 Derivative Summary Table
| Activation | Formula | Derivative | Range of | Used In |
|---|---|---|---|---|
| Sigmoid | Logistic regression, old RNNs | |||
| Tanh | LSTMs, NLP | |||
| ReLU | CNNs, ResNets | |||
| GELU | Transformers | |||
| Softplus | Smooth ReLU |
9. Numerical Differentiation
Recall: 01 previewed the gradient as a limit in 01 11. Here we develop the full numerical treatment - the canonical home for this topic.
9.1 Finite Difference Approximations
Forward difference (first order, error):
Taylor expansion: , so the error is .
Centered difference (second order, error):
Expand both: odd-order error terms cancel, leaving error . Centered differences are strongly preferred.
Second derivative via finite differences:
9.2 Optimal Step Size
Two competing errors:
- Truncation error: - decreases as
- Round-off error: floating-point arithmetic contributes - increases as
Balancing these (minimize total error ) gives optimal:
In practice: for float64 centered differences.
9.3 Gradient Checking
Gradient checking verifies an analytical gradient implementation by comparing it to a numerical estimate:
def grad_check(f, x, analytic_grad, h=1e-5):
numerical_grad = np.zeros_like(x)
for i in range(len(x)):
x_plus = x.copy(); x_plus[i] += h
x_minus = x.copy(); x_minus[i] -= h
numerical_grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
rel_error = np.linalg.norm(analytic_grad - numerical_grad) / \
(np.linalg.norm(analytic_grad) + np.linalg.norm(numerical_grad) + 1e-8)
return rel_error # < 1e-5 typically acceptable
Libraries like PyTorch provide torch.autograd.gradcheck() implementing exactly this. Every custom backward pass should be verified.
9.4 Automatic Differentiation
Numerical differentiation approximates derivatives; automatic differentiation (AD) computes them exactly (up to floating-point) by applying the chain rule to elementary operations in forward or reverse mode.
- Forward mode AD: accumulates derivative alongside function evaluation - efficient for few inputs, many outputs
- Reverse mode AD (backpropagation): accumulates gradient from output backward - efficient for many inputs, one output (the loss)
All modern deep learning frameworks (PyTorch, JAX, TensorFlow) implement reverse-mode AD. Numerical differentiation is used for checking, not training.
10. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Product rule is | Always use product rule | |
| 2 | Quotient rule is | Derive from quotient rule | |
| 3 | Forgetting chain rule: | Outer function derivative x inner derivative needed | |
| 4 | Confused with change-of-base; correct is | Memorize: | |
| 5 | Missing chain rule: multiply by | ||
| 6 | Differentiable implies continuous - false direction | True direction: differentiable -> continuous (not converse) | ReLU is continuous but not differentiable at 0 |
| 7 | means saddle/inflection | Not sufficient - need sign change of | Check sign on both sides |
| 8 | Optimal for finite differences is | Catastrophic cancellation - round-off dominates | Use for float64 |
| 9 | Applying L'Hpital when limit is not or | L'Hpital only applies to indeterminate forms | Check form first |
| 10 | Chain rule direction error: | Must evaluate at : | Track composition carefully |
| 11 | Assuming or | ReLU is not differentiable at 0; subgradient is standard | Use subgradient convention |
| 12 | Gradient of as | Matrix derivative rules needed; correct form is | See 05-Multivariate Calculus |
11. Exercises
Exercise 1 - Basic rules. Differentiate: (a) , (b) , (c) .
Exercise 2 - Chain rule. Differentiate: (a) , (b) , (c) , (d) .
Exercise 3 - Implicit differentiation. Given (folium of Descartes), find .
Exercise 4 - Activation derivatives. (a) Derive from scratch. (b) Verify numerically using centered differences at with .
Exercise 5 - Critical points. Find and classify all critical points of . Identify global extrema on .
Exercise 6 - MVT application. Prove that for all using the MVT.
Exercise 7 - Gradient checking. Implement centered finite differences to approximate the derivative of at . Compare to the analytic derivative. Experiment with and explain the behavior.
Exercise 8 - Backpropagation. A 2-layer scalar network: , , , , . Derive and using the chain rule. Verify numerically.
12. Why This Matters for AI (2026 Perspective)
| Concept | AI/ML Impact |
|---|---|
| Derivative definition | Formal basis of gradient - every optimizer (SGD, Adam, Adagrad) descends along |
| Chain rule | Backpropagation: gradient of loss through composed layers; foundational for all neural network training |
| Product/quotient rules | Attention score derivatives; gating mechanisms in LSTMs and GRUs |
| Activation derivatives | , ReLU', GELU' determine gradient magnitude at each neuron; control vanishing/exploding gradients |
| Second derivative / concavity | Hessian approximations in Adam (diagonal), K-FAC (block), Shampoo (full); Newton's method in optimization |
| Critical points | Saddle points dominate high-dimensional loss landscapes; flat minima -> better generalization |
| MVT and Lipschitz bounds | Learning rate theory: ensures descent for -smooth losses |
| Implicit differentiation | Normalizing flows, invertible networks, implicit models (DEQ, neural ODEs) |
| Numerical differentiation | Gradient checking in unit tests; finite-difference Hessian in small-scale studies |
| Automatic differentiation | PyTorch autograd, JAX grad, TensorFlow GradientTape - all implement reverse-mode AD |
| Linear approximation | Gradient descent step as linear model of the loss surface near current parameters |
| GELU derivative | Used in GPT-2, BERT, T5, and effectively every transformer post-2018 |
Conceptual Bridge
Looking back. This section built on the definition of a limit (01) - the derivative is a limit of a difference quotient. The continuity of activation functions established in 01 is a prerequisite for their differentiability here (differentiable implies continuous, but not conversely).
Looking forward. The chain rule derived here for scalar functions becomes the multivariable chain rule in 05-Multivariate Calculus - the chain rule expressed through Jacobians is the mathematical heart of backpropagation for vector-valued functions. The linear approximation introduced in 5.3 becomes the full Taylor series in 04-Series-and-Sequences, which justifies optimizer update rules. Critical point analysis extends to gradient = 0 conditions in 08-Optimization.
Position in the curriculum:
CHAPTER 4 - CALCULUS FUNDAMENTALS
01 Limits and Continuity
(limit definition, L'Hpital, continuity of activations)
02 Derivatives and Differentiation YOU ARE HERE
(rules, chain rule, activation f', numerical diff)
03 Integration
(FTC links derivatives <-> integrals)
04 Series and Sequences
(Taylor series uses all derivatives from 02)
05 Multivariate Calculus
(multivariable chain rule, Jacobians, full backprop)
<- Back to Calculus Fundamentals | Next: Integration ->
Appendix A: Extended Worked Examples - Basic Rules
A.1 Power Rule - Negative and Fractional Exponents
The power rule applies to all real :
Negative integer: .
Fractional: . Note: not defined at .
Irrational: .
Proof for (inductive, using product rule). Base case : from definition. Inductive step: assume . Then .
For general real : write and apply chain rule: .
A.2 Product Rule - Triple Product
For three functions .
Proof. Write and apply product rule twice: .
Example. .
A.3 Quotient Rule - Deriving from Product Rule
Proof. Let , so . By product rule: . Solving: .
Mnemonic ("lo d-hi minus hi d-lo over lo-lo"): .
A.4 Logarithmic Differentiation
When is a product/quotient of many terms, take first:
Differentiate implicitly:
Multiply by . Logarithmic differentiation is also the only elementary route to :
Appendix B: Extended Chain Rule Examples and Patterns
B.1 Identifying the Composition Structure
Every application of the chain rule begins with identifying:
- Outer function : what is applied last
- Inner function : what is the argument of the outer function
| Expression | Outer | Inner | Derivative |
|---|---|---|---|
Note: - the sigmoid of .
B.2 Nested Chain Rule - Four Levels Deep
. Work from outside in:
Let , , .
B.3 Scalar Backpropagation Step-by-Step
Consider the output node where (binary cross-entropy, scalar model).
Chain rule cascade:
Gradient w.r.t. weight :
This is the gradient used in logistic regression's SGD update. The chain rule reduces the gradient to a simple product of the prediction error and the input feature .
B.4 The Derivative as a Linear Map
Abstractly, the derivative is the best linear approximation to near . More precisely, is the unique number such that:
where denotes terms that shrink faster than . This abstract definition extends directly to vector functions (Frchet derivative) - in that setting, becomes a linear map (matrix), which is the Jacobian.
Appendix C: Deep Dive - Differentiability and Regularity
C.1 The Formal epsilon-delta Definition of Differentiability
is differentiable at if there exists a number such that:
i.e., for all , there exists such that implies:
We write .
C.2 Non-Differentiable Points - A Taxonomy
Type 1 - Corner: . Left derivative: . Right derivative: . Limit does not exist.
Type 2 - Cusp: . At : . The tangent line becomes vertical - the derivative limit is , not finite.
Type 3 - Vertical tangent: Same as cusp; the function is continuous but the difference quotient diverges.
Type 4 - Oscillatory discontinuity of derivative: for , . The derivative exists at 0 () but is not continuous at 0 - is differentiable but not .
For AI - ReLU. ReLU has a corner at : left derivative is , right derivative is . In practice, the subgradient (any value in the interval) is used; most frameworks return .
C.3 Smoothness Classes
| Class | Condition | Examples |
|---|---|---|
| Continuous | ReLU, $ | |
| Derivative exists and is continuous | Softplus, GELU | |
| Second derivative exists and is continuous | , , polynomials | |
| All derivatives exist | , , | |
| Real analytic (equals its Taylor series) | , |
For AI: Smooth activation functions () are required for second-order optimizers (Newton, L-BFGS) and for second-order sensitivity analysis. Rectified activations ( only) work for SGD/Adam but complicate curvature estimation.
C.4 Differentiability vs. Continuity - Summary
Differentiable at x_0
(implication)
Continuous at x_0
(no implication)
Differentiable at x_0
- Differentiable -> Continuous: Proven via .
- Continuous Differentiable: Counterexample: is continuous everywhere but not differentiable at 0.
- Differentiable Continuously Differentiable: Counterexample: above.
Appendix D: Advanced Topics - Inverse Function Theorem and Derivatives of Inverses
D.1 Inverse Function Theorem
If is differentiable at and , then is differentiable at and:
Intuition. If multiplies tangent slopes by , then divides by . In Leibniz notation: if , then .
Geometric meaning. The graph of is the reflection of the graph of over the line . The slope of the tangent to at is ; the slope of the reflected tangent to at is .
D.2 Derivatives of Inverse Trigonometric Functions
Using the inverse function theorem and implicit differentiation:
: Let , so . Differentiate: . Since :
: Similarly, .
: Let , so . Differentiate: . Since :
For AI. The softplus function is the antiderivative of sigmoid: . This connects the inverse-function perspective to activation function design.
D.3 Generalized Power Rule via Logarithmic Differentiation
For (variable base, variable exponent):
Special cases:
- (constant): gives power rule
- (base ): gives exponential derivative
Appendix E: Optimization Theory Foundations
E.1 First and Second Derivative Tests - Full Treatment
First Derivative Test (complete statement):
Let be continuous on and a critical point.
- If for and for : local maximum at .
- If for and for : local minimum at .
- If does not change sign: neither (saddle or inflection).
Second Derivative Test:
If :
- -> local min (concave up)
- -> local max (concave down)
- -> inconclusive; examine higher-order derivatives or use first derivative test
Inconclusive case - higher-order test. If the first derivatives vanish at and :
- odd: neither max nor min (inflection-type)
- even, : local min
- even, : local max
Example. at : , , even -> local (and global) min.
E.2 Rolle's Theorem
Theorem. If is continuous on , differentiable on , and , then there exists with .
Proof. By EVT, attains its maximum and minimum on . If both are attained at endpoints, then is constant and . Otherwise, an interior maximum or minimum exists; at an interior extremum, (since would imply is locally monotone, contradicting the extremum).
Rolle's Theorem is the key lemma for the MVT.
E.3 Gradient Descent - Single Variable View
The gradient descent update for minimizing :
Convergence (quadratic ). Let (1D version of -smooth quadratic). Then:
Converges if , i.e., . Optimal gives one-step convergence: .
For general smooth (Taylor + MVT argument):
If everywhere (Lipschitz gradient), then:
The right side decreases as long as and - guaranteeing descent at each step.
Appendix F: Numerical Differentiation - Advanced Analysis
F.1 Error Analysis in Detail
Forward difference error. Taylor expand :
Solve for :
The leading error term is , so the truncation error is .
Centered difference error. Also expand :
Subtract:
The leading error is , so the truncation error is - one order better.
Round-off error. In floating-point arithmetic, is computed with absolute error . The round-off contribution to the centered difference is:
Total error:
Minimize over : gives:
For float64 (): .
F.2 Richardson Extrapolation
Given two centered-difference approximations at step sizes and , eliminate the leading error term:
This Richardson-extrapolated formula has error - a massive improvement. Recursive application yields Romberg differentiation with arbitrarily high accuracy.
F.3 Complex-Step Derivative
An alternative that avoids catastrophic cancellation: compute at (complex argument):
For analytic , this has only round-off error with no subtraction cancellation, and allows very small without accuracy loss. Used in aerospace and scientific computing.
Appendix G: Activation Functions - Extended Analysis
G.1 Vanishing Gradient Problem - Quantitative Analysis
In a network with layers using sigmoid activations, the gradient at the first layer involves a product of sigmoid derivatives:
Since for all , the product is bounded:
For a 10-layer network: - the gradient is effectively zero. This is the vanishing gradient problem that made training deep sigmoid networks impractical before ReLU.
ReLU advantage. For active ReLU units (): . The product of active ReLU derivatives is exactly 1 - no decay. This is why deep residual networks with ReLU can be trained with hundreds of layers.
G.2 Exploding Gradients
The opposite pathology: if weights are initialized too large, the product can exceed 1 exponentially in . Gradients explode, leading to NaN loss.
Mitigations: gradient clipping (), weight initialization (Xavier/He), batch normalization, residual connections.
G.3 Softmax Derivative
The softmax function maps to a probability vector:
Partial derivatives:
The Jacobian of softmax is , or in matrix form: .
For backprop. Combined with cross-entropy loss :
The gradient is simply the difference between the predicted probability and the true label - remarkably clean due to the cancellation between softmax and cross-entropy.
G.4 GELU vs ReLU - Gradient Behavior
| Property | ReLU | GELU |
|---|---|---|
| Differentiable | No (at 0) | Yes () |
| Dead neurons | Yes (gradient = 0 for always) | No (small gradient for ) |
| Gradient at | (subgradient) | (smooth) |
| Computation | Trivial () | Requires erf evaluation |
| Used in | ResNets, older models | GPT, BERT, modern transformers |
The GELU's smooth gradient in the negative regime allows small updates even for negative pre-activations, which may contribute to the empirical finding that GELU outperforms ReLU on large-scale language tasks.
Appendix H: Historical Context and Mathematical Development
H.1 The Calculus War
The derivative was developed independently in the 1660s-1680s by Newton and Leibniz - one of the most famous priority disputes in mathematics.
| Year | Newton | Leibniz |
|---|---|---|
| 1666 | Develops "method of fluxions" (flux = rate of change) | - |
| 1675 | - | Introduces notation |
| 1684 | - | Publishes first calculus paper |
| 1687 | Publishes Principia (uses fluents, not derivatives explicitly) | - |
| 1693 | - | Publishes systematically |
| 1704 | Newton publishes "Of the Quadrature of Curves" | - |
| 1711 | Royal Society initiates plagiarism inquiry | - |
Modern consensus: independent discovery. Newton's notation ( for time derivatives) persists in physics; Leibniz's dominates mathematics and engineering.
H.2 Rigorous Foundations - 19th Century
The limit-based definition of the derivative was not rigorously formulated until the 1820s:
- Cauchy (1821): Formalized limits and gave the first rigorous definition of the derivative as a limit of a difference quotient.
- Weierstrass (1872): Constructed a continuous-everywhere, differentiable-nowhere function - proving that "most" continuous functions are not differentiable.
- Riemann (1854): Formalized integration; the FTC was proved rigorously.
H.3 Connection to Modern AI
| Year | Development | Calculus Foundation |
|---|---|---|
| 1986 | Rumelhart et al. popularize backpropagation | Chain rule on computation graphs |
| 1989 | LeCun's convolutional backprop | Chain rule through spatial filters |
| 2012 | AlexNet - deep ReLU networks | Avoids vanishing gradient (8.1) |
| 2014 | Adam optimizer | Adaptive learning rates, second-moment gradient estimate |
| 2017 | Transformer attention | Softmax Jacobian (G.3) in attention scores |
| 2018 | BERT (GELU activation) | Smooth gradient, activations |
| 2022+ | JAX's functional AD | Efficient forward/reverse mode composition |
| 2023+ | LoRA / DoRA (PEFT) | Low-rank gradient updates - rank of Jacobians |
| 2024+ | Mechanistic interpretability | Gradient attribution to individual features |
Appendix I: Applications in Machine Learning - Deep Dives
I.1 Gradient Descent and Its Variants
All first-order optimization algorithms are based on the derivative. The basic update:
In the scalar case (): .
Momentum (Polyak 1964):
Accumulates a running average of gradients, dampening oscillations in high-curvature directions.
Adam (Kingma & Ba, 2015):
is a first moment (mean) estimate; is a second moment (uncentered variance) estimate. Dividing by adapts the step size per parameter: large gradient -> small effective step; small gradient -> larger effective step. This diagonal approximation to the inverse curvature is why Adam often outperforms vanilla SGD.
I.2 Automatic Differentiation - Under the Hood
Forward mode. Alongside each function value , propagate the derivative using dual numbers where arithmetic follows (product rule built in).
Reverse mode. In the forward pass, record a computation graph (tape). In the backward pass, traverse the graph in reverse, accumulating partial derivatives using the chain rule at each node.
Cost comparison:
- Forward mode: passes for inputs -> for scalar output, for outputs
- Reverse mode: one backward pass -> for scalar output regardless of
For neural network training ( = millions of parameters, scalar loss ): reverse mode is passes vs. for forward mode - the reason backpropagation is computationally feasible.
I.3 Derivative-Based Interpretability
Saliency maps. For an image classification model , the gradient measures how sensitive the output is to pixel - used as a first-order feature importance score.
Input x Gradient. - a refinement that accounts for the magnitude of the feature, not just the sensitivity.
Integrated Gradients (Sundararajan et al. 2017). Computes the path integral of the gradient from a baseline to the input:
Satisfies sensitivity and completeness axioms - considered one of the most principled gradient-based attribution methods.
Appendix J: Extended Exercises and Practice Problems
J.1 Warm-Up Drill - Compute Derivatives
Differentiate each function (answers at end of section):
Answers:
J.2 Conceptual Questions
-
True or False. If , then has a local extremum at . (False - consider at .)
-
Proof. Show that if and are differentiable, implies that the derivative of the identity gives .
-
Explain. Why does the gradient descent step (with ) lead to oscillatory behavior for a quadratic, while gives smooth convergence?
-
Derivation. Starting from , derive the update rule for the gradient of binary cross-entropy loss with respect to .
Solution. .
-
Analysis. Centered finite differences have error, but why do software implementations (like
torch.autograd.gradcheck) still use or even rather than the optimal ?Answer. With 32-bit floating point (), the optimal step is . The larger step avoids round-off without sacrificing much truncation accuracy in float32.
J.3 Computational Exercises
Exercise J.1. Implement the forward, backward, and centered finite difference formulas for . Plot the absolute error vs. for at , using on a log scale. Observe the optimal region.
Exercise J.2. Implement a simple scalar backpropagation for the function :
- Forward pass: compute , ,
- Backward pass: compute analytically via chain rule
- Verify with centered finite differences
Exercise J.3. For the sigmoid function, plot: (a) , (b) , (c) . Identify: the inflection point of , the maximum of , and the zeros of .
Appendix K: Notation Reference and Quick-Reference Tables
K.1 Derivative Notation - Comparison
| Notation | Author | Reads as | Common usage |
|---|---|---|---|
| Lagrange | " prime of " | Pure mathematics | |
| Leibniz | " by " | Engineering, physics | |
| Newton | " dot" | Physics (time derivatives) | |
| Euler/operator | "D of " | Differential equations | |
| - | "partial partial " | Multivariable calculus | |
| Hamilton | "nabla " or "gradient" | Machine learning, optimization |
In this curriculum: for scalar single-variable, for multi-parameter ML context.
K.2 Differentiation Rules Quick Reference
| Rule | Formula | Notes |
|---|---|---|
| Constant | ||
| Identity | ||
| Power | All real | |
| Constant multiple | ||
| Sum | ||
| Product | ||
| Quotient | ||
| Chain | ||
| Exponential | ||
| Natural log | ||
| Sine | ||
| Cosine | ||
| Arctangent | ||
| Inverse |
K.3 Second Derivative Test Quick Reference
| Condition | Classification |
|---|---|
| , | Local minimum |
| , | Local maximum |
| , | Inconclusive |
| (any ) | Concave up |
| (any ) | Concave down |
| changes sign at | Inflection point |
K.4 Finite Difference Error Summary
| Method | Formula | Error order | Optimal (float64) |
|---|---|---|---|
| Forward | |||
| Backward | |||
| Centered | |||
| Second derivative | |||
| Richardson | Combines two centered | larger acceptable |
Appendix L: Implicit Differentiation - Extended Examples
L.1 Folium of Descartes
The curve (with ) is the folium of Descartes, defined implicitly. Differentiating:
Horizontal tangents: . Substituting into : or .
L.2 Ellipse and Normal Lines
The ellipse . Differentiating:
The normal line (perpendicular to tangent) at has slope .
L.3 Implicit Differentiation for Normalizing Flows
In normalizing flows, a bijective function is used. The change-of-variables formula for densities requires the determinant of the Jacobian:
When is defined implicitly via (e.g., in continuous normalizing flows via an ODE), the implicit function theorem guarantees the Jacobian exists and computes it as:
This is implicit differentiation generalized to the multivariable setting - a key tool in modern generative models (FFJORD, Neural ODEs).
Appendix M: Mean Value Theorem - Consequences and Applications
M.1 Monotonicity via Derivatives
Theorem. If for all , then is strictly increasing on .
Proof. Let in . By MVT, there exists with:
since and . Thus .
Corollary. on iff is constant on .
M.2 Error Bounds via MVT
Theorem (MVT error bound). If for all , then:
This is the Lipschitz condition with constant .
Application - gradient descent learning rate. If (second derivative bounded), then - the gradient is -Lipschitz. The descent lemma states:
ensuring a sufficient decrease of per step.
M.3 Cauchy Mean Value Theorem
Theorem. If are continuous on , differentiable on , and , then:
for some .
Application. L'Hpital's Rule (01) follows from the Cauchy MVT: for ,
provided the last limit exists.
Appendix N: Parametric and Polar Curves
N.1 Derivatives of Parametric Curves
A parametric curve defines as an implicit function of via the parameter . The derivative:
provided .
Second derivative:
Example. Cycloid: , .
, .
At : - 45 angle. At : - vertical tangent.
N.2 Applications to Attention Curves
In transformer attention, the attention weight for query-key pair is:
The derivative of attention weight with respect to the scale factor :
This is the derivative of a softmax composition - it shows that attention sharpens (moves weight to higher-scored keys) as increases, consistent with the "temperature" interpretation from 01.
Appendix O: Numerically Stable Derivative Computations
O.1 Sigmoid: Numerically Stable Computation
Naive overflows for large negative ( can be > float64 max for ).
Stable implementation:
def sigmoid(x):
"""Numerically stable sigmoid."""
return np.where(x >= 0,
1 / (1 + np.exp(-x)),
np.exp(x) / (1 + np.exp(x)))
Both branches are equivalent; the first is used for (where is safe), the second for (where is safe).
Derivative - stable form:
def sigmoid_prime(x):
s = sigmoid(x)
return s * (1 - s)
No additional overflow risk once is stably computed.
O.2 Log-Sigmoid: Avoiding Catastrophic Cancellation
For the log-sigmoid :
- For large positive : - no cancellation.
- For large negative : - use this directly.
Stable:
def log_sigmoid(x):
return -np.logaddexp(0, -x) # = -log(e^0 + e^{-x})
np.logaddexp(a, b) computes stably by extracting the maximum:
O.3 Softmax: Subtract Maximum
Naive softmax overflows for large logits. Standard fix:
Subtracting the maximum shifts all logits to , making . The softmax value is unchanged since the normalization cancels the factor .
These numerical stability techniques are directly connected to the limits and from 01.
Appendix P: Derivatives in Probability and Statistics
P.1 Score Function and Fisher Information
In statistics, the score function is the derivative of the log-likelihood with respect to the parameter :
Properties:
- (score has zero mean)
- (Fisher information)
Fisher information:
The second equality (Fisher's identity) relates the variance of the first derivative to the expected value of the second derivative - a deep connection between information geometry and calculus.
P.2 KL Divergence Gradient
The KL divergence appears throughout ML. Its derivative with respect to a parameter of :
This is the negative expected score under . In variational inference (ELBO optimization), this derivative structure underlies the "reparameterization trick" for training VAEs.
P.3 Maximum Likelihood - First Order Conditions
For i.i.d. data , MLE maximizes:
The first-order condition defines the MLE. For the Gaussian with unknown mean: .
For neural networks, MLE is equivalent to minimizing cross-entropy loss - the derivative-based optimization (SGD, Adam) maximizes the likelihood of the training data.
Appendix Q: Special Topics - Subgradients and Non-Smooth Analysis
Q.1 Subgradients
For a convex function that may not be differentiable, a subgradient at is any satisfying:
The subdifferential is the set of all subgradients.
Examples:
- : ; for .
- : ; for ; for .
Q.2 Subgradient Descent
For non-smooth convex , the subgradient descent update uses any :
Unlike gradient descent for smooth functions, subgradient descent does NOT necessarily decrease at each step. Convergence requires diminishing step sizes: , (Robbins-Monro conditions from 01).
Q.3 Proximal Operators
For non-smooth regularization (L1 sparsity), instead of subgradient descent, the proximal gradient method is often used. The proximal operator of at with step size :
For : - the soft-thresholding operator. This is the exact solution to L1 penalized regression (LASSO).
Appendix R: Key Proofs in Full
R.1 Proof: Product Rule
Theorem. If and are differentiable at , then .
Proof.
Add and subtract :
Since is differentiable at , it is continuous there, so :
R.2 Proof: Chain Rule (via auxiliary function)
Theorem. If is differentiable at and is differentiable at , then .
Proof. Define the auxiliary function:
By differentiability of at : , so is continuous at .
Let . Then for :
(This holds even when for some values of , since both sides are 0 in that case.)
Taking : (by continuity of ), so , and :
R.3 Proof: Differentiable Implies Continuous
Theorem. If is differentiable at , then is continuous at .
Proof.
Therefore , which is the definition of continuity at .
R.4 Proof:
Proof. Using the sine addition formula and two fundamental limits from 01:
As : (even function, implies by squeeze), and :
Appendix S: Worked Examples - Chain Rule in Neural Networks
S.1 Single-Layer Sigmoid Network - Full Backprop
Setup. Input , single weight , bias , sigmoid activation, MSE loss:
Forward pass (left to right):
| Variable | Value |
|---|---|
Backward pass (chain rule, right to left):
The structure: error signal is modulated by the local gradient before being distributed to the weight (scaled by input ) and bias .
S.2 Two-Layer ReLU Network
Input , hidden layer , output :
Gradients:
The indicator is the ReLU subgradient: if the pre-activation was negative, the gradient is blocked (dead neuron).
S.3 Gradient Flow Visualization
BACKPROPAGATION GRADIENT FLOW
INPUT LOSS
x -> [z = wx+b] -> [a = sigma(z)] -> [L = (a-y)^2]
Forward Forward
Backward Backward
partialL/partialw = (a-y)*a(1-a)*x partialL/partiala = a-y
Each node multiplies its local gradient into the flowing signal.
This IS the chain rule: partialL/partialw = (partialL/partiala)(partiala/partialz)(partialz/partialw)
S.4 Gradient Tape - PyTorch Analogy
In PyTorch, the computation graph is built dynamically during the forward pass. .backward() traverses it in reverse, applying the chain rule at each node:
import torch
x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)
z = w * x + 1.0 # z = 7.0
a = torch.sigmoid(z) # a ~= 0.999
L = 0.5 * (a - 0.5) ** 2 # L
L.backward() # Reverse-mode AD: computes all partialL/partialw, partialL/partialx
print(w.grad) # partialL/partialw = (a-0.5)*a*(1-a)*x
print(x.grad) # partialL/partialx = (a-0.5)*a*(1-a)*w
This is the chain rule automated - the "tape" records in the forward pass, then .backward() applies our formulas in reverse.
Appendix T: Summary and Connections
T.1 Core Theorem Summary
| Theorem | Statement | Key Condition | Consequence |
|---|---|---|---|
| Differentiable -> Continuous | exists -> continuous at | Differentiability | Justifies assuming continuity in proofs |
| Power Rule | Foundation of all polynomial differentiation | ||
| Product Rule | differentiable | Essential for activation analysis | |
| Chain Rule | Both differentiable | Backbone of backpropagation | |
| Mean Value Theorem | Continuous , diff | Lipschitz bounds, descent lemmas | |
| First Derivative Test | Sign change of classifies extrema | Optimization - finding local minima | |
| Second Derivative Test | sign at critical point | , exists | Confirms min/max; generalizes to PD Hessian |
| Inverse Function Theorem | Derives inverse trig derivatives; flows |
T.2 Dependency Graph - All Concepts in This Section
DERIVATIVE DEPENDENCY MAP
Limit (01) -> Derivative Definition
Constant/Power Product Rule Chain Rule
Sum/Difference Quotient Rule
Differentiation Rules
Higher Derivatives Implicit Diff. Activation f'
Concavity/Inflection Related Rates sigma', ReLU', GELU'
Critical Points / Vanishing Gradients
Extrema / MVT Backpropagation
Gradient Descent Theory -> 08-Optimization
T.3 What to Review if You're Stuck
| Struggling with | Review |
|---|---|
| Chain rule applications | 4.1-4.2, Appendix B |
| Activation function derivatives | 8.1-8.5, Appendix G |
| Critical point classification | 7.1-7.3 |
| Numerical differentiation step size | 9.1-9.2, Appendix F |
| Backpropagation derivation | 4.3, Appendix S |
| Implicit differentiation | 5.1, Appendix L |
| Why ReLU avoids vanishing gradients | Appendix G.1 |
| Softmax gradient computation | Appendix G.3 |
| Adam optimizer intuition | Appendix I.1 |
| Gradient checking in code | 9.3 |
Appendix U: Practice Problems - Solutions
U.1 Full Solutions: Section 11 Exercises
Exercise 1 - Basic rules.
(a) :
(b) . Product rule:
(c) . Quotient rule (, ):
Exercise 2 - Chain rule.
(a)
(b)
(c)
(d)
Exercise 3 - Implicit differentiation: .
Differentiate:
Collect :
Exercise 4 - Sigmoid derivative.
(a) Derivation:
Factor: .
Exercise 6 - MVT proof: .
The sine function is differentiable with for all . By the MVT, for any there exists between and with:
Therefore:
(The inequality also holds for : .)
Exercise 8 - Backpropagation chain rule:
Forward: , , , .
This reproduces the four-factor chain: loss sensitivity x downstream weight x local gate x input.
Appendix V: Connections to Later Chapters
V.1 What Integration Needs from This Section
The next section (03-Integration) uses:
- Antiderivatives: integration reverses differentiation. Every differentiation formula gives an integration formula: .
- Fundamental Theorem of Calculus (Part 2): if , then . This requires knowing derivatives of common functions.
- Substitution rule: the reverse chain rule. If , .
Every derivative formula in 3.5 has a corresponding integral formula. Master both sides.
V.2 What Series Needs from This Section
04-Series-and-Sequences uses:
- Higher-order derivatives: Taylor coefficients require computing many derivatives.
- Linear approximation: the first-order Taylor expansion is a direct extension of 5.3.
- Error bounds: the Taylor remainder involves at some intermediate point (MVT applied to integrals).
V.3 What Multivariate Calculus Needs from This Section
05-Multivariate Calculus extends every concept here:
| 02 Concept | 05 Extension |
|---|---|
| Derivative | Gradient |
| Chain rule | Chain rule with Jacobians: |
| Second derivative | Hessian |
| Inflection point | Saddle point in |
| Linear approximation | |
| at extremum | at extremum |
The scalar theory developed here is the essential foundation. Every formula in 05 will look familiar with vector notation substituted.
V.4 What Optimization Needs from This Section
08-Optimization uses directly:
- Gradient descent theory (7.3 MVT, descent lemma)
- Critical point classification (saddles dominate in high dimensions)
- Chain rule for backpropagation
- Adam optimizer - second moment as curvature estimate
- Gradient checking (9.3) for debugging
Appendix W: Glossary
| Term | Definition |
|---|---|
| Derivative | - instantaneous rate of change |
| Differentiable | The derivative exists and is finite at the point |
| Smooth () | All derivatives of all orders exist and are continuous |
| Critical point | where or undefined |
| Local minimum | for all near |
| Concave up | - bowl shape, gradient is increasing |
| Inflection point | Where changes sign |
| Subgradient | Generalized derivative for non-smooth convex functions |
| Lipschitz constant | such that $ |
| Forward difference | - approximation to |
| Centered difference | - approximation to |
| Automatic differentiation | Exact derivative computation by applying chain rule to elementary ops |
| Backpropagation | Reverse-mode AD on neural network computation graph |
| Vanishing gradient | Product of many small derivatives -> exponential decay |
| Score function | - derivative of log-likelihood |
Appendix X: Additional Examples - Differentiating Through Compositions
X.1 Softplus as Smooth ReLU
The softplus function is a smooth approximation to ReLU.
Derivative:
Softplus's derivative is the sigmoid! This means:
- Softplus is always increasing ( for all )
- Its second derivative: - always concave up
- Softplus approaches ReLU as temperature :
X.2 SiLU / Swish
Swish: .
Derivative (product rule):
At : .
Swish is non-monotone: has a local minimum at (where ). This property, combined with self-gating, is hypothesized to improve gradient flow in deep networks.
X.3 Mish
Mish: .
Derivative (product rule + chain rule):
Mish is smooth, non-monotone (like Swish), and has shown improvements over ReLU in some image classification benchmarks. Its derivative involves three activations composed - a showcase for the chain rule.
X.4 Gradient of Cross-Entropy Loss - Full Derivation
For -class classification with softmax output and one-hot label :
Using the softmax Jacobian from G.3:
(since for a one-hot vector). This elegant result - gradient = predicted probability minus true probability - is why the combined softmax + cross-entropy layer is computationally convenient and numerically stable.
Appendix Y: Connecting to the Broader Curriculum
Y.1 Prerequisites Review
Before using the material in this section, ensure you are solid on:
- Limit definition (01 2): is the backbone of the derivative definition
- Fundamental limits (01 3): is used to derive
- Continuity (01 6): differentiability implies continuity; you need continuity to apply EVT and FTC
- Algebra: polynomial factoring, rational expression manipulation - needed for quotient rule applications
- Exponential/log identities: , - used in logarithmic differentiation
Y.2 Key Formulas from This Section (Memorize These)
These formulas appear throughout the rest of the curriculum, in optimization, probability, and all applied sections.
This section is part of the Math for LLMs curriculum. For issues or contributions, see CONTRIBUTING.md.