NotesMath for LLMs

Derivatives and Differentiation

Calculus Fundamentals / Derivatives and Differentiation

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Notes

"The derivative is the heart of calculus. Everything else - integrals, series, differential equations - is built around it."

  • Michael Spivak, Calculus (1967)

Overview

The derivative formalizes the intuition of instantaneous rate of change. Where a limit asks "what value does f(x)f(x) approach?", a derivative asks "how fast is ff changing right now?" This single idea - the slope of the tangent line, the instantaneous velocity, the marginal cost - is the load-bearing concept of all continuous optimization.

In machine learning, every gradient computation is a derivative. The backpropagation algorithm is the chain rule applied systematically to a computation graph. The vanishing gradient problem is a consequence of activation function derivatives approaching zero. The Adam optimizer's adaptive learning rates are functions of accumulated squared derivatives. Understanding derivatives at the mathematical level - not just as "the thing PyTorch computes" - is essential for debugging, designing, and innovating in modern AI.

This section develops single-variable derivatives from the limit definition through all standard rules, applies them to activation functions and computation graphs, and connects numerical differentiation to practical gradient checking.

Prerequisites

  • Limits and continuity - 04/01-Limits-and-Continuity: epsilon-delta definition, limit laws, continuity
  • Functions: domain, range, composition, inverse
  • Algebra: polynomial manipulation, exponential and logarithmic identities
  • Trigonometry: sinx\sin x, cosx\cos x and their basic identities

Companion Notebooks

NotebookDescription
theory.ipynbInteractive: derivative definition, rules, activation functions, backprop, numerical differentiation
exercises.ipynb10 graded exercises from power rule to backpropagation and gradient checking

Learning Objectives

After completing this section, you will be able to:

  1. Define the derivative as a limit and verify it for elementary functions from first principles
  2. Apply the power, product, quotient, and chain rules fluently
  3. Differentiate composite, implicit, and inverse functions
  4. Compute derivatives of all standard activation functions and explain their ML significance
  5. State and prove the Mean Value Theorem and apply it to optimization arguments
  6. Classify critical points using first and second derivative tests
  7. Implement one-sided and centered finite difference approximations with correct error order
  8. Perform gradient checking to validate autodiff implementations
  9. Trace backpropagation through a scalar computation graph using the chain rule
  10. Connect derivative concepts to vanishing gradients, dying ReLU, and learning rate behavior

Table of Contents


1. Intuition

1.1 What a Derivative Measures

Consider a car's position s(t)s(t) at time tt. The average velocity over the interval [t,t+h][t, t+h] is:

vˉ=s(t+h)s(t)h\bar{v} = \frac{s(t+h) - s(t)}{h}

As h0h \to 0, this average approaches the instantaneous velocity at time tt. That limit - if it exists - is the derivative s(t)s'(t).

Geometrically, the average velocity is the slope of a secant line through (t,s(t))(t, s(t)) and (t+h,s(t+h))(t+h, s(t+h)). As h0h \to 0, the secant approaches the tangent line at (t,s(t))(t, s(t)).

DERIVATIVE: FROM SECANT TO TANGENT


  f(x)
                              (x+h, f(x+h))
                            |
                            |
                  secant slope = [f(x+h)-f(x)]/h
              (x, f(x))
           
          <- tangent line at x (slope = f'(x))
    ->  x
              x       x+h

  As h -> 0: secant -> tangent, average rate -> instantaneous rate.


The derivative measures sensitivity: how much does output change per unit change in input? If f(a)=5f'(a) = 5, a small nudge δ\delta in input produces a change of approximately 5δ5\delta in output.

1.2 Historical Motivation

The derivative was invented twice - independently and almost simultaneously - by two of history's greatest mathematicians, triggering a bitter priority dispute that divided European mathematics for a generation.

Isaac Newton (1665-1666) developed "the method of fluxions" during the Great Plague, when Cambridge was closed and Newton worked alone at Woolsthorpe Manor. He called the derivative a "fluxion" - the rate of flow of a "fluent" quantity. He used the notation x˙\dot{x} (dot notation still used in physics). Newton did not publish for decades.

Gottfried Wilhelm Leibniz (1675) independently developed calculus in Paris, motivated by geometric problems. He introduced the notation dy/dxdy/dx and \int (still universal today). He published first, in 1684.

The Royal Society's 1713 judgment in favor of Newton - almost certainly biased - poisoned British-Continental mathematical relations for over a century, isolating British mathematics from advances made by the Bernoullis, Euler, and Lagrange.

YearEvent
1665Newton develops fluxions (unpublished)
1675Leibniz develops calculus independently
1684Leibniz publishes first
1687Newton publishes Principia (uses fluxions implicitly)
1713Royal Society (controlled by Newton) rules for Newton
1821Cauchy formalizes with limits (modern foundation)
1861Weierstrass gives epsilon-delta definition

Lagrange (1797) introduced the prime notation f(x)f'(x) and attempted an algebraic foundation via power series. Cauchy (1821) finally gave the limit-based definition we use today.

1.3 Why Derivatives Are Central to AI

Every parameter update in every gradient-based learning algorithm is computed via derivatives.

Backpropagation (Rumelhart, Hinton, Williams, 1986) is the chain rule applied to a computation graph. Given a loss L\mathcal{L} and parameters θ1,,θn\theta_1, \ldots, \theta_n, the update θiθiαL/θi\theta_i \leftarrow \theta_i - \alpha \partial\mathcal{L}/\partial\theta_i requires computing all partial derivatives efficiently - which backprop does in O(forward pass)O(\text{forward pass}) time.

Vanishing gradients (Hochreiter, 1991; Hochreiter & Schmidhuber, 1997): sigmoid activations have σ(x)1/4\sigma'(x) \leq 1/4. In a 20-layer network, the chain rule multiplies 20 such factors: (1/4)201012(1/4)^{20} \approx 10^{-12}. Early layers receive effectively zero gradient - they cannot learn. The fix (ReLU, residual connections, layer norm) is entirely about derivative behavior.

Adam optimizer (Kingma & Ba, 2014): the adaptive learning rate α/(v^t+ε)\alpha/(\sqrt{\hat{v}_t} + \varepsilon) where v^t\hat{v}_t is the exponential moving average of squared gradients (L/θ)2(\partial\mathcal{L}/\partial\theta)^2. Every component is derivative-driven.

Neural architecture search, meta-learning, differentiable programming: modern AI increasingly treats the architecture itself as differentiable - requiring derivatives through discrete operations via straight-through estimators, Gumbel-softmax, and continuous relaxations.


2. Formal Definition

2.1 The Derivative as a Limit

Recall: The limit limxaf(x)\lim_{x \to a} f(x) was defined rigorously in 01-Limits-and-Continuity. The derivative is a specific instance of that definition.

Definition (Derivative). The derivative of ff at aa, written f(a)f'(a), is:

f(a)=limh0f(a+h)f(a)hf'(a) = \lim_{h \to 0} \frac{f(a+h) - f(a)}{h}

provided this limit exists (is finite). If it exists, ff is differentiable at aa.

The expression f(a+h)f(a)h\frac{f(a+h)-f(a)}{h} is called a difference quotient. It is the slope of the secant line through (a,f(a))(a, f(a)) and (a+h,f(a+h))(a+h, f(a+h)).

Example: f(x)=x2f(x) = x^2.

f(a)=limh0(a+h)2a2h=limh02ah+h2h=limh0(2a+h)=2af'(a) = \lim_{h \to 0} \frac{(a+h)^2 - a^2}{h} = \lim_{h \to 0} \frac{2ah + h^2}{h} = \lim_{h \to 0}(2a + h) = 2a

So (x2)=2x(x^2)' = 2x. The derivative of x2x^2 at a=3a = 3 is f(3)=6f'(3) = 6.

Example: f(x)=xf(x) = \sqrt{x} at a>0a > 0.

f(a)=limh0a+haha+h+aa+h+a=limh0hh(a+h+a)=12af'(a) = \lim_{h \to 0} \frac{\sqrt{a+h} - \sqrt{a}}{h} \cdot \frac{\sqrt{a+h}+\sqrt{a}}{\sqrt{a+h}+\sqrt{a}} = \lim_{h \to 0} \frac{h}{h(\sqrt{a+h}+\sqrt{a})} = \frac{1}{2\sqrt{a}}

Example: f(x)=exf(x) = e^x.

f(a)=limh0ea+heah=ealimh0eh1h=ea1=eaf'(a) = \lim_{h \to 0} \frac{e^{a+h} - e^a}{h} = e^a \lim_{h \to 0} \frac{e^h - 1}{h} = e^a \cdot 1 = e^a

using the fundamental limit limh0(eh1)/h=1\lim_{h\to 0}(e^h - 1)/h = 1 from 01 3.2. So (ex)=ex(e^x)' = e^x - the exponential is its own derivative.

The derivative as a function. If ff is differentiable at every xx in its domain, the rule xf(x)x \mapsto f'(x) defines a new function f:RRf': \mathbb{R} \to \mathbb{R} called the derivative function.

2.2 Differentiability vs. Continuity

Theorem. If ff is differentiable at aa, then ff is continuous at aa.

Proof. We show limh0f(a+h)=f(a)\lim_{h\to 0} f(a+h) = f(a):

f(a+h)f(a)=f(a+h)f(a)hhf(a)0=0(h0)f(a+h) - f(a) = \frac{f(a+h)-f(a)}{h} \cdot h \to f'(a) \cdot 0 = 0 \qquad (h \to 0)

So limh0f(a+h)=f(a)\lim_{h\to 0} f(a+h) = f(a). \square

The converse is false: continuity does not imply differentiability.

Counterexamples:

  • f(x)=xf(x) = \lvert x \rvert: continuous everywhere, but at x=0x=0: left derivative =1= -1, right derivative =+1= +1 - not equal, so not differentiable
  • f(x)=x1/3f(x) = x^{1/3}: continuous at 00, but h1/3/h=h2/3\lvert h^{1/3} \rvert / \lvert h \rvert = \lvert h \rvert^{-2/3} \to \infty - vertical tangent, not differentiable
  • Weierstrass function (1872): continuous everywhere, differentiable nowhere - the first published example showing the two concepts are genuinely distinct
DIFFERENTIABILITY VS. CONTINUITY


  All differentiable functions          Continuous functions
  
                                                               
                                        
       Differentiable         |x|  x^{1/3}  Weierstrass fn  
       (smooth, no kinks)                                     
                                        
  

  Differentiable  Continuous    (proven above)
  Continuous  Differentiable    FALSE (counterexamples above)


For AI: ReLU(x)=max(0,x)(x) = \max(0,x) is continuous everywhere but not differentiable at x=0x=0 (left derivative 00, right derivative 11). In practice, we use a subgradient - any value in [0,1][0,1] at x=0x=0, typically 00 or 1/21/2. Training converges despite this because P(activation=0)=0P(\text{activation} = 0) = 0 for continuously distributed pre-activations.

2.3 Notational Systems

Four equivalent notations coexist; knowing all is essential for reading across fields:

NotationSymbolUsage
Lagrange (prime)f(x)f'(x), f(x)f''(x), f(n)(x)f^{(n)}(x)Most common in pure math
Leibniz (fraction)dydx\dfrac{dy}{dx}, d2ydx2\dfrac{d^2y}{dx^2}Chain rule, implicit diff, physics
Newton (dot)x˙\dot{x}, x¨\ddot{x}Physics, ODEs, time derivatives
Euler (operator)DfDf, D2fD^2 fFunctional analysis, operator theory

Leibniz notation is the most powerful for computation because it makes the chain rule look like fraction cancellation:

dydx=dydududx\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}

Higher order: f(x)=d2ydx2=ddx ⁣(dydx)f''(x) = \frac{d^2y}{dx^2} = \frac{d}{dx}\!\left(\frac{dy}{dx}\right). Note: (dydx)2d2ydx2\left(\frac{dy}{dx}\right)^2 \neq \frac{d^2y}{dx^2} - a common error.

2.4 One-Sided Derivatives

Definition. The right-hand derivative is f+(a)=limh0+f(a+h)f(a)hf'_+(a) = \lim_{h \to 0^+}\frac{f(a+h)-f(a)}{h}. The left-hand derivative is f(a)=limh0f(a+h)f(a)hf'_-(a) = \lim_{h\to 0^-}\frac{f(a+h)-f(a)}{h}.

ff is differentiable at aa if and only if both one-sided derivatives exist and are equal: f(a)=f+(a)f'_-(a) = f'_+(a).

For AI - ReLU subgradient: At x=0x = 0: ReLU(0)=0\text{ReLU}'_-(0) = 0 and ReLU+(0)=1\text{ReLU}'_+(0) = 1. Since 010 \neq 1, ReLU is not differentiable at 00. Frameworks like PyTorch use the convention ReLU(0)=0\text{ReLU}'(0) = 0 (the left derivative), which works in practice.


3. Basic Differentiation Rules

These rules allow computing derivatives algebraically without evaluating limits each time. All are proved from the limit definition.

3.1 Constant and Power Rules

Constant Rule: (c)=0(c)' = 0 for any constant cc.

Proof: limh0(cc)/h=limh00=0\lim_{h\to 0}(c - c)/h = \lim_{h\to 0} 0 = 0. \square

Power Rule: (xn)=nxn1(x^n)' = nx^{n-1} for any real nn.

Proof (integer n1n \geq 1): Expand (x+h)n(x+h)^n by the binomial theorem:

(x+h)n=xn+nxn1h+(n2)xn2h2+(x+h)^n = x^n + nx^{n-1}h + \binom{n}{2}x^{n-2}h^2 + \cdots

Subtract xnx^n, divide by hh, let h0h \to 0: only the nxn1nx^{n-1} term survives. \square

The power rule holds for all real nn (including fractions and negatives):

ddxx1/2=12x1/2,ddxx1=x2,ddxxπ=πxπ1\frac{d}{dx}x^{1/2} = \frac{1}{2}x^{-1/2}, \quad \frac{d}{dx}x^{-1} = -x^{-2}, \quad \frac{d}{dx}x^\pi = \pi x^{\pi-1}

3.2 Sum and Constant-Multiple Rules

Sum Rule: (f+g)=f+g(f + g)' = f' + g'

Constant-Multiple Rule: (cf)=cf(cf)' = cf'

Together: (af+bg)=af+bg(af + bg)' = af' + bg' - differentiation is linear.

Consequence: Differentiating polynomials term by term:

ddx(3x42x2+7x1)=12x34x+7\frac{d}{dx}(3x^4 - 2x^2 + 7x - 1) = 12x^3 - 4x + 7

3.3 Product Rule

Theorem. (fg)=fg+fg(fg)' = f'g + fg'

Proof:

f(x+h)g(x+h)f(x)g(x)h\frac{f(x+h)g(x+h) - f(x)g(x)}{h}

Add and subtract f(x+h)g(x)f(x+h)g(x):

=f(x+h)f(x)hg(x+h)+f(x)g(x+h)g(x)h= \frac{f(x+h)-f(x)}{h} \cdot g(x+h) + f(x) \cdot \frac{g(x+h)-g(x)}{h}

As h0h \to 0: first term f(x)g(x)\to f'(x)g(x), second f(x)g(x)\to f(x)g'(x). \square

Memory aid: "derivative of first times second, plus first times derivative of second."

Examples:

ddx[x2ex]=2xex+x2ex=xex(2+x)\frac{d}{dx}[x^2 e^x] = 2xe^x + x^2 e^x = xe^x(2+x) ddx[xsinx]=sinx+xcosx\frac{d}{dx}[x\sin x] = \sin x + x\cos x

Generalized product (Leibniz rule):

(f1f2fn)=k=1nf1fk1fkfk+1fn(f_1 f_2 \cdots f_n)' = \sum_{k=1}^n f_1 \cdots f_{k-1} f_k' f_{k+1} \cdots f_n

3.4 Quotient Rule

Theorem. (fg)=fgfgg2\left(\dfrac{f}{g}\right)' = \dfrac{f'g - fg'}{g^2}, provided g(x)0g(x) \neq 0.

Proof: Write f=(f/g)gf = (f/g) \cdot g and apply the product rule to both sides. \square

Memory aid: "lo d-hi minus hi d-lo, over lo squared" (where hi =f= f, lo =g= g).

Example:

ddxsinxx=xcosxsinxx2\frac{d}{dx}\frac{\sin x}{x} = \frac{x\cos x - \sin x}{x^2}

3.5 Elementary Function Derivatives

Core table - memorize these:

FunctionDerivativeNotes
exe^xexe^xOnly function equal to its own derivative
axa^xaxlnaa^x \ln aFor any base a>0a > 0
lnx\ln x1/x1/xFor x>0x > 0
logax\log_a x1/(xlna)1/(x \ln a)Change of base
sinx\sin xcosx\cos x
cosx\cos xsinx-\sin xNote the negative
tanx\tan xsec2x\sec^2 x
arcsinx\arcsin x1/1x21/\sqrt{1-x^2}Via inverse function rule
arctanx\arctan x1/(1+x2)1/(1+x^2)Via inverse function rule

Proof: (lnx)=1/x(\ln x)' = 1/x.

limh0ln(x+h)lnxh=limh01hln ⁣(1+hx)=1xlimu0ln(1+u)u=1x1=1x\lim_{h\to 0} \frac{\ln(x+h) - \ln x}{h} = \lim_{h\to 0} \frac{1}{h}\ln\!\left(1 + \frac{h}{x}\right) = \frac{1}{x}\lim_{u\to 0}\frac{\ln(1+u)}{u} = \frac{1}{x} \cdot 1 = \frac{1}{x}

using limu0ln(1+u)/u=1\lim_{u\to 0}\ln(1+u)/u = 1 (from 01 3.2). \square


4. The Chain Rule

4.1 Statement and Proof

The chain rule handles composites - functions of functions. If y=f(g(x))y = f(g(x)), then:

dydx=f(g(x))g(x)(Leibniz: dydx=dydududx)\frac{dy}{dx} = f'(g(x)) \cdot g'(x) \qquad \text{(Leibniz: } \frac{dy}{dx} = \frac{dy}{du}\cdot\frac{du}{dx}\text{)}

Proof (informal but rigorous enough for most purposes). Let u=g(x)u = g(x) and Δu=g(x+h)g(x)\Delta u = g(x+h) - g(x). Then:

ΔyΔx=ΔyΔuΔuΔx\frac{\Delta y}{\Delta x} = \frac{\Delta y}{\Delta u} \cdot \frac{\Delta u}{\Delta x}

Taking h0h \to 0: if g(x)g'(x) exists and f(u)f'(u) exists at u=g(x)u = g(x), the limit yields the chain rule. (The formal proof handles the case Δu=0\Delta u = 0 separately via an auxiliary function.)

4.2 Worked Examples

Example 1. y=sin(x2)y = \sin(x^2). Let u=x2u = x^2, y=sinuy = \sin u.

y=cos(x2)2xy' = \cos(x^2) \cdot 2x

Example 2. y=ex2/2y = e^{-x^2/2} (Gaussian kernel).

y=ex2/2(x)=xex2/2y' = e^{-x^2/2} \cdot (-x) = -x\,e^{-x^2/2}

Example 3 (triple composition). y=ln(sin(ex))y = \ln(\sin(e^x)).

y=1sin(ex)cos(ex)ex=excos(ex)sin(ex)y' = \frac{1}{\sin(e^x)} \cdot \cos(e^x) \cdot e^x = \frac{e^x \cos(e^x)}{\sin(e^x)}

The pattern: differentiate outer, keep inner untouched, multiply by derivative of inner - repeat inward.

4.3 Chain Rule and Backpropagation

Every neural network is a deeply composed function:

L=(fn(fn1(f1(x))))\mathcal{L} = \ell(f_n(f_{n-1}(\cdots f_1(\mathbf{x})\cdots)))

Backpropagation is the chain rule applied to a computation graph. The gradient of the loss with respect to any parameter is a product of local derivatives along the path from that parameter to the output. The chain rule guarantees this product equals the true gradient - the theoretical justification for gradient-based learning.

Forward reference. The multivariable chain rule (Jacobians, backpropagation for vector-valued functions) is the canonical topic of 05-Multivariate Calculus. Here we develop the single-variable foundation only.


5. Implicit Differentiation and Related Rates

5.1 Implicit Differentiation

Not every curve is a function y=f(x)y = f(x). The unit circle x2+y2=1x^2 + y^2 = 1 defines yy implicitly. To find dy/dxdy/dx, differentiate both sides with respect to xx, treating yy as a function of xx:

ddx[x2+y2]=ddx[1]    2x+2ydydx=0    dydx=xy\frac{d}{dx}[x^2 + y^2] = \frac{d}{dx}[1] \implies 2x + 2y\frac{dy}{dx} = 0 \implies \frac{dy}{dx} = -\frac{x}{y}

General procedure:

  1. Differentiate both sides w.r.t. xx; apply chain rule wherever yy appears (treat yy as y(x)y(x)).
  2. Collect all dy/dxdy/dx terms on one side.
  3. Solve algebraically for dy/dxdy/dx.

Example. Derive (arctanx)=1/(1+x2)(\arctan x)' = 1/(1+x^2).

Let y=arctanxy = \arctan x, so tany=x\tan y = x. Differentiate implicitly:

sec2ydydx=1    dydx=cos2y=11+tan2y=11+x2.\sec^2 y \cdot \frac{dy}{dx} = 1 \implies \frac{dy}{dx} = \cos^2 y = \frac{1}{1 + \tan^2 y} = \frac{1}{1+x^2}. \quad \square

For AI: Implicit differentiation underlies the derivation of softmax Jacobians and the inverse function theorem applied to normalizing flows - generative models that learn invertible mappings.

5.2 Related Rates

Related rates problems ask: if two quantities are linked by an equation, how fast does one change relative to the other? Differentiate the linking equation w.r.t. time tt.

Example. The cross-entropy loss L=logp\mathcal{L} = -\log p and the logit zz are linked by the sigmoid: p=σ(z)=1/(1+ez)p = \sigma(z) = 1/(1+e^{-z}). Rate of change of loss w.r.t. logit:

dLdz=dLdpdpdz=1pp(1p)=(1p)=p1\frac{d\mathcal{L}}{dz} = \frac{d\mathcal{L}}{dp} \cdot \frac{dp}{dz} = \frac{-1}{p} \cdot p(1-p) = -(1-p) = p - 1

This is the gradient passed to the logit layer during backpropagation - related rates in disguise.

5.3 Linear Approximation

The tangent line at x=ax = a approximates ff locally:

f(x)f(a)+f(a)(xa)f(x) \approx f(a) + f'(a)(x-a)

This linearization is the first-order Taylor approximation. It underlies gradient descent: the loss surface near θ\theta looks like a hyperplane, and gradient descent steps in the direction of steepest descent on that plane.

Forward reference. The full Taylor series (all higher-order terms, remainder, convergence) is the canonical topic of 04-Series-and-Sequences.


6. Higher-Order Derivatives and Concavity

6.1 Definition and Notation

The second derivative is the derivative of the derivative:

f(x)=ddx[f(x)]=d2fdx2f''(x) = \frac{d}{dx}[f'(x)] = \frac{d^2f}{dx^2}

Higher-order derivatives: f(n)(x)=dnfdxnf^{(n)}(x) = \frac{d^n f}{dx^n}.

Geometric meaning:

  • f(x)>0f'(x) > 0: ff is increasing at xx
  • f(x)<0f'(x) < 0: ff is decreasing at xx
  • f(x)>0f''(x) > 0: ff' is increasing -> ff is concave up (bowl shape )
  • f(x)<0f''(x) < 0: ff' is decreasing -> ff is concave down (dome shape )

6.2 Inflection Points

A point x0x_0 is an inflection point if ff'' changes sign at x0x_0. The concavity flips - from to or vice versa. Necessary condition: f(x0)=0f''(x_0) = 0 (but not sufficient - check sign change).

Example. f(x)=x3f(x) = x^3: f(x)=6xf''(x) = 6x. Changes sign at x=0x = 0: inflection point. Non-example. f(x)=x4f(x) = x^4: f(x)=12x20f''(x) = 12x^2 \geq 0, equals zero at x=0x=0 but no sign change - not an inflection point.

6.3 Significance for Optimization

The second derivative determines the local curvature of the loss surface:

  • Positive curvature (f>0f'' > 0): minimum is possible here; gradient descent converges toward it.
  • Negative curvature (f<0f'' < 0): maximum; gradient descent moves away from saddle regions.
  • Zero curvature: saddle point candidate - common in high-dimensional loss landscapes.

The Hessian (matrix of second partial derivatives, covered in 05-Multivariate Calculus) generalizes ff'' to multi-parameter models. Positive definite Hessians certify local minima.

Recall: Positive definite matrices were defined in 07-Positive-Definite-Matrices - a matrix HH is PD iff vHv>0\mathbf{v}^\top H \mathbf{v} > 0 for all nonzero v\mathbf{v}.

6.4 Taylor Coefficients Foreshadowed

The nn-th derivative at a point gives the nn-th Taylor coefficient:

an=f(n)(a)n!a_n = \frac{f^{(n)}(a)}{n!}

So ff'' controls the quadratic correction to the linear approximation - exactly the second-order term Adam exploits by tracking gradient variance as a proxy for curvature.

Forward reference. The full derivation of Taylor series using higher-order derivatives is in 04-Series-and-Sequences.


7. Extrema, Critical Points, and the Mean Value Theorem

7.1 Critical Points and the First Derivative Test

A critical point of ff is any x0x_0 where f(x0)=0f'(x_0) = 0 or f(x0)f'(x_0) does not exist.

First Derivative Test. If ff' changes sign at x0x_0:

  • +- \to +: local minimum
  • ++ \to -: local maximum
  • No sign change: saddle point / inflection

Example. f(x)=x33xf(x) = x^3 - 3x. Critical points: f(x)=3x23=0x=±1f'(x) = 3x^2 - 3 = 0 \Rightarrow x = \pm 1.

  • x=1x = -1: ff' changes ++ \to - -> local max, f(1)=2f(-1) = 2
  • x=1x = 1: ff' changes +- \to + -> local min, f(1)=2f(1) = -2

7.2 Second Derivative Test

If f(x0)=0f'(x_0) = 0:

  • f(x0)>0f''(x_0) > 0: local minimum
  • f(x0)<0f''(x_0) < 0: local maximum
  • f(x0)=0f''(x_0) = 0: inconclusive - higher-order test needed

For AI: In a scalar model, the second derivative test tells us whether a parameter has converged to a local minimum or is at a saddle. In neural networks this extends to the Hessian eigenspectrum - if all eigenvalues are positive, we've found a local minimum.

7.3 Global Extrema on Closed Intervals

Recall: The Extreme Value Theorem (EVT) from 01-Limits-and-Continuity guarantees that a continuous function on [a,b][a,b] attains its maximum and minimum.

Algorithm for global extrema on [a,b][a,b]:

  1. Find all critical points in (a,b)(a,b) where f(x)=0f'(x) = 0 or ff' undefined.
  2. Evaluate ff at critical points and endpoints aa, bb.
  3. The largest value is the global max; the smallest is the global min.

7.4 The Mean Value Theorem

Theorem (MVT). If ff is continuous on [a,b][a,b] and differentiable on (a,b)(a,b), there exists c(a,b)c \in (a,b) such that:

f(c)=f(b)f(a)baf'(c) = \frac{f(b) - f(a)}{b - a}

Geometrically: there is a point where the tangent slope equals the average (secant) slope.

Proof. Apply Rolle's Theorem (MVT with f(a)=f(b)f(a) = f(b)) to g(x)=f(x)f(b)f(a)ba(xa)g(x) = f(x) - \frac{f(b)-f(a)}{b-a}(x-a).

For optimization: The MVT bounds how much a function can change along a gradient descent path. If fL|f'| \leq L everywhere, then f(x)f(y)Lxy|f(x) - f(y)| \leq L|x-y| - a Lipschitz bound used to set safe learning rates.

Forward reference. Critical point analysis extends to multivariable functions via gradient = 0 and Hessian definiteness in 05-Multivariate Calculus and 08-Optimization.


8. Activation Function Derivatives

Recall: Continuity of activation functions (whether ReLU, GELU, sigmoid are continuous at the origin) was analyzed in 01-Limits-and-Continuity 10. Here we derive their derivatives - essential for backpropagation.

8.1 Sigmoid

σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}}

Derivative. Using the quotient rule with u=1u = 1, v=1+exv = 1+e^{-x}:

σ(x)=ex(1+ex)2=σ(x)(1σ(x))\sigma'(x) = \frac{e^{-x}}{(1+e^{-x})^2} = \sigma(x)(1-\sigma(x))

This self-referential form means the gradient can be computed from the forward-pass output - no second exponential needed. Peak value: σ(0)=1/4\sigma'(0) = 1/4.

Vanishing gradient. As x|x| \to \infty: σ(x)0\sigma(x) \to 0 or 11, so σ(x)0\sigma'(x) \to 0. Products of many near-zero gradients -> exponential decay through layers.

8.2 Tanh

tanh(x)=exexex+ex\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

Derivative. Using tanh(x)=2σ(2x)1\tanh(x) = 2\sigma(2x) - 1 or direct quotient rule:

tanh(x)=1tanh2(x)=sech2(x)\tanh'(x) = 1 - \tanh^2(x) = \text{sech}^2(x)

Peak value: tanh(0)=1\tanh'(0) = 1 - double that of sigmoid. Still vanishes for large x|x|.

8.3 ReLU and Variants

ReLU(x)=max(0,x),ReLU(x)={1x>00x<0\text{ReLU}(x) = \max(0, x), \qquad \text{ReLU}'(x) = \begin{cases} 1 & x > 0 \\ 0 & x < 0 \end{cases}

Undefined at x=0x = 0; in practice subgradient 00 is used. No vanishing gradient for active units (x>0x > 0) - the key advantage over sigmoid/tanh.

Leaky ReLU: LReLU(x)=max(αx,x)\text{LReLU}(x) = \max(\alpha x, x) with α0.01\alpha \approx 0.01, derivative α\alpha for x<0x < 0.

8.4 GELU

The Gaussian Error Linear Unit:

GELU(x)=xΦ(x),Φ(x)=12[1+erf ⁣(x2)]\text{GELU}(x) = x \cdot \Phi(x), \qquad \Phi(x) = \frac{1}{2}\left[1 + \text{erf}\!\left(\frac{x}{\sqrt{2}}\right)\right]

Derivative (product rule):

GELU(x)=Φ(x)+xϕ(x),ϕ(x)=12πex2/2\text{GELU}'(x) = \Phi(x) + x \cdot \phi(x), \qquad \phi(x) = \frac{1}{\sqrt{2\pi}}e^{-x^2/2}

where ϕ\phi is the standard normal PDF. GELU is smooth everywhere, approximating ReLU for large positive xx and gating negative values by their Gaussian probability. Used in GPT-2, BERT, and most modern transformers.

8.5 Derivative Summary Table

ActivationFormulaDerivativeRange of ff'Used In
Sigmoid σ\sigma1/(1+ex)1/(1+e^{-x})σ(1σ)\sigma(1-\sigma)(0,1/4](0, 1/4]Logistic regression, old RNNs
Tanh(exex)/(ex+ex)(e^x-e^{-x})/(e^x+e^{-x})1tanh21-\tanh^2(0,1](0,1]LSTMs, NLP
ReLUmax(0,x)\max(0,x)1[x>0]\mathbf{1}[x>0]{0,1}\{0,1\}CNNs, ResNets
GELUxΦ(x)x\Phi(x)Φ(x)+xϕ(x)\Phi(x)+x\phi(x)(0.17,1]\approx(-0.17,1]Transformers
Softplusln(1+ex)\ln(1+e^x)σ(x)\sigma(x)(0,1)(0,1)Smooth ReLU

9. Numerical Differentiation

Recall: 01 previewed the gradient as a limit in 01 11. Here we develop the full numerical treatment - the canonical home for this topic.

9.1 Finite Difference Approximations

Forward difference (first order, O(h)O(h) error):

f(x)f(x+h)f(x)hf'(x) \approx \frac{f(x+h) - f(x)}{h}

Taylor expansion: f(x+h)=f(x)+hf(x)+h22f(x)+f(x+h) = f(x) + hf'(x) + \frac{h^2}{2}f''(x) + \cdots, so the error is h2f(x)+O(h2)\frac{h}{2}f''(x) + O(h^2).

Centered difference (second order, O(h2)O(h^2) error):

f(x)f(x+h)f(xh)2hf'(x) \approx \frac{f(x+h) - f(x-h)}{2h}

Expand both: odd-order error terms cancel, leaving error h26f(x)+O(h4)\frac{h^2}{6}f'''(x) + O(h^4). Centered differences are strongly preferred.

Second derivative via finite differences:

f(x)f(x+h)2f(x)+f(xh)h2f''(x) \approx \frac{f(x+h) - 2f(x) + f(x-h)}{h^2}

9.2 Optimal Step Size

Two competing errors:

  • Truncation error: O(h2)O(h^2) - decreases as h0h \to 0
  • Round-off error: floating-point arithmetic contributes εmach/h\varepsilon_{\text{mach}}/h - increases as h0h \to 0

Balancing these (minimize total error h2+εmach/hh^2 + \varepsilon_{\text{mach}}/h) gives optimal:

hεmach1/3105.3(for float64, εmach1016)h^* \approx \varepsilon_{\text{mach}}^{1/3} \approx 10^{-5.3} \quad (\text{for float64, } \varepsilon_{\text{mach}} \approx 10^{-16})

In practice: h105h \approx 10^{-5} for float64 centered differences.

9.3 Gradient Checking

Gradient checking verifies an analytical gradient implementation by comparing it to a numerical estimate:

def grad_check(f, x, analytic_grad, h=1e-5):
    numerical_grad = np.zeros_like(x)
    for i in range(len(x)):
        x_plus = x.copy(); x_plus[i] += h
        x_minus = x.copy(); x_minus[i] -= h
        numerical_grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
    rel_error = np.linalg.norm(analytic_grad - numerical_grad) / \
                (np.linalg.norm(analytic_grad) + np.linalg.norm(numerical_grad) + 1e-8)
    return rel_error  # < 1e-5 typically acceptable

Libraries like PyTorch provide torch.autograd.gradcheck() implementing exactly this. Every custom backward pass should be verified.

9.4 Automatic Differentiation

Numerical differentiation approximates derivatives; automatic differentiation (AD) computes them exactly (up to floating-point) by applying the chain rule to elementary operations in forward or reverse mode.

  • Forward mode AD: accumulates derivative alongside function evaluation - efficient for few inputs, many outputs
  • Reverse mode AD (backpropagation): accumulates gradient from output backward - efficient for many inputs, one output (the loss)

All modern deep learning frameworks (PyTorch, JAX, TensorFlow) implement reverse-mode AD. Numerical differentiation is used for checking, not training.


10. Common Mistakes

#MistakeWhy It's WrongFix
1(fg)=fg(fg)' = f'g'Product rule is (fg)=fg+fg(fg)' = f'g + fg'Always use product rule
2(f/g)=f/g(f/g)' = f'/g'Quotient rule is (fgfg)/g2(f'g - fg')/g^2Derive from quotient rule
3Forgetting chain rule: (ex2)=ex2(e^{x^2})' = e^{x^2}Outer function derivative x inner derivative needed(ex2)=2xex2(e^{x^2})' = 2x\,e^{x^2}
4(lnx)=1/lnx(\ln x)' = 1/\ln xConfused with change-of-base; correct is 1/x1/xMemorize: (lnx)=1/x(\ln x)' = 1/x
5(sin2x)=2sinx(\sin^2 x)' = 2\sin xMissing chain rule: multiply by (sinx)=cosx(\sin x)' = \cos x(sin2x)=2sinxcosx=sin2x(\sin^2 x)' = 2\sin x\cos x = \sin 2x
6Differentiable implies continuous - false directionTrue direction: differentiable -> continuous (not converse)ReLU is continuous but not differentiable at 0
7f(x0)=0f''(x_0) = 0 means saddle/inflectionNot sufficient - need sign change of ff''Check ff'' sign on both sides
8Optimal hh for finite differences is h=1016h = 10^{-16}Catastrophic cancellation - round-off dominatesUse h105h \approx 10^{-5} for float64
9Applying L'Hpital when limit is not 0/00/0 or /\infty/\inftyL'Hpital only applies to indeterminate formsCheck form first
10Chain rule direction error: (fg)=fg(f \circ g)' = f' \circ g'Must evaluate ff' at g(x)g(x): (fg)(x)=f(g(x))g(x)(f\circ g)'(x) = f'(g(x))g'(x)Track composition carefully
11Assuming ReLU(0)=1\text{ReLU}'(0) = 1 or 0.50.5ReLU is not differentiable at 0; subgradient 00 is standardUse subgradient convention
12Gradient of Axb2\|Ax - b\|^2 as 2A(Axb)2A(Ax-b)Matrix derivative rules needed; correct form is 2A(Axb)2A^\top(Ax-b)See 05-Multivariate Calculus

11. Exercises

Exercise 1 - Basic rules. Differentiate: (a) f(x)=3x42x2+7f(x) = 3x^4 - 2x^2 + 7, (b) g(x)=xexg(x) = \sqrt{x}\,e^x, (c) h(x)=x2+1x1h(x) = \frac{x^2+1}{x-1}.

Exercise 2 - Chain rule. Differentiate: (a) sin(x3)\sin(x^3), (b) ln(cosx)\ln(\cos x), (c) (1+x2)10(1+x^2)^{10}, (d) esinxe^{\sin x}.

Exercise 3 - Implicit differentiation. Given x3+y3=6xyx^3 + y^3 = 6xy (folium of Descartes), find dy/dxdy/dx.

Exercise 4 - Activation derivatives. (a) Derive σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1-\sigma(x)) from scratch. (b) Verify numerically using centered differences at x{2,0,2}x \in \{-2, 0, 2\} with h=105h = 10^{-5}.

Exercise 5 - Critical points. Find and classify all critical points of f(x)=x44x3+6x2f(x) = x^4 - 4x^3 + 6x^2. Identify global extrema on [1,4][-1, 4].

Exercise 6 - MVT application. Prove that sinasinbab|\sin a - \sin b| \leq |a - b| for all a,bRa, b \in \mathbb{R} using the MVT.

Exercise 7 - Gradient checking. Implement centered finite differences to approximate the derivative of f(x)=xln(x+ex)f(x) = x\ln(x + e^{-x}) at x=1x = 1. Compare to the analytic derivative. Experiment with h{101,105,1010,1015}h \in \{10^{-1}, 10^{-5}, 10^{-10}, 10^{-15}\} and explain the behavior.

Exercise 8 - Backpropagation. A 2-layer scalar network: z1=w1xz_1 = w_1 x, a1=σ(z1)a_1 = \sigma(z_1), z2=w2a1z_2 = w_2 a_1, y^=z2\hat{y} = z_2, L=12(y^y)2\mathcal{L} = \frac{1}{2}(\hat{y} - y)^2. Derive L/w1\partial\mathcal{L}/\partial w_1 and L/w2\partial\mathcal{L}/\partial w_2 using the chain rule. Verify numerically.


12. Why This Matters for AI (2026 Perspective)

ConceptAI/ML Impact
Derivative definitionFormal basis of gradient - every optimizer (SGD, Adam, Adagrad) descends along f(θ)-f'(\theta)
Chain ruleBackpropagation: gradient of loss through composed layers; foundational for all neural network training
Product/quotient rulesAttention score derivatives; gating mechanisms in LSTMs and GRUs
Activation derivativesσ\sigma', ReLU', GELU' determine gradient magnitude at each neuron; control vanishing/exploding gradients
Second derivative / concavityHessian approximations in Adam (diagonal), K-FAC (block), Shampoo (full); Newton's method in optimization
Critical pointsSaddle points dominate high-dimensional loss landscapes; flat minima -> better generalization
MVT and Lipschitz boundsLearning rate theory: η1/L\eta \leq 1/L ensures descent for LL-smooth losses
Implicit differentiationNormalizing flows, invertible networks, implicit models (DEQ, neural ODEs)
Numerical differentiationGradient checking in unit tests; finite-difference Hessian in small-scale studies
Automatic differentiationPyTorch autograd, JAX grad, TensorFlow GradientTape - all implement reverse-mode AD
Linear approximationGradient descent step as linear model of the loss surface near current parameters
GELU derivativeUsed in GPT-2, BERT, T5, and effectively every transformer post-2018

Conceptual Bridge

Looking back. This section built on the definition of a limit (01) - the derivative is a limit of a difference quotient. The continuity of activation functions established in 01 is a prerequisite for their differentiability here (differentiable implies continuous, but not conversely).

Looking forward. The chain rule derived here for scalar functions becomes the multivariable chain rule in 05-Multivariate Calculus - the chain rule expressed through Jacobians is the mathematical heart of backpropagation for vector-valued functions. The linear approximation introduced in 5.3 becomes the full Taylor series in 04-Series-and-Sequences, which justifies optimizer update rules. Critical point analysis extends to gradient = 0 conditions in 08-Optimization.

Position in the curriculum:

CHAPTER 4 - CALCULUS FUNDAMENTALS


  01 Limits and Continuity 
           (limit definition, L'Hpital, continuity of activations)
         
  02 Derivatives and Differentiation   YOU ARE HERE 
           (rules, chain rule, activation f', numerical diff)
         
  03 Integration 
           (FTC links derivatives <-> integrals)
         
  04 Series and Sequences 
           (Taylor series uses all derivatives from 02)
         
  05 Multivariate Calculus 
         (multivariable chain rule, Jacobians, full backprop)



<- Back to Calculus Fundamentals | Next: Integration ->


Appendix A: Extended Worked Examples - Basic Rules

A.1 Power Rule - Negative and Fractional Exponents

The power rule ddxxn=nxn1\frac{d}{dx}x^n = nx^{n-1} applies to all real nn:

Negative integer: ddxx3=3x4=3/x4\frac{d}{dx}x^{-3} = -3x^{-4} = -3/x^4.

Fractional: ddxx2/3=23x1/3\frac{d}{dx}x^{2/3} = \frac{2}{3}x^{-1/3}. Note: not defined at x=0x = 0.

Irrational: ddxxπ=πxπ1\frac{d}{dx}x^\pi = \pi x^{\pi-1}.

Proof for nZ+n \in \mathbb{Z}^+ (inductive, using product rule). Base case n=1n=1: (x)=1(x)' = 1 from definition. Inductive step: assume (xk)=kxk1(x^k)' = kx^{k-1}. Then (xk+1)=(xxk)=1xk+xkxk1=(k+1)xk(x^{k+1})' = (x \cdot x^k)' = 1 \cdot x^k + x \cdot kx^{k-1} = (k+1)x^k. \square

For general real nn: write xn=enlnxx^n = e^{n\ln x} and apply chain rule: ddxenlnx=enlnxnx=xnnx=nxn1\frac{d}{dx}e^{n\ln x} = e^{n\ln x} \cdot \frac{n}{x} = x^n \cdot \frac{n}{x} = nx^{n-1}.

A.2 Product Rule - Triple Product

For three functions (uvw)=uvw+uvw+uvw(uvw)' = u'vw + uv'w + uvw'.

Proof. Write uvw=(uv)wuvw = (uv)w and apply product rule twice: (uv)w+(uv)w=(uv+uv)w+uvw=uvw+uvw+uvw(uv)'w + (uv)w' = (u'v + uv')w + uvw' = u'vw + uv'w + uvw'.

Example. (x2sinxlnx)=2xsinxlnx+x2cosxlnx+x2sinx1x(x^2 \sin x \ln x)' = 2x\sin x\ln x + x^2\cos x\ln x + x^2\sin x \cdot \frac{1}{x}.

A.3 Quotient Rule - Deriving from Product Rule

(fg)=fgfgg2\left(\frac{f}{g}\right)' = \frac{f'g - fg'}{g^2}

Proof. Let h=f/gh = f/g, so f=ghf = gh. By product rule: f=gh+ghf' = g'h + gh'. Solving: h=(fgh)/g=(fgf/g)/g=(fgfg)/g2h' = (f' - g'h)/g = (f' - g'f/g)/g = (f'g - fg')/g^2. \square

Mnemonic ("lo d-hi minus hi d-lo over lo-lo"): ddxhilo=lod(hi)hid(lo)lo2\frac{d}{dx}\frac{\text{hi}}{\text{lo}} = \frac{\text{lo}\cdot d(\text{hi}) - \text{hi}\cdot d(\text{lo})}{\text{lo}^2}.

A.4 Logarithmic Differentiation

When f(x)f(x) is a product/quotient of many terms, take ln\ln first:

y=x3x+1(x2+2)4    lny=3lnx+12ln(x+1)4ln(x2+2)y = \frac{x^3\sqrt{x+1}}{(x^2+2)^4} \implies \ln y = 3\ln x + \tfrac{1}{2}\ln(x+1) - 4\ln(x^2+2)

Differentiate implicitly:

yy=3x+12(x+1)8xx2+2\frac{y'}{y} = \frac{3}{x} + \frac{1}{2(x+1)} - \frac{8x}{x^2+2}

Multiply by yy. Logarithmic differentiation is also the only elementary route to y=xxy = x^x:

lny=xlnx    yy=lnx+1    y=xx(lnx+1)\ln y = x\ln x \implies \frac{y'}{y} = \ln x + 1 \implies y' = x^x(\ln x + 1)

Appendix B: Extended Chain Rule Examples and Patterns

B.1 Identifying the Composition Structure

Every application of the chain rule begins with identifying:

  • Outer function ff: what is applied last
  • Inner function gg: what is the argument of the outer function
ExpressionOuter f(u)f(u)Inner u=g(x)u = g(x)Derivative
ex2e^{-x^2}eue^ux2-x^22xex2-2x\,e^{-x^2}
sin3x\sin^3 xu3u^3sinx\sin x3sin2xcosx3\sin^2 x \cos x
ln(x2+1)\ln(x^2+1)lnu\ln ux2+1x^2+12x/(x2+1)2x/(x^2+1)
tanx\sqrt{\tan x}u\sqrt{u}tanx\tan xsec2x/(2tanx)\sec^2 x / (2\sqrt{\tan x})
(1+ex)1(1+e^x)^{-1}u1u^{-1}1+ex1+e^xex/(1+ex)2-e^x/(1+e^x)^2

Note: (1+ex)1=σ(x)(1+e^x)^{-1} = \sigma(-x) - the sigmoid of x-x.

B.2 Nested Chain Rule - Four Levels Deep

y=sin(ln(x2+1))y = \sin(\ln(\sqrt{x^2+1})). Work from outside in:

Let u1=ln(x2+1)u_1 = \ln(\sqrt{x^2+1}), u2=x2+1u_2 = \sqrt{x^2+1}, u3=x2+1u_3 = x^2+1.

y=cos(u1)1u212u32x=cos(lnx2+1)x(x2+1)y' = \cos(u_1) \cdot \frac{1}{u_2} \cdot \frac{1}{2\sqrt{u_3}} \cdot 2x = \cos(\ln\sqrt{x^2+1}) \cdot \frac{x}{(x^2+1)}

B.3 Scalar Backpropagation Step-by-Step

Consider the output node L=logσ(z)\mathcal{L} = -\log\sigma(z) where z=wxz = \mathbf{w}^\top\mathbf{x} (binary cross-entropy, scalar model).

Chain rule cascade:

dLdz=dLdσdσdz=1σ(z)σ(z)(1σ(z))=(1σ(z))=σ(z)1\frac{d\mathcal{L}}{dz} = \frac{d\mathcal{L}}{d\sigma}\cdot\frac{d\sigma}{dz} = -\frac{1}{\sigma(z)}\cdot\sigma(z)(1-\sigma(z)) = -(1-\sigma(z)) = \sigma(z) - 1

Gradient w.r.t. weight wjw_j:

dLdwj=dLdzdzdwj=(σ(z)1)xj\frac{d\mathcal{L}}{dw_j} = \frac{d\mathcal{L}}{dz}\cdot\frac{dz}{dw_j} = (\sigma(z)-1)\cdot x_j

This is the gradient used in logistic regression's SGD update. The chain rule reduces the gradient to a simple product of the prediction error (σ(z)1)(\sigma(z)-1) and the input feature xjx_j.

B.4 The Derivative as a Linear Map

Abstractly, the derivative f(a)f'(a) is the best linear approximation to ff near aa. More precisely, f(a)f'(a) is the unique number such that:

f(a+h)=f(a)+f(a)h+o(h)as h0f(a + h) = f(a) + f'(a)\,h + o(h) \quad \text{as } h \to 0

where o(h)o(h) denotes terms that shrink faster than hh. This abstract definition extends directly to vector functions (Frchet derivative) - in that setting, f(a)f'(a) becomes a linear map (matrix), which is the Jacobian.


Appendix C: Deep Dive - Differentiability and Regularity

C.1 The Formal epsilon-delta Definition of Differentiability

ff is differentiable at aa if there exists a number LL such that:

limh0f(a+h)f(a)h=L\lim_{h \to 0} \frac{f(a+h) - f(a)}{h} = L

i.e., for all ε>0\varepsilon > 0, there exists δ>0\delta > 0 such that 0<h<δ0 < |h| < \delta implies:

f(a+h)f(a)hL<ε\left|\frac{f(a+h)-f(a)}{h} - L\right| < \varepsilon

We write f(a)=Lf'(a) = L.

C.2 Non-Differentiable Points - A Taxonomy

Type 1 - Corner: f(x)=xf(x) = |x|. Left derivative: 1-1. Right derivative: +1+1. Limit does not exist.

Type 2 - Cusp: f(x)=x2/3f(x) = x^{2/3}. At x=0x = 0: f(x)=23x1/3±f'(x) = \frac{2}{3}x^{-1/3} \to \pm\infty. The tangent line becomes vertical - the derivative limit is ±\pm\infty, not finite.

Type 3 - Vertical tangent: Same as cusp; the function is continuous but the difference quotient diverges.

Type 4 - Oscillatory discontinuity of derivative: f(x)=x2sin(1/x)f(x) = x^2 \sin(1/x) for x0x \neq 0, f(0)=0f(0) = 0. The derivative exists at 0 (f(0)=0f'(0) = 0) but f(x)f'(x) is not continuous at 0 - ff is differentiable but not C1C^1.

For AI - ReLU. ReLU has a corner at x=0x = 0: left derivative is 00, right derivative is 11. In practice, the subgradient ReLU(0)=[0,1]\partial \text{ReLU}(0) = [0,1] (any value in the interval) is used; most frameworks return 00.

C.3 Smoothness Classes

ClassConditionExamples
C0C^0ContinuousReLU, $
C1C^1Derivative exists and is continuousSoftplus, GELU
C2C^2Second derivative exists and is continuousexe^x, sinx\sin x, polynomials
CC^\inftyAll derivatives existexe^x, sinx\sin x, cosx\cos x
CωC^\omegaReal analytic (equals its Taylor series)exe^x, sinx\sin x

For AI: Smooth activation functions (CC^\infty) are required for second-order optimizers (Newton, L-BFGS) and for second-order sensitivity analysis. Rectified activations (C0C^0 only) work for SGD/Adam but complicate curvature estimation.

C.4 Differentiability vs. Continuity - Summary

Differentiable at x_0
         
          (implication)
Continuous at x_0
         
          (no implication)
         
Differentiable at x_0
  • Differentiable -> Continuous: Proven via limh0[f(a+h)f(a)]=limh0f(a+h)f(a)hh=f(a)0=0\lim_{h\to 0}[f(a+h)-f(a)] = \lim_{h\to 0}\frac{f(a+h)-f(a)}{h}\cdot h = f'(a)\cdot 0 = 0.
  • Continuous Differentiable: Counterexample: x|x| is continuous everywhere but not differentiable at 0.
  • Differentiable Continuously Differentiable: Counterexample: x2sin(1/x)x^2\sin(1/x) above.

Appendix D: Advanced Topics - Inverse Function Theorem and Derivatives of Inverses

D.1 Inverse Function Theorem

If ff is differentiable at aa and f(a)0f'(a) \neq 0, then f1f^{-1} is differentiable at b=f(a)b = f(a) and:

(f1)(b)=1f(f1(b))=1f(a)(f^{-1})'(b) = \frac{1}{f'(f^{-1}(b))} = \frac{1}{f'(a)}

Intuition. If ff multiplies tangent slopes by f(a)f'(a), then f1f^{-1} divides by f(a)f'(a). In Leibniz notation: if y=f(x)y = f(x), then dx/dy=1/(dy/dx)dx/dy = 1/(dy/dx).

Geometric meaning. The graph of f1f^{-1} is the reflection of the graph of ff over the line y=xy = x. The slope of the tangent to ff at (a,f(a))(a, f(a)) is f(a)f'(a); the slope of the reflected tangent to f1f^{-1} at (f(a),a)(f(a), a) is 1/f(a)1/f'(a).

D.2 Derivatives of Inverse Trigonometric Functions

Using the inverse function theorem and implicit differentiation:

arcsin\arcsin: Let y=arcsinxy = \arcsin x, so siny=x\sin y = x. Differentiate: cosyy=1\cos y \cdot y' = 1. Since cosy=1sin2y=1x2\cos y = \sqrt{1-\sin^2 y} = \sqrt{1-x^2}:

(arcsinx)=11x2,x(1,1)(\arcsin x)' = \frac{1}{\sqrt{1-x^2}}, \quad x \in (-1,1)

arccos\arccos: Similarly, (arccosx)=1/1x2(\arccos x)' = -1/\sqrt{1-x^2}.

arctan\arctan: Let y=arctanxy = \arctan x, so tany=x\tan y = x. Differentiate: sec2yy=1\sec^2 y \cdot y' = 1. Since sec2y=1+tan2y=1+x2\sec^2 y = 1 + \tan^2 y = 1 + x^2:

(arctanx)=11+x2(\arctan x)' = \frac{1}{1+x^2}

For AI. The softplus function softplus(x)=ln(1+ex)\text{softplus}(x) = \ln(1+e^x) is the antiderivative of sigmoid: softplus(x)=σ(x)\text{softplus}'(x) = \sigma(x). This connects the inverse-function perspective to activation function design.

D.3 Generalized Power Rule via Logarithmic Differentiation

For y=[f(x)]g(x)y = [f(x)]^{g(x)} (variable base, variable exponent):

lny=g(x)lnf(x)    yy=g(x)lnf(x)+g(x)f(x)f(x)\ln y = g(x)\ln f(x) \implies \frac{y'}{y} = g'(x)\ln f(x) + g(x)\frac{f'(x)}{f(x)} y=[f(x)]g(x)[g(x)lnf(x)+g(x)f(x)f(x)]y' = [f(x)]^{g(x)}\left[g'(x)\ln f(x) + \frac{g(x)f'(x)}{f(x)}\right]

Special cases:

  • g(x)=ng(x) = n (constant): gives power rule y=nfn1fy' = nf^{n-1}f'
  • f(x)=ef(x) = e (base ee): gives exponential derivative y=eg(x)g(x)y' = e^{g(x)}g'(x)

Appendix E: Optimization Theory Foundations

E.1 First and Second Derivative Tests - Full Treatment

First Derivative Test (complete statement):

Let ff be continuous on (a,b)(a,b) and x0(a,b)x_0 \in (a,b) a critical point.

  • If f(x)>0f'(x) > 0 for x(a,x0)x \in (a,x_0) and f(x)<0f'(x) < 0 for x(x0,b)x \in (x_0, b): local maximum at x0x_0.
  • If f(x)<0f'(x) < 0 for x(a,x0)x \in (a,x_0) and f(x)>0f'(x) > 0 for x(x0,b)x \in (x_0, b): local minimum at x0x_0.
  • If ff' does not change sign: neither (saddle or inflection).

Second Derivative Test:

If f(x0)=0f'(x_0) = 0:

  • f(x0)>0f''(x_0) > 0 -> local min (concave up)
  • f(x0)<0f''(x_0) < 0 -> local max (concave down)
  • f(x0)=0f''(x_0) = 0 -> inconclusive; examine higher-order derivatives or use first derivative test

Inconclusive case - higher-order test. If the first k1k-1 derivatives vanish at x0x_0 and f(k)(x0)0f^{(k)}(x_0) \neq 0:

  • kk odd: neither max nor min (inflection-type)
  • kk even, f(k)(x0)>0f^{(k)}(x_0) > 0: local min
  • kk even, f(k)(x0)<0f^{(k)}(x_0) < 0: local max

Example. f(x)=x4f(x) = x^4 at x0=0x_0 = 0: f(0)=f(0)=f(0)=0f'(0) = f''(0) = f'''(0) = 0, f(4)(0)=24>0f^{(4)}(0) = 24 > 0, k=4k = 4 even -> local (and global) min.

E.2 Rolle's Theorem

Theorem. If ff is continuous on [a,b][a,b], differentiable on (a,b)(a,b), and f(a)=f(b)f(a) = f(b), then there exists c(a,b)c \in (a,b) with f(c)=0f'(c) = 0.

Proof. By EVT, ff attains its maximum and minimum on [a,b][a,b]. If both are attained at endpoints, then ff is constant and f0f' \equiv 0. Otherwise, an interior maximum or minimum cc exists; at an interior extremum, f(c)=0f'(c) = 0 (since f(c)0f'(c) \neq 0 would imply ff is locally monotone, contradicting the extremum). \square

Rolle's Theorem is the key lemma for the MVT.

E.3 Gradient Descent - Single Variable View

The gradient descent update for minimizing f(θ)f(\theta):

θt+1=θtηf(θt)\theta_{t+1} = \theta_t - \eta f'(\theta_t)

Convergence (quadratic ff). Let f(θ)=12Lθ2f(\theta) = \frac{1}{2}L\theta^2 (1D version of LL-smooth quadratic). Then:

f(θt+1)=12L(θtηLθt)2=(1ηL)2f(θt)f(\theta_{t+1}) = \frac{1}{2}L(\theta_t - \eta L\theta_t)^2 = (1-\eta L)^2 f(\theta_t)

Converges if 1ηL<1|1 - \eta L| < 1, i.e., 0<η<2/L0 < \eta < 2/L. Optimal η=1/L\eta = 1/L gives one-step convergence: f(θ1)=0f(\theta_1) = 0.

For general smooth ff (Taylor + MVT argument):

If fL|f''| \leq L everywhere (Lipschitz gradient), then:

f(θηf(θ))f(θ)η(1ηL2)f(θ)2f(\theta - \eta f'(\theta)) \leq f(\theta) - \eta\left(1 - \frac{\eta L}{2}\right)|f'(\theta)|^2

The right side decreases as long as η<2/L\eta < 2/L and f(θ)0f'(\theta) \neq 0 - guaranteeing descent at each step.


Appendix F: Numerical Differentiation - Advanced Analysis

F.1 Error Analysis in Detail

Forward difference error. Taylor expand f(x+h)f(x+h):

f(x+h)=f(x)+hf(x)+h22f(x)+h36f(x)+f(x+h) = f(x) + hf'(x) + \frac{h^2}{2}f''(x) + \frac{h^3}{6}f'''(x) + \cdots

Solve for f(x)f'(x):

f(x)=f(x+h)f(x)hh2f(x)h26f(x)f'(x) = \frac{f(x+h) - f(x)}{h} - \frac{h}{2}f''(x) - \frac{h^2}{6}f'''(x) - \cdots

The leading error term is h2f(x)-\frac{h}{2}f''(x), so the truncation error is O(h)O(h).

Centered difference error. Also expand f(xh)f(x-h):

f(xh)=f(x)hf(x)+h22f(x)h36f(x)+f(x-h) = f(x) - hf'(x) + \frac{h^2}{2}f''(x) - \frac{h^3}{6}f'''(x) + \cdots

Subtract: f(x+h)f(xh)=2hf(x)+h33f(x)+f(x+h) - f(x-h) = 2hf'(x) + \frac{h^3}{3}f'''(x) + \cdots

f(x)=f(x+h)f(xh)2hh26f(x)f'(x) = \frac{f(x+h) - f(x-h)}{2h} - \frac{h^2}{6}f'''(x) - \cdots

The leading error is h26f(x)-\frac{h^2}{6}f'''(x), so the truncation error is O(h2)O(h^2) - one order better.

Round-off error. In floating-point arithmetic, f(x+h)f(x+h) is computed with absolute error εmachf(x)\sim \varepsilon_{\text{mach}}|f(x)|. The round-off contribution to the centered difference is:

round-off error2εmachf(x)2h=εmachf(x)h\text{round-off error} \approx \frac{2\varepsilon_{\text{mach}}|f(x)|}{2h} = \frac{\varepsilon_{\text{mach}}|f(x)|}{h}

Total error: E(h)h26f(x)+εmachf(x)hE(h) \approx \frac{h^2}{6}|f'''(x)| + \frac{\varepsilon_{\text{mach}}|f(x)|}{h}

Minimize over hh: dE/dh=h3fεmachfh2=0dE/dh = \frac{h}{3}|f'''| - \frac{\varepsilon_{\text{mach}}|f|}{h^2} = 0 gives:

h=(3εmachff)1/3εmach1/3h^* = \left(\frac{3\varepsilon_{\text{mach}}|f|}{|f'''|}\right)^{1/3} \approx \varepsilon_{\text{mach}}^{1/3}

For float64 (εmach2.2×1016\varepsilon_{\text{mach}} \approx 2.2 \times 10^{-16}): h6×106h^* \approx 6 \times 10^{-6}.

F.2 Richardson Extrapolation

Given two centered-difference approximations at step sizes hh and h/2h/2, eliminate the leading O(h2)O(h^2) error term:

f(x)4D(h/2)D(h)3,D(h)=f(x+h)f(xh)2hf'(x) \approx \frac{4 \cdot D(h/2) - D(h)}{3}, \quad D(h) = \frac{f(x+h)-f(x-h)}{2h}

This Richardson-extrapolated formula has error O(h4)O(h^4) - a massive improvement. Recursive application yields Romberg differentiation with arbitrarily high accuracy.

F.3 Complex-Step Derivative

An alternative that avoids catastrophic cancellation: compute ff at x+ihx + ih (complex argument):

f(x)Im[f(x+ih)]hf'(x) \approx \frac{\text{Im}[f(x + ih)]}{h}

For analytic ff, this has only round-off error O(εmach)O(\varepsilon_{\text{mach}}) with no subtraction cancellation, and allows very small h1020h \approx 10^{-20} without accuracy loss. Used in aerospace and scientific computing.


Appendix G: Activation Functions - Extended Analysis

G.1 Vanishing Gradient Problem - Quantitative Analysis

In a network with LL layers using sigmoid activations, the gradient at the first layer involves a product of LL sigmoid derivatives:

Lw1=LaLl=2Lσ(zl)\frac{\partial \mathcal{L}}{\partial w_1} = \frac{\partial \mathcal{L}}{\partial a_L} \cdot \prod_{l=2}^{L} \sigma'(z_l)

Since σ(z)1/4\sigma'(z) \leq 1/4 for all zz, the product is bounded:

l=2Lσ(zl)(14)L1\left|\prod_{l=2}^{L}\sigma'(z_l)\right| \leq \left(\frac{1}{4}\right)^{L-1}

For a 10-layer network: (1/4)94×106\leq (1/4)^9 \approx 4 \times 10^{-6} - the gradient is effectively zero. This is the vanishing gradient problem that made training deep sigmoid networks impractical before ReLU.

ReLU advantage. For active ReLU units (z>0z > 0): ReLU(z)=1\text{ReLU}'(z) = 1. The product of LL active ReLU derivatives is exactly 1 - no decay. This is why deep residual networks with ReLU can be trained with hundreds of layers.

G.2 Exploding Gradients

The opposite pathology: if weights are initialized too large, the product lfl(zl)\prod_l f'_l(z_l) can exceed 1 exponentially in LL. Gradients explode, leading to NaN loss.

Mitigations: gradient clipping (gc\|g\| \leq c), weight initialization (Xavier/He), batch normalization, residual connections.

G.3 Softmax Derivative

The softmax function maps zRK\mathbf{z} \in \mathbb{R}^K to a probability vector:

softmax(z)i=ezij=1Kezj=pi\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}} = p_i

Partial derivatives:

pizj={pi(1pi)i=jpipjij\frac{\partial p_i}{\partial z_j} = \begin{cases} p_i(1-p_i) & i = j \\ -p_ip_j & i \neq j \end{cases}

The Jacobian of softmax is Jij=pi(δijpj)J_{ij} = p_i(\delta_{ij} - p_j), or in matrix form: J=diag(p)ppJ = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^\top.

For backprop. Combined with cross-entropy loss L=iyilogpi\mathcal{L} = -\sum_i y_i\log p_i:

Lzj=pjyj\frac{\partial \mathcal{L}}{\partial z_j} = p_j - y_j

The gradient is simply the difference between the predicted probability and the true label - remarkably clean due to the cancellation between softmax and cross-entropy.

G.4 GELU vs ReLU - Gradient Behavior

PropertyReLUGELU
DifferentiableNo (at 0)Yes (CC^\infty)
Dead neuronsYes (gradient = 0 for x<0x < 0 always)No (small gradient for x<0x < 0)
Gradient at x=0x=000 (subgradient)0.50.5 (smooth)
ComputationTrivial (max\max)Requires erf evaluation
Used inResNets, older modelsGPT, BERT, modern transformers

The GELU's smooth gradient in the negative regime allows small updates even for negative pre-activations, which may contribute to the empirical finding that GELU outperforms ReLU on large-scale language tasks.


Appendix H: Historical Context and Mathematical Development

H.1 The Calculus War

The derivative was developed independently in the 1660s-1680s by Newton and Leibniz - one of the most famous priority disputes in mathematics.

YearNewtonLeibniz
1666Develops "method of fluxions" (flux = rate of change)-
1675-Introduces dy/dxdy/dx notation
1684-Publishes first calculus paper
1687Publishes Principia (uses fluents, not derivatives explicitly)-
1693-Publishes d/dxd/dx systematically
1704Newton publishes "Of the Quadrature of Curves"-
1711Royal Society initiates plagiarism inquiry-

Modern consensus: independent discovery. Newton's notation (x˙\dot{x} for time derivatives) persists in physics; Leibniz's dy/dxdy/dx dominates mathematics and engineering.

H.2 Rigorous Foundations - 19th Century

The limit-based definition of the derivative was not rigorously formulated until the 1820s:

  • Cauchy (1821): Formalized limits and gave the first rigorous definition of the derivative as a limit of a difference quotient.
  • Weierstrass (1872): Constructed a continuous-everywhere, differentiable-nowhere function - proving that "most" continuous functions are not differentiable.
  • Riemann (1854): Formalized integration; the FTC was proved rigorously.

H.3 Connection to Modern AI

YearDevelopmentCalculus Foundation
1986Rumelhart et al. popularize backpropagationChain rule on computation graphs
1989LeCun's convolutional backpropChain rule through spatial filters
2012AlexNet - deep ReLU networksAvoids vanishing gradient (8.1)
2014Adam optimizerAdaptive learning rates, second-moment gradient estimate
2017Transformer attentionSoftmax Jacobian (G.3) in attention scores
2018BERT (GELU activation)Smooth gradient, CC^\infty activations
2022+JAX's functional ADEfficient forward/reverse mode composition
2023+LoRA / DoRA (PEFT)Low-rank gradient updates - rank of Jacobians
2024+Mechanistic interpretabilityGradient attribution to individual features

Appendix I: Applications in Machine Learning - Deep Dives

I.1 Gradient Descent and Its Variants

All first-order optimization algorithms are based on the derivative. The basic update:

θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)

In the scalar case (θR\theta \in \mathbb{R}): θt+1=θtηL(θt)\theta_{t+1} = \theta_t - \eta \mathcal{L}'(\theta_t).

Momentum (Polyak 1964):

vt+1=βvt+L(θt),θt+1=θtηvt+1v_{t+1} = \beta v_t + \mathcal{L}'(\theta_t), \qquad \theta_{t+1} = \theta_t - \eta v_{t+1}

Accumulates a running average of gradients, dampening oscillations in high-curvature directions.

Adam (Kingma & Ba, 2015):

mt=β1mt1+(1β1)L(θt)m_t = \beta_1 m_{t-1} + (1-\beta_1)\mathcal{L}'(\theta_t) vt=β2vt1+(1β2)[L(θt)]2v_t = \beta_2 v_{t-1} + (1-\beta_2)[\mathcal{L}'(\theta_t)]^2 m^t=mt/(1β1t),v^t=vt/(1β2t)\hat{m}_t = m_t/(1-\beta_1^t), \quad \hat{v}_t = v_t/(1-\beta_2^t) θt+1=θtηm^t/(v^t+ε)\theta_{t+1} = \theta_t - \eta\,\hat{m}_t/(\sqrt{\hat{v}_t}+\varepsilon)

mtm_t is a first moment (mean) estimate; vtv_t is a second moment (uncentered variance) estimate. Dividing by vt\sqrt{v_t} adapts the step size per parameter: large gradient -> small effective step; small gradient -> larger effective step. This diagonal approximation to the inverse curvature is why Adam often outperforms vanilla SGD.

I.2 Automatic Differentiation - Under the Hood

Forward mode. Alongside each function value f(x)f(x), propagate the derivative f(x)f'(x) using dual numbers (a,b)(a, b) where arithmetic follows (a1,b1)(a2,b2)=(a1a2,a1b2+a2b1)(a_1, b_1) \cdot (a_2, b_2) = (a_1 a_2, a_1 b_2 + a_2 b_1) (product rule built in).

Reverse mode. In the forward pass, record a computation graph (tape). In the backward pass, traverse the graph in reverse, accumulating partial derivatives using the chain rule at each node.

Cost comparison:

  • Forward mode: O(n)O(n) passes for nn inputs -> O(n)O(n) for scalar output, O(nm)O(nm) for mm outputs
  • Reverse mode: one backward pass -> O(1)O(1) for scalar output regardless of nn

For neural network training (nn = millions of parameters, scalar loss L\mathcal{L}): reverse mode is O(1)O(1) passes vs. O(n)O(n) for forward mode - the reason backpropagation is computationally feasible.

I.3 Derivative-Based Interpretability

Saliency maps. For an image classification model ff, the gradient f(x)/xi\partial f(x)/\partial x_i measures how sensitive the output is to pixel ii - used as a first-order feature importance score.

Input x Gradient. xif/xix_i \cdot \partial f/\partial x_i - a refinement that accounts for the magnitude of the feature, not just the sensitivity.

Integrated Gradients (Sundararajan et al. 2017). Computes the path integral of the gradient from a baseline to the input:

IGi(x)=(xixi0)01f(x0+t(xx0))xidt\text{IG}_i(x) = (x_i - x_i^0)\int_0^1 \frac{\partial f(x^0 + t(x-x^0))}{\partial x_i}\,dt

Satisfies sensitivity and completeness axioms - considered one of the most principled gradient-based attribution methods.


Appendix J: Extended Exercises and Practice Problems

J.1 Warm-Up Drill - Compute Derivatives

Differentiate each function (answers at end of section):

  1. f(x)=x53x3+2x7f(x) = x^5 - 3x^3 + 2x - 7
  2. f(x)=x21x2+1f(x) = \frac{x^2 - 1}{x^2 + 1}
  3. f(x)=(2x+1)10f(x) = (2x+1)^{10}
  4. f(x)=x2exf(x) = x^2 e^{-x}
  5. f(x)=ln(sinx)f(x) = \ln(\sin x)
  6. f(x)=arctan(ex)f(x) = \arctan(e^x)
  7. f(x)=xsinxf(x) = x^{\sin x}
  8. f(x)=x+1x2+3f(x) = \frac{\sqrt{x+1}}{x^2+3}

Answers:

  1. 5x49x2+25x^4 - 9x^2 + 2
  2. 4x(x2+1)2\frac{4x}{(x^2+1)^2}
  3. 20(2x+1)920(2x+1)^9
  4. xex(2x)xe^{-x}(2-x)
  5. cotx\cot x
  6. ex1+e2x\frac{e^x}{1+e^{2x}}
  7. xsinx(cosxlnx+sinxx)x^{\sin x}\left(\cos x\ln x + \frac{\sin x}{x}\right)
  8. (x2+3)/(2x+1)2xx+1(x2+3)2\frac{(x^2+3)/(2\sqrt{x+1}) - 2x\sqrt{x+1}}{(x^2+3)^2}

J.2 Conceptual Questions

  1. True or False. If f(c)=0f'(c) = 0, then ff has a local extremum at cc. (False - consider f(x)=x3f(x) = x^3 at c=0c=0.)

  2. Proof. Show that if ff and gg are differentiable, (fg)(x)=f(g(x))g(x)(f \circ g)'(x) = f'(g(x))g'(x) implies that the derivative of the identity ff1(x)=xf \circ f^{-1}(x) = x gives (f1)(f(x))=1/f(x)(f^{-1})'(f(x)) = 1/f'(x).

  3. Explain. Why does the gradient descent step η=2/L\eta = 2/L (with L=maxfL = \max|f''|) lead to oscillatory behavior for a quadratic, while η=1/L\eta = 1/L gives smooth convergence?

  4. Derivation. Starting from σ(x)=1/(1+ex)\sigma(x) = 1/(1+e^{-x}), derive the update rule for the gradient of binary cross-entropy loss L=[ylogσ(z)+(1y)log(1σ(z))]\mathcal{L} = -[y\log\sigma(z) + (1-y)\log(1-\sigma(z))] with respect to zz.

    Solution. L/z=σ(z)y\partial\mathcal{L}/\partial z = \sigma(z) - y.

  5. Analysis. Centered finite differences have O(h2)O(h^2) error, but why do software implementations (like torch.autograd.gradcheck) still use h=103h = 10^{-3} or even h=102h = 10^{-2} rather than the optimal h105h^* \approx 10^{-5}?

    Answer. With 32-bit floating point (εmach107\varepsilon_{\text{mach}} \approx 10^{-7}), the optimal step is h107/3102.3h^* \approx 10^{-7/3} \approx 10^{-2.3}. The larger step avoids round-off without sacrificing much truncation accuracy in float32.

J.3 Computational Exercises

Exercise J.1. Implement the forward, backward, and centered finite difference formulas for f(x)f'(x). Plot the absolute error vs. hh for f(x)=sin(x)f(x) = \sin(x) at x=1x = 1, using h[1016,101]h \in [10^{-16}, 10^{-1}] on a log scale. Observe the optimal hh region.

Exercise J.2. Implement a simple scalar backpropagation for the function f(x)=σ(ln(x2+1))f(x) = \sigma(\ln(x^2 + 1)):

  • Forward pass: compute u1=x2+1u_1 = x^2+1, u2=lnu1u_2 = \ln u_1, y=σ(u2)y = \sigma(u_2)
  • Backward pass: compute y/x\partial y/\partial x analytically via chain rule
  • Verify with centered finite differences

Exercise J.3. For the sigmoid function, plot: (a) σ(x)\sigma(x), (b) σ(x)=σ(1σ)\sigma'(x) = \sigma(1-\sigma), (c) σ(x)\sigma''(x). Identify: the inflection point of σ\sigma, the maximum of σ\sigma', and the zeros of σ\sigma''.


Appendix K: Notation Reference and Quick-Reference Tables

K.1 Derivative Notation - Comparison

NotationAuthorReads asCommon usage
f(x)f'(x)Lagrange"ff prime of xx"Pure mathematics
dfdx\frac{df}{dx}Leibniz"dfdf by dxdx"Engineering, physics
f˙\dot{f}Newton"ff dot"Physics (time derivatives)
Df(x)Df(x)Euler/operator"D of ff"Differential equations
f/x\partial f/\partial x-"partial ff partial xx"Multivariable calculus
xf\nabla_x fHamilton"nabla ff" or "gradient"Machine learning, optimization

In this curriculum: f(x)f'(x) for scalar single-variable, θL\nabla_\theta \mathcal{L} for multi-parameter ML context.

K.2 Differentiation Rules Quick Reference

RuleFormulaNotes
Constant(c)=0(c)' = 0
Identity(x)=1(x)' = 1
Power(xn)=nxn1(x^n)' = nx^{n-1}All real nn
Constant multiple(cf)=cf(cf)' = cf'
Sum(f+g)=f+g(f+g)' = f' + g'
Product(fg)=fg+fg(fg)' = f'g + fg'
Quotient(f/g)=(fgfg)/g2(f/g)' = (f'g-fg')/g^2
Chain(fg)(x)=f(g(x))g(x)(f\circ g)'(x) = f'(g(x))g'(x)
Exponential(ex)=ex(e^x)' = e^x
Natural log(lnx)=1/x(\ln x)' = 1/xx>0x > 0
Sine(sinx)=cosx(\sin x)' = \cos x
Cosine(cosx)=sinx(\cos x)' = -\sin x
Arctangent(arctanx)=1/(1+x2)(\arctan x)' = 1/(1+x^2)
Inverse(f1)(y)=1/f(f1(y))(f^{-1})'(y) = 1/f'(f^{-1}(y))

K.3 Second Derivative Test Quick Reference

ConditionClassification
f(x0)=0f'(x_0)=0, f(x0)>0f''(x_0)>0Local minimum
f(x0)=0f'(x_0)=0, f(x0)<0f''(x_0)<0Local maximum
f(x0)=0f'(x_0)=0, f(x0)=0f''(x_0)=0Inconclusive
f(x0)>0f''(x_0)>0 (any x0x_0)Concave up
f(x0)<0f''(x_0)<0 (any x0x_0)Concave down
ff'' changes sign at x0x_0Inflection point

K.4 Finite Difference Error Summary

MethodFormulaError orderOptimal hh (float64)
Forward[f(x+h)f(x)]/h[f(x+h)-f(x)]/hO(h)O(h)108\approx 10^{-8}
Backward[f(x)f(xh)]/h[f(x)-f(x-h)]/hO(h)O(h)108\approx 10^{-8}
Centered[f(x+h)f(xh)]/(2h)[f(x+h)-f(x-h)]/(2h)O(h2)O(h^2)105.3\approx 10^{-5.3}
Second derivative[f(x+h)2f(x)+f(xh)]/h2[f(x+h)-2f(x)+f(x-h)]/h^2O(h2)O(h^2)105\approx 10^{-5}
RichardsonCombines two centeredO(h4)O(h^4)larger hh acceptable

Appendix L: Implicit Differentiation - Extended Examples

L.1 Folium of Descartes

The curve x3+y3=3axyx^3 + y^3 = 3axy (with a=1a = 1) is the folium of Descartes, defined implicitly. Differentiating:

3x2+3y2dydx=3y+3xdydx3x^2 + 3y^2\frac{dy}{dx} = 3y + 3x\frac{dy}{dx} dydx(3y23x)=3y3x2    dydx=yx2y2x\frac{dy}{dx}(3y^2 - 3x) = 3y - 3x^2 \implies \frac{dy}{dx} = \frac{y - x^2}{y^2 - x}

Horizontal tangents: dy/dx=0y=x2dy/dx = 0 \Rightarrow y = x^2. Substituting into x3+y3=3xyx^3 + y^3 = 3xy: x3+x6=3x3x62x3=0x=0x^3 + x^6 = 3x^3 \Rightarrow x^6 - 2x^3 = 0 \Rightarrow x = 0 or x=21/3x = 2^{1/3}.

L.2 Ellipse and Normal Lines

The ellipse x2a2+y2b2=1\frac{x^2}{a^2} + \frac{y^2}{b^2} = 1. Differentiating:

2xa2+2yb2dydx=0    dydx=b2xa2y\frac{2x}{a^2} + \frac{2y}{b^2}\frac{dy}{dx} = 0 \implies \frac{dy}{dx} = -\frac{b^2 x}{a^2 y}

The normal line (perpendicular to tangent) at (x0,y0)(x_0, y_0) has slope a2y0b2x0\frac{a^2 y_0}{b^2 x_0}.

L.3 Implicit Differentiation for Normalizing Flows

In normalizing flows, a bijective function z=f(x)\mathbf{z} = f(\mathbf{x}) is used. The change-of-variables formula for densities requires the determinant of the Jacobian:

pX(x)=pZ(f(x))detzxp_X(\mathbf{x}) = p_Z(f(\mathbf{x})) \left|\det\frac{\partial \mathbf{z}}{\partial \mathbf{x}}\right|

When ff is defined implicitly via g(x,z)=0g(\mathbf{x}, \mathbf{z}) = 0 (e.g., in continuous normalizing flows via an ODE), the implicit function theorem guarantees the Jacobian exists and computes it as:

zx=(gz)1gx\frac{\partial \mathbf{z}}{\partial \mathbf{x}} = -\left(\frac{\partial g}{\partial \mathbf{z}}\right)^{-1}\frac{\partial g}{\partial \mathbf{x}}

This is implicit differentiation generalized to the multivariable setting - a key tool in modern generative models (FFJORD, Neural ODEs).


Appendix M: Mean Value Theorem - Consequences and Applications

M.1 Monotonicity via Derivatives

Theorem. If f(x)>0f'(x) > 0 for all x(a,b)x \in (a,b), then ff is strictly increasing on [a,b][a,b].

Proof. Let x1<x2x_1 < x_2 in [a,b][a,b]. By MVT, there exists c(x1,x2)c \in (x_1, x_2) with:

f(x2)f(x1)=f(c)(x2x1)>0f(x_2) - f(x_1) = f'(c)(x_2 - x_1) > 0

since f(c)>0f'(c) > 0 and x2x1>0x_2 - x_1 > 0. Thus f(x2)>f(x1)f(x_2) > f(x_1). \square

Corollary. f0f' \equiv 0 on (a,b)(a,b) iff ff is constant on [a,b][a,b].

M.2 Error Bounds via MVT

Theorem (MVT error bound). If f(x)M|f'(x)| \leq M for all x[a,b]x \in [a,b], then:

f(x)f(y)Mxyfor all x,y[a,b]|f(x) - f(y)| \leq M|x-y| \quad \text{for all } x, y \in [a,b]

This is the Lipschitz condition with constant MM.

Application - gradient descent learning rate. If fL|f''| \leq L (second derivative bounded), then f(x)f(y)Lxy|f'(x) - f'(y)| \leq L|x-y| - the gradient is LL-Lipschitz. The descent lemma states:

f ⁣(θ1Lf(θ))f(θ)12Lf(θ)2f\!\left(\theta - \frac{1}{L}f'(\theta)\right) \leq f(\theta) - \frac{1}{2L}|f'(\theta)|^2

ensuring a sufficient decrease of 12Lf(θ)2\frac{1}{2L}|f'(\theta)|^2 per step.

M.3 Cauchy Mean Value Theorem

Theorem. If f,gf, g are continuous on [a,b][a,b], differentiable on (a,b)(a,b), and g(x)0g'(x) \neq 0, then:

f(b)f(a)g(b)g(a)=f(c)g(c)\frac{f(b) - f(a)}{g(b) - g(a)} = \frac{f'(c)}{g'(c)}

for some c(a,b)c \in (a,b).

Application. L'Hpital's Rule (01) follows from the Cauchy MVT: for f(a)=g(a)=0f(a) = g(a) = 0,

f(x)g(x)=f(x)f(a)g(x)g(a)=f(c)g(c)xaf(a)g(a)\frac{f(x)}{g(x)} = \frac{f(x)-f(a)}{g(x)-g(a)} = \frac{f'(c)}{g'(c)} \xrightarrow{x \to a} \frac{f'(a)}{g'(a)}

provided the last limit exists.


Appendix N: Parametric and Polar Curves

N.1 Derivatives of Parametric Curves

A parametric curve (x(t),y(t))(x(t), y(t)) defines yy as an implicit function of xx via the parameter tt. The derivative:

dydx=dy/dtdx/dt=y(t)x(t)\frac{dy}{dx} = \frac{dy/dt}{dx/dt} = \frac{y'(t)}{x'(t)}

provided x(t)0x'(t) \neq 0.

Second derivative:

d2ydx2=ddx(dydx)=ddt(y(t)x(t))x(t)\frac{d^2y}{dx^2} = \frac{d}{dx}\left(\frac{dy}{dx}\right) = \frac{\frac{d}{dt}\left(\frac{y'(t)}{x'(t)}\right)}{x'(t)}

Example. Cycloid: x=tsintx = t - \sin t, y=1costy = 1 - \cos t.

dx/dt=1costdx/dt = 1-\cos t, dy/dt=sintdy/dt = \sin t.

dydx=sint1cost\frac{dy}{dx} = \frac{\sin t}{1-\cos t}

At t=π/2t = \pi/2: dy/dx=11=1dy/dx = \frac{1}{1} = 1 - 45 angle. At t0t \to 0: dy/dxsint/(1cost)t/(t2/2)=2/tdy/dx \to \sin t/(1-\cos t) \approx t/(t^2/2) = 2/t \to \infty - vertical tangent.

N.2 Applications to Attention Curves

In transformer attention, the attention weight for query-key pair (q,k)(q,k) is:

α=eqk/djeqkj/d\alpha = \frac{e^{q \cdot k / \sqrt{d}}}{\sum_{j} e^{q \cdot k_j / \sqrt{d}}}

The derivative of attention weight with respect to the scale factor β=1/d\beta = 1/\sqrt{d}:

dαidβ=αi(qkijαjqkj)=αi(qkiEα[qk])\frac{d\alpha_i}{d\beta} = \alpha_i\left(q\cdot k_i - \sum_j \alpha_j\, q\cdot k_j\right) = \alpha_i\left(q\cdot k_i - \mathbb{E}_\alpha[q\cdot k]\right)

This is the derivative of a softmax composition - it shows that attention sharpens (moves weight to higher-scored keys) as β\beta increases, consistent with the "temperature" interpretation from 01.


Appendix O: Numerically Stable Derivative Computations

O.1 Sigmoid: Numerically Stable Computation

Naive σ(x)=1/(1+ex)\sigma(x) = 1/(1+e^{-x}) overflows for large negative xx (exe^{|x|} can be > float64 max 10308\approx 10^{308} for x>710|x| > 710).

Stable implementation:

def sigmoid(x):
    """Numerically stable sigmoid."""
    return np.where(x >= 0,
                    1 / (1 + np.exp(-x)),
                    np.exp(x) / (1 + np.exp(x)))

Both branches are equivalent; the first is used for x0x \geq 0 (where exe^{-x} is safe), the second for x<0x < 0 (where exe^x is safe).

Derivative - stable form:

def sigmoid_prime(x):
    s = sigmoid(x)
    return s * (1 - s)

No additional overflow risk once σ(x)\sigma(x) is stably computed.

O.2 Log-Sigmoid: Avoiding Catastrophic Cancellation

For the log-sigmoid logσ(x)=log(1+ex)\log\sigma(x) = -\log(1+e^{-x}):

  • For large positive xx: log(1+ex)ex\log(1+e^{-x}) \approx e^{-x} - no cancellation.
  • For large negative xx: log(1+ex)x\log(1+e^{-x}) \approx -x - use this directly.

Stable:

def log_sigmoid(x):
    return -np.logaddexp(0, -x)  # = -log(e^0 + e^{-x})

np.logaddexp(a, b) computes log(ea+eb)\log(e^a + e^b) stably by extracting the maximum:

log(ea+eb)=max(a,b)+log(1+e(ab))\log(e^a + e^b) = \max(a,b) + \log(1 + e^{-(|a-b|)})

O.3 Softmax: Subtract Maximum

Naive softmax overflows for large logits. Standard fix:

softmax(z)i=ezizmaxjezjzmax\text{softmax}(\mathbf{z})_i = \frac{e^{z_i - z_{\max}}}{\sum_j e^{z_j - z_{\max}}}

Subtracting the maximum shifts all logits to 0\leq 0, making ezizmax(0,1]e^{z_i - z_{\max}} \in (0,1]. The softmax value is unchanged since the normalization cancels the factor ezmaxe^{-z_{\max}}.

These numerical stability techniques are directly connected to the limits limx(1+ex)1=1\lim_{x\to\infty}(1+e^{-x})^{-1} = 1 and limxex=0\lim_{x\to-\infty}e^x = 0 from 01.


Appendix P: Derivatives in Probability and Statistics

P.1 Score Function and Fisher Information

In statistics, the score function is the derivative of the log-likelihood with respect to the parameter θ\theta:

s(θ;x)=θlogp(x;θ)s(\theta; x) = \frac{\partial}{\partial\theta}\log p(x;\theta)

Properties:

  • Eθ[s(θ;X)]=0\mathbb{E}_\theta[s(\theta;X)] = 0 (score has zero mean)
  • Varθ[s(θ;X)]=I(θ)\text{Var}_\theta[s(\theta;X)] = \mathcal{I}(\theta) (Fisher information)

Fisher information:

I(θ)=E[(logp(x;θ)θ)2]=E[2logp(x;θ)θ2]\mathcal{I}(\theta) = \mathbb{E}\left[\left(\frac{\partial\log p(x;\theta)}{\partial\theta}\right)^2\right] = -\mathbb{E}\left[\frac{\partial^2\log p(x;\theta)}{\partial\theta^2}\right]

The second equality (Fisher's identity) relates the variance of the first derivative to the expected value of the second derivative - a deep connection between information geometry and calculus.

P.2 KL Divergence Gradient

The KL divergence KL(qp)=q(x)logq(x)p(x)dx\text{KL}(q\|p) = \int q(x)\log\frac{q(x)}{p(x)}dx appears throughout ML. Its derivative with respect to a parameter θ\theta of p(;θ)p(\cdot;\theta):

θKL(qpθ)=q(x)θlogp(x;θ)dx=Eq[θlogp(x;θ)]\frac{\partial}{\partial\theta}\text{KL}(q\|p_\theta) = -\int q(x)\frac{\partial}{\partial\theta}\log p(x;\theta)\,dx = -\mathbb{E}_q\left[\frac{\partial}{\partial\theta}\log p(x;\theta)\right]

This is the negative expected score under qq. In variational inference (ELBO optimization), this derivative structure underlies the "reparameterization trick" for training VAEs.

P.3 Maximum Likelihood - First Order Conditions

For i.i.d. data {x1,,xn}\{x_1,\ldots,x_n\}, MLE maximizes:

θ^=argmaxθi=1nlogp(xi;θ)\hat{\theta} = \arg\max_\theta \sum_{i=1}^n \log p(x_i;\theta)

The first-order condition θilogp(xi;θ)=0\frac{\partial}{\partial\theta}\sum_i\log p(x_i;\theta) = 0 defines the MLE. For the Gaussian with unknown mean: i(xiμ)=0μ^=xˉ\sum_i(x_i - \mu) = 0 \Rightarrow \hat{\mu} = \bar{x}.

For neural networks, MLE is equivalent to minimizing cross-entropy loss - the derivative-based optimization (SGD, Adam) maximizes the likelihood of the training data.


Appendix Q: Special Topics - Subgradients and Non-Smooth Analysis

Q.1 Subgradients

For a convex function ff that may not be differentiable, a subgradient at x0x_0 is any gRg \in \mathbb{R} satisfying:

f(x)f(x0)+g(xx0)for all xf(x) \geq f(x_0) + g(x - x_0) \quad \text{for all } x

The subdifferential f(x0)\partial f(x_0) is the set of all subgradients.

Examples:

  • f(x)=xf(x) = |x|: f(0)=[1,1]\partial f(0) = [-1, 1]; f(x)={sign(x)}\partial f(x) = \{\text{sign}(x)\} for x0x \neq 0.
  • f(x)=ReLU(x)=max(0,x)f(x) = \text{ReLU}(x) = \max(0,x): f(0)=[0,1]\partial f(0) = [0, 1]; f(x)={0}\partial f(x) = \{0\} for x<0x<0; f(x)={1}\partial f(x) = \{1\} for x>0x>0.

Q.2 Subgradient Descent

For non-smooth convex ff, the subgradient descent update uses any gtf(θt)g_t \in \partial f(\theta_t):

θt+1=θtηtgt\theta_{t+1} = \theta_t - \eta_t g_t

Unlike gradient descent for smooth functions, subgradient descent does NOT necessarily decrease ff at each step. Convergence requires diminishing step sizes: tηt=\sum_t \eta_t = \infty, tηt2<\sum_t \eta_t^2 < \infty (Robbins-Monro conditions from 01).

Q.3 Proximal Operators

For non-smooth regularization (L1 sparsity), instead of subgradient descent, the proximal gradient method is often used. The proximal operator of hh at vv with step size η\eta:

proxηh(v)=argminx[h(x)+12ηxv2]\text{prox}_{\eta h}(v) = \arg\min_x\left[h(x) + \frac{1}{2\eta}\|x-v\|^2\right]

For h(x)=λxh(x) = \lambda|x|: proxηh(v)=sign(v)max(vηλ,0)\text{prox}_{\eta h}(v) = \text{sign}(v)\max(|v|-\eta\lambda, 0) - the soft-thresholding operator. This is the exact solution to L1 penalized regression (LASSO).


Appendix R: Key Proofs in Full

R.1 Proof: Product Rule

Theorem. If ff and gg are differentiable at xx, then (fg)(x)=f(x)g(x)+f(x)g(x)(fg)'(x) = f'(x)g(x) + f(x)g'(x).

Proof.

limh0f(x+h)g(x+h)f(x)g(x)h\lim_{h\to 0}\frac{f(x+h)g(x+h) - f(x)g(x)}{h}

Add and subtract f(x+h)g(x)f(x+h)g(x):

=limh0f(x+h)g(x+h)f(x+h)g(x)+f(x+h)g(x)f(x)g(x)h= \lim_{h\to 0}\frac{f(x+h)g(x+h) - f(x+h)g(x) + f(x+h)g(x) - f(x)g(x)}{h} =limh0[f(x+h)g(x+h)g(x)h+g(x)f(x+h)f(x)h]= \lim_{h\to 0}\left[f(x+h)\cdot\frac{g(x+h)-g(x)}{h} + g(x)\cdot\frac{f(x+h)-f(x)}{h}\right] =limh0f(x+h)g(x)+g(x)f(x)= \lim_{h\to 0}f(x+h) \cdot g'(x) + g(x) \cdot f'(x)

Since ff is differentiable at xx, it is continuous there, so limh0f(x+h)=f(x)\lim_{h\to 0}f(x+h) = f(x):

=f(x)g(x)+g(x)f(x).= f(x)g'(x) + g(x)f'(x). \quad \square

R.2 Proof: Chain Rule (via auxiliary function)

Theorem. If gg is differentiable at xx and ff is differentiable at g(x)g(x), then (fg)(x)=f(g(x))g(x)(f\circ g)'(x) = f'(g(x))g'(x).

Proof. Define the auxiliary function:

ϕ(k)={f(g(x)+k)f(g(x))kk0f(g(x))k=0\phi(k) = \begin{cases} \frac{f(g(x)+k) - f(g(x))}{k} & k \neq 0 \\ f'(g(x)) & k = 0 \end{cases}

By differentiability of ff at g(x)g(x): limk0ϕ(k)=f(g(x))\lim_{k\to 0}\phi(k) = f'(g(x)), so ϕ\phi is continuous at k=0k=0.

Let k=g(x+h)g(x)k = g(x+h) - g(x). Then for h0h \neq 0:

f(g(x+h))f(g(x))h=ϕ(k)g(x+h)g(x)h\frac{f(g(x+h)) - f(g(x))}{h} = \phi(k) \cdot \frac{g(x+h)-g(x)}{h}

(This holds even when k=0k = 0 for some values of hh, since both sides are 0 in that case.)

Taking h0h \to 0: k0k \to 0 (by continuity of gg), so ϕ(k)f(g(x))\phi(k) \to f'(g(x)), and (g(x+h)g(x))/hg(x)(g(x+h)-g(x))/h \to g'(x):

(fg)(x)=f(g(x))g(x).(f\circ g)'(x) = f'(g(x)) \cdot g'(x). \quad \square

R.3 Proof: Differentiable Implies Continuous

Theorem. If ff is differentiable at aa, then ff is continuous at aa.

Proof.

limxa[f(x)f(a)]=limxaf(x)f(a)xa(xa)=f(a)0=0\lim_{x\to a}[f(x) - f(a)] = \lim_{x\to a}\frac{f(x)-f(a)}{x-a}\cdot(x-a) = f'(a)\cdot 0 = 0

Therefore limxaf(x)=f(a)\lim_{x\to a}f(x) = f(a), which is the definition of continuity at aa. \square

R.4 Proof: (sinx)=cosx(\sin x)' = \cos x

Proof. Using the sine addition formula and two fundamental limits from 01:

sin(x+h)sinxh=sinxcosh+cosxsinhsinxh\frac{\sin(x+h)-\sin x}{h} = \frac{\sin x\cos h + \cos x\sin h - \sin x}{h} =sinxcosh1h+cosxsinhh= \sin x\cdot\frac{\cos h - 1}{h} + \cos x\cdot\frac{\sin h}{h}

As h0h\to 0: (cosh1)/h0(\cos h - 1)/h \to 0 (even function, sinh/h1\sin h/h \to 1 implies (cosh1)/h0(\cos h-1)/h \to 0 by squeeze), and sinh/h1\sin h/h \to 1:

(sinx)=sinx0+cosx1=cosx.(\sin x)' = \sin x\cdot 0 + \cos x\cdot 1 = \cos x. \quad \square

Appendix S: Worked Examples - Chain Rule in Neural Networks

S.1 Single-Layer Sigmoid Network - Full Backprop

Setup. Input xx, single weight ww, bias bb, sigmoid activation, MSE loss:

z=wx+b,a=σ(z),y^=a,L=12(ay)2z = wx + b, \quad a = \sigma(z), \quad \hat{y} = a, \quad \mathcal{L} = \tfrac{1}{2}(a - y)^2

Forward pass (left to right):

VariableValue
zzwx+bwx + b
aaσ(z)\sigma(z)
L\mathcal{L}12(ay)2\frac{1}{2}(a-y)^2

Backward pass (chain rule, right to left):

La=ay\frac{\partial\mathcal{L}}{\partial a} = a - y Lz=Laaz=(ay)σ(z)=(ay)a(1a)\frac{\partial\mathcal{L}}{\partial z} = \frac{\partial\mathcal{L}}{\partial a}\cdot\frac{\partial a}{\partial z} = (a-y)\cdot\sigma'(z) = (a-y)\cdot a(1-a) Lw=Lzzw=(ay)a(1a)x\frac{\partial\mathcal{L}}{\partial w} = \frac{\partial\mathcal{L}}{\partial z}\cdot\frac{\partial z}{\partial w} = (a-y)\cdot a(1-a)\cdot x Lb=Lzzb=(ay)a(1a)1\frac{\partial\mathcal{L}}{\partial b} = \frac{\partial\mathcal{L}}{\partial z}\cdot\frac{\partial z}{\partial b} = (a-y)\cdot a(1-a)\cdot 1

The structure: error signal (ay)(a-y) is modulated by the local gradient σ(z)=a(1a)\sigma'(z) = a(1-a) before being distributed to the weight ww (scaled by input xx) and bias bb.

S.2 Two-Layer ReLU Network

Input xx, hidden layer h1=ReLU(w1x+b1)h_1 = \text{ReLU}(w_1 x + b_1), output y^=w2h1+b2\hat{y} = w_2 h_1 + b_2:

L=12(y^y)2\mathcal{L} = \tfrac{1}{2}(\hat{y}-y)^2

Gradients:

Lw2=(y^y)h1\frac{\partial\mathcal{L}}{\partial w_2} = (\hat{y}-y)\cdot h_1 Lb2=(y^y)\frac{\partial\mathcal{L}}{\partial b_2} = (\hat{y}-y) δ1=L(w1x+b1)=(y^y)w21[w1x+b1>0]\delta_1 = \frac{\partial\mathcal{L}}{\partial(w_1 x+b_1)} = (\hat{y}-y)\cdot w_2\cdot\mathbf{1}[w_1 x+b_1 > 0] Lw1=δ1x,Lb1=δ1\frac{\partial\mathcal{L}}{\partial w_1} = \delta_1\cdot x, \qquad \frac{\partial\mathcal{L}}{\partial b_1} = \delta_1

The indicator 1[]\mathbf{1}[\cdot] is the ReLU subgradient: if the pre-activation was negative, the gradient is blocked (dead neuron).

S.3 Gradient Flow Visualization

BACKPROPAGATION GRADIENT FLOW


  INPUT                                              LOSS
    x -> [z = wx+b] -> [a = sigma(z)] -> [L = (a-y)^2]
                                                 
                    Forward    Forward     
                                                  
                      Backward  Backward   
                                                 
           partialL/partialw = (a-y)*a(1-a)*x              partialL/partiala = a-y

  Each node multiplies its local gradient into the flowing signal.
  This IS the chain rule: partialL/partialw = (partialL/partiala)(partiala/partialz)(partialz/partialw)


S.4 Gradient Tape - PyTorch Analogy

In PyTorch, the computation graph is built dynamically during the forward pass. .backward() traverses it in reverse, applying the chain rule at each node:

import torch
x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)
z = w * x + 1.0          # z = 7.0
a = torch.sigmoid(z)      # a ~= 0.999
L = 0.5 * (a - 0.5) ** 2  # L

L.backward()              # Reverse-mode AD: computes all partialL/partialw, partialL/partialx
print(w.grad)             # partialL/partialw = (a-0.5)*a*(1-a)*x
print(x.grad)             # partialL/partialx = (a-0.5)*a*(1-a)*w

This is the chain rule automated - the "tape" records (z,a,L)(z, a, L) in the forward pass, then .backward() applies our formulas in reverse.


Appendix T: Summary and Connections

T.1 Core Theorem Summary

TheoremStatementKey ConditionConsequence
Differentiable -> Continuousf(a)f'(a) exists -> ff continuous at aaDifferentiabilityJustifies assuming continuity in proofs
Power Rule(xn)=nxn1(x^n)' = nx^{n-1}nRn \in \mathbb{R}Foundation of all polynomial differentiation
Product Rule(fg)=fg+fg(fg)' = f'g+fg'f,gf,g differentiableEssential for activation analysis
Chain Rule(fg)=f(g)g(f\circ g)'=f'(g)\cdot g'Both differentiableBackbone of backpropagation
Mean Value Theoremc:f(c)=(f(b)f(a))/(ba)\exists c: f'(c) = (f(b)-f(a))/(b-a)Continuous [a,b][a,b], diff (a,b)(a,b)Lipschitz bounds, descent lemmas
First Derivative TestSign change of ff' classifies extremaf(x0)=0f'(x_0)=0Optimization - finding local minima
Second Derivative Testff'' sign at critical pointf(x0)=0f'(x_0)=0, ff'' existsConfirms min/max; generalizes to PD Hessian
Inverse Function Theorem(f1)=1/f(f1)(f^{-1})'=1/f'(f^{-1})f(a)0f'(a)\neq 0Derives inverse trig derivatives; flows

T.2 Dependency Graph - All Concepts in This Section

DERIVATIVE DEPENDENCY MAP


  Limit (01) -> Derivative Definition
                                      
             
                                                      
       Constant/Power           Product Rule        Chain Rule
       Sum/Difference           Quotient Rule           
                                                      
             
                                      
                               Differentiation Rules
                                      
          
                                                        
   Higher Derivatives         Implicit Diff.      Activation f'
   Concavity/Inflection        Related Rates     sigma', ReLU', GELU'
                                                        
                                                        
   Critical Points /                            Vanishing Gradients
   Extrema / MVT                                Backpropagation
          
          
   Gradient Descent Theory -> 08-Optimization


T.3 What to Review if You're Stuck

Struggling withReview
Chain rule applications4.1-4.2, Appendix B
Activation function derivatives8.1-8.5, Appendix G
Critical point classification7.1-7.3
Numerical differentiation step size9.1-9.2, Appendix F
Backpropagation derivation4.3, Appendix S
Implicit differentiation5.1, Appendix L
Why ReLU avoids vanishing gradientsAppendix G.1
Softmax gradient computationAppendix G.3
Adam optimizer intuitionAppendix I.1
Gradient checking in code9.3

Appendix U: Practice Problems - Solutions

U.1 Full Solutions: Section 11 Exercises

Exercise 1 - Basic rules.

(a) f(x)=3x42x2+7f(x) = 3x^4 - 2x^2 + 7:

f(x)=12x34xf'(x) = 12x^3 - 4x

(b) g(x)=xex=x1/2exg(x) = \sqrt{x}\,e^x = x^{1/2}e^x. Product rule:

g(x)=12x1/2ex+x1/2ex=ex ⁣(12x+x)=ex1+2x2xg'(x) = \tfrac{1}{2}x^{-1/2}e^x + x^{1/2}e^x = e^x\!\left(\frac{1}{2\sqrt{x}} + \sqrt{x}\right) = e^x\cdot\frac{1+2x}{2\sqrt{x}}

(c) h(x)=(x2+1)/(x1)h(x) = (x^2+1)/(x-1). Quotient rule (f=x2+1f=x^2+1, g=x1g=x-1):

h(x)=2x(x1)(x2+1)(1)(x1)2=2x22xx21(x1)2=x22x1(x1)2h'(x) = \frac{2x(x-1) - (x^2+1)(1)}{(x-1)^2} = \frac{2x^2-2x-x^2-1}{(x-1)^2} = \frac{x^2-2x-1}{(x-1)^2}

Exercise 2 - Chain rule.

(a) (sin(x3))=cos(x3)3x2(\sin(x^3))' = \cos(x^3)\cdot 3x^2

(b) (ln(cosx))=sinxcosx=tanx(\ln(\cos x))' = \frac{-\sin x}{\cos x} = -\tan x

(c) ((1+x2)10)=10(1+x2)92x=20x(1+x2)9((1+x^2)^{10})' = 10(1+x^2)^9\cdot 2x = 20x(1+x^2)^9

(d) (esinx)=esinxcosx(e^{\sin x})' = e^{\sin x}\cos x

Exercise 3 - Implicit differentiation: x3+y3=6xyx^3 + y^3 = 6xy.

Differentiate: 3x2+3y2y=6y+6xy3x^2 + 3y^2 y' = 6y + 6x y'

Collect yy': y(3y26x)=6y3x2y'(3y^2 - 6x) = 6y - 3x^2

y=6y3x23y26x=2yx2y22xy' = \frac{6y - 3x^2}{3y^2 - 6x} = \frac{2y - x^2}{y^2 - 2x}

Exercise 4 - Sigmoid derivative.

(a) Derivation:

σ(x)=ddx(1+ex)1=(1+ex)2(ex)=ex(1+ex)2\sigma'(x) = \frac{d}{dx}(1+e^{-x})^{-1} = -(1+e^{-x})^{-2}\cdot(-e^{-x}) = \frac{e^{-x}}{(1+e^{-x})^2}

Factor: =11+exex1+ex=σ(x)(1σ(x))= \frac{1}{1+e^{-x}}\cdot\frac{e^{-x}}{1+e^{-x}} = \sigma(x)\cdot(1-\sigma(x)). \square

Exercise 6 - MVT proof: sinasinbab|\sin a - \sin b| \leq |a-b|.

The sine function is differentiable with (sinx)=cosx1|(\sin x)' | = |\cos x| \leq 1 for all xx. By the MVT, for any aba \neq b there exists cc between aa and bb with:

sinasinbab=cosc\frac{\sin a - \sin b}{a - b} = \cos c

Therefore:

sinasinb=coscab1ab=ab.|\sin a - \sin b| = |\cos c||a-b| \leq 1\cdot|a-b| = |a-b|. \quad \square

(The inequality also holds for a=ba = b: 000 \leq 0.)

Exercise 8 - Backpropagation chain rule:

Forward: z1=w1xz_1 = w_1 x, a1=σ(z1)a_1 = \sigma(z_1), z2=w2a1z_2 = w_2 a_1, L=12(z2y)2\mathcal{L} = \frac{1}{2}(z_2-y)^2.

Lz2=z2y=w2a1y\frac{\partial\mathcal{L}}{\partial z_2} = z_2 - y = w_2 a_1 - y Lw2=(z2y)a1\frac{\partial\mathcal{L}}{\partial w_2} = (z_2-y)\cdot a_1 La1=(z2y)w2\frac{\partial\mathcal{L}}{\partial a_1} = (z_2-y)\cdot w_2 Lz1=(z2y)w2σ(z1)=(z2y)w2a1(1a1)\frac{\partial\mathcal{L}}{\partial z_1} = (z_2-y)\cdot w_2\cdot\sigma'(z_1) = (z_2-y)\cdot w_2\cdot a_1(1-a_1) Lw1=(z2y)w2a1(1a1)x\frac{\partial\mathcal{L}}{\partial w_1} = (z_2-y)\cdot w_2\cdot a_1(1-a_1)\cdot x

This reproduces the four-factor chain: loss sensitivity x downstream weight x local gate x input.


Appendix V: Connections to Later Chapters

V.1 What Integration Needs from This Section

The next section (03-Integration) uses:

  • Antiderivatives: integration reverses differentiation. Every differentiation formula gives an integration formula: (sinx)=cosxcosxdx=sinx+C(\sin x)' = \cos x \Rightarrow \int\cos x\,dx = \sin x + C.
  • Fundamental Theorem of Calculus (Part 2): if F=fF' = f, then abf(x)dx=F(b)F(a)\int_a^b f(x)\,dx = F(b) - F(a). This requires knowing derivatives of common functions.
  • Substitution rule: the reverse chain rule. If u=g(x)u = g(x), f(g(x))g(x)dx=f(u)du\int f(g(x))g'(x)\,dx = \int f(u)\,du.

Every derivative formula in 3.5 has a corresponding integral formula. Master both sides.

V.2 What Series Needs from This Section

04-Series-and-Sequences uses:

  • Higher-order derivatives: Taylor coefficients an=f(n)(a)/n!a_n = f^{(n)}(a)/n! require computing many derivatives.
  • Linear approximation: the first-order Taylor expansion f(x)f(a)+f(a)(xa)f(x) \approx f(a) + f'(a)(x-a) is a direct extension of 5.3.
  • Error bounds: the Taylor remainder involves f(n+1)f^{(n+1)} at some intermediate point (MVT applied to integrals).

V.3 What Multivariate Calculus Needs from This Section

05-Multivariate Calculus extends every concept here:

02 Concept05 Extension
Derivative f(x)f'(x)Gradient f(x)\nabla f(\mathbf{x})
Chain rule df/dx=(df/du)(du/dx)df/dx = (df/du)(du/dx)Chain rule with Jacobians: Jfg=JfJg\mathbf{J}_{f\circ g} = \mathbf{J}_f \mathbf{J}_g
Second derivative f(x)f''(x)Hessian H=2f\mathbf{H} = \nabla^2 f
Inflection pointSaddle point in Rn\mathbb{R}^n
Linear approximation f(x)f(a)+f(a)(xa)f(x)\approx f(a)+f'(a)(x-a)f(x)f(a)+f(a)(xa)f(\mathbf{x})\approx f(\mathbf{a})+\nabla f(\mathbf{a})^\top(\mathbf{x}-\mathbf{a})
f(a)=0f'(a) = 0 at extremumf(a)=0\nabla f(\mathbf{a}) = \mathbf{0} at extremum

The scalar theory developed here is the essential foundation. Every formula in 05 will look familiar with vector notation substituted.

V.4 What Optimization Needs from This Section

08-Optimization uses directly:

  • Gradient descent theory (7.3 MVT, descent lemma)
  • Critical point classification (saddles dominate in high dimensions)
  • Chain rule for backpropagation
  • Adam optimizer - second moment as curvature estimate
  • Gradient checking (9.3) for debugging

Appendix W: Glossary

TermDefinition
Derivativef(a)=limh0[f(a+h)f(a)]/hf'(a) = \lim_{h\to 0}[f(a+h)-f(a)]/h - instantaneous rate of change
DifferentiableThe derivative exists and is finite at the point
Smooth (CC^\infty)All derivatives of all orders exist and are continuous
Critical pointx0x_0 where f(x0)=0f'(x_0) = 0 or f(x0)f'(x_0) undefined
Local minimumf(x0)f(x)f(x_0) \leq f(x) for all xx near x0x_0
Concave upf(x)>0f''(x) > 0 - bowl shape, gradient is increasing
Inflection pointWhere ff'' changes sign
SubgradientGeneralized derivative for non-smooth convex functions
Lipschitz constantLL such that $
Forward difference(f(x+h)f(x))/h(f(x+h)-f(x))/h - O(h)O(h) approximation to f(x)f'(x)
Centered difference(f(x+h)f(xh))/(2h)(f(x+h)-f(x-h))/(2h) - O(h2)O(h^2) approximation to f(x)f'(x)
Automatic differentiationExact derivative computation by applying chain rule to elementary ops
BackpropagationReverse-mode AD on neural network computation graph
Vanishing gradientProduct of many small derivatives -> exponential decay
Score functionθlogp(x;θ)\nabla_\theta \log p(x;\theta) - derivative of log-likelihood

Appendix X: Additional Examples - Differentiating Through Compositions

X.1 Softplus as Smooth ReLU

The softplus function f(x)=ln(1+ex)f(x) = \ln(1+e^x) is a smooth approximation to ReLU.

Derivative:

f(x)=ex1+ex=11+ex=σ(x)f'(x) = \frac{e^x}{1+e^x} = \frac{1}{1+e^{-x}} = \sigma(x)

Softplus's derivative is the sigmoid! This means:

  • Softplus is always increasing (σ(x)>0\sigma(x) > 0 for all xx)
  • Its second derivative: f(x)=σ(x)=σ(x)(1σ(x))>0f''(x) = \sigma'(x) = \sigma(x)(1-\sigma(x)) > 0 - always concave up
  • Softplus approaches ReLU as temperature T0T \to 0: Tln(1+ex/T)max(0,x)T\ln(1+e^{x/T}) \to \max(0,x)

X.2 SiLU / Swish

Swish: swish(x)=xσ(x)\text{swish}(x) = x\sigma(x).

Derivative (product rule):

swish(x)=σ(x)+xσ(x)=σ(x)+xσ(x)(1σ(x))=σ(x)(1+x(1σ(x)))\text{swish}'(x) = \sigma(x) + x\sigma'(x) = \sigma(x) + x\sigma(x)(1-\sigma(x)) = \sigma(x)(1 + x(1-\sigma(x)))

At x=0x = 0: swish(0)=σ(0)(1+0)=1/2\text{swish}'(0) = \sigma(0)(1+0) = 1/2.

Swish is non-monotone: has a local minimum at x1.28x \approx -1.28 (where swish(x)=0\text{swish}'(x) = 0). This property, combined with self-gating, is hypothesized to improve gradient flow in deep networks.

X.3 Mish

Mish: mish(x)=xtanh(softplus(x))=xtanh(ln(1+ex))\text{mish}(x) = x\tanh(\text{softplus}(x)) = x\tanh(\ln(1+e^x)).

Derivative (product rule + chain rule):

mish(x)=tanh(softplus(x))+xsech2(softplus(x))σ(x)\text{mish}'(x) = \tanh(\text{softplus}(x)) + x \cdot \text{sech}^2(\text{softplus}(x)) \cdot \sigma(x)

Mish is smooth, non-monotone (like Swish), and has shown improvements over ReLU in some image classification benchmarks. Its derivative involves three activations composed - a showcase for the chain rule.

X.4 Gradient of Cross-Entropy Loss - Full Derivation

For KK-class classification with softmax output p=softmax(z)\mathbf{p} = \text{softmax}(\mathbf{z}) and one-hot label y\mathbf{y}:

L=k=1Kyklogpk\mathcal{L} = -\sum_{k=1}^K y_k \log p_k Lzj=kyklogpkzj=kyk1pkpkzj\frac{\partial\mathcal{L}}{\partial z_j} = -\sum_k y_k \frac{\partial\log p_k}{\partial z_j} = -\sum_k y_k \frac{1}{p_k}\frac{\partial p_k}{\partial z_j}

Using the softmax Jacobian from G.3:

=kyk1pkpk(δkjpj)=kyk(δkjpj)= -\sum_k y_k \frac{1}{p_k}p_k(\delta_{kj} - p_j) = -\sum_k y_k(\delta_{kj} - p_j) =yj+pjkyk=pjyj= -y_j + p_j\sum_k y_k = p_j - y_j

(since kyk=1\sum_k y_k = 1 for a one-hot vector). This elegant result - gradient = predicted probability minus true probability - is why the combined softmax + cross-entropy layer is computationally convenient and numerically stable.


Appendix Y: Connecting to the Broader Curriculum

Y.1 Prerequisites Review

Before using the material in this section, ensure you are solid on:

  • Limit definition (01 2): limh0(...)\lim_{h\to 0}(...) is the backbone of the derivative definition
  • Fundamental limits (01 3): limh0sin(h)/h=1\lim_{h\to 0}\sin(h)/h = 1 is used to derive (sinx)=cosx(\sin x)' = \cos x
  • Continuity (01 6): differentiability implies continuity; you need continuity to apply EVT and FTC
  • Algebra: polynomial factoring, rational expression manipulation - needed for quotient rule applications
  • Exponential/log identities: ea+b=eaebe^{a+b} = e^a e^b, ln(ab)=lna+lnb\ln(ab) = \ln a + \ln b - used in logarithmic differentiation

Y.2 Key Formulas from This Section (Memorize These)

(xn)=nxn1(ex)=ex(lnx)=1x\boxed{(x^n)' = nx^{n-1}} \qquad \boxed{(e^x)' = e^x} \qquad \boxed{(\ln x)' = \frac{1}{x}} (fg)=fg+fg(fg)=fgfgg2(fg)=f(g)g\boxed{(fg)' = f'g + fg'} \qquad \boxed{\left(\frac{f}{g}\right)' = \frac{f'g - fg'}{g^2}} \qquad \boxed{(f\circ g)' = f'(g)\cdot g'} σ(x)=σ(x)(1σ(x))ReLU(x)=1[x>0]\boxed{\sigma'(x) = \sigma(x)(1-\sigma(x))} \qquad \boxed{\text{ReLU}'(x) = \mathbf{1}[x>0]} f(x)f(x+h)f(xh)2h+O(h2)\boxed{f'(x) \approx \frac{f(x+h)-f(x-h)}{2h} + O(h^2)}

These formulas appear throughout the rest of the curriculum, in optimization, probability, and all applied sections.


This section is part of the Math for LLMs curriculum. For issues or contributions, see CONTRIBUTING.md.