NotesMath for LLMs

Proof Techniques

Mathematical Foundations / Proof Techniques

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Notes

"A proof is a finite sequence of logical steps that establishes a truth beyond any possible doubt. Unlike experiments, which can always be overturned by a new observation, a mathematical proof is eternal - valid for all cases, for all time."

Overview

A proof technique is a general strategy for establishing that a mathematical statement is true - not probably true, not true in every case we checked, but necessarily true as a consequence of axioms, definitions, and previously established results. Proof techniques are the algorithms of mathematical reasoning: given a goal, which technique do you apply? Given a structure, which argument do you use?

For AI practitioners, proof techniques are not optional academic exercises. Every convergence theorem for gradient descent, every generalisation bound in PAC learning, every correctness argument for attention mechanisms, every hardness result in complexity theory - all rest on specific proof strategies. Without fluency in these strategies, theoretical machine learning papers are inaccessible, and the foundations of why models work (or fail) remain opaque.

This chapter covers the complete landscape of proof techniques, from elementary direct proofs through probabilistic existence arguments and analytic convergence proofs, with emphasis on the patterns that recur throughout ML theory.

Prerequisites

  • Propositional and predicate logic (from Sets and Logic module)
  • Basic set operations (union, intersection, complement)
  • Familiarity with functions, injectivity, surjectivity
  • Elementary number theory concepts (even, odd, prime, divisibility)
  • Basic probability (for Sections 10, 13)

Companion Notebooks

NotebookDescription
theory.ipynbInteractive demonstrations: proof verification, induction chains, probabilistic arguments
exercises.ipynbPractice problems: direct proofs, contradiction, induction, counting, ML proof patterns

Learning Objectives

After completing this section, you will:

  • Identify and apply the correct proof technique for any mathematical statement
  • Write rigorous direct proofs, proofs by contradiction, and proofs by contrapositive
  • Master mathematical induction (ordinary, strong, and structural)
  • Understand and apply the probabilistic method for existence proofs
  • Use counting arguments, pigeonhole principle, and double counting
  • Write \epsilon-\delta proofs for limits, continuity, and convergence
  • Recognise and apply standard ML proof patterns: union bound, concentration inequalities, PAC learning, reduction proofs
  • Identify common proof errors and evaluate the validity of AI-generated proofs
  • Read and understand proofs in theoretical ML papers

Table of Contents


1. Intuition

1.1 What Are Proof Techniques?

A proof technique is a general strategy for establishing that a mathematical statement is true beyond any possible doubt. Unlike empirical evidence (which can always be overturned by a new counterexample) or intuition (which is frequently wrong), a mathematical proof is an absolute guarantee - valid for all cases, for all time.

Proof techniques are the algorithms of mathematical reasoning: given a goal, which technique do you apply? Given a structure, which argument do you use? Mastering proof techniques means mastering the ability to move from "I believe this is true" to "I can prove this is true and explain exactly why."

For AI: proofs establish correctness of algorithms, validity of bounds, convergence of optimisers, and soundness of theoretical guarantees. Without proof techniques, theoretical ML is inaccessible.

1.2 Why Proof Techniques Matter for AI

AI DomainProof Technique RequiredExample
Convergence proofsTelescoping, induction, \epsilon-\deltaProving gradient descent converges to a stationary point
Generalisation boundsProbabilistic method, union boundPAC learning, VC dimension, Rademacher complexity
Algorithm correctnessDirect proof, inductionProving BPE terminates; proving softmax is well-defined
Complexity theoryReduction proofs, contradictionShowing neural network verification is NP-hard
LLM reasoning evaluationAll techniquesEvaluating whether an LLM's "proof" is logically valid
Formal verificationAll techniquesLean, Coq, Isabelle verify AI system properties

1.3 The Landscape of Proof Techniques

+==============================================================+
|                    PROOF TECHNIQUES                          |
+==============================================================+
|                                                              |
|  Direct Methods                                              |
|  +-- Direct proof (assume P; derive Q)                       |
|  +-- Proof by construction (exhibit the object)              |
|  +-- Proof by exhaustion (check all cases)                   |
|                                                              |
|  Indirect Methods                                            |
|  +-- Contrapositive (prove \negQ -> \negP instead of P -> Q)        |
|  +-- Contradiction (assume \negP; derive \perp)                    |
|  +-- Proof by cases (partition; prove each)                  |
|                                                              |
|  Inductive Methods                                           |
|  +-- Mathematical induction (base + step)                    |
|  +-- Strong induction (use all previous cases)               |
|  +-- Structural induction (induct on recursive structure)    |
|                                                              |
|  Probabilistic Methods                                       |
|  +-- Probabilistic method (show Pr[property] > 0)            |
|  +-- Expectation argument (E[X] implies existence)           |
|  +-- Lovasz Local Lemma (avoid rare bad events)              |
|                                                              |
|  Algebraic Methods                                           |
|  +-- Counting arguments (double counting; bijection)         |
|  +-- Pigeonhole principle (n+1 items in n bins)              |
|  +-- Generating functions (encode combinatorics)             |
|                                                              |
|  Analytic Methods                                            |
|  +-- \epsilon-\delta arguments (limits; continuity; convergence)         |
|  +-- Compactness arguments (Heine-Borel; sequential)         |
|  +-- Fixed point arguments (Banach; Brouwer; Kakutani)       |
|                                                              |
+==============================================================+

1.4 The Structure of a Mathematical Statement

Every theorem has the form: "If [hypotheses], then [conclusion]."

Formally: H1H2HnCH_1 \land H_2 \land \dots \land H_n \to C

  • Hypotheses: conditions assumed to hold; the "given" part
  • Conclusion: what must be shown to follow; the "prove" part

A proof is a finite sequence of logically valid steps from hypotheses to conclusion. Each step follows from axioms, definitions, previously proved results, or hypotheses.

Key skill: identifying what you are allowed to assume and what you must derive.

+===================================================+
|          ANATOMY OF A THEOREM                     |
+===================================================+
|                                                   |
|  "If  n is even,  then  n^2 is even"              |
|       ---------        ----------                 |
|       hypothesis       conclusion                 |
|       (assume this)    (derive this)              |
|                                                   |
|  Proof = chain of valid steps:                    |
|                                                   |
|  Assume n is even                                 |
|    -> n = 2k for some integer k      (definition)  |
|    -> n^2 = (2k)^2 = 4k^2              (algebra)     |
|    -> n^2 = 2(2k^2)                   (factor)      |
|    -> n^2 is even                     (definition)  |
|                                                   |
|  Each arrow: justified by a named rule.           |
|                                                   |
+===================================================+

1.5 Reading and Writing Proofs

Reading a proof: identify hypotheses -> identify conclusion -> trace each step -> check each logical inference -> verify completeness.

Writing a proof: state what you are proving -> state your strategy -> execute strategy -> end with explicit conclusion.

Common proof structure markers:

MarkerPurpose
"Assume..." / "Suppose..."Introduce hypothesis or assumption
"Let..."Introduce a variable or object
"Since..." / "Because..." / "By..."Justify a step
"Therefore..." / "Thus..." / "Hence..."Conclude a step
"We have shown..." / "This completes the proof"Explicit closure
"QED" / "[]" / "QED"End of proof
"Case 1: ... Case 2: ..."Proof by cases
"Base case: ... Inductive step: ..."Induction

1.6 Historical Timeline

DateMilestoneSignificance
~300 BCEEuclid's ElementsFirst systematic proofs; 467 propositions from 5 axioms
~250 BCEArchimedesMethod of exhaustion; proto-integration
9th-13th c.Arab mathematiciansAlgebraic proofs; al-Khwarizmi
1821CauchyRigorous \epsilon-\delta definitions of limits
1860sWeierstrassFormal epsilon-delta proofs
1874CantorDiagonal argument; \mathbb{R} is uncountable
1900HilbertProgramme to formalise all mathematics
1901RussellParadox via self-reference
1931GodelIncompleteness; limits of formal systems
1976Appel & HakenFour-colour theorem; computer-assisted proof
1995WilesFermat's Last Theorem; 129-page proof
2000s-Coq, Lean, IsabelleMachine-verified proofs
2024AlphaProof (DeepMind)LLM + Lean; silver medal at IMO 2024

2. Direct Proof

2.1 The Strategy

To prove PQP \to Q:

  1. Assume PP is true
  2. Through a sequence of logical steps - each justified by definitions, axioms, or previously proved results - derive QQ
  3. Conclude QQ follows from PP

This is the most natural proof technique. Try it first for any implication. It works best when the hypothesis gives you something concrete to work with and the conclusion has a clear definition to aim for.

2.2 Template

+==============================================+
|          DIRECT PROOF TEMPLATE               |
+==============================================+
|                                              |
|  Proof:                                      |
|    Assume [P].                               |
|    [Step 1: apply definition/theorem]        |
|    [Step 2: algebraic or logical step]       |
|    ...                                       |
|    [Final step: arrive at Q]                 |
|    Therefore [Q].  []                         |
|                                              |
+==============================================+

2.3 Worked Example - Even Times Even Is Even

Theorem. If mm and nn are both even integers, then mnmn is even.

Proof.

Assume mm and nn are both even.

By definition of even: jZ:m=2j\exists j \in \mathbb{Z}: m = 2j and kZ:n=2k\exists k \in \mathbb{Z}: n = 2k.

Then:

mn=(2j)(2k)=4jk=2(2jk)mn = (2j)(2k) = 4jk = 2(2jk)

Since 2jkZ2jk \in \mathbb{Z}, mnmn is of the form 2×(integer)2 \times (\text{integer}).

Therefore mnmn is even. \square

Dissection: Every step names its justification - definition of even, integer closure under multiplication, definition of even again. No hand-waving.

2.4 Worked Example - If n^2 Is Even Then n Is Even (Direct Proof Fails)

Theorem. If n2n^2 is even, then nn is even.

Attempted direct proof: Assume n2n^2 is even. Then n2=2kn^2 = 2k for some kk. So n=2kn = \sqrt{2k}... stuck. The square root of an even number is not obviously an integer, let alone obviously even.

Lesson: When direct proof stalls, switch technique. This theorem is proved cleanly by contrapositive (Section 4).

2.5 Worked Example - Sum of Continuous Functions Is Continuous

Theorem. If ff and gg are continuous at aa, then f+gf + g is continuous at aa.

Proof.

Assume ff and gg are continuous at aa. Let ε>0\varepsilon > 0.

We need to find δ>0\delta > 0 such that xa<δ|x - a| < \delta implies (f+g)(x)(f+g)(a)<ε|(f+g)(x) - (f+g)(a)| < \varepsilon.

By continuity of ff: δ1>0\exists \delta_1 > 0: xa<δ1    f(x)f(a)<ε/2|x - a| < \delta_1 \implies |f(x) - f(a)| < \varepsilon/2.

By continuity of gg: δ2>0\exists \delta_2 > 0: xa<δ2    g(x)g(a)<ε/2|x - a| < \delta_2 \implies |g(x) - g(a)| < \varepsilon/2.

Let δ=min(δ1,δ2)\delta = \min(\delta_1, \delta_2).

Then xa<δ|x - a| < \delta implies:

(f+g)(x)(f+g)(a)=f(x)+g(x)f(a)g(a)|(f+g)(x) - (f+g)(a)| = |f(x) + g(x) - f(a) - g(a)| f(x)f(a)+g(x)g(a)<ε2+ε2=ε\leq |f(x) - f(a)| + |g(x) - g(a)| < \frac{\varepsilon}{2} + \frac{\varepsilon}{2} = \varepsilon

Therefore f+gf + g is continuous at aa. \square

Structure: Uses both hypotheses; constructs δ\delta from δ1\delta_1 and δ2\delta_2; the ε/2\varepsilon/2 trick ensures the sum stays below ε\varepsilon after applying the triangle inequality.

2.6 Direct Proof in AI Contexts

Theorem. The output of a self-attention head lies in the convex hull of the value vectors.

Proof.

Let αij\alpha_{ij} be the attention weights and VjV_j the value vectors.

  • αij0\alpha_{ij} \geq 0 for all jj (softmax outputs are non-negative)
  • jαij=1\sum_j \alpha_{ij} = 1 (softmax outputs sum to 1)
  • Oi=jαijVjO_i = \sum_j \alpha_{ij} V_j (output is a weighted sum)

By definition of convex hull: a point is in conv({Vj})\text{conv}(\{V_j\}) if and only if it is a convex combination (non-negative weights summing to 1) of the VjV_j.

Therefore Oiconv({Vj})O_i \in \text{conv}(\{V_j\}). \square

Theorem. Cross-entropy loss is convex in logits.

Proof.

L(z)=zy+logvexp(zv)L(z) = -z_y + \log \sum_v \exp(z_v)

  • zy-z_y is linear in zz, hence convex
  • logvexp(zv)\log \sum_v \exp(z_v) is the log-sum-exp function, which is convex (standard result: composition of affine functions with logsumexp\log \circ \text{sum} \circ \exp)
  • Sum of convex functions is convex

Therefore LL is convex. \square


3. Proof by Construction

3.1 The Strategy

To prove xP(x)\exists x\, P(x): exhibit a specific xx and verify that P(x)P(x) holds.

A constructive proof provides an algorithm - the proof itself tells you how to find the object. This stands in contrast to non-constructive existence proofs (Section 5), which prove something exists without showing you how to find it.

Constructive proofs are strictly more informative: they tell you not just that something exists, but what it is.

3.2 Template

+==============================================+
|        CONSTRUCTIVE PROOF TEMPLATE           |
+==============================================+
|                                              |
|  Proof:                                      |
|    [Define or describe object x explicitly]  |
|    We claim x satisfies [P].                 |
|    [Verify each required property of x]      |
|    Therefore \existsx satisfying P.  []             |
|                                              |
+==============================================+

3.3 Worked Example - Prime Between n and 2n

Theorem (Bertrand's Postulate). For every n1n \geq 1, there exists a prime pp with n<p2nn < p \leq 2n.

This is proved by careful counting (Chebyshev's and later Erdos's proof): one bounds the central binomial coefficient (2nn)\binom{2n}{n} from above and below, showing that a prime in (n,2n](n, 2n] must exist. The key is that the proof is constructive in principle - it establishes existence by showing the alternative (no prime in the range) leads to a bound violation.

3.4 Worked Example - Constructing a Bijection

Theorem. Z=N|\mathbb{Z}| = |\mathbb{N}| (the integers and natural numbers have the same cardinality).

Proof (by construction). Exhibit an explicit bijection f:NZf: \mathbb{N} \to \mathbb{Z}:

f(n)={0if n=0kif n=2k1 for k1kif n=2k for k1f(n) = \begin{cases} 0 & \text{if } n = 0 \\ k & \text{if } n = 2k - 1 \text{ for } k \geq 1 \\ -k & \text{if } n = 2k \text{ for } k \geq 1 \end{cases}

This maps: 00,  11,  21,  32,  42,  53,  63,0 \mapsto 0,\; 1 \mapsto 1,\; 2 \mapsto -1,\; 3 \mapsto 2,\; 4 \mapsto -2,\; 5 \mapsto 3,\; 6 \mapsto -3, \dots

Verify injective: Different natural numbers map to different integers (each integer appears exactly once in the sequence above).

Verify surjective: Every integer zZz \in \mathbb{Z} is hit:

  • z=0z = 0: f(0)=0f(0) = 0 OK
  • z>0z > 0: f(2z1)=zf(2z - 1) = z OK
  • z<0z < 0: f(2z)=zf(2|z|) = z OK

Therefore ff is a bijection; Z=N|\mathbb{Z}| = |\mathbb{N}|. \square

3.5 Worked Example - Neural Network Approximation

Theorem. For the target function f(x)=1[x>0.5]f(x) = \mathbb{1}[x > 0.5], there exists a 1-hidden-layer ReLU network approximating ff to within ε=0.01\varepsilon = 0.01 on [0,1](0.49,0.51)[0, 1] \setminus (0.49, 0.51).

Proof (by construction).

Define the network:

h(x)=ReLU(MxM/2)Mh(x) = \frac{\text{ReLU}(Mx - M/2)}{M}

for a parameter M>0M > 0. This is a single-neuron ReLU network with:

  • W1=[M]W_1 = [M], b1=[M/2]b_1 = [-M/2], W2=[1/M]W_2 = [1/M], b2=0b_2 = 0

Behaviour:

  • For x0.5x \leq 0.5: MxM/20Mx - M/2 \leq 0, so ReLU()=0\text{ReLU}(\cdot) = 0, so h(x)=0=f(x)h(x) = 0 = f(x) OK
  • For x>0.5x > 0.5: h(x)=(MxM/2)/M=x0.5h(x) = (Mx - M/2)/M = x - 0.5
    • At x=0.51x = 0.51: h(0.51)=0.01h(0.51) = 0.01; f(0.51)=1f(0.51) = 1; error =0.99= 0.99 (transition zone)
    • At x=0.5+1/Mx = 0.5 + 1/M: h=1/Mh = 1/M; for M=100M = 100: h(0.51)=0.01h(0.51) = 0.01

For M=100M = 100 and x0.51x \geq 0.51: h(x)=x0.50.01h(x) = x - 0.5 \geq 0.01. We need h(x)1h(x) \approx 1, so this simple construction works on [0,0.5]{1}[0, 0.5] \cup \{1\} but needs adjustment for the full interval. A better construction uses two ReLU units to create a step:

g(x)=ReLU(M(x0.5))ReLU(M(x0.5)1)g(x) = \text{ReLU}(M(x - 0.5)) - \text{ReLU}(M(x - 0.5) - 1)

This clips to [0,1][0, 1] and approximates the step function with transition width 1/M1/M. For M=100M = 100, the approximation error is <0.01< 0.01 outside (0.49,0.51)(0.49, 0.51). \square

3.6 Constructive vs Non-Constructive - A Key Distinction

ConstructiveNon-Constructive
What it givesThe object itselfProof that it exists
Algorithm?Yes - the proof is the algorithmNo - existence without method
InformativenessVery highModerate
Typical techniqueDirect constructionContradiction, probabilistic method
AI relevanceAlgorithms, architecturesPerformance bounds, impossibility results

Example - Normal Numbers:

  • Constructive: "Let x=0.123456789101112x = 0.123456789101112\dots" (Champernowne's constant). This is provably normal in base 10 by construction.
  • Non-constructive: "A normal number must exist because the set of non-normal numbers has Lebesgue measure zero." Proves existence without exhibiting the number.

AI relevance: Constructive proofs correspond to implementable algorithms. Non-constructive proofs establish performance bounds without telling you how to achieve them.


4. Proof by Contrapositive

4.1 The Strategy

Logical equivalence: (PQ)(¬Q¬P)(P \to Q) \equiv (\lnot Q \to \lnot P)

To prove PQP \to Q, you may equivalently prove ¬Q¬P\lnot Q \to \lnot P. This is the contrapositive.

When to use: when the negation of QQ gives more useful information than PP itself. Often, ¬Q\lnot Q hands you a concrete object to work with, while PP is too abstract.

Critical distinction from contradiction: Contrapositive proves a specific alternate implication ¬Q¬P\lnot Q \to \lnot P. Contradiction assumes ¬P\lnot P and derives any false statement \bot. They are different techniques (see Section 5.7).

4.2 Template

+======================================================+
|         CONTRAPOSITIVE PROOF TEMPLATE                |
+======================================================+
|                                                      |
|  Proof (by contrapositive):                          |
|    We prove the contrapositive: assume \negQ.           |
|    [Derive \negP through logical steps]                 |
|    Therefore \negP.                                     |
|    By contrapositive equivalence, P -> Q.  []          |
|                                                      |
+======================================================+

4.3 Worked Example - If n^2 Is Even Then n Is Even

Theorem. For nZn \in \mathbb{Z}, if n2n^2 is even then nn is even.

Contrapositive: If nn is odd, then n2n^2 is odd.

Proof (by contrapositive).

Assume nn is odd. Then kZ:n=2k+1\exists k \in \mathbb{Z}: n = 2k + 1.

n2=(2k+1)2=4k2+4k+1=2(2k2+2k)+1n^2 = (2k+1)^2 = 4k^2 + 4k + 1 = 2(2k^2 + 2k) + 1

Since 2k2+2kZ2k^2 + 2k \in \mathbb{Z}, n2n^2 is of the form 2m+12m + 1; therefore n2n^2 is odd.

By contrapositive equivalence: if n2n^2 is even, then nn is even. \square

Compare with Section 2.4: The direct proof got stuck at n=2kn = \sqrt{2k}. The contrapositive transforms "even -> even" (hard direction) into "odd -> odd" (easy direction with concrete algebraic manipulation).

4.4 Worked Example - Irrational Product

Theorem. If xyxy is rational and xx is irrational, then y=0y = 0.

Contrapositive: If y0y \neq 0 and xx is irrational, then xyxy is irrational.

Proof (by contrapositive).

Assume y0y \neq 0 and xx is irrational. Suppose for contradiction that xyxy is rational: xy=p/qxy = p/q with q0q \neq 0. Then x=p/(qy)x = p/(qy). If yy is rational, y=r/sy = r/s, then x=ps/(qr)x = ps/(qr), which is rational - contradicting xx irrational. So either yy is irrational (and we are in the case where xyxy could still be irrational) or we get the contradiction.

Actually, this theorem is more naturally proved by contradiction - showing that technique selection requires judgment. The contrapositive reformulation doesn't simplify the argument. Not every theorem is best proved by contrapositive.

4.5 Contrapositive in AI Contexts

Theorem. If a neural network fθf_\theta is not Lipschitz continuous, then it is not robust to adversarial examples.

Contrapositive: If fθf_\theta is robust to adversarial examples (all perturbations δ<r\|\delta\| < r change output by less than ε\varepsilon), then fθf_\theta is Lipschitz with constant L=ε/rL = \varepsilon / r.

This direction is often easier to prove: robustness directly gives the Lipschitz bound.

Theorem. If L(θ)0\nabla L(\theta) \neq 0, then θ\theta is not a local minimum.

Contrapositive: If θ\theta is a local minimum, then L(θ)=0\nabla L(\theta) = 0.

This is the standard necessary condition for optimality, and the contrapositive direction is the one typically proved: at a local minimum, any directional derivative must be non-negative, which forces L=0\nabla L = 0.

4.6 Recognising When to Use Contrapositive

SignExplanation
Conclusion QQ involves a negation"x is not...", "cannot be...", "there is no..."
Negating QQ gives concrete structure¬Q\lnot Q hands you an object to work with
Direct proof gets stuckReformulating as ¬Q¬P\lnot Q \to \lnot P opens new paths
Hypothesis PP has existential content¬P\lnot P may be simpler than PP

Test: Write out ¬P\lnot P and ¬Q\lnot Q. If ¬Q¬P\lnot Q \to \lnot P seems more natural to argue, use contrapositive.


5. Proof by Contradiction

5.1 The Strategy

To prove PP: assume ¬P\lnot P; derive a contradiction - a statement known to be false, or both a statement CC and its negation ¬C\lnot C simultaneously.

Valid by the law of excluded middle: P¬PP \lor \lnot P. If ¬P\lnot P leads to a contradiction \bot, then PP must hold.

When to use: when PP has the form "X does not exist" or "X is impossible"; when no positive construction is available; when the assumption ¬P\lnot P has powerful consequences to exploit.

5.2 Template

+======================================================+
|         CONTRADICTION PROOF TEMPLATE                 |
+======================================================+
|                                                      |
|  Proof (by contradiction):                           |
|    Assume for contradiction that \negP.                 |
|    [Derive consequences of \negP]                       |
|    [Arrive at statement C and \negC simultaneously]     |
|    This is a contradiction.                          |
|    Therefore P must hold.  []                         |
|                                                      |
+======================================================+

5.3 Worked Example - \sqrt2 Is Irrational

Theorem. 2Q\sqrt{2} \notin \mathbb{Q}.

Proof (by contradiction).

Assume for contradiction that 2Q\sqrt{2} \in \mathbb{Q}.

Then 2=p/q\sqrt{2} = p/q for some p,qZp, q \in \mathbb{Z}, q0q \neq 0, with gcd(p,q)=1\gcd(p, q) = 1 (fraction in lowest terms).

Squaring: 2=p2/q22 = p^2/q^2, so p2=2q2p^2 = 2q^2.

p2p^2 is even. By the result of Section 4.3: pp is even. So p=2kp = 2k for some kZk \in \mathbb{Z}.

Substituting: (2k)2=2q2    4k2=2q2    q2=2k2(2k)^2 = 2q^2 \implies 4k^2 = 2q^2 \implies q^2 = 2k^2.

q2q^2 is even. By the same result: qq is even.

But both pp and qq even contradicts gcd(p,q)=1\gcd(p, q) = 1.

Contradiction. Therefore 2Q\sqrt{2} \notin \mathbb{Q}. \square

Note how the proof builds on Section 4.3: The irrationality of 2\sqrt{2} uses the lemma "n2n^2 even     \implies nn even" as a sub-result. This is typical - complex proofs assemble simpler proven facts.

5.4 Worked Example - Infinitely Many Primes (Euclid)

Theorem. There are infinitely many primes.

Proof (by contradiction).

Assume for contradiction there are only finitely many primes: p1,p2,,pnp_1, p_2, \dots, p_n.

Construct N=p1×p2××pn+1N = p_1 \times p_2 \times \dots \times p_n + 1.

N>1N > 1. Every integer greater than 1 has a prime factor. Let pp be a prime factor of NN.

pp must be one of p1,,pnp_1, \dots, p_n (by assumption, these are all the primes).

But pNp \mid N and pp1p2pnp \mid p_1 p_2 \cdots p_n, so p(Np1p2pn)=1p \mid (N - p_1 p_2 \cdots p_n) = 1.

No prime divides 1. Contradiction.

Therefore there are infinitely many primes. \square

5.5 Worked Example - No Largest Real Number

Theorem. There is no largest real number.

Proof (by contradiction).

Assume for contradiction MR\exists M \in \mathbb{R}: MxM \geq x for all xRx \in \mathbb{R}.

Consider M+1RM + 1 \in \mathbb{R}.

M+1>MM + 1 > M, contradicting MM being the largest.

Contradiction. Therefore no largest real number exists. \square

5.6 Non-Constructive Existence via Contradiction

One can prove "xP(x)\exists x\, P(x)" by contradiction: assume x¬P(x)\forall x\, \lnot P(x); derive a contradiction. This proves existence without constructing the object.

Example: The Intermediate Value Theorem proves c(a,b):f(c)=y\exists c \in (a, b): f(c) = y without constructing cc.

AI relevance: Existence of optimal parameters θ\theta^* minimising loss can be proved via compactness (if loss is continuous on a compact set) - but the proof gives no algorithm for finding θ\theta^*. This is why optimisation theory is a separate discipline from existence theory.

5.7 Contradiction vs Contrapositive - Clarifying the Difference

These two techniques are frequently confused. They are not the same:

ContrapositiveContradiction
GoalProve PQP \to QProve PP (or PQP \to Q)
Assume¬Q\lnot Q¬P\lnot P (or P¬QP \land \lnot Q)
Derive¬P\lnot P (specifically)Any contradiction \bot
ConclusionPQP \to Q (by logical equivalence)PP (since ¬P\lnot P led to \bot)
When best¬Q¬P\lnot Q \to \lnot P is natural¬P\lnot P has powerful consequences

Contrapositive is more structured (you know exactly what to derive: ¬P\lnot P). Contradiction is more flexible (any absurdity suffices), but can make proofs harder to follow.

5.8 Contradiction in AI Contexts

Theorem. Cross-entropy loss L(z,y)=zy+logvexp(zv)L(z, y) = -z_y + \log \sum_v \exp(z_v) has no finite lower bound.

Proof (by contradiction).

Assume M\exists M: L(z,y)ML(z, y) \geq M for all zz.

Set zy=Tz_y = T and all other zv=0z_v = 0. Then:

L=T+log(exp(T)+V1)L = -T + \log(\exp(T) + |V| - 1)

For large TT: log(exp(T)+V1)T+log(1+(V1)exp(T))T\log(\exp(T) + |V| - 1) \approx T + \log(1 + (|V|-1)\exp(-T)) \to T.

So LT+T+0=0L \to -T + T + 0 = 0 from above? Let's be more careful. Set all zv=0z_v = 0 for vyv \neq y and vary zyz_y:

L=zy+log(ezy+C)where C=V1L = -z_y + \log(e^{z_y} + C) \quad \text{where } C = |V| - 1

As zy+z_y \to +\infty: L0+L \to 0^+ (approaches but never reaches 0).

So the infimum is 0, but it is not achieved. For any M>0M > 0, choosing TT large enough makes L<ML < M. The infimum 0 is not a minimum.

More precisely: The loss approaches 0 but never equals it. There is no zz achieving L=0L = 0. The infimum is not attained. \square

Theorem. A single ReLU hidden unit cannot represent the XOR function.

Proof (by contradiction).

Assume a network with one ReLU hidden unit computes XOR. The hidden unit computes h(x)=ReLU(wTx+b)h(x) = \text{ReLU}(w^T x + b), which is a piecewise linear function with at most 2 linear pieces (separated by the hyperplane wTx+b=0w^T x + b = 0). The output is f(x)=vh(x)+cf(x) = v \cdot h(x) + c, also piecewise linear with at most 2 pieces.

XOR on {(0,0),(0,1),(1,0),(1,1)}\{(0,0), (0,1), (1,0), (1,1)\} requires: f(0,0)=0f(0,0) = 0, f(0,1)=1f(0,1) = 1, f(1,0)=1f(1,0) = 1, f(1,1)=0f(1,1) = 0. But two linear pieces cannot separate these four points correctly - any half-plane places at least one positive and one negative example on the same side. Contradiction. \square


6. Proof by Cases (Exhaustion)

6.1 The Strategy

To prove PP, partition the universe into exhaustive, mutually exclusive cases C1,C2,,CnC_1, C_2, \dots, C_n such that C1C2CnC_1 \lor C_2 \lor \dots \lor C_n covers all possibilities. Prove PP holds in each case separately.

Valid because: (C1P)(C2P)(CnP)(C_1 \to P) \land (C_2 \to P) \land \dots \land (C_n \to P) together with (C1C2Cn)(C_1 \lor C_2 \lor \dots \lor C_n) yield PP.

When to use: when no single argument covers all cases, but the problem has a natural partition (parity, sign, relative order, etc.).

6.2 Template

+======================================================+
|          PROOF BY CASES TEMPLATE                     |
+======================================================+
|                                                      |
|  Proof (by cases):                                   |
|    Let x be arbitrary. We consider all cases.        |
|                                                      |
|    Case 1: [C_1 holds]                                |
|      [Prove P under assumption C_1]                   |
|                                                      |
|    Case 2: [C_2 holds]                                |
|      [Prove P under assumption C_2]                   |
|                                                      |
|    ...                                                 |
|    Since C_1 \vee C_2 \vee ... \vee C_n is exhaustive, P.  []     |
|                                                      |
+======================================================+

6.3 Worked Example - Triangle Inequality for Absolute Value

Theorem. For all a,bRa, b \in \mathbb{R}: a+ba+b|a + b| \leq |a| + |b|.

Proof (by cases).

Case 1: a+b0a + b \geq 0.

a+b=a+ba+b|a + b| = a + b \leq |a| + |b| since aaa \leq |a| and bbb \leq |b|.

Case 2: a+b<0a + b < 0.

a+b=(a+b)=(a)+(b)a+b|a + b| = -(a + b) = (-a) + (-b) \leq |a| + |b| since aa-a \leq |a| and bb-b \leq |b|.

Both cases yield a+ba+b|a + b| \leq |a| + |b|. \square

6.4 Worked Example - n^2 mod 4

Theorem. For all nZn \in \mathbb{Z}, n2mod4{0,1}n^2 \bmod 4 \in \{0, 1\}.

Proof (by cases on nmod4n \bmod 4).

Case 1: n0(mod4)n \equiv 0 \pmod{4}. Then n=4kn = 4k, n2=16k20(mod4)n^2 = 16k^2 \equiv 0 \pmod{4}.

Case 2: n1(mod4)n \equiv 1 \pmod{4}. Then n=4k+1n = 4k+1, n2=16k2+8k+11(mod4)n^2 = 16k^2 + 8k + 1 \equiv 1 \pmod{4}.

Case 3: n2(mod4)n \equiv 2 \pmod{4}. Then n=4k+2n = 4k+2, n2=16k2+16k+40(mod4)n^2 = 16k^2 + 16k + 4 \equiv 0 \pmod{4}.

Case 4: n3(mod4)n \equiv 3 \pmod{4}. Then n=4k+3n = 4k+3, n2=16k2+24k+91(mod4)n^2 = 16k^2 + 24k + 9 \equiv 1 \pmod{4}.

All cases give n2mod4{0,1}n^2 \bmod 4 \in \{0, 1\}. \square

6.5 Cases in AI - Activation Function Analysis

Theorem. The ReLU function σ(x)=max(0,x)\sigma(x) = \max(0, x) is convex.

Proof (by cases).

For convexity: σ(λx+(1λ)y)λσ(x)+(1λ)σ(y)\sigma(\lambda x + (1-\lambda)y) \leq \lambda \sigma(x) + (1-\lambda) \sigma(y) for λ[0,1]\lambda \in [0,1].

Let z=λx+(1λ)yz = \lambda x + (1-\lambda) y.

Case 1: z0z \leq 0.

σ(z)=0λσ(x)+(1λ)σ(y)\sigma(z) = 0 \leq \lambda \sigma(x) + (1-\lambda)\sigma(y) since RHS is 0\geq 0.

Case 2: z>0z > 0.

σ(z)=z=λx+(1λ)yλmax(0,x)+(1λ)max(0,y)=λσ(x)+(1λ)σ(y)\sigma(z) = z = \lambda x + (1-\lambda)y \leq \lambda \max(0,x) + (1-\lambda)\max(0,y) = \lambda \sigma(x) + (1-\lambda) \sigma(y)

(using xmax(0,x)x \leq \max(0,x) and ymax(0,y)y \leq \max(0,y)).

Both cases confirm convexity. \square

6.6 Without Loss of Generality (WLOG)

When cases are symmetric, you can reduce the number by stating "without loss of generality" (WLOG).

Example: Proving a property about max(a,b)\max(a,b): WLOG assume aba \geq b; then max(a,b)=a\max(a,b) = a. The case b>ab > a is symmetric with roles swapped.

Caution: WLOG is only valid when the cases are truly symmetric under relabelling. A common mistake is claiming WLOG when the problem is not symmetric.


7. Mathematical Induction

7.1 The Strategy

To prove P(n)P(n) holds for all nn0n \geq n_0 (typically n0=0n_0 = 0 or 11):

  1. Base case: Prove P(n0)P(n_0).
  2. Inductive step: Prove kn0:P(k)    P(k+1)\forall k \geq n_0: P(k) \implies P(k+1).

Then P(n)P(n) holds for all nn0n \geq n_0 by the well-ordering principle (or equivalently, the axiom of induction for N\mathbb{N}).

Induction is the foundational proof technique for discrete structures and is heavily used in analysing algorithms, data structures, and recursive models.

7.2 Template

+======================================================+
|         INDUCTION PROOF TEMPLATE                     |
+======================================================+
|                                                      |
|  Proof (by induction on n):                          |
|                                                      |
|    Base case (n = n_0):                               |
|      [Verify P(n_0) directly]                         |
|                                                      |
|    Inductive step:                                   |
|      Assume P(k) for some k \geq n_0.  (IH)             |
|      [Prove P(k+1) using the inductive              |
|       hypothesis P(k)]                               |
|                                                      |
|    By induction, P(n) for all n \geq n_0.  []            |
|                                                      |
+======================================================+

Important: In the inductive step, you assume P(k)P(k) (the "inductive hypothesis," IH) and prove P(k+1)P(k+1). This is not circular - you are proving a conditional statement.

7.3 Worked Example - Sum Formula

Theorem. i=1ni=n(n+1)2\displaystyle\sum_{i=1}^{n} i = \frac{n(n+1)}{2} for all n1n \geq 1.

Proof (by induction on nn).

Base case (n=1n = 1):

LHS: i=11i=1\sum_{i=1}^{1} i = 1. RHS: 122=1\frac{1 \cdot 2}{2} = 1. OK

Inductive step:

Assume i=1ki=k(k+1)2\sum_{i=1}^{k} i = \frac{k(k+1)}{2} for some k1k \geq 1. (IH)

Then:

i=1k+1i=(i=1ki)+(k+1)=k(k+1)2+(k+1)(by IH)\sum_{i=1}^{k+1} i = \left(\sum_{i=1}^{k} i\right) + (k+1) = \frac{k(k+1)}{2} + (k+1) \quad \text{(by IH)} =k(k+1)+2(k+1)2=(k+1)(k+2)2= \frac{k(k+1) + 2(k+1)}{2} = \frac{(k+1)(k+2)}{2}

This is (k+1)((k+1)+1)2\frac{(k+1)((k+1)+1)}{2}, which is P(k+1)P(k+1). OK

By induction, P(n)P(n) holds for all n1n \geq 1. \square

7.4 Worked Example - Power Set Cardinality

Theorem. A set with nn elements has 2n2^n subsets.

Proof (by induction on nn).

Base case (n=0n = 0):

\varnothing has one subset: \varnothing itself. 20=12^0 = 1. OK

Inductive step:

Assume a set with kk elements has 2k2^k subsets. (IH)

Let SS be a set with k+1k + 1 elements. Pick xSx \in S and let T=S{x}T = S \setminus \{x\}, so T=k|T| = k.

Every subset of SS either contains xx or does not:

  • Subsets not containing xx: exactly the subsets of TT. There are 2k2^k of these (by IH).
  • Subsets containing xx: formed by taking a subset of TT and adding xx. There are 2k2^k of these.

Total: 2k+2k=22k=2k+12^k + 2^k = 2 \cdot 2^k = 2^{k+1}. OK

By induction, S=n    P(S)=2n|S| = n \implies |\mathcal{P}(S)| = 2^n for all n0n \geq 0. \square

7.5 Induction for Neural Network Depth

Theorem. A ReLU network with dd hidden layers can represent a function with at most 2d2^d linear regions (given one hidden unit per layer).

Proof (by induction on dd).

Base case (d=1d = 1):

One ReLU hidden unit: h(x)=max(0,wx+b)h(x) = \max(0, wx + b). This produces at most 2 linear pieces (one where input b/w\leq -b/w and one where >b/w> -b/w). 21=22^1 = 2. OK

Inductive step:

Assume a network with kk layers (one unit each) has 2k\leq 2^k linear regions. (IH)

Adding layer k+1k+1 with one ReLU: the new ReLU unit can at most double the number of regions by "folding" each existing piece. The fold happens at the ReLU's breakpoint, splitting at most all existing regions in two.

Maximum new regions: 22k=2k+12 \cdot 2^k = 2^{k+1}. OK

By induction, the bound is 2d2^d for dd layers. \square

Remark: With ww units per layer, the bound becomes O(wd)O(w^d), explaining the exponential expressiveness of depth - a key insight behind deep learning.

7.6 Common Induction Mistakes

MistakeExampleWhy It Fails
Forgetting the base case"Assume P(k)P(k), prove P(k+1)P(k+1)" - but never check P(1)P(1)The chain has no starting anchor
Wrong base caseProve P(0)P(0) but claim result for n1n \geq 1May miss that P(1)P(1) needs separate attention
Assuming what you prove"Assume P(k+1)P(k+1)" instead of proving itCircular reasoning
Not using the IHProving P(k+1)P(k+1) from scratch, never invoking P(k)P(k)Valid but not induction (just direct proof)
Wrong inductive "direction"Assume P(k)P(k), prove P(k1)P(k-1)Proves for all nn0n \leq n_0, not nn0n \geq n_0

8. Strong Induction

8.1 The Strategy

Strong induction (also called complete induction or course-of-values induction) strengthens the inductive hypothesis: instead of assuming only P(k)P(k), you assume P(n0)P(n0+1)P(k)P(n_0) \land P(n_0 + 1) \land \dots \land P(k) - that is, P(j)P(j) for all n0jkn_0 \leq j \leq k - and then prove P(k+1)P(k+1).

Logical equivalence: Strong induction is equivalent to ordinary induction on the statement "for all jnj \leq n, P(j)P(j)." It is not a stronger proof system, just a more convenient packaging.

When to use: when P(k+1)P(k+1) depends on P(k1)P(k-1), P(k3)P(k-3), or some earlier case - not just the immediately preceding one.

8.2 Template

+======================================================+
|         STRONG INDUCTION TEMPLATE                    |
+======================================================+
|                                                      |
|  Proof (by strong induction on n):                   |
|                                                      |
|    Base case(s) (n = n_0, ..., n_0+r):                  |
|      [Verify P(n_0), ..., P(n_0+r) directly]            |
|                                                      |
|    Inductive step:                                   |
|      Assume P(j) for all n_0 \leq j \leq k.  (SIH)        |
|      [Prove P(k+1) using any/all of                 |
|       P(n_0), P(n_0+1), ..., P(k)]                     |
|                                                      |
|    By strong induction, P(n) for all n \geq n_0.  []     |
|                                                      |
+======================================================+

Note: Strong induction often requires multiple base cases because the inductive step may reach back several positions.

8.3 Worked Example - Fundamental Theorem of Arithmetic

Theorem. Every integer n2n \geq 2 can be written as a product of primes.

Proof (by strong induction on nn).

Base case (n=2n = 2):

22 is prime, so it is (trivially) a product of primes. OK

Inductive step:

Assume every integer jj with 2jk2 \leq j \leq k can be written as a product of primes. (SIH)

Consider k+1k + 1. Two cases:

Case 1: k+1k+1 is prime. Then it is a product of one prime. OK

Case 2: k+1k+1 is composite. Then k+1=abk+1 = a \cdot b for some 2a,bk2 \leq a, b \leq k.

By the SIH, a=p1pra = p_1 \cdots p_r and b=q1qsb = q_1 \cdots q_s are products of primes.

Therefore k+1=p1prq1qsk+1 = p_1 \cdots p_r \cdot q_1 \cdots q_s, a product of primes. OK

By strong induction, every n2n \geq 2 is a product of primes. \square

Why standard induction fails: P(k+1)P(k+1) for composite k+1k+1 requires P(a)P(a) and P(b)P(b) where a,b<k+1a, b < k+1, but not necessarily P(k)P(k) specifically.

8.4 Worked Example - Binary Representation

Theorem. Every positive integer has a binary representation: n=iai2in = \sum_{i} a_i 2^i with ai{0,1}a_i \in \{0, 1\}.

Proof (by strong induction on nn).

Base case (n=1n = 1): 1=201 = 2^0, so a0=1a_0 = 1. OK

Inductive step:

Assume all 1jk1 \leq j \leq k have binary representations. (SIH)

Consider k+1k + 1.

Case 1: k+1k+1 is even. Then k+1=2mk+1 = 2m where 1mk1 \leq m \leq k. By SIH, m=iai2im = \sum_i a_i 2^i. So k+1=iai2i+1k+1 = \sum_i a_i 2^{i+1}, a valid binary representation (shift all digits left one position). OK

Case 2: k+1k+1 is odd. Then k+1=2m+1k+1 = 2m + 1 where 1mk1 \leq m \leq k (for k1k \geq 1). By SIH, m=iai2im = \sum_i a_i 2^i. So k+1=1+iai2i+1k+1 = 1 + \sum_i a_i 2^{i+1}, setting a0=1a_0 = 1. OK

By strong induction, every positive integer has a binary representation. \square

AI relevance: Binary representations are fundamental to digital computation. This proof legitimises the binary encoding that underlies all neural network implementations.

8.5 Strong Induction in AI - Recursive Network Correctness

Theorem. A tree-structured recursive neural network correctly computes the representation of any tree of depth d0d \geq 0 if (a) it correctly handles leaves and (b) whenever it correctly handles subtrees, its composition function correctly handles their parent.

Proof (by strong induction on dd).

Base case (d=0d = 0): Leaf nodes. By assumption (a), the network correctly computes their representations. OK

Inductive step: Assume the network correctly represents all trees of depth k\leq k. (SIH)

A tree of depth k+1k+1 has a root whose children are roots of subtrees of depth k\leq k. By SIH, these subtrees are correctly represented. By assumption (b), the composition function correctly computes the parent's representation from its children's representations. OK

By strong induction, the network correctly handles all trees. \square


9. Structural Induction

9.1 The Strategy

Structural induction generalises mathematical induction from natural numbers to inductively defined structures: lists, trees, formulas, programs, and other recursive data types.

The idea: an inductively defined set has base constructors (creating atomic elements) and recursive constructors (building larger structures from smaller ones). Structural induction mirrors this:

  1. Base case: Prove PP for each base constructor.
  2. Inductive step: For each recursive constructor, assume PP for all sub-structures (structural inductive hypothesis, SIH); prove PP for the constructed structure.

9.2 Template

+======================================================+
|       STRUCTURAL INDUCTION TEMPLATE                  |
+======================================================+
|                                                      |
|  Proof (by structural induction):                    |
|                                                      |
|    Base case: [For each base constructor c_0]         |
|      [Verify P(c_0)]                                 |
|                                                      |
|    Inductive step: [For each recursive               |
|                     constructor C(x_1, ..., x_m)]       |
|      Assume P(x_1), ..., P(x_m).  (SIH)                |
|      [Prove P(C(x_1, ..., x_m))]                       |
|                                                      |
|    By structural induction, P holds for all          |
|    elements of the inductively defined set.  []       |
|                                                      |
+======================================================+

9.3 Induction on Lists

Definition (Lists). The set of lists over a type AA is inductively defined:

  • Base: [][] (the empty list) is a list.
  • Recursive: If xAx \in A and \ell is a list, then x::x :: \ell (cons) is a list.

Theorem. length(append(1,2))=length(1)+length(2)\text{length}(\text{append}(\ell_1, \ell_2)) = \text{length}(\ell_1) + \text{length}(\ell_2).

where append([],2)=2\text{append}([], \ell_2) = \ell_2 and append(x::1,2)=x::append(1,2)\text{append}(x :: \ell_1', \ell_2) = x :: \text{append}(\ell_1', \ell_2).

Proof (by structural induction on 1\ell_1).

Base case (1=[]\ell_1 = []):

length(append([],2))=length(2)=0+length(2)=length([])+length(2)\text{length}(\text{append}([], \ell_2)) = \text{length}(\ell_2) = 0 + \text{length}(\ell_2) = \text{length}([]) + \text{length}(\ell_2). OK

Inductive step (1=x::1\ell_1 = x :: \ell_1'):

Assume length(append(1,2))=length(1)+length(2)\text{length}(\text{append}(\ell_1', \ell_2)) = \text{length}(\ell_1') + \text{length}(\ell_2). (SIH)

length(append(x::1,2))=length(x::append(1,2))\text{length}(\text{append}(x :: \ell_1', \ell_2)) = \text{length}(x :: \text{append}(\ell_1', \ell_2)) =1+length(append(1,2))=1+length(1)+length(2)(by SIH)= 1 + \text{length}(\text{append}(\ell_1', \ell_2)) = 1 + \text{length}(\ell_1') + \text{length}(\ell_2) \quad \text{(by SIH)} =length(x::1)+length(2)= \text{length}(x :: \ell_1') + \text{length}(\ell_2)

OK By structural induction, the result holds for all lists 1\ell_1. \square

9.4 Induction on Binary Trees

Definition (Binary trees).

  • Base: Leaf\text{Leaf} is a binary tree.
  • Recursive: If LL and RR are binary trees, then Node(L,R)\text{Node}(L, R) is a binary tree.

Theorem. A binary tree with nn internal nodes has n+1n + 1 leaves.

Proof (by structural induction).

Base case (Leaf\text{Leaf}):

0 internal nodes, 1 leaf. 0+1=10 + 1 = 1. OK

Inductive step (Node(L,R)\text{Node}(L, R)):

Assume LL has nLn_L internal nodes and nL+1n_L + 1 leaves (SIH). Similarly RR has nRn_R internal nodes and nR+1n_R + 1 leaves (SIH).

Node(L,R)\text{Node}(L, R) has nL+nR+1n_L + n_R + 1 internal nodes (the +1+1 is the new root).

Leaves: (nL+1)+(nR+1)=nL+nR+2=(nL+nR+1)+1(n_L + 1) + (n_R + 1) = n_L + n_R + 2 = (n_L + n_R + 1) + 1. OK

By structural induction, a tree with nn internal nodes has n+1n + 1 leaves. \square

AI relevance: Decision trees, parse trees in NLP, and hierarchical attention structures are all binary/n-ary trees. Properties proved by structural induction apply directly.

9.5 Induction on Computation Graphs

Definition (Computation graph for automatic differentiation).

  • Base: An input variable xix_i is a computation node.
  • Recursive: If v1,,vkv_1, \dots, v_k are computation nodes and ff is a differentiable function, then v=f(v1,,vk)v = f(v_1, \dots, v_k) is a computation node.

Theorem (Correctness of backpropagation). For any output node vv and any input xix_i, the backpropagation algorithm correctly computes v/xi\partial v / \partial x_i.

Proof sketch (by structural induction on computation nodes).

Base case: v=xiv = x_i. Then xi/xi=1\partial x_i / \partial x_i = 1 and xi/xj=0\partial x_i / \partial x_j = 0 for jij \neq i. Backpropagation initialises these correctly. OK

Inductive step: v=f(v1,,vk)v = f(v_1, \dots, v_k).

Assume backprop correctly computes vj/xi\partial v_j / \partial x_i for all jj and all ii. (SIH)

By the chain rule:

vxi=j=1kfvjvjxi\frac{\partial v}{\partial x_i} = \sum_{j=1}^{k} \frac{\partial f}{\partial v_j} \cdot \frac{\partial v_j}{\partial x_i}

Backpropagation computes this sum: the local gradients f/vj\partial f / \partial v_j are computed by the node, and vj/xi\partial v_j / \partial x_i are correctly computed by SIH. Their product and sum are correctly accumulated. OK

By structural induction, backpropagation is correct for all computation graphs. \square

9.6 When to Use Which Induction Variant

VariantDomainInductive HypothesisUse When
StandardN\mathbb{N}P(k)P(k)P(k+1)P(k+1) depends only on P(k)P(k)
StrongN\mathbb{N}P(n0),,P(k)P(n_0), \dots, P(k)P(k+1)P(k+1) depends on earlier cases
StructuralInductive typesPP for all sub-structuresDomain is recursively defined
TransfiniteOrdinalsP(α)P(\alpha) for all α<β\alpha < \betaInfinite/uncountable well-ordered sets

10. The Probabilistic Method

10.1 The Strategy

To prove that an object with property PP exists, define a probability distribution over a space of candidate objects. Show that the probability of PP is strictly positive: Pr[P]>0\Pr[P] > 0. Then an object with property PP must exist (otherwise the probability would be 0).

This technique, pioneered by Paul Erdos, is profoundly non-constructive: it proves existence without exhibiting a specific example. Yet it is one of the most powerful tools in combinatorics, computer science, and - increasingly - machine learning theory.

10.2 Template

+======================================================+
|       PROBABILISTIC METHOD TEMPLATE                  |
+======================================================+
|                                                      |
|  Proof (by the probabilistic method):                |
|                                                      |
|    Define a probability space \Omega over candidates.     |
|    Let X be a random candidate drawn from \Omega.         |
|    Compute E[cost(X)] or Pr[P(X)].                   |
|                                                      |
|    Show that E[cost(X)] < threshold                  |
|      (or Pr[P(X)] > 0).                              |
|                                                      |
|    Therefore \exists x \in \Omega with cost(x) < threshold       |
|      (or satisfying P).  []                           |
|                                                      |
+======================================================+

Key principle: If a random variable has expected value μ\mu, then \exists an outcome μ\leq \mu (and \exists an outcome μ\geq \mu). This is the first moment method.

10.3 Worked Example - Ramsey Lower Bound

Theorem (Erdos, 1947). R(k,k)>2k/2R(k, k) > 2^{k/2} (the diagonal Ramsey number exceeds 2k/22^{k/2}).

Proof (by the probabilistic method).

Consider the complete graph KnK_n on nn vertices. Colour each edge red or blue independently with probability 1/21/2.

For any set SS of kk vertices: Pr[all (k2) edges in S are red]=2(k2)\Pr[\text{all } \binom{k}{2} \text{ edges in } S \text{ are red}] = 2^{-\binom{k}{2}}.

By union bound over all (nk)\binom{n}{k} choices of SS:

Pr[ monochromatic Kk]2(nk)2(k2)\Pr[\exists \text{ monochromatic } K_k] \leq 2 \binom{n}{k} 2^{-\binom{k}{2}}

(factor 2 for red or blue).

Using (nk)nk/k!\binom{n}{k} \leq n^k / k!:

Pr[ monochromatic Kk]2nkk!2k(k1)/2\Pr[\exists \text{ monochromatic } K_k] \leq 2 \cdot \frac{n^k}{k!} \cdot 2^{-k(k-1)/2}

For n=2k/2n = \lfloor 2^{k/2} \rfloor: nk2k2/2n^k \approx 2^{k^2/2} and 2k(k1)/2=2k2/2+k/22^{-k(k-1)/2} = 2^{-k^2/2 + k/2}. So:

Pr22k/2k!<1for k3\Pr \leq \frac{2 \cdot 2^{k/2}}{k!} < 1 \quad \text{for } k \geq 3

Since probability <1< 1, there exists a 2-colouring with no monochromatic KkK_k. Therefore R(k,k)>2k/2R(k,k) > 2^{k/2}. \square

10.4 Random Hyperplane Rounding (AI Application)

Theorem (Goemans-Williamson, 1995). The random hyperplane rounding algorithm achieves a 0.878-approximation for MAX-CUT.

Proof sketch (probabilistic method).

Solve the SDP relaxation: assign unit vectors viRnv_i \in \mathbb{R}^n to vertices. For each edge (i,j)(i,j), the relaxation value is 12(1vivj)\frac{1}{2}(1 - v_i \cdot v_j).

Rounding: Choose a random hyperplane through the origin (uniformly random normal rr). Assign vertex ii to side ++ if rvi0r \cdot v_i \geq 0, else side -.

Edge (i,j)(i,j) is cut iff sign(rvi)sign(rvj)\text{sign}(r \cdot v_i) \neq \text{sign}(r \cdot v_j).

Pr[(i,j) is cut]=arccos(vivj)π\Pr[(i,j) \text{ is cut}] = \frac{\arccos(v_i \cdot v_j)}{\pi}

The key inequality:

arccos(vivj)π0.8781vivj2\frac{\arccos(v_i \cdot v_j)}{\pi} \geq 0.878 \cdot \frac{1 - v_i \cdot v_j}{2}

Therefore: E[cut value]0.878SDP value0.878OPT\mathbb{E}[\text{cut value}] \geq 0.878 \cdot \text{SDP value} \geq 0.878 \cdot \text{OPT}.

By linearity of expectation, there exists a hyperplane achieving 0.878OPT\geq 0.878 \cdot \text{OPT}. \square

AI connection: Approximation algorithms via random rounding are conceptually related to random projection methods (Johnson-Lindenstrauss lemma, locality-sensitive hashing) used in approximate nearest neighbor search for embedding retrieval.

10.5 The Probabilistic Method in Deep Learning Theory

Theorem (Random initialisation has non-zero gradients - informal). For a randomly initialised deep network with ReLU activations, with probability 1 the gradient θL\nabla_\theta L is non-zero at initialisation.

Proof sketch.

The set of parameter values where L=0\nabla L = 0 exactly is a measure-zero subset of Rp\mathbb{R}^p (it is the zero set of a non-trivial analytic function of the parameters, assuming the loss is analytic). A continuous distribution over parameters assigns probability zero to any measure-zero set. Therefore, with probability 1, the initialisation lies outside this set. \square

This is why random initialisation "works": it almost surely places us at a point where gradient descent can make progress.


11. Counting and Combinatorial Arguments

11.1 The Strategy

Prove identities or existence results by counting the same set in two different ways (double counting), or by establishing bijections between sets, or by applying combinatorial inequalities (pigeonhole principle, inclusion-exclusion).

11.2 Double Counting

Principle: If you count the elements of a set SS in two different ways and get expressions AA and BB, then A=BA = B.

Theorem (Handshaking Lemma). In any graph G=(V,E)G = (V, E): vVdeg(v)=2E\sum_{v \in V} \deg(v) = 2|E|.

Proof (by double counting).

Count the set S={(v,e):vV,eE,v is an endpoint of e}S = \{(v, e) : v \in V, e \in E, v \text{ is an endpoint of } e\}.

Method 1: For each vertex vv, it contributes deg(v)\deg(v) pairs. Total: vdeg(v)\sum_v \deg(v).

Method 2: For each edge e={u,w}e = \{u, w\}, it contributes exactly 2 pairs: (u,e)(u, e) and (w,e)(w, e). Total: 2E2|E|.

Same set, two counts: vdeg(v)=2E\sum_v \deg(v) = 2|E|. \square

AI relevance: Graph neural networks (GNNs) aggregate messages along edges. The handshaking lemma ensures that the total number of message-passing operations equals 2E2|E|, which determines the computational cost of one GNN layer.

11.3 The Pigeonhole Principle

Principle: If nn items are placed into mm containers and n>mn > m, then some container holds more than one item.

Generalised: If nn items are placed into mm containers, some container holds n/m\geq \lceil n/m \rceil items.

Theorem. In any set of 13 real numbers, at least two have a difference whose fractional part <1/12< 1/12.

Proof (by pigeonhole).

Partition [0,1)[0, 1) into 12 intervals: [0,1/12),[1/12,2/12),,[11/12,1)[0, 1/12), [1/12, 2/12), \dots, [11/12, 1).

Given 13 numbers, consider their fractional parts. By pigeonhole (13 numbers, 12 intervals), at least two fractional parts lie in the same interval. Their difference has fractional part <1/12< 1/12. \square

11.4 Pigeonhole in AI - Collision Arguments

Theorem. Any hash function h:{0,1}n{0,1}mh: \{0,1\}^n \to \{0,1\}^m with m<nm < n has collisions.

Proof (by pigeonhole).

domain=2n>2m=range|\text{domain}| = 2^n > 2^m = |\text{range}|. By pigeonhole, xy:h(x)=h(y)\exists x \neq y: h(x) = h(y). \square

Corollary for embeddings: An embedding f:XRdf: \mathcal{X} \to \mathbb{R}^d where X>c|\mathcal{X}| > c for any finite cc cannot be injective if restricted to a finite-precision representation with <log2X< \log_2 |\mathcal{X}| bits per dimension.

Theorem. Any fixed-length tokeniser mapping variable-length strings to fixed-size vocabularies must have collisions (multiple strings mapping to the same token sequence of bounded length).

11.5 Inclusion-Exclusion

Principle.

i=1nAi=iAii<jAiAj+i<j<kAiAjAk+(1)n+1A1An\left|\bigcup_{i=1}^n A_i\right| = \sum_i |A_i| - \sum_{i<j} |A_i \cap A_j| + \sum_{i<j<k} |A_i \cap A_j \cap A_k| - \cdots + (-1)^{n+1} |A_1 \cap \cdots \cap A_n|

AI application - Estimating vocabulary coverage:

Given nn documents with vocabularies V1,,VnV_1, \dots, V_n:

V1Vn=ViViVj+|V_1 \cup \cdots \cup V_n| = \sum |V_i| - \sum |V_i \cap V_j| + \cdots

In practice, the higher-order terms are estimated by sampling. This is used in corpus analysis for NLP.

11.6 Bijective Proofs

Principle: To prove A=B|A| = |B|, exhibit an explicit bijection f:ABf: A \to B.

Theorem. The number of subsets of {1,,n}\{1, \dots, n\} of even size equals the number of odd size (for n1n \geq 1).

Proof (by bijection).

Fix element 1. Define f(S)=S{1}f(S) = S \triangle \{1\} (symmetric difference: add 1 if absent, remove if present).

ff maps even-sized subsets to odd-sized subsets and vice versa. ff is its own inverse (ff=idf \circ f = \text{id}), so it is a bijection. \square

Therefore: k even(nk)=k odd(nk)=2n1\sum_{k \text{ even}} \binom{n}{k} = \sum_{k \text{ odd}} \binom{n}{k} = 2^{n-1}.


12. Epsilon-Delta and Analytic Proofs

12.1 The \epsilon-\delta Framework

In analysis, many proofs require showing that a quantity can be made arbitrarily small. The standard framework:

ε>0, δ>0:[condition on δ]    [conclusion within ε]\forall \varepsilon > 0,\ \exists \delta > 0: \text{[condition on } \delta\text{]} \implies \text{[conclusion within } \varepsilon\text{]}

The key challenge is constructing δ\delta as a function of ε\varepsilon (and sometimes other parameters).

12.2 Convergence of Gradient Descent (Fixed Step Size)

Theorem. Let f:RdRf: \mathbb{R}^d \to \mathbb{R} be convex and LL-smooth (f(x)f(y)Lxy\|\nabla f(x) - \nabla f(y)\| \leq L\|x - y\| for all x,yx, y). Gradient descent with step size η=1/L\eta = 1/L satisfies:

f(xT)f(x)Lx0x22Tf(x_T) - f(x^*) \leq \frac{L \|x_0 - x^*\|^2}{2T}

Proof.

By LL-smoothness, for any x,yx, y:

f(y)f(x)+f(x)T(yx)+L2yx2f(y) \leq f(x) + \nabla f(x)^T (y - x) + \frac{L}{2} \|y - x\|^2

Set y=xt+1=xtηf(xt)y = x_{t+1} = x_t - \eta \nabla f(x_t) and x=xtx = x_t:

f(xt+1)f(xt)+f(xt)T(ηf(xt))+L2η2f(xt)2f(x_{t+1}) \leq f(x_t) + \nabla f(x_t)^T (-\eta \nabla f(x_t)) + \frac{L}{2} \eta^2 \|\nabla f(x_t)\|^2 =f(xt)ηf(xt)2+Lη22f(xt)2= f(x_t) - \eta \|\nabla f(x_t)\|^2 + \frac{L \eta^2}{2} \|\nabla f(x_t)\|^2 =f(xt)η(1Lη2)f(xt)2= f(x_t) - \eta\left(1 - \frac{L\eta}{2}\right) \|\nabla f(x_t)\|^2

With η=1/L\eta = 1/L: f(xt+1)f(xt)12Lf(xt)2f(x_{t+1}) \leq f(x_t) - \frac{1}{2L} \|\nabla f(x_t)\|^2.

By convexity: f(x)f(xt)+f(xt)T(xxt)f(x^*) \geq f(x_t) + \nabla f(x_t)^T (x^* - x_t), so:

f(xt)2(f(xt)T(xtx))2xtx2\|\nabla f(x_t)\|^2 \geq \frac{(\nabla f(x_t)^T (x_t - x^*))^2}{\|x_t - x^*\|^2}

Summing the per-step descent and using f(xt)f(x)f(xt)T(xtx)f(x_t) - f(x^*) \leq \nabla f(x_t)^T (x_t - x^*) (convexity):

f(xT)f(x)Lx0x22Tf(x_T) - f(x^*) \leq \frac{L \|x_0 - x^*\|^2}{2T}

This is an O(1/T)O(1/T) convergence rate. \square

Reading this proof: Note the structure - establish a per-step improvement bound, then telescope the sum. This pattern recurs throughout optimisation theory.

12.3 PAC-Learning Bound (Finite Hypothesis Classes)

Theorem. Let H\mathcal{H} be a finite hypothesis class. For any distribution D\mathcal{D}, any target hHh^* \in \mathcal{H}, and any ε,δ>0\varepsilon, \delta > 0: if m1εlnHδm \geq \frac{1}{\varepsilon} \ln \frac{|\mathcal{H}|}{\delta}, then with probability 1δ\geq 1 - \delta, any hHh \in \mathcal{H} consistent with mm i.i.d. samples has error ε\leq \varepsilon.

Proof.

Fix a "bad" hypothesis hh with err(h)>ε\text{err}(h) > \varepsilon. The probability it is consistent with one sample:

Pr[h(x)=h(x)]1ε\Pr[h(x) = h^*(x)] \leq 1 - \varepsilon

Over mm i.i.d. samples:

Pr[h consistent with all m samples](1ε)meεm\Pr[h \text{ consistent with all } m \text{ samples}] \leq (1 - \varepsilon)^m \leq e^{-\varepsilon m}

Union bound over all bad hypotheses (H\leq |\mathcal{H}| of them):

Pr[ bad consistent h]Heεm\Pr[\exists \text{ bad consistent } h] \leq |\mathcal{H}| \cdot e^{-\varepsilon m}

Set this δ\leq \delta: Heεmδ    m1εlnHδ|\mathcal{H}| e^{-\varepsilon m} \leq \delta \iff m \geq \frac{1}{\varepsilon} \ln \frac{|\mathcal{H}|}{\delta}. \square

Significance: This is the foundational bound in computational learning theory. It connects sample complexity to hypothesis class size logarithmically - explaining why large model families need more data.

12.4 Uniform Convergence via \epsilon-Nets

Theorem (\epsilon-net argument). For a hypothesis class with VC-dimension dd, m=O(d+ln(1/δ)ε2)m = O\left(\frac{d + \ln(1/\delta)}{\varepsilon^2}\right) samples suffice for uniform convergence of empirical risk to true risk within ε\varepsilon.

Proof idea.

The \epsilon-net approach discretises the infinite hypothesis class: find a finite \epsilon-net Nε\mathcal{N}_\varepsilon (of size determined by the VC dimension) such that every hHh \in \mathcal{H} is close to some hNεh' \in \mathcal{N}_\varepsilon. Apply the finite class bound to Nε\mathcal{N}_\varepsilon.

The key technical tool is the symmetrisation argument (Vapnik-Chervonenkis): replace the true risk with an empirical risk on a "ghost sample," then bound the probability of large deviations using Rademacher complexity or growth function bounds.

This is the theoretical foundation of generalisation bounds in machine learning.


13. Proof Patterns in Machine Learning

13.1 Overview

ML proofs recurrently use a small number of structural patterns. Recognising these patterns helps you read and construct proofs more efficiently.

+======================================================+
|       ML PROOF PATTERN LANDSCAPE                     |
+======================================================+
|                                                      |
|  Bound-Based    ---- Union bound + per-element       |
|                       probability control             |
|                                                      |
|  Descent-Based  ---- Per-step improvement            |
|                       + telescoping sum               |
|                                                      |
|  Convexity      ---- Jensen's inequality /            |
|                       definition verification         |
|                                                      |
|  Symmetry       ---- Rademacher / permutation         |
|                       arguments                       |
|                                                      |
|  Information    ---- Data processing inequality /     |
|                       mutual information bounds       |
|                                                      |
+======================================================+

13.2 The Union Bound Pattern

Structure: To bound Pr[bad event]\Pr[\text{bad event}], decompose into sub-events and apply:

Pr[iAi]iPr[Ai]\Pr\left[\bigcup_i A_i\right] \leq \sum_i \Pr[A_i]

Used in: PAC-learning, uniform convergence, multi-task bounds, reward model error analysis.

Pro tip: If the union bound is too loose (sum exceeds 1), try:

  • Chernoff/Hoeffding bounds for tighter individual terms
  • \epsilon-net arguments to reduce the number of terms
  • Symmetrisation to avoid direct union bounding

13.3 The Descent Lemma Pattern

Structure:

  1. Show f(xt+1)f(xt)cg(xt)f(x_{t+1}) \leq f(x_t) - c \cdot g(x_t) for some positive quantity gg.
  2. Telescope: t=0T1g(xt)f(x0)f(xT)cf(x0)fc\sum_{t=0}^{T-1} g(x_t) \leq \frac{f(x_0) - f(x_T)}{c} \leq \frac{f(x_0) - f^*}{c}.
  3. Conclude: mintg(xt)f(x0)fcT\min_t g(x_t) \leq \frac{f(x_0) - f^*}{cT}.

Used in: Gradient descent convergence, SGD analysis, Adam convergence, policy gradient convergence.

13.4 The Convexity Verification Pattern

Structure: To prove ff is convex, verify one of:

  1. Definition: f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y)
  2. First-order: f(y)f(x)+f(x)T(yx)f(y) \geq f(x) + \nabla f(x)^T(y-x) for all x,yx, y
  3. Second-order: 2f(x)0\nabla^2 f(x) \succeq 0 for all xx (Hessian is PSD)

Example - Cross-entropy is convex in logits:

L(z)=zy+logvezvL(z) = -z_y + \log \sum_v e^{z_v} 2Lzizj=diag(p)ppTwhere pi=ezivezv\frac{\partial^2 L}{\partial z_i \partial z_j} = \text{diag}(p) - pp^T \quad \text{where } p_i = \frac{e^{z_i}}{\sum_v e^{z_v}}

This is diag(p)ppT\text{diag}(p) - pp^T, which is the covariance matrix of a categorical distribution - always PSD. \square

13.5 The Symmetrisation Argument

Structure: To bound suphHR^(h)R(h)\sup_{h \in \mathcal{H}} |\hat{R}(h) - R(h)| (empirical vs true risk):

  1. Introduce a "ghost" sample SS' of equal size.
  2. Replace R(h)R(h) with R^S(h)\hat{R}_{S'}(h) (close with high probability by concentration).
  3. suphR^S(h)R^S(h)\sup_h |\hat{R}_S(h) - \hat{R}_{S'}(h)| is now a function of two finite samples.
  4. Introduce Rademacher variables σi{1,+1}\sigma_i \in \{-1, +1\}: swapping SS and SS' entries is equivalent.
  5. Bound the Rademacher complexity Rm(H)\mathcal{R}_m(\mathcal{H}).

This converts an infinite supremum into a computable quantity and is the basis of Rademacher complexity generalisation bounds.

13.6 Proof Strategy Selection Guide

I want to show...TechniqueKey tool
Algorithm convergesDescent lemmaPer-step bound + telescope
Model generalisesUnion bound / RademacherPAC / VC / Rademacher
Loss is well-behavedConvexity verificationHessian PSD check
Architecture has capacityStructural inductionInduction on depth/width
Existence of good modelProbabilistic methodRandom construction
Lower bound on complexityContradiction / countingInformation-theoretic
Two quantities are equalBijection / double countingCombinatorial identity
Gradient computation is correctStructural inductionInduction on computation graph

14. Common Mistakes and Pitfalls

14.1 Affirming the Consequent

Fallacy: From PQP \to Q and QQ, conclude PP.

Example: "If a model overfits, training loss is low. Training loss is low. Therefore the model overfits." - Invalid. Low training loss could be from a well-generalising model.

Correct reasoning: PQP \to Q and QQ tells you nothing about PP. Only PQP \to Q and ¬Q\lnot Q lets you conclude ¬P\lnot P (modus tollens).

14.2 Denying the Antecedent

Fallacy: From PQP \to Q and ¬P\lnot P, conclude ¬Q\lnot Q.

Example: "If learning rate is too high, training diverges. Learning rate is not too high. Therefore training does not diverge." - Invalid. Training can diverge for other reasons (exploding gradients, numerical issues).

14.3 Confusing Necessary and Sufficient Conditions

NecessarySufficient
FormQPQ \to PPQP \to Q
MeaningPP is required for QQPP guarantees QQ
Negation¬P¬Q\lnot P \to \lnot Q¬Q¬P\lnot Q \to \lnot P

Example: "Gradient = 0 is necessary for a minimum, but not sufficient." (Saddle points also have zero gradient.)

14.4 Proof by Example (Invalid)

Verifying P(n)P(n) for specific values of nn does not prove nP(n)\forall n\, P(n).

Famous counterexample: n2+n+41n^2 + n + 41 is prime for n=0,1,2,,39n = 0, 1, 2, \dots, 39 - but 402+40+41=41240^2 + 40 + 41 = 41^2 is not prime.

In ML: "The model works on the test set" does not prove "the model works on all inputs." This is precisely why generalisation theory exists.

14.5 Circular Reasoning

Using the conclusion as a premise (possibly through a chain of implications that loops back).

Example: "Neural networks are universal approximators because they can approximate any function because they are universal approximators."

Detection: Trace the chain of reasoning. If any step assumes (even implicitly) what you are trying to prove, the argument is circular.

14.6 Misusing Induction

"All horses are the same colour" fallacy:

"Claim: Any group of nn horses are the same colour.

Base: n=1n = 1. One horse has one colour. OK

Inductive step: Assume any kk horses are the same colour. Given k+1k+1 horses {h1,,hk+1}\{h_1, \dots, h_{k+1}\}:

  • {h1,,hk}\{h_1, \dots, h_k\}: same colour by IH.
  • {h2,,hk+1}\{h_2, \dots, h_{k+1}\}: same colour by IH.
  • These overlap in {h2,,hk}\{h_2, \dots, h_k\}, so all k+1k+1 are the same colour."

The bug: When k=1k = 1, the overlap {h2,,hk}=\{h_2, \dots, h_k\} = \varnothing - there is no overlap! The inductive step fails at k=1k = 1.

Lesson: Always check that the inductive step works at the boundary (k=n0k = n_0).

14.7 Quantifier Errors

WrittenIntendedProblem
ε>0 N:\forall \varepsilon > 0\ \exists N: ...For each ε\varepsilon, find NNCorrect (convergence)
N ε>0:\exists N\ \forall \varepsilon > 0: ...One NN works for all ε\varepsilonWrong (much stronger)

In ML context:

  • "ε,m\forall \varepsilon, \exists m: with mm samples, error <ε< \varepsilon" - correct PAC statement.
  • "m,ε\exists m, \forall \varepsilon: with mm samples, error <ε< \varepsilon" - would mean finite samples give zero error, which is false.

The order of quantifiers matters. Swapping \forall and \exists changes the meaning entirely.


15. Practice Exercises

15.1 Direct Proof Exercises

Exercise 1. Prove: the sum of two odd integers is even.

Exercise 2. Prove: if aba | b and bcb | c, then aca | c (divisibility is transitive).

Exercise 3. Prove: ReLU(x)+ReLU(x)=x\text{ReLU}(x) + \text{ReLU}(-x) = |x|.

15.2 Contrapositive Exercises

Exercise 4. Prove by contrapositive: if n3n^3 is odd, then nn is odd.

Exercise 5. Prove by contrapositive: if ff is not Lipschitz, then ff' is unbounded (for differentiable ff on R\mathbb{R}).

15.3 Contradiction Exercises

Exercise 6. Prove by contradiction: 3\sqrt{3} is irrational.

Exercise 7. Prove by contradiction: there is no smallest positive rational number.

Exercise 8. Prove: log23\log_2 3 is irrational.

15.4 Induction Exercises

Exercise 9. Prove by induction: i=0n2i=2n+11\sum_{i=0}^{n} 2^i = 2^{n+1} - 1.

Exercise 10. Prove by induction: a binary tree of height hh has at most 2h+112^{h+1} - 1 nodes.

Exercise 11. Prove by strong induction: every integer n2n \geq 2 can be expressed as a sum of distinct powers of 2 (binary representation with no repeats).

15.5 Cases and Construction Exercises

Exercise 12. Prove by cases: max(a,b)=a+b+ab2\max(a, b) = \frac{a + b + |a - b|}{2}.

Exercise 13. Construct a bijection between {0,1}n\{0, 1\}^n (binary strings of length nn) and P({1,,n})\mathcal{P}(\{1, \dots, n\}) (subsets of {1,,n}\{1, \dots, n\}).

15.6 Counting and Probabilistic Exercises

Exercise 14. Use the pigeonhole principle to show: in any group of 5 integers, some pair has the same remainder mod 4.

Exercise 15. Use the handshaking lemma to prove: in any graph, the number of vertices with odd degree is even.

15.7 ML-Specific Exercises

Exercise 16. Prove that softmax output lies in the probability simplex: isoftmax(z)i=1\sum_i \text{softmax}(z)_i = 1 and softmax(z)i>0\text{softmax}(z)_i > 0 for all ii.

Exercise 17. Prove by induction on network depth: the composition of dd Lipschitz functions with constants L1,,LdL_1, \dots, L_d is Lipschitz with constant i=1dLi\prod_{i=1}^d L_i.

Exercise 18. Prove by contradiction: a continuous function on a closed bounded interval [a,b][a, b] attains its maximum (use sequences and the Bolzano-Weierstrass theorem).

15.8 Solution Sketches

Exercise 1 solution sketch:

Let a=2k+1a = 2k + 1, b=2m+1b = 2m + 1. Then a+b=2k+2m+2=2(k+m+1)a + b = 2k + 2m + 2 = 2(k + m + 1). Since k+m+1Zk + m + 1 \in \mathbb{Z}, the sum is even. \square

Exercise 6 solution sketch:

Assume 3=p/q\sqrt{3} = p/q with gcd(p,q)=1\gcd(p,q) = 1. Then 3q2=p23q^2 = p^2, so 3p23 | p^2, hence 3p3 | p (since 3 is prime). Write p=3kp = 3k: 3q2=9k2    q2=3k23q^2 = 9k^2 \implies q^2 = 3k^2, so 3q3 | q. But 3p3 | p and 3q3 | q contradicts gcd(p,q)=1\gcd(p,q) = 1. \square

Exercise 9 solution sketch:

Base (n=0n = 0): 20=1=2112^0 = 1 = 2^1 - 1. OK

IH: Assume i=0k2i=2k+11\sum_{i=0}^{k} 2^i = 2^{k+1} - 1.

i=0k+12i=(2k+11)+2k+1=22k+11=2k+21\sum_{i=0}^{k+1} 2^i = (2^{k+1} - 1) + 2^{k+1} = 2 \cdot 2^{k+1} - 1 = 2^{k+2} - 1. OK \square

Exercise 16 solution sketch:

Positivity: ezi>0e^{z_i} > 0 for all ziRz_i \in \mathbb{R}, and vezv>0\sum_v e^{z_v} > 0, so softmax(z)i=ezi/vezv>0\text{softmax}(z)_i = e^{z_i} / \sum_v e^{z_v} > 0.

Sum to 1: isoftmax(z)i=iezi/vezv=(iezi)/(vezv)=1\sum_i \text{softmax}(z)_i = \sum_i e^{z_i} / \sum_v e^{z_v} = (\sum_i e^{z_i}) / (\sum_v e^{z_v}) = 1. \square


16. Why It Matters for AI

16.1 Proofs as Guarantees

Machine learning operates in a landscape of empirical results. Proofs provide the theoretical guardrails:

What proofs guaranteeExample
CorrectnessBackpropagation correctly computes gradients (structural induction, 9.5)
ConvergenceGradient descent reaches optimum at rate O(1/T)O(1/T) (analytic proof, 12.2)
GeneralisationPAC bounds on sample complexity (union bound + counting, 12.3)
ExpressivenessUniversal approximation - networks can represent any continuous function
ImpossibilityNo Free Lunch - no single algorithm dominates on all problems
RobustnessLipschitz bounds <-> adversarial robustness (contrapositive, 4.5)

Without proofs, we are running experiments in the dark. With proofs, we know why things work (or fail) and when guarantees hold.

16.2 Proofs and Algorithms

There is a deep connection between proofs and algorithms (the Curry-Howard correspondence in its broadest sense):

Proof conceptAlgorithm concept
Constructive existence proofAlgorithm that finds the object
Inductive proofRecursive algorithm
Case analysisBranching / conditional logic
ContradictionException handling / invariant violation
Bound derivationComplexity analysis

When you prove a convergence rate of O(1/T)O(1/T), you are simultaneously describing the algorithm (gradient descent) and its guarantee. When you prove existence by construction, you are writing a program.

16.3 Proof Literacy for ML Practitioners

You don't need to prove new theorems to benefit from proof literacy. Knowing how to read proofs lets you:

  1. Understand assumptions: Every theorem has hypotheses. Knowing them tells you when the guarantee applies to your model and when it doesn't.

  2. Spot gaps in arguments: "This architecture is better because it has more parameters" - is that a proof? What's the implicit theorem? What are the hidden assumptions?

  3. Debug theoretical claims: Papers sometimes contain proof errors. Proof literacy lets you verify claims before building on them.

  4. Transfer results: If you understand why a theorem holds (not just that it holds), you can adapt it to new settings.

16.4 Modern Frontiers - AI for Proofs

The field is coming full circle: AI systems are now being used to discover and verify proofs:

SystemAchievement
AlphaProof (DeepMind, 2024)Solved IMO-level problems via reinforcement learning + formal verification
Lean / MathlibFormal proof library with growing ML-generated contributions
LLEMMA (2023)Language model fine-tuned for mathematical reasoning
AutoformalizationTranslating informal proofs to formal (verifiable) proofs via LLMs
ATP systemsAutomated theorem provers (Vampire, E, Z3) integrated with ML heuristics

The ability to formalise and verify proofs is becoming a key AI capability, and understanding proof techniques is essential for building and evaluating these systems.


17. Conceptual Bridge

17.1 From Proof Strategy to ML Reasoning

This chapter sits underneath the rest of the curriculum in a quiet but important way. Later topics will look like they are about vectors, derivatives, optimization, probability, or model architecture, but the moment a theorem appears, the same proof patterns return:

  • direct proof for algebraic identities and gradient derivations
  • contradiction for impossibility results and lower bounds
  • induction for sequence length, layer depth, and recursive structure
  • construction for algorithms, witnesses, and counterexamples

Proof techniques are therefore not separate from applied ML mathematics. They are the reasoning templates that make later chapters readable.

17.2 What Comes Next

As you move into linear algebra, calculus, probability, and optimization, try to read every result with three questions in mind: what is assumed, what is being concluded, and why this proof strategy fits the shape of the claim. That habit turns proof literacy into practical engineering judgment, because it helps you distinguish empirical patterns from actual guarantees.


References

  1. Velleman, D.J. How to Prove It: A Structured Approach. Cambridge University Press, 3rd ed., 2019.
  2. Hammack, R. Book of Proof. Virginia Commonwealth University, 3rd ed., 2018. (Free online)
  3. Alon, N. & Spencer, J.H. The Probabilistic Method. Wiley, 4th ed., 2016.
  4. Shalev-Shwartz, S. & Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
  5. Boyd, S. & Vandenberghe, L. Convex Optimization. Cambridge University Press, 2004.
  6. Goemans, M.X. & Williamson, D.P. "Improved approximation algorithms for maximum cut..." JACM 42(6), 1995.
  7. Erdos, P. "Some remarks on the theory of graphs." Bulletin of the AMS 53, 1947.
  8. Nesterov, Y. Introductory Lectures on Convex Optimization. Springer, 2004.
  9. Trinh, T. et al. "Solving olympiad geometry without human demonstrations." Nature 625, 2024.

<- Previous: Einstein Summation and Index Notation | Home | Next: Vectors and Spaces ->

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue