NotesMath for LLMs

Concentration Inequalities

Probability Theory / Concentration Inequalities

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Notes

"The tendency of random quantities to concentrate near their expectations is one of the most powerful and pervasive phenomena in all of mathematics."

  • Stephane Boucheron, Gabor Lugosi & Pascal Massart, Concentration Inequalities (2013)

Overview

A random variable XX has an expectation E[X]\mathbb{E}[X], but individual realisations scatter around it. Concentration inequalities answer the quantitative question: how far from E[X]\mathbb{E}[X] can XX stray, and with what probability? These are not mere abstract bounds - they are the mathematical engine behind almost every theoretical guarantee in machine learning.

Every time a practitioner asks "how many training examples do I need to achieve 95% accuracy with 99% confidence?", the answer is a concentration inequality. Every time a theorist proves that a learning algorithm generalises from training data to unseen data, the proof uses a concentration inequality. The PAC learning framework, VC theory, Rademacher complexity, and modern neural tangent kernel theory all rest on the same probabilistic foundation built in this section.

The central insight is a hierarchy: weaker assumptions yield looser bounds, stronger assumptions yield tighter ones. Markov's inequality needs only non-negativity and a finite mean. Chebyshev's needs a finite variance. Hoeffding's needs bounded support. Chernoff's needs a moment generating function. The practitioner's art is matching the right bound to the right problem.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive derivations: Markov, Chebyshev, Hoeffding, Chernoff, McDiarmid, PAC bounds, Rademacher complexity
exercises.ipynb10 graded exercises from tail bound verification to VC dimension and Rademacher estimation

Learning Objectives

After completing this section, you will be able to:

  1. State and prove Markov's inequality and identify when it is tight
  2. Derive Chebyshev's inequality from Markov and apply the kk-sigma rule quantitatively
  3. Define sub-Gaussian random variables and prove the sub-Gaussian tail bound
  4. State and prove Hoeffding's inequality for bounded independent random variables
  5. Apply the Chernoff method to derive tail bounds via MGF optimisation
  6. State Bernstein's inequality and explain when it improves on Hoeffding
  7. Apply McDiarmid's inequality to functions of independent random variables
  8. Use the union bound with covering arguments to control events over hypothesis classes
  9. Derive the PAC generalisation bound for finite hypothesis classes
  10. Define VC dimension and state the generalisation bound with Sauer-Shelah lemma
  11. Define Rademacher complexity and interpret the Rademacher generalisation bound
  12. Compute required sample sizes for given (ε,δ)(\varepsilon, \delta)-PAC guarantees

Table of Contents


1. Intuition

1.1 What Is Concentration?

A random variable XX has a mean μ=E[X]\mu = \mathbb{E}[X]. But knowing the mean alone tells us nothing about how a single draw is distributed around it. Concentration inequalities fill this gap by bounding the probability that XX deviates far from μ\mu.

The fundamental question is: given what we know about XX (its mean, variance, boundedness, or MGF), what is the tightest bound we can place on P(Xμt)P(|X - \mu| \geq t)?

This matters deeply in practice. In machine learning, we observe the empirical mean of a loss function on training data and want to know how well it approximates the true (population) mean. If X1,,XnX_1, \ldots, X_n are i.i.d. losses with mean μ\mu and we observe Xˉn=1nXi\bar{X}_n = \frac{1}{n}\sum X_i, concentration inequalities tell us how close Xˉn\bar{X}_n is to μ\mu with high probability. This is the foundation of statistical learning theory.

The sample mean concentrates around the true mean. As nn grows, the deviation Xˉnμ|\bar{X}_n - \mu| becomes increasingly small with high probability. This is the content of the Law of Large Numbers - but concentration inequalities give quantitative, finite-nn bounds, not just asymptotic guarantees.

For AI: Every evaluation metric in ML (accuracy, F1, BLEU, ROUGE, perplexity) is estimated from a finite test set. Concentration inequalities tell us how many test examples we need to trust the estimate. HELM benchmarks for LLMs, for instance, implicitly rely on Hoeffding-type bounds for the confidence intervals around model comparisons.

1.2 The Hierarchy of Bounds

Concentration inequalities form a clear hierarchy. Moving down requires stronger assumptions but yields exponentially tighter bounds:

CONCENTRATION INEQUALITY HIERARCHY
========================================================================

  Assumption          Inequality         Tail bound form
  -----------------------------------------------------------------
  E[X] < \\infty, X \\geq 0   Markov             P(X \\geq t) \\leq \\mu/t
  (polynomial decay)

  Var(X) < \\infty         Chebyshev          P(|X-\\mu| \\geq t) \\leq \\sigma^2/t^2
  (polynomial decay)

  E[e^{sX}] < \\infty      Chernoff method    P(X \\geq t) \\leq min_s e^{-st}M(s)
  (general expo)

  X \\in [a,b] a.s.     Hoeffding          P(Xbar-\\mu \\geq t) \\leq exp(-2nt^2/(b-a)^2)
  (sub-Gaussian)      (exponential decay)

  Var known + bounded Bernstein          exp(-nt^2/(2\\sigma^2+2ct/3))
  (variance-aware)    (tighter for small t)

  f(X_1,...,X_n)       McDiarmid          exp(-2t^2/\\Sigmac_i^2)
  bounded differences (functions of iid)

========================================================================

The key insight: moment-based bounds (Markov, Chebyshev) give polynomial tails O(1/tk)O(1/t^k), while MGF-based bounds give exponential tails O(ect2)O(e^{-ct^2}). Exponential tails are vastly superior - a Gaussian has P(X3σ)0.003P(|X| \geq 3\sigma) \approx 0.003, while Chebyshev gives only 1/90.111/9 \approx 0.11.

1.3 Historical Timeline

YearPersonContribution
1867ChebyshevProved the inequality bearing his name; used it to prove the weak LLN
1899Markov (student of Chebyshev)Simpler proof via indicator argument; Markov's inequality
1952Herman ChernoffMGF-based exponential tail bounds for sums of independent variables
1963Wassily HoeffdingTight bounds for bounded variables; modern form used everywhere
1971Vapnik & ChervonenkisVC dimension theory; combinatorial approach to generalisation
1984Leslie ValiantPAC learning framework - theoretical model for machine learning
1989Colin McDiarmidBounded differences inequality for general functions of independent variables
1995Bartlett & MendelsonRademacher complexity - data-dependent alternative to VC theory
2013Boucheron, Lugosi & MassartComprehensive monograph unifying modern concentration theory

1.4 Why Concentration Matters for ML

Generalisation: A neural network achieves 99% training accuracy. Does it generalise? Concentration inequalities bound R(h)R^(h)|R(h) - \hat{R}(h)| - the gap between true and empirical risk. They tell us when we can trust training metrics.

SGD analysis: Stochastic gradient descent uses a mini-batch estimate g^\hat{g} of the true gradient gg. Concentration inequalities bound g^g2\|\hat{g} - g\|_2, determining the noise level and required step sizes for convergence.

Confidence intervals: When evaluating an LLM on 1000 benchmark questions with accuracy 73.2%, concentration bounds tell us the confidence interval around this estimate.

Random features: The Rahimi-Recht random features method approximates kernel matrices using random projections. How many random features are needed? The answer is a Hoeffding bound.

Dropout: Dropout randomly zeroes activations. The kept activations are bounded, so McDiarmid's inequality bounds the output variance.


2. Markov's and Chebyshev's Inequalities

2.1 Markov's Inequality

Theorem (Markov's Inequality). Let X0X \geq 0 be a non-negative random variable with E[X]<\mathbb{E}[X] < \infty. For any t>0t > 0:

P(Xt)E[X]tP(X \geq t) \leq \frac{\mathbb{E}[X]}{t}

Proof. This follows immediately from the indicator 1[Xt]X/t\mathbf{1}[X \geq t] \leq X/t for X0X \geq 0, t>0t > 0:

P(Xt)=E[1[Xt]]E ⁣[Xt]=E[X]tP(X \geq t) = \mathbb{E}[\mathbf{1}[X \geq t]] \leq \mathbb{E}\!\left[\frac{X}{t}\right] = \frac{\mathbb{E}[X]}{t}

This proof is a master class in the indicator trick: multiply by 1, then bound the indicator.

Tightness. Markov's bound is tight. Consider X=tX = t with probability μ/t\mu/t and X=0X = 0 otherwise. Then E[X]=μ\mathbb{E}[X] = \mu and P(Xt)=μ/tP(X \geq t) = \mu/t - exactly matching the bound.

Extensions. Markov applies to any non-negative function of XX: for any non-decreasing ϕ0\phi \geq 0,

P(Xt)=P(ϕ(X)ϕ(t))E[ϕ(X)]ϕ(t)P(X \geq t) = P(\phi(X) \geq \phi(t)) \leq \frac{\mathbb{E}[\phi(X)]}{\phi(t)}

Setting ϕ(x)=xk\phi(x) = x^k gives the kk-th moment bound: P(Xt)E[Xk]/tkP(X \geq t) \leq \mathbb{E}[X^k]/t^k. Setting ϕ(x)=esx\phi(x) = e^{sx} gives the Chernoff method.

For AI: Markov's inequality underlies gradient clipping analysis. If E[g2]G\mathbb{E}[\|g\|_2] \leq G, then P(g2cG)1/cP(\|g\|_2 \geq cG) \leq 1/c - so gradient norms exceed 10G10G at most 10% of steps.

2.2 Chebyshev's Inequality

Theorem (Chebyshev's Inequality). For any random variable XX with E[X]=μ\mathbb{E}[X] = \mu and Var(X)=σ2<\operatorname{Var}(X) = \sigma^2 < \infty, for any t>0t > 0:

P(Xμt)σ2t2P(|X - \mu| \geq t) \leq \frac{\sigma^2}{t^2}

Equivalently, setting t=kσt = k\sigma: P(Xμkσ)1k2P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}.

Proof. Apply Markov to the non-negative variable Y=(Xμ)2Y = (X - \mu)^2:

P(Xμt)=P((Xμ)2t2)E[(Xμ)2]t2=σ2t2P(|X - \mu| \geq t) = P((X-\mu)^2 \geq t^2) \leq \frac{\mathbb{E}[(X-\mu)^2]}{t^2} = \frac{\sigma^2}{t^2}

The kk-sigma rule. Chebyshev guarantees: at least 11/k21 - 1/k^2 of probability mass lies within kσk\sigma of the mean. For any distribution with finite variance:

kkChebyshev boundGaussian actual
275%\geq 75\% within 2σ2\sigma95.4%\approx 95.4\%
388.9%\geq 88.9\% within 3σ3\sigma99.7%\approx 99.7\%
493.75%\geq 93.75\% within 4σ4\sigma99.994%\approx 99.994\%
596%\geq 96\% within 5σ5\sigma99.99994%\approx 99.99994\%

For AI: Chebyshev justifies batch normalisation. If activations have Var(X)=σ2\operatorname{Var}(X) = \sigma^2, then at most 1/k21/k^2 of activations lie beyond kσk\sigma - this quantifies the scale-normalisation benefit.

Recall: The variance Var(X)=E[X2](E[X])2\operatorname{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 and its properties were developed fully in Section04 Expectation and Moments. We use it here as a plug-in parameter.

2.3 One-Sided Chebyshev (Cantelli's Inequality)

Chebyshev bounds both tails symmetrically. Cantelli's inequality gives a tighter one-sided bound:

Theorem (Cantelli). For any t>0t > 0:

P(Xμt)σ2σ2+t2P(X - \mu \geq t) \leq \frac{\sigma^2}{\sigma^2 + t^2}

Proof. For any s>0s > 0, by Markov applied to (Xμ+s)2(X - \mu + s)^2:

P(Xμt)=P(Xμ+st+s)E[(Xμ+s)2](t+s)2=σ2+s2(t+s)2P(X - \mu \geq t) = P(X - \mu + s \geq t + s) \leq \frac{\mathbb{E}[(X-\mu+s)^2]}{(t+s)^2} = \frac{\sigma^2 + s^2}{(t+s)^2}

Optimise over ss: /s=0\partial/\partial s = 0 gives s=σ2/ts = \sigma^2/t, yielding σ2/(σ2+t2)\sigma^2/(\sigma^2 + t^2).

For large tt: Chebyshev gives σ2/t2\sigma^2/t^2 per tail (so 2σ2/t22\sigma^2/t^2 two-sided), while Cantelli gives σ2/(σ2+t2)σ2/t2\sigma^2/(\sigma^2 + t^2) \approx \sigma^2/t^2 one-sided - roughly the same. But for small tt, Cantelli avoids the factor of 2.

2.4 Limitations of Moment-Based Bounds

Markov and Chebyshev have polynomial tails. The Gaussian has P(X3σ)0.0013P(X \geq 3\sigma) \approx 0.0013, but Chebyshev gives only 1/90.1111/9 \approx 0.111 - an 85x overestimate. For t=10σt = 10\sigma, Chebyshev gives 0.010.01, while the true Gaussian probability is 1023\approx 10^{-23}.

Why the gap? Chebyshev uses only the first two moments. A distribution could concentrate all its mass at μ±t\mu \pm t and match any (μ,σ2)(\mu, \sigma^2) pair - that's the worst case for Chebyshev. Real distributions with bounded support or thin tails concentrate far more.

When Chebyshev is the right tool:

  • Distribution-free bounds (don't know the family)
  • Heavy-tailed distributions (Pareto, tt-distribution with low df)
  • Quick estimates without assuming sub-Gaussianity

3. Sub-Gaussian Random Variables

3.1 Definition and MGF Condition

Sub-Gaussian random variables are those whose tails decay at least as fast as a Gaussian. The key condition is on the MGF.

Definition. A mean-zero random variable XX is σ2\sigma^2-sub-Gaussian if for all tRt \in \mathbb{R}:

E[etX]eσ2t2/2\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2}

The parameter σ2\sigma^2 is the sub-Gaussian parameter (also called the proxy variance). Note: σ2\sigma^2 need not equal Var(X)\operatorname{Var}(X), though Var(X)σ2\operatorname{Var}(X) \leq \sigma^2 always holds (by differentiation at t=0t=0).

Why this condition? Recall from Section04 that the MGF of N(0,σ2)\mathcal{N}(0, \sigma^2) is exactly eσ2t2/2e^{\sigma^2 t^2/2}. The sub-Gaussian condition says the MGF of XX is dominated by that of a Gaussian with variance σ2\sigma^2. This immediately implies Gaussian-like tail bounds.

Equivalent characterisations (for mean-zero XX):

  1. MGF condition: E[etX]eσ2t2/2\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2} for all tt
  2. Tail condition: P(Xt)2et2/(2σ2)P(|X| \geq t) \leq 2e^{-t^2/(2\sigma^2)} for all t0t \geq 0
  3. Moment condition: E[X2k](2k1)!!σ2k\mathbb{E}[X^{2k}] \leq (2k-1)!! \cdot \sigma^{2k} for all k1k \geq 1

3.2 Examples of Sub-Gaussian Variables

Gaussian. XN(0,σ2)X \sim \mathcal{N}(0, \sigma^2) is σ2\sigma^2-sub-Gaussian. The MGF equals eσ2t2/2e^{\sigma^2 t^2/2} exactly - Gaussian is the equality case.

Bounded random variable. If X[a,b]X \in [a, b] almost surely with E[X]=μ\mathbb{E}[X] = \mu, then XμX - \mu is (ba)24\frac{(b-a)^2}{4}-sub-Gaussian. This is Hoeffding's lemma - proved in Section4.

Rademacher variable. ε{1,+1}\varepsilon \in \{-1, +1\} with equal probability. Then E[etε]=cosh(t)et2/2\mathbb{E}[e^{t\varepsilon}] = \cosh(t) \leq e^{t^2/2}, so ε\varepsilon is 1-sub-Gaussian. Rademacher variables appear in Rademacher complexity (Section10).

Bernoulli centered. X=Bern(p)p{p,1p}X = \operatorname{Bern}(p) - p \in \{-p, 1-p\}. Since X[p,1p]X \in [-p, 1-p], it is 14\frac{1}{4}-sub-Gaussian by Hoeffding's lemma.

Non-example. Heavy-tailed distributions are not sub-Gaussian. A tt-distribution with ν\nu degrees of freedom has E[etX]=\mathbb{E}[e^{tX}] = \infty for all t0t \neq 0 when ν2\nu \leq 2. A Pareto distribution is not sub-Gaussian regardless of parameters.

3.3 The Sub-Gaussian Tail Bound

Theorem. If XX is σ2\sigma^2-sub-Gaussian with E[X]=0\mathbb{E}[X] = 0, then for all t0t \geq 0:

P(Xt)et2/(2σ2)P(X \geq t) \leq e^{-t^2/(2\sigma^2)}

Proof (Chernoff method). For any s>0s > 0:

P(Xt)=P(esXest)E[esX]esteσ2s2/2est=eσ2s2/2stP(X \geq t) = P(e^{sX} \geq e^{st}) \leq \frac{\mathbb{E}[e^{sX}]}{e^{st}} \leq \frac{e^{\sigma^2 s^2/2}}{e^{st}} = e^{\sigma^2 s^2/2 - st}

Optimise over ss: /s(σ2s2/2st)=0s=t/σ2\partial/\partial s(\sigma^2 s^2/2 - st) = 0 \Rightarrow s^* = t/\sigma^2. Substituting:

P(Xt)eσ2(t/σ2)2/2tt/σ2=et2/(2σ2)t2/σ2=et2/(2σ2)P(X \geq t) \leq e^{\sigma^2(t/\sigma^2)^2/2 - t \cdot t/\sigma^2} = e^{t^2/(2\sigma^2) - t^2/\sigma^2} = e^{-t^2/(2\sigma^2)}

This proof pattern - Markov + MGF bound + optimise - is the Chernoff method and will recur throughout this section.

3.4 Closure Under Sums

Theorem. If X1,,XnX_1, \ldots, X_n are independent with XiX_i being σi2\sigma_i^2-sub-Gaussian and E[Xi]=0\mathbb{E}[X_i] = 0, then S=iXiS = \sum_i X_i is (iσi2)(\sum_i \sigma_i^2)-sub-Gaussian.

Proof. By independence:

E[etS]=i=1nE[etXi]i=1neσi2t2/2=e(iσi2)t2/2\mathbb{E}[e^{tS}] = \prod_{i=1}^n \mathbb{E}[e^{tX_i}] \leq \prod_{i=1}^n e^{\sigma_i^2 t^2/2} = e^{(\sum_i \sigma_i^2)t^2/2}

Corollary. For nn i.i.d. σ2\sigma^2-sub-Gaussian variables, the sample mean Xˉn=S/n\bar{X}_n = S/n satisfies P(Xˉnt)ent2/(2σ2)P(\bar{X}_n \geq t) \leq e^{-nt^2/(2\sigma^2)}.

For AI: Mini-batch gradient estimation. If each sample's contribution to the gradient is sub-Gaussian with parameter σ2\sigma^2, then the mini-batch gradient of size mm is σ2/m\sigma^2/m-sub-Gaussian - the noise decreases as 1/m1/m.


4. Hoeffding's Inequality

4.1 Hoeffding's Lemma

Hoeffding's lemma is the core technical ingredient: it establishes that any bounded, mean-zero random variable is sub-Gaussian.

Lemma (Hoeffding, 1963). Let XX be a random variable with E[X]=0\mathbb{E}[X] = 0 and X[a,b]X \in [a, b] almost surely. Then for all tRt \in \mathbb{R}:

E[etX]exp ⁣(t2(ba)28)\mathbb{E}[e^{tX}] \leq \exp\!\left(\frac{t^2(b-a)^2}{8}\right)

In other words, XX is (ba)24\frac{(b-a)^2}{4}-sub-Gaussian.

Proof sketch. Since [a,b][a, b] is a bounded interval and etxe^{tx} is convex in xx, we can bound etxe^{tx} by the chord from (a,eta)(a, e^{ta}) to (b,etb)(b, e^{tb}):

etxbxbaeta+xabaetbe^{tx} \leq \frac{b-x}{b-a} e^{ta} + \frac{x-a}{b-a} e^{tb}

Taking expectations (using E[X]=0\mathbb{E}[X] = 0, so E[(bX)/(ba)]=b/(ba)\mathbb{E}[(b-X)/(b-a)] = b/(b-a) and E[(Xa)/(ba)]=a/(ba)\mathbb{E}[(X-a)/(b-a)] = -a/(b-a)):

E[etX]bbaetaabaetb=eg(u)\mathbb{E}[e^{tX}] \leq \frac{b}{b-a} e^{ta} - \frac{a}{b-a} e^{tb} = e^{g(u)}

where u=t(ba)u = t(b-a) and g(u)=pu+log(1p+peu)g(u) = -pu + \log(1 - p + pe^u) with p=a/(ba)p = -a/(b-a). Expanding gg via Taylor series around u=0u=0 and bounding g(u)1/4g''(u) \leq 1/4 (achieved at p=1/2p = 1/2) gives g(u)u2/8=t2(ba)2/8g(u) \leq u^2/8 = t^2(b-a)^2/8.

Why (ba)2/8(b-a)^2/8? The factor 1/81/8 comes from the worst case p=1/2p = 1/2 (centred in the interval). For a {0,1}\{0,1\} Bernoulli minus its mean, ba=1b - a = 1 and the sub-Gaussian parameter is 1/41/4.

4.2 Hoeffding's Inequality

Theorem (Hoeffding, 1963). Let X1,,XnX_1, \ldots, X_n be independent random variables with E[Xi]=μi\mathbb{E}[X_i] = \mu_i and Xi[ai,bi]X_i \in [a_i, b_i] almost surely. Then for all t>0t > 0:

P ⁣(i=1n(Xiμi)t)exp ⁣(2t2i=1n(biai)2)P\!\left(\sum_{i=1}^n (X_i - \mu_i) \geq t\right) \leq \exp\!\left(-\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right)

Proof. Let Yi=XiμiY_i = X_i - \mu_i (mean-zero, Yi[aiμi,biμi]Y_i \in [a_i - \mu_i, b_i - \mu_i]). By Hoeffding's lemma, each YiY_i is (biai)24\frac{(b_i - a_i)^2}{4}-sub-Gaussian. By independence and closure under sums (Section3.4), Yi\sum Y_i is (biai)24\frac{\sum(b_i-a_i)^2}{4}-sub-Gaussian. Applying the sub-Gaussian tail bound:

P ⁣(Yit)exp ⁣(t22(biai)24)=exp ⁣(2t2(biai)2)P\!\left(\sum Y_i \geq t\right) \leq \exp\!\left(-\frac{t^2}{2 \cdot \frac{\sum(b_i-a_i)^2}{4}}\right) = \exp\!\left(-\frac{2t^2}{\sum(b_i-a_i)^2}\right)

4.3 Two-Sided Hoeffding and Sample Complexity

For i.i.d. Xi[a,b]X_i \in [a, b] with mean μ\mu, the two-sided version follows by symmetry (apply one-sided to XX and X-X):

P(Xˉnμt)2exp ⁣(2nt2(ba)2)P(|\bar{X}_n - \mu| \geq t) \leq 2\exp\!\left(-\frac{2nt^2}{(b-a)^2}\right)

Reading the bound: The probability of a large deviation decays exponentially in nn and t2t^2. Doubling the sample size squares the exponential factor. Halving the allowed deviation quadruples the required nn.

4.4 Required Sample Size

One of the most practically useful results: how large must nn be to guarantee Xˉnμε|\bar{X}_n - \mu| \leq \varepsilon with probability at least 1δ1 - \delta?

Setting 2exp(2nt2/(ba)2)δ2\exp(-2nt^2/(b-a)^2) \leq \delta and solving for nn:

n(ba)2log(2/δ)2ε2n \geq \frac{(b-a)^2 \log(2/\delta)}{2\varepsilon^2}

Example. Coin flips (Xi{0,1}X_i \in \{0,1\}, ba=1b - a = 1): to achieve ε=0.01\varepsilon = 0.01 accuracy with δ=0.05\delta = 0.05 confidence:

nlog(40)0.000218,444n \geq \frac{\log(40)}{0.0002} \approx 18{,}444

So roughly 18,500 coin flips are needed to estimate a probability to within ±0.01\pm 0.01 with 95% confidence.

For AI: Benchmark evaluation. To measure LLM accuracy to within ±1%\pm 1\% at 95% confidence on a binary task (ba=1b-a=1), Hoeffding requires n19,122n \geq 19{,}122 examples. This explains why serious LLM evaluations (MMLU, HumanEval, BIG-bench) use thousands of questions.


5. Chernoff Bounds

5.1 The Chernoff Method

The Chernoff method is a general technique for deriving exponential tail bounds using the MGF. It is more powerful than Hoeffding when the MGF has a known closed form.

The Chernoff method. For any random variable XX and t>0t > 0:

P(Xa)infs>0esaE[esX]=infs>0esaMX(s)P(X \geq a) \leq \inf_{s > 0} e^{-sa} \cdot \mathbb{E}[e^{sX}] = \inf_{s > 0} e^{-sa} M_X(s)

Proof. For any s>0s > 0: P(Xa)=P(esXesa)esaE[esX]P(X \geq a) = P(e^{sX} \geq e^{sa}) \leq e^{-sa} \mathbb{E}[e^{sX}] by Markov. Optimise over ss.

The infimum over ss gives the tightest bound. The optimal ss^* satisfies MX(s)/MX(s)=aM_X'(s^*)/M_X(s^*) = a, i.e., the mean of XX under the tilted distribution equals aa.

Recall: MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}] was developed in Section04 Expectation and Moments. The key property used here is MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t) for independent X,YX, Y.

5.2 Chernoff for Bernoulli Sums

Let X1,,XnBern(p)X_1, \ldots, X_n \sim \operatorname{Bern}(p) i.i.d. and S=i=1nXiBin(n,p)S = \sum_{i=1}^n X_i \sim \operatorname{Bin}(n, p) with mean μ=np\mu = np.

The MGF of SS is MS(t)=(1p+pet)nM_S(t) = (1 - p + pe^t)^n.

Upper Chernoff bound. For δ>0\delta > 0:

P(S(1+δ)μ)(eδ(1+δ)1+δ)μP(S \geq (1+\delta)\mu) \leq \left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu

Derivation. Apply the Chernoff method with a=(1+δ)μa = (1+\delta)\mu:

P(S(1+δ)μ)infs>0es(1+δ)μMS(s)P(S \geq (1+\delta)\mu) \leq \inf_{s>0} e^{-s(1+\delta)\mu} M_S(s)

Substituting MS(s)=(1p+pes)neμ(es1)M_S(s) = (1-p+pe^s)^n \leq e^{\mu(e^s-1)} (using 1+xex1 + x \leq e^x):

infs>0exp(μ(es1)s(1+δ)μ)=exp(μ(es1s(1+δ)))\leq \inf_{s>0} \exp(\mu(e^s - 1) - s(1+\delta)\mu) = \exp(\mu(e^s - 1 - s(1+\delta)))

Optimising: s=log(1+δ)s^* = \log(1+\delta), giving the stated bound.

5.3 Multiplicative Chernoff Form

For δ(0,1)\delta \in (0, 1), a simpler but slightly looser form:

P(S(1+δ)μ)eμδ2/3P(S \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3}

This follows from the inequality (eδ(1+δ)1+δ)μeμδ2/3\left(\frac{e^\delta}{(1+\delta)^{1+\delta}}\right)^\mu \leq e^{-\mu\delta^2/3}.

Comparison with Hoeffding. Hoeffding (applied to Bin(n,p)\operatorname{Bin}(n,p) with ba=1b-a=1) gives:

P(S(1+δ)μ)=P(Xˉp(1+δ))exp(2n(pδ)2)P(S \geq (1+\delta)\mu) = P(\bar{X} \geq p(1+\delta)) \leq \exp(-2n(p\delta)^2)

Chernoff gives exp(np2δ2/3)=exp(μ2δ2/(3n))\exp(-np^2\delta^2/3) = \exp(-\mu^2\delta^2/(3n)). For small pp (rare events), Chernoff's dependence on μ=np\mu = np rather than nn makes it far tighter.

5.4 Lower Tail Chernoff

For the lower tail, for δ(0,1)\delta \in (0, 1):

P(S(1δ)μ)eμδ2/2P(S \leq (1-\delta)\mu) \leq e^{-\mu\delta^2/2}

Note the factor of 1/21/2 vs 1/31/3 in the exponent - the lower tail is slightly tighter.

Application: balls into bins. When nn items are assigned to mm bins uniformly at random, the expected load per bin is μ=n/m\mu = n/m. Chernoff bounds control the maximum load - crucial for hashing and load-balancing proofs in distributed systems.


6. Bernstein's Inequality

6.1 Sub-Exponential Random Variables

Some random variables have heavier tails than sub-Gaussian but still lighter than arbitrary - these are sub-exponential.

Definition. A mean-zero random variable XX is sub-exponential with parameters (ν2,b)(\nu^2, b) if:

E[etX]eν2t2/2for all t1/b\mathbb{E}[e^{tX}] \leq e^{\nu^2 t^2/2} \quad \text{for all } |t| \leq 1/b

Sub-exponential tails decay as et/be^{-t/b} (exponential), slower than Gaussian et2/(2σ2)e^{-t^2/(2\sigma^2)}.

Examples:

  • Exponential(1)(1): sub-exponential with (ν2,b)=(1,1)(\nu^2, b) = (1, 1)
  • χk2\chi^2_k: sub-exponential (sum of squared Gaussians)
  • Any bounded variable is also sub-exponential

6.2 Bernstein's Inequality

Theorem (Bernstein, 1924/1946). Let X1,,XnX_1, \ldots, X_n be independent mean-zero variables with Xic|X_i| \leq c and i=1nE[Xi2]=ν2\sum_{i=1}^n \mathbb{E}[X_i^2] = \nu^2. Then for all t>0t > 0:

P ⁣(i=1nXit)exp ⁣(t2/2ν2+ct/3)P\!\left(\sum_{i=1}^n X_i \geq t\right) \leq \exp\!\left(-\frac{t^2/2}{\nu^2 + ct/3}\right)

Intuition. This interpolates between two regimes:

  • Small tt (tν2/ct \ll \nu^2/c): Denominator ν2\approx \nu^2, gives et2/(2ν2)e^{-t^2/(2\nu^2)} - sub-Gaussian with variance ν2\nu^2
  • Large tt (tν2/ct \gg \nu^2/c): Denominator ct/3\approx ct/3, gives e3t/(2c)e^{-3t/(2c)} - exponential tail

6.3 Bernstein vs Hoeffding

For nn i.i.d. variables with Xic|X_i| \leq c and sample variance s2=1nE[Xi2]s^2 = \frac{1}{n}\sum \mathbb{E}[X_i^2]:

BoundFormulaBetter when
Hoeffdingexp(2nt2/(2c)2)=exp(nt2/(2c2))\exp(-2nt^2/(2c)^2) = \exp(-nt^2/(2c^2))No variance info
Bernsteinexp(nt2/2/(ns2+ct/3))\exp(-nt^2/2/(ns^2 + ct/3))s2c2s^2 \ll c^2 (small variance)

When Bernstein wins. If s2c2s^2 \ll c^2 (data has small variance despite large range), Bernstein gives exp(nt2/(2s2))\exp(-nt^2/(2s^2)) for small tt, while Hoeffding gives exp(nt2/(2c2))\exp(-nt^2/(2c^2)). Since s2c2s^2 \ll c^2, Bernstein is exponentially tighter.

Example. Suppose Xi[1,1]X_i \in [-1, 1] but Var(Xi)=0.01\operatorname{Var}(X_i) = 0.01. Hoeffding uses range (ba)2=4(b-a)^2 = 4, while Bernstein uses variance 0.010.01 - a 400x improvement in the exponent for small tt.

6.4 Application to Gradient Noise

In SGD, the stochastic gradient g^t=1miBt(wt;xi)\hat{g}_t = \frac{1}{m}\sum_{i \in B_t} \nabla \ell(w_t; x_i) estimates the true gradient gt=L(wt)g_t = \nabla \mathcal{L}(w_t).

If each (w;x)\nabla\ell(w; x) is bounded: (w;x)2G\|\nabla\ell(w; x)\|_2 \leq G, and the sample gradient variance is σg2\sigma_g^2, then Bernstein bounds:

P ⁣(g^tgt2t)dexp ⁣(mt2/2dσg2+Gt/3)P\!\left(\|\hat{g}_t - g_t\|_2 \geq t\right) \leq d \cdot \exp\!\left(-\frac{mt^2/2}{d\sigma_g^2 + Gt/3}\right)

where the factor dd comes from a union bound over coordinates. This bound shows why variance reduction methods (SVRG, SAGA) that reduce σg2\sigma_g^2 improve convergence more dramatically than reducing GG alone.


7. McDiarmid's Inequality

7.1 Bounded Differences Condition

McDiarmid's inequality generalises Hoeffding from sums to general functions.

Definition (Bounded Differences). A function f:XnRf: \mathcal{X}^n \to \mathbb{R} satisfies the bounded differences condition with constants c1,,cn>0c_1, \ldots, c_n > 0 if for all i[n]i \in [n] and all x1,,xn,xiXx_1, \ldots, x_n, x_i' \in \mathcal{X}:

f(x1,,xi,,xn)f(x1,,xi,,xn)ci|f(x_1, \ldots, x_i, \ldots, x_n) - f(x_1, \ldots, x_i', \ldots, x_n)| \leq c_i

Intuitively: changing the ii-th coordinate by any amount changes ff by at most cic_i.

7.2 McDiarmid's Inequality

Theorem (McDiarmid, 1989). Let X1,,XnX_1, \ldots, X_n be independent random variables and let ff satisfy the bounded differences condition with constants c1,,cnc_1, \ldots, c_n. Then for all t>0t > 0:

P(f(X1,,Xn)E[f]t)exp ⁣(2t2i=1nci2)P(f(X_1, \ldots, X_n) - \mathbb{E}[f] \geq t) \leq \exp\!\left(-\frac{2t^2}{\sum_{i=1}^n c_i^2}\right)

Proof sketch (martingale method). Define the Doob martingale Zk=E[fX1,,Xk]Z_k = \mathbb{E}[f \mid X_1, \ldots, X_k] with Z0=E[f]Z_0 = \mathbb{E}[f] and Zn=fZ_n = f. The differences Dk=ZkZk1D_k = Z_k - Z_{k-1} satisfy Dkck|D_k| \leq c_k by the bounded differences assumption. Apply Azuma's inequality (concentration for bounded martingale differences) to get the result.

7.3 Relationship to Hoeffding

Hoeffding's inequality is a special case of McDiarmid's. For f(X1,,Xn)=1ni=1nXif(X_1, \ldots, X_n) = \frac{1}{n}\sum_{i=1}^n X_i with Xi[ai,bi]X_i \in [a_i, b_i], changing XiX_i changes ff by at most (biai)/n(b_i - a_i)/n, so ci=(biai)/nc_i = (b_i - a_i)/n. Then:

ici2=i(biai)2n2\sum_i c_i^2 = \frac{\sum_i (b_i - a_i)^2}{n^2}

Substituting into McDiarmid recovers exactly Hoeffding's bound.

7.4 Applications to ML Stability

Training loss with missing data. Let D={z1,,zn}\mathcal{D} = \{z_1, \ldots, z_n\} be a training set and R^(D)=1ni=1n(h,zi)\hat{R}(\mathcal{D}) = \frac{1}{n}\sum_{i=1}^n \ell(h, z_i) the empirical risk with [0,1]\ell \in [0, 1]. Swapping one training example changes R^\hat{R} by at most 1/n1/n. McDiarmid gives:

P(R^E[R^]t)2exp(2nt2)P(|\hat{R} - \mathbb{E}[\hat{R}]| \geq t) \leq 2\exp(-2nt^2)

Dropout. With dropout probability pp, each neuron's activation aia_i is scaled by 1[kept]/(1p)\mathbf{1}[\text{kept}] / (1-p). The network output f(z1,,zd)f(z_1, \ldots, z_d) changes by at most ci=wiaimax/(1p)c_i = |w_i| \cdot a_i^{\max} / (1-p) when the ii-th neuron is toggled. McDiarmid bounds the output variance from dropout noise.

Data augmentation. If augmenting a single training example can change the training loss by at most cc, McDiarmid bounds the sensitivity of the trained model to augmentation choices.


8. The Union Bound and Covering Arguments

8.1 Union Bound

Lemma (Boole's / Union Bound). For any events A1,,AkA_1, \ldots, A_k (not necessarily independent):

P ⁣(i=1kAi)i=1kP(Ai)P\!\left(\bigcup_{i=1}^k A_i\right) \leq \sum_{i=1}^k P(A_i)

Proof. By induction: P(AB)=P(A)+P(B)P(AB)P(A)+P(B)P(A \cup B) = P(A) + P(B) - P(A \cap B) \leq P(A) + P(B).

The union bound is exact when events are disjoint and pessimistic otherwise. Despite its looseness, it is indispensable because it requires no assumption about dependence between events.

For AI: The union bound appears in virtually every multi-class or multi-hypothesis analysis. PAC theory for a finite hypothesis class H\mathcal{H} asks for the probability that any hHh \in \mathcal{H} is bad - union bound over all H|\mathcal{H}| hypotheses.

8.2 Multiple Hypothesis Testing

Consider mm statistical tests each with individual significance level α\alpha. If we want the probability that at least one false positive occurs to be at most δ\delta, the Bonferroni correction sets each test at level δ/m\delta/m.

In ML: When training mm models and reporting the best, the union bound says the best result can be a false positive with probability up to mαm\alpha. This is the multiple comparison problem underlying claims like "our method beats 10 baselines on 5 datasets."

Balancing union bound and concentration. For kk bad events AiA_i each with P(Ai)εP(A_i) \leq \varepsilon:

P(i:Ai)kεP(\exists i: A_i) \leq k\varepsilon

To make this δ\leq \delta: need each P(Ai)δ/kP(A_i) \leq \delta/k. In PAC theory, this costs an extra logk\log k in the required sample size.

8.3 Covering Numbers and \varepsilon-Nets

The union bound can only be applied to finite collections of events. For continuous hypothesis spaces (e.g., all linear classifiers, all neural networks), we need to discretise first.

Definition. An ε\varepsilon-net of a set H\mathcal{H} under metric dd is a finite set NεH\mathcal{N}_\varepsilon \subseteq \mathcal{H} such that for every hHh \in \mathcal{H}, there exists h^Nε\hat{h} \in \mathcal{N}_\varepsilon with d(h,h^)εd(h, \hat{h}) \leq \varepsilon.

The covering number N(H,d,ε)\mathcal{N}(\mathcal{H}, d, \varepsilon) is the size of the smallest ε\varepsilon-net.

Covering number of a ball. The 2\ell_2-ball {w:w2R}\{\mathbf{w}: \|\mathbf{w}\|_2 \leq R\} in Rd\mathbb{R}^d has covering number:

N(BR,2,ε)(2Rε+1)d(3Rε)d\mathcal{N}(\mathcal{B}_R, \ell_2, \varepsilon) \leq \left(\frac{2R}{\varepsilon} + 1\right)^d \leq \left(\frac{3R}{\varepsilon}\right)^d

This grows polynomially in 1/ε1/\varepsilon but exponentially in dd - the curse of dimensionality for covering.

8.4 The Net Argument

The standard approach for continuous hypothesis classes:

  1. Build an ε\varepsilon-net Nε\mathcal{N}_\varepsilon with Nε=N(H,d,ε)|\mathcal{N}_\varepsilon| = \mathcal{N}(\mathcal{H}, d, \varepsilon)
  2. Apply concentration + union bound over the net: control suphNεR^(h)R(h)\sup_{h \in \mathcal{N}_\varepsilon} |\hat{R}(h) - R(h)|
  3. Use Lipschitz continuity to extend from net to all of H\mathcal{H}

The final bound pays logN(H,d,ε)\log \mathcal{N}(\mathcal{H}, d, \varepsilon) extra compared to a single hypothesis, plus the approximation error ε\varepsilon from the net.

Johnson-Lindenstrauss Lemma. As an application of covering, if m=O(log(n)/ε2)m = O(\log(n)/\varepsilon^2) random projections are used, nn points in Rd\mathbb{R}^d embed into Rm\mathbb{R}^m with (1±ε)(1\pm\varepsilon) distortion of all pairwise distances, with high probability. This is proved using the union bound over all (n2)\binom{n}{2} pairs.


9. PAC Learning and Generalisation Bounds

9.1 The PAC Framework

The Probably Approximately Correct (PAC) framework, introduced by Valiant (1984), provides a formal model for machine learning under uncertainty.

Setup: A learner observes nn i.i.d. examples {(x(i),y(i))}\{(x^{(i)}, y^{(i)})\} from an unknown distribution D\mathcal{D} over X×Y\mathcal{X} \times \mathcal{Y}. The learner selects a hypothesis hh from a hypothesis class H\mathcal{H}.

True risk: R(h)=P(x,y)D(h(x)y)=E[1[h(x)y]]R(h) = P_{(x,y) \sim \mathcal{D}}(h(x) \neq y) = \mathbb{E}[\mathbf{1}[h(x) \neq y]]

Empirical risk: R^(h)=1ni=1n1[h(x(i))y(i)]\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[h(x^{(i)}) \neq y^{(i)}]

PAC guarantee: A learning algorithm A\mathcal{A} is PAC-learning H\mathcal{H} if for any ε,δ>0\varepsilon, \delta > 0 and any distribution D\mathcal{D}, given nn0(ε,δ)n \geq n_0(\varepsilon, \delta) examples, with probability at least 1δ1-\delta:

R(A(D))minhHR(h)+εR(\mathcal{A}(\mathcal{D})) \leq \min_{h \in \mathcal{H}} R(h) + \varepsilon

The sample complexity n0(ε,δ)n_0(\varepsilon, \delta) is the central quantity: how many examples are needed?

9.2 Finite Hypothesis Class

Theorem. Let H\mathcal{H} be a finite hypothesis class with H<|\mathcal{H}| < \infty. For any ERM algorithm (minimising R^(h)\hat{R}(h)) and any ε,δ>0\varepsilon, \delta > 0, if:

nlog(H/δ)2ε2n \geq \frac{\log(|\mathcal{H}|/\delta)}{2\varepsilon^2}

then with probability at least 1δ1-\delta, the selected hypothesis h^\hat{h} satisfies R(h^)minhR(h)+εR(\hat{h}) \leq \min_{h^*} R(h^*) + \varepsilon (in the realisable case with R^(h)=0\hat{R}(h^*) = 0).

Proof. Consider the event "bad hh": Bh={R(h)>ε but R^(h)=0}B_h = \{R(h) > \varepsilon \text{ but } \hat{R}(h) = 0\}.

By Hoeffding: P(Bh)P(R^(h)=0R(h)>ε)(1ε)nenεP(B_h) \leq P(\hat{R}(h) = 0 \mid R(h) > \varepsilon) \leq (1-\varepsilon)^n \leq e^{-n\varepsilon}.

By union bound: P(hH:Bh)HenεP(\exists h \in \mathcal{H}: B_h) \leq |\mathcal{H}| \cdot e^{-n\varepsilon}.

Setting this δ\leq \delta and solving: nlog(H/δ)εn \geq \frac{\log(|\mathcal{H}|/\delta)}{\varepsilon}. The factor ε2\varepsilon^2 version covers the agnostic (non-realisable) case.

9.3 Uniform Convergence

Theorem (Uniform Convergence). For finite H\mathcal{H} and losses in [0,1][0, 1], for nlog(2H/δ)2ε2n \geq \frac{\log(2|\mathcal{H}|/\delta)}{2\varepsilon^2}:

P ⁣(suphHR^(h)R(h)>ε)δP\!\left(\sup_{h \in \mathcal{H}} |\hat{R}(h) - R(h)| > \varepsilon\right) \leq \delta

Proof. Fix any hHh \in \mathcal{H}. By Hoeffding: P(R^(h)R(h)>ε)2e2nε2P(|\hat{R}(h) - R(h)| > \varepsilon) \leq 2e^{-2n\varepsilon^2}.

By union bound over all H|\mathcal{H}| hypotheses: P(suphR^(h)R(h)>ε)2He2nε2δP(\sup_h |\hat{R}(h) - R(h)| > \varepsilon) \leq 2|\mathcal{H}| e^{-2n\varepsilon^2} \leq \delta.

Consequence: If uniform convergence holds, then for ERM h^=argminhR^(h)\hat{h} = \arg\min_h \hat{R}(h):

R(h^)R^(h^)+εR^(h)+εR(h)+2εR(\hat{h}) \leq \hat{R}(\hat{h}) + \varepsilon \leq \hat{R}(h^*) + \varepsilon \leq R(h^*) + 2\varepsilon

The ERM generalises to within 2ε2\varepsilon of the best hypothesis.

9.4 VC Dimension

For infinite hypothesis classes, we need a different complexity measure.

Definition. A hypothesis class H\mathcal{H} shatters a set C={x1,,xk}C = \{x_1, \ldots, x_k\} if for every labelling y{0,1}ky \in \{0,1\}^k, there exists hHh \in \mathcal{H} with h(xi)=yih(x_i) = y_i for all ii.

VC dimension dVC(H)d_{VC}(\mathcal{H}) is the size of the largest set shattered by H\mathcal{H}.

Examples:

  • Threshold classifiers on R\mathbb{R}: dVC=1d_{VC} = 1 (can shatter any single point but not two)
  • Linear classifiers (halfspaces) in Rd\mathbb{R}^d: dVC=d+1d_{VC} = d + 1
  • Degree-kk polynomial classifiers in Rd\mathbb{R}^d: dVC=(d+kk)d_{VC} = \binom{d+k}{k}
  • Circles in R2\mathbb{R}^2: dVC=3d_{VC} = 3
  • Neural networks: grows with parameter count, approximately linearly

Sauer-Shelah Lemma. The growth function ΠH(n)=maxx1,,xn{(h(x1),,h(xn)):hH}\Pi_\mathcal{H}(n) = \max_{x_1,\ldots,x_n} |\{(h(x_1),\ldots,h(x_n)): h \in \mathcal{H}\}| satisfies:

ΠH(n)i=0d(ni)(end)d\Pi_\mathcal{H}(n) \leq \sum_{i=0}^{d} \binom{n}{i} \leq \left(\frac{en}{d}\right)^d

for nd=dVC(H)n \geq d = d_{VC}(\mathcal{H}). This says: an nn-point set can be labelled in at most (en/d)d(en/d)^d ways by H\mathcal{H}, even if H=|\mathcal{H}| = \infty.

9.5 Generalisation Bound with VC Dimension

Theorem (VC Generalisation Bound). For any δ>0\delta > 0, with probability at least 1δ1-\delta over nn i.i.d. training examples, every hHh \in \mathcal{H} satisfies:

R(h)R^(h)+d(log(2n/d)+1)+log(4/δ)2nR(h) \leq \hat{R}(h) + \sqrt{\frac{d(\log(2n/d) + 1) + \log(4/\delta)}{2n}}

where d=dVC(H)d = d_{VC}(\mathcal{H}).

Sample complexity. Setting the right-hand side excess to ε\varepsilon:

n=O ⁣(dlog(1/(εδ))+log(1/δ)ε2)=O ⁣(d+log(1/δ)ε2)n = O\!\left(\frac{d \log(1/(\varepsilon\delta)) + \log(1/\delta)}{\varepsilon^2}\right) = O\!\left(\frac{d + \log(1/\delta)}{\varepsilon^2}\right)

The log(H)\log(|\mathcal{H}|) from the finite case is replaced by dVCd_{VC}. For a dd-dimensional linear classifier (dVC=d+1d_{VC} = d+1), need n=O(d/ε2)n = O(d/\varepsilon^2) - linear in dimension, not exponential.

For AI (2026): Modern transformers have billions of parameters, implying enormous VC dimension. Yet they generalise. This "paradox" is partly resolved by:

  1. Implicit regularisation: gradient descent finds minimum-norm solutions
  2. Overparameterisation: interpolation threshold, double descent (covered in Section04)
  3. PAC-Bayes: data-dependent bounds that are tighter for specific algorithms

10. Rademacher Complexity

10.1 Definition

Rademacher complexity measures the richness of a function class on specific data, making it data-dependent and often tighter than VC bounds.

Definition. Given a sample S=(x(1),,x(n))S = (x^{(1)}, \ldots, x^{(n)}) and a function class F\mathcal{F}, the empirical Rademacher complexity is:

R^S(F)=Eσ ⁣[supfF1ni=1nσif(x(i))]\hat{\mathfrak{R}}_S(\mathcal{F}) = \mathbb{E}_{\boldsymbol{\sigma}}\!\left[\sup_{f \in \mathcal{F}} \frac{1}{n}\sum_{i=1}^n \sigma_i f(x^{(i)})\right]

where σ=(σ1,,σn)\boldsymbol{\sigma} = (\sigma_1, \ldots, \sigma_n) with σiUniform({1,+1})\sigma_i \sim \operatorname{Uniform}(\{-1, +1\}) i.i.d. (Rademacher variables).

The Rademacher complexity is Rn(F)=ES[R^S(F)]\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}_S[\hat{\mathfrak{R}}_S(\mathcal{F})].

Interpretation. R^S(F)\hat{\mathfrak{R}}_S(\mathcal{F}) measures how well F\mathcal{F} can fit random ±1\pm 1 labels on the training points. If the class is rich enough to fit any labelling, R^S1\hat{\mathfrak{R}}_S \approx 1. If it can only fit structured labels, R^S0\hat{\mathfrak{R}}_S \approx 0.

10.2 Rademacher Generalisation Bound

Theorem. For any function class F\mathcal{F} with values in [0,1][0, 1], with probability at least 1δ1-\delta:

R(f)R^(f)+2Rn(F)+log(1/δ)2nR(f) \leq \hat{R}(f) + 2\mathfrak{R}_n(\mathcal{F}) + \sqrt{\frac{\log(1/\delta)}{2n}}

for all fFf \in \mathcal{F} simultaneously.

Comparison to VC. The VC bound has d/n\sqrt{d/n}; the Rademacher bound has Rn(F)\mathfrak{R}_n(\mathcal{F}). For specific data distributions where the function class is "effectively smaller," Rademacher can be significantly tighter.

10.3 Rademacher of Linear Classifiers

For linear classifiers F={xw,x:w2B}\mathcal{F} = \{\mathbf{x} \mapsto \langle \mathbf{w}, \mathbf{x} \rangle : \|\mathbf{w}\|_2 \leq B\} on data with x(i)2C\|\mathbf{x}^{(i)}\|_2 \leq C:

R^S(F)BCnΣ^FB=Ctr(Σ^)Bn\hat{\mathfrak{R}}_S(\mathcal{F}) \leq \frac{BC}{\sqrt{n}} \cdot \frac{\|\hat{\Sigma}\|_F}{B} = \frac{C\sqrt{\operatorname{tr}(\hat{\Sigma})}}{B\sqrt{n}}

More precisely, using Cauchy-Schwarz:

R^S(F)BnEσ ⁣[i=1nσix(i)2]Bnix(i)22n1/2BCn\hat{\mathfrak{R}}_S(\mathcal{F}) \leq \frac{B}{n}\mathbb{E}_\sigma\!\left[\left\|\sum_{i=1}^n \sigma_i \mathbf{x}^{(i)}\right\|_2\right] \leq \frac{B}{\sqrt{n}} \cdot \frac{\sqrt{\sum_i \|\mathbf{x}^{(i)}\|_2^2}}{n^{1/2}} \leq \frac{BC}{\sqrt{n}}

This is O(1/n)O(1/\sqrt{n}), independent of the ambient dimension dd! This explains why linear classifiers generalise well even in high dimensions, as long as the norm w2\|\mathbf{w}\|_2 is controlled.

10.4 Rademacher vs VC

PropertyVC DimensionRademacher Complexity
DistributionDistribution-freeData-dependent
CalculationCombinatorialExpectation over labels
EstimateHard (often NP-hard)MC approximation possible
TightnessWorst-caseOften tighter in practice
Handles regressionNo (classification)Yes (arbitrary losses)
Modern NNsVacuous (too large dd)Can be finite for sparse/low-rank models

For AI: Rademacher complexity gives finite (non-vacuous) generalisation bounds for attention heads with bounded query/key norm, explaining why transformers with weight decay generalise despite huge parameter counts.


11. ML Applications

11.1 SGD and Gradient Concentration

In stochastic gradient descent, at each step tt we compute a mini-batch gradient:

g^t=1miBtθ(θt;x(i))\hat{g}_t = \frac{1}{m}\sum_{i \in B_t} \nabla_\theta \ell(\theta_t; x^{(i)})

where BtB_t is a random mini-batch of size mm. The true gradient is gt=θL(θt)=E[θ]g_t = \nabla_\theta \mathcal{L}(\theta_t) = \mathbb{E}[\nabla_\theta \ell].

Concentration of gradient norm. Assuming (θ;x)2G\|\nabla\ell(\theta; x)\|_2 \leq G almost surely, by Hoeffding applied coordinate-wise and union bound:

P(g^tgt2ε)2dexp ⁣(mε22G2)P(\|\hat{g}_t - g_t\|_2 \geq \varepsilon) \leq 2d \cdot \exp\!\left(-\frac{m\varepsilon^2}{2G^2}\right)

Implication for learning rate. For SGD to make progress, we need the gradient noise g^tgt2\|\hat{g}_t - g_t\|_2 to be smaller than the signal. Concentration tells us: with probability 1δ1-\delta, the noise is bounded by G2log(2d/δ)/mG\sqrt{2\log(2d/\delta)/m}. This justifies the learning rate scaling η1/t\eta \propto 1/\sqrt{t} in theory.

Variance reduction. Methods like SVRG (Stochastic Variance-Reduced Gradient) periodically compute the full gradient, reducing GG to G/nG/\sqrt{n}. Bernstein's inequality quantifies the benefit: convergence can be linear (not just 1/t1/\sqrt{t}) when variance is controlled.

11.2 Confidence Intervals via Hoeffding

Suppose we evaluate an LLM on a benchmark: out of nn binary questions (correct/incorrect), we observe accuracy p^=k/n\hat{p} = k/n.

By Hoeffding's inequality (with Xi{0,1}X_i \in \{0,1\}, ba=1b-a = 1), with probability at least 1δ1-\delta:

p^plog(2/δ)2n|\hat{p} - p| \leq \sqrt{\frac{\log(2/\delta)}{2n}}

This gives the Hoeffding confidence interval: p[p^log(2/δ)2n,p^+log(2/δ)2n]p \in \left[\hat{p} - \sqrt{\frac{\log(2/\delta)}{2n}}, \hat{p} + \sqrt{\frac{\log(2/\delta)}{2n}}\right].

Numerical example. For n=1000n = 1000 questions and δ=0.05\delta = 0.05:

CI half-width=log(40)20003.6920000.043\text{CI half-width} = \sqrt{\frac{\log(40)}{2000}} \approx \sqrt{\frac{3.69}{2000}} \approx 0.043

So a measured accuracy of 73.2% on 1000 questions gives 95% CI [69.0%,77.4%][69.0\%, 77.4\%] - a ±4.2%\pm 4.2\% interval. This explains why claiming "Model A (73.2%) outperforms Model B (72.8%)" on 1000 questions has no statistical support.

11.3 Random Features and Kernel Approximation

The Rahimi-Recht random features method approximates the RBF kernel k(x,x)=exx2/(2σ2)k(x, x') = e^{-\|x-x'\|^2/(2\sigma^2)} as:

k(x,x)1Dj=1Dϕj(x)ϕj(x),ϕj(x)=2cos(ωjx+bj)k(x, x') \approx \frac{1}{D}\sum_{j=1}^D \phi_j(x)\phi_j(x'), \quad \phi_j(x) = \sqrt{2}\cos(\omega_j^\top x + b_j)

where ωjN(0,I/σ2)\omega_j \sim \mathcal{N}(0, I/\sigma^2) and bjU[0,2π]b_j \sim \mathcal{U}[0, 2\pi].

How many features? The approximation error is bounded by Hoeffding:

P ⁣(1Dj=1Dzjk(x,x)ε)2eDε2/2P\!\left(\left|\frac{1}{D}\sum_{j=1}^D z_j - k(x,x')\right| \geq \varepsilon\right) \leq 2e^{-D\varepsilon^2/2}

where zj=ϕj(x)ϕj(x)[1,1]z_j = \phi_j(x)\phi_j(x') \in [-1, 1]. For ε=0.01\varepsilon = 0.01 and δ=0.01\delta = 0.01: D2log(2/0.01)/0.01292,000D \geq 2\log(2/0.01)/0.01^2 \approx 92{,}000.

In practice, D=1000D = 1000--1000010000 suffices with ε0.1\varepsilon \approx 0.1, suggesting the bound is conservative. This is common: worst-case bounds are loose, actual concentrations are tighter.

11.4 Concentration in High Dimensions

High-dimensional geometry exhibits surprising concentration phenomena. For xN(0,Id)\mathbf{x} \sim \mathcal{N}(0, I_d):

Chi-squared concentration. x22χd2\|\mathbf{x}\|_2^2 \sim \chi^2_d with mean dd and variance 2d2d. Bernstein gives:

P ⁣(x22d1t)2exp ⁣(dmin ⁣(t24,t4))P\!\left(\left|\frac{\|\mathbf{x}\|_2^2}{d} - 1\right| \geq t\right) \leq 2\exp\!\left(-d \cdot \min\!\left(\frac{t^2}{4}, \frac{t}{4}\right)\right)

So x2/d1\|\mathbf{x}\|_2 / \sqrt{d} \to 1 in probability - all Gaussian vectors in high dimensions have nearly the same norm d\sqrt{d}.

Concentration of inner products. For independent x,yN(0,Id/d)\mathbf{x}, \mathbf{y} \sim \mathcal{N}(0, I_d/d) (unit-variance):

P(x,yt)2edt2/2P(|\langle \mathbf{x}, \mathbf{y} \rangle| \geq t) \leq 2e^{-dt^2/2}

Nearly-orthogonal random vectors: with d=512d = 512 (typical embedding dimension), random vectors have x,y<0.1|\langle \mathbf{x}, \mathbf{y}\rangle| < 0.1 with very high probability.

For AI (attention). In transformers with dk=64d_k = 64 (key dimension), random query-key inner products concentrate around 0. This justifies the 1/dk1/\sqrt{d_k} scaling in attention: without it, softmax inputs would have variance dkd_k, causing near-zero gradients through the softmax saturation.

Preview: Law of Large Numbers and CLT

The concentration results above provide finite-sample bounds. The LLN says Xˉnμ\bar{X}_n \to \mu as nn \to \infty; the CLT says the deviation n(Xˉnμ)/σ\sqrt{n}(\bar{X}_n - \mu)/\sigma converges to N(0,1)\mathcal{N}(0,1). These are the asymptotic counterparts of Hoeffding's inequality.

-> Full treatment: Section06 Stochastic Processes


12. Common Mistakes

#MistakeWhy It's WrongFix
1Applying Markov to a signed random variableMarkov requires X0X \geq 0; for signed XX, P(Xt)P(X \geq t) can exceed E[X]/t\mathbb{E}[X]/tApply Markov to $
2Using Chebyshev for exponential tailsChebyshev gives 1/k21/k^2 decay, vastly looser than sub-Gaussian ek2/2e^{-k^2/2}If data is bounded or Gaussian, use Hoeffding/Chernoff
3Forgetting the factor of 2 in two-sided HoeffdingOne-sided: e2nt2/(ba)2e^{-2nt^2/(b-a)^2}; two-sided has factor 2 in frontAlways write "two-sided" explicitly and include the factor
4Confusing sub-Gaussian parameter with varianceSub-Gaussian parameter σ2\sigma^2 satisfies Var(X)σ2\operatorname{Var}(X) \leq \sigma^2 but is not equalVar(X)=M(0)\operatorname{Var}(X) = M''(0); sub-Gaussian parameter can be larger
5Applying union bound over infinite classes directlyUnion bound only works for countable (finite/countably infinite) collectionsUse ε\varepsilon-net covering argument first, then union bound over the net
6Computing VC dimension by counting parametersVC dimension is not the number of parameters; depends on the function class structureFor linear classifiers in Rd\mathbb{R}^d: dVC=d+1d_{VC} = d+1, not number of weights
7Ignoring the log(1/δ)\log(1/\delta) term in sample complexitySetting δ=0.01\delta = 0.01 adds log(100)4.6\log(100) \approx 4.6 - small, but not negligibleAlways compute full bound including log(1/δ)\log(1/\delta)
8Claiming Chernoff applies to dependent variablesChernoff requires independence (product of MGFs)For dependent variables, use McDiarmid or martingale concentration
9Using Hoeffding when variance is known to be smallBernstein would give an exponentially tighter bound for small-variance dataCheck if Var(Xi)(ba)2/4\operatorname{Var}(X_i) \ll (b-a)^2/4; if so, use Bernstein
10Interpreting VC generalisation bounds as practically tightVC bounds are often vacuous for deep networks; describe worst-case behaviourUse PAC-Bayes or Rademacher bounds, which can be data-dependent and tighter

13. Exercises

**Exercise 1 *** (Markov and Chebyshev) For XExp(1)X \sim \operatorname{Exp}(1) and YN(0,1)Y \sim \mathcal{N}(0,1): (a) Compute P(X3)P(X \geq 3) exactly and via Markov's inequality. What is the ratio? (b) Compute P(Y3)P(|Y| \geq 3) exactly and via Chebyshev. What is the ratio? (c) For which kk does Chebyshev give P(Yk)0.05P(|Y| \geq k) \leq 0.05? Compare to the exact Gaussian answer. (d) Verify both bounds numerically with N=106N = 10^6 samples.

**Exercise 2 *** (Hoeffding Sample Complexity) A spam filter classifies emails. On nn test emails, it achieves accuracy p^\hat{p}. (a) Using Hoeffding, find the smallest nn such that with 99% confidence, p^p0.02|\hat{p} - p| \leq 0.02. (b) How does the required nn change if we only require p^p0.05|\hat{p} - p| \leq 0.05? (c) If n=500n = 500 and p^=0.92\hat{p} = 0.92, give the 95% Hoeffding confidence interval for pp. (d) Compare the Hoeffding CI to the Normal-approximation CI p^±1.96p^(1p^)/n\hat{p} \pm 1.96\sqrt{\hat{p}(1-\hat{p})/n}.

**Exercise 3 *** (Chernoff Bounds for Binomial) Let SBin(n,p)S \sim \operatorname{Bin}(n, p) with n=1000n = 1000, p=0.1p = 0.1 (so μ=100\mu = 100). (a) Compute P(S130)P(S \geq 130) exactly and via the multiplicative Chernoff bound (δ=0.3\delta = 0.3). (b) Compute P(S130)P(S \geq 130) via Hoeffding. Compare to Chernoff - which is tighter? (c) Compute P(S70)P(S \leq 70) via the lower-tail Chernoff bound and compare to the exact value. (d) Plot all three bounds (Hoeffding, upper Chernoff, lower Chernoff) as a function of the threshold.

**Exercise 4 **** (Bernstein: Gradient Noise in SGD) Assume each sample gradient gi[G,G]dg_i \in [-G, G]^d with E[gi]=g\mathbb{E}[g_i] = g and per-coordinate variance σk2\sigma_k^2. (a) Apply Bernstein (with union bound) to bound P(g^gε)P(\|\hat{g} - g\|_\infty \geq \varepsilon) for mini-batch size mm. (b) Compare to the Hoeffding-based bound. Under what condition does Bernstein win? (c) For G=1G = 1, d=100d = 100, σk2=0.01\sigma_k^2 = 0.01, find the mm needed for P(g^g0.1)0.05P(\|\hat{g} - g\|_\infty \geq 0.1) \leq 0.05 via both bounds. (d) Verify numerically: generate synthetic gradients and measure empirical exceedance probability.

**Exercise 5 **** (McDiarmid on Empirical Risk) Let R^(h,D)=1ni=1n(h,zi)\hat{R}(h, \mathcal{D}) = \frac{1}{n}\sum_{i=1}^n \ell(h, z_i) with [0,1]\ell \in [0, 1] and ziz_i i.i.d. (a) Show that swapping one training example changes R^\hat{R} by at most 1/n1/n. (b) Apply McDiarmid's inequality to bound P(R^E[R^]t)P(|\hat{R} - \mathbb{E}[\hat{R}]| \geq t). (c) Compare this to the direct Hoeffding application. Are they the same? Why? (d) What if [0,L]\ell \in [0, L] instead of [0,1][0, 1]? How does the bound scale?

**Exercise 6 **** (PAC Bounds for Finite Class) Consider H={h1,,h1000}\mathcal{H} = \{h_1, \ldots, h_{1000}\} with 1000 decision rules for binary classification. (a) How many training examples are needed for uniform convergence with ε=0.05\varepsilon = 0.05, δ=0.05\delta = 0.05? (b) ERM achieves training error 0. Using the PAC bound, what is the worst-case true error with 95% confidence? (c) If we use n=5000n = 5000 examples and observe R^(h^)=0.03\hat{R}(\hat{h}) = 0.03, bound R(h^)R(\hat{h}) with 99% confidence. (d) How does the bound change if H=106|\mathcal{H}| = 10^6 (e.g., a lookup table over binary features)?

**Exercise 7 ***** (VC Dimension and Generalisation) Consider linear classifiers in Rd\mathbb{R}^d: H={sign(wx+b):wRd,bR}\mathcal{H} = \{\operatorname{sign}(\mathbf{w}^\top \mathbf{x} + b): \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R}\}. (a) Show by construction that d+1d+1 points can be shattered (exhibit a configuration). (b) Show that no d+2d+2 points in general position can be shattered. (c) Write the VC generalisation bound for d=100d = 100, n=10000n = 10000, δ=0.05\delta = 0.05. (d) Compare to the Rademacher bound for the same class with w21\|\mathbf{w}\|_2 \leq 1 and x21\|\mathbf{x}\|_2 \leq 1.

**Exercise 8 ***** (Rademacher Complexity: Monte Carlo Estimation) Given a dataset {x(1),,x(n)}Rd\{x^{(1)}, \ldots, x^{(n)}\} \subset \mathbb{R}^d and linear function class F={xwx:w21}\mathcal{F} = \{\mathbf{x} \mapsto \mathbf{w}^\top \mathbf{x}: \|\mathbf{w}\|_2 \leq 1\}: (a) Show that R^S(F)=1nEσ ⁣[iσix(i)2]\hat{\mathfrak{R}}_S(\mathcal{F}) = \frac{1}{n}\mathbb{E}_\sigma\!\left[\left\|\sum_i \sigma_i x^{(i)}\right\|_2\right]. (b) Implement a Monte Carlo estimator using T=1000T = 1000 random sign vectors σ\sigma. (c) On the MNIST training set (or a synthetic dataset), compute R^S\hat{\mathfrak{R}}_S and the resulting generalisation bound. (d) How does R^S\hat{\mathfrak{R}}_S change as you vary nn? Verify the O(1/n)O(1/\sqrt{n}) scaling.


14. Why This Matters for AI (2026 Perspective)

ConceptAI/ML ConnectionCurrent Importance
Hoeffding confidence intervalsLLM benchmark evaluation: how many test questions for reliable rankings?Critical - used implicitly in all leaderboard comparisons (MMLU, HELM, LMSYS)
PAC-Bayes boundsTightest known generalisation bounds for overparameterised models; connects to flat minima and sharpness-aware minimisation (SAM)Active research frontier, 2023-2026
VC dimensionClassical foundation; vacuous for LLMs with 10910^9+ parametersConceptually important; practically replaced by norm-based bounds
Rademacher complexityData-dependent bounds; finite for norm-constrained transformersUsed in theoretical analysis of attention, LoRA rank bounds
McDiarmid stabilityAlgorithmic stability theory: output doesn't change much when one training point changesFoundation of differential privacy; relevant to RLHF data sensitivity
Concentration in high dimsdk\sqrt{d_k} scaling in attention; JL lemma for random projections in retrievalDirectly explains transformer architectural choices (FlashAttention, MLA)
Chernoff for random graphsTheoretical analysis of dropout, random initialisation, random feature networksRelevant to lottery ticket hypothesis, neural architecture search
Union bound + coveringUniform convergence for function classes; foundation of all finite-sample learning theoryEvery theoretical ML paper uses this framework
Sub-Gaussian gradientsConvergence theory for Adam, SGD; required sample complexity for fine-tuningUsed in LoRA theoretical analysis and RLHF convergence guarantees
Bernstein vs HoeffdingVariance-aware bounds; relevant when gradient variance is small (near optima)Foundation of variance reduction methods (SVRG, SAG, SAGA)

15. Conceptual Bridge

Looking back. This section builds on Section04 Expectation and Moments, which established E[X]\mathbb{E}[X], Var(X)\operatorname{Var}(X), and the MGF MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}]. The MGF is the critical bridge: Hoeffding's lemma uses convexity of etxe^{tx}; the Chernoff method optimises over MX(s)M_X(s); Bernstein's inequality exploits the variance E[X2]\mathbb{E}[X^2]. Jensen's inequality (from Section04) underlies both Hoeffding's lemma proof and the information-theoretic interpretation of Chernoff bounds.

Looking forward. The Law of Large Numbers says Xˉnμ\bar{X}_n \to \mu almost surely. This is the qualitative counterpart of Hoeffding's quantitative bound: Hoeffding gives the rate at which Xˉn\bar{X}_n concentrates. The Central Limit Theorem (CLT) says the shape of the distribution of Xˉn\bar{X}_n converges to Gaussian - explaining why sub-Gaussian bounds are the right model for sums of bounded variables. Both are developed in Section06 Stochastic Processes.

The PAC-statistics connection. PAC learning theory is, at its core, applied concentration theory. The "probably" in PAC = concentration inequality. The "approximately correct" = ε\varepsilon tolerance. The sample complexity n=O((d+log(1/δ))/ε2)n = O((d + \log(1/\delta))/\varepsilon^2) is exactly what concentration inequalities give. Chapter 7 (Statistics) will build on this: confidence intervals are Hoeffding bounds, hypothesis tests are Chernoff bounds, and MLE theory uses Bernstein to prove asymptotic normality.

CONCENTRATION INEQUALITIES IN THE CURRICULUM
========================================================================

  Section04 Expectation and Moments
  +- E[X], Var(X), MGF M_X(t)  -------------------------- input
  +- Jensen's inequality         -------------------------- used in proofs
  +- Markov/Chebyshev preview    -------------------------- previewed there

         v

  Section05 Concentration Inequalities   <--- YOU ARE HERE
  +- Markov / Chebyshev            (moment-based, polynomial tails)
  +- Sub-Gaussian / Hoeffding      (bounded, exponential tails)
  +- Chernoff / Bernstein          (MGF-based, variance-aware)
  +- McDiarmid                     (functions of independent variables)
  +- Union bound + covering        (infinite hypothesis classes)
  +- PAC learning / VC dimension   (generalisation theory)
  +- Rademacher complexity         (data-dependent bounds)

         v                                      v

  Section06 Stochastic Processes         Chapter 7: Statistics
  +- Weak LLN (via Chebyshev)      +- Confidence intervals
  +- Strong LLN                    +- Hypothesis testing
  +- CLT (limit of Hoeffding)      +- MLE consistency (Bernstein)

========================================================================

Appendix A: Summary of Key Bounds

InequalityConditionsBoundRate
MarkovX0X \geq 0, E[X]=μ\mathbb{E}[X] = \muP(Xt)μ/tP(X \geq t) \leq \mu/tO(1/t)O(1/t)
ChebyshevVar(X)=σ2\operatorname{Var}(X) = \sigma^2P(Xμt)σ2/t2P(\lvert X-\mu\rvert \geq t) \leq \sigma^2/t^2O(1/t2)O(1/t^2)
CantelliVar(X)=σ2\operatorname{Var}(X) = \sigma^2P(Xμt)σ2/(σ2+t2)P(X-\mu \geq t) \leq \sigma^2/(\sigma^2+t^2)O(1/t2)O(1/t^2)
Sub-GaussianXX σ2\sigma^2-sub-GP(Xt)et2/(2σ2)P(X \geq t) \leq e^{-t^2/(2\sigma^2)}O(et2)O(e^{-t^2})
HoeffdingXi[ai,bi]X_i \in [a_i,b_i] i.i.d.P(Xˉμt)exp(2nt2/(biai)2)P(\bar{X}-\mu \geq t) \leq \exp(-2nt^2/\sum(b_i-a_i)^2)O(ent2)O(e^{-nt^2})
Chernoff (mult.)Bin(n,p)\operatorname{Bin}(n,p), δ(0,1)\delta \in (0,1)P(S(1+δ)μ)eμδ2/3P(S \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3}O(eμ)O(e^{-\mu})
BernsteinXic\lvert X_i\rvert \leq c, variance ν2\nu^2P(Xˉt)exp(nt2/2/(ν2+ct/3))P(\bar{X} \geq t) \leq \exp(-nt^2/2/(\nu^2+ct/3))O(ent2/ν2)O(e^{-nt^2/\nu^2})
McDiarmidff with cic_i-bounded diff.P(fE[f]t)e2t2/ci2P(f - \mathbb{E}[f] \geq t) \leq e^{-2t^2/\sum c_i^2}O(et2)O(e^{-t^2})

Appendix B: Sample Complexity Table

Given: i.i.d. data in [0,1][0,1], want P(Xˉnμε)1δP(|\bar{X}_n - \mu| \leq \varepsilon) \geq 1-\delta.

ε\varepsilonδ\deltaHoeffding nnNormal approx nn
0.100.1013368
0.050.10530271
0.050.05600384
0.010.0514,9799,604
0.010.0119,93316,587

Hoeffding is distribution-free (works for worst case); Normal approximation assumes CLT holds.


<- Back to Chapter 6: Probability Theory | Next: Stochastic Processes ->

Appendix C: Detailed Proofs and Derivations

C.1 Proof of Hoeffding's Lemma in Full

We prove: if E[X]=0\mathbb{E}[X] = 0 and X[a,b]X \in [a, b] a.s., then E[etX]et2(ba)2/8\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8}.

Step 1: Convexity bound. Since etxe^{tx} is convex in xx, for x[a,b]x \in [a, b]:

etxbxbaeta+xabaetbe^{tx} \leq \frac{b - x}{b - a} e^{ta} + \frac{x - a}{b - a} e^{tb}

Step 2: Take expectations. Let p=aba[0,1]p = \frac{-a}{b-a} \in [0,1] (since a0ba \leq 0 \leq b after centering):

E[etX]bE[X]baeta+E[X]abaetb=bbaeta+abaetb\mathbb{E}[e^{tX}] \leq \frac{b - \mathbb{E}[X]}{b-a} e^{ta} + \frac{\mathbb{E}[X] - a}{b-a} e^{tb} = \frac{b}{b-a} e^{ta} + \frac{-a}{b-a} e^{tb} =(1p)eta+petb=eta[(1p)+pet(ba)]= (1-p) e^{ta} + p e^{tb} = e^{ta}\left[(1-p) + p e^{t(b-a)}\right]

Step 3: Exponential bound. Let h=t(ba)h = t(b-a) and ϕ(h)=log[(1p)+peh]ph\phi(h) = \log[(1-p) + pe^h] - ph. Then:

E[etX]eta+pheϕ(h)=eta+pt(ba)eϕ(h)\mathbb{E}[e^{tX}] \leq e^{ta + ph} \cdot e^{\phi(h)} = e^{ta + p \cdot t(b-a)} \cdot e^{\phi(h)}

Note: ta+pt(ba)=ta+t(a/(ba))(ba)=tata=0ta + p \cdot t(b-a) = ta + t(-a/(b-a))(b-a) = ta - ta = 0 (since p=a/(ba)p = -a/(b-a)). So:

E[etX]eϕ(h)\mathbb{E}[e^{tX}] \leq e^{\phi(h)}

Step 4: Bound ϕ(h)\phi(h). We have ϕ(0)=0\phi(0) = 0 and ϕ(0)=0\phi'(0) = 0. By Taylor's theorem:

ϕ(h)=ϕ(0)+hϕ(0)+h22ϕ(ξ)\phi(h) = \phi(0) + h\phi'(0) + \frac{h^2}{2}\phi''(\xi)

for some ξ[0,h]\xi \in [0, h]. Computing: ϕ(h)=peh(1p)((1p)+peh)214\phi''(h) = \frac{pe^h(1-p)}{((1-p)+pe^h)^2} \leq \frac{1}{4}

(since xy(x+y)2/4xy \leq (x+y)^2/4 for x,y0x, y \geq 0, applied to x=(1p)x = (1-p), y=pehy = pe^h).

Therefore: ϕ(h)h28=t2(ba)28\phi(h) \leq \frac{h^2}{8} = \frac{t^2(b-a)^2}{8}. \blacksquare

C.2 Proof of McDiarmid's Inequality via Azuma's Inequality

Azuma's Inequality. If (Z0,Z1,,Zn)(Z_0, Z_1, \ldots, Z_n) is a martingale with ZkZk1ck|Z_k - Z_{k-1}| \leq c_k a.s., then:

P(ZnZ0t)exp ⁣(2t2k=1nck2)P(Z_n - Z_0 \geq t) \leq \exp\!\left(-\frac{2t^2}{\sum_{k=1}^n c_k^2}\right)

Doob martingale construction. Given independent X1,,XnX_1, \ldots, X_n and ff with bounded differences cic_i, define:

Zk=E[f(X1,,Xn)X1,,Xk]Z_k = \mathbb{E}[f(X_1, \ldots, X_n) \mid X_1, \ldots, X_k]

Then:

  • Z0=E[f]Z_0 = \mathbb{E}[f] (constant, before observing anything)
  • Zn=f(X1,,Xn)Z_n = f(X_1, \ldots, X_n) (the function value itself)
  • (Zk)(Z_k) is a martingale by the tower property

Bounding differences. For each kk:

ZkZk1=E[fX1,,Xk]E[fX1,,Xk1]Z_k - Z_{k-1} = \mathbb{E}[f \mid X_1,\ldots,X_k] - \mathbb{E}[f \mid X_1,\ldots,X_{k-1}]

Since changing XkX_k changes ff by at most ckc_k and Zk1Z_{k-1} doesn't depend on XkX_k, we have ZkZk1ck|Z_k - Z_{k-1}| \leq c_k.

Applying Azuma to this Doob martingale gives McDiarmid's inequality. \blacksquare

C.3 Proof of the Finite-Class PAC Bound (Agnostic Case)

Setup. H\mathcal{H} finite with H=M|\mathcal{H}| = M. Losses h(i)=(h,z(i))[0,1]\ell_h^{(i)} = \ell(h, z^{(i)}) \in [0,1]. True risk R(h)=E[h]R(h) = \mathbb{E}[\ell_h], empirical risk R^(h)=1nih(i)\hat{R}(h) = \frac{1}{n}\sum_i \ell_h^{(i)}.

Goal. Show P(suphR(h)R^(h)>ε)δP(\sup_h |R(h) - \hat{R}(h)| > \varepsilon) \leq \delta when nlog(2M/δ)2ε2n \geq \frac{\log(2M/\delta)}{2\varepsilon^2}.

Step 1. Fix hh. By Hoeffding: P(R(h)R^(h)>ε)2e2nε2P(|R(h) - \hat{R}(h)| > \varepsilon) \leq 2e^{-2n\varepsilon^2}.

Step 2. Union bound over all MM hypotheses:

P(h:R(h)R^(h)>ε)hH2e2nε2=2Me2nε2P(\exists h: |R(h) - \hat{R}(h)| > \varepsilon) \leq \sum_{h \in \mathcal{H}} 2e^{-2n\varepsilon^2} = 2M e^{-2n\varepsilon^2}

Step 3. Set 2Me2nε2δ2Me^{-2n\varepsilon^2} \leq \delta:

nlog(2M/δ)2ε2n \geq \frac{\log(2M/\delta)}{2\varepsilon^2}

ERM generalisation. Under uniform convergence, the ERM h^=argminhR^(h)\hat{h} = \arg\min_h \hat{R}(h) satisfies:

R(h^)R^(h^)+εR^(h)+εR(h)+2εR(\hat{h}) \leq \hat{R}(\hat{h}) + \varepsilon \leq \hat{R}(h^*) + \varepsilon \leq R(h^*) + 2\varepsilon

where h=argminhR(h)h^* = \arg\min_h R(h) is the Bayes-optimal hypothesis in H\mathcal{H}. \blacksquare


Appendix D: VC Theory in Depth

D.1 The Growth Function

Definition. The growth function ΠH(n)\Pi_\mathcal{H}(n) counts the maximum number of distinct labellings achievable by H\mathcal{H} on any nn points:

ΠH(n)=maxx(1),,x(n)X{(h(x(1)),,h(x(n))):hH}\Pi_\mathcal{H}(n) = \max_{x^{(1)},\ldots,x^{(n)} \in \mathcal{X}} \left|\{(h(x^{(1)}),\ldots,h(x^{(n)})): h \in \mathcal{H}\}\right|

If H\mathcal{H} can shatter nn points, ΠH(n)=2n\Pi_\mathcal{H}(n) = 2^n. The VC dimension d=dVCd = d_{VC} is the largest nn for which ΠH(n)=2n\Pi_\mathcal{H}(n) = 2^n.

D.2 Sauer-Shelah Lemma

Lemma. For dVC(H)=dd_{VC}(\mathcal{H}) = d and ndn \geq d:

ΠH(n)k=0d(nk)\Pi_\mathcal{H}(n) \leq \sum_{k=0}^d \binom{n}{k}

Proof idea. By induction on nn and dd. Split: either x(n)x^{(n)} makes no difference to the labelling count, or it doubles some labellings. Careful counting bounds the total.

Consequence. For ndn \geq d: ΠH(n)(en/d)d\Pi_\mathcal{H}(n) \leq (en/d)^d. This transitions from exponential 2n2^n (when class is rich) to polynomial (en/d)d(en/d)^d (once nn exceeds VC dimension).

D.3 VC Dimension Examples

Hypothesis ClassVC Dimension
Threshold on R\mathbb{R}1
Intervals on R\mathbb{R}2
Halfspaces in Rd\mathbb{R}^dd+1d+1
Axis-aligned rectangles in R2\mathbb{R}^24
Convex polygons in R2\mathbb{R}^2\infty
Degree-kk polynomials in R\mathbb{R}k+1k+1
Neural nets with WW weightsO(WlogW)O(W \log W)
Kernel classifiers (universal)\infty

Why VC = d+1d+1 for halfspaces. In Rd\mathbb{R}^d, any d+1d+1 points in general position (no d+1d+1 are on a hyperplane) can be shattered by a halfspace. But for d+2d+2 points, by Radon's theorem, they can be split into two groups whose convex hulls intersect - no halfspace separates them. Hence dVC=d+1d_{VC} = d+1.

D.4 Fundamental Theorem of Statistical Learning

Theorem. For binary classification, the following are equivalent:

  1. H\mathcal{H} is PAC-learnable
  2. ERM is a successful learner for H\mathcal{H}
  3. H\mathcal{H} has finite VC dimension
  4. H\mathcal{H} has the uniform convergence property

This theorem, proved through Sauer-Shelah and the symmetrisation technique (doubling the sample and introducing Rademacher variables), is the bedrock of statistical learning theory.


Appendix E: Advanced Topics in Concentration

E.1 Talagrand's Inequality

Talagrand's inequality (1995) strengthens McDiarmid for "self-bounding" functions - those where the effect of each coordinate is bounded by the function value itself.

Self-bounding functions. f:XnR0f: \mathcal{X}^n \to \mathbb{R}_{\geq 0} is self-bounding if there exist fi:Xn1Rf_i: \mathcal{X}^{n-1} \to \mathbb{R} such that:

0f(x)fi(xi)1andi=1n(f(x)fi(xi))f(x)0 \leq f(\mathbf{x}) - f_i(\mathbf{x}^{-i}) \leq 1 \quad \text{and} \quad \sum_{i=1}^n (f(\mathbf{x}) - f_i(\mathbf{x}^{-i})) \leq f(\mathbf{x})

Talagrand's bound. For self-bounding ff:

P(fE[f]+t)et2/(2(E[f]+t/3))P(f \geq \mathbb{E}[f] + t) \leq e^{-t^2/(2(\mathbb{E}[f]+t/3))}

This is a variance-adaptive McDiarmid - analogous to Bernstein vs Hoeffding.

Applications: Number of ones in a sum of Bernoullis, size of random matchings, empirical risk when the model fits the data.

E.2 Concentration for Heavy-Tailed Distributions

When distributions are heavy-tailed, sub-Gaussian bounds don't apply. Modern techniques include:

Catoni's estimator. For estimating the mean of a distribution with finite variance σ2\sigma^2 (but potentially infinite MGF), Catoni's estimator achieves:

P(μ^ψμt)2exp ⁣(nt22σ2+2tσ)ent2/(2σ2)P(|\hat{\mu}_\psi - \mu| \geq t) \leq 2\exp\!\left(-\frac{nt^2}{2\sigma^2 + 2t\sigma}\right) \approx e^{-nt^2/(2\sigma^2)}

using a truncated exponential influence function. This gives Bernstein-like rates without boundedness.

Median of means. Partition nn samples into kk groups of n/kn/k. Compute group means and take the median. The median is sub-Gaussian with parameter σ2k/n\sigma^2 k/n - achieving sub-Gaussian rates for distributions with only two moments.

For AI: Training data for LLMs often has heavy-tailed length distributions (documents follow power laws). Heavy-tail-robust mean estimation is relevant for unbiased training.

E.3 Matrix Concentration Inequalities

Scalar concentration extends to matrices. For many AI applications, we care about concentration of random matrices.

Matrix Hoeffding. For independent random PSD matrices X1,,XnRd×dX_1, \ldots, X_n \in \mathbb{R}^{d \times d} with 0XicI0 \preceq X_i \preceq cI and E[Xi]=M\mathbb{E}[\sum X_i] = M, for t>0t > 0:

P ⁣(λmax ⁣(iXiM)t)det2/(2nc2)P\!\left(\lambda_{\max}\!\left(\sum_i X_i - M\right) \geq t\right) \leq d \cdot e^{-t^2/(2nc^2)}

The extra factor dd (matrix dimension) accounts for the eigenvalue structure.

Application: random features. When approximating a kernel matrix Kij=k(x(i),x(j))K_{ij} = k(x^{(i)}, x^{(j)}) by random features, the approximation error in spectral norm concentrates with matrix Hoeffding, justifying the use of D=O(d2/ε2)D = O(d^2/\varepsilon^2) random features for ε\varepsilon-accurate Gram matrix approximation.

Matrix Bernstein. For independent zero-mean random matrices XiX_i with Xi2R\|X_i\|_2 \leq R and σ2=E[Xi2]2\sigma^2 = \|\sum \mathbb{E}[X_i^2]\|_2:

P ⁣(iXi2t)2dexp ⁣(t2/2σ2+Rt/3)P\!\left(\left\|\sum_i X_i\right\|_2 \geq t\right) \leq 2d \cdot \exp\!\left(-\frac{t^2/2}{\sigma^2 + Rt/3}\right)

Used in randomised linear algebra (randomised SVD, Nystrom approximation).

E.4 PAC-Bayes Theory

PAC-Bayes bounds combine the PAC framework with Bayesian prior knowledge.

Setup. Given a prior PP over H\mathcal{H} (before seeing data) and posterior QQ (after seeing data):

PAC-Bayes Theorem (McAllester, 1999). For any prior PP and δ>0\delta > 0, with probability 1δ\geq 1-\delta:

EhQ[R(h)]EhQ[R^(h)]+DKL(QP)+log(1/δ)2(n1)\mathbb{E}_{h \sim Q}[R(h)] \leq \mathbb{E}_{h \sim Q}[\hat{R}(h)] + \sqrt{\frac{D_{\mathrm{KL}}(Q \| P) + \log(1/\delta)}{2(n-1)}}

Why this is powerful: The KL term DKL(QP)D_{\mathrm{KL}}(Q\|P) replaces logH\log|\mathcal{H}| (which can be infinite). For neural networks, if the posterior concentrates near the prior (small KL), the bound is tight. Flat minima (low sharpness) correspond to posteriors that don't drift far from the prior.

Connection to SAM. Sharpness-Aware Minimisation (SAM, 2021) minimises a PAC-Bayes-inspired upper bound on generalisation error by seeking flat minima - directly operationalising PAC-Bayes in practice.


Appendix F: Worked Examples

F.1 Coin Flip Estimation

Problem. We flip a biased coin nn times and observe p^\hat{p} heads. How many flips to estimate pp within ±0.05\pm 0.05 with 99% confidence?

Hoeffding solution. With Xi{0,1}X_i \in \{0,1\}, ba=1b-a=1:

nlog(2/0.01)2(0.05)2=log2000.005=5.2980.005=1060n \geq \frac{\log(2/0.01)}{2(0.05)^2} = \frac{\log 200}{0.005} = \frac{5.298}{0.005} = 1060

Normal approximation. n(2.576/0.05)20.25=664n \geq (2.576/0.05)^2 \cdot 0.25 = 664. Less conservative because it uses the Gaussian CLT approximation (valid for large nn).

Interpretation. About 1000-1100 flips are needed for reliable probability estimation. This is why A/B tests often require thousands of users: smaller samples give confidence intervals too wide to detect meaningful differences.

F.2 Multi-Armed Bandit via Chernoff

Problem. A recommendation system has K=100K = 100 arms (content types). Each arm aa has unknown click-through rate pap_a. We want to identify the best arm within ε=0.01\varepsilon = 0.01 with δ=0.01\delta = 0.01.

Naive approach. Pull each arm n0n_0 times. By Hoeffding + union bound:

P(a:p^apa>ε)2Ke2n0ε2P(\exists a: |\hat{p}_a - p_a| > \varepsilon) \leq 2K e^{-2n_0\varepsilon^2}

Set δ\leq \delta: n0log(2K/δ)2ε2=log(20000)0.000248,200n_0 \geq \frac{\log(2K/\delta)}{2\varepsilon^2} = \frac{\log(20000)}{0.0002} \approx 48{,}200. Total pulls: 4,820,0004{,}820{,}000.

Adaptive strategy (UCB). Upper Confidence Bound algorithms concentrate exploration on promising arms, achieving total pulls O(Klogn/ε2)O(K\log n/\varepsilon^2) - much smaller in practice.

F.3 Neural Network Generalisation: PAC-Bayes in Practice

Setup. 2-layer ReLU network, 100 hidden units, trained on 50,000 MNIST examples. Training error 0.3%, test error 1.2%. Standard VC bound: vacuous (parameter count n\gg n).

PAC-Bayes approach. Use Gaussian prior P=N(0,σ02I)P = \mathcal{N}(0, \sigma_0^2 I) and posterior Q=N(θ^,σ2I)Q = \mathcal{N}(\hat{\theta}, \sigma^2 I) (trained weights θ^\hat{\theta}). Compute:

DKL(QP)=θ^222σ02+dlog(σ0/σ)+dσ2/(2σ02)d/2D_{\mathrm{KL}}(Q\|P) = \frac{\|\hat{\theta}\|_2^2}{2\sigma_0^2} + d\log(\sigma_0/\sigma) + d\sigma^2/(2\sigma_0^2) - d/2

With careful tuning of σ\sigma, this can give non-vacuous bounds (< 50% error guarantee) even for deep networks - unlike VC theory.

F.4 Random Projection (Johnson-Lindenstrauss)

Problem. We have n=10000n = 10000 points in R1000\mathbb{R}^{1000}. We want to project to Rm\mathbb{R}^m preserving all pairwise distances within factor (1±ε)(1 \pm \varepsilon) for ε=0.1\varepsilon = 0.1.

JL Lemma. For m=O(logn/ε2)m = O(\log n / \varepsilon^2) random projections (from N(0,I/m)\mathcal{N}(0, I/m)):

m4+2log(n2)ε2/2ε3/38lognε2=8ln(10000)0.017400m \geq \frac{4 + 2\log(n^2)}{\varepsilon^2/2 - \varepsilon^3/3} \approx \frac{8\log n}{\varepsilon^2} = \frac{8 \cdot \ln(10000)}{0.01} \approx 7400

So 7400 dimensions suffice - far less than 1000 (already doing 26% compression), and compression from 10000 would be massive.

Proof sketch. A random Gaussian matrix RRm×dR \in \mathbb{R}^{m \times d} satisfies: for any fixed pair of points x,yx, y:

P ⁣(RxRy22/mxy22>εxy22)2emε2/8P\!\left(\left|\|Rx - Ry\|_2^2/m - \|x-y\|_2^2\right| > \varepsilon\|x-y\|_2^2\right) \leq 2e^{-m\varepsilon^2/8}

by chi-squared concentration. Union bound over (n2)\binom{n}{2} pairs gives the JL lemma.

For AI: Locality-sensitive hashing, approximate nearest-neighbour search (used in RAG retrieval), and low-rank adaptation (LoRA) all rely on Johnson-Lindenstrauss-type arguments to justify dimensionality reduction.


Appendix G: Computing Concentration Bounds in Practice

G.1 Choosing the Right Bound

DECISION TREE: WHICH BOUND TO USE
========================================================================

  Do you know E[X]? --- No --> Can't bound tail
       |
       Yes
       |
  Is X \\geq 0? --- Yes --> Markov: P(X \\geq t) \\leq E[X]/t
       |
       No
       |
  Do you know Var(X)? --- Yes --> Chebyshev: P(|X-\\mu| \\geq t) \\leq \\sigma^2/t^2
       |
       No
       |
  Is X bounded in [a,b]? --- Yes --> Hoeffding (exponential bound)
       |                             or Bernstein if Var(X) known
       |
       No
       |
  Is MGF finite? --- Yes --> Chernoff method (optimise over s)
       |
       No
       |
  Distribution-free needed --> Student-t CI or bootstrap

========================================================================

G.2 Python Recipe for Hoeffding CIs

import numpy as np
from scipy import stats

def hoeffding_ci(x, delta=0.05, lo=0.0, hi=1.0):
    """Hoeffding confidence interval for bounded data in [lo, hi]."""
    n = len(x)
    x_bar = np.mean(x)
    radius = (hi - lo) * np.sqrt(np.log(2 / delta) / (2 * n))
    return x_bar - radius, x_bar + radius

def hoeffding_sample_size(epsilon, delta, lo=0.0, hi=1.0):
    """Required n for epsilon-accuracy at 1-delta confidence."""
    return int(np.ceil((hi - lo)**2 * np.log(2 / delta) / (2 * epsilon**2)))

G.3 Interpreting Generalisation Gaps

When a model achieves training error 1% and test error 5%:

  • Generalisation gap = 4%
  • Is this within bounds? With n=10000n = 10000, δ=0.05\delta = 0.05: Hoeffding gives log(40)/200001.4%\sqrt{\log(40)/20000} \approx 1.4\% per hypothesis
  • For H=106|\mathcal{H}| = 10^6 hypotheses: add log(106)/200002.6%\sqrt{\log(10^6)/20000} \approx 2.6\%
  • Total bound: 4%\approx 4\% - exactly matching the observed gap!

In practice, neural networks occupy a tiny corner of their hypothesis class (due to inductive bias of SGD), so effective complexity is much smaller than the VC dimension suggests.


Appendix H: Self-Assessment Checklist

Before moving to Section06 (Stochastic Processes), verify you can:

  • State Markov's inequality from memory and prove it with the indicator trick
  • Derive Chebyshev from Markov by applying Markov to (Xμ)2(X-\mu)^2
  • Prove Hoeffding's lemma using convexity of etxe^{tx} and Taylor expansion
  • Apply the Chernoff method: Markov + MGF bound + optimise over ss
  • State McDiarmid's inequality and identify cic_i for a given function
  • Compute sample complexity for given (ε,δ)(\varepsilon, \delta) using Hoeffding
  • Define VC dimension and compute it for halfspaces (=d+1= d+1 in Rd\mathbb{R}^d)
  • Write the PAC generalisation bound with H|\mathcal{H}| or dVCd_{VC}
  • Define Rademacher complexity and explain its interpretation
  • Identify when Bernstein beats Hoeffding (small variance case)
  • Give a non-example for sub-Gaussian (heavy-tailed distribution)
  • Explain why VC bounds are vacuous for LLMs and what PAC-Bayes offers

Appendix I: Connections to Information Theory

I.1 Kullback-Leibler Divergence and Exponential Families

The Chernoff method has a deep information-theoretic interpretation. The optimal exponent in the Chernoff bound equals the KL divergence under the tilted distribution.

For X1,,XnX_1, \ldots, X_n i.i.d. with distribution PP, and target xˉ=a>E[X]\bar{x} = a > \mathbb{E}[X]:

limn1nlogP(Xˉna)=DKL(PaP)\lim_{n \to \infty} \frac{1}{n}\log P(\bar{X}_n \geq a) = -D_{\mathrm{KL}}(P_a \| P)

where PaP_a is the exponential tilt of PP that puts its mean at aa: Pa(x)esxP(x)P_a(x) \propto e^{s^* x} P(x) with EPa[X]=a\mathbb{E}_{P_a}[X] = a.

This is Cramer's theorem - the foundation of large deviations theory. The probability of a rare event decays exponentially at rate given by the KL divergence to the closest distribution that makes the event typical.

For AI: This connection explains why cross-entropy loss (KL divergence from data to model) is the right loss for language models: minimising cross-entropy = maximising the probability of the training data = minimising the Chernoff exponent for generalisation error.

I.2 Entropy and Sauer-Shelah

The growth function ΠH(n)\Pi_\mathcal{H}(n) counts the number of distinct behaviours of H\mathcal{H} on nn points. The log growth rate:

h(H)=limnlog2ΠH(n)nh(\mathcal{H}) = \lim_{n \to \infty} \frac{\log_2 \Pi_\mathcal{H}(n)}{n}

is the combinatorial entropy of H\mathcal{H}. Sauer-Shelah shows:

  • If dVC<d_{VC} < \infty: h(H)=0h(\mathcal{H}) = 0 (subexponential growth, polynomial in nn)
  • If dVC=d_{VC} = \infty: h(H)=1h(\mathcal{H}) = 1 (full exponential growth 2n2^n)

This binary distinction - polynomial vs exponential growth - is precisely what separates PAC-learnable from non-learnable hypothesis classes.


Appendix J: Connections to Optimisation

J.1 Convergence of SGD via Hoeffding

The standard convergence analysis of SGD for convex functions uses:

  1. Descent lemma: L(θt+1)L(θt)ηgt(θtθ)+η2L2gt22\mathcal{L}(\theta_{t+1}) \leq \mathcal{L}(\theta_t) - \eta g_t^\top(\theta_t - \theta^*) + \frac{\eta^2 L}{2}\|g_t\|_2^2

  2. Gradient concentration: g^tgt2ε\|\hat{g}_t - g_t\|_2 \leq \varepsilon with high probability (Hoeffding)

  3. Telescoping: Sum over TT steps and apply Hoeffding union bound over all TT steps (adding logT\log T to the sample complexity)

The result: after T=O(G2/(ε2η2))T = O(G^2/(\varepsilon^2\eta^2)) steps, SGD achieves L(θT)L(θ)ε\mathcal{L}(\theta_T) - \mathcal{L}(\theta^*) \leq \varepsilon with high probability.

J.2 Generalisation and Algorithmic Stability

Definition (Uniform Stability). Algorithm A\mathcal{A} is β\beta-uniformly stable if for any two datasets SS, SS' differing in one example:

supz(A(S),z)(A(S),z)β\sup_z |\ell(\mathcal{A}(S), z) - \ell(\mathcal{A}(S'), z)| \leq \beta

Theorem (Bousquet-Elisseeff, 2002). If A\mathcal{A} is β\beta-stable with βc/n\beta \leq c/n for some constant cc, and losses are bounded in [0,1][0, 1], then with probability 1δ\geq 1-\delta:

R(A(S))R^(A(S))+2β+(4nβ+1)log(1/δ)2nR(\mathcal{A}(S)) \leq \hat{R}(\mathcal{A}(S)) + 2\beta + (4n\beta + 1)\sqrt{\frac{\log(1/\delta)}{2n}}

SGD is stable. For LL-smooth convex losses, SGD with step size η\eta for TT steps is 2ηTL/n2\eta TL/n-uniformly stable. Setting η=O(1/T)\eta = O(1/\sqrt{T}) gives stability O(T/n)O(\sqrt{T}/n) - small when Tn2T \ll n^2.

This explains why early stopping helps: more SGD steps = less stable = worse generalisation.


Appendix K: Concentration Phenomena in Transformer Architectures

K.1 Attention Score Concentration

In scaled dot-product attention, query-key inner products QiKj/dkQ_i K_j^\top / \sqrt{d_k} determine the attention weights. For randomly initialised Q,KN(0,1/dk)Q, K \sim \mathcal{N}(0, 1/d_k), the inner products satisfy:

P(QiKjt)2edkt2/2P(|Q_i K_j^\top| \geq t) \leq 2e^{-d_k t^2/2}

Without the dk\sqrt{d_k} scaling, the softmax input QiKjQ_i K_j^\top has standard deviation O(dk)O(\sqrt{d_k}), pushing softmax into its saturating regime (near-uniform or near-one-hot). The scaling ensures standard deviation O(1)O(1), keeping gradients flowing.

Formal justification. With dkd_k-dimensional queries and keys, QiKj=l=1dkQilKjlQ_i K_j^\top = \sum_{l=1}^{d_k} Q_{il} K_{jl} where each summand has variance 1/dk21/d_k^2. By sub-Gaussian closure, the sum is 1/dk1/d_k-sub-Gaussian, so standard deviation 1/dk\approx 1/\sqrt{d_k}. After scaling: QiKj/dkQ_i K_j^\top / \sqrt{d_k} has standard deviation 1\approx 1, optimal for softmax.

K.2 Concentration in Residual Networks

Deep residual networks xl+1=xl+fl(xl)x_{l+1} = x_l + f_l(x_l) accumulate residual perturbations. If each residual block flf_l has bounded output fl(x)2c\|f_l(x)\|_2 \leq c, then by Azuma's inequality applied to the martingale xlx_l:

P(xLx02t)exp ⁣(t22Lc2)P(\|x_L - x_0\|_2 \geq t) \leq \exp\!\left(-\frac{t^2}{2Lc^2}\right)

This says: even with L=100L = 100 layers, if each residual block adds a small perturbation (c1c \ll 1), the total drift is bounded. LayerNorm and weight decay enforce exactly this: they keep fl(x)2\|f_l(x)\|_2 small, ensuring the representation doesn't drift exponentially with depth.

K.3 LoRA and Low-Rank Concentration

LoRA (Low-Rank Adaptation) approximates weight updates as ΔW=AB\Delta W = AB where ARd×rA \in \mathbb{R}^{d \times r}, BRr×kB \in \mathbb{R}^{r \times k} with small rank rr. The resulting hypothesis class has reduced complexity:

Rademacher complexity of LoRA. For the function class {xW0x+ABx:AFsA,BFsB}\{x \mapsto W_0 x + AB x : \|A\|_F \leq s_A, \|B\|_F \leq s_B\}:

Rn(FLoRA)sAsBrCn\mathfrak{R}_n(\mathcal{F}_{\text{LoRA}}) \leq \frac{s_A s_B \sqrt{r} \cdot C}{\sqrt{n}}

where C=maxix(i)2C = \max_i \|x^{(i)}\|_2. The key: Rademacher complexity scales as r\sqrt{r}, not dk\sqrt{dk} (full fine-tuning). With rmin(d,k)r \ll \min(d,k), LoRA has provably smaller generalisation gap than full fine-tuning, explaining why it doesn't overfit despite being used on small datasets.


Appendix L: Practical Guidelines for ML Practitioners

L.1 When to Report Confidence Intervals

Always report when:

  • Sample size n<10000n < 10000 (Hoeffding CI is meaningful)
  • Comparing two systems (the CI overlap tells you if the difference is significant)
  • Benchmarking LLMs on standard evaluations

Minimum information to report: point estimate, sample size, CI method (Hoeffding vs Normal approximation), confidence level.

L.2 Common Benchmarking Mistakes

MistakeStatistical ProblemFix
"Model A (73.2%) > Model B (72.8%) on 1000 questions"95% CI is +/-4.2% - difference insignificantNeed n19000n \geq 19000 for +/-1% CI
"Best of 10 model variants is significant"Multiple comparison: 10×0.05=0.510 \times 0.05 = 0.5 false positive rateBonferroni: each test at 0.05/10=0.0050.05/10 = 0.005
"Averaging over 5 datasets gives reliable comparison"Each dataset is a separate testReport CI for each; meta-analysis requires care
"Our method wins on 8/10 tasks"Sign test: need 9/10\geq 9/10 for p<0.05p < 0.05Use paired t-test or Wilcoxon signed-rank
"1000-epoch training curve shows improvement"Training curve is not test performanceReport test error at convergence with CI

L.3 Required Sample Sizes (Reference Table)

For binary outcomes (accuracy), using Hoeffding with 95% confidence (δ=0.05\delta = 0.05):

Desired precision (ε\varepsilon)Required nn
+/-10% (0.10)185
+/-5% (0.05)738
+/-3% (0.03)2,056
+/-2% (0.02)4,626
+/-1% (0.01)18,445
+/-0.5% (0.005)73,779

Note: These are conservative (distribution-free). Normal approximation gives roughly half as many, valid when np^(1p^)5n\hat{p}(1-\hat{p}) \geq 5.

L.4 Interpreting Generalisation Bounds for Deep Learning

Current theoretical bounds are often loose for deep networks. Practitioners should interpret them as:

  1. Order of magnitude guidance: A bound of O(d/n)O(\sqrt{d/n}) correctly predicts that more data helps linearly, but may overestimate the absolute gap by 10-100x
  2. Relative comparisons: PAC-Bayes bounds correctly rank models by generalisation risk even when absolute values are imprecise
  3. Design principles: Results like "norm-constrained models generalise better" translate directly to regularisation practices (weight decay, gradient clipping, dropout)
  4. Not absolute guarantees: A theoretical bound of 40% error doesn't mean the model has 40% test error - it means we can't rule out 40% test error from the theory alone

The theory is most useful as a framework for thinking, not as a numerical oracle.


Appendix M: References and Further Reading

Primary Sources

  1. Boucheron, Lugosi & Massart (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. - The definitive modern treatment.

  2. Hoeffding, W. (1963). Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association, 58(301), 13-30.

  3. McDiarmid, C. (1989). On the Method of Bounded Differences. Surveys in Combinatorics, 141, 148-188.

  4. Vapnik, V. & Chervonenkis, A. (1971). On the Uniform Convergence of Relative Frequencies to Their Probabilities. Theory of Probability and Its Applications, 16(2), 264-280.

  5. Valiant, L. (1984). A Theory of the Learnable. Communications of the ACM, 27(11), 1134-1142.

Learning Theory Textbooks

  1. Shalev-Shwartz & Ben-David (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. - Best introduction to PAC learning for ML practitioners.

  2. Mohri, Rostamizadeh & Talwalkar (2018). Foundations of Machine Learning (2nd ed.). MIT Press. - Covers Rademacher complexity and generalisation theory in depth.

  3. Wainwright, M. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press. - Advanced treatment, covers matrix concentration and modern techniques.

Deep Learning Theory

  1. Zhang, Bengio et al. (2017). Understanding Deep Learning Requires Rethinking Generalization. ICLR 2017. - Showed memorisation capacity, challenging classical VC theory.

  2. Dziugaite & Roy (2017). Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many Parameters. UAI 2017. - First non-vacuous PAC-Bayes bounds for deep nets.

  3. Belkin et al. (2019). Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off. PNAS 2019. - Double descent, overparameterisation.

  4. Foret et al. (2021). Sharpness-Aware Minimization for Efficiently Improving Generalization. ICLR 2021. - SAM as PAC-Bayes operationalisation.


Appendix N: Extended Worked Problems

N.1 Bounding the Maximum of Random Variables

Problem. Let X1,,XnN(0,1)X_1, \ldots, X_n \sim \mathcal{N}(0,1) i.i.d. What is E[maxiXi]\mathbb{E}[\max_i X_i] approximately?

Answer. Each XiX_i is 1-sub-Gaussian. By the union bound:

P(maxiXit)nP(X1t)net2/2P(\max_i X_i \geq t) \leq n \cdot P(X_1 \geq t) \leq n e^{-t^2/2}

Setting net2/2=1ne^{-t^2/2} = 1: t=2lognt^* = \sqrt{2\log n}. So maxiXi=O(logn)\max_i X_i = O(\sqrt{\log n}) with high probability.

More precisely, E[maxiXi]2logn(1O(1/logn))\mathbb{E}[\max_i X_i] \approx \sqrt{2\log n}(1 - O(1/\log n)). For n=1000n = 1000: 26.93.7\approx \sqrt{2 \cdot 6.9} \approx 3.7, matching Monte Carlo.

For AI: The maximum attention logit in a head with nn tokens and dkd_k-dimensional keys satisfies maxjQKj2dklogn\max_j Q K_j^\top \approx \sqrt{2d_k \log n}. The softmax of this concentrates on one or few tokens for large dkd_k - explaining why attention becomes "sharp" (spiky) in deeper layers where dkd_k is large.

N.2 Symmetric Rounding

Problem. An algorithm rounds each of n=100n = 100 numbers independently up or down by at most 0.5. What is the probability the total rounding error exceeds 10?

Solution. Each error Ei[0.5,0.5]E_i \in [-0.5, 0.5] with E[Ei]=0\mathbb{E}[E_i] = 0. The sum S=EiS = \sum E_i has Ei0.5|E_i| \leq 0.5, so by Hoeffding:

P(S10)exp ⁣(2100(1.0)2)et2=exp ⁣(210210012)=e20.135P(S \geq 10) \leq \exp\!\left(-\frac{2 \cdot 100}{\sum (1.0)^2}\right) \cdot e^{-t^2} = \exp\!\left(-\frac{2 \cdot 10^2}{100 \cdot 1^2}\right) = e^{-2} \approx 0.135

The two-sided version: P(S10)2e20.27P(|S| \geq 10) \leq 2e^{-2} \approx 0.27.

N.3 PAC Bound for a Neural Network Classifier

Problem. A student trains a 3-layer neural network on n=5000n = 5000 examples and achieves 2% training error. Using the finite-class PAC bound with H=2106|\mathcal{H}| = 2^{10^6} (very conservatively, assuming 1-bit per parameter for 10610^6 parameters), what does the bound say about test error?

Solution. The PAC bound (agnostic, finite class):

R(h^)R^(h^)+log(2H/δ)2nR(\hat{h}) \leq \hat{R}(\hat{h}) + \sqrt{\frac{\log(2|\mathcal{H}|/\delta)}{2n}}

With δ=0.05\delta = 0.05 and H=2106|\mathcal{H}| = 2^{10^6}:

log(22106/0.05)10000=106ln2+ln4010000693150100008.3\sqrt{\frac{\log(2 \cdot 2^{10^6}/0.05)}{10000}} = \sqrt{\frac{10^6 \ln 2 + \ln 40}{10000}} \approx \sqrt{\frac{693150}{10000}} \approx 8.3

Test error bound: 0.02+8.38.320.02 + 8.3 \approx 8.32 - a completely vacuous bound (exceeds 100%). This illustrates why VC/parameter-counting bounds are useless for deep networks. PAC-Bayes or Rademacher complexity (using wF\|\mathbf{w}\|_F rather than parameter count) gives non-vacuous results.

N.4 Proving Generalisation for 1-Nearest Neighbour

Problem. For kk-nearest neighbour classification with k=1k = 1: (a) Show that the leave-one-out error is an unbiased estimate of the true error. (b) Use McDiarmid to bound the variance of the generalisation error around the leave-one-out estimate.

Solution. (a) The leave-one-out (LOO) error R^LOO=1ni1[y^iy(i)]\hat{R}_{\text{LOO}} = \frac{1}{n}\sum_i \mathbf{1}[\hat{y}_{-i} \neq y^{(i)}] where y^i\hat{y}_{-i} is the prediction on x(i)x^{(i)} using all other training points. By symmetry of the i.i.d. draw, E[R^LOO]=Rn1\mathbb{E}[\hat{R}_{\text{LOO}}] = R_{n-1} (the true error of 1-NN trained on n1n-1 examples).

(b) Swapping one training example changes R^LOO\hat{R}_{\text{LOO}} by at most 2/n2/n (affects at most 2 LOO predictions). By McDiarmid:

P(R^LOOE[R^LOO]t)2exp ⁣(2t2i(2/n)2)=2exp ⁣(nt22)P(|\hat{R}_{\text{LOO}} - \mathbb{E}[\hat{R}_{\text{LOO}}]| \geq t) \leq 2\exp\!\left(-\frac{2t^2}{\sum_i (2/n)^2}\right) = 2\exp\!\left(-\frac{nt^2}{2}\right)

N.5 Confidence Region for Gradient Descent Convergence

Problem. SGD is run for TT iterations with step size η=1/T\eta = 1/\sqrt{T}. Each gradient estimate is sub-Gaussian with parameter σ2\sigma^2. With what probability does SGD achieve L(θT)Lε\mathcal{L}(\theta_T) - \mathcal{L}^* \leq \varepsilon?

Solution. Standard SGD analysis shows: E[L(θT)L]R2+σ2T\mathbb{E}[\mathcal{L}(\theta_T) - \mathcal{L}^*] \leq \frac{R^2 + \sigma^2}{\sqrt{T}} where R=θ0θ2R = \|\theta_0 - \theta^*\|_2.

For the high-probability version: at each step tt, P(gtg^t2u)2emu2/(2σ2)P(\|g_t - \hat{g}_t\|_2 \geq u) \leq 2e^{-mu^2/(2\sigma^2)} by sub-Gaussianity (batch size mm). Taking union bound over all TT steps and using the Azuma-type argument for the algorithm trajectory:

P(L(θT)Lε)δP(\mathcal{L}(\theta_T) - \mathcal{L}^* \geq \varepsilon) \leq \delta

when T=O(R2log(1/δ)+σ2log(T/δ)ε2)T = O\left(\frac{R^2 \log(1/\delta) + \sigma^2 \log(T/\delta)}{\varepsilon^2}\right).


Appendix O: Visualisation Notes

O.1 Plotting Concentration Inequalities

When visualising tail bounds, always show:

  1. The exact tail probability (computed numerically) as the ground truth
  2. The bound as a curve above the exact probability
  3. The ratio (bound/exact) to show how loose the bound is
TAIL BOUND TIGHTNESS - Gaussian(0,1)
========================================================================

  t   | P(X\\geqt) exact | Markov* | Chebyshev | Hoeffding | Bound/Exact
  --------------------------------------------------------------------
  1   | 0.159  | N/A    | 1.000    | 0.607     |  4x / 3.8x
  2   | 0.023  | N/A    | 0.250    | 0.135     | 11x / 5.9x
  3   | 0.0013 | N/A    | 0.111    | 0.011     | 85x / 8.5x
  4   | 3.2e-5 | N/A    | 0.0625   | 3.4e-4    | 1953x / 11x
  5   | 2.9e-7 | N/A    | 0.040    | 3.7e-7    | 138,000x / 1.3x

  *Markov requires X \\geq 0; not directly applicable to symmetric Gaussian

========================================================================

At t=5σt = 5\sigma: Chebyshev overestimates by 138,000x! Hoeffding for Gaussian is essentially tight - Gaussian IS the sub-Gaussian equality case.

O.2 Sample Complexity Curves

The sample complexity n(ε,δ)=log(2/δ)2ε2n(\varepsilon, \delta) = \frac{\log(2/\delta)}{2\varepsilon^2} has a characteristic shape:

  • O(1/ε2)O(1/\varepsilon^2): doubling precision requires 4x more data
  • O(log(1/δ))O(\log(1/\delta)): going from 95% to 99.9% confidence adds only log(200/20)/log(40/20)3.3×\log(200/20)/\log(40/20) \approx 3.3 \times more data - logarithmically cheap

This 1/ε21/\varepsilon^2 scaling is fundamental: it appears in:

  • Hoeffding (statistical estimation)
  • Monte Carlo (variance σ2\sigma^2, error σ/n\sigma/\sqrt{n})
  • PAC learning (generalisation gap)
  • Stochastic optimisation (SGD convergence rate)

All are manifestations of the same underlying concentration phenomenon.


Appendix P: Glossary of Key Terms

TermDefinitionSection
Concentration inequalityA bound on P(XE[X]t)P(X - \mathbb{E}[X] \geq t) or P(XE[X]t)P(\|X - \mathbb{E}[X]\| \geq t)Section1
Sub-Gaussian random variableXX with E[etX]eσ2t2/2\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2} for all ttSection3.1
Sub-Gaussian parameterσ2\sigma^2: the proxy variance in the sub-Gaussian definitionSection3.1
Rademacher variableσ{1,+1}\sigma \in \{-1, +1\} with equal probabilitySection3.2
Hoeffding's lemmaBounded RV X[a,b]X \in [a,b] with zero mean \Rightarrow E[etX]et2(ba)2/8\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8}Section4.1
Chernoff methodTail bound via P(Xt)estE[esX]P(X \geq t) \leq e^{-st}\mathbb{E}[e^{sX}], optimised over ssSection5.1
Bounded differences$f(\cdots x_i \cdots) - f(\cdots x_i' \cdots)
Union boundP(Ai)P(Ai)P(\bigcup A_i) \leq \sum P(A_i)Section8.1
Covering numberSmallest ε\varepsilon-net size for a set under a metricSection8.3
ShatteringH\mathcal{H} shatters CC if it can realise all $2^{C
VC dimensionSize of largest shattered setSection9.4
Growth functionΠH(n)\Pi_\mathcal{H}(n): max labellings of nn points by H\mathcal{H}App D.1
PAC learningProbably approximately correct: R(h^)minhR(h)+εR(\hat{h}) \leq \min_h R(h) + \varepsilon w.p. 1δ\geq 1-\deltaSection9.1
Uniform convergence$\sup_h\hat{R}(h) - R(h)
Rademacher complexityRn(F)=Eσ[supf1nσif(x(i))]\mathfrak{R}_n(\mathcal{F}) = \mathbb{E}_\sigma[\sup_f \frac{1}{n}\sum \sigma_i f(x^{(i)})]Section10.1
PAC-BayesGeneralisation bound using DKL(posteriorprior)D_{\mathrm{KL}}(\text{posterior}\|\text{prior})App E.4
Algorithmic stabilityOutput insensitive to swapping one training exampleApp J.2
Exponential tiltTwisted distribution Ps(x)esxP(x)P_s(x) \propto e^{sx}P(x) used in ChernoffApp I.1

Appendix Q: The Method of Types and Large Deviations

Q.1 Types and Empirical Distributions

The method of types is an elegant combinatorial approach to concentration that complements the analytical MGF-based approach.

Definition. For a sequence xn=(x1,,xn)Xnx^n = (x_1, \ldots, x_n) \in \mathcal{X}^n, the type (empirical distribution) is:

Pxn(a)=1ni=1n1[xi=a],aXP_{x^n}(a) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[x_i = a], \quad a \in \mathcal{X}

The set of all sequences of type QQ is the type class Tn(Q)T_n(Q).

Key counting result. For a finite alphabet X=k|\mathcal{X}| = k:

Tn(Q)=(nnQ(x1),,nQ(xk))enH(Q)|T_n(Q)| = \binom{n}{nQ(x_1), \ldots, nQ(x_k)} \approx e^{nH(Q)}

where H(Q)=xQ(x)logQ(x)H(Q) = -\sum_{x} Q(x)\log Q(x) is the entropy of QQ. This means there are approximately enH(Q)e^{nH(Q)} sequences with empirical distribution QQ.

Q.2 Connection to Chernoff Bounds

For i.i.d. XiPX_i \sim P and empirical distribution PxnP_{x^n}:

P(PXn=Q)enDKL(QP)P(P_{X^n} = Q) \leq e^{-n D_{\mathrm{KL}}(Q \| P)}

This is the method of types bound. Taking QQ to be the distribution concentrated at xˉ=a\bar{x} = a:

P(Xˉna)(n+1)XenDKL(QP)P(\bar{X}_n \geq a) \leq (n+1)^{|\mathcal{X}|} e^{-n D_{\mathrm{KL}}(Q^* \| P)}

where QQ^* is the type closest to PP with mean a\geq a. For large nn, the polynomial factor (n+1)X(n+1)^{|\mathcal{X}|} is negligible, and this matches the Chernoff bound with the KL divergence interpretation.

Q.3 Cramer's Theorem

Theorem (Cramer, 1938). For i.i.d. X1,,XnX_1, \ldots, X_n with distribution PP:

limn1nlogP(Xˉna)=I(a)\lim_{n\to\infty} \frac{1}{n} \log P(\bar{X}_n \geq a) = -I(a)

where I(a)=supt{talogMX(t)}I(a) = \sup_{t} \{ta - \log M_X(t)\} is the Cramer-Legendre transform (rate function), and I(a)=DKL(PaP)I(a) = D_{\mathrm{KL}}(P_a \| P) (KL divergence to the tilted distribution).

This is the fundamental result of large deviations theory: rare events occur at exponential rate determined by the KL distance to the "closest" distribution that makes the event typical.

For AI: Language model perplexity is essentially exp(DKL(true distmodel))\exp(D_{\mathrm{KL}}(\text{true dist}\|\text{model})). Low perplexity = model close to true distribution = rare events (unusual tokens) are not overly penalised. Cramer's theorem connects the generalisation gap to KL divergence, explaining why cross-entropy is the right loss.


Appendix R: Generalisation in the Overparameterised Regime

R.1 The Classical Puzzle

Classical theory (VC, Rademacher) says: more parameters = larger hypothesis class = worse generalisation. Yet in practice, overparameterised models (large dnd \gg n) often generalise better than their smaller counterparts.

The resolution involves multiple mechanisms:

1. Implicit regularisation. Gradient descent on overparameterised linear models converges to the minimum-2\ell_2-norm solution. Among all interpolating solutions, this has the smallest Rademacher complexity.

2. Double descent. As model capacity increases past the interpolation threshold (dnd \approx n), test error first rises (overfitting) then falls (benign overfitting). The minimum-norm interpolating solution's generalisation error is:

MSEσ2dn11n/d\text{MSE} \approx \frac{\sigma^2 d}{n} \cdot \frac{1}{1 - n/d}

which decreases as d/nd/n \to \infty (overparameterised regime).

3. Benign overfitting. Bartlett et al. (2020) showed that for linear models with Gaussian data, interpolation (zero training error) can generalise well when most singular values of the data matrix are small - the model "memorises" noise in directions orthogonal to the signal.

R.2 Neural Tangent Kernel (NTK) Theory

Wide neural networks in the "lazy training" (NTK) regime behave like kernel methods. The generalisation bound becomes:

R(h^)R^(h^)+O ⁣(y^(Knn)1y^+log(1/δ)n)R(\hat{h}) \leq \hat{R}(\hat{h}) + O\!\left(\sqrt{\frac{\hat{\mathbf{y}}^\top (K_{nn})^{-1} \hat{\mathbf{y}} + \log(1/\delta)}{n}}\right)

where KnnK_{nn} is the NTK kernel matrix. The key quantity y^Knn1y^\hat{\mathbf{y}}^\top K_{nn}^{-1} \hat{\mathbf{y}} is the effective complexity - analogous to Rademacher complexity but computed from the kernel. For networks trained with weight decay, this remains bounded even as depth and width grow.

R.3 PAC-Bayes for Overparameterised Models

The most successful approach for non-vacuous bounds on deep networks uses PAC-Bayes. The bound:

EhQ[R(h)]EhQ[R^(h)]+DKL(QP)+log(1/δ)2(n1)\mathbb{E}_{h\sim Q}[R(h)] \leq \mathbb{E}_{h\sim Q}[\hat{R}(h)] + \sqrt{\frac{D_{\mathrm{KL}}(Q\|P) + \log(1/\delta)}{2(n-1)}}

applies to a stochastic classifier hQh \sim Q. For weight-perturbed networks (Gaussian posterior with small variance σ2\sigma^2):

DKL(QP)θ^θ0222σ02D_{\mathrm{KL}}(Q\|P) \approx \frac{\|\hat{\theta} - \theta_0\|_2^2}{2\sigma_0^2}

This is small when the fine-tuned weights θ^\hat{\theta} don't stray far from the initialisation θ0\theta_0. SAM (Sharpness-Aware Minimisation) explicitly seeks flat minima where the posterior concentrates near a broad region around θ^\hat{\theta}, minimising this KL term.


Appendix S: Concentration in Continuous-Time Settings

S.1 Azuma's Inequality for Martingales

We've seen Azuma's inequality used to prove McDiarmid. Here's the full statement:

Theorem (Azuma-Hoeffding). Let (Z0,Z1,,Zn)(Z_0, Z_1, \ldots, Z_n) be a martingale (or supermartingale) with Z0=0Z_0 = 0 and ZkZk1ck|Z_k - Z_{k-1}| \leq c_k a.s. Then:

P(Znt)exp ⁣(t22k=1nck2)P(Z_n \geq t) \leq \exp\!\left(-\frac{t^2}{2\sum_{k=1}^n c_k^2}\right)

Proof. Apply the Chernoff method: P(Znt)estE[esZn]P(Z_n \geq t) \leq e^{-st} \mathbb{E}[e^{sZ_n}]. For a martingale, E[esZkZk1]esZk1es2ck2/2\mathbb{E}[e^{sZ_k} \mid Z_{k-1}] \leq e^{sZ_{k-1}} \cdot e^{s^2c_k^2/2} by Hoeffding's lemma (conditional on Zk1Z_{k-1}, the increment ZkZk1[ck,ck]Z_k - Z_{k-1} \in [-c_k, c_k] is bounded). Telescoping: E[esZn]es2ck2/2\mathbb{E}[e^{sZ_n}] \leq e^{s^2\sum c_k^2/2}. Optimise: s=t/ck2s = t/\sum c_k^2. \blacksquare

S.2 Doob's Optional Stopping

Theorem. If (Mn)(M_n) is a martingale and τ\tau is a bounded stopping time, then E[Mτ]=E[M0]\mathbb{E}[M_\tau] = \mathbb{E}[M_0].

Application to SGD. Define Mt=θtθ22tCη2σ2M_t = \|\theta_t - \theta^*\|_2^2 - t \cdot C\eta^2\sigma^2 where CC is a constant. Under mild conditions, MtM_t is a supermartingale. Azuma's inequality applied to the stopped process gives high-probability guarantees on convergence time.

This is the mathematical foundation of high-probability SGD convergence - stronger than the usual in-expectation guarantees.


Appendix T: Summary Diagrams

T.1 The Proof Technique Map

PROOF TECHNIQUES FOR CONCENTRATION INEQUALITIES
========================================================================

  Core tool: Markov's inequality
  -------------------------------------------------------------------

  Apply Markov to (X-\\mu)^2          --> Chebyshev
  Apply Markov to e^{sX}          --> Chernoff method
       +-- with Hoeffding's lemma  --> Hoeffding's inequality
       +-- with Bernstein's lemma  --> Bernstein's inequality
       +-- with Binomial MGF       --> Chernoff for Bin(n,p)

  Apply Azuma to Doob martingale  --> McDiarmid's inequality
  Union bound over hypothesis class --> PAC generalisation bound
  Union bound + covering number    --> Infinite-class PAC
  Rademacher symmetrisation        --> Rademacher complexity bound

  -------------------------------------------------------------------
  Common pattern: P(X \\geq t) \\leq E[e^{sX}] / e^{st}  (all MGF methods)
========================================================================

T.2 Bound Quality Comparison

TAIL PROBABILITY COMPARISON - X = mean of n=100 Bernoulli(0.5), t=0.1
========================================================================

  Method           Bound          Notes
  -----------------------------------------------------------------
  Exact (CLT)     ~0.046         Normal approx; valid for large n
  Hoeffding        0.135          Tight for this case
  Chebyshev        0.250          3x looser
  Markov           0.500          5x looser (applied to |X-0.5|/0.5)

  Same setup, t=0.2:
  Exact            ~0.000046
  Hoeffding        0.018          400x looser
  Chebyshev        0.0625         1360x looser

========================================================================

Appendix U: Proofs of Sub-Gaussian Closure and Examples

U.1 Cosh Inequality for Rademacher Variables

Claim. For ε{1,+1}\varepsilon \in \{-1, +1\} uniform: E[etε]=cosh(t)et2/2\mathbb{E}[e^{t\varepsilon}] = \cosh(t) \leq e^{t^2/2}.

Proof. cosh(t)=k=0t2k/(2k)!k=0t2k/(2kk!)=k=0(t2/2)k/k!=et2/2\cosh(t) = \sum_{k=0}^\infty t^{2k}/(2k)! \leq \sum_{k=0}^\infty t^{2k}/(2^k k!) = \sum_{k=0}^\infty (t^2/2)^k/k! = e^{t^2/2}, using (2k)!2kk!(2k)! \geq 2^k k!.

Interpretation. Rademacher variables are the "most concentrated" bounded variables in [1,1][-1,1]: their MGF is exactly cosh(t)\cosh(t), slightly below the sub-Gaussian bound et2/2e^{t^2/2}. This is also why Rademacher complexity multiplied by 2 appears in generalisation bounds - the factor accounts for the Rademacher supremum.

U.2 Sub-Gaussian Parameter for Bounded Variables

Claim. If X[a,b]X \in [a,b] with E[X]=0\mathbb{E}[X] = 0, then XX is (ba)24\frac{(b-a)^2}{4}-sub-Gaussian.

The parameter (ba)24\frac{(b-a)^2}{4} is achieved when X=ba2εX = \frac{b-a}{2}\varepsilon where ε\varepsilon is Rademacher - a two-point distribution at the endpoints. This is the "hardest" distribution on [a,b][a,b] for Hoeffding's lemma.

Tightness example. For X{1,+1}X \in \{-1, +1\} uniform: variance =1= 1, sub-Gaussian parameter =(1(1))24=1= \frac{(1-(-1))^2}{4} = 1. They agree! For the Rademacher distribution, sub-Gaussian parameter = variance. For other distributions in [1,1][-1,1], sub-Gaussian parameter \geq variance.

U.3 Sub-Gaussian Characterisation via Tail Probabilities

The following are equivalent for a mean-zero random variable XX:

  1. (MGF) E[etX]eσ2t2/2\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2} for all tt
  2. (Tail) P(Xt)2et2/(2σ2)P(|X| \geq t) \leq 2e^{-t^2/(2\sigma^2)} for all t0t \geq 0
  3. (Moments) XLp=(E[Xp])1/pσp\|X\|_{L^p} = (\mathbb{E}[|X|^p])^{1/p} \leq \sigma\sqrt{p} for all p1p \geq 1

The equivalence (up to constants) is non-trivial. The key step: (2)(1)(2) \Rightarrow (1) uses E[etX]=1+k=1tkE[Xk]k!\mathbb{E}[e^{tX}] = 1 + \sum_{k=1}^\infty \frac{t^k \mathbb{E}[X^k]}{k!} and bounds E[Xk]\mathbb{E}[X^k] from the tail condition.


Appendix V: Exercises with Solutions

V.1 Markov and Chebyshev Calculations

Problem. XPoisson(5)X \sim \text{Poisson}(5). Compute P(X12)P(X \geq 12) using: (a) Markov's inequality (E[X]=5\mathbb{E}[X] = 5) (b) Chebyshev (Var(X)=5\operatorname{Var}(X) = 5, so t=125=7t = 12 - 5 = 7, k=7/53.13k = 7/\sqrt{5} \approx 3.13) (c) The exact value (Poisson CDF)

Solution. (a) Markov: P(X12)5/120.417P(X \geq 12) \leq 5/12 \approx 0.417 (b) Chebyshev: P(X12)P(X57)5/490.102P(X \geq 12) \leq P(|X - 5| \geq 7) \leq 5/49 \approx 0.102 (c) Exact: P(X12)=1k=011e55k/k!0.020P(X \geq 12) = 1 - \sum_{k=0}^{11} e^{-5}5^k/k! \approx 0.020

Chebyshev is 5x loose; Markov is 21x loose. Both bounds hold but are very conservative.

V.2 Hoeffding Bound Computation

Problem. We estimate the click-through rate pp of an ad by showing it to nn users and recording clicks Xi{0,1}X_i \in \{0,1\}. We observe p^=0.08\hat{p} = 0.08 from n=2000n = 2000 users. Give a 99% Hoeffding confidence interval.

Solution. With δ=0.01\delta = 0.01, ba=1b-a=1:

ε=log(2/0.01)22000=log2004000=5.2984000=0.0013250.0364\varepsilon = \sqrt{\frac{\log(2/0.01)}{2 \cdot 2000}} = \sqrt{\frac{\log 200}{4000}} = \sqrt{\frac{5.298}{4000}} = \sqrt{0.001325} \approx 0.0364

99% CI: [0.080.0364,0.08+0.0364]=[0.0436,0.1164][0.08 - 0.0364, 0.08 + 0.0364] = [0.0436, 0.1164].

Interpretation: we're 99% confident the true CTR is between 4.4% and 11.6% - a very wide interval! To tighten to +/-1%: need nlog2002(0.01)226,490n \geq \frac{\log 200}{2(0.01)^2} \approx 26{,}490 users.

V.3 Chernoff vs Hoeffding

Problem. SBin(500,0.02)S \sim \text{Bin}(500, 0.02), so μ=10\mu = 10. Bound P(S20)P(S \geq 20).

Hoeffding: P(S20)=P(Xˉ0.04)exp(2500(0.040.02)2)=e0.40.67P(S \geq 20) = P(\bar{X} \geq 0.04) \leq \exp(-2 \cdot 500 \cdot (0.04-0.02)^2) = e^{-0.4} \approx 0.67. Useless!

Chernoff: δ=(2010)/10=1\delta = (20-10)/10 = 1. P(S20)e1012/3=e3.330.036P(S \geq 20) \leq e^{-10 \cdot 1^2/3} = e^{-3.33} \approx 0.036.

Exact: P(S20)0.0085P(S \geq 20) \approx 0.0085 (Poisson approximation: e10k=2010k/k!e^{-10}\sum_{k=20}^\infty 10^k/k!).

Chernoff is 4x loose; Hoeffding is essentially vacuous (bound > 0.5 is useless). For rare events with p1p \ll 1 and large nn, Chernoff is the right tool.

V.4 McDiarmid on Dropout

Problem. A network output f(z1,,zL)f(\mathbf{z}_1, \ldots, \mathbf{z}_L) depends on L=10L=10 layer activations. Dropout independently zeroes each activation, changing the contribution of neuron jj in layer ll by at most clj=wljaljmax/(1p)c_{lj} = |w_{lj}|\cdot a_{lj}^{\max} / (1-p). If l,jclj2=4\sum_{l,j} c_{lj}^2 = 4, bound P(fE[f]0.5)P(|f - \mathbb{E}[f]| \geq 0.5).

Solution. By McDiarmid: P(fE[f]0.5)2exp(20.25/4)=2e0.1251.76P(|f - \mathbb{E}[f]| \geq 0.5) \leq 2\exp(-2 \cdot 0.25/4) = 2e^{-0.125} \approx 1.76. The bound is vacuous! This means either cljc_{lj} are too large, or tt is too small relative to the noise.

With larger networks and better weight norms (clj2=0.1\sum c_{lj}^2 = 0.1): 2e20.25/0.1=2e50.0132e^{-2 \cdot 0.25/0.1} = 2e^{-5} \approx 0.013. Now meaningful: dropout output deviates by more than 0.5 less than 1.3% of the time.


Appendix W: Notation Reference for This Section

SymbolMeaning
P(A)P(A)Probability of event AA
μ=E[X]\mu = \mathbb{E}[X]Mean of XX
σ2=Var(X)\sigma^2 = \operatorname{Var}(X)Variance of XX
MX(t)=E[etX]M_X(t) = \mathbb{E}[e^{tX}]Moment generating function
Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_iSample mean
R(h)R(h)True (population) risk of hypothesis hh
R^(h)\hat{R}(h)Empirical risk of hh on training data
H\mathcal{H}Hypothesis class
dVCd_{VC}VC dimension of H\mathcal{H}
ΠH(n)\Pi_\mathcal{H}(n)Growth function
Rn(F)\mathfrak{R}_n(\mathcal{F})Rademacher complexity
R^S(F)\hat{\mathfrak{R}}_S(\mathcal{F})Empirical Rademacher complexity
ε\varepsilonAccuracy parameter (how close to optimal)
δ\deltaConfidence parameter (failure probability)
cic_iBounded differences constant for coordinate ii
ε\varepsilon-netCovering set with spacing ε\varepsilon
N(H,d,ε)\mathcal{N}(\mathcal{H}, d, \varepsilon)Covering number: size of smallest ε\varepsilon-net
DKL(PQ)D_{\mathrm{KL}}(P\|Q)KL divergence from QQ to PP
I(a)=supt(talogM(t))I(a) = \sup_t(ta - \log M(t))Cramer rate function
σi\sigma_iRademacher random variable {1,+1}\in \{-1,+1\}
Bδ(h)B_\delta(h)Ball of radius δ\delta around hypothesis hh

Appendix X: More on Rademacher Complexity

X.1 Computing Rademacher Complexity for Common Classes

Halfspaces with margin. For Hγ={xsign(wx):w21,γ-margin}\mathcal{H}_\gamma = \{\mathbf{x} \mapsto \operatorname{sign}(\mathbf{w}^\top \mathbf{x}): \|\mathbf{w}\|_2 \leq 1, \gamma\text{-margin}\} on data with x(i)21\|\mathbf{x}^{(i)}\|_2 \leq 1:

Rn(Hγ)1γn\mathfrak{R}_n(\mathcal{H}_\gamma) \leq \frac{1}{\gamma\sqrt{n}}

The margin γ\gamma effectively replaces dimension dd! This is why SVMs with large margins generalise well in high dimensions - the effective complexity is O(1/γ2n)O(1/\gamma^2 n), not O(d/n)O(d/n).

Two-layer neural network. For F={xvσ(Wx):WFBW,v1Bv}\mathcal{F} = \{\mathbf{x} \mapsto \mathbf{v}^\top \sigma(W\mathbf{x}): \|W\|_F \leq B_W, \|\mathbf{v}\|_1 \leq B_v\}:

Rn(F)BvBWC2log(2d)n\mathfrak{R}_n(\mathcal{F}) \leq \frac{B_v B_W C \sqrt{2\log(2d)}}{\sqrt{n}}

where C=maxix(i)2C = \max_i \|\mathbf{x}^{(i)}\|_2. The logd\log d factor comes from the union bound over dd output neurons; the n\sqrt{n} denominator shows O(1/n)O(1/\sqrt{n}) rate as expected.

X.2 Rademacher Complexity and Mutual Information

A deep connection: the Rademacher complexity R^S(F)\hat{\mathfrak{R}}_S(\mathcal{F}) can be bounded via mutual information between the training labels and the class predictions. For bounded function classes:

R^S(F)2I(Yn;f(Xn))n\hat{\mathfrak{R}}_S(\mathcal{F}) \leq \sqrt{\frac{2I(Y^n; f(X^n))}{n}}

where I(Yn;f(Xn))I(Y^n; f(X^n)) is the mutual information between labels and class predictions. This connects concentration theory to information theory - a richer theory being actively developed (2020-2026) under "information-theoretic generalisation bounds."

X.3 Monte Carlo Estimation of Rademacher Complexity

In practice, R^S(F)\hat{\mathfrak{R}}_S(\mathcal{F}) can be estimated by Monte Carlo:

def estimate_rademacher(X, W_norm=1.0, T=1000):
    """Monte Carlo estimate of Rademacher complexity for ||w||_2 <= W_norm."""
    n, d = X.shape
    rademachers = []
    for _ in range(T):
        sigma = np.random.choice([-1, 1], size=n)
        # Optimal w: proportional to X^T sigma, normalised
        w_opt = X.T @ sigma
        w_norm = np.linalg.norm(w_opt)
        if w_norm > 0:
            w_opt = w_opt * W_norm / w_norm
        rademachers.append((sigma @ (X @ w_opt)) / n)
    return np.mean(rademachers)

For a dataset of n=500n=500 MNIST digits (784-dimensional), Wnorm=1W_\text{norm}=1: empirical R^S0.038\hat{\mathfrak{R}}_S \approx 0.038. Rademacher bound: test error \leq train error + 20.038+log(20)/10002 \cdot 0.038 + \sqrt{\log(20)/1000} \approx train error + 0.076 + 0.055. For 2% train error: test error 13.1%\leq 13.1\% - loose but not vacuous!


Appendix Y: Bounds in the Context of Modern ML Systems

Y.1 LLM Benchmark Evaluation Standards

As of 2026, serious LLM evaluations use concentration-aware practices:

HELM (Holistic Evaluation of Language Models): Uses 1000+ examples per scenario, reporting bootstrap confidence intervals. Assumes bounded loss and applies Hoeffding-style reasoning.

BIG-bench: 200+ tasks, with tasks having 100-10000 examples. Standard deviations are reported, equivalent to reporting σ/n\sigma/\sqrt{n} which matches the leading term in Hoeffding CI.

LiveBench: Uses human-pairwise comparisons (Elo-style). The Elo update rule has a concentration interpretation: the ELO gap needed for significance scales as O(1/ncomparisons)O(1/\sqrt{n_{\text{comparisons}}}).

Statistical significance in model comparisons. With nn test examples and two models achieving accuracies p^1\hat{p}_1 and p^2\hat{p}_2:

  • Paired comparison: use McNemar's test (exact)
  • Unpaired: significance threshold p^1p^22log(4/δ)2n|\hat{p}_1 - \hat{p}_2| \geq 2\sqrt{\frac{\log(4/\delta)}{2n}} (Hoeffding)
  • For n=1000n = 1000, δ=0.05\delta = 0.05: need difference 2×0.042=8.4%\geq 2 \times 0.042 = 8.4\% for significance

Y.2 Fine-Tuning Sample Complexity via PAC-Bayes

When fine-tuning a pretrained LLM on task-specific data:

Prior: Pre-trained weights θ0\theta_0 from language model pre-training on internet-scale data Posterior: Fine-tuned weights θ^\hat{\theta} on nn task examples PAC-Bayes bound:

R(θ^)R^(θ^)+θ^θ022/(2σ02)+log(1/δ)2(n1)R(\hat{\theta}) \leq \hat{R}(\hat{\theta}) + \sqrt{\frac{\|\hat{\theta} - \theta_0\|_2^2/(2\sigma_0^2) + \log(1/\delta)}{2(n-1)}}

The θ^θ022\|\hat{\theta} - \theta_0\|_2^2 term is small when fine-tuning doesn't move far from pre-training - exactly what LoRA enforces by constraining ΔW=AB\Delta W = AB. With rank rmin(d,k)r \ll \min(d,k):

θ^θ022=ABF2AF2BF2r(per-param budget)\|\hat{\theta} - \theta_0\|_2^2 = \|AB\|_F^2 \leq \|A\|_F^2 \|B\|_F^2 \leq r \cdot \text{(per-param budget)}

PAC-Bayes theoretically justifies why LoRA generalises better than full fine-tuning on small datasets.

Y.3 Privacy-Preserving ML via Concentration

Differential Privacy (DP) and concentration inequalities are deeply connected. An algorithm is (ε,δ)(\varepsilon, \delta)-DP if changing one training example changes the output distribution by at most ε\varepsilon in TV distance, except with probability δ\delta.

DP-SGD (Abadi et al., 2016): Clips each gradient to norm CC, adds Gaussian noise N(0,σ2C2I)\mathcal{N}(0, \sigma^2 C^2 I). The privacy guarantee follows from the Gaussian mechanism, and the utility (convergence) uses concentration inequalities on the noisy gradients.

McDiarmid connection: DP is essentially algorithmic stability with ci=C/nc_i = C/n (each example changes the model by at most C/nC/n in some norm). McDiarmid's inequality gives the same O(exp(nt2/C2))O(\exp(-nt^2/C^2)) bound for both privacy and stability.


Appendix Z: Complete Theorem Index

This appendix lists every named result in this section for quick reference.

Moment-Based Inequalities

Theorem 2.1 (Markov). X0X \geq 0, E[X]=μ\mathbb{E}[X] = \mu: P(Xt)μ/tP(X \geq t) \leq \mu/t.

Theorem 2.2 (Chebyshev). Var(X)=σ2\operatorname{Var}(X) = \sigma^2: P(Xμt)σ2/t2P(|X-\mu| \geq t) \leq \sigma^2/t^2.

Theorem 2.3 (Cantelli). One-sided: P(Xμt)σ2/(σ2+t2)P(X - \mu \geq t) \leq \sigma^2/(\sigma^2 + t^2).

Sub-Gaussian Inequalities

Definition 3.1 (Sub-Gaussian). XX is σ2\sigma^2-sub-G if E[etX]eσ2t2/2\mathbb{E}[e^{tX}] \leq e^{\sigma^2 t^2/2} for all tt.

Theorem 3.2 (Sub-Gaussian tail). P(Xt)et2/(2σ2)P(X \geq t) \leq e^{-t^2/(2\sigma^2)}.

Theorem 3.3 (Sub-Gaussian sum). Independent σi2\sigma_i^2-sub-G: sum is σi2\sum \sigma_i^2-sub-G.

Hoeffding-Type Inequalities

Lemma 4.1 (Hoeffding's lemma). X[a,b]X \in [a,b], E[X]=0\mathbb{E}[X]=0: E[etX]et2(ba)2/8\mathbb{E}[e^{tX}] \leq e^{t^2(b-a)^2/8}.

Theorem 4.2 (Hoeffding). Independent Xi[ai,bi]X_i \in [a_i,b_i]: P((Xiμi)t)e2t2/(biai)2P(\sum(X_i-\mu_i) \geq t) \leq e^{-2t^2/\sum(b_i-a_i)^2}.

Theorem 4.3 (Two-sided Hoeffding). P(Xˉnμt)2e2nt2/(ba)2P(|\bar{X}_n-\mu| \geq t) \leq 2e^{-2nt^2/(b-a)^2}.

Chernoff Bounds

Theorem 5.1 (Chernoff method). P(Xt)infs>0estMX(s)P(X \geq t) \leq \inf_{s>0} e^{-st} M_X(s).

Theorem 5.2 (Chernoff upper tail). SBin(n,p)S \sim \operatorname{Bin}(n,p), δ>0\delta > 0: P(S(1+δ)μ)(eδ/(1+δ)1+δ)μP(S \geq (1+\delta)\mu) \leq (e^\delta/(1+\delta)^{1+\delta})^\mu.

Theorem 5.3 (Chernoff multiplicative). P(S(1+δ)μ)eμδ2/3P(S \geq (1+\delta)\mu) \leq e^{-\mu\delta^2/3} for δ(0,1)\delta \in (0,1).

Theorem 5.4 (Chernoff lower tail). P(S(1δ)μ)eμδ2/2P(S \leq (1-\delta)\mu) \leq e^{-\mu\delta^2/2}.

Bernstein and McDiarmid

Theorem 6.2 (Bernstein). Xic|X_i| \leq c, variance ν2\nu^2: P(Xit)exp(t2/2/(ν2+ct/3))P(\sum X_i \geq t) \leq \exp(-t^2/2/(\nu^2+ct/3)).

Theorem 7.2 (McDiarmid). ff with cic_i-bounded differences: P(fE[f]t)e2t2/ci2P(f-\mathbb{E}[f] \geq t) \leq e^{-2t^2/\sum c_i^2}.

Theorem S.1 (Azuma). Martingale ZkZk1ck|Z_k-Z_{k-1}| \leq c_k: P(Znt)et2/2ck2P(Z_n \geq t) \leq e^{-t^2/2\sum c_k^2}.

Statistical Learning Theory

Lemma 8.1 (Union bound). P(Ai)P(Ai)P(\bigcup A_i) \leq \sum P(A_i).

Lemma 8.3 (Covering number). 2\ell_2-ball: N(BR,2,ε)(3R/ε)d\mathcal{N}(\mathcal{B}_R, \ell_2, \varepsilon) \leq (3R/\varepsilon)^d.

Theorem 9.2 (Finite-class PAC). n(log(H/δ))/(2ε2)n \geq (\log(|\mathcal{H}|/\delta))/(2\varepsilon^2) \Rightarrow uniform convergence.

Lemma 9.4 (Sauer-Shelah). ΠH(n)(en/d)d\Pi_\mathcal{H}(n) \leq (en/d)^d for nd=dVCn \geq d = d_{VC}.

Theorem 9.5 (VC bound). R(h)R^(h)+(dlog(2n/d)+log(4/δ))/(2n)R(h) \leq \hat{R}(h) + \sqrt{(d\log(2n/d)+\log(4/\delta))/(2n)}.

Theorem 10.2 (Rademacher bound). R(f)R^(f)+2Rn(F)+log(1/δ)/(2n)R(f) \leq \hat{R}(f) + 2\mathfrak{R}_n(\mathcal{F}) + \sqrt{\log(1/\delta)/(2n)}.

Theorem E.4 (PAC-Bayes). EQ[R(h)]EQ[R^(h)]+(DKL(QP)+log(1/δ))/(2(n1))\mathbb{E}_Q[R(h)] \leq \mathbb{E}_Q[\hat{R}(h)] + \sqrt{(D_{\mathrm{KL}}(Q\|P)+\log(1/\delta))/(2(n-1))}.

Theorem J.2 (Stability). β\beta-stable algo: RR^+2β+(4nβ+1)log(1/δ)/(2n)R \leq \hat{R} + 2\beta + (4n\beta+1)\sqrt{\log(1/\delta)/(2n)}.


This completes Section05 Concentration Inequalities. The material here is used extensively in Chapter 7: Statistics for confidence intervals and hypothesis tests, and throughout learning theory.

<- Back to Chapter 6: Probability Theory | Next: Stochastic Processes ->


Supplementary: Concentration in High Dimensions

The Curse and Blessing of Dimensionality

High-dimensional geometry is counterintuitive. In Rd\mathbb{R}^d with large dd:

  • Almost all the volume of a ball lies near its surface
  • Random unit vectors are nearly orthogonal: u,v1/d|\langle u, v \rangle| \approx 1/\sqrt{d} w.h.p.
  • The norm of a Gaussian vector concentrates: x2d\|\mathbf{x}\|_2 \approx \sqrt{d} for xN(0,Id)\mathbf{x} \sim \mathcal{N}(0, I_d)

These facts follow directly from concentration inequalities applied to sums of squares.

Gaussian norm concentration. Let xN(0,Id)\mathbf{x} \sim \mathcal{N}(0, I_d). Then x22=i=1dXi2\|\mathbf{x}\|_2^2 = \sum_{i=1}^d X_i^2 where XiiidN(0,1)X_i \stackrel{iid}{\sim} \mathcal{N}(0,1). By Bernstein's inequality (since Xi21X_i^2 - 1 is sub-exponential):

P ⁣(x2dt)2ect2P\!\left(\left|\,\|\mathbf{x}\|_2 - \sqrt{d}\,\right| \geq t\right) \leq 2e^{-ct^2}

for a universal constant c>0c > 0.

Implication for dot products. For two independent x,yN(0,Id/d)\mathbf{x}, \mathbf{y} \sim \mathcal{N}(0, I_d/d):

P ⁣(x,yt/d)2ect2P\!\left(\left|\langle \mathbf{x}, \mathbf{y}\rangle\right| \geq t/\sqrt{d}\right) \leq 2e^{-ct^2}

This is why scaled dot-product attention uses 1/dk1/\sqrt{d_k} - it keeps logits O(1)O(1) despite dkd_k-dimensional embeddings.

Geometry of Softmax

Let q,k1,,knRdk\mathbf{q}, \mathbf{k}_1, \ldots, \mathbf{k}_n \in \mathbb{R}^{d_k} be the query and keys in a transformer. The attention weight on key jj is:

αj=eqkj/dkeqk/dk\alpha_j = \frac{e^{\mathbf{q}^\top \mathbf{k}_j / \sqrt{d_k}}}{\sum_{\ell} e^{\mathbf{q}^\top \mathbf{k}_\ell / \sqrt{d_k}}}

Without the 1/dk1/\sqrt{d_k} scaling, the logits qkj\mathbf{q}^\top \mathbf{k}_j have variance dkd_k (for unit random vectors), pushing softmax into near-zero gradient regions - a form of implicit dimension dependence that concentration inequalities make precise.

Covering Numbers in Transformer Weight Space

The number of parameters in a transformer layer is Θ=O(d2)\Theta = O(d^2). For LoRA with rank rr, the effective dimensionality is O(dr)O(dr). By the ε\varepsilon-covering number bound:

logN(FLoRA,ε,)=O ⁣(drlog(1/ε))\log \mathcal{N}(\mathcal{F}_{\text{LoRA}}, \varepsilon, \|\cdot\|) = O\!\left(dr \log(1/\varepsilon)\right)

This converts directly to a PAC generalisation bound showing that LoRA generalises with O(drlogn/n)O(\sqrt{dr \log n / n}) excess risk - substantially better than O(d2logn/n)O(\sqrt{d^2 \log n / n}) for full fine-tuning. Concentration inequalities thus provide the theoretical foundation for parameter-efficient fine-tuning.

Summary: Why This Section Matters

Concentration inequalities are the engine behind every finite-sample guarantee in machine learning. They explain:

  1. Why more data helps: bounds shrink as 1/n1/\sqrt{n}
  2. Why simpler models generalise: bounds grow with logH\sqrt{\log |\mathcal{H}|} or dVC\sqrt{d_{\text{VC}}}
  3. Why high dimensions are manageable: Gaussian concentration keeps norms stable
  4. Why LoRA works: Rademacher complexity of low-rank updates is provably small
  5. Why LLM benchmarks need many examples: CIs on accuracy require n1/ε2n \propto 1/\varepsilon^2

The progression Markov -> Chebyshev -> sub-Gaussian -> Chernoff -> McDiarmid is not merely a tour of tricks - it is a systematic strengthening of assumptions that yields exponentially tighter guarantees, mirroring the progress from "some data" to "structured data" in modern ML.


End of Section05 Concentration Inequalities

Next: Section06 Stochastic Processes - where concentration inequalities meet martingales in continuous time.

Back: Section04 Expectation and Moments - the moment-based tools (MGF, Jensen) used throughout this section.

Chapter: Section06 Probability Theory README