NotesMath for LLMs

Estimation Theory

Statistics / Estimation Theory

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Notes

<- Back to Chapter 7: Statistics | Next: Hypothesis Testing ->


"The problem of statistical estimation is one of the most fundamental in all of science: given observations drawn from some unknown process, what can we infer about that process?"

  • Sir Ronald A. Fisher, Statistical Methods and Scientific Inference (1956)

Overview

Estimation theory answers the central question of applied statistics: given a finite, noisy sample from an unknown population, how do we recover the population's parameters? This inversion - from data to model - is not merely an academic exercise. It is the mathematical foundation of every machine learning training algorithm ever written. When a language model minimises cross-entropy loss on a text corpus, it is performing maximum likelihood estimation. When a researcher reports model accuracy with error bars, those bars are frequentist confidence intervals. When the natural gradient optimizer adjusts step sizes using curvature information, it is inverting the Fisher information matrix.

The discipline divides into three interconnected pillars. Point estimation asks: given a sample, what single value should we report for the unknown parameter? We want estimators that are unbiased (correct on average), consistent (converge to the truth as sample size grows), and efficient (achieve the minimum possible variance). The Cramer-Rao lower bound establishes a fundamental limit - no unbiased estimator can have variance smaller than the reciprocal of Fisher information, a quantity measuring how much data "inform" the parameter. Maximum likelihood estimation is the workhorse of both classical statistics and modern machine learning: find the parameter value that makes the observed data most probable. It is consistent, asymptotically efficient, and invariant under reparametrisation - properties that make it the default choice across science. Confidence intervals extend point estimates to ranges, quantifying the uncertainty inherent in estimation from finite samples.

This section builds the complete theory systematically, from formal definitions through asymptotic theory, culminating in concrete ML applications: MLE as cross-entropy, Fisher information in natural gradient descent, bootstrap confidence intervals for benchmark evaluation, and Fisher information matrices in elastic weight consolidation for catastrophic forgetting prevention.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive derivations: MLE for Gaussian/Bernoulli/Poisson, Fisher information visualisation, CRB verification, asymptotic normality simulation, bootstrap CI, natural gradient
exercises.ipynb10 graded exercises from Bernoulli MLE and bias proofs through Fisher information for neural network layers and Chinchilla scaling law estimation

Learning Objectives

After completing this section, you will:

  • Define an estimator as a statistic and explain why it is a random variable with a sampling distribution
  • Decompose the MSE of any estimator into bias-squared plus variance, and construct examples demonstrating the trade-off
  • State and prove the Cramer-Rao lower bound using the Cauchy-Schwarz inequality
  • Compute the Fisher information for Bernoulli, Gaussian, Poisson, and Exponential models
  • Derive the MLE for scalar and multivariate Gaussian parameters from first principles
  • Prove that MLE of Gaussian variance is biased and state the corrected estimator
  • Explain why minimising NLL/cross-entropy loss is equivalent to performing MLE
  • Derive confidence intervals using both the pivoting method and the asymptotic normal approximation
  • Construct bootstrap confidence intervals and explain when they outperform asymptotic CIs
  • State the asymptotic normality theorem for MLE and explain its rate O(1/n)O(1/\sqrt{n})
  • Apply the delta method to derive asymptotic distributions of transformed estimators
  • Connect Fisher information to natural gradient descent, K-FAC, and elastic weight consolidation

Table of Contents


1. Intuition

1.1 What Is Statistical Estimation?

Suppose you flip a coin 100 times and observe 63 heads. The coin's true bias pp - the probability of heads on any single flip - is unknown. What is your best guess for pp? The obvious answer is p^=0.63\hat{p} = 0.63. But why is this the right answer? What makes it "best"? Could a different estimate be better in some sense? And how confident should you be - could the true pp be 0.5 (a fair coin) despite observing 63 heads?

These are precisely the questions estimation theory addresses. The statistical estimation problem has three ingredients:

  1. An unknown parameter θΘ\theta \in \Theta governing some probability distribution p(x;θ)p(\mathbf{x}; \theta)
  2. Observed data x(1),,x(n)\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} drawn independently from p(;θ)p(\cdot; \theta)
  3. An estimator θ^=T(x(1),,x(n))\hat{\theta} = T(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}): a function of the data that produces a guess for θ\theta

The key insight, often overlooked by beginners, is that the estimator is a random variable. Before we collect data, θ^\hat{\theta} is uncertain - its value depends on which random sample we happen to observe. If we repeated the experiment with a fresh sample, we would get a different θ^\hat{\theta}. The sampling distribution of θ^\hat{\theta} - the distribution of θ^\hat{\theta} across all possible samples - is the object of study. A good estimator has a sampling distribution tightly concentrated around the true θ\theta.

For AI: This perspective matters directly. When you train a neural network on a dataset D\mathcal{D} and obtain weights θ^\hat{\boldsymbol{\theta}}, those weights are a realisation of an estimator - a function of the random training data. Training on a different shuffle or subsample gives different θ^\hat{\boldsymbol{\theta}}. The variation you see across training runs, seeds, and dataset splits is the sampling distribution of the MLE manifesting in practice.

1.2 The Estimation Landscape

Estimation theory spans three major frameworks, each answering a different question:

Point estimation produces a single value θ^\hat{\theta} for the unknown parameter. The challenge is specifying what "best" means. Three criteria dominate:

  • Unbiasedness: E[θ^]=θ\mathbb{E}[\hat{\theta}] = \theta - on average, the estimator is correct
  • Minimum variance: among unbiased estimators, prefer the one with smallest Var(θ^)\operatorname{Var}(\hat{\theta})
  • Consistency: θ^nPθ\hat{\theta}_n \xrightarrow{P} \theta as nn \to \infty - with enough data, the estimator converges to the truth

These criteria can conflict. The Cramer-Rao lower bound establishes a hard floor on variance for unbiased estimators, and the estimators that achieve this floor are called efficient. Maximum likelihood estimation (MLE) is the dominant method: it is consistent, asymptotically efficient, and invariant under smooth reparametrisations.

Interval estimation goes beyond a point to a range [θ^L,θ^U][\hat{\theta}_L, \hat{\theta}_U] with a coverage guarantee: in repeated experiments, 1α1 - \alpha fraction of such intervals will contain the true θ\theta. Frequentist confidence intervals capture estimation uncertainty without requiring a prior distribution on θ\theta.

Asymptotic theory studies estimator behaviour as nn \to \infty. The flagship result is the asymptotic normality of MLE: under regularity conditions,

n(θ^MLEθ)dN ⁣(0,I(θ)1)\sqrt{n}\left(\hat{\theta}_{\text{MLE}} - \theta^*\right) \xrightarrow{d} \mathcal{N}\!\left(0,\, \mathcal{I}(\theta^*)^{-1}\right)

where I(θ)\mathcal{I}(\theta^*) is the Fisher information. This single result simultaneously justifies MLE as an estimation method, provides the basis for asymptotic confidence intervals, and connects estimation theory to information geometry.

For AI: The three frameworks map directly to ML practice. Point estimation = model training (find θ^\hat{\boldsymbol{\theta}}). Interval estimation = evaluation with error bars (report accuracy ±\pm 95% CI). Asymptotic theory = understanding why larger datasets produce more reliable models (the 1/n1/\sqrt{n} convergence rate).

1.3 Historical Timeline

ESTIMATION THEORY - HISTORICAL TIMELINE
========================================================================

  1795  Gauss (age 18) uses least squares to predict asteroid orbits
        - the first systematic estimation method

  1805  Legendre publishes least squares formally (Gauss priority dispute)

  1809  Gauss derives least squares from the assumption of Gaussian errors
        - first connection between MLE and squared-error minimisation

  1894  Pearson introduces method of moments - systematic moment matching

  1912  Fisher introduces "maximum likelihood" as a term; begins systematic
        study of estimation properties

  1922  Fisher: "On the Mathematical Foundations of Theoretical Statistics"
        - defines sufficiency, efficiency, consistency; derives properties of MLE

  1925  Fisher introduces Fisher information and the information inequality
        (precursor to Cramer-Rao)

  1945  Cramer proves the Cramer-Rao lower bound (independently of Rao)

  1945  Rao independently proves the same bound + Rao-Blackwell theorem

  1947  Lehmann & Scheffe: completeness and UMVUE theory

  1950s Wald develops statistical decision theory - unified framework for
        estimation as minimising expected loss

  1979  Efron introduces the bootstrap - non-parametric confidence intervals
        without distributional assumptions

  1998  Amari: natural gradient via Fisher information geometry
        - direct connection to modern ML optimisation

  2015  Martens & Grosse: K-FAC - tractable approximation of FIM for DNNs

  2017  Kirkpatrick et al.: Elastic Weight Consolidation (EWC) uses FIM
        to prevent catastrophic forgetting in continual learning

  2022  Hoffmann et al. (Chinchilla): scaling laws estimated via MLE on
        (N, D, L) data - estimation theory at trillion-parameter scale

========================================================================

1.4 Why Estimation Theory Matters for AI

Every major component of a modern ML pipeline rests on estimation theory:

Training objectives. Minimising cross-entropy loss is equivalent to performing MLE. The connection is exact: argminθilogp(y(i)x(i);θ)=argmaxθip(y(i)x(i);θ)\arg\min_{\boldsymbol{\theta}} \sum_i -\log p(y^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\theta}) = \arg\max_{\boldsymbol{\theta}} \prod_i p(y^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\theta}). Every training run of GPT-4, LLaMA-3, Gemini, or any neural network trained with NLL loss is performing MLE on the conditional distribution.

Evaluation uncertainty. When a benchmark reports "Model A achieves 73.2% accuracy", this is a point estimate. How uncertain is it? With n=1000n = 1000 test examples, the 95% CI is approximately ±2.8%\pm 2.8\% - meaning Model A and Model B with 73.5% accuracy are statistically indistinguishable. Estimation theory makes this precise.

Second-order optimisation. Adam, K-FAC, and natural gradient all incorporate curvature information. The natural gradient multiplies the gradient by I(θ)1\mathcal{I}(\boldsymbol{\theta})^{-1}, where I\mathcal{I} is the Fisher information matrix. This reparametrises the optimisation in the geometry of the statistical manifold, making gradient steps invariant to reparametrisation. K-FAC (Martens & Grosse, 2015) approximates I1\mathcal{I}^{-1} tractably for deep networks.

Continual learning. Elastic weight consolidation (Kirkpatrick et al., 2017) protects important weights during sequential task learning. "Importance" is measured by the diagonal of the FIM: parameters with high Fisher information are those to which the likelihood is most sensitive, and perturbing them most damages performance.

Calibration. A well-calibrated model's confidence scores match empirical frequencies: when it says "80% confident", it's right 80% of the time. Temperature scaling - dividing logits by a scalar τ\tau - is an MLE problem: find the τ\tau that maximises likelihood on a held-out calibration set.


2. The Formal Estimation Problem

2.1 Statistical Model and Parametric Family

Definition 2.1 (Parametric Statistical Model). A parametric statistical model is a collection of probability distributions indexed by a parameter:

M={p(;θ):θΘ}\mathcal{M} = \{p(\cdot\,;\boldsymbol{\theta}) : \boldsymbol{\theta} \in \Theta\}

where ΘRk\Theta \subseteq \mathbb{R}^k is the parameter space. The model is:

  • Correctly specified if the true data-generating distribution pp^* satisfies p=p(;θ)p^* = p(\cdot\,;\boldsymbol{\theta}^*) for some θΘ\boldsymbol{\theta}^* \in \Theta
  • Misspecified if pMp^* \notin \mathcal{M} - the model family does not contain the truth (most practical ML models are misspecified)
  • Identifiable if different parameters give different distributions: θθp(;θ)p(;θ)\boldsymbol{\theta} \neq \boldsymbol{\theta}' \Rightarrow p(\cdot\,;\boldsymbol{\theta}) \neq p(\cdot\,;\boldsymbol{\theta}')

Identifiability is necessary for consistent estimation - you cannot consistently estimate a parameter that leaves the data distribution unchanged.

Standard examples of parametric families:

ModelParameter(s)Parameter spaceNotes
Bern(p)\operatorname{Bern}(p)pp(0,1)(0,1)Single binary outcome
N(μ,σ2)\mathcal{N}(\mu, \sigma^2)(μ,σ2)(\mu, \sigma^2)R×R>0\mathbb{R} \times \mathbb{R}_{>0}Location-scale family
N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma)(μ,Σ)(\boldsymbol{\mu}, \Sigma)Rd×S++d\mathbb{R}^d \times \mathbb{S}^d_{++}Multivariate Gaussian
Poisson(λ)\operatorname{Poisson}(\lambda)λ\lambdaR>0\mathbb{R}_{>0}Count data
Exp(λ)\operatorname{Exp}(\lambda)λ\lambdaR>0\mathbb{R}_{>0}Time-to-event
Beta(α,β)\operatorname{Beta}(\alpha, \beta)(α,β)(\alpha, \beta)R>02\mathbb{R}_{>0}^2Probabilities

Non-examples (non-identifiable):

  • N(μ1+μ2,1)\mathcal{N}(\mu_1 + \mu_2, 1) with parameter (μ1,μ2)(\mu_1, \mu_2): we can only estimate μ1+μ2\mu_1 + \mu_2, not the individual components. The model is not identifiable.
  • A neural network with two hidden layers of width 1 using tanh\tanh: swapping the two neurons gives the same function, so the parameterisation is not identifiable (though the function class may be).

The iid assumption. Throughout this section, observations x(1),,x(n)\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} are assumed independent and identically distributed (iid) from p(;θ)p(\cdot\,;\boldsymbol{\theta}^*). The iid assumption enables the joint log-likelihood to factorise as a sum:

logp(x(1),,x(n);θ)=i=1nlogp(x(i);θ)\log p(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)};\boldsymbol{\theta}) = \sum_{i=1}^n \log p(\mathbf{x}^{(i)};\boldsymbol{\theta})

This factorisation is what makes MLE computationally tractable and theoretically tractable. In practice, this assumption is violated whenever data points are correlated (time series, text sequences, spatially clustered data), and estimation theory has extensions to handle dependent data.

2.2 Point Estimators

Definition 2.2 (Estimator). An estimator of θ\boldsymbol{\theta} is any measurable function θ^n=T(x(1),,x(n))\hat{\boldsymbol{\theta}}_n = T(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}) of the sample. The key points:

  1. θ^n\hat{\boldsymbol{\theta}}_n is a random variable - before observing data, it is uncertain
  2. A specific value T(x(1),,x(n))=tT(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}) = t after observing data is called an estimate (not estimator)
  3. The sampling distribution of θ^n\hat{\boldsymbol{\theta}}_n is the distribution of TT across all possible samples of size nn

This distinction - estimator (random variable) vs. estimate (specific realisation) - is pedantic but consequential. Confidence interval coverage is a property of the estimator (random), not the estimate (fixed). When we say "95% CI", we mean that the random interval contains θ\boldsymbol{\theta}^* with probability 95%, not that this specific computed interval has a 95% chance of containing θ\boldsymbol{\theta}^* (after observing data, θ\boldsymbol{\theta}^* is either in the interval or not - there is no remaining randomness).

Examples of estimators for the Gaussian mean μ\mu:

Let x(1),,x(n)iidN(μ,σ2)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2).

  • Sample mean: μ^1=1ni=1nx(i)\hat{\mu}_1 = \frac{1}{n}\sum_{i=1}^n x^{(i)} - unbiased, consistent, efficient
  • Constant zero: μ^2=0\hat{\mu}_2 = 0 - biased unless μ=0\mu = 0, inconsistent
  • First observation: μ^3=x(1)\hat{\mu}_3 = x^{(1)} - unbiased! but inconsistent (variance σ2\sigma^2 regardless of nn)
  • Trimmed mean: μ^4=\hat{\mu}_4 = mean of middle 90% - slightly biased for Gaussian, but robust to outliers
  • Constant cc: μ^5=c\hat{\mu}_5 = c - biased by cμ|c - \mu|, inconsistent; can have lower MSE than the sample mean when μc|\mu - c| is small (illustrates bias-variance trade-off)

The existence of unbiased but inconsistent estimators (μ^3\hat{\mu}_3) and consistent but biased estimators (trimmed mean) shows these properties are logically independent.

2.3 The Loss Framework: MSE, Bias, and Variance

To compare estimators, we need a criterion. The most widely used is Mean Squared Error (MSE):

Definition 2.3 (MSE, Bias, Variance). For an estimator θ^\hat{\theta} of a scalar parameter θ\theta:

MSE(θ^)=E[(θ^θ)2]\operatorname{MSE}(\hat{\theta}) = \mathbb{E}\left[(\hat{\theta} - \theta)^2\right]

The bias-variance decomposition is the central identity of estimation theory:

MSE(θ^)=Bias2(θ^)+Var(θ^)\operatorname{MSE}(\hat{\theta}) = \operatorname{Bias}^2(\hat{\theta}) + \operatorname{Var}(\hat{\theta})

Proof:

MSE(θ^)=E[(θ^θ)2]\operatorname{MSE}(\hat{\theta}) = \mathbb{E}\left[(\hat{\theta} - \theta)^2\right]

Add and subtract E[θ^]\mathbb{E}[\hat{\theta}]:

=E[((θ^E[θ^])+(E[θ^]θ))2]= \mathbb{E}\left[\left((\hat{\theta} - \mathbb{E}[\hat{\theta}]) + (\mathbb{E}[\hat{\theta}] - \theta)\right)^2\right] =E[(θ^E[θ^])2]+2(E[θ^]θ)E[θ^E[θ^]]=0+(E[θ^]θ)2= \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2\right] + 2(\mathbb{E}[\hat{\theta}] - \theta)\underbrace{\mathbb{E}[\hat{\theta} - \mathbb{E}[\hat{\theta}]]}_{=0} + (\mathbb{E}[\hat{\theta}] - \theta)^2 =Var(θ^)+Bias(θ^)2= \operatorname{Var}(\hat{\theta}) + \operatorname{Bias}(\hat{\theta})^2 \qquad \square

where Bias(θ^)=E[θ^]θ\operatorname{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}] - \theta.

Geometric interpretation:

BIAS-VARIANCE DECOMPOSITION - GEOMETRIC VIEW
========================================================================

  True parameter \\theta* = 0 (target)

  Case 1: Low bias, low variance (good estimator)
    ---    Estimates cluster tightly around \\theta*
    ----   MSE \\approx small
    ---

  Case 2: Low bias, high variance (unbiased but noisy)
        -             -
    -       -   x   -       -
        -             -
    Centred on \\theta* but spread out.  MSE = Var is large.

  Case 3: High bias, low variance (biased but stable)
              ---
             ----    x  (\\theta*)
              ---
    Centred off target.  MSE = Bias^2 + small Var.

  Case 4: High bias, high variance (worst case)
    -           -
            -       x  (\\theta*)
        -       -
    Both centred off target AND spread out.

  Key trade-off: Reducing variance by shrinkage introduces bias.
  Increasing bias via regularisation often reduces variance more than
  it increases bias-squared, giving lower MSE overall.

========================================================================

The bias-variance trade-off in ML is the exact same phenomenon: regularisation (L2, dropout, early stopping) introduces bias (the model is pulled away from the MLE toward a constrained region) but reduces variance (the model is less sensitive to the particular training sample), often reducing test MSE overall.

Minimax estimators. MSE is not the only loss criterion. The minimax estimator minimises the worst-case risk: minTmaxθE[(θ^θ)2]\min_T \max_\theta \mathbb{E}[(\hat{\theta} - \theta)^2]. The James-Stein estimator (1961) famously showed that in Rk\mathbb{R}^k for k3k \geq 3, the sample mean is inadmissible - there exists an estimator with strictly lower MSE at every θ\theta. This is the Stein paradox: shrinkage toward the origin (a biased estimator!) uniformly dominates the unbiased sample mean in 3+ dimensions.

2.4 Examples of Biased and Unbiased Estimators

Example 1: Sample mean is unbiased.

Let x(1),,x(n)iidp(;θ)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} p(\cdot;\theta) with E[x(i)]=μ\mathbb{E}[x^{(i)}] = \mu. Then:

E ⁣[xˉ]=E ⁣[1ni=1nx(i)]=1ni=1nE[x(i)]=nμn=μ\mathbb{E}\!\left[\bar{x}\right] = \mathbb{E}\!\left[\frac{1}{n}\sum_{i=1}^n x^{(i)}\right] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[x^{(i)}] = \frac{n\mu}{n} = \mu

The sample mean is an unbiased estimator of μ\mu for any distribution with finite mean.

Example 2: MLE of Gaussian variance is biased.

Let x(1),,x(n)iidN(μ,σ2)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2). The MLE of σ2\sigma^2 is:

σ^MLE2=1ni=1n(x(i)xˉ)2\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x^{(i)} - \bar{x})^2

Computing the bias: Using x(i)xˉ=(x(i)μ)(xˉμ)x^{(i)} - \bar{x} = (x^{(i)} - \mu) - (\bar{x} - \mu):

i=1n(x(i)xˉ)2=i=1n(x(i)μ)2n(xˉμ)2\sum_{i=1}^n (x^{(i)} - \bar{x})^2 = \sum_{i=1}^n (x^{(i)} - \mu)^2 - n(\bar{x} - \mu)^2

Taking expectations:

E ⁣[i=1n(x(i)xˉ)2]=nσ2nσ2n=(n1)σ2\mathbb{E}\!\left[\sum_{i=1}^n (x^{(i)} - \bar{x})^2\right] = n\sigma^2 - n \cdot \frac{\sigma^2}{n} = (n-1)\sigma^2

Therefore:

E[σ^MLE2]=(n1)σ2n=σ2σ2n\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \frac{(n-1)\sigma^2}{n} = \sigma^2 - \frac{\sigma^2}{n}

The bias is σ2/n-\sigma^2/n - the MLE underestimates variance. The correction is Bessel's correction: s2=1n1(x(i)xˉ)2s^2 = \frac{1}{n-1}\sum(x^{(i)} - \bar{x})^2 is unbiased.

Recall from Section01: We saw Bessel's correction s2=1n1(x(i)xˉ)2s^2 = \frac{1}{n-1}\sum(x^{(i)}-\bar{x})^2 introduced as the "standard" sample variance formula. Now we understand why: MLE gives 1/n1/n, which is biased. The n1n-1 denominator corrects for the one degree of freedom lost in estimating μ\mu by xˉ\bar{x}.

Example 3: MLE is biased but consistent.

The Gaussian variance MLE σ^MLE2\hat{\sigma}^2_{\text{MLE}} has bias σ2/n0-\sigma^2/n \to 0 as nn \to \infty. It is asymptotically unbiased and consistent. This illustrates that bias can be acceptable if it vanishes with sample size.

Example 4: When bias reduces MSE.

Consider estimating μ\mu with the shrinkage estimator μ^c=cxˉ\hat{\mu}_c = c \bar{x} for some c(0,1)c \in (0,1). Then:

MSE(μ^c)=(c1)2μ2+c2σ2n\operatorname{MSE}(\hat{\mu}_c) = (c-1)^2 \mu^2 + c^2 \frac{\sigma^2}{n}

The optimal c=μ2μ2+σ2/nc^* = \frac{\mu^2}{\mu^2 + \sigma^2/n}, which gives lower MSE than c=1c = 1 (unbiased sample mean) whenever σ2/n>0\sigma^2/n > 0. The key insight: when μ|\mu| is small relative to σ/n\sigma/\sqrt{n}, shrinking toward zero introduces small bias but large variance reduction.


3. Properties of Estimators

3.1 Bias and Unbiasedness

Definition 3.1 (Bias). The bias of an estimator θ^\hat{\theta} for parameter θ\theta is:

Bias(θ^;θ)=E[θ^]θ\operatorname{Bias}(\hat{\theta};\theta) = \mathbb{E}[\hat{\theta}] - \theta

The estimator is unbiased if Bias(θ^;θ)=0\operatorname{Bias}(\hat{\theta};\theta) = 0 for all θΘ\theta \in \Theta, and asymptotically unbiased if Bias(θ^n;θ)0\operatorname{Bias}(\hat{\theta}_n;\theta) \to 0 as nn \to \infty.

Why unbiasedness is desirable - but not sacred. An unbiased estimator is correct on average over infinite repetitions of the experiment. This is a natural property to want. However, unbiasedness alone is insufficient:

  • The estimator θ^=x(1)\hat{\theta} = x^{(1)} (use only the first observation) is unbiased but throws away n1n-1 observations - clearly wasteful
  • No unbiased estimator exists for some parameters: there is no unbiased estimator of eμe^\mu based on a Gaussian sample (by the Lehmann-Scheffe theorem, for this to exist eμe^\mu would need to be a polynomial in xˉ\bar{x}, which it is not)
  • The Stein phenomenon (Section2.3) shows that in 3+ dimensions, unbiased estimators can be uniformly dominated by biased ones

Bias correction. When an estimator θ^\hat{\theta} has known bias b(θ)=E[θ^]θb(\theta) = \mathbb{E}[\hat{\theta}] - \theta, we can correct it: θ^corr=θ^b(θ^)\hat{\theta}_{\text{corr}} = \hat{\theta} - b(\hat{\theta}). This is the idea behind Bessel's correction (n1n-1 vs nn in the variance formula) and jackknife bias reduction.

For AI: Batch normalisation computes μ^B=1Bibatchxi\hat{\mu}_B = \frac{1}{B}\sum_{i \in \text{batch}} x_i as the batch mean during training. This is an unbiased estimator of the feature mean. During inference, PyTorch uses the running mean accumulated over training, which is a biased estimator of the current data mean if the distribution has shifted - illustrating that unbiasedness depends on whether the estimation scenario matches the training scenario.

3.2 Consistency

Definition 3.2 (Consistency). An estimator θ^n\hat{\theta}_n is:

  • Weakly consistent if θ^nPθ\hat{\theta}_n \xrightarrow{P} \theta for all θΘ\theta \in \Theta (convergence in probability)
  • Strongly consistent if θ^na.s.θ\hat{\theta}_n \xrightarrow{a.s.} \theta for all θΘ\theta \in \Theta (almost-sure convergence)

Weak consistency: for every ε>0\varepsilon > 0, P(θ^nθ>ε)0P(|\hat{\theta}_n - \theta| > \varepsilon) \to 0 as nn \to \infty.

Sufficient conditions for consistency:

  • If Bias(θ^n;θ)0\operatorname{Bias}(\hat{\theta}_n;\theta) \to 0 and Var(θ^n)0\operatorname{Var}(\hat{\theta}_n) \to 0 as nn \to \infty, then θ^n\hat{\theta}_n is weakly consistent. (By Chebyshev + the bias decomposition: P(θ^nθ>ε)MSE(θ^n)/ε20P(|\hat{\theta}_n - \theta| > \varepsilon) \leq \operatorname{MSE}(\hat{\theta}_n)/\varepsilon^2 \to 0.)

Example: Sample mean consistency.

xˉn=1ni=1nx(i)\bar{x}_n = \frac{1}{n}\sum_{i=1}^n x^{(i)} satisfies Bias=0\operatorname{Bias} = 0 and Var(xˉn)=σ2/n0\operatorname{Var}(\bar{x}_n) = \sigma^2/n \to 0. Therefore xˉnPμ\bar{x}_n \xrightarrow{P} \mu. This is exactly the Weak Law of Large Numbers - consistency of the sample mean is the LLN.

Example: MLE of Gaussian variance is consistent.

σ^MLE2=1n(x(i)xˉ)2\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum(x^{(i)} - \bar{x})^2 has bias σ2/n0-\sigma^2/n \to 0 and variance O(1/n)O(1/n), so it is consistent even though it is biased for finite nn.

Inconsistent estimator example: θ^=x(1)\hat{\theta} = x^{(1)} (take only the first observation) has Var=σ2\operatorname{Var} = \sigma^2 regardless of nn. It never concentrates - it is inconsistent.

Why consistency matters: Consistency is the minimal requirement for an estimator to be scientifically useful. A method that doesn't converge to the right answer with infinite data is fundamentally broken. Consistency does not guarantee that the estimator is good for small nn - convergence can be arbitrarily slow - but it at least ensures the method is correct in the large-data limit.

For AI: The "double descent" phenomenon in over-parameterised neural networks challenges classical bias-variance thinking. A massively over-parameterised model (dnd \gg n) can still be consistent for the Bayes-optimal predictor if trained with appropriate implicit regularisation (e.g., gradient descent from zero initialisation). This is an active area connecting statistical estimation theory to modern deep learning theory.

3.3 Efficiency

Among all unbiased estimators of θ\theta, which has the smallest variance? This question is answered by the Cramer-Rao bound (Section 4), but we can define efficiency as a property here.

Definition 3.3 (Efficiency and Relative Efficiency).

  • An unbiased estimator θ^\hat{\theta} is efficient if its variance equals the Cramer-Rao lower bound: Var(θ^)=1/I(θ)\operatorname{Var}(\hat{\theta}) = 1/\mathcal{I}(\theta).
  • The relative efficiency of two unbiased estimators θ^1\hat{\theta}_1 and θ^2\hat{\theta}_2 is e(θ^1,θ^2)=Var(θ^2)/Var(θ^1)e(\hat{\theta}_1, \hat{\theta}_2) = \operatorname{Var}(\hat{\theta}_2)/\operatorname{Var}(\hat{\theta}_1). Values greater than 1 mean θ^1\hat{\theta}_1 is more efficient.

Examples:

  1. Gaussian mean: The sample mean xˉ\bar{x} is efficient - it achieves the CRB σ2/n\sigma^2/n. Relative efficiency of the sample median vs. mean is 2/π0.6372/\pi \approx 0.637 for Gaussian data - the median throws away ~36% of efficiency.

  2. Gaussian variance: The unbiased sample variance s2s^2 is not efficient (the MLE σ^2\hat{\sigma}^2 achieves the CRB but is biased; correcting for bias loses the CRB).

Asymptotic efficiency. For large nn, MLE achieves the CRB asymptotically - it is asymptotically efficient. This is one of its key advantages.

For AI: The sample mean is the most efficient estimator of the mean for Gaussian data. Batch gradient descent using all nn samples computes the exact gradient (equivalent to efficient estimation), while SGD uses a single sample or mini-batch (less efficient, higher variance, but much faster per update). The trade-off between statistical efficiency and computational efficiency is fundamental to modern ML training.

3.4 Sufficiency and the Rao-Blackwell Theorem

A sufficient statistic captures all the information in the data that is relevant to estimating θ\theta.

Definition 3.4 (Sufficient Statistic). A statistic T=T(x(1),,x(n))T = T(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}) is sufficient for θ\theta if the conditional distribution p(x(1),,x(n)T=t;θ)p(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} \mid T = t; \theta) does not depend on θ\theta for any tt.

Intuitively: once you know TT, the raw data x(1),,x(n)\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} provide no additional information about θ\theta.

Fisher-Neyman Factorisation Theorem. TT is sufficient for θ\theta if and only if the likelihood factors as:

p(x(1),,x(n);θ)=g(T(x(1),,x(n)),θ)h(x(1),,x(n))p(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)};\theta) = g(T(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}), \theta) \cdot h(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)})

where gg depends on data only through TT, and hh does not depend on θ\theta.

Examples of sufficient statistics:

ModelSufficient statistic TT
Bern(p)\operatorname{Bern}(p), nn samplesT=i=1nx(i)T = \sum_{i=1}^n x^{(i)} (total successes)
N(μ,σ2)\mathcal{N}(\mu, \sigma^2), σ2\sigma^2 knownT=xˉT = \bar{x} (sample mean)
N(μ,σ2)\mathcal{N}(\mu, \sigma^2), both unknownT=(xˉ,(x(i)xˉ)2)T = (\bar{x}, \sum(x^{(i)} - \bar{x})^2) (mean and SS)
Poisson(λ)\operatorname{Poisson}(\lambda)T=i=1nx(i)T = \sum_{i=1}^n x^{(i)} (total count)
Exp(λ)\operatorname{Exp}(\lambda)T=i=1nx(i)T = \sum_{i=1}^n x^{(i)} (total time)

Rao-Blackwell Theorem. If θ^\hat{\theta} is any unbiased estimator and TT is a sufficient statistic, then θ~=E[θ^T]\tilde{\theta} = \mathbb{E}[\hat{\theta} \mid T] satisfies:

Var(θ~)Var(θ^)\operatorname{Var}(\tilde{\theta}) \leq \operatorname{Var}(\hat{\theta})

with equality iff θ^\hat{\theta} is already a function of TT. The "Rao-Blackwellised" estimator is at least as good.

Proof sketch: By the law of total variance, Var(θ^)=Var(E[θ^T])+E[Var(θ^T)]Var(θ~)\operatorname{Var}(\hat{\theta}) = \operatorname{Var}(\mathbb{E}[\hat{\theta}|T]) + \mathbb{E}[\operatorname{Var}(\hat{\theta}|T)] \geq \operatorname{Var}(\tilde{\theta}).

For AI: Sufficient statistics are the foundation of exponential families, which include Gaussian, Bernoulli, Poisson, Gamma, and most common distributions. The natural parameterisation of exponential families (used in generalised linear models and variational inference) exploits sufficient statistics to simplify computation. In the VAE (Kingma & Welling, 2014), the encoder qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}) outputs sufficient statistics (μϕ(x),σϕ(x))(\boldsymbol{\mu}_\phi(\mathbf{x}), \boldsymbol{\sigma}_\phi(\mathbf{x})) of the Gaussian posterior approximation.

3.5 Completeness and the Lehmann-Scheffe Theorem

Definition 3.5 (Complete Statistic). A sufficient statistic TT is complete if for any measurable function gg: E[g(T);θ]=0\mathbb{E}[g(T); \theta] = 0 for all θΘ\theta \in \Theta implies g(T)=0g(T) = 0 a.s. for all θ\theta.

Completeness means the only function of TT that has zero expectation everywhere is the zero function - there are no "hidden" unbiasedness conditions.

Lehmann-Scheffe Theorem. If TT is a complete sufficient statistic and θ~=h(T)\tilde{\theta} = h(T) is unbiased for θ\theta, then θ~\tilde{\theta} is the unique minimum variance unbiased estimator (UMVUE) of θ\theta.

The UMVUE is the "best possible" unbiased estimator - it has the lowest variance among all unbiased estimators, and it is unique.

Example: For N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with σ2\sigma^2 known, T=xˉT = \bar{x} is complete sufficient, and xˉ\bar{x} itself is unbiased for μ\mu. By Lehmann-Scheffe, xˉ\bar{x} is the UMVUE of μ\mu - there is no unbiased estimator with smaller variance. This also confirms that xˉ\bar{x} achieves the CRB σ2/n\sigma^2/n.


4. Fisher Information and the Cramer-Rao Bound

4.1 The Score Function

The score function is the fundamental quantity linking estimation and information theory.

Definition 4.1 (Score Function). For a single observation xx from p(x;θ)p(x;\theta), the score is:

s(θ;x)=θlogp(x;θ)s(\theta; x) = \frac{\partial}{\partial \theta} \log p(x; \theta)

For a sample of nn iid observations, the total score is:

Sn(θ)=i=1nθlogp(x(i);θ)S_n(\theta) = \sum_{i=1}^n \frac{\partial}{\partial \theta} \log p(x^{(i)};\theta)

The MLE solves Sn(θ^)=0S_n(\hat{\theta}) = 0 (the score equation).

Zero-mean property of the score:

E ⁣[s(θ;X)]=logp(x;θ)θp(x;θ)dx=p(x;θ)θp(x;θ)p(x;θ)dx=θp(x;θ)dx=θ(1)=0\mathbb{E}\!\left[s(\theta; X)\right] = \int \frac{\partial \log p(x;\theta)}{\partial \theta} p(x;\theta)\, dx = \int \frac{\frac{\partial p(x;\theta)}{\partial \theta}}{p(x;\theta)} p(x;\theta)\, dx = \frac{\partial}{\partial \theta}\int p(x;\theta)\, dx = \frac{\partial}{\partial \theta}(1) = 0

under regularity conditions permitting interchange of differentiation and integration.

Interpretation: The score measures how sensitively the log-likelihood responds to perturbations of θ\theta. At the true parameter value, the score has zero mean - on average, the likelihood is at a stationary point. But it has positive variance, measuring how much information the data carries about θ\theta.

SCORE FUNCTION - GEOMETRIC INTUITION
========================================================================

  Log-likelihood \\ell(\\theta) = \\Sigma_i log p(x_i;\\theta) as a function of \\theta:

         \\ell(\\theta)
          |
          |         +-----+
          |       +-+     +-+
          |     +-+         +-+
          |---+-+             +-+----
          |                        \\theta
                    ^
                 \\theta* (true parameter)
                 slope = score = 0 at maximum

  Score s(\\theta; data) = d\\ell/d\\theta is the slope of the log-likelihood.
  High variance of score at \\theta* means the data sharply identifies
  \\theta* (the likelihood is "peaky"). Low variance means the data
  are relatively uninformative about \\theta.

========================================================================

4.2 Fisher Information

Definition 4.2 (Fisher Information). The Fisher information for parameter θ\theta in model p(x;θ)p(x;\theta) is:

I(θ)=E ⁣[(logp(X;θ)θ)2]=Var ⁣(logp(X;θ)θ)\mathcal{I}(\theta) = \mathbb{E}\!\left[\left(\frac{\partial \log p(X;\theta)}{\partial \theta}\right)^2\right] = \operatorname{Var}\!\left(\frac{\partial \log p(X;\theta)}{\partial \theta}\right)

(The second equality uses E[s]=0\mathbb{E}[s] = 0.)

Alternative form (under regularity conditions):

I(θ)=E ⁣[2logp(X;θ)θ2]\mathcal{I}(\theta) = -\mathbb{E}\!\left[\frac{\partial^2 \log p(X;\theta)}{\partial \theta^2}\right]

Proof of equivalence: Differentiating E[s(θ;X)]=0\mathbb{E}[s(\theta;X)] = 0 with respect to θ\theta:

0=θlogpθpdx=[2logpθ2p+logpθpθ]dx0 = \frac{\partial}{\partial\theta}\int \frac{\partial \log p}{\partial \theta} p\, dx = \int \left[\frac{\partial^2 \log p}{\partial \theta^2} p + \frac{\partial \log p}{\partial \theta} \cdot \frac{\partial p}{\partial \theta}\right] dx =E ⁣[2logpθ2]+E ⁣[(logpθ)2]= \mathbb{E}\!\left[\frac{\partial^2 \log p}{\partial \theta^2}\right] + \mathbb{E}\!\left[\left(\frac{\partial \log p}{\partial\theta}\right)^2\right]

Therefore I(θ)=E ⁣[2logpθ2]\mathcal{I}(\theta) = -\mathbb{E}\!\left[\frac{\partial^2 \log p}{\partial\theta^2}\right]. \square

The second-derivative form has intuitive content: I(θ)\mathcal{I}(\theta) is the expected (negative) curvature of the log-likelihood. High curvature = the likelihood has a sharp peak = the data strongly identifies θ\theta. Low curvature = flat likelihood = data is relatively uninformative.

Fisher information for an iid sample of nn observations:

In(θ)=nI1(θ)\mathcal{I}_n(\theta) = n \cdot \mathcal{I}_1(\theta)

Information accumulates linearly with sample size - doubling data doubles information, halving uncertainty (in variance terms: 1/In=(1/I1)/n1/\mathcal{I}_n = (1/\mathcal{I}_1)/n).

Computed examples:

Bernoulli(p)(p): logp(x;p)=xlogp+(1x)log(1p)\log p(x;p) = x\log p + (1-x)\log(1-p). Score: s=x/p(1x)/(1p)s = x/p - (1-x)/(1-p). Score squared: s2=(xp)2p2(1p)2s^2 = \frac{(x-p)^2}{p^2(1-p)^2}.

I(p)=E[s2]=Var(X)p2(1p)2=p(1p)p2(1p)2=1p(1p)\mathcal{I}(p) = \mathbb{E}[s^2] = \frac{\operatorname{Var}(X)}{p^2(1-p)^2} = \frac{p(1-p)}{p^2(1-p)^2} = \frac{1}{p(1-p)}

N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with σ2\sigma^2 known: logp(x;μ)=(xμ)22σ2+const\log p(x;\mu) = -\frac{(x-\mu)^2}{2\sigma^2} + \text{const}. Score: s=(xμ)/σ2s = (x-\mu)/\sigma^2.

I(μ)=E ⁣[(Xμ)2σ4]=Var(X)σ4=σ2σ4=1σ2\mathcal{I}(\mu) = \mathbb{E}\!\left[\frac{(X-\mu)^2}{\sigma^4}\right] = \frac{\operatorname{Var}(X)}{\sigma^4} = \frac{\sigma^2}{\sigma^4} = \frac{1}{\sigma^2}

Poisson(λ)(\lambda): logp(x;λ)=xlogλλlog(x!)\log p(x;\lambda) = x\log\lambda - \lambda - \log(x!). Score: s=x/λ1s = x/\lambda - 1.

I(λ)=E ⁣[(Xλ1)2]=Var(X)λ2=λλ2=1λ\mathcal{I}(\lambda) = \mathbb{E}\!\left[\left(\frac{X}{\lambda} - 1\right)^2\right] = \frac{\operatorname{Var}(X)}{\lambda^2} = \frac{\lambda}{\lambda^2} = \frac{1}{\lambda}

Pattern: For exponential family distributions, I(θ)=1/Var(θ^MLE)\mathcal{I}(\theta) = 1/\operatorname{Var}(\hat{\theta}_{\text{MLE}}) in scalar cases - information is the reciprocal of the natural parameter variance.

Multivariate Fisher Information Matrix. For θRk\boldsymbol{\theta} \in \mathbb{R}^k:

I(θ)jl=E ⁣[logpθjlogpθl]=E ⁣[2logpθjθl]\mathcal{I}(\boldsymbol{\theta})_{jl} = \mathbb{E}\!\left[\frac{\partial \log p}{\partial \theta_j} \cdot \frac{\partial \log p}{\partial \theta_l}\right] = -\mathbb{E}\!\left[\frac{\partial^2 \log p}{\partial \theta_j \partial \theta_l}\right]

In matrix form: I(θ)=E[ss]=E[Hlogp]\mathcal{I}(\boldsymbol{\theta}) = \mathbb{E}[\mathbf{s}\mathbf{s}^\top] = -\mathbb{E}[H_{\log p}] where HlogpH_{\log p} is the Hessian of the log-likelihood.

The FIM is always positive semi-definite (as an outer product expectation), and positive definite under identifiability.

4.3 The Cramer-Rao Lower Bound

The Cramer-Rao Lower Bound (CRB) is the fundamental limit on estimation accuracy.

Theorem 4.3 (Cramer-Rao Lower Bound). Let θ^\hat{\theta} be any unbiased estimator of θ\theta based on nn iid observations. Under regularity conditions:

Var(θ^)1nI(θ)\operatorname{Var}(\hat{\theta}) \geq \frac{1}{n\mathcal{I}(\theta)}

For the multivariate case with unbiased estimator θ^\hat{\boldsymbol{\theta}} of θRk\boldsymbol{\theta} \in \mathbb{R}^k:

Cov(θ^)1nI(θ)1\operatorname{Cov}(\hat{\boldsymbol{\theta}}) \succeq \frac{1}{n}\mathcal{I}(\boldsymbol{\theta})^{-1}

(meaning Cov(θ^)1nI(θ)1\operatorname{Cov}(\hat{\boldsymbol{\theta}}) - \frac{1}{n}\mathcal{I}(\boldsymbol{\theta})^{-1} is positive semidefinite).

Proof (scalar case, single observation):

We need to show Var(θ^)I(θ)1\operatorname{Var}(\hat{\theta}) \cdot \mathcal{I}(\theta) \geq 1.

Since θ^\hat{\theta} is unbiased: E[θ^]=θ\mathbb{E}[\hat{\theta}] = \theta. Differentiating with respect to θ\theta:

1=θθ^(x)p(x;θ)dx=θ^(x)logp(x;θ)θp(x;θ)dx=E[θ^s(θ;X)]1 = \frac{\partial}{\partial\theta}\int \hat{\theta}(x)\, p(x;\theta)\, dx = \int \hat{\theta}(x)\, \frac{\partial \log p(x;\theta)}{\partial\theta}\, p(x;\theta)\, dx = \mathbb{E}[\hat{\theta} \cdot s(\theta;X)]

Now apply the Cauchy-Schwarz inequality to E[θ^s]\mathbb{E}[\hat{\theta} \cdot s]:

1=E[θ^s]=E[(θ^θ)s](since E[s]=0)1 = \mathbb{E}[\hat{\theta} \cdot s] = \mathbb{E}[(\hat{\theta} - \theta) \cdot s] \quad (\text{since } \mathbb{E}[s] = 0)

By Cauchy-Schwarz: E[(θ^θ)s]2E[(θ^θ)2]E[s2]\mathbb{E}[(\hat{\theta} - \theta) \cdot s]^2 \leq \mathbb{E}[(\hat{\theta}-\theta)^2] \cdot \mathbb{E}[s^2]

1Var(θ^)I(θ)1 \leq \operatorname{Var}(\hat{\theta}) \cdot \mathcal{I}(\theta) Var(θ^)1I(θ)\operatorname{Var}(\hat{\theta}) \geq \frac{1}{\mathcal{I}(\theta)} \qquad \square

When is the bound tight? The CRB is achieved (with equality in Cauchy-Schwarz) if and only if (θ^θ)=c(θ)s(θ;X)(\hat{\theta} - \theta) = c(\theta) \cdot s(\theta; X) for some function c(θ)c(\theta). This occurs precisely for exponential family distributions, and the efficient estimator is the MLE.

Biased estimator CRB. For biased estimators with E[θ^]=θ+b(θ)\mathbb{E}[\hat{\theta}] = \theta + b(\theta):

Var(θ^)(1+b(θ))2I(θ)\operatorname{Var}(\hat{\theta}) \geq \frac{(1 + b'(\theta))^2}{\mathcal{I}(\theta)}

This shows that introducing bias (via regularisation) can actually reduce the CRB - a biased estimator faces a different, potentially less demanding bound.

Examples of efficient estimators:

ModelParameterEfficient estimatorVar\operatorname{Var}CRB
N(μ,σ2)\mathcal{N}(\mu,\sigma^2), σ2\sigma^2 knownμ\muxˉ\bar{x}σ2/n\sigma^2/nσ2/n\sigma^2/n [ok]
Bern(p)\operatorname{Bern}(p)ppxˉ\bar{x}p(1p)/np(1-p)/np(1p)/np(1-p)/n [ok]
Poisson(λ)\operatorname{Poisson}(\lambda)λ\lambdaxˉ\bar{x}λ/n\lambda/nλ/n\lambda/n [ok]
Exp(λ)\operatorname{Exp}(\lambda)λ\lambdan/x(i)n/\sum x^{(i)}λ2/n\lambda^2/nλ2/n\lambda^2/n [ok]

The sample mean is efficient for all these single-parameter exponential families. This is not a coincidence - it follows from the structure of exponential families.

4.4 Fisher Information Matrix in Practice

For a neural network with parameters θRp\boldsymbol{\theta} \in \mathbb{R}^p defining a conditional distribution p(yx;θ)p(y|\mathbf{x};\boldsymbol{\theta}), the FIM is:

I(θ)=E(x,y)pdata ⁣[θlogp(yx;θ)θlogp(yx;θ)]\mathcal{I}(\boldsymbol{\theta}) = \mathbb{E}_{(\mathbf{x},y)\sim p_{\text{data}}}\!\left[\nabla_{\boldsymbol{\theta}} \log p(y|\mathbf{x};\boldsymbol{\theta})\, \nabla_{\boldsymbol{\theta}} \log p(y|\mathbf{x};\boldsymbol{\theta})^\top\right]

This is an outer product of gradient vectors, making it a p×pp \times p PSD matrix. For modern LLMs with p1011p \sim 10^{11} parameters, this matrix is completely intractable to store (would require 102210^{22} bytes).

Empirical FIM. In practice, approximate by sampling:

I^(θ)=1ni=1nθlogp(y(i)x(i);θ)θlogp(y(i)x(i);θ)\hat{\mathcal{I}}(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^n \nabla_{\boldsymbol{\theta}} \log p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta})\, \nabla_{\boldsymbol{\theta}} \log p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta})^\top

This is the average squared gradient - exactly what is accumulated in gradient variance estimates.

Natural gradient (Amari, 1998). The ordinary gradient θL\nabla_{\boldsymbol{\theta}} \mathcal{L} is the steepest ascent direction in Euclidean parameter space Rp\mathbb{R}^p. But parameters do not live in Euclidean space - they live in the statistical manifold of distributions, where the natural metric is the Fisher information. The natural gradient is:

~θL=I(θ)1θL\tilde{\nabla}_{\boldsymbol{\theta}} \mathcal{L} = \mathcal{I}(\boldsymbol{\theta})^{-1} \nabla_{\boldsymbol{\theta}} \mathcal{L}

This is the steepest ascent direction in the metric defined by the FIM. Natural gradient is invariant to reparametrisation: if we change from θ\boldsymbol{\theta} to ϕ=g(θ)\boldsymbol{\phi} = g(\boldsymbol{\theta}), the natural gradient update gives the same parameter change.

K-FAC approximation (Martens & Grosse, 2015). Full FIM inversion costs O(p3)O(p^3). K-FAC approximates I1\mathcal{I}^{-1} by assuming layer-wise independence and Kronecker factorisation of each layer's FIM: I[l]A[l]G[l]\mathcal{I}^{[l]} \approx A^{[l]} \otimes G^{[l]}, where A[l]=E[a[l1](a[l1])]A^{[l]} = \mathbb{E}[\mathbf{a}^{[l-1]}(\mathbf{a}^{[l-1]})^\top] (input covariance) and G[l]=E[δ[l](δ[l])]G^{[l]} = \mathbb{E}[\delta^{[l]}(\delta^{[l]})^\top] (gradient covariance). Inverting the Kronecker product: (AG)1=A1G1(A \otimes G)^{-1} = A^{-1} \otimes G^{-1}, reducing cost from O(p3)O(p^3) to O(din3+dout3)O(d_{\text{in}}^3 + d_{\text{out}}^3) per layer.

Jeffreys prior. In Bayesian statistics (preview of Section04), the Jeffreys prior p(θ)I(θ)p(\theta) \propto \sqrt{\mathcal{I}(\theta)} is the "uninformative" prior that is invariant to reparametrisation. For θ=p\theta = p in Bernoulli: I(p)=1/(p(1p))\mathcal{I}(p) = 1/(p(1-p)), so pJ(p)p1/2(1p)1/2p_J(p) \propto p^{-1/2}(1-p)^{-1/2}, which is Beta(1/2,1/2)\operatorname{Beta}(1/2, 1/2).

Preview: MAP Estimation. Adding a prior p(θ)p(\boldsymbol{\theta}) and finding argmaxθ[p(Dθ)p(θ)]\arg\max_{\boldsymbol{\theta}} [p(\mathcal{D}|\boldsymbol{\theta}) p(\boldsymbol{\theta})] gives the Maximum A Posteriori (MAP) estimate - MLE regularised by the log-prior. MAP with a Gaussian prior N(0,λ1I)\mathcal{N}(0, \lambda^{-1}I) gives L2-regularised MLE. The full Bayesian treatment of priors, posteriors, and conjugate families is in Section04 Bayesian Inference.


5. Maximum Likelihood Estimation

5.1 The Likelihood Principle

Definition 5.1 (Likelihood Function). Given observed data x(1),,x(n)\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} and a parametric model p(;θ)p(\cdot;\boldsymbol{\theta}), the likelihood function is:

L(θ;x(1),,x(n))=i=1np(x(i);θ)L(\boldsymbol{\theta}; \mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)}) = \prod_{i=1}^n p(\mathbf{x}^{(i)};\boldsymbol{\theta})

The log-likelihood is:

(θ)=logL(θ)=i=1nlogp(x(i);θ)\ell(\boldsymbol{\theta}) = \log L(\boldsymbol{\theta}) = \sum_{i=1}^n \log p(\mathbf{x}^{(i)};\boldsymbol{\theta})

The Maximum Likelihood Estimator (MLE) is:

θ^MLE=argmaxθΘ(θ)=argmaxθΘi=1nlogp(x(i);θ)\hat{\boldsymbol{\theta}}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta} \in \Theta} \ell(\boldsymbol{\theta}) = \arg\max_{\boldsymbol{\theta} \in \Theta} \sum_{i=1}^n \log p(\mathbf{x}^{(i)};\boldsymbol{\theta})

The likelihood is not a probability. L(θ;x)L(\boldsymbol{\theta}; \mathbf{x}) is a function of θ\boldsymbol{\theta} for fixed data x\mathbf{x}, not a probability distribution over θ\boldsymbol{\theta}. It does not integrate to 1 over θ\boldsymbol{\theta}. The statement "L(θ)=0.3L(\boldsymbol{\theta}) = 0.3" means "the data has probability density 0.3 under parameter θ\boldsymbol{\theta}", not "there is a 30% chance θ\boldsymbol{\theta} is the true parameter".

Why log? Three reasons:

  1. Computational: i=1npi\prod_{i=1}^n p_i underflows to zero for large nn (e.g., 0.51000103010.5^{1000} \approx 10^{-301}); sums are numerically stable
  2. Mathematical: sums are easier to differentiate and optimise than products
  3. Statistical: log is a monotone transformation, so argmaxL=argmaxlogL\arg\max L = \arg\max \log L

The likelihood principle states that all information about θ\boldsymbol{\theta} in the data is contained in the likelihood function. Two datasets with proportional likelihoods should lead to the same inferences about θ\boldsymbol{\theta}. This is a philosophical principle - frequentists may not accept it fully (confidence intervals depend on the sampling procedure, not just the likelihood), but it underlies MLE and Bayesian inference alike.

5.2 Deriving MLEs: Scalar Parameters

The standard procedure for finding the MLE:

  1. Write the log-likelihood (θ)=ilogp(x(i);θ)\ell(\theta) = \sum_i \log p(x^{(i)};\theta)
  2. Differentiate and set (θ)=0\ell'(\theta) = 0 (the score equation)
  3. Verify it is a maximum (second derivative <0< 0 or global structure)
  4. Check boundary cases (the maximum might be at Θ\Theta's boundary)

Bernoulli MLE. Let x(1),,x(n)iidBern(p)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \operatorname{Bern}(p).

(p)=i=1n[x(i)logp+(1x(i))log(1p)]=Slogp+(nS)log(1p)\ell(p) = \sum_{i=1}^n \left[x^{(i)}\log p + (1-x^{(i)})\log(1-p)\right] = S\log p + (n-S)\log(1-p)

where S=i=1nx(i)S = \sum_{i=1}^n x^{(i)} is the total number of successes. Setting (p)=0\ell'(p) = 0:

SpnS1p=0    S(1p)=p(nS)    p^MLE=Sn=xˉ\frac{S}{p} - \frac{n-S}{1-p} = 0 \implies S(1-p) = p(n-S) \implies \hat{p}_{\text{MLE}} = \frac{S}{n} = \bar{x}

The MLE is the sample proportion - the intuitive answer.

Poisson MLE. Let x(1),,x(n)iidPoisson(λ)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \operatorname{Poisson}(\lambda).

(λ)=i=1n[x(i)logλλlog(x(i)!)]=Slogλnλ+const\ell(\lambda) = \sum_{i=1}^n \left[x^{(i)}\log\lambda - \lambda - \log(x^{(i)}!)\right] = S\log\lambda - n\lambda + \text{const}

Setting (λ)=S/λn=0\ell'(\lambda) = S/\lambda - n = 0: λ^MLE=S/n=xˉ\hat{\lambda}_{\text{MLE}} = S/n = \bar{x}.

The MLE of the Poisson rate is the sample mean - the estimator of the mean equals the MLE because E[X]=λ\mathbb{E}[X] = \lambda.

Exponential MLE. Let x(1),,x(n)iidExp(λ)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \operatorname{Exp}(\lambda), i.e., p(x;λ)=λeλxp(x;\lambda) = \lambda e^{-\lambda x} for x>0x > 0.

(λ)=nlogλλi=1nx(i)\ell(\lambda) = n\log\lambda - \lambda\sum_{i=1}^n x^{(i)}

Setting (λ)=n/λx(i)=0\ell'(\lambda) = n/\lambda - \sum x^{(i)} = 0: λ^MLE=n/x(i)=1/xˉ\hat{\lambda}_{\text{MLE}} = n/\sum x^{(i)} = 1/\bar{x}.

The MLE of the rate is the reciprocal of the sample mean - since E[X]=1/λ\mathbb{E}[X] = 1/\lambda, this is natural.

Uniform MLE. Let x(1),,x(n)iidU(0,θ)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{U}(0,\theta), p(x;θ)=1/θp(x;\theta) = 1/\theta for x[0,θ]x \in [0,\theta].

(θ)=nlogθ1[θmaxix(i)]\ell(\theta) = -n\log\theta \cdot \mathbb{1}[\theta \geq \max_i x^{(i)}]

The likelihood is zero for θ<maxix(i)\theta < \max_i x^{(i)} (some observation would be outside [0,θ][0,\theta]) and decreasing in θ\theta for θmaxix(i)\theta \geq \max_i x^{(i)}. The MLE is at the boundary: θ^MLE=maxix(i)\hat{\theta}_{\text{MLE}} = \max_i x^{(i)}.

This is an example where the score equation gives no solution (the log-likelihood has no interior critical point); the MLE is found by reasoning about the likelihood's shape. Also, the MLE here is biased: E[maxix(i)]=nθ/(n+1)<θ\mathbb{E}[\max_i x^{(i)}] = n\theta/(n+1) < \theta.

5.3 Deriving MLEs: Multivariate Gaussian

Let x(1),,x(n)iidN(μ,Σ)\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(n)} \overset{\text{iid}}{\sim} \mathcal{N}(\boldsymbol{\mu}, \Sigma) with μRd\boldsymbol{\mu} \in \mathbb{R}^d and ΣS++d\Sigma \in \mathbb{S}^d_{++}.

The log-likelihood is:

(μ,Σ)=n2logdet(Σ)12i=1n(x(i)μ)Σ1(x(i)μ)nd2log(2π)\ell(\boldsymbol{\mu}, \Sigma) = -\frac{n}{2}\log\det(\Sigma) - \frac{1}{2}\sum_{i=1}^n (\mathbf{x}^{(i)} - \boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}^{(i)} - \boldsymbol{\mu}) - \frac{nd}{2}\log(2\pi)

MLE of μ\boldsymbol{\mu}: Taking the gradient with respect to μ\boldsymbol{\mu} and setting to zero:

μ=i=1nΣ1(x(i)μ)=0    μ^MLE=1ni=1nx(i)=xˉ\nabla_{\boldsymbol{\mu}} \ell = \sum_{i=1}^n \Sigma^{-1}(\mathbf{x}^{(i)} - \boldsymbol{\mu}) = 0 \implies \hat{\boldsymbol{\mu}}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n \mathbf{x}^{(i)} = \bar{\mathbf{x}}

MLE of Σ\Sigma: Substituting μ^=xˉ\hat{\boldsymbol{\mu}} = \bar{\mathbf{x}} and differentiating with respect to Σ1\Sigma^{-1} (using matrix calculus identities AlogdetA=A\nabla_A \log\det A = A^{-\top} and Atr(BA)=B\nabla_A \operatorname{tr}(BA) = B^\top):

Σ1=n2Σ12i=1n(x(i)xˉ)(x(i)xˉ)=0\frac{\partial \ell}{\partial \Sigma^{-1}} = \frac{n}{2}\Sigma - \frac{1}{2}\sum_{i=1}^n (\mathbf{x}^{(i)} - \bar{\mathbf{x}})(\mathbf{x}^{(i)} - \bar{\mathbf{x}})^\top = 0 Σ^MLE=1ni=1n(x(i)xˉ)(x(i)xˉ)\hat{\Sigma}_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}^{(i)} - \bar{\mathbf{x}})(\mathbf{x}^{(i)} - \bar{\mathbf{x}})^\top

Bias of Σ^MLE\hat{\Sigma}_{\text{MLE}}: By the same calculation as in the scalar case, E[Σ^MLE]=n1nΣ\mathbb{E}[\hat{\Sigma}_{\text{MLE}}] = \frac{n-1}{n}\Sigma. The unbiased estimator is S=1n1i(x(i)xˉ)(x(i)xˉ)S = \frac{1}{n-1}\sum_i (\mathbf{x}^{(i)} - \bar{\mathbf{x}})(\mathbf{x}^{(i)} - \bar{\mathbf{x}})^\top.

For AI: Fitting a Gaussian to activations or embeddings is done in numerous ML methods. The empirical covariance of a layer's activations, used in whitening transforms, weight initialisation (Xavier/He initialisation matches the second moment), and covariance-regularised fine-tuning, all use this MLE formula. The multivariate Gaussian MLE is also the building block for Gaussian mixture models (EM algorithm) and linear discriminant analysis.

5.4 MLE as Cross-Entropy Minimisation

This connection is perhaps the most important in all of applied ML.

Theorem 5.4. For a classification model p(yx;θ)p(y|\mathbf{x};\boldsymbol{\theta}) and iid data {(x(i),y(i))}i=1n\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^n:

θ^MLE=argmaxθi=1nlogp(y(i)x(i);θ)=argminθ[1ni=1nlogp(y(i)x(i);θ)]\hat{\boldsymbol{\theta}}_{\text{MLE}} = \arg\max_{\boldsymbol{\theta}} \sum_{i=1}^n \log p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta}) = \arg\min_{\boldsymbol{\theta}} \left[-\frac{1}{n}\sum_{i=1}^n \log p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta})\right]

The right-hand side is the negative log-likelihood (NLL) or cross-entropy loss:

LCE(θ)=1ni=1nlogp(y(i)x(i);θ)\mathcal{L}_{\text{CE}}(\boldsymbol{\theta}) = -\frac{1}{n}\sum_{i=1}^n \log p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta})

Connection to KL divergence: Define the empirical distribution pdata(yx)=1niδ(x(i),y(i))p_{\text{data}}(y|\mathbf{x}) = \frac{1}{n}\sum_i \delta_{(x^{(i)}, y^{(i)})}. Then:

LCE=1nilogp(y(i)x(i);θ)=H(pdata)+DKL(pdatapθ)\mathcal{L}_{\text{CE}} = -\frac{1}{n}\sum_i \log p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta}) = H(p_{\text{data}}) + D_{\mathrm{KL}}(p_{\text{data}} \| p_\theta)

Since H(pdata)H(p_{\text{data}}) does not depend on θ\boldsymbol{\theta}:

argminθLCE=argminθDKL(pdatapθ)\arg\min_{\boldsymbol{\theta}} \mathcal{L}_{\text{CE}} = \arg\min_{\boldsymbol{\theta}} D_{\mathrm{KL}}(p_{\text{data}} \| p_{\boldsymbol{\theta}})

MLE minimises the KL divergence from the data distribution to the model. This is the information-theoretic interpretation of MLE.

For language models: An LLM defines pθ(xtx1,,xt1)p_{\boldsymbol{\theta}}(x_t | x_1, \ldots, x_{t-1}). Training by NLL is MLE:

θ^=argminθ1Tt=1Tlogpθ(xtx<t)\hat{\boldsymbol{\theta}} = \arg\min_{\boldsymbol{\theta}} -\frac{1}{T}\sum_{t=1}^T \log p_{\boldsymbol{\theta}}(x_t | x_{<t})

where the sum is over all tokens in the training corpus. This objective is used verbatim in every large language model - GPT-4 (OpenAI, 2023), LLaMA-3 (Meta, 2024), Gemini (Google, 2023), Claude (Anthropic, 2024). The perplexity exp(LCE)\exp(\mathcal{L}_{\text{CE}}) is the standard evaluation metric.

Label smoothing as regularised MLE. Standard MLE uses hard labels y(i){0,1}Ky^{(i)} \in \{0,1\}^K (one-hot). Label smoothing (Szegedy et al., 2016) uses soft targets y~(i)=(1ε)y(i)+ε/K\tilde{y}^{(i)} = (1-\varepsilon)y^{(i)} + \varepsilon/K, adding a small weight to incorrect classes. This is equivalent to regularising MLE by mixing the empirical distribution with a uniform prior - reducing overconfidence and improving calibration.

5.5 Properties of MLE

Under regularity conditions (the "Cramer-Rao conditions" on the model), MLE satisfies:

1. Consistency: θ^MLEPθ\hat{\boldsymbol{\theta}}_{\text{MLE}} \xrightarrow{P} \boldsymbol{\theta}^*. The proof uses the fact that the expected log-likelihood E[(θ)/n]\mathbb{E}[\ell(\boldsymbol{\theta})/n] is uniquely maximised at the true parameter θ\boldsymbol{\theta}^* (by the strict convexity of KL divergence), and the LLN ensures (θ)/nE[logp(X;θ)]\ell(\boldsymbol{\theta})/n \to \mathbb{E}[\log p(X;\boldsymbol{\theta})] uniformly.

2. Asymptotic normality:

n(θ^MLEθ)dN ⁣(0,I(θ)1)\sqrt{n}\left(\hat{\boldsymbol{\theta}}_{\text{MLE}} - \boldsymbol{\theta}^*\right) \xrightarrow{d} \mathcal{N}\!\left(\mathbf{0},\, \mathcal{I}(\boldsymbol{\theta}^*)^{-1}\right)

For large nn: θ^MLEN(θ,1nI(θ)1)\hat{\boldsymbol{\theta}}_{\text{MLE}} \approx \mathcal{N}(\boldsymbol{\theta}^*, \frac{1}{n}\mathcal{I}(\boldsymbol{\theta}^*)^{-1}).

3. Asymptotic efficiency: The asymptotic covariance 1nI1\frac{1}{n}\mathcal{I}^{-1} achieves the CRB. No consistent estimator can have smaller asymptotic variance (at every θ\boldsymbol{\theta}). MLE is asymptotically optimal.

4. Invariance under reparametrisation: If θ^\hat{\boldsymbol{\theta}} is the MLE of θ\boldsymbol{\theta}, then g(θ^)g(\hat{\boldsymbol{\theta}}) is the MLE of g(θ)g(\boldsymbol{\theta}) for any function gg (not required to be injective). This is the invariance property of MLE.

Examples of invariance:

  • MLE of σ\sigma is σ^=σ^MLE2\hat{\sigma} = \sqrt{\hat{\sigma}^2_{\text{MLE}}} (not ss, which is the unbiased estimator)
  • MLE of 1/λ1/\lambda in Exponential is xˉ\bar{x} (consistent with λ^=1/xˉ\hat{\lambda} = 1/\bar{x})
  • MLE of p2p^2 in Bernoulli is p^2=xˉ2\hat{p}^2 = \bar{x}^2 (biased, but the MLE)

Regularity conditions (Cramer-Rao conditions). These are the sufficient conditions for the above properties:

  1. The parameter space Θ\Theta is an open subset of Rk\mathbb{R}^k
  2. The model is identifiable
  3. The support {x:p(x;θ)>0}\{x: p(x;\boldsymbol{\theta}) > 0\} does not depend on θ\boldsymbol{\theta} (excludes Uniform)
  4. The log-likelihood is three times differentiable in θ\boldsymbol{\theta}
  5. The Fisher information matrix is positive definite at θ\boldsymbol{\theta}^*

When these fail (as in the Uniform example), MLE may still be the natural estimator but its properties differ.

5.6 Numerical MLE

For complex models, the score equation (θ)=0\ell'(\boldsymbol{\theta}) = 0 has no closed-form solution and must be solved numerically.

Gradient ascent on log-likelihood. The simplest approach:

θt+1=θt+ηθ(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \eta \nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}_t)

For neural networks trained with mini-batches: stochastic gradient ascent on the mini-batch log-likelihood. This is exactly SGD on NLL loss with a negated gradient.

Newton-Raphson / Fisher scoring. Use second-order information:

θt+1=θt[H(θt)]1(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - [H_\ell(\boldsymbol{\theta}_t)]^{-1} \nabla \ell(\boldsymbol{\theta}_t)

where HH_\ell is the Hessian of the log-likelihood. Replacing the Hessian with the expected negative Hessian (Fisher information) gives Fisher scoring:

θt+1=θt+1nI(θt)1(θt)\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \frac{1}{n}\mathcal{I}(\boldsymbol{\theta}_t)^{-1} \nabla \ell(\boldsymbol{\theta}_t)

This is exactly the natural gradient update. Newton-Raphson converges quadratically near the maximum - much faster than gradient ascent - but requires inverting a k×kk \times k matrix per step.

EM algorithm. For models with latent variables (mixtures, HMMs, VAEs), the EM algorithm alternates between:

  • E-step: compute Q(θ;θt)=Ezx;θt[logp(x,z;θ)]Q(\boldsymbol{\theta};\boldsymbol{\theta}_t) = \mathbb{E}_{z|\mathbf{x};\boldsymbol{\theta}_t}[\log p(\mathbf{x},z;\boldsymbol{\theta})]
  • M-step: θt+1=argmaxθQ(θ;θt)\boldsymbol{\theta}_{t+1} = \arg\max_{\boldsymbol{\theta}} Q(\boldsymbol{\theta};\boldsymbol{\theta}_t)

EM guarantees that (θt+1)(θt)\ell(\boldsymbol{\theta}_{t+1}) \geq \ell(\boldsymbol{\theta}_t) - the log-likelihood is non-decreasing. EM is a specialised case of variational inference (to be developed in Section04 Bayesian Inference).

Numerical pitfalls:

  • Overflow/underflow: compute =logpi\ell = \sum \log p_i, not logpi\log \prod p_i (avoid multiplying probabilities)
  • Log-sum-exp trick: for softmax-based likelihoods, logkezk=zmax+logkezkzmax\log \sum_k e^{z_k} = z_{\max} + \log\sum_k e^{z_k - z_{\max}}
  • Flat log-likelihoods: when I(θ)0\mathcal{I}(\boldsymbol{\theta}) \approx 0 near the solution, gradient methods converge extremely slowly; natural gradient methods help
  • Local maxima: the log-likelihood may be multimodal for mixture models and neural networks; multiple restarts are necessary

6. Method of Moments

6.1 The Method of Moments

The method of moments (MoM), introduced by Karl Pearson (1894), is the oldest systematic estimation procedure. The idea is simple: match sample moments to population moments and solve for θ\boldsymbol{\theta}.

Definition 6.1 (Method of Moments Estimator). For a kk-parameter model, the MoM estimator solves the system:

μ^j=E[Xj;θ],j=1,,k\hat{\mu}_j = \mathbb{E}[X^j;\boldsymbol{\theta}], \quad j = 1, \ldots, k

where μ^j=1ni=1n(x(i))j\hat{\mu}_j = \frac{1}{n}\sum_{i=1}^n (x^{(i)})^j is the jj-th sample moment.

Example: Normal distribution. For N(μ,σ2)\mathcal{N}(\mu, \sigma^2) with 2 parameters:

  • 1st moment equation: xˉ=E[X]=μ    μ^=xˉ\bar{x} = \mathbb{E}[X] = \mu \implies \hat{\mu} = \bar{x}
  • 2nd moment equation: 1nxi2=E[X2]=μ2+σ2    σ^2=1nxi2xˉ2=1n(xixˉ)2\frac{1}{n}\sum x_i^2 = \mathbb{E}[X^2] = \mu^2 + \sigma^2 \implies \hat{\sigma}^2 = \frac{1}{n}\sum x_i^2 - \bar{x}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2

For the Gaussian, MoM and MLE coincide.

Example: Gamma distribution. XGamma(α,β)X \sim \text{Gamma}(\alpha, \beta) with mean α/β\alpha/\beta and variance α/β2\alpha/\beta^2.

  • μ^1=xˉ=α/β\hat{\mu}_1 = \bar{x} = \alpha/\beta
  • μ^2μ^12=s2=α/β2\hat{\mu}_2 - \hat{\mu}_1^2 = s^2 = \alpha/\beta^2

Solving: β^=xˉ/s2\hat{\beta} = \bar{x}/s^2, α^=xˉ2/s2\hat{\alpha} = \bar{x}^2/s^2. Unlike the Gaussian case, the Gamma MLE requires numerical optimisation, making MoM attractive for quick estimation.

Example: Beta distribution. XBeta(α,β)X \sim \text{Beta}(\alpha, \beta) with mean α/(α+β)\alpha/(\alpha+\beta) and variance αβ/[(α+β)2(α+β+1)]\alpha\beta/[(\alpha+\beta)^2(\alpha+\beta+1)].

Setting m=xˉm = \bar{x} and v=s2v = s^2, solving the moment equations:

α^=m(m(1m)v1),β^=(1m)(m(1m)v1)\hat{\alpha} = m\left(\frac{m(1-m)}{v} - 1\right), \quad \hat{\beta} = (1-m)\left(\frac{m(1-m)}{v} - 1\right)

This requires v<m(1m)v < m(1-m) (the variance must be less than the theoretical maximum for the Beta).

Properties of MoM:

  • Consistent: by the LLN, sample moments converge to population moments; moment equations are continuous, so MoM estimators are consistent
  • Asymptotically normal: by the multivariate CLT and delta method
  • Inefficient: generally does not achieve the CRB; MLE has smaller asymptotic variance when it exists in closed form

6.2 Generalised Method of Moments

The Generalised Method of Moments (GMM), introduced by Hansen (1982), allows for more moment conditions than parameters.

Suppose we have mm moment conditions E[g(x;θ)]=0\mathbb{E}[g(\mathbf{x};\boldsymbol{\theta})] = \mathbf{0}, where g:Rd×ΘRmg: \mathbb{R}^d \times \Theta \to \mathbb{R}^m and mkm \geq k (over-identified system).

Sample moment conditions: g^n(θ)=1ni=1ng(x(i);θ)\hat{\mathbf{g}}_n(\boldsymbol{\theta}) = \frac{1}{n}\sum_{i=1}^n g(\mathbf{x}^{(i)};\boldsymbol{\theta})

When m>km > k, we cannot set all mm conditions to zero simultaneously. The GMM estimator minimises the weighted quadratic form:

θ^GMM=argminθg^n(θ)Wg^n(θ)\hat{\boldsymbol{\theta}}_{\text{GMM}} = \arg\min_{\boldsymbol{\theta}} \hat{\mathbf{g}}_n(\boldsymbol{\theta})^\top W \hat{\mathbf{g}}_n(\boldsymbol{\theta})

for some positive-definite weighting matrix WW. The optimal weight matrix (minimising asymptotic variance) is W=S1W^* = S^{-1} where S=Var(g(x;θ))S = \operatorname{Var}(g(\mathbf{x};\boldsymbol{\theta}^*)).

GMM is widely used in econometrics. In ML, it appears implicitly when models are estimated by matching distributional moments rather than maximising likelihood.

6.3 MoM vs MLE

PropertyMethod of MomentsMaximum Likelihood
ComputationalOften closed-formMay require numerical optimisation
Asymptotic efficiencyGenerally inefficientEfficient (achieves CRB)
ConsistencyYesYes (under regularity conditions)
RobustnessDepends only on momentsCan be sensitive to tail misspecification
ApplicabilityWorks even when likelihood is intractableRequires tractable likelihood
Parameter constraintsCan produce invalid estimates (e.g., σ^2<0\hat{\sigma}^2 < 0)Respects constraints if optimised on Θ\Theta

When to prefer MoM:

  • The likelihood is intractable (latent variable models without EM)
  • Speed matters more than efficiency
  • Only the first few moments are of interest
  • The distributional assumption may be wrong (moments are more robust)

When to prefer MLE:

  • Full likelihood is tractable and the model is well-specified
  • Maximum statistical efficiency is needed
  • Regularity conditions are satisfied
  • A standard reference distribution is being fitted

7. Confidence Intervals

7.1 The Frequentist Interpretation

Definition 7.1 (Confidence Interval). A random interval [θ^L(X),θ^U(X)][\hat{\theta}_L(\mathbf{X}), \hat{\theta}_U(\mathbf{X})] is a (1α)(1-\alpha) confidence interval for θ\theta if:

Pθ ⁣(θ^L(X)θθ^U(X))1αfor all θΘP_\theta\!\left(\hat{\theta}_L(\mathbf{X}) \leq \theta \leq \hat{\theta}_U(\mathbf{X})\right) \geq 1-\alpha \quad \text{for all } \theta \in \Theta

The key is that θ^L\hat{\theta}_L and θ^U\hat{\theta}_U are random (functions of the data X\mathbf{X}), while θ\theta is fixed (unknown but not random in the frequentist framework).

The correct interpretation: If we repeat the experiment many times and construct a 95% CI each time, 95% of those intervals will contain the true θ\theta. This specific computed interval either contains θ\theta or it does not - there is no 95% probability attached to this particular interval after observing data.

Common misconceptions:

Incorrect statementWhy it's wrong
"There is a 95% probability that θ[2.1,3.4]\theta \in [2.1, 3.4]."After computing the interval, θ\theta is either in it or not - no probability remains
"95% of the data lies within the CI."CIs are about the parameter, not the data distribution
"If I repeat the experiment, my estimate will be in this CI 95% of the time."The CI is about the parameter being covered, not about future estimates
"A wider CI means I'm 95% more confident."Width affects precision but all 95% CIs have the same coverage probability

The Bayesian analogue - "there is a 95% probability that θ\theta is in this interval" - is a credible interval and requires a prior distribution on θ\theta. The full treatment is in Section04 Bayesian Inference.

7.2 Exact Confidence Intervals via Pivoting

The pivotal method constructs CIs by finding a pivot: a function of data and parameter whose distribution is known and does not depend on θ\theta.

Gaussian mean, σ2\sigma^2 known. Let x(1),,x(n)iidN(μ,σ2)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2).

The pivot is Z=xˉμσ/nN(0,1)Z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1).

P ⁣(zα/2xˉμσ/nzα/2)=1αP\!\left(-z_{\alpha/2} \leq \frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \leq z_{\alpha/2}\right) = 1 - \alpha

Solving for μ\mu: the (1α)(1-\alpha) CI is [xˉzα/2σn,  xˉ+zα/2σn]\left[\bar{x} - z_{\alpha/2}\frac{\sigma}{\sqrt{n}},\; \bar{x} + z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right].

Gaussian mean, σ2\sigma^2 unknown. Replacing σ\sigma with the sample standard deviation ss changes the distribution: T=xˉμs/ntn1T = \frac{\bar{x}-\mu}{s/\sqrt{n}} \sim t_{n-1} (Student's tt with n1n-1 degrees of freedom).

(1α)(1-\alpha) CI: [xˉtn1,α/2sn,  xˉ+tn1,α/2sn]\left[\bar{x} - t_{n-1,\alpha/2}\frac{s}{\sqrt{n}},\; \bar{x} + t_{n-1,\alpha/2}\frac{s}{\sqrt{n}}\right]

As nn \to \infty, tn1,α/2zα/2t_{n-1,\alpha/2} \to z_{\alpha/2} (Student's tt converges to standard normal).

STUDENT'S t DISTRIBUTION vs. NORMAL
========================================================================

  p(t)
   |
   |            --- Normal(0,1)
   |          ==== t(df=5)
   |        - - - t(df=1) [Cauchy]
   |
   |          +-----+         Heavier tails of t distribution
   |         ++     ++        reflect additional uncertainty
   |        ++       ++       from estimating \\sigma.
   |      --+         +--
   |--------------------------- t
       -3  -2  -1   0   1   2   3

  For df \\geq 30, t \\approx Normal - practical "large sample" threshold.

========================================================================

Gaussian variance CI. The pivot is (n1)s2/σ2χn12(n-1)s^2/\sigma^2 \sim \chi^2_{n-1}, giving the CI:

[(n1)s2χn1,α/22,  (n1)s2χn1,1α/22]\left[\frac{(n-1)s^2}{\chi^2_{n-1, \alpha/2}},\; \frac{(n-1)s^2}{\chi^2_{n-1, 1-\alpha/2}}\right]

Note this CI is asymmetric because the χ2\chi^2 distribution is skewed.

7.3 Asymptotic Confidence Intervals

When exact pivots are unavailable, use the asymptotic normality of MLE: for large nn, θ^MLEN(θ,1nI(θ)1)\hat{\theta}_{\text{MLE}} \approx \mathcal{N}(\theta^*, \frac{1}{n}\mathcal{I}(\theta^*)^{-1}).

Wald interval. Plugging in the MLE for the unknown θ\theta^* in the Fisher information:

θ^MLE±zα/21nI(θ^MLE)\hat{\theta}_{\text{MLE}} \pm z_{\alpha/2} \cdot \frac{1}{\sqrt{n\mathcal{I}(\hat{\theta}_{\text{MLE}})}}

For Bernoulli with nn observations: p^±zα/2p^(1p^)/n\hat{p} \pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}.

Limitations of the Wald interval near boundaries. For pp near 0 or 1, the Wald interval can extend outside [0,1][0,1]. The Wilson score interval is better-behaved:

p^+z2/(2n)1+z2/n±zp^(1p^)/n+z2/(4n2)1+z2/n\frac{\hat{p} + z^2/(2n)}{1 + z^2/n} \pm \frac{z\sqrt{\hat{p}(1-\hat{p})/n + z^2/(4n^2)}}{1 + z^2/n}

where z=zα/2z = z_{\alpha/2}. This is the standard CI for proportions in ML evaluation.

Likelihood ratio interval. Based on the likelihood ratio statistic Λ=2[(θ^)(θ0)]\Lambda = 2[\ell(\hat{\theta}) - \ell(\theta_0)], which satisfies Λdχ12\Lambda \xrightarrow{d} \chi^2_1 under H0:θ=θ0H_0: \theta = \theta_0. The CI is the set of θ0\theta_0 values that pass the test:

CI1α={θ:2[(θ^)(θ)]χ1,α2}\text{CI}_{1-\alpha} = \{\theta : 2[\ell(\hat{\theta}) - \ell(\theta)] \leq \chi^2_{1,\alpha}\}

This is often more accurate than the Wald interval for small nn.

7.4 Bootstrap Confidence Intervals

The bootstrap (Efron, 1979) is a non-parametric resampling method for constructing CIs without distributional assumptions.

Non-parametric bootstrap algorithm:

  1. From the original sample X={x(1),,x(n)}\mathcal{X} = \{x^{(1)}, \ldots, x^{(n)}\}, draw BB bootstrap samples X1,,XB\mathcal{X}^*_1, \ldots, \mathcal{X}^*_B (each of size nn, with replacement)
  2. Compute the statistic θ^b=T(Xb)\hat{\theta}^*_b = T(\mathcal{X}^*_b) for each bootstrap sample
  3. Estimate the sampling distribution of θ^\hat{\theta} by the empirical distribution of {θ^1,,θ^B}\{\hat{\theta}^*_1, \ldots, \hat{\theta}^*_B\}

Percentile bootstrap CI: Use the (α/2)(\alpha/2) and (1α/2)(1-\alpha/2) quantiles of {θ^b}\{\hat{\theta}^*_b\} as CI endpoints.

BCa (bias-corrected and accelerated) bootstrap: Adjusts for bias in the bootstrap distribution and non-constant acceleration. Typically more accurate than the percentile method for small nn.

When to use bootstrap:

  • The sampling distribution of θ^\hat{\theta} is non-standard (no closed-form pivot)
  • The distributional assumptions for parametric CIs are suspect
  • The statistic is complex (e.g., ratio of two MLEs, AUC, BLEU score)
  • nn is moderate and asymptotic approximations are poor

Bootstrap for ML evaluation (example). To compute a 95% CI for test accuracy:

  1. Let the test set be {(x(i),y(i))}i=1n\{(x^{(i)}, y^{(i)})\}_{i=1}^n; accuracy a^=1n1[y^(i)=y(i)]\hat{a} = \frac{1}{n}\sum \mathbb{1}[\hat{y}^{(i)} = y^{(i)}]
  2. Bootstrap: resample with replacement B=2000B = 2000 times, compute accuracy on each bootstrap test set
  3. CI = [2.5th percentile, 97.5th percentile] of bootstrap accuracies

7.5 Confidence Intervals in ML Evaluation

Understanding CI coverage for ML evaluation prevents erroneous conclusions.

Binary accuracy. Test accuracy a^\hat{a} with nn examples and kk correct. Wilson score CI:

CI95%a^±1.96a^(1a^)/n\text{CI}_{95\%} \approx \hat{a} \pm 1.96\sqrt{\hat{a}(1-\hat{a})/n}

For n=1000n = 1000, a^=0.85\hat{a} = 0.85: CI = [0.828,0.872][0.828, 0.872] - width ±2.2%\pm 2.2\%. For n=100n = 100: CI = [0.780,0.920][0.780, 0.920] - width ±7%\pm 7\%.

This shows why benchmark results on n<500n < 500 test examples are statistically unreliable.

Comparing two models. If Model A achieves a^A=0.852\hat{a}_A = 0.852 and Model B achieves a^B=0.847\hat{a}_B = 0.847 on the same n=1000n = 1000 test examples, is A significantly better? The difference Δ=0.005\Delta = 0.005 with standard error a^A(1a^A)/n+a^B(1a^B)/n0.011\approx \sqrt{\hat{a}_A(1-\hat{a}_A)/n + \hat{a}_B(1-\hat{a}_B)/n} \approx 0.011 gives a 95% CI for Δ\Delta of [0.017,+0.027][-0.017, +0.027], which includes zero. The models are not statistically distinguishable.

McNemar's test is more powerful than comparing independent CIs when models are evaluated on the same examples (the paired structure reduces variance). See Section03 Hypothesis Testing for the full treatment.

Multiple benchmark correction. When evaluating on kk benchmarks, the probability of at least one spuriously significant result rises. This is the multiple comparison problem - addressed fully in Section03. For reporting CIs, Bonferroni correction replaces α\alpha with α/k\alpha/k, widening each CI to maintain family-wise coverage.


8. Asymptotic Theory

8.1 Convergence Modes

Before stating the asymptotic normality theorem, we review the convergence modes used.

Almost-sure (a.s.) convergence (a.s.\xrightarrow{a.s.}): P(limnXn=X)=1P(\lim_{n\to\infty} X_n = X) = 1. The sequence converges on every sample path except a set of probability zero. This is the strongest mode; the Strong LLN gives xˉna.s.μ\bar{x}_n \xrightarrow{a.s.} \mu.

Convergence in probability (P\xrightarrow{P}): For every ε>0\varepsilon > 0, P(XnX>ε)0P(|X_n - X| > \varepsilon) \to 0. Weaker than a.s. convergence. The Weak LLN gives xˉnPμ\bar{x}_n \xrightarrow{P} \mu. Sufficient for consistency.

Convergence in distribution (d\xrightarrow{d}): FXn(t)FX(t)F_{X_n}(t) \to F_X(t) at every continuity point of FXF_X. The weakest mode; the CLT gives n(xˉnμ)/σdN(0,1)\sqrt{n}(\bar{x}_n - \mu)/\sigma \xrightarrow{d} \mathcal{N}(0,1). Used for asymptotic normality.

Key theorems for manipulating convergence:

Slutsky's Theorem. If XndXX_n \xrightarrow{d} X and YnPcY_n \xrightarrow{P} c (a constant), then:

  • Xn+YndX+cX_n + Y_n \xrightarrow{d} X + c
  • XnYndcXX_n \cdot Y_n \xrightarrow{d} c \cdot X
  • Xn/YndX/cX_n / Y_n \xrightarrow{d} X/c (if c0c \neq 0)

Continuous Mapping Theorem. If XnPXX_n \xrightarrow{P} X and gg is continuous, then g(Xn)Pg(X)g(X_n) \xrightarrow{P} g(X). Same holds for d\xrightarrow{d}.

Application: The MLE θ^Pθ\hat{\theta} \xrightarrow{P} \theta^* (consistency) implies I(θ^)PI(θ)\mathcal{I}(\hat{\theta}) \xrightarrow{P} \mathcal{I}(\theta^*) by the continuous mapping theorem (assuming I\mathcal{I} is continuous). Then by Slutsky: nI(θ^)(θ^θ)dN(0,1)\sqrt{n\mathcal{I}(\hat{\theta})}(\hat{\theta} - \theta^*) \xrightarrow{d} \mathcal{N}(0,1).

8.2 The Delta Method

The delta method extends the CLT to smooth functions of asymptotically normal estimators.

Theorem 8.2 (Delta Method - Scalar). Suppose n(θ^θ)dN(0,σ2)\sqrt{n}(\hat{\theta} - \theta^*) \xrightarrow{d} \mathcal{N}(0, \sigma^2) and gg is differentiable at θ\theta^* with g(θ)0g'(\theta^*) \neq 0. Then:

n(g(θ^)g(θ))dN ⁣(0,[g(θ)]2σ2)\sqrt{n}\left(g(\hat{\theta}) - g(\theta^*)\right) \xrightarrow{d} \mathcal{N}\!\left(0,\, [g'(\theta^*)]^2 \sigma^2\right)

Proof sketch: By differentiability, g(θ^)g(θ)+g(θ)(θ^θ)g(\hat{\theta}) \approx g(\theta^*) + g'(\theta^*)(\hat{\theta} - \theta^*). Multiply through by n\sqrt{n}: n(g(θ^)g(θ))g(θ)n(θ^θ)dg(θ)N(0,σ2)=N(0,[g(θ)]2σ2)\sqrt{n}(g(\hat{\theta}) - g(\theta^*)) \approx g'(\theta^*) \cdot \sqrt{n}(\hat{\theta} - \theta^*) \xrightarrow{d} g'(\theta^*)\mathcal{N}(0,\sigma^2) = \mathcal{N}(0, [g'(\theta^*)]^2\sigma^2).

Examples:

  1. CI for log-odds. For Bernoulli with p^=xˉ\hat{p} = \bar{x}, the log-odds are θ=log[p/(1p)]\theta = \log[p/(1-p)]. The MLE is θ^=log[p^/(1p^)]\hat{\theta} = \log[\hat{p}/(1-\hat{p})]. The asymptotic variance: g(p)=log[p/(1p)]g(p) = \log[p/(1-p)], g(p)=1/[p(1p)]g'(p) = 1/[p(1-p)]. So:
Var(θ^)1n1[p(1p)]2p(1p)=1np(1p)\operatorname{Var}(\hat{\theta}) \approx \frac{1}{n} \cdot \frac{1}{[p(1-p)]^2} \cdot p(1-p) = \frac{1}{np(1-p)}
  1. CI for standard deviation. Given σ^MLE2N(σ2,2σ4/n)\hat{\sigma}^2_{\text{MLE}} \approx \mathcal{N}(\sigma^2, 2\sigma^4/n) (asymptotic variance of variance estimator), the delta method with g(v)=vg(v) = \sqrt{v}, g(v)=1/(2v)g'(v) = 1/(2\sqrt{v}):
Var(σ^)14σ22σ4n=σ22n\operatorname{Var}(\hat{\sigma}) \approx \frac{1}{4\sigma^2} \cdot \frac{2\sigma^4}{n} = \frac{\sigma^2}{2n}

Multivariate delta method. If n(θ^θ)dN(0,Σ)\sqrt{n}(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}^*) \xrightarrow{d} \mathcal{N}(\mathbf{0}, \Sigma) and g:RkRmg: \mathbb{R}^k \to \mathbb{R}^m is differentiable:

n(g(θ^)g(θ))dN ⁣(0,JgΣJg)\sqrt{n}(g(\hat{\boldsymbol{\theta}}) - g(\boldsymbol{\theta}^*)) \xrightarrow{d} \mathcal{N}\!\left(\mathbf{0},\, J_g \Sigma J_g^\top\right)

where JgJ_g is the m×km \times k Jacobian of gg at θ\boldsymbol{\theta}^*.

8.3 Asymptotic Normality of MLE

Theorem 8.3 (Asymptotic Normality of MLE). Under the Cramer-Rao regularity conditions, if θ^n\hat{\boldsymbol{\theta}}_n is the MLE based on nn iid observations:

n(θ^nθ)dN ⁣(0,I(θ)1)\sqrt{n}\left(\hat{\boldsymbol{\theta}}_n - \boldsymbol{\theta}^*\right) \xrightarrow{d} \mathcal{N}\!\left(\mathbf{0},\, \mathcal{I}(\boldsymbol{\theta}^*)^{-1}\right)

Proof sketch for the scalar case. Taylor-expand the score equation around θ\boldsymbol{\theta}^*:

0=1nSn(θ^)1nSn(θ)+1nSn(θ)n(θ^θ)0 = \frac{1}{\sqrt{n}} S_n(\hat{\theta}) \approx \frac{1}{\sqrt{n}} S_n(\theta^*) + \frac{1}{n} S_n'(\theta^*) \cdot \sqrt{n}(\hat{\theta} - \theta^*)

Rearranging:

n(θ^θ)1nSn(θ)1nSn(θ)\sqrt{n}(\hat{\theta} - \theta^*) \approx -\frac{\frac{1}{\sqrt{n}}S_n(\theta^*)}{\frac{1}{n}S_n'(\theta^*)}

Numerator: 1nSn(θ)=1nis(θ;x(i))\frac{1}{\sqrt{n}}S_n(\theta^*) = \frac{1}{\sqrt{n}}\sum_i s(\theta^*; x^{(i)}). By CLT (scores are iid with mean 0 and variance I(θ)\mathcal{I}(\theta^*)): 1nSn(θ)dN(0,I(θ))\frac{1}{\sqrt{n}}S_n(\theta^*) \xrightarrow{d} \mathcal{N}(0, \mathcal{I}(\theta^*)).

Denominator: 1nSn(θ)=1ni2logpθ2θ\frac{1}{n}S_n'(\theta^*) = \frac{1}{n}\sum_i \frac{\partial^2 \log p}{\partial\theta^2}\big|_{\theta^*}. By LLN: 1nSn(θ)PE ⁣[2logpθ2]=I(θ)\frac{1}{n}S_n'(\theta^*) \xrightarrow{P} \mathbb{E}\!\left[\frac{\partial^2 \log p}{\partial\theta^2}\right] = -\mathcal{I}(\theta^*).

By Slutsky: n(θ^θ)dN(0,I(θ))1I(θ)=N(0,I(θ)1)\sqrt{n}(\hat{\theta} - \theta^*) \xrightarrow{d} \mathcal{N}(0, \mathcal{I}(\theta^*)) \cdot \frac{-1}{-\mathcal{I}(\theta^*)} = \mathcal{N}(0, \mathcal{I}(\theta^*)^{-1}). \square

Implications for practice:

  • Large sample CI: For n50n \geq 50 (rule of thumb), the (1α)(1-\alpha) CI is θ^±zα/2/nI(θ^)\hat{\theta} \pm z_{\alpha/2}/\sqrt{n\mathcal{I}(\hat{\theta})}
  • Convergence rate: The MLE error is O(1/n)O(1/\sqrt{n}) - doubling data halves the standard error
  • Efficiency: The asymptotic covariance I1/n\mathcal{I}^{-1}/n achieves the CRB - no consistent estimator is asymptotically better
  • Standard errors in ML: The standard errors reported in logistic regression, GLMs, and survival models are all based on this theorem, using the observed or expected Fisher information

8.4 Misspecified Models

In practice, the model {p(;θ)}\{p(\cdot;\boldsymbol{\theta})\} rarely contains the true data-generating distribution pp^*. What happens when the model is misspecified?

Theorem 8.4 (MLE under Misspecification). Under regularity conditions, even when pMp^* \notin \mathcal{M}, the MLE converges to:

θ^MLEPθ=argminθΘDKL(pp(;θ))\hat{\boldsymbol{\theta}}_{\text{MLE}} \xrightarrow{P} \boldsymbol{\theta}^{**} = \arg\min_{\boldsymbol{\theta} \in \Theta} D_{\mathrm{KL}}(p^* \| p(\cdot;\boldsymbol{\theta}))

The "pseudo-true parameter" θ\boldsymbol{\theta}^{**} is the closest point in the model family to the truth in KL divergence. The asymptotic distribution under misspecification is the sandwich estimator:

n(θ^θ)dN ⁣(0,I1VI1)\sqrt{n}(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}^{**}) \xrightarrow{d} \mathcal{N}\!\left(\mathbf{0},\, \mathcal{I}^{-1} V \mathcal{I}^{-1}\right)

where V=Var[s(θ;X)]V = \operatorname{Var}[s(\boldsymbol{\theta}^{**}; X)] is the actual variance of the score (not equal to I\mathcal{I} when misspecified). Under correct specification, V=IV = \mathcal{I} and the sandwich collapses to I1\mathcal{I}^{-1}.

Implications for LLM training: Language models trained by NLL minimisation are nearly always misspecified (the model cannot perfectly represent natural language). The trained model converges to the KL-minimising distribution within the model class. This is why language models exhibit characteristic failure modes related to the model architecture's inductive biases - they converge to the nearest representable distribution, not the true distribution.


9. Applications in Machine Learning

9.1 MLE Is Cross-Entropy Training

We derived in Section5.4 that MLE = cross-entropy minimisation. Here we examine the consequences.

NLL loss for common model types:

| Task | Model p(yx;θ)p(y|\mathbf{x};\boldsymbol{\theta}) | NLL loss | Standard name | | --- | --- | --- | --- | | Binary classification | Bern(σ(wx))\operatorname{Bern}(\sigma(\mathbf{w}^\top\mathbf{x})) | ylogp^(1y)log(1p^)-y\log\hat{p} - (1-y)\log(1-\hat{p}) | Binary cross-entropy | | Multi-class | Cat(softmax(Wx))\operatorname{Cat}(\operatorname{softmax}(\mathbf{W}\mathbf{x})) | kyklogp^k-\sum_k y_k\log\hat{p}_k | Categorical cross-entropy | | Regression | N(wx,σ2)\mathcal{N}(\mathbf{w}^\top\mathbf{x}, \sigma^2) | (yy^)22σ2\frac{(y-\hat{y})^2}{2\sigma^2} | MSE (up to constant) | | Language model | tpθ(xtx<t)\prod_t p_\theta(x_t|x_{<t}) | 1Ttlogpθ(xtx<t)-\frac{1}{T}\sum_t\log p_\theta(x_t|x_{<t}) | NLL / perplexity |

Temperature scaling for calibration. After training a classifier, its softmax probabilities are often overconfident (Guo et al., 2017). Temperature scaling finds T=argminTLNLL(θ,T)T^* = \arg\min_T \mathcal{L}_{\text{NLL}}(\boldsymbol{\theta}, T) where logits are divided by TT. This is a 1-parameter MLE problem on a calibration set, provably maintaining accuracy while improving Expected Calibration Error (ECE).

Label smoothing. Standard MLE on one-hot targets drives logp(yx)0\log p(y|\mathbf{x}) \to 0 (i.e., p1p \to 1) for the correct class, making the model overconfident. Label smoothing (Szegedy et al., 2016) uses target y~k=(1ε)1[k=y]+ε/K\tilde{y}_k = (1-\varepsilon)\mathbb{1}[k=y] + \varepsilon/K for small ε\varepsilon (typically 0.1). This regularises MLE toward a mixture of the data distribution and uniform - reducing overconfidence without sacrificing accuracy.

9.2 Natural Gradient and Fisher Information

The problem with Euclidean gradient descent. The ordinary gradient θL\nabla_{\boldsymbol{\theta}} \mathcal{L} is the direction of steepest ascent in Euclidean space. But this is parameterisation-dependent: rescaling parameters changes the gradient direction. Different parameterisations of the same model family give different gradient descent trajectories.

Information geometry. The space of probability distributions has a natural Riemannian metric - the Fisher information metric gjk=I(θ)jkg_{jk} = \mathcal{I}(\boldsymbol{\theta})_{jk}. In this curved space, the steepest ascent direction is:

~θL=I(θ)1θL\tilde{\nabla}_{\boldsymbol{\theta}} \mathcal{L} = \mathcal{I}(\boldsymbol{\theta})^{-1} \nabla_{\boldsymbol{\theta}} \mathcal{L}

Natural gradient descent (Amari, 1998):

θt+1=θtηI(θt)1θL\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta \mathcal{I}(\boldsymbol{\theta}_t)^{-1} \nabla_{\boldsymbol{\theta}} \mathcal{L}

Properties:

  • Parameterisation-invariant: changing parameterisation gives the same update in the function space
  • Adaptive: steps are small in directions of high curvature (where the loss changes rapidly), large in flat directions
  • Connection to Newton's method: natural gradient = Newton's method when the Hessian equals the FIM (true for GLMs at the maximum)

K-FAC (Martens & Grosse, 2015). Full FIM inversion is O(p3)O(p^3) - impossible for neural networks. K-FAC approximates I1\mathcal{I}^{-1} layer-wise using the Kronecker structure:

I[l]A[l1]G[l]\mathcal{I}^{[l]} \approx A^{[l-1]} \otimes G^{[l]}

where A[l1]=E[a[l1](a[l1])]A^{[l-1]} = \mathbb{E}[\mathbf{a}^{[l-1]}(\mathbf{a}^{[l-1]})^\top] and G[l]=E[δ[l](δ[l])]G^{[l]} = \mathbb{E}[\delta^{[l]}(\delta^{[l]})^\top]. This reduces the cost from O(p3)O(p^3) to O(din3+dout3)O(d_{\text{in}}^3 + d_{\text{out}}^3) per layer, making natural gradient feasible for moderate-sized networks.

Shampoo (Gupta et al., 2018) is a related second-order optimiser that maintains full-matrix preconditioners per parameter group, used at Google-scale training.

9.3 Confidence Intervals for Model Evaluation

The evaluation uncertainty problem. A model is evaluated on a test set of nn examples. The reported accuracy a^\hat{a} is an estimate of the true accuracy aa^*. Without a CI, it is impossible to determine if two models are genuinely different or differ only by chance.

Standard practice for ML papers:

  • Wilson score CI for binary metrics (accuracy, F1 per class)
  • Bootstrap CI for non-standard metrics (BLEU, ROUGE, perplexity)
  • Correct for multiple benchmarks using Bonferroni or BH correction

Confidence interval for AUC. The Area Under the ROC Curve is a function of ranks, not a simple proportion. Its asymptotic distribution follows the Wilcoxon-Mann-Whitney form, giving a CI:

AUC^±zα/2AUC^(1AUC^)+(n+1)(Q1AUC^2)+(n1)(Q2AUC^2)n+n\widehat{\text{AUC}} \pm z_{\alpha/2} \sqrt{\frac{\widehat{\text{AUC}}(1-\widehat{\text{AUC}}) + (n_+-1)(Q_1-\widehat{\text{AUC}}^2) + (n_--1)(Q_2-\widehat{\text{AUC}}^2)}{n_+ n_-}}

For complex metrics like this, bootstrap is typically preferred.

LLM benchmark confidence. Major benchmarks (MMLU, HumanEval, HellaSwag) have 1000-14000 questions. For MMLU (n14000n \approx 14000), the 95% Wilson CI for an accuracy of 70% is approximately ±0.76%\pm 0.76\%. For HumanEval (n=164n = 164), the CI is ±7.0%\pm 7.0\% - models within 7 percentage points cannot be reliably ranked.

9.4 Estimating Scaling Laws

Scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) describe how model performance LL depends on model size NN (parameters), training tokens DD, and compute CC. These are regression problems - a direct application of estimation theory.

Chinchilla scaling law (Hoffmann et al., 2022). The loss function is modelled as:

L(N,D)=E+ANα+BDβL(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

where E,A,B,α,βE, A, B, \alpha, \beta are estimated by nonlinear least squares MLE on (N,D,L)(N, D, L) observations from hundreds of training runs. The fitted parameters (α0.34\alpha \approx 0.34, β0.28\beta \approx 0.28) are MLE estimates with associated uncertainty.

The optimal allocation. For a fixed compute budget C=6NDC = 6ND, the optimal (N,D)(N^*, D^*) is found by constrained optimisation of the fitted loss model. Chinchilla showed that LLaMA-1 (Touvron et al., 2023) and GPT-3 were undertrained relative to their model size - the optimal allocation uses smaller models trained on more tokens.

For AI: The Chinchilla scaling law computation is MLE applied at the scale of frontier AI research. The "parameters" being estimated (α,β,A,B,E\alpha, \beta, A, B, E) are 5 numbers that determine the optimal architecture and training strategy for models costing millions of dollars to train.

9.5 Fisher Information in Fine-Tuning

Elastic Weight Consolidation (EWC, Kirkpatrick et al., 2017). In continual learning, a model trained sequentially on tasks T1,T2,T_1, T_2, \ldots tends to forget earlier tasks - catastrophic forgetting. EWC prevents this by penalising changes to weights that were important for task T1T_1:

LEWC(θ)=LT2(θ)+λ2jFj(θjθ1,j)2\mathcal{L}_{\text{EWC}}(\boldsymbol{\theta}) = \mathcal{L}_{T_2}(\boldsymbol{\theta}) + \frac{\lambda}{2} \sum_j F_j (\theta_j - \theta^*_{1,j})^2

where FjF_j is the jj-th diagonal element of the Fisher information matrix computed on task T1T_1's data: Fj=1ni(logp(y(i)x(i);θ1)/θj)2F_j = \frac{1}{n}\sum_i (\partial \log p(y^{(i)}|\mathbf{x}^{(i)};\boldsymbol{\theta}^*_1)/\partial\theta_j)^2.

Parameters with high FjF_j are those to which the task-1 likelihood is most sensitive - changing them most damages task-1 performance. EWC protects these parameters with strong quadratic penalties.

LoRA as low-rank MLE (Hu et al., 2022). LoRA fine-tunes large models by constraining weight updates to be low-rank: ΔW=BA\Delta W = BA where BRd×rB \in \mathbb{R}^{d \times r}, ARr×kA \in \mathbb{R}^{r \times k}, and rmin(d,k)r \ll \min(d,k). This is MLE with a rank constraint on the parameter update - the MLE over a restricted parameter manifold. The rank rr controls the trade-off between expressivity and the number of estimated parameters.

Optimal Brain Damage (LeCun et al., 1990). Neural network pruning by removing parameters θj\theta_j that, according to a second-order Taylor expansion, increase the loss least. The per-parameter importance is approximately 12Hjjθj2\frac{1}{2} H_{jj} \theta_j^2, where HjjH_{jj} is the diagonal Hessian. Replacing HjjH_{jj} with FjjF_{jj} (diagonal FIM) recovers FIM-based pruning - estimation theory meets efficient neural architecture.


10. Common Mistakes

#MistakeWhy It's WrongFix
1"The 95% CI means there is a 95% probability that θ[a,b]\theta \in [a,b]."After observing data, θ\theta is either in [a,b][a,b] or not - no randomness remains. The probability refers to the procedure, not this interval.Say "In repeated experiments, 95% of such intervals cover θ\theta." Use Bayesian credible intervals if a posterior probability statement is needed.
2Confusing bias with MSE, treating "unbiased" as synonymous with "accurate."Low bias says nothing about variance or overall error. An unbiased estimator with high variance can have larger MSE than a biased estimator with low variance.Always decompose MSE = Bias^2 + Var and consider both terms.
3Using σ^2=1n(xixˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2 as the unbiased variance estimator.The 1/n1/n estimator is the MLE and is biased; E[σ^2]=(n1)σ2/n\mathbb{E}[\hat{\sigma}^2] = (n-1)\sigma^2/n.Use s2=1n1(xixˉ)2s^2 = \frac{1}{n-1}\sum(x_i - \bar{x})^2 (Bessel's correction) for unbiased estimation.
4Treating the MLE as automatically unbiased.MLE is consistent and asymptotically efficient but not generally unbiased. The Gaussian variance MLE, Uniform maximum MLE, and many others are biased.Compute E[θ^MLE]\mathbb{E}[\hat{\theta}_{\text{MLE}}] explicitly or use the delta method for composite parameters.
5Applying the CRB to biased estimators using the unbiased form Var1/I\operatorname{Var} \geq 1/\mathcal{I}.The CRB for biased estimators is Var(θ^)(1+b(θ))2/I(θ)\operatorname{Var}(\hat{\theta}) \geq (1+b'(\theta))^2/\mathcal{I}(\theta). Using the unbiased form gives an incorrect lower bound.For biased estimators, always use the biased-estimator CRB, or switch to MSE rather than variance comparisons.
6Using the Wald CI for proportions near 0 or 1 (e.g., p^=0.02\hat{p} = 0.02, n=50n = 50).The Wald interval p^±zp^(1p^)/n\hat{p} \pm z\sqrt{\hat{p}(1-\hat{p})/n} can extend below 0 or above 1, and has poor coverage near the boundary.Use the Wilson score interval or Clopper-Pearson exact interval for proportions near 0 or 1.
7Treating the Fisher information (expected curvature) as identical to the Hessian of the log-likelihood (observed curvature).I(θ)=E[Hlogp]\mathcal{I}(\theta) = -\mathbb{E}[H_{\log p}] equals the expected Hessian. For a specific sample, the observed Hessian Hlogp(θ^)-H_{\log p}(\hat{\theta}) differs. The observed Hessian is used in practice (faster) but they are equal only in expectation.Use observed Hessian when computational speed is needed; expected FIM for theoretical comparisons. Both give consistent estimates of I(θ)\mathcal{I}(\theta^*).
8Applying the MLE invariance property in reverse: assuming g(θ^)g(\hat{\theta}) is unbiased for g(θ)g(\theta) because θ^\hat{\theta} is unbiased for θ\theta.MLE invariance says argmaxL(g(θ))=g(θ^MLE)\arg\max L(g(\theta)) = g(\hat{\theta}_{\text{MLE}}), but says nothing about bias. If gg is nonlinear, Jensen's inequality shows E[g(θ^)]g(E[θ^])=g(θ)\mathbb{E}[g(\hat{\theta})] \neq g(\mathbb{E}[\hat{\theta}]) = g(\theta) in general.Use the delta method to compute the asymptotic bias of g(θ^)g(\hat{\theta}) and add a bias correction if needed.
9Applying asymptotic (nn \to \infty) CI formulas for small nn (e.g., n=15n = 15).Asymptotic normality of MLE typically requires n30n \geq 30-100100 depending on the model. For small nn, the tt-distribution, exact intervals, or bootstrap should be used.Use the tt-distribution CI for Gaussian mean with small nn; bootstrap CI for non-standard statistics; exact binomial CI for small count data.
10Using the sum of likelihoods L(θ;xi)\sum L(\theta; x_i) instead of the product L(θ;xi)\prod L(\theta; x_i) (or sum of log-likelihoods).The likelihood function is the joint probability of the data - a product for independent observations, not a sum. Maximising the sum gives a different, generally inconsistent estimator.Always write (θ)=ilogp(x(i);θ)\ell(\theta) = \sum_i \log p(x^{(i)};\theta) (log-likelihood = sum of log densities).

11. Exercises

Exercise 1 [*] MLE for Bernoulli and Poisson.

(a) Given n=10n = 10 coin flips with outcomes {H,H,T,H,T,T,H,H,H,T}\{H, H, T, H, T, T, H, H, H, T\} (6 heads, 4 tails), compute p^MLE\hat{p}_{\text{MLE}} and its standard error p^(1p^)/n\sqrt{\hat{p}(1-\hat{p})/n}.

(b) For the Poisson model Poisson(λ)\operatorname{Poisson}(\lambda) with nn iid observations summing to SS, derive the MLE λ^\hat{\lambda} by solving the score equation and verify it is a maximum (second derivative < 0).

(c) Using the MLE from (b) and the Fisher information I(λ)=1/λ\mathcal{I}(\lambda) = 1/\lambda, verify the MLE achieves the CRB: Var(λ^)=λ/n=1/(nI(λ))\operatorname{Var}(\hat{\lambda}) = \lambda/n = 1/(n\mathcal{I}(\lambda)).

Exercise 2 [*] Bias of the MLE variance estimator.

Let x(1),,x(n)iidN(μ,σ2)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2) with both parameters unknown.

(a) Prove that σ^MLE2=1ni=1n(x(i)xˉ)2\hat{\sigma}^2_{\text{MLE}} = \frac{1}{n}\sum_{i=1}^n (x^{(i)} - \bar{x})^2 has E[σ^MLE2]=n1nσ2\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \frac{n-1}{n}\sigma^2. (Hint: write x(i)xˉx^{(i)} - \bar{x} in terms of x(i)μx^{(i)} - \mu and xˉμ\bar{x} - \mu.)

(b) Show that the corrected estimator s2=nn1σ^MLE2s^2 = \frac{n}{n-1}\hat{\sigma}^2_{\text{MLE}} is unbiased.

(c) By MLE invariance, σ^MLE=σ^MLE2\hat{\sigma}_{\text{MLE}} = \sqrt{\hat{\sigma}^2_{\text{MLE}}}. Is this an unbiased estimator of σ\sigma? If not, what is the sign of its bias? Use Jensen's inequality.

Exercise 3 [*] Cramer-Rao bound verification.

For nn iid observations from Exp(λ)\operatorname{Exp}(\lambda) with p(x;λ)=λeλxp(x;\lambda) = \lambda e^{-\lambda x}:

(a) Compute the score s(λ;x)=logp(x;λ)/λs(\lambda; x) = \partial \log p(x;\lambda)/\partial \lambda and verify E[s]=0\mathbb{E}[s] = 0.

(b) Compute the Fisher information I(λ)\mathcal{I}(\lambda) using both the variance-of-score and negative-expected-Hessian formulas and verify they are equal.

(c) Show that λ^=n/i=1nx(i)\hat{\lambda} = n/\sum_{i=1}^n x^{(i)} is the MLE and compute Var(λ^)\operatorname{Var}(\hat{\lambda}) using the delta method applied to g(xˉ)=1/xˉg(\bar{x}) = 1/\bar{x} with Var(xˉ)=(1/λ2)/n\operatorname{Var}(\bar{x}) = (1/\lambda^2)/n. Verify that Var(λ^)=λ2/n=1/(nI(λ))\operatorname{Var}(\hat{\lambda}) = \lambda^2/n = 1/(n\mathcal{I}(\lambda)) - the CRB is achieved.

Exercise 4 [**] Method of Moments for the Beta distribution.

Let x(1),,x(n)iidBeta(α,β)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \operatorname{Beta}(\alpha, \beta) with moments E[X]=α/(α+β)\mathbb{E}[X] = \alpha/(\alpha+\beta) and Var(X)=αβ/[(α+β)2(α+β+1)]\operatorname{Var}(X) = \alpha\beta/[(\alpha+\beta)^2(\alpha+\beta+1)].

(a) Derive the MoM estimators α^\hat{\alpha} and β^\hat{\beta} in terms of xˉ\bar{x} and s2=1n(xixˉ)2s^2 = \frac{1}{n}\sum(x_i - \bar{x})^2.

(b) Show that the MoM estimates are only valid when s2<xˉ(1xˉ)s^2 < \bar{x}(1-\bar{x}). What is the statistical interpretation of this constraint?

(c) Generate n=200n = 200 samples from Beta(2,5)\operatorname{Beta}(2, 5) and compute both the MoM and the MLE (via numerical optimisation). Compare the estimates and their standard errors. Which is more efficient?

Exercise 5 [**] Bootstrap confidence interval.

Let x={3.1,1.4,2.5,4.2,1.8,3.7,2.1,4.8,1.2,3.3}\mathbf{x} = \{3.1, 1.4, 2.5, 4.2, 1.8, 3.7, 2.1, 4.8, 1.2, 3.3\} (a sample of size 10).

(a) Compute the sample median x~\tilde{x} and the 95% asymptotic CI for the median using the fact that, for a distribution with density f(x~)f(\tilde{x}) at the true median, Var(sample median)14nf(x~)2\operatorname{Var}(\text{sample median}) \approx \frac{1}{4nf(\tilde{x})^2}.

(b) Implement the non-parametric bootstrap with B=2000B = 2000 resamples and compute the percentile bootstrap CI for the median. Compare the width to the asymptotic CI.

(c) Is the bootstrap CI or the asymptotic CI more reliable here, and why?

Exercise 6 [**] Asymptotic normality simulation.

Verify the asymptotic normality of the Bernoulli MLE empirically.

(a) For true parameter p=0.3p = 0.3 and sample sizes n{10,30,100,500}n \in \{10, 30, 100, 500\}, simulate M=5000M = 5000 MLEs p^(m)=xˉ(m)\hat{p}^{(m)} = \bar{x}^{(m)} and plot their distributions.

(b) For each nn, standardise: Z(m)=p^(m)0.30.30.7/nZ^{(m)} = \frac{\hat{p}^{(m)} - 0.3}{\sqrt{0.3 \cdot 0.7/n}} and overlay the N(0,1)\mathcal{N}(0,1) PDF. At what nn does the approximation appear adequate?

(c) Compute the empirical coverage of the 95% Wald CI at each nn. At what nn does coverage first reach 94%-96%? Does the CRB bound hold: is nVar(p^)n \cdot \operatorname{Var}(\hat{p}) always 1/I(0.3)=p(1p)\geq 1/\mathcal{I}(0.3) = p(1-p)?

Exercise 7 [***] Fisher information for a neural network layer.

Consider a one-layer softmax classifier p(yx;W)=softmax(Wx)yp(y|\mathbf{x};\mathbf{W}) = \operatorname{softmax}(\mathbf{W}\mathbf{x})_y for y{1,,K}y \in \{1,\ldots,K\}, where WRK×d\mathbf{W} \in \mathbb{R}^{K \times d}.

(a) Derive the score Wlogp(yx;W)=(eyp^)x\nabla_{\mathbf{W}} \log p(y|\mathbf{x};\mathbf{W}) = (\mathbf{e}_y - \hat{\mathbf{p}}) \mathbf{x}^\top where p^=softmax(Wx)\hat{\mathbf{p}} = \operatorname{softmax}(\mathbf{W}\mathbf{x}).

(b) Show that the Fisher information matrix (as a Kd×KdKd \times Kd matrix) is:

I(W)=Ex ⁣[(Diag(p^)p^p^)xx]\mathcal{I}(\mathbf{W}) = \mathbb{E}_{\mathbf{x}}\!\left[(\operatorname{Diag}(\hat{\mathbf{p}}) - \hat{\mathbf{p}}\hat{\mathbf{p}}^\top) \otimes \mathbf{x}\mathbf{x}^\top\right]

(c) Identify the two factors in the K-FAC approximation: A=E[xx]A = \mathbb{E}[\mathbf{x}\mathbf{x}^\top] (input covariance) and G=E[Diag(p^)p^p^]G = \mathbb{E}[\operatorname{Diag}(\hat{\mathbf{p}}) - \hat{\mathbf{p}}\hat{\mathbf{p}}^\top] (output covariance). Simulate this for d=10d = 10, K=5K = 5, and compare AGA \otimes G to the full FIM.

Exercise 8 [***] Chinchilla scaling law estimation.

The Chinchilla model (Hoffmann et al., 2022) estimates language model loss as L(N,D)=E+A/Nα+B/DβL(N,D) = E + A/N^\alpha + B/D^\beta.

(a) Given synthetic training run data (simulate 50 runs with (N,D)(N, D) pairs and losses with added Gaussian noise), fit the 5 parameters (E,A,B,α,β)(E, A, B, \alpha, \beta) by nonlinear least squares using scipy.optimize.curve_fit.

(b) Compute 95% CIs for each parameter using the covariance matrix returned by curve_fit (which estimates σ2(JJ)1\sigma^2(J^\top J)^{-1}, where JJ is the Jacobian at the solution).

(c) For a fixed compute budget C=6NDC = 6ND (FLOPs), find the optimal (N,D)(N^*, D^*) by minimising L(N,C/(6N))L(N, C/(6N)) over NN. How sensitive is this optimal allocation to uncertainty in α\alpha and β\beta?


12. Why This Matters for AI (2026 Perspective)

ConceptAI/LLM Impact
MLE = cross-entropy minimisationEvery language model (GPT-4, LLaMA-3, Gemini, Claude) is trained by NLL minimisation; MLE provides the theoretical foundation for why this converges to the right distribution
Fisher information matrixK-FAC (Martens & Grosse, 2015) enables tractable second-order optimisation for deep networks; Shampoo (Google) uses full-matrix FIM approximations in large-scale training
Natural gradientInvariant to reparametrisation; theoretically optimal steepest direction in distribution space; practical approximations (K-FAC, SOAP) achieve closer-to-optimal convergence rates
Asymptotic normality of MLEJustifies confidence intervals for model parameters; provides the basis for Laplace approximation in Bayesian neural networks (Section04)
Cramer-Rao boundFundamental limit on estimation from data; explains why larger datasets (1/n\sim 1/\sqrt{n}) always help; motivates data-efficient fine-tuning methods
Confidence intervalsBenchmark evaluation without CIs is scientifically misleading; Wilson CIs for accuracy, bootstrap CIs for composite metrics; the HELM/OpenLLM leaderboards report point estimates without CIs - a known limitation
Bias-variance decompositionTheoretical foundation for regularisation in ML: L2/L1 regularisation = adding bias to reduce variance; early stopping, dropout, and weight decay all operate on this trade-off
Sufficient statisticsExponential family sufficient statistics underlie variational autoencoders (encoder outputs μ\boldsymbol{\mu}, σ\boldsymbol{\sigma} of Gaussian posterior) and normalising flows
Scaling law estimationChinchilla (Hoffmann et al., 2022) uses nonlinear MLE to fit (N,D,L)(N, D, L) data; the estimated scaling exponents (α0.34\alpha \approx 0.34, β0.28\beta \approx 0.28) determine optimal model sizes at every compute budget
Elastic weight consolidationFIM-based penalty prevents catastrophic forgetting in continual learning; applied in fine-tuning pipelines to protect pre-trained capabilities while adapting to new tasks
Temperature calibrationMLE on a calibration set finds the scalar TT that minimises NLL; well-calibrated models produce reliable uncertainty estimates for downstream decision-making
Bootstrap evaluationNon-parametric CI construction for arbitrary metrics (BLEU, pass@k, ECE); essential for comparing models with complex evaluation protocols

13. Conceptual Bridge

Estimation theory is the bridge between probability and action. Probability theory (Chapter 6) defines random variables, distributions, and their properties - it answers "if we know the model, what data should we expect?" Estimation theory inverts this: "given data, what can we infer about the model?" This inversion from data to parameters is the core operation of every machine learning training algorithm.

The section you have just completed builds on Descriptive Statistics (Section01) in a crucial way: sample statistics like xˉ\bar{x} and s2s^2 were introduced there as data summaries; here they are elevated to estimators - random variables with sampling distributions, bias, variance, and convergence properties. The concept of Bessel's correction (n1n-1 denominator) was motivated in Section01 by practicality; here it is derived from the requirement of unbiasedness.

Looking forward to Hypothesis Testing (Section03): the confidence intervals of this section have a duality with hypothesis tests - rejecting H0:θ=θ0H_0: \theta = \theta_0 at level α\alpha is equivalent to θ0\theta_0 falling outside the (1α)(1-\alpha) CI. The sampling distributions developed here (Gaussian, Student's tt, chi-squared, FF) are the exact distributions that hypothesis tests use.

The next major extension is Bayesian Inference (Section04), which reframes the estimation problem by treating θ\boldsymbol{\theta} as a random variable with a prior distribution p(θ)p(\boldsymbol{\theta}). The posterior p(θD)L(θ;D)p(θ)p(\boldsymbol{\theta}|\mathcal{D}) \propto L(\boldsymbol{\theta};\mathcal{D})\, p(\boldsymbol{\theta}) combines the likelihood (covered here) with the prior. MAP estimation - maximising the posterior - is MLE regularised by the log-prior. L2 regularisation is MAP with a Gaussian prior; L1 regularisation is MAP with a Laplace prior. The full treatment of conjugate priors, credible intervals, and MCMC is in Section04.

ESTIMATION THEORY IN THE CURRICULUM
========================================================================

  Chapter 6 - Probability Theory
  |  Section01 Random Variables (distributions, CDF/PDF)
  |  Section04 Expectation and Moments (E[X], Var(X), CLT, LLN)
  |  Section03 Joint Distributions (Bayes' theorem)
  +------------------------------------------+
                                             | prerequisites
  Chapter 7 - Statistics                     v
  |  Section01 Descriptive Statistics    ----------------------------------
  |       sample mean, variance,   data summaries -> estimators
  |       covariance, correlation
  |
  +--> Section02 ESTIMATION THEORY (this section)
  |       - Point estimators: bias, variance, MSE
  |       - Cramer-Rao bound, Fisher information
  |       - MLE: derivation, properties, examples
  |       - Asymptotic normality, delta method
  |       - Confidence intervals (frequentist)
  |       - MLE = cross-entropy training
  |
  +--> Section03 Hypothesis Testing      (duality: CIs <-> tests)
  |       p-values, power, t/z/\\chi^2 tests, A/B testing
  |
  +--> Section04 Bayesian Inference      (extension: add prior to likelihood)
  |       posterior = likelihood x prior, MAP = regularised MLE
  |       conjugate priors, MCMC, variational inference
  |
  +--> Section05 Time Series             (sequential estimation, Kalman filter)
  |
  +--> Section06 Regression Analysis     (applied MLE: OLS, Ridge, Lasso)

  Chapter 8 - Optimisation         (natural gradient uses FIM)
  Chapter 9 - Information Theory   (FIM and Shannon information)

========================================================================

The Fisher information matrix is the central object connecting estimation theory outward: it bounds estimation variance (CRB), defines the geometry of natural gradient optimisation (Ch8), underpins the Laplace approximation in Bayesian inference (Section04), and is related to the Shannon capacity of statistical experiments (Ch9). Understanding the FIM deeply is understanding the mathematical infrastructure of modern ML training.


End of Section02 Estimation Theory

<- Back to Chapter 7: Statistics | Next: Hypothesis Testing ->


Appendix A: Exponential Families

Exponential families are the most important class of parametric models in statistics, unifying the Gaussian, Bernoulli, Poisson, Exponential, Gamma, Beta, and dozens of other distributions under one framework.

Definition A.1 (Exponential Family). A parametric family {p(x;η)}\{p(x;\boldsymbol{\eta})\} is an exponential family if it can be written in the form:

p(x;η)=h(x)exp ⁣(ηT(x)A(η))p(x;\boldsymbol{\eta}) = h(x) \exp\!\left(\boldsymbol{\eta}^\top T(x) - A(\boldsymbol{\eta})\right)

where:

  • ηRk\boldsymbol{\eta} \in \mathbb{R}^k is the natural (canonical) parameter
  • T(x)RkT(x) \in \mathbb{R}^k is the sufficient statistic (vector of natural statistics)
  • h(x)0h(x) \geq 0 is the base measure
  • A(η)=logh(x)eηT(x)dxA(\boldsymbol{\eta}) = \log \int h(x) e^{\boldsymbol{\eta}^\top T(x)} dx is the log-partition function (ensures normalisation)

The log-partition function encodes all moments: ηA(η)=E[T(X)]\nabla_{\boldsymbol{\eta}} A(\boldsymbol{\eta}) = \mathbb{E}[T(X)] and η2A(η)=Cov(T(X))\nabla^2_{\boldsymbol{\eta}} A(\boldsymbol{\eta}) = \operatorname{Cov}(T(X)).

Verification examples:

Bernoulli. p(x;p)=px(1p)1xp(x;p) = p^x(1-p)^{1-x}. Write as exp(xlog(p/(1p))+log(1p))\exp(x\log(p/(1-p)) + \log(1-p)). Natural parameter η=log(p/(1p))\eta = \log(p/(1-p)) (log-odds), sufficient statistic T(x)=xT(x) = x, A(η)=log(1+eη)A(\eta) = \log(1 + e^\eta).

Gaussian. N(μ,σ2)\mathcal{N}(\mu,\sigma^2): natural parameters η=(μ/σ2,1/(2σ2))\boldsymbol{\eta} = (\mu/\sigma^2, -1/(2\sigma^2)), sufficient statistics T(x)=(x,x2)T(x) = (x, x^2).

Key properties of exponential families for MLE:

  1. MLE score equation: η^MLE\hat{\boldsymbol{\eta}}_{\text{MLE}} satisfies Eη^[T(X)]=1niT(x(i))\mathbb{E}_{\hat{\boldsymbol{\eta}}}[T(X)] = \frac{1}{n}\sum_i T(x^{(i)}) - the MLE moment-matches the sufficient statistics.

  2. MLE = MoM: For exponential families, MLE and method of moments coincide (they both set E[T]=T^\mathbb{E}[T] = \hat{T}).

  3. Efficient MLE: The MLE achieves the CRB for all exponential families - it is always efficient.

  4. Log-likelihood is concave: For minimal exponential families, (η)\ell(\boldsymbol{\eta}) is strictly concave - the MLE is the unique global maximum, and Newton-Raphson converges from any starting point.

  5. Complete sufficient statistics: T(x(1),,x(n))=iT(x(i))T(x^{(1)}, \ldots, x^{(n)}) = \sum_i T(x^{(i)}) is a complete sufficient statistic, so the MLE is the UMVUE by Lehmann-Scheffe.

Importance for ML: The categorical distribution (softmax output of neural networks), Gaussian (regression, VAE latent space), Bernoulli (binary classification), and Poisson (count data, Poisson regression) are all exponential families. The VAE encoder outputs the natural parameters (μ,σ)(\boldsymbol{\mu}, \boldsymbol{\sigma}) of the Gaussian posterior - the sufficient statistics. Generalised linear models (GLMs, of which logistic regression is a special case) are defined precisely as: exponential family response + linear natural parameter η=Wx\boldsymbol{\eta} = \boldsymbol{W}\mathbf{x}.


Appendix B: Sufficient Statistics - Detailed Examples

B.1 The Exponential Distribution

For x(1),,x(n)iidExp(λ)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \operatorname{Exp}(\lambda):

p(x;λ)=λnexp ⁣(λi=1nx(i))1[x(i)0  i]p(\mathbf{x};\lambda) = \lambda^n \exp\!\left(-\lambda \sum_{i=1}^n x^{(i)}\right) \cdot \mathbb{1}[x^{(i)} \geq 0 \; \forall i]

By the factorisation theorem: T=ix(i)T = \sum_i x^{(i)} (total waiting time) is a sufficient statistic. It factors as g(T,λ)h(x)g(T,\lambda) \cdot h(\mathbf{x}) with g(T,λ)=λneλTg(T,\lambda) = \lambda^n e^{-\lambda T} and h(x)=1[x(i)0]h(\mathbf{x}) = \prod \mathbb{1}[x^{(i)}\geq 0].

MLE: λ^=n/T=1/xˉ\hat{\lambda} = n/T = 1/\bar{x}, which depends on the data only through TT - consistent with sufficiency.

B.2 Order Statistics and the Uniform Distribution

For x(1),,x(n)iidU(0,θ)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{U}(0,\theta):

p(x;θ)=1θni=1n1[0x(i)θ]=1θn1[maxix(i)θ]1[minix(i)0]p(\mathbf{x};\theta) = \frac{1}{\theta^n} \prod_{i=1}^n \mathbb{1}[0 \leq x^{(i)} \leq \theta] = \frac{1}{\theta^n}\mathbb{1}[\max_i x^{(i)} \leq \theta]\mathbb{1}[\min_i x^{(i)} \geq 0]

The sufficient statistic is T=X(n)=maxix(i)T = X_{(n)} = \max_i x^{(i)} (the maximum order statistic). The MLE θ^=X(n)\hat{\theta} = X_{(n)} is a function of TT as expected.

Bias correction. Since E[X(n)]=nθ/(n+1)\mathbb{E}[X_{(n)}] = n\theta/(n+1), the unbiased estimator is θ~=n+1nX(n)\tilde{\theta} = \frac{n+1}{n}X_{(n)}. By Rao-Blackwell applied to θ~=n+1nx(1)\tilde{\theta}' = \frac{n+1}{n}x^{(1)} (use only first observation): conditioning on T=X(n)T = X_{(n)} gives E[θ~T]=n+1nX(n)1nn=n+1nX(n)\mathbb{E}[\tilde{\theta}' | T] = \frac{n+1}{n}X_{(n)}\cdot\frac{1}{n} \cdot n = \frac{n+1}{n}X_{(n)}, so the Rao-Blackwell estimator is indeed θ~\tilde{\theta}.


Appendix C: The EM Algorithm - Connecting MLE to Latent Variable Models

Many models of interest (Gaussian mixture models, HMMs, topic models, VAEs) have latent variables ZZ in addition to observed data XX. The complete-data log-likelihood logp(X,Z;θ)\log p(X, Z; \boldsymbol{\theta}) would be easy to maximise if ZZ were observed, but we only observe XX.

The EM algorithm maximises the marginal log-likelihood (θ)=logp(X;θ)=logZp(X,Z;θ)\ell(\boldsymbol{\theta}) = \log p(X;\boldsymbol{\theta}) = \log \sum_Z p(X,Z;\boldsymbol{\theta}) by iterating two steps:

E-step: Compute the expected complete-data log-likelihood under the current posterior over ZZ:

Q(θ;θ(t))=EZX;θ(t)[logp(X,Z;θ)]Q(\boldsymbol{\theta};\boldsymbol{\theta}^{(t)}) = \mathbb{E}_{Z \mid X;\boldsymbol{\theta}^{(t)}}[\log p(X, Z;\boldsymbol{\theta})]

M-step: Maximise:

θ(t+1)=argmaxθQ(θ;θ(t))\boldsymbol{\theta}^{(t+1)} = \arg\max_{\boldsymbol{\theta}} Q(\boldsymbol{\theta};\boldsymbol{\theta}^{(t)})

Why EM guarantees (θ(t+1))(θ(t))\ell(\boldsymbol{\theta}^{(t+1)}) \geq \ell(\boldsymbol{\theta}^{(t)}): The evidence lower bound (ELBO):

(θ)=Q(θ;θ(t))+H(θ(t))\ell(\boldsymbol{\theta}) = Q(\boldsymbol{\theta};\boldsymbol{\theta}^{(t)}) + H(\boldsymbol{\theta}^{(t)})

where HH is the entropy term not involving θ\boldsymbol{\theta}. The M-step maximises QQ, and the E-step tightens the bound at the new θ(t+1)\boldsymbol{\theta}^{(t+1)}.

Gaussian Mixture Model (GMM). For KK-component GMM with means μk\boldsymbol{\mu}_k, covariances Σk\Sigma_k, and mixing weights πk\pi_k:

  • E-step: Compute responsibilities rik=πkN(x(i);μk,Σk)jπjN(x(i);μj,Σj)r_{ik} = \frac{\pi_k \mathcal{N}(\mathbf{x}^{(i)};\boldsymbol{\mu}_k,\Sigma_k)}{\sum_j \pi_j \mathcal{N}(\mathbf{x}^{(i)};\boldsymbol{\mu}_j,\Sigma_j)}
  • M-step: Update parameters:
π^k=irikn,μ^k=irikx(i)irik,Σ^k=irik(x(i)μ^k)(x(i)μ^k)irik\hat{\pi}_k = \frac{\sum_i r_{ik}}{n}, \quad \hat{\boldsymbol{\mu}}_k = \frac{\sum_i r_{ik}\mathbf{x}^{(i)}}{\sum_i r_{ik}}, \quad \hat{\Sigma}_k = \frac{\sum_i r_{ik}(\mathbf{x}^{(i)}-\hat{\boldsymbol{\mu}}_k)(\mathbf{x}^{(i)}-\hat{\boldsymbol{\mu}}_k)^\top}{\sum_i r_{ik}}

Each M-step is a weighted MLE for a Gaussian - exactly the closed-form solutions from Section5.3, but with fractional weights.

For AI: The EM algorithm is the algorithmic prototype for variational inference in VAEs (Section04). In the VAE, the E-step is replaced by the encoder qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}) (an amortised posterior approximation) and the M-step is replaced by joint gradient ascent on the ELBO with respect to both encoder parameters ϕ\phi and decoder parameters θ\boldsymbol{\theta}.


Appendix D: Cramer-Rao Bound - Multivariate Proof and Geometry

Multivariate CRB (full proof). Let θ^\hat{\boldsymbol{\theta}} be an unbiased estimator of θRk\boldsymbol{\theta} \in \mathbb{R}^k. Define the score vector s(θ;x)=θlogp(x;θ)Rk\mathbf{s}(\boldsymbol{\theta};\mathbf{x}) = \nabla_{\boldsymbol{\theta}} \log p(\mathbf{x};\boldsymbol{\theta}) \in \mathbb{R}^k.

Since θ^\hat{\boldsymbol{\theta}} is unbiased: E[θ^]=θ\mathbb{E}[\hat{\boldsymbol{\theta}}] = \boldsymbol{\theta}. Differentiating with respect to θj\theta_j:

Ij=θjE[θ^]=E[θ^sj(θ;x)]=E[(θ^θ)sj]I_j = \frac{\partial}{\partial\theta_j}\mathbb{E}[\hat{\boldsymbol{\theta}}] = \mathbb{E}[\hat{\boldsymbol{\theta}} \cdot s_j(\boldsymbol{\theta};\mathbf{x})] = \mathbb{E}[(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}) s_j]

In matrix form: Ik=E[(θ^θ)s]I_k = \mathbb{E}[(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta})\mathbf{s}^\top] (a k×kk \times k identity).

By the matrix Cauchy-Schwarz (Lowner ordering):

(Cov(θ^)IkIkI(θ))0\begin{pmatrix} \operatorname{Cov}(\hat{\boldsymbol{\theta}}) & I_k \\ I_k & \mathcal{I}(\boldsymbol{\theta}) \end{pmatrix} \succeq 0

The Schur complement condition gives: Cov(θ^)I(θ)10\operatorname{Cov}(\hat{\boldsymbol{\theta}}) - \mathcal{I}(\boldsymbol{\theta})^{-1} \succeq 0, i.e., Cov(θ^)I(θ)1\operatorname{Cov}(\hat{\boldsymbol{\theta}}) \succeq \mathcal{I}(\boldsymbol{\theta})^{-1}. \square

Information-geometric interpretation. The statistical manifold M={p(;θ):θΘ}\mathcal{M} = \{p(\cdot;\boldsymbol{\theta}) : \boldsymbol{\theta} \in \Theta\} is a Riemannian manifold with the Fisher-Rao metric gjl=I(θ)jlg_{jl} = \mathcal{I}(\boldsymbol{\theta})_{jl}. The CRB states that the covariance of any unbiased estimator is at least as large as the inverse metric - the "volume" of estimation uncertainty is bounded below by the curvature of the statistical manifold.

This geometric view motivates the natural gradient: gradient descent in the Fisher-Rao metric on the statistical manifold corresponds to the natural gradient update θθηI1\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \mathcal{I}^{-1}\nabla\ell.


Appendix E: The Score Test, Wald Test, and Likelihood Ratio Test

These three test statistics form the classical trinity of asymptotic tests in statistics. While their full treatment belongs in Section03 Hypothesis Testing, their derivations from MLE theory belong here.

For testing H0:θ=θ0H_0: \boldsymbol{\theta} = \boldsymbol{\theta}_0:

Score (Rao) test: ΛS=1nsn(θ0)I(θ0)1sn(θ0)dχk2\Lambda_S = \frac{1}{n}\mathbf{s}_n(\boldsymbol{\theta}_0)^\top \mathcal{I}(\boldsymbol{\theta}_0)^{-1} \mathbf{s}_n(\boldsymbol{\theta}_0) \xrightarrow{d} \chi^2_k under H0H_0.

Uses only the restricted model - does not require fitting the unrestricted MLE.

Wald test: ΛW=n(θ^θ0)I(θ^)(θ^θ0)dχk2\Lambda_W = n(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0)^\top \mathcal{I}(\hat{\boldsymbol{\theta}})(\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}_0) \xrightarrow{d} \chi^2_k under H0H_0.

Uses only the unrestricted MLE.

Likelihood ratio test: ΛLR=2[(θ^)(θ0)]dχk2\Lambda_{LR} = 2[\ell(\hat{\boldsymbol{\theta}}) - \ell(\boldsymbol{\theta}_0)] \xrightarrow{d} \chi^2_k under H0H_0.

Requires fitting both restricted and unrestricted models. By Wilks' theorem (1938), this converges to the chi-squared distribution.

Asymptotic equivalence. Under H0H_0: ΛS=ΛW=ΛLR+OP(n1/2)\Lambda_S = \Lambda_W = \Lambda_{LR} + O_P(n^{-1/2}) - all three are first-order equivalent. They differ in finite samples. The LRT is generally most accurate for small nn.

These tests are developed fully in Section03 Hypothesis Testing, including the Neyman-Pearson lemma, p-values, and power calculations.



Appendix F: Sampling Distributions of Classical Statistics

Understanding the exact (finite-sample) distributions of key estimators is essential for constructing valid hypothesis tests and confidence intervals when nn is small.

F.1 Distribution of the Sample Mean

Let x(1),,x(n)iidN(μ,σ2)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2).

xˉ=1ni=1nx(i)N ⁣(μ,σ2n)\bar{x} = \frac{1}{n}\sum_{i=1}^n x^{(i)} \sim \mathcal{N}\!\left(\mu, \frac{\sigma^2}{n}\right)

This is an exact result (not asymptotic) - the sum of independent Gaussians is Gaussian. The standardised version Z=xˉμσ/nN(0,1)Z = \frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1) gives exact z-tests and z-intervals.

F.2 Distribution of the Sample Variance: Chi-Squared

Theorem F.1. Let x(1),,x(n)iidN(μ,σ2)x^{(1)}, \ldots, x^{(n)} \overset{\text{iid}}{\sim} \mathcal{N}(\mu, \sigma^2). Then:

(n1)s2σ2=i=1n(x(i)xˉ)2σ2χn12\frac{(n-1)s^2}{\sigma^2} = \frac{\sum_{i=1}^n (x^{(i)} - \bar{x})^2}{\sigma^2} \sim \chi^2_{n-1}

and this is independent of xˉ\bar{x}.

Proof sketch. Write (x(i)xˉ)2/σ2=(z(i)zˉ)2\sum(x^{(i)}-\bar{x})^2/\sigma^2 = \sum(z^{(i)}-\bar{z})^2 where z(i)=(x(i)μ)/σiidN(0,1)z^{(i)} = (x^{(i)}-\mu)/\sigma \overset{\text{iid}}{\sim} \mathcal{N}(0,1). The quadratic form (z(i)zˉ)2\sum(z^{(i)}-\bar{z})^2 equals z(I1n11)z\mathbf{z}^\top(I - \frac{1}{n}\mathbf{1}\mathbf{1}^\top)\mathbf{z}, a projection of z\mathbf{z} onto the (n1)(n-1)-dimensional subspace orthogonal to 1\mathbf{1}. By the spectral theorem for Gaussian quadratic forms, this has a χn12\chi^2_{n-1} distribution. Independence from xˉ\bar{x} (a function of 1z\mathbf{1}^\top\mathbf{z}) follows from the orthogonality.

The n1n-1 degrees of freedom arise because estimating μ\mu by xˉ\bar{x} removes one degree of freedom from the nn deviations x(i)xˉx^{(i)}-\bar{x} (they sum to zero, so only n1n-1 are free).

F.3 Student's tt Distribution

Definition. If ZN(0,1)Z \sim \mathcal{N}(0,1) and Vχν2V \sim \chi^2_\nu independently, then T=Z/V/νtνT = Z/\sqrt{V/\nu} \sim t_\nu.

Application to Gaussian mean. With Z=(xˉμ)/(σ/n)N(0,1)Z = (\bar{x}-\mu)/(\sigma/\sqrt{n}) \sim \mathcal{N}(0,1) and V=(n1)s2/σ2χn12V = (n-1)s^2/\sigma^2 \sim \chi^2_{n-1} (independent of ZZ):

T=xˉμs/n=ZV/(n1)tn1T = \frac{\bar{x}-\mu}{s/\sqrt{n}} = \frac{Z}{\sqrt{V/(n-1)}} \sim t_{n-1}

This is exact for all n2n \geq 2 - no large-sample approximation needed. The tt-distribution is symmetric, bell-shaped, heavier-tailed than the Gaussian, and converges to N(0,1)\mathcal{N}(0,1) as ν\nu \to \infty.

Quantiles comparison:

α\alpha (two-sided)zα/2z_{\alpha/2}t9,α/2t_{9, \alpha/2}t29,α/2t_{29, \alpha/2}
10%1.6451.8331.699
5%1.9602.2622.045
1%2.5763.2502.756

The tt-distribution with small df requires wider intervals to achieve the same coverage - reflecting the additional uncertainty from estimating σ\sigma.

F.4 The FF Distribution

Definition. If Uχm2U \sim \chi^2_m and Vχn2V \sim \chi^2_n independently, then F=(U/m)/(V/n)Fm,nF = (U/m)/(V/n) \sim F_{m,n}.

Application. Comparing variances: s12/s22(σ22/σ12)Fn11,n21s_1^2/s_2^2 \cdot (\sigma_2^2/\sigma_1^2) \sim F_{n_1-1, n_2-1}. Testing equality of variances H0:σ12=σ22H_0: \sigma_1^2 = \sigma_2^2: Fobs=s12/s22Fn11,n21F_{\text{obs}} = s_1^2/s_2^2 \sim F_{n_1-1, n_2-1} under H0H_0.

The FF-distribution also underlies ANOVA (testing equality of multiple means), regression significance tests, and model comparison - fully developed in Section03 Hypothesis Testing and Section06 Regression Analysis.


Appendix G: Numerical Example - Fitting a Gaussian Mixture Model via EM

This worked example demonstrates MLE for a latent variable model.

Setup. Suppose we observe n=200n = 200 observations believed to come from a mixture of two Gaussians: p(x;θ)=π1N(x;μ1,σ12)+(1π1)N(x;μ2,σ22)p(x;\boldsymbol{\theta}) = \pi_1 \mathcal{N}(x;\mu_1,\sigma_1^2) + (1-\pi_1)\mathcal{N}(x;\mu_2,\sigma_2^2).

The parameters are θ=(π1,μ1,σ12,μ2,σ22)\boldsymbol{\theta} = (\pi_1, \mu_1, \sigma_1^2, \mu_2, \sigma_2^2).

Direct MLE is hard. The log-likelihood is (θ)=i=1nlog[π1N(x(i);μ1,σ12)+(1π1)N(x(i);μ2,σ22)]\ell(\boldsymbol{\theta}) = \sum_{i=1}^n \log[\pi_1\mathcal{N}(x^{(i)};\mu_1,\sigma_1^2) + (1-\pi_1)\mathcal{N}(x^{(i)};\mu_2,\sigma_2^2)], which contains a log of a sum - non-convex, no closed form.

EM solution. Introduce latent indicators Z(i){1,2}Z^{(i)} \in \{1,2\} indicating which component generated x(i)x^{(i)}.

E-step: Compute responsibilities

r1(i)=π1N(x(i);μ1,σ12)π1N(x(i);μ1,σ12)+(1π1)N(x(i);μ2,σ22),r2(i)=1r1(i)r^{(i)}_1 = \frac{\pi_1\mathcal{N}(x^{(i)};\mu_1,\sigma_1^2)}{\pi_1\mathcal{N}(x^{(i)};\mu_1,\sigma_1^2) + (1-\pi_1)\mathcal{N}(x^{(i)};\mu_2,\sigma_2^2)}, \quad r^{(i)}_2 = 1 - r^{(i)}_1

M-step: Update parameters (weighted MLEs for each component):

π^1=ir1(i)n,μ^k=irk(i)x(i)irk(i),σ^k2=irk(i)(x(i)μ^k)2irk(i)\hat{\pi}_1 = \frac{\sum_i r^{(i)}_1}{n}, \quad \hat{\mu}_k = \frac{\sum_i r^{(i)}_k x^{(i)}}{\sum_i r^{(i)}_k}, \quad \hat{\sigma}_k^2 = \frac{\sum_i r^{(i)}_k (x^{(i)} - \hat{\mu}_k)^2}{\sum_i r^{(i)}_k}

Convergence. EM guarantees (θ(t+1))(θ(t))\ell(\boldsymbol{\theta}^{(t+1)}) \geq \ell(\boldsymbol{\theta}^{(t)}) at every iteration, converging to a local maximum. Multiple random initialisations are needed to find the global maximum.

For AI: The EM algorithm is the prototype for alternating optimisation in deep learning. The BERT pre-training objective (masked language modelling) alternates between predicting masked tokens (like the E-step computing responsibilities) and updating parameters (like the M-step). Expectation-Maximisation underlies the theory of the EM algorithm, which is used for fitting GMMs, HMMs, and topic models in NLP preprocessing pipelines.


Appendix H: Regularised MLE and the Bias-Variance Trade-Off in Practice

Ridge regression (L2-regularised linear regression) is the canonical example of regularised MLE.

Setup. For linear regression y(i)=θx(i)+ε(i)y^{(i)} = \boldsymbol{\theta}^\top \mathbf{x}^{(i)} + \varepsilon^{(i)} with ε(i)iidN(0,σ2)\varepsilon^{(i)} \overset{\text{iid}}{\sim} \mathcal{N}(0,\sigma^2), the MLE is OLS: θ^OLS=(XX)1Xy\hat{\boldsymbol{\theta}}_{\text{OLS}} = (X^\top X)^{-1}X^\top\mathbf{y}.

Ridge regression: Add L2 penalty:

θ^ridge(λ)=argminθyXθ22+λθ22=(XX+λI)1Xy\hat{\boldsymbol{\theta}}_{\text{ridge}}(\lambda) = \arg\min_{\boldsymbol{\theta}} \lVert \mathbf{y} - X\boldsymbol{\theta} \rVert_2^2 + \lambda\lVert\boldsymbol{\theta}\rVert_2^2 = (X^\top X + \lambda I)^{-1}X^\top\mathbf{y}

Bias: Bias(θ^ridge)=λ(XX+λI)1θ0\operatorname{Bias}(\hat{\boldsymbol{\theta}}_{\text{ridge}}) = -\lambda(X^\top X + \lambda I)^{-1}\boldsymbol{\theta}^* \neq \mathbf{0} - ridge is biased.

Variance: Cov(θ^ridge)=σ2(XX+λI)1XX(XX+λI)1σ2(XX)1=Cov(θ^OLS)\operatorname{Cov}(\hat{\boldsymbol{\theta}}_{\text{ridge}}) = \sigma^2(X^\top X + \lambda I)^{-1}X^\top X(X^\top X + \lambda I)^{-1} \preceq \sigma^2(X^\top X)^{-1} = \operatorname{Cov}(\hat{\boldsymbol{\theta}}_{\text{OLS}}).

Ridge has smaller covariance than OLS - introducing bias reduces variance. The bias-variance trade-off: for well-chosen λ\lambda, MSE of ridge < MSE of OLS.

MAP interpretation. Ridge is MAP estimation with Gaussian prior θN(0,σ2/λI)\boldsymbol{\theta} \sim \mathcal{N}(\mathbf{0}, \sigma^2/\lambda \cdot I). The prior pulls θ\boldsymbol{\theta} toward zero, introducing bias toward zero. The full development of MAP (adding a prior to MLE) is in Section04 Bayesian Inference.

Optimal λ\lambda selection. Cross-validation estimates the prediction MSE for each λ\lambda, selecting the one that minimises the bias-variance trade-off from a prediction perspective. Alternatively, analytical formula: the MSE-minimising λ=kσ2/θ22\lambda^* = k\sigma^2/\lVert\boldsymbol{\theta}^*\rVert_2^2 (depends on unknown θ\boldsymbol{\theta}^*, so must be estimated).

Connection to weight decay in LLMs. AdamW (Loshchilov & Hutter, 2019) adds weight decay λθ22\lambda\lVert\boldsymbol{\theta}\rVert_2^2 to the Adam update. This is regularised MLE with a Gaussian prior on all parameters - the standard training procedure for all frontier language models. The weight decay coefficient λ\lambda (typically 0.1) controls the strength of the prior, determining how much parameters are regularised toward zero.


Appendix I: Non-Parametric Estimation

When the parametric model is uncertain or clearly wrong, non-parametric estimation makes minimal distributional assumptions.

Empirical Distribution Function (EDF). Given sample x(1),,x(n)x^{(1)}, \ldots, x^{(n)}:

F^n(t)=1ni=1n1[x(i)t]\hat{F}_n(t) = \frac{1}{n}\sum_{i=1}^n \mathbb{1}[x^{(i)} \leq t]

The EDF is the non-parametric MLE of the CDF FF. By the Glivenko-Cantelli theorem: suptF^n(t)F(t)a.s.0\sup_t |\hat{F}_n(t) - F(t)| \xrightarrow{a.s.} 0 - the EDF converges uniformly to the true CDF.

The Dvoretzky-Kiefer-Wolfowitz (DKW) inequality gives a finite-sample bound:

P ⁣(suptF^n(t)F(t)>ε)2e2nε2P\!\left(\sup_t |\hat{F}_n(t) - F(t)| > \varepsilon\right) \leq 2e^{-2n\varepsilon^2}

This is the non-parametric Cramer-Rao analogue - it bounds estimation error for the CDF without any parametric assumption.

Kernel density estimation. Estimating the PDF ff non-parametrically:

f^h(x)=1nhi=1nK ⁣(xx(i)h)\hat{f}_h(x) = \frac{1}{nh}\sum_{i=1}^n K\!\left(\frac{x - x^{(i)}}{h}\right)

where KK is a kernel (e.g., Gaussian) and hh is the bandwidth. The bias-variance trade-off applies: small hh -> low bias, high variance; large hh -> high bias, low variance. Optimal bandwidth: hn1/5h^* \propto n^{-1/5} (mean integrated squared error).

Bootstrap as non-parametric MLE. The bootstrap resamples from the EDF F^n\hat{F}_n - effectively replacing the unknown FF with its non-parametric MLE. Bootstrap CI validity follows from the closeness of F^n\hat{F}_n to FF (Glivenko-Cantelli) and the continuity of the statistic of interest.



Appendix J: Selected Proofs and Derivations

J.1 Asymptotic Variance of the Sample Median

For a distribution with PDF ff and CDF FF, let mm denote the population median (F(m)=1/2F(m) = 1/2). The sample median x~n\tilde{x}_n satisfies:

n(x~nm)dN ⁣(0,14f(m)2)\sqrt{n}(\tilde{x}_n - m) \xrightarrow{d} \mathcal{N}\!\left(0, \frac{1}{4f(m)^2}\right)

Proof sketch. The sample median is the MLE for the double-exponential (Laplace) distribution, and the asymptotic variance of the pp-th quantile estimator q^p\hat{q}_p is p(1p)/[nf(qp)2]p(1-p)/[nf(q_p)^2]. For p=1/2p = 1/2: (1/2)(1/2)/[nf(m)2]=1/(4nf(m)2)(1/2)(1/2)/[n f(m)^2] = 1/(4nf(m)^2).

Comparison to mean (for Gaussian data). For N(μ,σ2)\mathcal{N}(\mu,\sigma^2): f(μ)=1/(σ2π)f(\mu) = 1/(\sigma\sqrt{2\pi}), so the median's asymptotic variance is πσ2/(2n)\pi\sigma^2/(2n) vs. σ2/n\sigma^2/n for the mean. The relative efficiency is 2/π0.6372/\pi \approx 0.637 - the mean extracts 100/0.63757%100/0.637 \approx 57\% more information from Gaussian data than the median. However, for heavy-tailed distributions (e.g., Cauchy), the mean has infinite variance while the median remains consistent - robustness matters.

J.2 The Neyman-Scott Paradox - Inconsistency of MLE

MLE is not always consistent. The Neyman-Scott example (1948) provides a famous cautionary case.

Setup. Observe nn pairs (xj(1),xj(2))(x_j^{(1)}, x_j^{(2)}) for j=1,,nj = 1, \ldots, n, where xj(k)iidN(μj,σ2)x_j^{(k)} \overset{\text{iid}}{\sim} \mathcal{N}(\mu_j, \sigma^2). The parameters are (μ1,,μn,σ2)(\mu_1, \ldots, \mu_n, \sigma^2) - n+1n+1 parameters for 2n2n observations.

MLE. For each pair: μ^j=(xj(1)+xj(2))/2\hat{\mu}_j = (x_j^{(1)} + x_j^{(2)})/2. For σ2\sigma^2: σ^MLE2=12nj=1n[(xj(1)μ^j)2+(xj(2)μ^j)2]=12nj(xj(1)xj(2))22\hat{\sigma}^2_{\text{MLE}} = \frac{1}{2n}\sum_{j=1}^n \left[(x_j^{(1)} - \hat{\mu}_j)^2 + (x_j^{(2)} - \hat{\mu}_j)^2\right] = \frac{1}{2n}\sum_j \frac{(x_j^{(1)}-x_j^{(2)})^2}{2}.

Inconsistency. E[σ^MLE2]=σ2/2\mathbb{E}[\hat{\sigma}^2_{\text{MLE}}] = \sigma^2/2 - the MLE permanently underestimates σ2\sigma^2 by a factor of 2, regardless of nn. The bias does not vanish because the number of parameters grows with nn.

Root cause. The regularity conditions for MLE consistency require the number of parameters to be fixed as nn \to \infty. When nuisance parameters grow with nn (an "incidental parameters" problem), MLE can fail. The fix: use a conditional or marginal likelihood that eliminates the nuisance parameters.

For AI. This paradox is relevant to neural network generalisation. A model with pnp \gg n parameters (overparameterised regime) has more parameters than observations. Classical MLE theory breaks down - yet in practice, overparameterised networks often generalise well due to implicit regularisation (gradient descent inductive bias). This remains an active research area.

J.3 Locally Most Powerful Tests and the Neyman-Pearson Lemma

(Preview - full treatment in Section03 Hypothesis Testing)

The Neyman-Pearson lemma provides the most powerful test for simple hypotheses H0:θ=θ0H_0: \theta = \theta_0 vs. H1:θ=θ1H_1: \theta = \theta_1. The rejection region is:

{x:L(θ1;x)/L(θ0;x)>c}={reject H0}\{x : L(\theta_1;x)/L(\theta_0;x) > c\} = \{\text{reject } H_0\}

where cc is chosen to control type-I error at level α\alpha. The likelihood ratio L(θ1;x)/L(θ0;x)L(\theta_1;x)/L(\theta_0;x) is a sufficient statistic for the test - no other test can be more powerful at the same level. The connection to MLE: the most powerful test uses the likelihood, and the MLE is the point that maximises the likelihood.

For composite alternatives H1:θΘ1H_1: \theta \in \Theta_1, the likelihood ratio test statistic Λ=2[(θ^)(θ^restricted)]dχk2\Lambda = 2[\ell(\hat{\theta}) - \ell(\hat{\theta}_{\text{restricted}})] \xrightarrow{d} \chi^2_k is the standard generalisation, and connects to the Wald and score tests via asymptotic equivalence.


Appendix K: Reference Tables

K.1 Common MLE Formulas

DistributionParametersLog-likelihoodMLE
Bern(p)\operatorname{Bern}(p)p(0,1)p \in (0,1)Slogp+(nS)log(1p)S\log p + (n-S)\log(1-p)p^=S/n\hat{p} = S/n
Poisson(λ)\operatorname{Poisson}(\lambda)λ>0\lambda > 0SlogλnλS\log\lambda - n\lambdaλ^=S/n=xˉ\hat{\lambda} = S/n = \bar{x}
Exp(λ)\operatorname{Exp}(\lambda)λ>0\lambda > 0nlogλλxin\log\lambda - \lambda\sum x_iλ^=n/xi=1/xˉ\hat{\lambda} = n/\sum x_i = 1/\bar{x}
N(μ,σ2)\mathcal{N}(\mu,\sigma^2)μR\mu \in \mathbb{R}, σ2>0\sigma^2 > 0nlogσ(xiμ)22σ2-n\log\sigma - \frac{\sum(x_i-\mu)^2}{2\sigma^2}μ^=xˉ\hat{\mu} = \bar{x}, σ^2=1n(xixˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum(x_i-\bar{x})^2
U(0,θ)\mathcal{U}(0,\theta)θ>0\theta > 0nlogθ1[θx(n)]-n\log\theta \cdot \mathbb{1}[\theta \geq x_{(n)}]θ^=x(n)=maxixi\hat{\theta} = x_{(n)} = \max_i x_i
Gamma(α,β)\operatorname{Gamma}(\alpha,\beta)α,β>0\alpha,\beta > 0nαlogβnlogΓ(α)+(α1)logxiβxin\alpha\log\beta - n\log\Gamma(\alpha) + (\alpha-1)\sum\log x_i - \beta\sum x_iNo closed form; requires Newton-Raphson
Beta(α,β)\operatorname{Beta}(\alpha,\beta)α,β>0\alpha,\beta > 0nlogΓ(α+β)Γ(α)Γ(β)+(α1)logxi+(β1)log(1xi)n\log\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} + (\alpha-1)\sum\log x_i + (\beta-1)\sum\log(1-x_i)No closed form

K.2 Fisher Information Summary

DistributionParameterI(θ)\mathcal{I}(\theta)Efficient MLEVarCRB=1/(nI)\operatorname{Var}_{\text{CRB}} = 1/(n\mathcal{I})
Bern(p)\operatorname{Bern}(p)pp1p(1p)\frac{1}{p(1-p)}xˉ\bar{x}p(1p)/np(1-p)/n
N(μ,σ2)\mathcal{N}(\mu,\sigma^2), σ2\sigma^2 knownμ\mu1/σ21/\sigma^2xˉ\bar{x}σ2/n\sigma^2/n
N(μ,σ2)\mathcal{N}(\mu,\sigma^2), μ\mu knownσ2\sigma^21/(2σ4)1/(2\sigma^4)1n(xiμ)2\frac{1}{n}\sum(x_i-\mu)^22σ4/n2\sigma^4/n
Poisson(λ)\operatorname{Poisson}(\lambda)λ\lambda1/λ1/\lambdaxˉ\bar{x}λ/n\lambda/n
Exp(λ)\operatorname{Exp}(\lambda)λ\lambda1/λ21/\lambda^21/xˉ1/\bar{x}λ2/n\lambda^2/n
Categorical(p1,,pK)\operatorname{Categorical}(p_1,\ldots,p_K)pkp_k (one of K1K-1 free)1pk+1pK\frac{1}{p_k} + \frac{1}{p_K} (marginal)nk/nn_k/nMultinomial variance

K.3 Confidence Interval Reference

ParameterSettingPivotCI formula
Gaussian mean μ\muσ2\sigma^2 knownZ=xˉμσ/nN(0,1)Z = \frac{\bar{x}-\mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1)xˉ±zα/2σ/n\bar{x} \pm z_{\alpha/2}\sigma/\sqrt{n}
Gaussian mean μ\muσ2\sigma^2 unknownT=xˉμs/ntn1T = \frac{\bar{x}-\mu}{s/\sqrt{n}} \sim t_{n-1}xˉ±tn1,α/2s/n\bar{x} \pm t_{n-1,\alpha/2} s/\sqrt{n}
Gaussian variance σ2\sigma^2μ\mu unknown(n1)s2/σ2χn12(n-1)s^2/\sigma^2 \sim \chi^2_{n-1}[(n1)s2χn1,α/22,(n1)s2χn1,1α/22]\left[\frac{(n-1)s^2}{\chi^2_{n-1,\alpha/2}}, \frac{(n-1)s^2}{\chi^2_{n-1,1-\alpha/2}}\right]
Bernoulli pp (large nn)-Wald: p^±zα/2p^(1p^)/n\hat{p} \pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}Wilson score preferred near 0/1
General MLE θ^\hat{\theta}Large nnnI(θ^)(θ^θ)dN(0,1)\sqrt{n\mathcal{I}(\hat{\theta})}(\hat{\theta}-\theta) \xrightarrow{d} \mathcal{N}(0,1)θ^±zα/2/nI(θ^)\hat{\theta} \pm z_{\alpha/2}/\sqrt{n\mathcal{I}(\hat{\theta})}
Any statisticNon-parametricBootstrapPercentile/BCa bootstrap

End of Appendices


Appendix L: Worked Examples - End-to-End Estimation

L.1 Estimating the Parameters of a Sentiment Classifier

Problem. A BERT-based sentiment classifier is evaluated on a test set of n=500n = 500 examples. It classifies 418 correctly (a^=0.836\hat{a} = 0.836). Report (a) the point estimate with standard error, (b) the 95% Wilson CI, (c) determine if this differs significantly from a baseline accuracy of 80%.

Step 1: Point estimate and SE.

a^=418/500=0.836\hat{a} = 418/500 = 0.836. Standard error: SE=a^(1a^)/n=0.836×0.164/500=0.0002740.0166\text{SE} = \sqrt{\hat{a}(1-\hat{a})/n} = \sqrt{0.836 \times 0.164/500} = \sqrt{0.000274} \approx 0.0166.

Step 2: Wilson score CI.

With z=z0.025=1.96z = z_{0.025} = 1.96, n=500n = 500, a^=0.836\hat{a} = 0.836:

CI=[0.836+1.962/(2500)1+1.962/500±1.960.836×0.164/500+1.962/(45002)1+1.962/500]\text{CI} = \left[\frac{0.836 + 1.96^2/(2 \cdot 500)}{1 + 1.96^2/500} \pm \frac{1.96\sqrt{0.836 \times 0.164/500 + 1.96^2/(4 \cdot 500^2)}}{1 + 1.96^2/500}\right]

Numerically: centre (0.836+0.00384)/1.007680.8364\approx (0.836 + 0.00384)/1.00768 \approx 0.8364; margin 0.0323\approx 0.0323. CI [0.804,0.869]\approx [0.804, 0.869].

Step 3: Comparison to baseline 80%.

The null hypothesis H0:a=0.80H_0: a^* = 0.80 corresponds to checking if 0.80 is inside the CI. Since 0.80 is inside [0.804,0.869][0.804, 0.869]? No - 0.80 is just outside. Alternatively, zz-test: z=(0.8360.80)/SE=0.036/0.0166=2.17>1.96z = (0.836 - 0.80)/\text{SE} = 0.036/0.0166 = 2.17 > 1.96. The improvement is statistically significant at α=0.05\alpha = 0.05.

Statistical interpretation. The classifier significantly outperforms the 80% baseline, but uncertainty spans from 80.4% to 86.9% - real-world performance could be anywhere in this range, highlighting the importance of CIs even for significant results.

L.2 MLE for a Gaussian Mixture - Worked Iteration

Problem. Two clusters are observed: {1.1,1.3,0.9,1.2}\{1.1, 1.3, 0.9, 1.2\} and {5.8,6.2,5.5,6.1}\{5.8, 6.2, 5.5, 6.1\} (but we don't know the assignments). Fit a 2-component Gaussian mixture.

Initialisation. π^1=0.5\hat{\pi}_1 = 0.5, μ^1=2.0\hat{\mu}_1 = 2.0, σ^12=1.0\hat{\sigma}_1^2 = 1.0, μ^2=5.0\hat{\mu}_2 = 5.0, σ^22=1.0\hat{\sigma}_2^2 = 1.0.

E-step (iteration 1). For x=1.1x = 1.1:

r1=0.5N(1.1;2.0,1.0)0.5N(1.1;2.0,1.0)+0.5N(1.1;5.0,1.0)=0.5×0.2660.5×0.266+0.5×0.0000.9999r_1 = \frac{0.5 \cdot \mathcal{N}(1.1;2.0,1.0)}{0.5 \cdot \mathcal{N}(1.1;2.0,1.0) + 0.5 \cdot \mathcal{N}(1.1;5.0,1.0)} = \frac{0.5 \times 0.266}{0.5 \times 0.266 + 0.5 \times 0.000}\approx 0.9999

For x=5.8x = 5.8: r10.0001r_1 \approx 0.0001. The algorithm already strongly assigns each point to the correct cluster.

M-step (iteration 1). With r1(i)4.0\sum r^{(i)}_1 \approx 4.0 (the four small values):

μ^1(1.1+1.3+0.9+1.2)/4=1.125,μ^2(5.8+6.2+5.5+6.1)/4=5.9\hat{\mu}_1 \approx (1.1+1.3+0.9+1.2)/4 = 1.125, \quad \hat{\mu}_2 \approx (5.8+6.2+5.5+6.1)/4 = 5.9

After 2-3 iterations, the algorithm converges to the obvious clusters. The key insight: EM separates the cluster assignment (E-step soft assignments rikr_{ik}) from parameter estimation (M-step weighted MLEs), alternating until convergence.

L.3 Bootstrap CI for Median Income - ML Context

Problem. A dataset of n=50n = 50 yearly salaries (thousands) for ML engineers is collected. Sample statistics: x~=145\tilde{x} = 145, s=42s = 42. Construct a 95% bootstrap CI for the median.

Why bootstrap? The Gaussian CI for the median requires knowing the PDF at the median - unknown without assuming a distribution. The salary distribution is likely right-skewed (some outliers earn much more), making parametric CIs unreliable.

Bootstrap procedure (pseudocode):

for b = 1 to B=2000:
    x_star = sample(x, n=50, replace=True)
    median_star[b] = median(x_star)

CI_95 = [percentile(median_star, 2.5), percentile(median_star, 97.5)]

Typical result: CI95[132,158]\text{CI}_{95} \approx [132, 158] thousand. The CI is slightly asymmetric (the right tail is longer due to income skewness), which the bootstrap correctly captures and a symmetric Gaussian CI would miss.

For AI. This exact procedure is used to report confidence intervals for ML engineer compensation surveys, which inform hiring benchmarks and compensation strategy at AI companies. Statistically sound salary benchmarks require bootstrap CIs because income distributions are strongly non-Gaussian.


Appendix M: Glossary of Key Terms

TermFormal definitionIntuition
EstimatorA measurable function T(x(1),,x(n))T(\mathbf{x}^{(1)},\ldots,\mathbf{x}^{(n)}) of the dataA rule for computing a guess from observed data
EstimateA specific value T(x(1),,x(n))=tT(x^{(1)},\ldots,x^{(n)}) = t after observing dataThe actual number produced by the estimator
Sampling distributionThe distribution of θ^\hat{\theta} across all possible samples of size nnHow much the estimate varies from experiment to experiment
BiasE[θ^]θ\mathbb{E}[\hat{\theta}] - \thetaHow far off the estimator is on average
VarianceE[(θ^E[θ^])2]\mathbb{E}[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2]How much the estimator fluctuates around its mean
MSEE[(θ^θ)2]=Bias2+Var\mathbb{E}[(\hat{\theta}-\theta)^2] = \text{Bias}^2 + \text{Var}Total estimation error
Consistencyθ^nPθ\hat{\theta}_n \xrightarrow{P} \theta as nn \to \inftyConverges to the right answer with more data
EfficiencyVar(θ^)=1/(nI(θ))\operatorname{Var}(\hat{\theta}) = 1/(n\mathcal{I}(\theta))Achieves the minimum possible variance
Sufficient statisticTT such that $p(\mathbf{x}T;\theta)doesnotdependondoes not depend on\theta$
Fisher informationI(θ)=E[s2(θ;X)]=E[(θ;X)]\mathcal{I}(\theta) = \mathbb{E}[s^2(\theta;X)] = -\mathbb{E}[\ell''(\theta;X)]How much the data informs θ\theta; curvature of log-likelihood
CRBVar(θ^)1/(nI(θ))\operatorname{Var}(\hat{\theta}) \geq 1/(n\mathcal{I}(\theta)) for unbiased θ^\hat{\theta}Hard lower bound on variance; no unbiased estimator can beat it
MLEθ^=argmaxθilogp(x(i);θ)\hat{\theta} = \arg\max_\theta \sum_i \log p(x^{(i)};\theta)Parameter that makes observed data most probable
Asymptotic normalityn(θ^MLEθ)dN(0,I1)\sqrt{n}(\hat{\theta}_{\text{MLE}}-\theta^*) \xrightarrow{d} \mathcal{N}(0,\mathcal{I}^{-1})MLE is approximately normal for large nn
Confidence intervalRandom interval covering θ\theta with probability 1α1-\alphaUncertainty quantification for frequentist estimation
Natural parameterη\boldsymbol{\eta} in p(x;η)=h(x)exp(ηT(x)A(η))p(x;\boldsymbol{\eta}) = h(x)\exp(\boldsymbol{\eta}^\top T(x) - A(\boldsymbol{\eta}))Parameterisation of exponential families
Natural gradientI(θ)1θL\mathcal{I}(\boldsymbol{\theta})^{-1}\nabla_{\boldsymbol{\theta}}\mathcal{L}Steepest ascent in Fisher-Rao metric on statistical manifold
Misspecificationp{p(;θ):θΘ}p^* \notin \{p(\cdot;\theta):\theta \in \Theta\}True distribution is not in the model family
Pseudo-true parameterθ=argminθDKL(pp(;θ))\theta^{**} = \arg\min_\theta D_{\mathrm{KL}}(p^*\|p(\cdot;\theta))Closest point in model to truth under KL divergence

Appendix N: Further Reading and References

Primary References

  1. Casella, G. & Berger, R.L. (2002). Statistical Inference (2nd ed.). Duxbury. - The definitive graduate textbook on classical estimation theory; covers all topics in this section at full rigor.

  2. Lehmann, E.L. & Casella, G. (1998). Theory of Point Estimation (2nd ed.). Springer. - Advanced treatment of UMVUE theory, Rao-Blackwell, and sufficiency.

  3. Fisher, R.A. (1922). "On the Mathematical Foundations of Theoretical Statistics." Philosophical Transactions of the Royal Society A, 222, 309-368. - The foundational paper defining sufficiency, efficiency, consistency, and MLE.

  4. Cramer, H. (1946). Mathematical Methods of Statistics. Princeton University Press. - Original proof of the CRB.

  5. Rao, C.R. (1945). "Information and the Accuracy Attainable in the Estimation of Statistical Parameters." Bulletin of the Calcutta Mathematical Society, 37, 81-91. - Independent CRB proof and Rao-Blackwell theorem.

  6. Efron, B. (1979). "Bootstrap Methods: Another Look at the Jackknife." Annals of Statistics, 7(1), 1-26. - Original bootstrap paper.

ML-Specific References

  1. Amari, S. (1998). "Natural Gradient Works Efficiently in Learning." Neural Computation, 10(2), 251-276. - Foundation of natural gradient methods.

  2. Martens, J. & Grosse, R. (2015). "Optimizing Neural Networks with Kronecker-factored Approximate Curvature." ICML. - K-FAC for tractable FIM approximation in deep networks.

  3. Kirkpatrick, J. et al. (2017). "Overcoming Catastrophic Forgetting in Neural Networks." PNAS, 114(13), 3521-3526. - EWC using FIM for continual learning.

  4. Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS. - Chinchilla scaling laws via nonlinear MLE.

  5. Hu, E. et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR. - Low-rank MLE for efficient fine-tuning.

  6. Guo, C. et al. (2017). "On Calibration of Modern Neural Networks." ICML. - Temperature scaling as MLE on calibration set.

  7. Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. - Chapter 5 covers MLE in the context of deep learning at the appropriate depth for ML practitioners.

Advanced References

  1. van der Vaart, A.W. (1998). Asymptotic Statistics. Cambridge. - Graduate-level asymptotic theory; proofs of asymptotic normality, delta method, semiparametric theory.

  2. Wasserman, L. (2004). All of Statistics. Springer. - Excellent graduate-level survey connecting classical statistics to modern methods.

  3. Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer. - Chapters 1-3 cover MLE, MAP, and Bayesian estimation with ML motivation.


This section is part of the Math for LLMs curriculum. For corrections or contributions, see CONTRIBUTING.md.


Appendix O: Connections to Information Theory

The Fisher information matrix has a deep connection to Shannon information theory, previewing material from Chapter 9 - Information Theory.

O.1 Fisher Information as Local Curvature of KL Divergence

The KL divergence between distributions at nearby parameter values satisfies:

DKL(p(;θ)p(;θ+dθ))=12dθI(θ)dθ+O(dθ3)D_{\mathrm{KL}}(p(\cdot;\boldsymbol{\theta}) \| p(\cdot;\boldsymbol{\theta}+\mathrm{d}\boldsymbol{\theta})) = \frac{1}{2}\mathrm{d}\boldsymbol{\theta}^\top \mathcal{I}(\boldsymbol{\theta})\, \mathrm{d}\boldsymbol{\theta} + O(\|\mathrm{d}\boldsymbol{\theta}\|^3)

The Fisher information matrix is the Hessian of the KL divergence with respect to the second argument, evaluated at the point where both arguments agree. This identifies I\mathcal{I} as the local curvature of the divergence surface.

Consequence: Moving in the Fisher metric direction by dθ\mathrm{d}\boldsymbol{\theta} produces the maximal increase in KL divergence from the current distribution - this is why the natural gradient (moving I1\mathcal{I}^{-1}\nabla\ell in parameter space) corresponds to moving in the direction of maximal KL divergence gain in distribution space.

O.2 Cramer-Rao and Channel Capacity

Shannon's channel capacity theorem and the Cramer-Rao bound are related through the van Trees inequality (a Bayesian version of the CRB for random θ\theta):

MSE(θ^)1nI(θ)+Iprior(θ)\operatorname{MSE}(\hat{\theta}) \geq \frac{1}{n\mathcal{I}(\theta) + \mathcal{I}_{\text{prior}}(\theta)}

where Iprior\mathcal{I}_{\text{prior}} is the Fisher information of the prior. As the prior becomes uninformative (Iprior0\mathcal{I}_{\text{prior}} \to 0), this recovers the standard CRB. The connection: information accumulates (Fisher information adds) across observations and from the prior, exactly as Shannon information adds in a noisy channel.

O.3 Sufficient Statistics and Data Compression

A sufficient statistic TT satisfies the data processing inequality: processing data through TT cannot increase Fisher information. In fact, for a sufficient statistic, IT(θ)=IX(θ)\mathcal{I}_T(\theta) = \mathcal{I}_{\mathbf{X}}(\theta) - no information is lost. This is the estimation-theory analogue of lossless compression: a sufficient statistic compresses the data without losing any information about θ\theta.

Minimal sufficient statistics provide the maximum compression while retaining all information - analogous to the minimum description length (MDL) principle in information theory.


Appendix P: Practice Problems

P.1 Identification Problems

For each of the following, state whether the model is identifiable. If not, identify the unidentifiable combination.

  1. XN(μ1μ2,1)X \sim \mathcal{N}(\mu_1 - \mu_2, 1) with parameters (μ1,μ2)R2(\mu_1, \mu_2) \in \mathbb{R}^2.
  2. p(xθ)=θeθxp(x|\theta) = \theta e^{-\theta x} for x>0x > 0, with θ>0\theta > 0. (Identifiable?)
  3. A two-layer ReLU network f(x)=ReLU(w1x)+ReLU(w2x)f(x) = \text{ReLU}(w_1 x) + \text{ReLU}(w_2 x) with parameters (w1,w2)(w_1, w_2). Is the function class identifiable?

P.2 Computational Problems

  1. MLE and invariance. For Poisson data with n=20n = 20 and xi=48\sum x_i = 48: (a) find λ^\hat{\lambda}; (b) find the MLE of eλe^{-\lambda} (the probability of observing zero events); (c) find the MLE of 1/λ1/\lambda (mean inter-arrival time).

  2. CRB for a transformed parameter. For XN(μ,1)X \sim \mathcal{N}(\mu, 1) with nn observations, derive the CRB for estimating g(μ)=μ2g(\mu) = \mu^2 using the biased-estimator form of the CRB. Show that the MLE μ^2=xˉ2\hat{\mu}^2 = \bar{x}^2 does not achieve this bound for finite nn, but does so asymptotically.

  3. Bootstrap vs. asymptotic. For n=20n = 20 observations from N(μ,1)\mathcal{N}(\mu, 1) with xˉ=3.2\bar{x} = 3.2: (a) compute the exact 95% tt-CI; (b) simulate the bootstrap CI with B=1000B = 1000; (c) compare coverage by repeating 1000 times and computing the empirical coverage of each CI type.

P.3 Conceptual Problems

  1. Stein paradox. For d=1d = 1, the sample mean xˉ\bar{x} is the admissible MLE. For d=3d = 3, the James-Stein estimator μ~=(1d2nxˉ22)xˉ\tilde{\boldsymbol{\mu}} = (1 - \frac{d-2}{n\lVert\bar{\mathbf{x}}\rVert_2^2})\bar{\mathbf{x}} dominates xˉ\bar{\mathbf{x}} in MSE. Simulate this for d=3d = 3, n=1n = 1, true μ=(2,2,2)\boldsymbol{\mu} = (2,2,2), σ=1\sigma = 1, and verify MSE(μ~)<MSE(xˉ)\text{MSE}(\tilde{\boldsymbol{\mu}}) < \text{MSE}(\bar{\mathbf{x}}).

  2. Model misspecification. A logistic regression model is fitted to data from a probit model (p(y=1x)=Φ(wx)p(y=1|\mathbf{x}) = \Phi(\mathbf{w}^\top\mathbf{x})). The MLE converges to what? Explain using the misspecified MLE theorem and the KL divergence interpretation.

  3. FIM for temperature scaling. For a KK-class classifier with logits z\mathbf{z} and temperature TT, the softmax is pk=ezk/T/jezj/Tp_k = e^{z_k/T}/\sum_j e^{z_j/T}. Derive the Fisher information I(T)\mathcal{I}(T) for a single example and explain why Newton-Raphson temperature calibration converges in 1-2 steps.


Section Section02 Estimation Theory - Chapter 7 Statistics - Math for LLMs curriculum

Lines: ~2000 | Theory notebook: 50+ cells | Exercises: 8 graded problems