Chapter 7 — Statistics
"Statistics is the grammar of science. Without it, data is just noise; with it, data becomes evidence."
Overview
Statistics is the discipline of drawing principled conclusions from data. Where probability theory asks "given a model, what data should we expect?", statistics inverts the question: "given data, what can we infer about the underlying model?" This inversion — inference from observations — is the foundation of every machine learning training algorithm.
This chapter builds statistical reasoning from data summarisation (§01), through classical estimation and hypothesis testing (§02–§03), Bayesian inference (§04), time-series analysis (§05), and regression (§06). Every concept is grounded in its ML application: MLE underpins cross-entropy training, confidence intervals govern model evaluation, Bayesian inference enables uncertainty quantification in neural networks, and regression is the blueprint for supervised learning.
The conceptual arc: summarise data (§01) → estimate parameters from data (§02) → test hypotheses about data (§03) → update beliefs from data (§04) → model sequential data (§05) → model relationships between variables (§06).
Subsection Map
| # | Subsection | What It Covers | Canonical Topics |
|---|---|---|---|
| 01 | Descriptive Statistics | Summarising and characterising datasets before modelling | Mean, median, mode, variance, standard deviation, skewness, kurtosis, quantiles, IQR, correlation, covariance matrices, data visualisation, outlier detection |
| 02 | Estimation Theory | Inferring population parameters from samples; MLE and method of moments | Point estimation, bias, variance, MSE, consistency, efficiency, Cramér-Rao bound, MLE derivation, method of moments, confidence intervals, Fisher information |
| 03 | Hypothesis Testing | Formal decision-making under uncertainty; p-values, power, and error rates | Null/alternative hypotheses, Type I/II errors, p-values, significance level, power, t-tests, z-tests, chi-squared tests, ANOVA, multiple testing correction, A/B testing |
| 04 | Bayesian Inference | Treating parameters as random variables; posterior computation and uncertainty | Prior, likelihood, posterior, conjugate priors, MAP estimation, MCMC posterior sampling, variational inference, credible intervals, Bayesian model comparison |
| 05 | Time Series | Modelling and forecasting sequential, temporally dependent data | Stationarity, autocorrelation, AR/MA/ARMA/ARIMA models, spectral analysis, seasonal decomposition, Kalman filter, forecasting |
| 06 | Regression Analysis | Modelling relationships between variables; the blueprint for supervised learning | Simple and multiple linear regression, OLS derivation, Gauss-Markov theorem, regularisation (Ridge/Lasso), logistic regression, GLMs, model diagnostics |
Reading Order and Dependencies
01-Descriptive-Statistics (foundation: summarise before modelling)
↓
02-Estimation-Theory (core inference: MLE, confidence intervals, Fisher info)
↓
03-Hypothesis-Testing (decisions: p-values, power, A/B testing)
↓
04-Bayesian-Inference (probabilistic view: posterior, MAP, MCMC)
↓
05-Time-Series (sequential data: AR/MA, forecasting, Kalman)
↓
06-Regression-Analysis (supervised learning blueprint: OLS, Ridge, Lasso)
↓
Chapter 8 — Optimization (gradient methods, convexity, training algorithms)
What Belongs Where — Canonical Homes
| Topic | Canonical Home | Preview Only In |
|---|---|---|
| Mean, median, mode, quantiles | §01 | §02 (sample mean as estimator) |
| Variance, standard deviation (sample) | §01 | §06 (residual variance) |
| Correlation and covariance (empirical) | §01 | §06 (design matrix structure) |
| Skewness, kurtosis, distribution shape | §01 | — |
| Outlier detection methods | §01 | — |
| Point estimators: bias, variance, MSE | §02 | — |
| Cramér-Rao lower bound, Fisher information | §02 | §04 (Laplace approximation) |
| MLE derivation | §02 | §04 (MAP as regularised MLE) |
| Method of moments | §02 | — |
| Confidence intervals (frequentist) | §02 | §03 (duality with tests) |
| Asymptotic normality of MLE | §02 | — |
| Null/alternative hypotheses, p-values | §03 | — |
| Type I/II errors, power, significance | §03 | — |
| t-test, z-test, chi-squared, ANOVA | §03 | — |
| Multiple testing correction | §03 | — |
| A/B testing framework | §03 | — |
| Bayes' theorem (prior × likelihood) | §04 | Ch6§03 (full derivation), §02 (preview) |
| Conjugate priors | §04 | — |
| MAP estimation | §04 | §02 (as regularised MLE) |
| Posterior predictive distribution | §04 | — |
| Credible intervals | §04 | — |
| Variational inference | §04 | — |
| Stationarity, ACF/PACF | §05 | — |
| AR/MA/ARIMA models | §05 | — |
| Kalman filter | §05 | — |
| Spectral density, Fourier in time series | §05 | — |
| OLS derivation () | §06 | — |
| Gauss-Markov theorem, BLUE | §06 | — |
| Ridge and Lasso regularisation | §06 | — |
| Logistic regression, GLMs | §06 | — |
| Residual analysis, model diagnostics | §06 | — |
Overlap Danger Zones
1. Descriptive Statistics ↔ Probability Theory
- §01 computes sample statistics from data (empirical mean, sample variance, sample correlation).
- Ch6§04 defines population parameters (expected value, variance, covariance of a distribution).
- §01 must not re-derive the expectation operator; it applies sample analogues. §01 should backward-reference Ch6§04 for the population versions.
2. MLE ↔ MAP ↔ Bayesian Inference
- §02 derives MLE from the likelihood principle and treats it as a point estimate.
- §04 introduces MAP as MLE regularised by a prior, then extends to full posterior inference.
- §02 may note that adding a prior gives MAP, but the full treatment of priors, posteriors, and conjugacy belongs in §04.
3. Confidence Intervals ↔ Credible Intervals
- §02 defines frequentist confidence intervals: in repeated experiments, 95% of such intervals contain the true parameter.
- §04 defines Bayesian credible intervals: the posterior probability that the parameter lies in the interval is 95%.
- The philosophical distinction belongs in §04; §02 covers only the frequentist construction.
4. Hypothesis Testing ↔ Bayesian Model Comparison
- §03 is the canonical home for p-values, power, and classical tests.
- §04 covers Bayes factors and Bayesian model selection as the Bayesian analogue.
- §03 may note the Bayesian alternative briefly; §04 covers it in depth.
5. Regression ↔ Optimisation
- §06 derives OLS and establishes regression as a statistical model (Gauss-Markov, residual assumptions).
- Ch8 covers gradient-based optimisation, convexity, and SGD — the algorithmic machinery for fitting models.
- §06 may solve OLS by the normal equations (closed form) without invoking gradient descent; the algorithmic treatment belongs in Ch8.
6. Regression ↔ Descriptive Statistics
- §01 computes the empirical correlation coefficient as a summary statistic.
- §06 derives the regression coefficient and explains the relationship.
- §01 must not derive OLS; §06 must backward-reference §01 for empirical correlation.
Key Cross-Chapter Dependencies
From Chapter 6 — Probability Theory:
- §01 (Random Variables, CDF/PDF) → §02 likelihood functions and sampling distributions
- §02 (Common Distributions) → §02/§03 named test statistics (t, chi-squared, F distributions)
- §03 (Bayes' theorem, conditional distributions) → §04 Bayesian inference
- §04 (Expectation, variance) → §02 estimator properties (bias, variance, MSE)
- §05 (CLT, LLN) → §02 asymptotic normality of MLE; §03 large-sample tests
- §06 (Stochastic Processes) → §05 time-series modelling (stationary processes, ACF)
- §07 (Markov Chains, MH) → §04 MCMC posterior sampling
From Chapter 3 — Advanced Linear Algebra:
- SVD → §02 Fisher information geometry; §06 OLS via pseudoinverse
- Positive definite matrices → §06 covariance matrix of estimators; ridge regression
Into Chapter 8 — Optimisation:
- §02 (MLE) → cross-entropy and NLL as loss functions
- §06 (OLS) → normal equations as a linear system; extends to gradient descent
- §04 (ELBO, variational inference) → variational autoencoder training objective
Into Chapter 9 — Information Theory:
- §02 (Fisher information) → Cramér-Rao and the information-theoretic view of estimation
- §04 (KL divergence as posterior approximation criterion) → variational inference
ML Concept Map
| ML Concept | Statistics Foundation | Section |
|---|---|---|
| Cross-entropy loss | Negative log-likelihood (MLE objective) | §02 |
| Weight decay / L2 regularisation | Ridge regression; MAP with Gaussian prior | §04, §06 |
| L1 / sparsity regularisation | Lasso regression; MAP with Laplace prior | §04, §06 |
| Confidence in model evaluation | Confidence intervals for test accuracy | §02, §03 |
| A/B testing for model comparison | Two-sample hypothesis test | §03 |
| Bayesian neural networks | Posterior over weights; variational inference | §04 |
| Dropout as Bayesian approximation | MC Dropout ↔ approximate posterior sampling | §04 |
| Batch normalisation | Sample mean/variance of activations | §01 |
| Layer normalisation | Sample statistics within layers | §01 |
| Early stopping | Train/validation loss as statistical estimators | §02 |
| Calibration (softmax temperature) | Posterior predictive calibration | §04 |
| Anomaly / OOD detection | Hypothesis testing; statistical distance | §03 |
| Data drift detection | Two-sample tests on feature distributions | §03 |
| Time-series forecasting | ARIMA, Kalman filter | §05 |
| Transformer positional encoding | Spectral methods, Fourier features | §05 |
| Linear probe evaluation | Logistic regression on embeddings | §06 |
| LoRA / low-rank adaptation | Regression on low-dimensional subspaces | §06 |
| Reward modelling (RLHF) | Logistic regression (Bradley-Terry model) | §06 |
Prerequisites
Before starting this chapter, ensure you are comfortable with:
- Probability distributions — PDFs, CDFs, named distributions (Gaussian, Bernoulli, Poisson, Beta) — Chapter 6 §01–§02
- Expectation and variance — , , covariance — Chapter 6 §04
- Bayes' theorem — — Chapter 6 §03
- Central limit theorem — sample mean converges to Gaussian — Chapter 6 §06
- Matrix algebra — matrix inverse, SVD, positive definiteness — Chapter 3
- Calculus — derivatives, optimisation (for MLE derivations) — Chapter 4
← Previous Chapter: Probability Theory | Next Chapter: Optimization →