NotesMath for LLMs

Joint Distributions

Probability Theory / Joint Distributions

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Notes

Section6.3 Joint Distributions

"Probability is not about the odds, but about the belief in the possibility of something happening."

  • Nassim Nicholas Taleb

Overview

Real-world phenomena are never isolated. A language model's next token depends on every preceding token. A patient's diagnosis depends on dozens of correlated symptoms. A stock price today depends on yesterday's price, the broader market, and macroeconomic variables. To reason about multiple quantities together - to model dependencies, compute conditionals, and propagate uncertainty - we need joint distributions.

This section develops the full theory of multivariate probability: how to define probability distributions over pairs and vectors of random variables, how to extract marginals and conditionals, and when independence permits us to factorise complexity. We give a complete treatment of the multivariate Gaussian, the most important distribution in machine learning, and show how the chain rule of probability underlies the autoregressive factorisation that powers every modern large language model.

The ideas here are the probabilistic backbone of Naive Bayes classifiers, Gaussian Mixture Models, Variational Autoencoders, and the attention mechanism in transformers.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbInteractive exploration of joint distributions, MVN geometry, conditionals, and ML applications
exercises.ipynb10 graded exercises from joint PMF verification through VAE reparameterisation

Learning Objectives

After completing this section, you will:

  • Define joint PMF, PDF, and CDF and verify they satisfy probability axioms
  • Compute marginal and conditional distributions from a joint distribution
  • Apply the chain rule of probability and recognise it as the foundation of autoregressive generation
  • State three equivalent characterisations of independence and give counterexamples showing pairwise \neq mutual independence
  • Explain conditional independence and its role in Naive Bayes and graphical models
  • Define the covariance matrix, verify it is positive semi-definite, and interpret eigenvectors geometrically
  • Derive the conditional distribution of a multivariate Gaussian using the Schur complement
  • Apply the change-of-variables formula with Jacobians to transform densities
  • Implement Naive Bayes classification and understand when the independence assumption helps vs hurts
  • Connect the chain rule factorisation p(x)=tp(xtx<t)p(\mathbf{x}) = \prod_t p(x_t \mid x_{<t}) to GPT-style language models

Table of Contents


1. Intuition and Motivation

1.1 Why Joint Distributions Matter

A univariate distribution answers one question: what is the probability that a single random variable takes a particular value? But the world is multivariate. We want to know: what is the probability that a language model generates token AA and then token BB? What is the probability that a patient has disease DD given symptom SS and test result TT? What is the probability that two neurons in a neural network fire together?

To answer these questions, we need joint distributions - probability distributions over two or more random variables simultaneously. The joint distribution is the most complete probabilistic description of a system: from it, we can derive every single-variable distribution (by marginalising), every conditional distribution (by dividing), and every notion of dependence (by comparing the joint to the product of marginals).

The central tension in probabilistic machine learning is between expressiveness and tractability. A fully general joint distribution over nn binary variables requires 2n12^n - 1 parameters - exponential in nn. A fully factored (independence) model requires only nn parameters but ignores all dependencies. The art of probabilistic modelling lies in choosing structured assumptions - conditional independence, low-rank covariance, mixture models - that capture the important dependencies while remaining computationally feasible.

For AI: Every language model's output distribution is a joint distribution over token sequences. The reason transformers are tractable is that they use the chain rule to factor p(x1,,xT)=t=1Tp(xtx1,,xt1)p(x_1, \ldots, x_T) = \prod_{t=1}^T p(x_t \mid x_1, \ldots, x_{t-1}), converting an intractable VTV^T-dimensional joint into TT tractable conditionals.

1.2 From Univariate to Multivariate

When we have a single random variable XX, its distribution lives on the real line R\mathbb{R}. The CDF FX(x)=P(Xx)F_X(x) = P(X \leq x) is a function of one variable, the density fX(x)f_X(x) integrates to 1 over R\mathbb{R}.

When we have two random variables XX and YY, their joint distribution lives on the plane R2\mathbb{R}^2. The joint CDF is FX,Y(x,y)=P(Xx,Yy)F_{X,Y}(x,y) = P(X \leq x, Y \leq y), a function of two variables. The joint density fX,Y(x,y)f_{X,Y}(x,y) integrates to 1 over R2\mathbb{R}^2. For nn variables X=(X1,,Xn)\mathbf{X} = (X_1, \ldots, X_n), the joint distribution lives on Rn\mathbb{R}^n.

UNIVARIATE vs MULTIVARIATE PROBABILITY
========================================================================

  Univariate:           X \\in \\mathbb{R}
  -------------
  PMF:  p(x) = P(X=x)                  [discrete]
  PDF:  f(x),  \\int f(x)dx = 1            [continuous]
  CDF:  F(x) = P(X \\leq x)

  Bivariate:            (X,Y) \\in \\mathbb{R}^2
  -------------
  PMF:  p(x,y) = P(X=x, Y=y)           [discrete]
  PDF:  f(x,y),  \\iint f(x,y)dxdy = 1      [continuous]
  CDF:  F(x,y) = P(X \\leq x, Y \\leq y)

  Multivariate:         X = (X_1,...,X_n) \\in \\mathbb{R}^n
  -------------
  PDF:  f(x_1,...,x_n),  \\int...\\int f(x)dx_1...dx_n = 1
  CDF:  F(x) = P(X_1 \\leq x_1, ..., X_n \\leq x_n)

========================================================================

The key new concept is dependence: knowing the value of XX can change our belief about YY. If XX and YY are independent, knowing XX tells us nothing about YY, and the joint distribution factorises as fX,Y(x,y)=fX(x)fY(y)f_{X,Y}(x,y) = f_X(x) \cdot f_Y(y). Independence is a very special, restrictive assumption - and it is almost never exactly true in practice, but often a useful approximation.

1.3 The Independence Assumption in ML

Many machine learning algorithms make explicit independence assumptions to achieve tractability:

  • Naive Bayes: Assumes features are conditionally independent given the class label. Despite being wrong for virtually all real datasets, this assumption produces competitive classifiers because it massively reduces the number of parameters.
  • Mean-field variational inference: Assumes the posterior over all latent variables factorises as q(z)=iqi(zi)q(\mathbf{z}) = \prod_i q_i(z_i). This is the canonical tractability trick in VAEs and Bayesian neural networks.
  • i.i.d. data assumption: Most supervised learning theory assumes training examples (xi,yi)(x_i, y_i) are drawn independently from the same distribution. This is violated for time series, graphs, and correlated datasets - and the violation matters for generalisation bounds.
  • Residual stream in transformers: Attention heads in the same layer see the same residual stream, making their outputs dependent. Recent work on multi-head attention analyses this dependence structure.

Understanding exactly which independence assumptions are made - and when they break - is essential for diagnosing failure modes in learned models.

1.4 Historical Context

The study of joint distributions has a rich history tied to the development of statistics and probability theory.

  • 1888 - Francis Galton introduced correlation while studying the joint distribution of parents' and children's heights. His "regression to the mean" was originally an observation about the conditional mean.
  • 1896 - Karl Pearson formalised the correlation coefficient and the bivariate normal distribution, laying the groundwork for multivariate statistics.
  • 1933 - Andrey Kolmogorov provided the axiomatic foundation for probability theory, defining joint distributions via product measures on product spaces.
  • 1959 - Abe Sklar proved his eponymous theorem, showing any joint distribution can be decomposed into marginals and a copula - separating the dependence structure from the marginal distributions.
  • 1989 - Judea Pearl developed graphical models (Bayesian networks), providing a language for specifying conditional independence structures among many variables. His work underpins modern probabilistic AI.
  • 2013 - Diederik Kingma and Max Welling introduced VAEs, where the core idea is the joint distribution p(x,z)=p(z)p(xz)p(\mathbf{x}, \mathbf{z}) = p(\mathbf{z}) p(\mathbf{x} \mid \mathbf{z}) over observed data and latent variables.
  • 2017 - "Attention Is All You Need" made the chain rule factorisation tp(xtx<t)\prod_t p(x_t \mid x_{<t}) the dominant paradigm for language modelling at scale.

2. Formal Definitions

2.1 Joint PMF (Discrete Case)

Let XX and YY be discrete random variables taking values in countable sets X\mathcal{X} and Y\mathcal{Y} respectively.

Definition: The joint probability mass function of (X,Y)(X, Y) is

pX,Y(x,y)=P(X=x,Y=y)for all xX,yYp_{X,Y}(x, y) = P(X = x, Y = y) \quad \text{for all } x \in \mathcal{X}, y \in \mathcal{Y}

Axioms: A valid joint PMF must satisfy:

  1. Non-negativity: pX,Y(x,y)0p_{X,Y}(x, y) \geq 0 for all (x,y)(x, y)
  2. Normalisation: xXyYpX,Y(x,y)=1\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p_{X,Y}(x, y) = 1

Example 1 - Fair dice: Roll two fair 6-sided dice. Let XX = value of die 1, YY = value of die 2. Then pX,Y(i,j)=1/36p_{X,Y}(i,j) = 1/36 for all i,j{1,,6}i, j \in \{1,\ldots,6\}. The joint distribution is uniform over the 6×66 \times 6 grid.

Example 2 - Dependent variables: Let XBernoulli(0.4)X \sim \text{Bernoulli}(0.4) and Y=XY = X (perfect dependence). Then:

pX,Y(0,0)=0.6,pX,Y(1,1)=0.4,pX,Y(0,1)=pX,Y(1,0)=0p_{X,Y}(0,0) = 0.6, \quad p_{X,Y}(1,1) = 0.4, \quad p_{X,Y}(0,1) = p_{X,Y}(1,0) = 0

Example 3 - Joint table:

Y=0Y=0Y=1Y=1Y=2Y=2
X=0X=00.10.20.1
X=1X=10.30.20.1

This is a valid joint PMF: all entries non-negative, sum = 1.0.

Non-example 1: p(0,0)=0.3,p(0,1)=0.4,p(1,0)=0.2,p(1,1)=0.2p(0,0) = 0.3, p(0,1) = 0.4, p(1,0) = 0.2, p(1,1) = 0.2 - sum is 1.1, not valid.

Non-example 2: p(0,0)=0.5,p(1,1)=0.5,p(0,1)=0.1,p(1,0)=0.1p(0,0) = 0.5, p(1,1) = 0.5, p(0,1) = -0.1, p(1,0) = 0.1 - negative entry, not valid.

Extension to nn variables: The joint PMF of (X1,,Xn)(X_1, \ldots, X_n) is

p(x1,,xn)=P(X1=x1,,Xn=xn)p(x_1, \ldots, x_n) = P(X_1 = x_1, \ldots, X_n = x_n)

with x1xnp(x1,,xn)=1\sum_{x_1} \cdots \sum_{x_n} p(x_1, \ldots, x_n) = 1.

2.2 Joint PDF (Continuous Case)

Let XX and YY be continuous random variables.

Definition: A function fX,Y:R2R0f_{X,Y}: \mathbb{R}^2 \to \mathbb{R}_{\geq 0} is a joint probability density function for (X,Y)(X,Y) if for every (measurable) set AR2A \subseteq \mathbb{R}^2:

P((X,Y)A)=AfX,Y(x,y)dxdyP((X,Y) \in A) = \iint_A f_{X,Y}(x,y) \, dx \, dy

Axioms:

  1. fX,Y(x,y)0f_{X,Y}(x,y) \geq 0 for all (x,y)(x,y)
  2. fX,Y(x,y)dxdy=1\int_{-\infty}^{\infty} \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dx \, dy = 1

Important: Unlike probabilities, densities can exceed 1 - only integrals of densities are probabilities.

Example 1 - Uniform on unit square: fX,Y(x,y)=1f_{X,Y}(x,y) = 1 for (x,y)[0,1]2(x,y) \in [0,1]^2, zero elsewhere. Check: 01011dxdy=1\int_0^1 \int_0^1 1 \, dx \, dy = 1. [ok]

Example 2 - Standard bivariate normal (uncorrelated):

fX,Y(x,y)=12πexp ⁣(x2+y22)f_{X,Y}(x,y) = \frac{1}{2\pi} \exp\!\left(-\frac{x^2 + y^2}{2}\right)

This factors as fX(x)fY(y)f_X(x) \cdot f_Y(y) - the variables are independent standard normals.

Example 3 - Correlated bivariate normal (correlation ρ\rho):

fX,Y(x,y)=12π1ρ2exp ⁣(x22ρxy+y22(1ρ2))f_{X,Y}(x,y) = \frac{1}{2\pi\sqrt{1-\rho^2}} \exp\!\left(-\frac{x^2 - 2\rho xy + y^2}{2(1-\rho^2)}\right)

Non-example: f(x,y)=2(x+y)1f(x,y) = 2(x+y) - 1 on [0,1]2[0,1]^2 - this can be negative (at (0,0)(0,0) it equals 1-1), so not a valid PDF.

Joint PDF in Rn\mathbb{R}^n: For a random vector X=(X1,,Xn)\mathbf{X} = (X_1, \ldots, X_n)^\top, the joint PDF fX:RnR0f_{\mathbf{X}}: \mathbb{R}^n \to \mathbb{R}_{\geq 0} satisfies

P(XA)=AfX(x)dxP(\mathbf{X} \in A) = \int_A f_{\mathbf{X}}(\mathbf{x}) \, d\mathbf{x}

where dx=dx1dxnd\mathbf{x} = dx_1 \cdots dx_n.

2.3 Joint CDF

Definition: The joint cumulative distribution function of (X1,,Xn)(X_1, \ldots, X_n) is

F(x1,,xn)=P(X1x1,,Xnxn)F(x_1, \ldots, x_n) = P(X_1 \leq x_1, \ldots, X_n \leq x_n)

Properties:

  1. Monotone: FF is non-decreasing in each argument
  2. Limits: F(x1,,xn)1F(x_1, \ldots, x_n) \to 1 as all xix_i \to \infty; F0F \to 0 if any xix_i \to -\infty
  3. Right-continuous: FF is right-continuous in each argument
  4. Recovery of the PDF: For continuous variables, f(x1,,xn)=nFx1xnf(x_1, \ldots, x_n) = \frac{\partial^n F}{\partial x_1 \cdots \partial x_n}

Computing probabilities from the CDF (bivariate):

P(a<Xb,c<Yd)=F(b,d)F(a,d)F(b,c)+F(a,c)P(a < X \leq b, c < Y \leq d) = F(b,d) - F(a,d) - F(b,c) + F(a,c)

This inclusion-exclusion formula generalises to nn dimensions with 2n2^n terms.

2.4 Mixed Discrete-Continuous Distributions

Some important distributions are mixed: discrete in one component, continuous in another.

Example - Spike-and-slab prior (used in LoRA): Let ZBernoulli(π)Z \sim \text{Bernoulli}(\pi) be a binary inclusion indicator and WZ=1N(0,σ12)W \mid Z=1 \sim \mathcal{N}(0, \sigma_1^2), WZ=0=0W \mid Z=0 = 0 (point mass). The joint distribution over (Z,W)(Z, W) is:

  • P(Z=0,W=0)=1πP(Z=0, W=0) = 1-\pi (discrete atom)
  • For w0w \neq 0: f(Z=1,W=w)=π1σ12πew2/(2σ12)f(Z=1, W=w) = \pi \cdot \frac{1}{\sigma_1\sqrt{2\pi}} e^{-w^2/(2\sigma_1^2)} (continuous)

Example - Gaussian Mixture Model (GMM): Let Z{1,,K}Z \in \{1, \ldots, K\} be a discrete cluster label with P(Z=k)=πkP(Z=k) = \pi_k, and XZ=kN(μk,Σk)\mathbf{X} \mid Z=k \sim \mathcal{N}(\boldsymbol{\mu}_k, \Sigma_k). The joint p(Z,X)p(Z, \mathbf{X}) is mixed. Marginalising out ZZ gives the mixture density:

f(x)=k=1KπkN(x;μk,Σk)f(\mathbf{x}) = \sum_{k=1}^K \pi_k \, \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k)

This is one of the most widely used density estimation models in unsupervised learning.


3. Marginal and Conditional Distributions

3.1 Marginalisation

Given a joint distribution over (X,Y)(X, Y), we obtain the marginal distribution of XX by "integrating out" YY - summing or integrating over all possible values of YY.

Discrete:

pX(x)=yYpX,Y(x,y)p_X(x) = \sum_{y \in \mathcal{Y}} p_{X,Y}(x, y)

Continuous:

fX(x)=fX,Y(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x, y) \, dy

Intuition: The marginal pXp_X answers: "What is the probability that X=xX=x, regardless of what YY is doing?" We collapse the joint table by summing rows (to get the XX marginal) or columns (to get the YY marginal).

Example - From joint table:

Y=0Y=0Y=1Y=1Y=2Y=2pX(x)p_X(x)
X=0X=00.10.20.10.4
X=1X=10.30.20.10.6
pY(y)p_Y(y)0.40.40.21.0

Marginalisation in nn dimensions: To obtain the marginal of XA\mathbf{X}_A (a subset of variables indexed by AA), integrate over all other variables XAc\mathbf{X}_{A^c}:

fXA(xA)=fX(xA,xAc)dxAcf_{\mathbf{X}_A}(\mathbf{x}_A) = \int f_{\mathbf{X}}(\mathbf{x}_A, \mathbf{x}_{A^c}) \, d\mathbf{x}_{A^c}

For AI: Training a language model means learning the joint distribution p(x1,,xT)p(x_1, \ldots, x_T) over token sequences. At inference time, we often want p(xtx1,,xt1)p(x_t \mid x_1, \ldots, x_{t-1}) - a conditional derived from this joint. Computing the marginal p(xt)=x1,,xt1p(x1,,xt)p(x_t) = \sum_{x_1,\ldots,x_{t-1}} p(x_1,\ldots,x_t) would require summing over Vt1V^{t-1} sequences - intractable. This is why the chain rule factorisation is essential.

3.2 Conditional Distributions

The conditional distribution of YY given X=xX = x captures how our belief about YY changes when we learn the value of XX.

Discrete:

pYX(yx)=P(Y=yX=x)=pX,Y(x,y)pX(x),provided pX(x)>0p_{Y \mid X}(y \mid x) = P(Y = y \mid X = x) = \frac{p_{X,Y}(x, y)}{p_X(x)}, \quad \text{provided } p_X(x) > 0

Continuous:

fYX(yx)=fX,Y(x,y)fX(x),provided fX(x)>0f_{Y \mid X}(y \mid x) = \frac{f_{X,Y}(x, y)}{f_X(x)}, \quad \text{provided } f_X(x) > 0

Verification: The conditional distribution is a valid probability distribution in yy:

ypYX(yx)=ypX,Y(x,y)pX(x)=pX(x)pX(x)=1\sum_y p_{Y \mid X}(y \mid x) = \sum_y \frac{p_{X,Y}(x,y)}{p_X(x)} = \frac{p_X(x)}{p_X(x)} = 1 \quad \checkmark

Interpretation: fYX(yx)f_{Y \mid X}(y \mid x) is a function of yy for fixed xx. As xx varies, we get a family of distributions - the conditional distribution function is a map from values of xx to distributions on Y\mathcal{Y}.

Example - Bivariate normal: For (X,Y)(X, Y) jointly Gaussian with correlation ρ\rho:

YX=xN ⁣(μY+ρσYσX(xμX),  σY2(1ρ2))Y \mid X = x \sim \mathcal{N}\!\left(\mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x - \mu_X), \; \sigma_Y^2(1-\rho^2)\right)

The conditional mean is linear in xx - this is the linear regression formula. The conditional variance σY2(1ρ2)\sigma_Y^2(1-\rho^2) is always smaller than the marginal variance σY2\sigma_Y^2 (conditioning reduces uncertainty), with equality iff ρ=0\rho = 0.

For AI: Every output of a neural network is a parameterised conditional distribution. A classifier outputs p(yx;θ)p(y \mid \mathbf{x}; \theta); a language model outputs p(xtx<t;θ)p(x_t \mid x_{<t}; \theta); a diffusion model computes p(xt1xt;θ)p(\mathbf{x}_{t-1} \mid \mathbf{x}_t; \theta). All of probabilistic deep learning is the art of learning flexible conditional distributions from data.

3.3 Chain Rule of Probability

The chain rule (or product rule) expresses any joint distribution as a product of conditional distributions:

p(x1,x2)=p(x1)p(x2x1)p(x_1, x_2) = p(x_1) \cdot p(x_2 \mid x_1) p(x1,x2,x3)=p(x1)p(x2x1)p(x3x1,x2)p(x_1, x_2, x_3) = p(x_1) \cdot p(x_2 \mid x_1) \cdot p(x_3 \mid x_1, x_2)

In general, for any ordering of variables:

p(x1,,xn)=i=1np(xix1,,xi1)p(x_1, \ldots, x_n) = \prod_{i=1}^n p(x_i \mid x_1, \ldots, x_{i-1})

Proof: By induction. Base case n=2n=2: p(x1,x2)=p(x2x1)p(x1)p(x_1, x_2) = p(x_2 \mid x_1) p(x_1) is the definition of conditional probability. Inductive step: p(x1,,xn)=p(xnx1,,xn1)p(x1,,xn1)p(x_1,\ldots,x_n) = p(x_n \mid x_1,\ldots,x_{n-1}) \cdot p(x_1,\ldots,x_{n-1}), and apply the inductive hypothesis.

CHAIN RULE ORDERINGS FOR p(A, B, C)
========================================================================

  Forward:   p(A) \\cdot p(B|A) \\cdot p(C|A,B)
  Reverse:   p(C) \\cdot p(B|C) \\cdot p(A|B,C)
  Mixed:     p(B) \\cdot p(A|B) \\cdot p(C|A,B)

  All three represent the same joint distribution p(A,B,C).
  Graphical models choose the ordering that exploits conditional
  independence to reduce the complexity of each conditional.

========================================================================

For AI - Autoregressive language models: The chain rule directly motivates the autoregressive factorisation:

p(x1,x2,,xT)=p(x1)t=2Tp(xtx1,,xt1)p(x_1, x_2, \ldots, x_T) = p(x_1) \prod_{t=2}^T p(x_t \mid x_1, \ldots, x_{t-1})

A GPT model learns one neural network to represent all TT conditional distributions simultaneously (via causal masking). Each forward pass computes all TT conditionals in parallel during training, while at inference time they are evaluated sequentially. The chain rule is not a modelling choice - it is an exact identity.

3.4 Bayes' Theorem for Random Variables

For discrete random variables:

pXY(xy)=pYX(yx)pX(x)xpYX(yx)pX(x)p_{X \mid Y}(x \mid y) = \frac{p_{Y \mid X}(y \mid x) \cdot p_X(x)}{\sum_{x'} p_{Y \mid X}(y \mid x') \cdot p_X(x')}

For continuous random variables (replace sums with integrals):

fXY(xy)=fYX(yx)fX(x)fYX(yx)fX(x)dxf_{X \mid Y}(x \mid y) = \frac{f_{Y \mid X}(y \mid x) \cdot f_X(x)}{\int f_{Y \mid X}(y \mid x') \cdot f_X(x') \, dx'}

In Bayesian terminology:

p(θD)posterior=p(Dθ)likelihoodp(θ)priorp(D)evidence\underbrace{p(\theta \mid \mathcal{D})}_{\text{posterior}} = \frac{\underbrace{p(\mathcal{D} \mid \theta)}_{\text{likelihood}} \cdot \underbrace{p(\theta)}_{\text{prior}}}{\underbrace{p(\mathcal{D})}_{\text{evidence}}}

Example - Medical diagnosis: Let θ\theta = "patient has disease", DD = "test is positive".

  • Prior: p(θ=1)=0.01p(\theta=1) = 0.01; Sensitivity: p(D=1θ=1)=0.95p(D=1 \mid \theta=1) = 0.95; FPR: p(D=1θ=0)=0.05p(D=1 \mid \theta=0) = 0.05
p(θ=1D=1)=0.95×0.010.95×0.01+0.05×0.990.161p(\theta=1 \mid D=1) = \frac{0.95 \times 0.01}{0.95 \times 0.01 + 0.05 \times 0.99} \approx 0.161

Despite a 95% sensitive test, a positive result only gives 16% posterior probability - because the disease is rare. This is the base rate fallacy, a consequence of not properly using the joint distribution.

Continuous Bayes - Gaussian-Gaussian: Prior θN(μ0,σ02)\theta \sim \mathcal{N}(\mu_0, \sigma_0^2), likelihood XθN(θ,σ2)X \mid \theta \sim \mathcal{N}(\theta, \sigma^2). The posterior is:

θX=xN ⁣(σ2μ0+σ02xσ2+σ02,  σ2σ02σ2+σ02)\theta \mid X = x \sim \mathcal{N}\!\left(\frac{\sigma^2 \mu_0 + \sigma_0^2 x}{\sigma^2 + \sigma_0^2}, \; \frac{\sigma^2 \sigma_0^2}{\sigma^2 + \sigma_0^2}\right)

The posterior mean is a weighted average of prior mean and observation - a fundamental result in Bayesian estimation.


4. Independence and Conditional Independence

4.1 Independence - Three Equivalent Characterisations

XX and YY are independent, written X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y, if and only if any of the following equivalent conditions holds:

Characterisation 1 (Factorisation):

pX,Y(x,y)=pX(x)pY(y)for all x,yp_{X,Y}(x,y) = p_X(x) \cdot p_Y(y) \quad \text{for all } x, y

Characterisation 2 (Conditional = Marginal):

pYX(yx)=pY(y)for all x,yp_{Y \mid X}(y \mid x) = p_Y(y) \quad \text{for all } x, y

Characterisation 3 (CDF Factorisation):

FX,Y(x,y)=FX(x)FY(y)for all x,yF_{X,Y}(x,y) = F_X(x) \cdot F_Y(y) \quad \text{for all } x, y

Proof of equivalence (1) <-> (2): If pX,Y=pXpYp_{X,Y} = p_X p_Y, then pYX(yx)=pX,Y/pX=pYp_{Y \mid X}(y \mid x) = p_{X,Y}/p_X = p_Y. Conversely, pX,Y=pYXpX=pYpXp_{X,Y} = p_{Y \mid X} p_X = p_Y p_X.

Independence of functions: If X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y, then g(X) ⁣ ⁣ ⁣h(Y)g(X) \perp\!\!\!\perp h(Y) for any measurable functions g,hg, h.

Caution: Independence requires factorisation everywhere in the support. A single point where pX,Y(x,y)pX(x)pY(y)p_{X,Y}(x,y) \neq p_X(x)p_Y(y) breaks independence.

4.2 Pairwise vs Mutual Independence

For three or more variables, we must distinguish between pairwise and mutual independence.

  • (X1,X2,X3)(X_1, X_2, X_3) are pairwise independent if Xi ⁣ ⁣ ⁣XjX_i \perp\!\!\!\perp X_j for all pairs (i,j)(i,j)
  • (X1,X2,X3)(X_1, X_2, X_3) are mutually independent if p(x1,x2,x3)=p(x1)p(x2)p(x3)p(x_1, x_2, x_3) = p(x_1)p(x_2)p(x_3) for all (x1,x2,x3)(x_1, x_2, x_3)

Mutual independence implies pairwise, but NOT vice versa.

Classic counterexample (Bernstein, 1946): Let X1,X2Bernoulli(1/2)X_1, X_2 \sim \text{Bernoulli}(1/2) independently, and X3=X1X2X_3 = X_1 \oplus X_2 (XOR). Then all pairs are independent (pairwise), but P(X3=1X1=1,X2=1)=01/2=P(X3=1)P(X_3=1 \mid X_1=1, X_2=1) = 0 \neq 1/2 = P(X_3=1) - NOT mutually independent.

General definition: (X1,,Xn)(X_1, \ldots, X_n) are mutually independent iff for every subset S{1,,n}S \subseteq \{1, \ldots, n\}:

p ⁣(iS{Xi=xi})=iSp(Xi=xi)p\!\left(\bigcap_{i \in S} \{X_i = x_i\}\right) = \prod_{i \in S} p(X_i = x_i)

This requires 2nn12^n - n - 1 factorisation conditions - all subsets of size 2\geq 2.

4.3 Conditional Independence

XX and YY are conditionally independent given ZZ, written X ⁣ ⁣ ⁣YZX \perp\!\!\!\perp Y \mid Z, if:

p(x,yz)=p(xz)p(yz)for all x,y,zp(x, y \mid z) = p(x \mid z) \cdot p(y \mid z) \quad \text{for all } x, y, z

Equivalently: p(xy,z)=p(xz)p(x \mid y, z) = p(x \mid z) - knowing ZZ, additional knowledge of YY provides no information about XX.

Conditional independence does not imply marginal independence, and vice versa.

Example 1 (CI but not marginal): ZZ selects which coin; XX and YY are independent flips of that coin. Given ZZ: X ⁣ ⁣ ⁣YZX \perp\!\!\!\perp Y \mid Z. Marginally: XX and YY are correlated (via the shared coin).

Example 2 (Marginal but not CI): X,YX, Y are independent fair coin flips; Z=XYZ = X \oplus Y. Then X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y, but X⊥̸ ⁣ ⁣ ⁣YZ=0X \not\perp\!\!\!\perp Y \mid Z=0 (knowing the XOR makes the flips correlated).

CONDITIONAL INDEPENDENCE STRUCTURES
========================================================================

  Chain:    X -> Z -> Y      X \\perp\\perp Y | Z  (Z blocks the path)
  Fork:     X <- Z -> Y      X \\perp\\perp Y | Z  (Z explains dependence)
  Collider: X -> Z <- Y      X \\perp\\perp Y      (marginally independent)
                            X \\not\\models\\perp\\perp Y | Z (conditioning creates dependence)
                            - Berkson's paradox / selection bias

========================================================================

For AI:

  • Naive Bayes: p(xy)=ip(xiy)p(\mathbf{x} \mid y) = \prod_i p(x_i \mid y) - features are conditionally independent given class
  • Causal masking: In transformers, position tt is conditionally independent of future tokens given context x<tx_{<t}

4.4 d-Separation Preview

Berkson's paradox (collider bias) illustrates how conditioning can create dependence. Suppose acceptance to a selective university depends on both academic ability AA and athletic skill SS, where A ⁣ ⁣ ⁣SA \perp\!\!\!\perp S marginally. Among admitted students (conditioning on admission), AA and SS are negatively correlated - high academic ability reduces the need for athletic skill to justify admission.

This is the collider structure AZSA \to Z \leftarrow S with ZZ conditioned on. It appears throughout ML as selection bias: training on filtered data creates spurious dependencies between filtered-out features.

Forward reference: The systematic framework for reading off all conditional independence relationships from a directed acyclic graph (d-separation, Bayesian networks, Markov blankets) is developed in the graphical models literature; see Pearl (2000), Causality.


5. Covariance and Correlation

5.1 Covariance

Definition: The covariance of XX and YY is:

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y]\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

Forward reference: The derivations of expectations using LOTUS are in Section6.4 Expectation and Variance. Here we use these as defined.

Sign interpretation:

  • Cov(X,Y)>0\text{Cov}(X,Y) > 0: XX and YY tend to move in the same direction
  • Cov(X,Y)<0\text{Cov}(X,Y) < 0: XX and YY tend to move in opposite directions
  • Cov(X,Y)=0\text{Cov}(X,Y) = 0: uncorrelated - no linear relationship

Key properties:

  1. Symmetry: Cov(X,Y)=Cov(Y,X)\text{Cov}(X,Y) = \text{Cov}(Y,X)
  2. Bilinearity: Cov(aX+bZ,Y)=aCov(X,Y)+bCov(Z,Y)\text{Cov}(aX+bZ, Y) = a\,\text{Cov}(X,Y) + b\,\text{Cov}(Z,Y)
  3. Self-covariance: Cov(X,X)=Var(X)\text{Cov}(X,X) = \text{Var}(X)
  4. Independence implies zero covariance: X ⁣ ⁣ ⁣YCov(X,Y)=0X \perp\!\!\!\perp Y \Rightarrow \text{Cov}(X,Y) = 0
  5. Variance of sum: Var(X+Y)=Var(X)+2Cov(X,Y)+Var(Y)\text{Var}(X+Y) = \text{Var}(X) + 2\text{Cov}(X,Y) + \text{Var}(Y)

Caution: Property 4 does NOT reverse. Cov(X,Y)=0\text{Cov}(X,Y) = 0 does NOT imply X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y.

Example - Zero covariance, strong dependence: Let XUniform(1,1)X \sim \text{Uniform}(-1, 1) and Y=X2Y = X^2. Then E[XY]=E[X3]=0\mathbb{E}[XY] = \mathbb{E}[X^3] = 0 (odd function of symmetric distribution), so Cov(X,Y)=0\text{Cov}(X,Y) = 0. But YY is a deterministic function of XX - perfectly dependent. Covariance only measures linear dependence.

5.2 Pearson Correlation Coefficient

The Pearson correlation coefficient normalises covariance to [1,1][-1, 1]:

ρ(X,Y)=Cov(X,Y)Var(X)Var(Y)=Cov(X,Y)σXσY\rho(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\,\text{Var}(Y)}} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y}

Geometric interpretation: ρ\rho is the cosine of the angle between centred random variables viewed as vectors in L2L^2:

ρ(X,Y)=cosθ,θ=(XE[X],YE[Y])\rho(X,Y) = \cos\theta, \quad \theta = \angle(X - \mathbb{E}[X], \, Y - \mathbb{E}[Y])

Properties:

  • ρ(X,Y)1|\rho(X,Y)| \leq 1 (Cauchy-Schwarz)
  • ρ=1\rho = 1: perfect positive linear relationship
  • ρ=1\rho = -1: perfect negative linear relationship
  • ρ=0\rho = 0: uncorrelated (no linear relationship - but possibly nonlinear)
  • Scale invariance: ρ(aX+b,cY+d)=sign(ac)ρ(X,Y)\rho(aX + b, cY + d) = \text{sign}(ac) \cdot \rho(X, Y)

5.3 Covariance Matrix

For a random vector X=(X1,,Xn)\mathbf{X} = (X_1, \ldots, X_n)^\top, the covariance matrix is:

Σ=Cov(X)=E[(Xμ)(Xμ)]\Sigma = \text{Cov}(\mathbf{X}) = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^\top]

where μ=E[X]\boldsymbol{\mu} = \mathbb{E}[\mathbf{X}]. Entry (i,j)(i,j) is Σij=Cov(Xi,Xj)\Sigma_{ij} = \text{Cov}(X_i, X_j); diagonal Σii=Var(Xi)\Sigma_{ii} = \text{Var}(X_i).

Key properties:

  1. Symmetry: Σ=Σ\Sigma = \Sigma^\top
  2. Positive semi-definiteness: vΣv0\mathbf{v}^\top \Sigma \mathbf{v} \geq 0 for all vRn\mathbf{v} \in \mathbb{R}^n

Proof of PSD: vΣv=E[(v(Xμ))2]0\mathbf{v}^\top \Sigma \mathbf{v} = \mathbb{E}[(\mathbf{v}^\top(\mathbf{X}-\boldsymbol{\mu}))^2] \geq 0 since the expectation of a squared quantity is non-negative. QED

  1. Transformation: Cov(AX+b)=AΣA\text{Cov}(A\mathbf{X} + \mathbf{b}) = A \Sigma A^\top

Eigendecomposition: Σ=QΛQ\Sigma = Q \Lambda Q^\top where QQ is orthogonal (eigenvectors) and Λ=diag(λ1,,λn)\Lambda = \text{diag}(\lambda_1, \ldots, \lambda_n) with λi0\lambda_i \geq 0. Eigenvectors are the principal axes of the distribution; eigenvalues are the variances along those axes.

For AI:

  • PCA finds the eigendecomposition of the sample covariance matrix
  • Adam second moment estimate approximates the diagonal of E[gtgt]\mathbb{E}[g_t g_t^\top]
  • Whitening applies Σ1/2\Sigma^{-1/2} to decorrelate features - the foundation of BatchNorm

5.4 Pitfalls: Anscombe, Zero Covariance, Causation

Anscombe's quartet (1973): Four datasets with nearly identical means, variances, and ρ0.816\rho \approx 0.816 yet completely different visual patterns - linear, curved, linear with outlier, vertical cluster. Lesson: always visualise. Summary statistics can be identical for very different distributions.

Zero covariance \neq independence: Covariance only captures linear dependence. For general dependence, use mutual information I(X;Y)=E[logp(X,Y)p(X)p(Y)]I(X;Y) = \mathbb{E}[\log \frac{p(X,Y)}{p(X)p(Y)}] - which is zero if and only if X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y.

Correlation \neq causation: Ice cream sales and drowning rates are positively correlated (confounded by summer). In ML: spurious correlations in training data are learned as shortcuts (Geirhos et al., 2020). Causal invariance methods (IRM, DRO) aim to learn causal rather than associative features.


6. Multivariate Gaussian Distribution

6.1 Definition and Parameterisation

The multivariate Gaussian (MVN) is the most important distribution in machine learning.

Definition: XRn\mathbf{X} \in \mathbb{R}^n follows XN(μ,Σ)\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) with density:

f(x)=1(2π)n/2Σ1/2exp ⁣(12(xμ)Σ1(xμ))f(\mathbf{x}) = \frac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} \exp\!\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu})\right)

where μRn\boldsymbol{\mu} \in \mathbb{R}^n is the mean vector and ΣRn×n\Sigma \in \mathbb{R}^{n \times n} is the covariance matrix (symmetric positive definite).

Parameters: nn for μ\boldsymbol{\mu} + n(n+1)/2n(n+1)/2 for Σ\Sigma = n(n+3)/2n(n+3)/2 total.

Natural parameterisation (precision form): Using Λ=Σ1\Lambda = \Sigma^{-1} (precision matrix) and η=Λμ\boldsymbol{\eta} = \Lambda\boldsymbol{\mu}:

f(x)exp ⁣(12xΛx+ηx)f(\mathbf{x}) \propto \exp\!\left(-\frac{1}{2}\mathbf{x}^\top \Lambda \mathbf{x} + \boldsymbol{\eta}^\top \mathbf{x}\right)

Special cases:

  • Σ=σ2I\Sigma = \sigma^2 I: isotropic - all dimensions independent, equal variance
  • Σ=diag(σ12,,σn2)\Sigma = \text{diag}(\sigma_1^2, \ldots, \sigma_n^2): diagonal - independent, heteroscedastic
  • Σ\Sigma full rank: full covariance - captures all pairwise correlations

6.2 Geometric Interpretation

The Mahalanobis distance (xμ)Σ1(xμ)(\mathbf{x} - \boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) measures distance in the Σ\Sigma metric. Level sets (contours of constant density) are ellipsoids in Rn\mathbb{R}^n.

The eigendecomposition Σ=QΛQ\Sigma = Q \Lambda Q^\top reveals:

  • Eigenvectors of Σ\Sigma (columns of QQ): principal axes of the ellipsoid
  • Eigenvalues λi\lambda_i: squared semi-axis lengths - larger = more spread in that direction
MVN GEOMETRY (2D EXAMPLE)
========================================================================

  \\Sigma = [[4, 2],       Eigenvalues: \\lambda_1 \\approx 5.24,  \\lambda_2 \\approx 0.76
       [2, 3]]       Correlation: \\rho = 2/\\sqrt(4\\cdot3) \\approx 0.577

        ^ x_2
        |    /v_1  (long axis, proportional to \\sqrt\\lambda_1)
        |   /  _.-'
        |  .-'  ellipse  \\cdot\\mu
        |  `-.
        |      `-. v_2 (short axis)
        +--------------------> x_1

  After whitening Z = \\Sigma^{-1/2}(X - \\mu): level sets become circles.

========================================================================

6.3 Marginals of MVN

Theorem: Any marginal of an MVN is also Gaussian.

Let XN(μ,Σ)\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) with block partition:

X=(X1X2),μ=(μ1μ2),Σ=(Σ11Σ12Σ21Σ22)\mathbf{X} = \begin{pmatrix}\mathbf{X}_1 \\ \mathbf{X}_2\end{pmatrix}, \quad \boldsymbol{\mu} = \begin{pmatrix}\boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2\end{pmatrix}, \quad \Sigma = \begin{pmatrix}\Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22}\end{pmatrix}

Then: X1N(μ1,Σ11)\mathbf{X}_1 \sim \mathcal{N}(\boldsymbol{\mu}_1, \Sigma_{11}) and X2N(μ2,Σ22)\mathbf{X}_2 \sim \mathcal{N}(\boldsymbol{\mu}_2, \Sigma_{22}).

Important: The converse is FALSE - marginals being Gaussian does not imply the joint is Gaussian. Non-Gaussian copulas can yield Gaussian marginals.

6.4 Conditionals of MVN - Schur Complement

Theorem: Under the same partition:

X1X2=x2N(μ12,Σ12)\mathbf{X}_1 \mid \mathbf{X}_2 = \mathbf{x}_2 \sim \mathcal{N}(\boldsymbol{\mu}_{1|2}, \Sigma_{1|2})

where:

μ12=μ1+Σ12Σ221(x2μ2)\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2) Σ12=Σ11Σ12Σ221Σ21\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}

The matrix Σ11Σ12Σ221Σ21\Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} is the Schur complement of Σ22\Sigma_{22} in Σ\Sigma.

Intuition:

  • Conditional mean = prior mean + correction proportional to deviation of x2\mathbf{x}_2 from its mean
  • Conditional covariance does NOT depend on x2\mathbf{x}_2 - unique to Gaussians
  • Σ12Σ11\Sigma_{1|2} \preceq \Sigma_{11}: conditioning always reduces uncertainty

Applications: Gaussian process regression (predictive distribution at test points), Bayesian linear regression (posterior over weights), Kalman filter (state update step - the Kalman gain K=Σ12Σ221K = \Sigma_{12}\Sigma_{22}^{-1}).

6.5 Affine Transformations

Theorem: If XN(μ,Σ)\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) and Y=AX+b\mathbf{Y} = A\mathbf{X} + \mathbf{b}, then:

YN(Aμ+b,  AΣA)\mathbf{Y} \sim \mathcal{N}(A\boldsymbol{\mu} + \mathbf{b}, \; A\Sigma A^\top)

Applications:

  • Whitening: A=Σ1/2A = \Sigma^{-1/2} -> YN(0,I)\mathbf{Y} \sim \mathcal{N}(\mathbf{0}, I)
  • PCA projection: A=QkA = Q_k^\top (top kk eigenvectors) -> kk-dimensional projection
  • Reparameterisation trick: Sample εN(0,I)\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I), set z=μ+Lε\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\varepsilon} where LL=ΣLL^\top = \Sigma - allows gradients to flow through the sampling step in VAEs

6.6 Maximum Entropy Property

Theorem: Among all distributions on Rn\mathbb{R}^n with fixed mean μ\boldsymbol{\mu} and covariance Σ\Sigma, the MVN has maximum differential entropy:

h(N(μ,Σ))=12log((2πe)nΣ)h(\mathcal{N}(\boldsymbol{\mu}, \Sigma)) = \frac{1}{2}\log\left((2\pi e)^n |\Sigma|\right)

Implication for AI: Gaussian priors/posteriors are the maximum-entropy (least informative) choice given mean and covariance constraints. This justifies Gaussian weight initialisation (no preferred direction), Gaussian latent priors in VAEs (maximum entropy latent space), and Laplace approximation to the Bayesian posterior.


7. Transformations of Random Vectors

7.1 Change of Variables and Jacobians

Given random vector X\mathbf{X} with PDF fXf_{\mathbf{X}} and differentiable bijection g:RnRn\mathbf{g}: \mathbb{R}^n \to \mathbb{R}^n, the PDF of Y=g(X)\mathbf{Y} = \mathbf{g}(\mathbf{X}) is:

fY(y)=fX(g1(y))detJg1(y)f_{\mathbf{Y}}(\mathbf{y}) = f_{\mathbf{X}}(\mathbf{g}^{-1}(\mathbf{y})) \cdot \left|\det J_{\mathbf{g}^{-1}}(\mathbf{y})\right|

where Jg1J_{\mathbf{g}^{-1}} is the Jacobian of the inverse map. The Jacobian determinant measures how much g\mathbf{g} stretches/compresses volume - probability mass must be conserved.

Example 1 - Polar coordinates: (X1,X2)N(0,I)(X_1, X_2) \sim \mathcal{N}(\mathbf{0}, I), transform to (R,Θ)(R, \Theta). The Jacobian is detJ=r|\det J| = r, giving:

fR,Θ(r,θ)=r2πer2/2f_{R,\Theta}(r,\theta) = \frac{r}{2\pi} e^{-r^2/2}

So RR has the Rayleigh distribution, ΘUniform(0,2π)\Theta \sim \text{Uniform}(0, 2\pi), and R ⁣ ⁣ ⁣ΘR \perp\!\!\!\perp \Theta.

Example 2 - Box-Muller transform: U1,U2Uniform(0,1)U_1, U_2 \sim \text{Uniform}(0,1) independent. Define:

X1=2logU1cos(2πU2),X2=2logU1sin(2πU2)X_1 = \sqrt{-2\log U_1} \cos(2\pi U_2), \quad X_2 = \sqrt{-2\log U_1} \sin(2\pi U_2)

Then X1,X2N(0,1)X_1, X_2 \sim \mathcal{N}(0,1) independently - a practical algorithm for Gaussian sampling.

Example 3 - Log-normal: XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), Y=eXY = e^X. Then:

fY(y)=1yσ2πexp ⁣((logyμ)22σ2),y>0f_Y(y) = \frac{1}{y\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(\log y - \mu)^2}{2\sigma^2}\right), \quad y > 0

For AI - Normalising flows: Flows learn a chain g=gKg1\mathbf{g} = \mathbf{g}_K \circ \cdots \circ \mathbf{g}_1 from a simple base distribution to a complex target. The change-of-variables formula gives exact log-likelihoods:

logfY(y)=logfZ(z)k=1KlogdetJgk\log f_{\mathbf{Y}}(\mathbf{y}) = \log f_{\mathbf{Z}}(\mathbf{z}) - \sum_{k=1}^K \log\left|\det J_{\mathbf{g}_k}\right|

Efficient flows (RealNVP, Glow) use triangular Jacobians with O(n)O(n) determinant computation.

7.2 Sum of Random Variables and Convolution

If X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y and Z=X+YZ = X + Y:

fZ(z)=(fXfY)(z)=fX(x)fY(zx)dxf_Z(z) = (f_X * f_Y)(z) = \int_{-\infty}^{\infty} f_X(x) f_Y(z-x) \, dx

Stability under convolution:

  • N(μ1,σ12)N(μ2,σ22)=N(μ1+μ2,σ12+σ22)\mathcal{N}(\mu_1, \sigma_1^2) * \mathcal{N}(\mu_2, \sigma_2^2) = \mathcal{N}(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)
  • Poisson(λ1)Poisson(λ2)=Poisson(λ1+λ2)\text{Poisson}(\lambda_1) * \text{Poisson}(\lambda_2) = \text{Poisson}(\lambda_1+\lambda_2)
  • Gamma(α1,β)Gamma(α2,β)=Gamma(α1+α2,β)\text{Gamma}(\alpha_1, \beta) * \text{Gamma}(\alpha_2, \beta) = \text{Gamma}(\alpha_1+\alpha_2, \beta)

Forward reference: The Central Limit Theorem - repeated convolution of any distribution with finite variance converges to Gaussian - is proved in Section6.6 Stochastic Processes via MGF factorisation.

7.3 Copulas

A copula separates the dependence structure from the marginals.

Sklar's Theorem (1959): For any joint CDF FX,YF_{X,Y} with continuous marginals FXF_X and FYF_Y, there exists a unique copula C:[0,1]2[0,1]C: [0,1]^2 \to [0,1] such that:

FX,Y(x,y)=C(FX(x),FY(y))F_{X,Y}(x,y) = C(F_X(x), F_Y(y))

The copula CC is the joint CDF of (FX(X),FY(Y))(F_X(X), F_Y(Y)) - the probability-integral-transformed variables with Uniform(0,1)\text{Uniform}(0,1) marginals.

Common copulas:

  • Independence: C(u,v)=uvC(u,v) = uv
  • Gaussian copula: Dependence structure of a bivariate normal with arbitrary marginals
  • Clayton: Lower tail dependence (joint extremes at low values more likely)
  • Gumbel: Upper tail dependence

For AI: Copulas model the dependence structure of generated text distributions independently of individual token marginals. In risk modelling, asset returns are often fitted with Gaussian marginals but non-Gaussian tail dependence via t-copulas.

7.4 Order Statistics

Given nn i.i.d. samples X1,,XnX_1, \ldots, X_n from CDF FF with density ff, the sorted values X(1)X(n)X_{(1)} \leq \cdots \leq X_{(n)} are order statistics.

fX(k)(x)=n!(k1)!(nk)!F(x)k1(1F(x))nkf(x)f_{X_{(k)}}(x) = \frac{n!}{(k-1)!(n-k)!} F(x)^{k-1}(1-F(x))^{n-k}f(x)

Minimum: fX(1)(x)=n(1F(x))n1f(x)f_{X_{(1)}}(x) = n(1-F(x))^{n-1}f(x)

Maximum: fX(n)(x)=nF(x)n1f(x)f_{X_{(n)}}(x) = nF(x)^{n-1}f(x)

For AI - Top-kk sampling: Language model top-kk sampling selects among the kk highest-probability tokens - analysable via the joint distribution of the kk largest order statistics of the logit vector. Top-pp (nucleus) sampling is related to the quantile of the cumulative distribution.


8. ML Applications

8.1 Naive Bayes Classifier

Model: Given class label Y{1,,K}Y \in \{1, \ldots, K\} and features x=(x1,,xd)\mathbf{x} = (x_1, \ldots, x_d):

p(xY=k)=i=1dp(xiY=k)p(\mathbf{x} \mid Y = k) = \prod_{i=1}^d p(x_i \mid Y = k)

Features are conditionally independent given the class - reduces parameter count from exponential to linear in dd.

Decision rule:

y^=argmaxk[logp(Y=k)+i=1dlogp(xiY=k)]\hat{y} = \arg\max_k \left[\log p(Y=k) + \sum_{i=1}^d \log p(x_i \mid Y=k)\right]

Variants: Gaussian NB (continuous features), Multinomial NB (word counts), Bernoulli NB (binary features).

Why it works despite being wrong: The independence assumption is almost always false, yet Naive Bayes is competitive because:

  1. Classification only requires the sign of the log-posterior ratio to be correct - much weaker than accurate probability estimation
  2. The independence assumption acts as a regulariser with limited data
  3. The decision boundary is a hyperplane - equivalent to logistic regression with NB features

8.2 Gaussian Mixture Models

A GMM is a joint distribution p(Z,X)p(Z, \mathbf{X}) where:

  • Z{1,,K}Z \in \{1, \ldots, K\} is a latent cluster indicator with P(Z=k)=πkP(Z=k) = \pi_k
  • XZ=kN(μk,Σk)\mathbf{X} \mid Z=k \sim \mathcal{N}(\boldsymbol{\mu}_k, \Sigma_k)

The marginal mixture density: p(x)=k=1KπkN(x;μk,Σk)p(\mathbf{x}) = \sum_{k=1}^K \pi_k \, \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \Sigma_k)

EM algorithm:

  • E-step: rik=P(Z=kxi)πkN(xi;μk,Σk)r_{ik} = P(Z=k \mid \mathbf{x}_i) \propto \pi_k \mathcal{N}(\mathbf{x}_i; \boldsymbol{\mu}_k, \Sigma_k) (Bayes' theorem for the joint)
  • M-step: Update parameters by maximising the expected complete-data log-likelihood

For AI: GMMs model speaker diarisation, sound source separation, and codebook structure in VQ-VAEs. Mechanistic interpretability research (Elhage et al., 2022) has found that transformer residual stream activations often decompose into Gaussian clusters across different input contexts.

8.3 VAE Encoder as Conditional Distribution

A VAE defines generative joint distribution pθ(x,z)=pθ(xz)p(z)p_\theta(\mathbf{x}, \mathbf{z}) = p_\theta(\mathbf{x} \mid \mathbf{z}) \cdot p(\mathbf{z}) with prior p(z)=N(0,I)p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, I).

The encoder approximates the intractable posterior pθ(zx)p_\theta(\mathbf{z} \mid \mathbf{x}) with:

qϕ(zx)=N(z;μϕ(x),diag(σϕ2(x)))q_\phi(\mathbf{z} \mid \mathbf{x}) = \mathcal{N}(\mathbf{z}; \boldsymbol{\mu}_\phi(\mathbf{x}), \text{diag}(\sigma^2_\phi(\mathbf{x})))

ELBO:

L(θ,ϕ)=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\mathcal{L}(\theta, \phi) = \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}[\log p_\theta(\mathbf{x} \mid \mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z}))

Reparameterisation trick: Write z=μϕ(x)+σϕ(x)ε\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\varepsilon} with εN(0,I)\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I). This moves stochasticity into ε\boldsymbol{\varepsilon} (not a function of ϕ\phi), so gradients flow through μϕ\boldsymbol{\mu}_\phi and σϕ\boldsymbol{\sigma}_\phi - the affine transformation property of MVN applied to enable gradient-based learning.

8.4 Attention as Joint Distribution

For query qt\mathbf{q}_t and keys k1,,kT\mathbf{k}_1, \ldots, \mathbf{k}_T:

αts=exp(qtks/d)sexp(qtks/d)=p(attend to position squery t)\alpha_{ts} = \frac{\exp(\mathbf{q}_t^\top \mathbf{k}_s / \sqrt{d})}{\sum_{s'} \exp(\mathbf{q}_t^\top \mathbf{k}_{s'} / \sqrt{d})} = p(\text{attend to position } s \mid \text{query } t)

This defines a conditional distribution over positions given the query. The output is the conditional expectation:

ot=sαtsvs=Esp(t)[vs]\mathbf{o}_t = \sum_s \alpha_{ts} \mathbf{v}_s = \mathbb{E}_{s \sim p(\cdot \mid t)}[\mathbf{v}_s]

The attention weight matrix {αts}t,s\{\alpha_{ts}\}_{t,s} defines a joint distribution over (query, key) position pairs. Causal masking encodes xt ⁣ ⁣ ⁣xsx<tx_t \perp\!\!\!\perp x_{s'} \mid x_{<t} for s>ts' > t - the autoregressive conditional independence assumption.

8.5 Autoregressive Factorisation

The chain rule gives the autoregressive factorisation used by every modern language model:

p(x1,,xT)=t=1Tp(xtx1,,xt1)p(x_1, \ldots, x_T) = \prod_{t=1}^T p(x_t \mid x_1, \ldots, x_{t-1})

Why this factorisation?

  1. Exact: A mathematical identity - no approximation
  2. Tractable: Each conditional is over vocabulary V\mathcal{V} (~50K-150K tokens) - manageable
  3. Sample-able: Sequential sampling from conditionals
  4. Trainable: Cross-entropy loss on each position; all positions trained in parallel with teacher forcing

KV cache: Storing (ks,vs)(\mathbf{k}_s, \mathbf{v}_s) for s<ts < t avoids recomputation - a direct consequence of the sequential conditional structure. The cache grows linearly in sequence length, motivating context-length-efficient architectures (sliding window, linear attention, Mamba).


9. Common Mistakes

#MistakeWhy It's WrongFix
1Confusing p(X,Y)p(X,Y) with p(X)p(Y)p(X)p(Y)The second assumes independenceAlways verify factorisation before assuming independence
2Marginalising over the wrong variableSumming out XX gives pYp_Y, not pXp_XIdentify which variable to sum/integrate out
3Concluding X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y from Cov(X,Y)=0\text{Cov}(X,Y) = 0Zero covariance = no linear dependence onlyCompute p(X,Y)p(X,Y) vs p(X)p(Y)p(X)p(Y); use mutual information
4Confusing marginal and conditional independenceX ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y and X ⁣ ⁣ ⁣YZX \perp\!\!\!\perp Y \mid Z are independent statementsDraw the causal structure; distinguish fork, chain, collider
5Wrong block structure in conditional MVNSwapping Σ11\Sigma_{11} and Σ22\Sigma_{22} gives wrong resultPut the conditioning variable consistently in x2\mathbf{x}_2
6Assuming Gaussian marginals imply Gaussian jointNon-Gaussian copulas yield Gaussian marginalsOnly MVN guarantees both marginals and conditionals Gaussian
7Using $\det J_\mathbf{g}insteadofinstead of
8Confusing pairwise and mutual independencePairwise does not imply mutual (Bernstein XOR example)Check all subset factorisations for mutual independence
9Using ρ\rho as a complete measure of dependenceAnscombe's quartet: same ρ\rho, very different distributionsAlways visualise; use mutual information for general dependence
10Forgetting the normalising constant in continuous BayesPosterior must integrate to 1Compute the evidence p(Dθ)p(θ)dθ\int p(\mathcal{D} \mid \theta)p(\theta)d\theta or use VI/MCMC
11Conditioning on a collider and expecting independenceConditioning on a collider creates dependence (Berkson)Draw the causal graph; identify colliders before conditioning
12Comparing chain rule terms from different orderingsDifferent orderings give different factorisations of the same jointFix the ordering; compare p(x1,,xn)p(x_1,\ldots,x_n) directly

10. Exercises

Exercise 1 * - Joint PMF Verification Given the joint PMF table, verify it is valid, compute all marginal distributions, and determine whether X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y.

Y=0Y=0Y=1Y=1Y=2Y=2
X=0X=0aa0.10.10.150.15
X=1X=10.20.20.10.1bb

(a) Find a,ba, b such that pX(0)=0.4p_X(0) = 0.4. (b) Compute marginals pXp_X and pYp_Y. (c) Test independence via the factorisation condition. (d) Compute P(X=1Y=1)P(X=1 \mid Y=1).

Exercise 2 * - Continuous Marginals and Conditionals Let (X,Y)(X,Y) have joint PDF f(x,y)=cxyf(x,y) = c \cdot xy on [0,2]×[0,1][0,2] \times [0,1], zero elsewhere.

(a) Find cc. (b) Compute marginals fXf_X and fYf_Y. (c) Are XX and YY independent? (d) Compute fYX(yx)f_{Y \mid X}(y \mid x).

Exercise 3 * - Chain Rule and Autoregressive Factorisation A simplified language model over vocabulary {A,B,C}\{A, B, C\} has: p(A)=0.5p(A) = 0.5, p(BA)=0.4p(B \mid A) = 0.4, p(CA)=0.3p(C \mid A) = 0.3, p(AB)=0.5p(A \mid B) = 0.5.

(a) Compute p(A,B)p(A, B) using the chain rule in forward order. (b) Compute p(B,A)p(B, A) in reverse order. Verify p(A,B)=p(B,A)p(A,B) = p(B,A). (c) Implement a simple bigram language model and sample 100 two-token sequences.

Exercise 4 * - Independence Testing with Simulation (a) Generate 10510^5 pairs (X,Y)(X, Y) where XN(0,1)X \sim \mathcal{N}(0,1) and Y=X2+εY = X^2 + \varepsilon, εN(0,0.1)\varepsilon \sim \mathcal{N}(0, 0.1). Compute Pearson ρ\rho. (b) Estimate mutual information I^(X;Y)\hat{I}(X;Y) via binned histogram. (c) Explain why ρ0\rho \approx 0 but I^(X;Y)0\hat{I}(X;Y) \gg 0.

Exercise 5 ** - Covariance Matrix Properties Let XN(0,Σ)\mathbf{X} \sim \mathcal{N}(\mathbf{0}, \Sigma) with Σ=(4223)\Sigma = \begin{pmatrix} 4 & 2 \\ 2 & 3 \end{pmatrix}.

(a) Verify Σ\Sigma is PSD by computing eigenvalues. (b) Compute the Cholesky factor LL and sample from X\mathbf{X} via X=Lε\mathbf{X} = L\boldsymbol{\varepsilon}. (c) Compute ρ(X1,X2)\rho(X_1, X_2). (d) Compute the distribution of Y=X1+X2Y = X_1 + X_2 using the affine transformation property.

Exercise 6 ** - MVN Conditional Distribution Let X=(X1,X2,X3)N(μ,Σ)\mathbf{X} = (X_1, X_2, X_3)^\top \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) with μ=(1,2,3)\boldsymbol{\mu} = (1, 2, 3)^\top and Σ=(420231012)\Sigma = \begin{pmatrix} 4 & 2 & 0 \\ 2 & 3 & 1 \\ 0 & 1 & 2 \end{pmatrix}.

(a) Compute the conditional distribution (X1,X2)X3=3.5(X_1, X_2) \mid X_3 = 3.5 via the Schur complement formula. (b) Verify numerically by sampling 10410^4 draws, filtering to X3[3.4,3.6]X_3 \in [3.4, 3.6], and computing empirical conditional moments.

Exercise 7 *** - Reparameterisation Trick (a) Show that if εN(0,I)\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I) and z=μ+Lε\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\varepsilon}, then zN(μ,LL)\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, LL^\top). (b) For μ=[1,1]\boldsymbol{\mu} = [1, -1] and Σ=(2112)\Sigma = \begin{pmatrix}2 & 1 \\ 1 & 2\end{pmatrix}, sample 10410^4 points and verify empirical mean and covariance. (c) Compute z/μ\partial \mathbf{z} / \partial \boldsymbol{\mu} and z/L\partial \mathbf{z} / \partial L. Explain why gradients exist and why this is essential for VAE training.

Exercise 8 *** - Naive Bayes vs Logistic Regression (a) Implement Gaussian Naive Bayes for a 2-class, 2-feature problem. Fit via MLE. (b) Implement logistic regression via gradient descent. (c) Compare decision boundaries on data with independent features. (d) Repeat with strongly correlated features. Analyse which model performs better and why, connecting to the conditional independence assumption.


11. Why This Matters for AI (2026 Perspective)

ConceptAI/LLM Application
Chain rule factorisationEvery autoregressive model (GPT, LLaMA, Gemini, Claude) uses tp(xtx<t)\prod_t p(x_t \mid x_{<t}) - the exact chain rule identity.
Conditional distributionsThe core output of every neural network: classifier outputs p(yx)p(y \mid \mathbf{x}); LLMs output p(xtx<t)p(x_t \mid x_{<t}); diffusion models output p(xt1xt)p(\mathbf{x}_{t-1} \mid \mathbf{x}_t).
MVN conditionals (Schur complement)Gaussian process regression (active in Bayesian optimisation for hyperparameter search), Kalman filtering (robot navigation, sensor fusion).
Reparameterisation trickVAEs (Kingma & Welling, 2013), continuous normalising flows, diffusion model variational bounds - all require differentiable sampling from Gaussian distributions.
Conditional independenceNaive Bayes (competitive for text classification), mean-field VI (VAE diagonal covariance), causal masking in attention (the AR factorisation independence assumption).
Covariance matrix / PSDAdam, Shampoo, K-FAC (second-order optimisers) approximate or use the parameter covariance. BatchNorm/LayerNorm whiten activations using empirical covariance.
Berkson's paradox (collider bias)Selection bias in training data (models trained on filtered web data), benchmark contamination, shortcut learning (Geirhos et al., 2020).
CopulasModelling tail dependence in financial risk; structured dependence between multi-modal outputs; beyond-correlation dependence modelling.
Normalising flows / JacobiansRealNVP, Glow, Neural ODEs - generative models using exact change-of-variables for tractable log-likelihoods and invertible generation.
GMMsSpeaker diarisation, VQ-VAE codebook (discrete latents via k-means on Gaussian clusters), mechanistic interpretability (superposition hypothesis).
Maximum entropy MVNJustifies Gaussian weight initialisation (Xavier/Kaiming), Gaussian latent priors in VAEs, Laplace approximation to Bayesian neural network posteriors.

12. Conceptual Bridge

This section sits at the heart of probabilistic machine learning. We arrived here from the foundations of single-variable probability (Section6.1) and named distributions (Section6.2), and we have now built the full machinery for reasoning about multiple variables simultaneously. The chain rule of probability - a simple product of conditional distributions - is not merely an abstract identity: it is the operating principle of every language model in existence.

The multivariate Gaussian is the cornerstone of continuous multivariate modelling. Its remarkable closure properties (marginals Gaussian, conditionals Gaussian, affine transforms Gaussian) make it analytically tractable in a way no other distribution is. The Schur complement formula for conditional distributions appears in Gaussian process regression, Bayesian linear regression, the Kalman filter, and Gaussian belief propagation - it is one of the most frequently used results in applied probabilistic modelling.

Looking forward, the next section Section6.4 Expectation and Variance develops the theory of moments in depth - deriving E[X]\mathbb{E}[X], Var(X)\text{Var}(X), and higher moments using LOTUS, the law of total expectation, and the law of total variance. Many quantities referenced here (covariance as E[XY]E[X]E[Y]\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y], MVN conditional mean) will be derived properly there. Section Section6.5 Concentration Inequalities develops tail bounds that require the joint distribution structure built here. Section Section6.6 Stochastic Processes proves the Central Limit Theorem via the convolution of joint distributions.

POSITION IN THE PROBABILITY THEORY CHAPTER
========================================================================

  Section6.1 Introduction         Section6.2 Distributions      Section6.3 Joint [HERE]
  ----------------          ------------------       ----------------
  Axioms, CDF, PMF,    ->    Bernoulli, Gaussian, ->   Joint PMF/PDF,
  PDF, single RVs           Gamma, Beta, ...         Marginals, Conditionals,
                            Exp. family               Chain rule, MVN,
                                                      Transformations

        v                         v                         v
  Section6.4 Expectation          Section6.5 Concentration       Section6.6 CLT/LLN
  -----------------         ------------------       ----------------
  E[X], Var(X), LOTUS,      Markov, Chebyshev,       LLN, CLT proof
  Total expectation/var     Hoeffding, PAC bounds     via joint dist.

                                  v
                      Section6.7 Markov Chains
                      ------------------
                      Transition matrices,
                      Steady state, MCMC

========================================================================

The ideas in this section - joint distributions, conditional independence, the multivariate Gaussian, the chain rule - are not just mathematical tools. They are the conceptual vocabulary of modern probabilistic AI: the language in which generative models, Bayesian inference, and information-theoretic learning are expressed. Every time a GPT model generates a token, every time a VAE encodes an image, every time a diffusion model denoises - the machinery is joint distributions.


<- Back to Probability Theory | Next: Expectation and Variance ->


Appendix A: Worked Calculations - Bivariate Gaussian

This appendix provides complete, step-by-step calculations for the bivariate Gaussian that are essential for building intuition.

A.1 Deriving the Marginals from the Joint

The bivariate normal with zero means and correlation ρ\rho has joint density:

f(x,y)=12π1ρ2exp ⁣(x22ρxy+y22(1ρ2))f(x,y) = \frac{1}{2\pi\sqrt{1-\rho^2}} \exp\!\left(-\frac{x^2 - 2\rho xy + y^2}{2(1-\rho^2)}\right)

To find fX(x)f_X(x), integrate over yy:

fX(x)=f(x,y)dy=12π1ρ2exp ⁣(x22ρxy+y22(1ρ2))dyf_X(x) = \int_{-\infty}^\infty f(x,y)\,dy = \frac{1}{2\pi\sqrt{1-\rho^2}} \int_{-\infty}^\infty \exp\!\left(-\frac{x^2 - 2\rho xy + y^2}{2(1-\rho^2)}\right)dy

Complete the square in yy. The exponent (x22ρxy+y2)/(2(1ρ2))-(x^2 - 2\rho xy + y^2)/(2(1-\rho^2)) can be rewritten:

y22ρxy+x2=(yρx)2+x2(1ρ2)y^2 - 2\rho xy + x^2 = (y - \rho x)^2 + x^2(1-\rho^2)

So the exponent becomes:

(yρx)22(1ρ2)x22-\frac{(y-\rho x)^2}{2(1-\rho^2)} - \frac{x^2}{2}

Pulling the xx-dependent factor outside:

fX(x)=ex2/22π1ρ2exp ⁣((yρx)22(1ρ2))dyf_X(x) = \frac{e^{-x^2/2}}{2\pi\sqrt{1-\rho^2}} \int_{-\infty}^\infty \exp\!\left(-\frac{(y-\rho x)^2}{2(1-\rho^2)}\right)dy

The integral is 2π(1ρ2)\sqrt{2\pi(1-\rho^2)} (Gaussian integral), so:

fX(x)=ex2/22π1ρ22π(1ρ2)=ex2/22π=ϕ(x)f_X(x) = \frac{e^{-x^2/2}}{2\pi\sqrt{1-\rho^2}} \cdot \sqrt{2\pi(1-\rho^2)} = \frac{e^{-x^2/2}}{\sqrt{2\pi}} = \phi(x)

Thus XN(0,1)X \sim \mathcal{N}(0,1), confirming the marginal is standard normal regardless of ρ\rho. By symmetry, YN(0,1)Y \sim \mathcal{N}(0,1) as well.

A.2 Deriving the Conditional from the Joint

Dividing f(x,y)f(x,y) by fX(x)f_X(x):

fYX(yx)=f(x,y)fX(x)=12π(1ρ2)exp ⁣((yρx)22(1ρ2))f_{Y \mid X}(y \mid x) = \frac{f(x,y)}{f_X(x)} = \frac{1}{\sqrt{2\pi(1-\rho^2)}} \exp\!\left(-\frac{(y-\rho x)^2}{2(1-\rho^2)}\right)

This is N(ρx,1ρ2)\mathcal{N}(\rho x, 1-\rho^2) - confirming the conditional mean E[YX=x]=ρx\mathbb{E}[Y \mid X=x] = \rho x and conditional variance Var(YX=x)=1ρ2\text{Var}(Y \mid X=x) = 1 - \rho^2.

Interpretation of the conditional mean: The slope of the regression line E[YX=x]=ρx\mathbb{E}[Y \mid X=x] = \rho x equals the correlation coefficient when both variables are standardised. For unstandardised (X,Y)(X,Y) with means (μX,μY)(\mu_X, \mu_Y) and standard deviations (σX,σY)(\sigma_X, \sigma_Y):

E[YX=x]=μY+ρσYσX(xμX)\mathbb{E}[Y \mid X=x] = \mu_Y + \rho\frac{\sigma_Y}{\sigma_X}(x - \mu_X)

The regression coefficient β=ρσY/σX\beta = \rho \sigma_Y / \sigma_X is the best linear predictor of YY from XX.

A.3 Independence Condition for Bivariate Normal

XX and YY are independent in the bivariate normal iff ρ=0\rho = 0.

Proof: If ρ=0\rho = 0, then f(x,y)=12πe(x2+y2)/2=fX(x)fY(y)f(x,y) = \frac{1}{2\pi} e^{-(x^2+y^2)/2} = f_X(x) f_Y(y) - direct factorisation. Conversely, if f(x,y)=fX(x)fY(y)f(x,y) = f_X(x)f_Y(y) for all x,yx,y, then Cov(X,Y)=0\text{Cov}(X,Y) = 0, which forces ρ=0\rho = 0. QED

This is a special property of the multivariate Gaussian: for MVN, uncorrelated implies independent. This does NOT hold for general distributions.


Appendix B: The Multivariate Gaussian - Derivation of the Schur Complement Formula

We derive the conditional distribution X1X2=x2\mathbf{X}_1 \mid \mathbf{X}_2 = \mathbf{x}_2 from first principles.

Setup: X=(X1,X2)N(μ,Σ)\mathbf{X} = (\mathbf{X}_1^\top, \mathbf{X}_2^\top)^\top \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) with the block partition as in Section 6.4.

Strategy: Construct a new variable Z=X1Σ12Σ221X2\mathbf{Z} = \mathbf{X}_1 - \Sigma_{12}\Sigma_{22}^{-1}\mathbf{X}_2 that is independent of X2\mathbf{X}_2.

Step 1: Compute Cov(Z,X2)\text{Cov}(\mathbf{Z}, \mathbf{X}_2):

Cov(Z,X2)=Cov(X1,X2)Σ12Σ221Cov(X2,X2)=Σ12Σ12Σ221Σ22=0\text{Cov}(\mathbf{Z}, \mathbf{X}_2) = \text{Cov}(\mathbf{X}_1, \mathbf{X}_2) - \Sigma_{12}\Sigma_{22}^{-1}\text{Cov}(\mathbf{X}_2, \mathbf{X}_2) = \Sigma_{12} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{22} = 0

Step 2: Since (Z,X2)(\mathbf{Z}, \mathbf{X}_2) is jointly Gaussian (linear transformation) with zero cross-covariance, they are independent.

Step 3: Since Z ⁣ ⁣ ⁣X2\mathbf{Z} \perp\!\!\!\perp \mathbf{X}_2, conditioning on X2=x2\mathbf{X}_2 = \mathbf{x}_2 does not change the distribution of Z\mathbf{Z}:

ZX2=x2N(E[Z],Cov(Z))\mathbf{Z} \mid \mathbf{X}_2 = \mathbf{x}_2 \sim \mathcal{N}(\mathbb{E}[\mathbf{Z}], \text{Cov}(\mathbf{Z}))

where E[Z]=μ1Σ12Σ221μ2\mathbb{E}[\mathbf{Z}] = \boldsymbol{\mu}_1 - \Sigma_{12}\Sigma_{22}^{-1}\boldsymbol{\mu}_2 and Cov(Z)=Σ11Σ12Σ221Σ21\text{Cov}(\mathbf{Z}) = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}.

Step 4: Since X1=Z+Σ12Σ221X2\mathbf{X}_1 = \mathbf{Z} + \Sigma_{12}\Sigma_{22}^{-1}\mathbf{X}_2, conditioning on X2=x2\mathbf{X}_2 = \mathbf{x}_2:

X1X2=x2N ⁣(μ1+Σ12Σ221(x2μ2),  Σ11Σ12Σ221Σ21)\mathbf{X}_1 \mid \mathbf{X}_2 = \mathbf{x}_2 \sim \mathcal{N}\!\left(\boldsymbol{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2), \; \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}\right)

This completes the derivation. QED

Key insight: The Schur complement Σ11Σ12Σ221Σ21\Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} is the covariance of the residual X1X^1\mathbf{X}_1 - \hat{\mathbf{X}}_1 where X^1=μ1+Σ12Σ221(X2μ2)\hat{\mathbf{X}}_1 = \boldsymbol{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}(\mathbf{X}_2 - \boldsymbol{\mu}_2) is the best linear predictor (BLUP) of X1\mathbf{X}_1 from X2\mathbf{X}_2.


Appendix C: Independence Tests in Practice

Given nn samples from (X,Y)(X, Y), how do we test independence?

C.1 Pearson's Chi-Squared Test (Discrete)

For discrete variables XX and YY, form the contingency table with observed counts OijO_{ij} (number of samples with X=i,Y=jX=i, Y=j). Under independence, expected counts are:

Eij=(row totali)(col totalj)nE_{ij} = \frac{(\text{row total}_i)(\text{col total}_j)}{n}

The test statistic:

χ2=i,j(OijEij)2EijH0χ2((r1)(c1))\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}} \overset{H_0}{\sim} \chi^2((r-1)(c-1))

where r,cr, c are the numbers of rows and columns. Reject independence at level α\alpha if χ2>χ(r1)(c1),1α2\chi^2 > \chi^2_{(r-1)(c-1), 1-\alpha}.

C.2 Pearson Correlation Test (Continuous, Linear)

For continuous (X,Y)(X,Y), test H0:ρ=0H_0: \rho = 0 vs H1:ρ0H_1: \rho \neq 0. The test statistic:

t=ρ^n21ρ^2H0tn2t = \frac{\hat{\rho}\sqrt{n-2}}{\sqrt{1-\hat{\rho}^2}} \overset{H_0}{\sim} t_{n-2}

Limitations: Only detects linear dependence. Fails for Y=X2Y = X^2 (zero correlation, strong dependence).

C.3 Kolmogorov-Smirnov Test for Independence

Compare the empirical joint CDF to the product of empirical marginals. The test statistic:

Dn=supx,yFn(x,y)FXn(x)FYn(y)D_n = \sup_{x,y} |F_n(x,y) - F_X^n(x) F_Y^n(y)|

Under independence, Dnd0D_n \overset{d}{\to} 0 at rate 1/n1/\sqrt{n}.

C.4 Mutual Information Test

Estimate I^(X;Y)\hat{I}(X;Y) via binned histograms or kernel density estimation. Under independence, 2nI^(X;Y)dχ(r1)(c1)22n\hat{I}(X;Y) \overset{d}{\to} \chi^2_{(r-1)(c-1)}. More powerful than χ2\chi^2 for nonlinear dependence.

For AI model diagnostics: Testing whether two layers' activations are conditionally independent given the input is a key step in mechanistic interpretability. High mutual information between distant layers suggests information bypasses intermediate layers - a sign of skip connections or residual pathways.


Appendix D: Multivariate Change of Variables - Additional Examples

D.1 Bivariate Transformation: Sum and Difference

Let X,YX, Y be independent N(0,1)\mathcal{N}(0,1). Define S=X+YS = X + Y, D=XYD = X - Y.

Jacobian calculation:

(x,y)(s,d)=x/sx/dy/sy/d=1/21/21/21/2=1/2\frac{\partial(x,y)}{\partial(s,d)} = \begin{vmatrix} \partial x/\partial s & \partial x/\partial d \\ \partial y/\partial s & \partial y/\partial d \end{vmatrix} = \begin{vmatrix} 1/2 & 1/2 \\ 1/2 & -1/2 \end{vmatrix} = -1/2

So detJ=1/2|\det J| = 1/2 and:

fS,D(s,d)=fX,Y ⁣(s+d2,sd2)12=12πe(s2+d2)/412=14πe(s2+d2)/4f_{S,D}(s,d) = f_{X,Y}\!\left(\frac{s+d}{2}, \frac{s-d}{2}\right) \cdot \frac{1}{2} = \frac{1}{2\pi} e^{-(s^2+d^2)/4} \cdot \frac{1}{2} = \frac{1}{4\pi} e^{-(s^2+d^2)/4}

This factors as fS(s)fD(d)f_S(s) f_D(d) where S,DN(0,2)S, D \sim \mathcal{N}(0,2) independently.

Takeaway: The sum and difference of independent standard normals are themselves independent normals - a non-obvious result following directly from the change of variables.

D.2 Exponential to Gamma via Convolution

For X1,,XniidExp(λ)X_1, \ldots, X_n \overset{\text{iid}}{\sim} \text{Exp}(\lambda), the sum Sn=X1++XnGamma(n,λ)S_n = X_1 + \cdots + X_n \sim \text{Gamma}(n, \lambda).

Proof by MGF: MSn(t)=i=1nMXi(t)=(λλt)nM_{S_n}(t) = \prod_{i=1}^n M_{X_i}(t) = \left(\frac{\lambda}{\lambda-t}\right)^n which is the MGF of Gamma(n,λ)\text{Gamma}(n,\lambda).

Connection to Poisson process: The time until the nn-th event in a Poisson(λ)(\lambda) process is exactly Gamma(n,λ)\text{Gamma}(n, \lambda) - a direct consequence of inter-arrival times being iid Exp(λ)(\lambda).


Appendix E: Copulas - Detailed Treatment

E.1 Gaussian Copula Construction

The Gaussian copula with correlation matrix RR is constructed as follows:

  1. Generate z=(Z1,,Zd)N(0,R)\mathbf{z} = (Z_1, \ldots, Z_d)^\top \sim \mathcal{N}(\mathbf{0}, R)
  2. Apply the probability integral transform: Ui=Φ(Zi)U_i = \Phi(Z_i) where Φ\Phi is the standard normal CDF
  3. The resulting (U1,,Ud)(U_1, \ldots, U_d) has the Gaussian copula with correlation RR

To generate samples from an arbitrary distribution with Gaussian copula: 4. Invert the marginal CDFs: Xi=Fi1(Ui)X_i = F_i^{-1}(U_i)

Properties:

  • Captures linear (Pearson) correlation at the copula level
  • Tail dependence: the Gaussian copula has zero tail dependence - extreme events in different variables are asymptotically independent even when there is correlation in the bulk

The Global Financial Crisis (2008): The widespread use of the Gaussian copula for CDO pricing was criticised as a key model failure. The Gaussian copula underestimates joint tail events - the probability that many mortgages default simultaneously during a systemic crisis. The correct model should use a copula with positive tail dependence (e.g., t-copula or Clayton copula).

E.2 Measuring Dependence via Copulas

Kendall's tau: A rank-based correlation measure:

τ=P[(X1X2)(Y1Y2)>0]P[(X1X2)(Y1Y2)<0]\tau = P[(X_1-X_2)(Y_1-Y_2) > 0] - P[(X_1-X_2)(Y_1-Y_2) < 0]

Unlike Pearson ρ\rho, Kendall's τ\tau is invariant to monotone transformations of XX and YY - it is purely a property of the copula.

Spearman's rho: ρS=ρ(rank(X),rank(Y))\rho_S = \rho(\text{rank}(X), \text{rank}(Y)) - also invariant to monotone marginal transformations.

For the Gaussian copula with linear correlation ρ\rho: τ=(2/π)arcsin(ρ)\tau = (2/\pi)\arcsin(\rho) and ρS=(6/π)arcsin(ρ/2)\rho_S = (6/\pi)\arcsin(\rho/2).


Appendix F: Gaussian Mixture Models - EM Algorithm Details

F.1 Complete Data Log-Likelihood

If we observed both xi\mathbf{x}_i and latent assignments ziz_i, the complete data log-likelihood is:

c(θ)=i=1nk=1K1[zi=k][logπk+logN(xi;μk,Σk)]\ell_c(\theta) = \sum_{i=1}^n \sum_{k=1}^K \mathbf{1}[z_i=k] \left[\log\pi_k + \log\mathcal{N}(\mathbf{x}_i; \boldsymbol{\mu}_k, \Sigma_k)\right]

This is easy to maximise - just compute cluster-conditional MLEs.

F.2 E-Step: Soft Assignments

Since ziz_i is latent, replace 1[zi=k]\mathbf{1}[z_i=k] by its posterior expectation:

rik=E[1[zi=k]xi,θ]=P(zi=kxi)=πkN(xi;μk,Σk)jπjN(xi;μj,Σj)r_{ik} = \mathbb{E}[\mathbf{1}[z_i=k] \mid \mathbf{x}_i, \theta] = P(z_i=k \mid \mathbf{x}_i) = \frac{\pi_k \mathcal{N}(\mathbf{x}_i; \boldsymbol{\mu}_k, \Sigma_k)}{\sum_j \pi_j \mathcal{N}(\mathbf{x}_i; \boldsymbol{\mu}_j, \Sigma_j)}

This is Bayes' theorem applied to the joint distribution over (zi,xi)(z_i, \mathbf{x}_i).

F.3 M-Step: Update Parameters

Maximise the expected complete data log-likelihood:

πknew=irikn,μknew=irikxiirik,Σknew=irik(xiμknew)(xiμknew)irik\pi_k^{\text{new}} = \frac{\sum_i r_{ik}}{n}, \quad \boldsymbol{\mu}_k^{\text{new}} = \frac{\sum_i r_{ik} \mathbf{x}_i}{\sum_i r_{ik}}, \quad \Sigma_k^{\text{new}} = \frac{\sum_i r_{ik} (\mathbf{x}_i - \boldsymbol{\mu}_k^{\text{new}})(\mathbf{x}_i - \boldsymbol{\mu}_k^{\text{new}})^\top}{\sum_i r_{ik}}

Each M-step is a weighted MLE - the soft assignments rikr_{ik} play the role of fractional cluster memberships.

F.4 Convergence

The EM algorithm monotonically increases the marginal log-likelihood (θ)=ilogkπkN(xi;μk,Σk)\ell(\theta) = \sum_i \log \sum_k \pi_k \mathcal{N}(\mathbf{x}_i; \boldsymbol{\mu}_k, \Sigma_k) at each iteration. It converges to a local maximum (not necessarily global). In practice:

  • Random restarts help escape poor local optima
  • K-means++ initialisation provides a good starting point
  • The BIC/AIC criteria are used to select the number of components KK

Appendix G: The Reparameterisation Trick - Formal Derivation

G.1 The Problem

To train a VAE, we need:

ϕEqϕ(zx)[f(z)]=ϕqϕ(zx)f(z)dz\nabla_\phi \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}[f(\mathbf{z})] = \nabla_\phi \int q_\phi(\mathbf{z} \mid \mathbf{x}) f(\mathbf{z}) \, d\mathbf{z}

The naive approach - swapping gradient and expectation - fails because the distribution qϕq_\phi depends on ϕ\phi:

ϕqϕ(zx)f(z)dzEqϕ[ϕf(z)]\int \nabla_\phi q_\phi(\mathbf{z} \mid \mathbf{x}) f(\mathbf{z}) \, d\mathbf{z} \neq \mathbb{E}_{q_\phi}[\nabla_\phi f(\mathbf{z})]

G.2 The Reparameterisation

Write z=gϕ(ε,x)\mathbf{z} = g_\phi(\boldsymbol{\varepsilon}, \mathbf{x}) where εp(ε)\boldsymbol{\varepsilon} \sim p(\boldsymbol{\varepsilon}) does not depend on ϕ\phi. For the diagonal Gaussian encoder:

z=μϕ(x)+σϕ(x)ε,εN(0,I)\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I)

Then:

Eqϕ(zx)[f(z)]=Ep(ε)[f(gϕ(ε,x))]\mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}[f(\mathbf{z})] = \mathbb{E}_{p(\boldsymbol{\varepsilon})}[f(g_\phi(\boldsymbol{\varepsilon}, \mathbf{x}))]

Now the expectation is over p(ε)p(\boldsymbol{\varepsilon}) which does NOT depend on ϕ\phi, so:

ϕEqϕ(zx)[f(z)]=Ep(ε) ⁣[ϕf(gϕ(ε,x))]\nabla_\phi \mathbb{E}_{q_\phi(\mathbf{z} \mid \mathbf{x})}[f(\mathbf{z})] = \mathbb{E}_{p(\boldsymbol{\varepsilon})}\!\left[\nabla_\phi f(g_\phi(\boldsymbol{\varepsilon}, \mathbf{x}))\right]

This gradient can be estimated by Monte Carlo: sample ε(1),,ε(L)\boldsymbol{\varepsilon}^{(1)}, \ldots, \boldsymbol{\varepsilon}^{(L)} and compute 1Llϕf(gϕ(ε(l),x))\frac{1}{L}\sum_l \nabla_\phi f(g_\phi(\boldsymbol{\varepsilon}^{(l)}, \mathbf{x})). In practice, L=1L=1 often suffices.

G.3 Why the Affine Transformation Property is Key

The reparameterisation z=μ+Lε\mathbf{z} = \boldsymbol{\mu} + L\boldsymbol{\varepsilon} works because:

  1. zN(μ,LL)\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, LL^\top) by the affine transformation property (Section 6.5)
  2. The Jacobian of the transformation is LL - invertible and differentiable
  3. Gradients flow through μ\boldsymbol{\mu} and LL as if they were deterministic variables

This trick generalises beyond Gaussians: any distribution expressible as a deterministic transformation of a fixed noise distribution admits reparameterisation. The Gumbel-softmax trick (Jang et al., 2017) extends this to Categorical distributions.


Appendix H: Common Distributions as Joint Distributions

Many standard distributions arise naturally as marginals of joint distributions over multiple variables.

H.1 Chi-Squared as a Sum of Squared Normals

If Z1,,ZkiidN(0,1)Z_1, \ldots, Z_k \overset{\text{iid}}{\sim} \mathcal{N}(0,1), then Q=i=1kZi2χ2(k)Q = \sum_{i=1}^k Z_i^2 \sim \chi^2(k).

Proof via MGF: MQ(t)=i=1kMZi2(t)M_Q(t) = \prod_{i=1}^k M_{Z_i^2}(t). For Z2Z^2 where ZN(0,1)Z \sim \mathcal{N}(0,1):

MZ2(t)=E[etZ2]=112t,t<1/2M_{Z^2}(t) = \mathbb{E}[e^{tZ^2}] = \frac{1}{\sqrt{1-2t}}, \quad t < 1/2

So MQ(t)=(12t)k/2M_Q(t) = (1-2t)^{-k/2} - the MGF of Gamma(k/2,1/2)=χ2(k)\text{Gamma}(k/2, 1/2) = \chi^2(k).

For AI: The chi-squared distribution appears in:

  • Wald tests for neural network weights (are weights significantly different from zero?)
  • Goodness-of-fit tests for probability distributions learned by generative models
  • The test for the χ2\chi^2 independence test (Appendix C)

H.2 Student-t as Ratio of Normal to Chi-Squared

If ZN(0,1)Z \sim \mathcal{N}(0,1) and Vχ2(ν)V \sim \chi^2(\nu) independently, then:

T=ZV/νtνT = \frac{Z}{\sqrt{V/\nu}} \sim t_\nu

Intuition: The Student-t arises when we estimate the variance from data (replacing the known σ2\sigma^2 with the sample variance s^2\hat{s}^2). The heavier tails reflect additional uncertainty from estimating σ2\sigma^2.

For AI: The t-SNE visualisation algorithm uses the Student-t distribution (with ν=1\nu=1, i.e., Cauchy) in the low-dimensional space to model cluster structure. The heavier tails push apart nearby clusters more aggressively than a Gaussian, creating cleaner cluster separation in 2D.

H.3 F-Distribution as Ratio of Chi-Squareds

If V1χ2(d1)V_1 \sim \chi^2(d_1) and V2χ2(d2)V_2 \sim \chi^2(d_2) independently, then:

F=V1/d1V2/d2F(d1,d2)F = \frac{V_1/d_1}{V_2/d_2} \sim F(d_1, d_2)

Used in ANOVA and comparing variances between groups.


Appendix I: Expectation and Covariance - Key Identities

This appendix collects identities used throughout the section. Full derivations are in Section6.4.

I.1 Law of Total Expectation

E[X]=EY[E[XY]]\mathbb{E}[X] = \mathbb{E}_Y[\mathbb{E}[X \mid Y]]

Application: E[X]=EZ[E[XZ]]=kP(Z=k)μk\mathbb{E}[\mathbf{X}] = \mathbb{E}_Z[\mathbb{E}[\mathbf{X} \mid Z]] = \sum_k P(Z=k) \boldsymbol{\mu}_k for a GMM - the overall mean is the mixture-weighted average of component means.

I.2 Law of Total Variance

Var(X)=E[Var(XY)]+Var(E[XY])\text{Var}(X) = \mathbb{E}[\text{Var}(X \mid Y)] + \text{Var}(\mathbb{E}[X \mid Y])

Decomposition: Total variance = average within-group variance + between-group variance (from ANOVA). For a GMM: Var(X)=kπkΣk+kπk(μkμ)(μkμ)\text{Var}(\mathbf{X}) = \sum_k \pi_k \Sigma_k + \sum_k \pi_k (\boldsymbol{\mu}_k - \boldsymbol{\mu})(\boldsymbol{\mu}_k - \boldsymbol{\mu})^\top where μ=kπkμk\boldsymbol{\mu} = \sum_k \pi_k \boldsymbol{\mu}_k.

I.3 Bilinearity of Covariance for Vectors

For random vectors X\mathbf{X} and Y\mathbf{Y} and matrices A,BA, B:

Cov(AX,BY)=ACov(X,Y)B\text{Cov}(A\mathbf{X}, B\mathbf{Y}) = A \, \text{Cov}(\mathbf{X}, \mathbf{Y}) \, B^\top

Special case: Cov(AX)=AΣXA\text{Cov}(A\mathbf{X}) = A \Sigma_X A^\top - the covariance under a linear transformation.

I.4 Independence and Zero Covariance for MVN

For jointly Gaussian random variables: Cov(Xi,Xj)=0Xi ⁣ ⁣ ⁣Xj\text{Cov}(X_i, X_j) = 0 \Leftrightarrow X_i \perp\!\!\!\perp X_j.

This is unique to the Gaussian - in general, zero covariance is weaker than independence.


Appendix J: Graphical Models - Overview

A Bayesian network (directed graphical model) is a directed acyclic graph (DAG) where:

  • Nodes = random variables
  • Edges = direct dependencies
  • Each node XiX_i has a conditional distribution p(Xiparents(Xi))p(X_i \mid \text{parents}(X_i))

The joint distribution factorises as:

p(x1,,xn)=i=1np(xipa(xi))p(x_1, \ldots, x_n) = \prod_{i=1}^n p(x_i \mid \text{pa}(x_i))

This is the chain rule applied with the graphical structure choosing which conditioning variables appear in each factor.

Conditional independence from the graph: X ⁣ ⁣ ⁣YZX \perp\!\!\!\perp Y \mid Z in the joint distribution iff XX and YY are d-separated by ZZ in the DAG - all paths from XX to YY are "blocked" by ZZ.

COMMON GRAPHICAL MODEL PATTERNS
========================================================================

  Naive Bayes:              LDA:                    Hidden Markov Model:
      Y                     \\theta   \\alpha                   z_1 -> z_2 -> z_3
    \\swarrowv\\searrow                   \\swarrowv\\searrow    \\searrow                  v    v    v
  X_1 X_2 X_3             w_1 w_2 w_3  \\beta                x_1   x_2   x_3
  (features cond.       (words given  (topics)       (obs given state)
   indep. given Y)       topic dist.)

  VAE:                      Kalman Filter:
   p(z)                     z_t-1 -> z_t -> z_t+1
     v                        v        v      v
   p(x|z)                    x_t-1   x_t   x_t+1
   (encoder   (decoder)       (hidden state dynamics)
    q(z|x) ^)

========================================================================

For AI: Modern large language models can be understood as extremely deep Bayesian networks where each token's probability is conditioned on all previous tokens. The attention pattern (which tokens attend to which) defines an implicit graphical structure that changes with each input.


Appendix K: Numerical Stability in MVN Computations

K.1 Cholesky Decomposition for Sampling

Never invert Σ\Sigma directly for sampling. Instead:

  1. Compute the Cholesky factor LL where Σ=LL\Sigma = LL^\top (stable, O(n3)O(n^3))
  2. Sample εN(0,I)\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I)
  3. Return x=μ+Lε\mathbf{x} = \boldsymbol{\mu} + L\boldsymbol{\varepsilon}

Direct inversion Σ1\Sigma^{-1} amplifies numerical errors; Cholesky is numerically stable.

K.2 Log-Determinant for Density Evaluation

Computing logΣ\log|\Sigma| naively via log(det(Sigma)) overflows for large nn. Instead:

  1. Compute Cholesky Σ=LL\Sigma = LL^\top
  2. logΣ=2ilogLii\log|\Sigma| = 2\sum_i \log L_{ii} (sum of log-diagonal elements of LL)

This is numerically stable because Lii>0L_{ii} > 0 and we sum logs rather than multiply small numbers.

K.3 Mahalanobis Distance

The Mahalanobis distance d2=(xμ)Σ1(xμ)d^2 = (\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu}) should be computed as:

  1. Solve Lv=(xμ)L\mathbf{v} = (\mathbf{x} - \boldsymbol{\mu}) via forward substitution
  2. d2=v2d^2 = \|\mathbf{v}\|^2

This avoids forming Σ1\Sigma^{-1} explicitly.

K.4 Conditioning on Nearly Degenerate Σ22\Sigma_{22}

When Σ22\Sigma_{22} is nearly singular (nearly degenerate observations), add a small diagonal regulariser: Σ22+ϵI\Sigma_{22} + \epsilon I for ϵ106\epsilon \sim 10^{-6}. This is jitter in Gaussian process literature. The resulting conditional covariance is Σ11Σ12(Σ22+ϵI)1Σ21\Sigma_{11} - \Sigma_{12}(\Sigma_{22} + \epsilon I)^{-1}\Sigma_{21}, which remains PSD.


Appendix L: Summary Tables

L.1 Key Joint Distribution Formulas

QuantityDiscreteContinuous
Jointp(x,y)p(x,y)f(x,y)f(x,y)
Marginal of XXyp(x,y)\sum_y p(x,y)f(x,y)dy\int f(x,y)\,dy
Conditional YX=xY \mid X=xp(x,y)/pX(x)p(x,y)/p_X(x)f(x,y)/fX(x)f(x,y)/f_X(x)
Independence conditionp(x,y)=pX(x)pY(y)p(x,y) = p_X(x)p_Y(y)f(x,y)=fX(x)fY(y)f(x,y) = f_X(x)f_Y(y)
Cov(X,Y)(X,Y)E[XY]E[X]E[Y]\mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]Same
Bayesp(xy)p(yx)pX(x)p(x\mid y) \propto p(y\mid x)p_X(x)Same

L.2 Multivariate Gaussian Cheat Sheet

PropertyFormula
Density(2π)n/2Σ1/2exp(12(xμ)Σ1(xμ))(2\pi)^{-n/2}\|\Sigma\|^{-1/2}\exp(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu}))
Marginal X1\mathbf{X}_1N(μ1,Σ11)\mathcal{N}(\boldsymbol{\mu}_1, \Sigma_{11})
Conditional X1X2\mathbf{X}_1 \mid \mathbf{X}_2N(μ1+Σ12Σ221(x2μ2),Σ11Σ12Σ221Σ21)\mathcal{N}(\boldsymbol{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}(\mathbf{x}_2-\boldsymbol{\mu}_2), \Sigma_{11}-\Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21})
Affine transform AX+bA\mathbf{X}+\mathbf{b}N(Aμ+b,AΣA)\mathcal{N}(A\boldsymbol{\mu}+\mathbf{b}, A\Sigma A^\top)
Entropy12log((2πe)nΣ)\frac{1}{2}\log((2\pi e)^n \|\Sigma\|)
KL(pq)(p\|q) for Gaussians12[tr(Σq1Σp)+(μqμp)Σq1(μqμp)n+logΣqΣp]\frac{1}{2}[\text{tr}(\Sigma_q^{-1}\Sigma_p) + (\boldsymbol{\mu}_q-\boldsymbol{\mu}_p)^\top\Sigma_q^{-1}(\boldsymbol{\mu}_q-\boldsymbol{\mu}_p) - n + \log\frac{\|\Sigma_q\|}{\|\Sigma_p\|}]

L.3 Independence: A Hierarchy

Mutual independence
       v (implies)
Pairwise independence
       v (implies)
Zero covariance (for general distributions)
       ^v (equivalent for MVN only)
Zero covariance

For the multivariate Gaussian, uncorrelated = pairwise independent = mutually independent (if the full vector is jointly Gaussian).


Appendix M: Detailed Exercises with Full Solutions

M.1 Computing the Bivariate Normal Conditional - Full Worked Example

Problem: (X,Y)N ⁣((13),(4229))(X,Y) \sim \mathcal{N}\!\left(\begin{pmatrix}1\\3\end{pmatrix}, \begin{pmatrix}4 & 2\\ 2 & 9\end{pmatrix}\right). Find YX=2Y \mid X = 2.

Step 1: Identify parameters. μX=1\mu_X = 1, μY=3\mu_Y = 3, σX2=4\sigma_X^2 = 4, σY2=9\sigma_Y^2 = 9, σXY=2\sigma_{XY} = 2.

Step 2: Apply the conditional formula.

μYX=2=μY+σXYσX2(xμX)=3+24(21)=3.5\mu_{Y|X=2} = \mu_Y + \sigma_{XY}\sigma_X^{-2}(x - \mu_X) = 3 + \frac{2}{4}(2 - 1) = 3.5 σYX2=σY2σXY2σX2=944=8\sigma^2_{Y|X} = \sigma_Y^2 - \sigma_{XY}^2\sigma_X^{-2} = 9 - \frac{4}{4} = 8

Answer: YX=2N(3.5,8)Y \mid X = 2 \sim \mathcal{N}(3.5, 8).

Verification: The conditional variance 8 < 9 = marginal variance - conditioning reduced uncertainty, as expected.

M.2 Independence Test via Joint Table - Full Example

Problem: Given the joint PMF table, test independence.

Y=0Y=0Y=1Y=1pXp_X
X=0X=00.20.30.5
X=1X=10.40.10.5
pYp_Y0.60.41.0

Independence check: For independence, we need p(x,y)=pX(x)pY(y)p(x,y) = p_X(x) \cdot p_Y(y) for all (x,y)(x,y).

Check (X=0,Y=0)(X=0, Y=0): pX(0)pY(0)=0.5×0.6=0.300.20p_X(0) \cdot p_Y(0) = 0.5 \times 0.6 = 0.30 \neq 0.20. FAIL.

The variables are NOT independent. In fact:

p(Y=0X=0)=0.2/0.5=0.40.6=p(Y=0)p(Y=0 \mid X=0) = 0.2/0.5 = 0.4 \neq 0.6 = p(Y=0)

Knowing X=0X=0 decreases our probability estimate for Y=0Y=0 - the variables are negatively correlated.

Covariance calculation:

E[XY]=000.2+010.3+100.4+110.1=0.1\mathbb{E}[XY] = 0 \cdot 0 \cdot 0.2 + 0 \cdot 1 \cdot 0.3 + 1 \cdot 0 \cdot 0.4 + 1 \cdot 1 \cdot 0.1 = 0.1 E[X]=0.5,E[Y]=0.4\mathbb{E}[X] = 0.5, \quad \mathbb{E}[Y] = 0.4 Cov(X,Y)=0.10.5×0.4=0.1<0\text{Cov}(X,Y) = 0.1 - 0.5 \times 0.4 = -0.1 < 0

Negative covariance confirms negative association.

M.3 Change of Variables - Log-Normal Full Derivation

Problem: XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2). Find the density of Y=eXY = e^X.

Step 1: Find the inverse: x=g1(y)=logyx = g^{-1}(y) = \log y for y>0y > 0.

Step 2: Compute the Jacobian: dxdy=1y\frac{dx}{dy} = \frac{1}{y}.

Step 3: Apply the formula:

f_Y(y) = f_X(g^{-1}(y)) \cdot |g^{-1}'(y)| = \frac{1}{\sigma\sqrt{2\pi}} e^{-(\log y - \mu)^2/(2\sigma^2)} \cdot \frac{1}{y}

Moments of the log-normal:

E[Y]=eμ+σ2/2,Var(Y)=(eσ21)e2μ+σ2\mathbb{E}[Y] = e^{\mu + \sigma^2/2}, \quad \text{Var}(Y) = (e^{\sigma^2}-1)e^{2\mu+\sigma^2}

For AI: Model parameter magnitudes after training often follow approximately log-normal distributions. If WLog-NormalW \sim \text{Log-Normal}, then logWN\log|W| \sim \mathcal{N} - transforming to log-scale makes analysis easier. Weight pruning thresholds are often set based on the log-normal quantiles.


Appendix N: The Gaussian Process as a Joint Distribution

A Gaussian process (GP) is a joint distribution over function values at infinitely many input points - the continuous generalisation of the MVN.

Definition: fGP(m(),k(,))f \sim \mathcal{GP}(m(\cdot), k(\cdot, \cdot)) if for any finite collection of inputs x1,,xn\mathbf{x}_1, \ldots, \mathbf{x}_n:

(f(x1),,f(xn))N(m,K)(f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)) \sim \mathcal{N}(\mathbf{m}, K)

where mi=m(xi)m_i = m(\mathbf{x}_i) and Kij=k(xi,xj)K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) is the kernel (covariance function).

GP regression uses the conditional MVN formula (Section 6.4) to make predictions. Given observations y=f(X)+ε\mathbf{y} = f(\mathbf{X}) + \boldsymbol{\varepsilon} with noise εN(0,σ2I)\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 I), the predictive distribution at new points X\mathbf{X}_* is:

fyN(μ,Σ)f_* \mid \mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_*, \Sigma_*)

where:

μ=K(X,X)[K(X,X)+σ2I]1y\boldsymbol{\mu}_* = K(\mathbf{X}_*, \mathbf{X})[K(\mathbf{X},\mathbf{X}) + \sigma^2 I]^{-1}\mathbf{y} Σ=K(X,X)K(X,X)[K(X,X)+σ2I]1K(X,X)\Sigma_* = K(\mathbf{X}_*,\mathbf{X}_*) - K(\mathbf{X}_*,\mathbf{X})[K(\mathbf{X},\mathbf{X}) + \sigma^2 I]^{-1}K(\mathbf{X},\mathbf{X}_*)

This is exactly the Schur complement conditional formula applied to the GP's joint distribution over observed and unobserved function values.

For AI: GPs are used for Bayesian hyperparameter optimisation (BoTorch, GPyTorch) in training large models. The acquisition function (UCB, EI) is computed analytically from the GP's mean and variance - enabled by the closed-form MVN conditional.


Appendix O: Information Theory and Joint Distributions

O.1 Joint Entropy

For jointly distributed (X,Y)(X,Y):

H(X,Y)=x,yp(x,y)logp(x,y)(discrete)H(X,Y) = -\sum_{x,y} p(x,y) \log p(x,y) \quad \text{(discrete)} =f(x,y)logf(x,y)dxdy(continuous)= -\int f(x,y) \log f(x,y) \, dx\,dy \quad \text{(continuous)}

Chain rule of entropy: H(X,Y)=H(X)+H(YX)=H(Y)+H(XY)H(X,Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y)

where H(YX)=EX[H(YX=x)]H(Y \mid X) = \mathbb{E}_X[H(Y \mid X=x)] is the conditional entropy.

O.2 Mutual Information

The mutual information measures the information shared between XX and YY:

I(X;Y)=H(X)H(XY)=H(Y)H(YX)=H(X)+H(Y)H(X,Y)I(X;Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X,Y)

Key properties:

  • I(X;Y)0I(X;Y) \geq 0 (non-negativity)
  • I(X;Y)=0X ⁣ ⁣ ⁣YI(X;Y) = 0 \Leftrightarrow X \perp\!\!\!\perp Y (zero iff independent)
  • I(X;Y)=DKL(p(X,Y)p(X)p(Y))I(X;Y) = D_{\text{KL}}(p(X,Y) \| p(X)p(Y)) - KL divergence from joint to product of marginals

For AI: Mutual information is used in:

  • Feature selection: keep features with high I(feature;label)I(\text{feature}; \text{label})
  • Information bottleneck theory: compress representations to maximise I(Z;Y)I(\mathbf{Z}; Y) while minimising I(X;Z)I(\mathbf{X}; \mathbf{Z})
  • Estimating how much a layer's representation retains information about the input vs the task

O.3 KL Divergence Between Multivariate Gaussians

For p=N(μ1,Σ1)p = \mathcal{N}(\boldsymbol{\mu}_1, \Sigma_1) and q=N(μ2,Σ2)q = \mathcal{N}(\boldsymbol{\mu}_2, \Sigma_2):

DKL(pq)=12[tr(Σ21Σ1)+(μ2μ1)Σ21(μ2μ1)n+logΣ2Σ1]D_{\text{KL}}(p \| q) = \frac{1}{2}\left[\text{tr}(\Sigma_2^{-1}\Sigma_1) + (\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1)^\top \Sigma_2^{-1}(\boldsymbol{\mu}_2 - \boldsymbol{\mu}_1) - n + \log\frac{|\Sigma_2|}{|\Sigma_1|}\right]

VAE special case (p=qϕ(zx)=N(μ,diag(σ2))p = q_\phi(\mathbf{z}\mid\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2)), q=N(0,I)q = \mathcal{N}(\mathbf{0},I)):

DKL(pq)=12d(μd2+σd2logσd21)D_{\text{KL}}(p \| q) = \frac{1}{2}\sum_d (\mu_d^2 + \sigma_d^2 - \log\sigma_d^2 - 1)

Appendix P: Notation Reference for This Section

SymbolMeaning
pX,Y(x,y)p_{X,Y}(x,y)Joint PMF of discrete (X,Y)(X,Y)
fX,Y(x,y)f_{X,Y}(x,y)Joint PDF of continuous (X,Y)(X,Y)
FX,Y(x,y)F_{X,Y}(x,y)Joint CDF
pX(x)p_X(x), fX(x)f_X(x)Marginal PMF/PDF of XX
pYX(yx)p_{Y\mid X}(y\mid x)Conditional PMF of YY given X=xX=x
X ⁣ ⁣ ⁣YX \perp\!\!\!\perp YXX and YY are (marginally) independent
X ⁣ ⁣ ⁣YZX \perp\!\!\!\perp Y \mid ZXX and YY are conditionally independent given ZZ
μ\boldsymbol{\mu}Mean vector (bold lowercase)
Σ\SigmaCovariance matrix (uppercase sigma)
Λ=Σ1\Lambda = \Sigma^{-1}Precision matrix
Σ12\Sigma_{12}Off-diagonal block of covariance matrix
Σ12\Sigma_{1\mid 2}Conditional covariance (Schur complement)
N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma)Multivariate Gaussian
Cov(X,Y)\text{Cov}(X,Y)Covariance of XX and YY
ρ(X,Y)\rho(X,Y)Pearson correlation coefficient
I(X;Y)I(X;Y)Mutual information
H(X,Y)H(X,Y)Joint entropy
H(YX)H(Y\mid X)Conditional entropy

Last updated: 2026 - covers Section6.3 Joint Distributions as scoped by the Chapter README.


Appendix Q: Proofs of Core Independence Results

Q.1 Proof that Independence Implies Zero Covariance

Theorem: If X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y, then Cov(X,Y)=0\text{Cov}(X,Y) = 0.

Proof:

Cov(X,Y)=E[XY]E[X]E[Y]\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]

When X ⁣ ⁣ ⁣YX \perp\!\!\!\perp Y, the joint density factorises: f(x,y)=fX(x)fY(y)f(x,y) = f_X(x)f_Y(y). Then:

E[XY]=xyfX(x)fY(y)dxdy=(xfX(x)dx)(yfY(y)dy)=E[X]E[Y]\mathbb{E}[XY] = \int\int xy \cdot f_X(x)f_Y(y)\,dx\,dy = \left(\int x f_X(x)\,dx\right)\left(\int y f_Y(y)\,dy\right) = \mathbb{E}[X]\mathbb{E}[Y]

Therefore Cov(X,Y)=E[X]E[Y]E[X]E[Y]=0\text{Cov}(X,Y) = \mathbb{E}[X]\mathbb{E}[Y] - \mathbb{E}[X]\mathbb{E}[Y] = 0. QED

Why the converse fails: Zero covariance says E[XY]=E[X]E[Y]\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y] - a single numerical condition. Independence says p(x,y)=pX(x)pY(y)p(x,y) = p_X(x)p_Y(y) for ALL (x,y)(x,y) - infinitely many conditions. A single number cannot encode the information required for independence.

Q.2 Proof that MVN Uncorrelated Implies Independent

Theorem: For jointly Gaussian (X1,,Xn)N(μ,Σ)(X_1, \ldots, X_n) \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma): Cov(Xi,Xj)=0\text{Cov}(X_i, X_j) = 0 for all iji \neq j implies (X1,,Xn)(X_1, \ldots, X_n) are mutually independent.

Proof: If Σ=diag(σ12,,σn2)\Sigma = \text{diag}(\sigma_1^2, \ldots, \sigma_n^2) (all off-diagonal entries zero), then:

f(x)=1(2π)n/2Σ1/2e12(xμ)Σ1(xμ)f(\mathbf{x}) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} e^{-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})}

With diagonal Σ\Sigma: Σ1=diag(1/σ12,,1/σn2)\Sigma^{-1} = \text{diag}(1/\sigma_1^2, \ldots, 1/\sigma_n^2) and Σ=iσi2|\Sigma| = \prod_i \sigma_i^2. So:

f(x)=i=1n1σi2πe(xiμi)2/(2σi2)=i=1nfXi(xi)f(\mathbf{x}) = \prod_{i=1}^n \frac{1}{\sigma_i\sqrt{2\pi}} e^{-(x_i-\mu_i)^2/(2\sigma_i^2)} = \prod_{i=1}^n f_{X_i}(x_i)

The joint density factorises as a product of marginals - mutual independence. QED

Q.3 Proof of the Cauchy-Schwarz Inequality for Correlation

Theorem: Cov(X,Y)2Var(X)Var(Y)|\text{Cov}(X,Y)|^2 \leq \text{Var}(X)\text{Var}(Y), implying ρ(X,Y)1|\rho(X,Y)| \leq 1.

Proof: For any tRt \in \mathbb{R}:

0E[(XE[X]+t(YE[Y]))2]=Var(X)+2tCov(X,Y)+t2Var(Y)0 \leq \mathbb{E}[(X - \mathbb{E}[X] + t(Y - \mathbb{E}[Y]))^2] = \text{Var}(X) + 2t\text{Cov}(X,Y) + t^2\text{Var}(Y)

This is a non-negative quadratic in tt, so its discriminant is non-positive:

4Cov(X,Y)24Var(X)Var(Y)04\text{Cov}(X,Y)^2 - 4\text{Var}(X)\text{Var}(Y) \leq 0

Therefore Cov(X,Y)2Var(X)Var(Y)\text{Cov}(X,Y)^2 \leq \text{Var}(X)\text{Var}(Y), i.e. ρ(X,Y)1|\rho(X,Y)| \leq 1. QED

Equality: ρ=1|\rho| = 1 iff YE[Y]=t(XE[X])Y - \mathbb{E}[Y] = t(X - \mathbb{E}[X]) almost surely - i.e., YY is a linear function of XX.


Appendix R: Continuous Bayes - Integration Examples

R.1 Beta-Bernoulli Conjugacy via Integration

Setup: Prior pBeta(α,β)p \sim \text{Beta}(\alpha, \beta), likelihood XpBernoulli(p)X \mid p \sim \text{Bernoulli}(p).

Joint distribution:

f(p,x)=px(1p)1xpα1(1p)β1B(α,β)f(p, x) = p^x(1-p)^{1-x} \cdot \frac{p^{\alpha-1}(1-p)^{\beta-1}}{B(\alpha,\beta)}

Marginal of X=1X=1:

P(X=1)=01ppα1(1p)β1B(α,β)dp=B(α+1,β)B(α,β)=αα+βP(X=1) = \int_0^1 p \cdot \frac{p^{\alpha-1}(1-p)^{\beta-1}}{B(\alpha,\beta)}\,dp = \frac{B(\alpha+1,\beta)}{B(\alpha,\beta)} = \frac{\alpha}{\alpha+\beta}

This is the prior predictive probability - integrating out the unknown pp.

Posterior: Applying Bayes' theorem for continuous pp:

f(pX=1)=ppα1(1p)β1B(α,β)P(X=1)=pα(1p)β1B(α+1,β)f(p \mid X=1) = \frac{p \cdot \frac{p^{\alpha-1}(1-p)^{\beta-1}}{B(\alpha,\beta)}}{P(X=1)} = \frac{p^\alpha (1-p)^{\beta-1}}{B(\alpha+1,\beta)}

This is Beta(α+1,β)\text{Beta}(\alpha+1, \beta) - one success increments α\alpha by 1. [ok]

R.2 Continuous Bayes for Signal in Noise

Setup: Observe noisy signal Y=θ+εY = \theta + \varepsilon where εN(0,σ2)\varepsilon \sim \mathcal{N}(0, \sigma^2) (known) and prior θN(μ0,σ02)\theta \sim \mathcal{N}(\mu_0, \sigma_0^2).

Likelihood: f(yθ)=N(y;θ,σ2)f(y \mid \theta) = \mathcal{N}(y; \theta, \sigma^2)

Posterior:

f(θy)f(yθ)f(θ)exp ⁣((yθ)22σ2(θμ0)22σ02)f(\theta \mid y) \propto f(y \mid \theta) f(\theta) \propto \exp\!\left(-\frac{(y-\theta)^2}{2\sigma^2} - \frac{(\theta-\mu_0)^2}{2\sigma_0^2}\right)

Completing the square in θ\theta, the exponent becomes:

12(1σ2+1σ02)(θy/σ2+μ0/σ021/σ2+1/σ02)2-\frac{1}{2}\left(\frac{1}{\sigma^2} + \frac{1}{\sigma_0^2}\right)\left(\theta - \frac{y/\sigma^2 + \mu_0/\sigma_0^2}{1/\sigma^2 + 1/\sigma_0^2}\right)^2

So the posterior is:

θyN ⁣(y/σ2+μ0/σ021/σ2+1/σ02,  11/σ2+1/σ02)\theta \mid y \sim \mathcal{N}\!\left(\frac{y/\sigma^2 + \mu_0/\sigma_0^2}{1/\sigma^2 + 1/\sigma_0^2},\; \frac{1}{1/\sigma^2 + 1/\sigma_0^2}\right)

Interpretation: The posterior mean is a precision-weighted average of the observation yy and the prior mean μ0\mu_0. More precise information (smaller variance) gets higher weight. When σ02\sigma_0^2 \to \infty (flat prior): posterior mean y\to y (MLE). When σ2\sigma^2 \to \infty (very noisy observation): posterior mean μ0\to \mu_0 (prior dominates).

This formula is the Kalman filter update equation for the 1D case.


Appendix S: From Joint Distributions to Variational Inference

Variational inference (VI) approximates intractable posteriors by solving an optimisation problem over a tractable family of distributions.

Goal: Approximate p(zx)=p(z,x)/p(x)p(\mathbf{z} \mid \mathbf{x}) = p(\mathbf{z}, \mathbf{x}) / p(\mathbf{x}) with qϕ(z)Qq_\phi(\mathbf{z}) \in \mathcal{Q}.

Evidence Lower Bound (ELBO):

logp(x)=Eqϕ(z) ⁣[logp(z,x)qϕ(z)]+DKL(qϕ(z)p(zx))\log p(\mathbf{x}) = \mathbb{E}_{q_\phi(\mathbf{z})}\!\left[\log \frac{p(\mathbf{z}, \mathbf{x})}{q_\phi(\mathbf{z})}\right] + D_{\text{KL}}(q_\phi(\mathbf{z}) \| p(\mathbf{z} \mid \mathbf{x}))

Since DKL0D_{\text{KL}} \geq 0, we have logp(x)L(ϕ)\log p(\mathbf{x}) \geq \mathcal{L}(\phi) where:

L(ϕ)=Eqϕ(z) ⁣[logp(z,x)qϕ(z)]=Eqϕ[logp(xz)]DKL(qϕ(z)p(z))\mathcal{L}(\phi) = \mathbb{E}_{q_\phi(\mathbf{z})}\!\left[\log \frac{p(\mathbf{z}, \mathbf{x})}{q_\phi(\mathbf{z})}\right] = \mathbb{E}_{q_\phi}[\log p(\mathbf{x} \mid \mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z}) \| p(\mathbf{z}))

Connection to joint distributions: The ELBO decomposes as:

  • Reconstruction: how well qϕq_\phi places mass where p(xz)p(\mathbf{x} \mid \mathbf{z}) is large
  • Regularisation: DKL(qϕp(z))D_{\text{KL}}(q_\phi \| p(\mathbf{z})) keeps qϕq_\phi close to the prior joint distribution p(z)p(\mathbf{z})

Maximising the ELBO is equivalent to minimising DKL(qϕ(z)p(zx))D_{\text{KL}}(q_\phi(\mathbf{z}) \| p(\mathbf{z} \mid \mathbf{x})) - finding the best joint approximation in Q\mathcal{Q}.

Mean-field VI: The simplest tractable family: qϕ(z)=iqϕi(zi)q_\phi(\mathbf{z}) = \prod_i q_{\phi_i}(z_i). This assumes all latent variables are independent - a strong assumption that leads to underestimated posterior variances (VI is over-confident). VAEs use this assumption in the encoder: the output qϕ(zx)=dN(zd;μd,σd2)q_\phi(\mathbf{z} \mid \mathbf{x}) = \prod_d \mathcal{N}(z_d; \mu_d, \sigma_d^2) is a diagonal (factored) Gaussian.

For AI (2026): Flow-based posteriors (normalising flows applied to the variational distribution) relax the mean-field assumption while remaining tractable. Diffusion-based VI methods (Geffner et al., 2023) use learned denoising processes as flexible approximate posteriors, achieving near-exact inference for structured joint distributions.


Appendix T: Extended ML Case Studies

T.1 Case Study: Transformer Attention as Conditional Expectation

We show in detail how the self-attention operation is a conditional expectation under a joint distribution over sequence positions.

Setting up the joint distribution: Define a joint distribution over "query position" TT and "key position" SS with:

p(S=sT=t)=αts=softmax ⁣(qtksd)sp(S = s \mid T = t) = \alpha_{ts} = \text{softmax}\!\left(\frac{\mathbf{q}_t^\top \mathbf{k}_s}{\sqrt{d}}\right)_s

This is a well-defined conditional distribution: αts0\alpha_{ts} \geq 0 and sαts=1\sum_s \alpha_{ts} = 1.

The attention output as conditional expectation:

ot=sαtsvs=ESp(T=t)[vS]\mathbf{o}_t = \sum_s \alpha_{ts} \mathbf{v}_s = \mathbb{E}_{S \sim p(\cdot \mid T=t)}[\mathbf{v}_S]

This is the conditional mean of the value vector under the attention distribution - exactly the conditional expectation formula E[YX=x]\mathbb{E}[Y \mid X = x] from Section 3.2.

Multi-head attention as mixture: Each head hh has its own attention distribution ph(ST)p_h(S \mid T). The output is:

ot=WO[ot(1)ot(H)]\mathbf{o}_t = W_O \left[\mathbf{o}_t^{(1)} \| \cdots \| \mathbf{o}_t^{(H)}\right]

where ot(h)=ESph(T=t)[vS(h)]\mathbf{o}_t^{(h)} = \mathbb{E}_{S \sim p_h(\cdot \mid T=t)}[\mathbf{v}_S^{(h)}]. Different heads can specialise in different conditional distributions - syntax, semantics, coreference, etc.

KV cache as sufficient statistic: The key insight is that once (ks,vs)(\mathbf{k}_s, \mathbf{v}_s) are computed for position ss, they fully determine ph(S=sT=t)p_h(S=s \mid T=t) and vS(h)\mathbf{v}_S^{(h)} for any future query position t>st > s. The KV cache stores exactly these sufficient statistics - the conditional distributions of future positions given past keys and values.

T.2 Case Study: VAE as Latent Variable Model

The VAE is a latent variable model with joint distribution:

pθ(x,z)=pθ(xz)p(z)p_\theta(\mathbf{x}, \mathbf{z}) = p_\theta(\mathbf{x} \mid \mathbf{z}) \cdot p(\mathbf{z})

The chain of joint distributions:

  1. Prior: zp(z)=N(0,I)\mathbf{z} \sim p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, I) - maximum entropy distribution over latent space
  2. Generative joint: pθ(x,z)p_\theta(\mathbf{x}, \mathbf{z}) - the model's joint distribution
  3. Approximate posterior: qϕ(zx)=N(μϕ(x),diag(σϕ2(x)))q_\phi(\mathbf{z} \mid \mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}_\phi^2(\mathbf{x})))
  4. Marginal likelihood: pθ(x)=pθ(xz)p(z)dzp_\theta(\mathbf{x}) = \int p_\theta(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z})\,d\mathbf{z} - intractable; ELBO is a lower bound

Reparameterisation implements the affine transformation: The encoder outputs (μϕ,σϕ)(\boldsymbol{\mu}_\phi, \boldsymbol{\sigma}_\phi). Sampling z=μϕ+σϕε\mathbf{z} = \boldsymbol{\mu}_\phi + \boldsymbol{\sigma}_\phi \odot \boldsymbol{\varepsilon} with εN(0,I)\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I) uses the affine transformation property: zN(μϕ,diag(σϕ2))\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}_\phi, \text{diag}(\boldsymbol{\sigma}_\phi^2)).

Training objective decomposition:

L=Eqϕ(zx)[logpθ(xz)]reconstructionDKL(qϕ(zx)p(z))regularisation\mathcal{L} = \underbrace{\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]}_{\text{reconstruction}} - \underbrace{D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))}_{\text{regularisation}}

The second term, computed in closed form (Appendix O.3), enforces that the approximate posterior qϕ(zx)q_\phi(\mathbf{z} \mid \mathbf{x}) stays close to the prior - preventing the encoder from ignoring the prior and using an arbitrary encoding.

T.3 Case Study: Gaussian Process Regression as Conditional MVN

Setting: We observe noisy evaluations y=f(X)+ε\mathbf{y} = f(\mathbf{X}) + \boldsymbol{\varepsilon} at training inputs X\mathbf{X}. We want to predict f=f(X)f_* = f(\mathbf{X}_*) at test points.

Joint prior:

(f,y)N ⁣(0,(KKnKnKnn+σ2I))(f_*, \mathbf{y}) \sim \mathcal{N}\!\left(\mathbf{0}, \begin{pmatrix} K_{**} & K_{*n} \\ K_{n*} & K_{nn} + \sigma^2 I \end{pmatrix}\right)

where Kij=k(xi,xj)K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j) is the kernel matrix.

Prediction = Conditional MVN: Apply the Schur complement formula (Section 6.4):

fyN(μ,Σ)f_* \mid \mathbf{y} \sim \mathcal{N}(\boldsymbol{\mu}_*, \Sigma_*)

with μ=Kn(Knn+σ2I)1y\boldsymbol{\mu}_* = K_{*n}(K_{nn} + \sigma^2 I)^{-1}\mathbf{y} and Σ=KKn(Knn+σ2I)1Kn\Sigma_* = K_{**} - K_{*n}(K_{nn} + \sigma^2 I)^{-1}K_{n*}.

The posterior mean μ\boldsymbol{\mu}_* is a weighted combination of training outputs - the weights Kn(Knn+σ2I)1K_{*n}(K_{nn} + \sigma^2 I)^{-1} are the GP analog of the Kalman gain. The posterior variance Σ\Sigma_* quantifies predictive uncertainty - it is reduced compared to the prior KK_{**} because we have data.

Connection to Bayesian linear regression: With a linear kernel k(x,x)=xxk(\mathbf{x},\mathbf{x}') = \mathbf{x}^\top \mathbf{x}', GP regression reduces to Bayesian linear regression with an isotropic prior on weights. The MVN conditional formula gives the posterior over weights and the predictive distribution simultaneously.


Appendix U: Numerical Examples - Covariance Matrix Computations

U.1 Computing the Schur Complement by Hand

Problem:

Σ=(4223),Σ11=4,Σ12=Σ21=2,Σ22=3\Sigma = \begin{pmatrix}4 & 2\\ 2 & 3\end{pmatrix}, \quad \Sigma_{11} = 4, \quad \Sigma_{12} = \Sigma_{21} = 2, \quad \Sigma_{22} = 3

Schur complement: Σ12=Σ11Σ12Σ221Σ21=42(1/3)2=44/3=8/3\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} = 4 - 2 \cdot (1/3) \cdot 2 = 4 - 4/3 = 8/3

Interpretation: The conditional variance of X1X_1 given X2X_2 is 8/32.67<48/3 \approx 2.67 < 4 - conditioning reduces variance. The reduction 4/34/3 is the variance explained by X2X_2.

Correlation from covariance: ρ=2/43=2/120.577\rho = 2/\sqrt{4 \cdot 3} = 2/\sqrt{12} \approx 0.577.

Conditional distribution for X1X2=x2X_1 \mid X_2 = x_2 (with μ1=μ2=0\mu_1 = \mu_2 = 0):

X1X2=x2N ⁣(23x2,83)X_1 \mid X_2 = x_2 \sim \mathcal{N}\!\left(\frac{2}{3}x_2, \frac{8}{3}\right)

U.2 Eigendecomposition of a 2x2 Covariance Matrix

For Σ=(4223)\Sigma = \begin{pmatrix}4 & 2\\ 2 & 3\end{pmatrix}:

Characteristic polynomial: λ27λ+8=0\lambda^2 - 7\lambda + 8 = 0

Eigenvalues: λ1,2=(7±4932)/2=(7±17)/2\lambda_{1,2} = (7 \pm \sqrt{49-32})/2 = (7 \pm \sqrt{17})/2

Numerically: λ15.56\lambda_1 \approx 5.56, λ21.44\lambda_2 \approx 1.44.

Eigenvectors: For λ1\lambda_1: (4λ1)v1+2v2=0v1(2,λ14)(2,1.56)(4-\lambda_1)v_1 + 2v_2 = 0 \Rightarrow v_1 \propto (2, \lambda_1-4) \propto (2, 1.56), normalised: (0.789,0.614)\approx (0.789, 0.614).

Geometric interpretation: The principal axis of the distribution is at angle arctan(0.614/0.789)37.9degrees\arctan(0.614/0.789) \approx 37.9 degrees from the x1x_1-axis. Variance along the long axis is λ15.56\lambda_1 \approx 5.56; along the short axis λ21.44\lambda_2 \approx 1.44.

U.3 PSD Check via Cholesky

For Σ=(4223)\Sigma = \begin{pmatrix}4 & 2\\ 2 & 3\end{pmatrix}:

Cholesky: Σ=LL\Sigma = LL^\top where L=(2012)L = \begin{pmatrix}2 & 0\\ 1 & \sqrt{2}\end{pmatrix}.

Verify: LL=(2012)(2102)=(4223)LL^\top = \begin{pmatrix}2&0\\1&\sqrt{2}\end{pmatrix}\begin{pmatrix}2&1\\0&\sqrt{2}\end{pmatrix} = \begin{pmatrix}4&2\\2&3\end{pmatrix} [ok]

Since Cholesky exists (all diagonal entries 2,2>02, \sqrt{2} > 0), Σ\Sigma is SPD.

Sampling: Generate εN(0,I)\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0},I), then x=Lε\mathbf{x} = L\boldsymbol{\varepsilon} has covariance LIL=ΣL I L^\top = \Sigma.


Appendix V - Worked Problems: Marginals, Conditionals, and Independence Tests

This appendix provides fully worked solutions to ten representative problems at the level of the exercises. Work through each problem before reading the solution.


V.1 Marginalising a Triangular Joint PDF

Problem. Let (X,Y)(X, Y) have joint PDF

fX,Y(x,y)=6x1[0yx1].f_{X,Y}(x, y) = 6x \cdot \mathbf{1}[0 \le y \le x \le 1].

Find (a) fX(x)f_X(x), (b) fY(y)f_Y(y), (c) fXY(xy)f_{X|Y}(x|y), (d) E[XY=y]\mathbb{E}[X|Y=y].

Solution.

(a) Marginal of XX. For 0x10 \le x \le 1, integrate over y[0,x]y \in [0, x]:

fX(x)=0x6xdy=6xx=6x2.f_X(x) = \int_0^x 6x \, dy = 6x \cdot x = 6x^2.

Check: 016x2dx=2\int_0^1 6x^2 \, dx = 2. Wait - that gives 2, not 1. Recheck the support: the integrand is 6x6x but xx ranges over [0,1][0,1]. Observe

016x2dx=613=21.\int_0^1 6x^2 \, dx = 6 \cdot \tfrac{1}{3} = 2 \ne 1.

The given density must be renormalised. The normalisation constant cc satisfies

c010xxdydx=c01x2dx=c3=1    c=3.c \int_0^1 \int_0^x x \, dy \, dx = c \int_0^1 x^2 \, dx = \tfrac{c}{3} = 1 \implies c = 3.

So the correct joint is fX,Y(x,y)=3x1[0yx1]f_{X,Y}(x,y) = 3x \cdot \mathbf{1}[0 \le y \le x \le 1] and fX(x)=3x2f_X(x) = 3x^2.

(b) Marginal of YY. For 0y10 \le y \le 1, integrate over x[y,1]x \in [y, 1]:

fY(y)=y13xdx=31y22=3(1y2)2.f_Y(y) = \int_y^1 3x \, dx = 3 \cdot \tfrac{1-y^2}{2} = \tfrac{3(1-y^2)}{2}.

Check: 013(1y2)2dy=32[113]=3223=1\int_0^1 \tfrac{3(1-y^2)}{2} \, dy = \tfrac{3}{2}[1 - \tfrac{1}{3}] = \tfrac{3}{2} \cdot \tfrac{2}{3} = 1. [ok]

(c) Conditional fXY(xy)f_{X|Y}(x|y). By definition,

fXY(xy)=fX,Y(x,y)fY(y)=3x3(1y2)2=2x1y2,x[y,1].f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)} = \frac{3x}{\tfrac{3(1-y^2)}{2}} = \frac{2x}{1-y^2}, \quad x \in [y, 1].

(d) Conditional mean E[XY=y]\mathbb{E}[X|Y=y].

E[XY=y]=y1x2x1y2dx=21y2y1x2dx=21y21y33=2(1y3)3(1y2).\mathbb{E}[X|Y=y] = \int_y^1 x \cdot \frac{2x}{1-y^2} \, dx = \frac{2}{1-y^2} \int_y^1 x^2 \, dx = \frac{2}{1-y^2} \cdot \frac{1-y^3}{3} = \frac{2(1-y^3)}{3(1-y^2)}.

Factor: 1y3=(1y)(1+y+y2)1 - y^3 = (1-y)(1+y+y^2) and 1y2=(1y)(1+y)1-y^2 = (1-y)(1+y), so

E[XY=y]=2(1+y+y2)3(1+y).\mathbb{E}[X|Y=y] = \frac{2(1+y+y^2)}{3(1+y)}.

V.2 Checking Independence via Factorisation

Problem. For each of the following joint PMFs, determine whether XYX \perp Y.

(a)

X\YX \backslash Y01
00.30.2
10.10.4

(b)

X\YX \backslash Y01
00.120.18
10.280.42

Solution.

(a) Marginals: P(X=0)=0.5P(X=0) = 0.5, P(X=1)=0.5P(X=1) = 0.5; P(Y=0)=0.4P(Y=0) = 0.4, P(Y=1)=0.6P(Y=1) = 0.6. Check (0,0)(0,0): P(X=0)P(Y=0)=0.5×0.4=0.200.30=P(X=0,Y=0)P(X=0)P(Y=0) = 0.5 \times 0.4 = 0.20 \ne 0.30 = P(X=0,Y=0). Not independent.

(b) Marginals: P(X=0)=0.30P(X=0) = 0.30, P(X=1)=0.70P(X=1) = 0.70; P(Y=0)=0.40P(Y=0) = 0.40, P(Y=1)=0.60P(Y=1) = 0.60. Check all four cells:

  • (0,0)(0,0): 0.30×0.40=0.120.30 \times 0.40 = 0.12 [ok]
  • (0,1)(0,1): 0.30×0.60=0.180.30 \times 0.60 = 0.18 [ok]
  • (1,0)(1,0): 0.70×0.40=0.280.70 \times 0.40 = 0.28 [ok]
  • (1,1)(1,1): 0.70×0.60=0.420.70 \times 0.60 = 0.42 [ok]

Independent. [ok]


V.3 Covariance Computation

Problem. Suppose (X,Y)(X, Y) uniform on the unit disk {(x,y):x2+y21}\{(x,y) : x^2 + y^2 \le 1\}. The joint PDF is f(x,y)=1πf(x,y) = \tfrac{1}{\pi}. Compute Cov(X,Y)\text{Cov}(X, Y).

Solution.

By symmetry of the unit disk under (x,y)(x,y)(x,y) \mapsto (-x, y) and (x,y)(x,y)(x,y) \mapsto (x, -y):

  • E[X]=0\mathbb{E}[X] = 0 and E[Y]=0\mathbb{E}[Y] = 0 (odd integrand over symmetric domain).
  • E[XY]=1πx2+y21xydxdy=0\mathbb{E}[XY] = \tfrac{1}{\pi} \iint_{x^2+y^2\le 1} xy \, dx \, dy = 0 (same symmetry argument: xyxy is antisymmetric under xxx \mapsto -x).

Therefore Cov(X,Y)=E[XY]E[X]E[Y]=000=0\text{Cov}(X, Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] = 0 - 0 \cdot 0 = 0.

However, XX and YY are not independent: if X=0.9X = 0.9 then YY must lie in [10.81,10.81]=[0.436,0.436][-\sqrt{1 - 0.81}, \sqrt{1 - 0.81}] = [-0.436, 0.436], a much tighter range than without conditioning. This is a canonical example of zero covariance ⇏\not\Rightarrow independence.


V.4 Bivariate Normal: Conditional Mean and Variance

Problem. (X1,X2)N(0,Σ)(X_1, X_2) \sim \mathcal{N}(\mathbf{0}, \Sigma) with Σ=(4339)\Sigma = \begin{pmatrix} 4 & 3 \\ 3 & 9 \end{pmatrix}.

Find the distribution of X1X2=x2X_1 | X_2 = x_2.

Solution.

Using the Schur complement formula with μ1=μ2=0\mu_1 = \mu_2 = 0:

μ12=μ1+Σ12Σ221(x2μ2)=0+319x2=x23.\mu_{1|2} = \mu_1 + \Sigma_{12}\Sigma_{22}^{-1}(x_2 - \mu_2) = 0 + 3 \cdot \frac{1}{9} \cdot x_2 = \frac{x_2}{3}. σ122=Σ11Σ12Σ221Σ21=43193=41=3.\sigma^2_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21} = 4 - 3 \cdot \frac{1}{9} \cdot 3 = 4 - 1 = 3.

Therefore X1X2=x2N ⁣(x23,3)X_1 | X_2 = x_2 \sim \mathcal{N}\!\left(\tfrac{x_2}{3},\, 3\right).

The correlation is ρ=3/49=3/6=1/2\rho = 3 / \sqrt{4 \cdot 9} = 3/6 = 1/2, and the conditional mean μ12=ρσ1σ2x2=1223x2=x23\mu_{1|2} = \rho \cdot \tfrac{\sigma_1}{\sigma_2} \cdot x_2 = \tfrac{1}{2} \cdot \tfrac{2}{3} \cdot x_2 = \tfrac{x_2}{3} confirms our result.


V.5 Reparameterisation Trick Gradient

Problem. The ELBO of a VAE contains the term Ezqϕ(zx)[f(z)]\mathbb{E}_{z \sim q_\phi(z|x)}[f(z)] where qϕ(zx)=N(μϕ(x),σϕ2(x)I)q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x) I). Show that ϕEqϕ[f(z)]=EεN(0,I)[ϕf(μϕ+σϕε)]\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \mathbb{E}_{\varepsilon \sim \mathcal{N}(0,I)}[\nabla_\phi f(\mu_\phi + \sigma_\phi \odot \varepsilon)].

Solution.

The key insight is to push the gradient inside the expectation. With the direct parameterisation, qϕq_\phi depends on ϕ\phi, making ϕEqϕ[f(z)]\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] require differentiating through the distribution - the REINFORCE estimator does this but has high variance.

The reparameterisation trick decouples the stochasticity from the parameters. Write

z=g(ϕ,ε)=μϕ+σϕε,εN(0,I).z = g(\phi, \varepsilon) = \mu_\phi + \sigma_\phi \odot \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, I).

Then

Ezqϕ[f(z)]=EεN(0,I)[f(g(ϕ,ε))].\mathbb{E}_{z \sim q_\phi}[f(z)] = \mathbb{E}_{\varepsilon \sim \mathcal{N}(0,I)}[f(g(\phi, \varepsilon))].

Now ε\varepsilon is drawn from a fixed distribution independent of ϕ\phi. Assuming sufficient regularity to swap gradient and expectation:

ϕEqϕ[f(z)]=ϕEε[f(g(ϕ,ε))]=Eε[ϕf(g(ϕ,ε))]=Eε[ϕf(μϕ+σϕε)].\nabla_\phi \mathbb{E}_{q_\phi}[f(z)] = \nabla_\phi \mathbb{E}_{\varepsilon}[f(g(\phi,\varepsilon))] = \mathbb{E}_{\varepsilon}[\nabla_\phi f(g(\phi,\varepsilon))] = \mathbb{E}_{\varepsilon}[\nabla_\phi f(\mu_\phi + \sigma_\phi \odot \varepsilon)].

The gradient flows through the deterministic function gg via the chain rule:

ϕf(g(ϕ,ε))=zf(z)gϕ=zf(z)(Idiag(ε))\nabla_\phi f(g(\phi,\varepsilon)) = \nabla_z f(z) \cdot \frac{\partial g}{\partial \phi} = \nabla_z f(z) \cdot \begin{pmatrix} I \\ \text{diag}(\varepsilon) \end{pmatrix}

where the second row gives g/σϕ=ε\partial g / \partial \sigma_\phi = \varepsilon. This low-variance estimator is exactly what enables training VAEs with standard backpropagation.


V.6 Change of Variables: Log-Normal

Problem. If XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2), find the PDF of Y=eXY = e^X.

Solution.

The transformation g(x)=exg(x) = e^x is strictly increasing, so its inverse is g1(y)=lnyg^{-1}(y) = \ln y with derivative ddyg1(y)=1y\frac{d}{dy}g^{-1}(y) = \frac{1}{y}. By the univariate change-of-variables formula:

fY(y)=fX(lny)ddylny=12πσexp ⁣((lnyμ)22σ2)1y,y>0.f_Y(y) = f_X(\ln y) \cdot \left|\frac{d}{dy}\ln y\right| = \frac{1}{\sqrt{2\pi}\sigma} \exp\!\left(-\frac{(\ln y - \mu)^2}{2\sigma^2}\right) \cdot \frac{1}{y}, \quad y > 0.

This is the log-normal distribution. Its mean and variance are

E[Y]=eμ+σ2/2,Var(Y)=(eσ21)e2μ+σ2.\mathbb{E}[Y] = e^{\mu + \sigma^2/2}, \quad \text{Var}(Y) = (e^{\sigma^2}-1)e^{2\mu+\sigma^2}.

AI relevance. Log-normal distributions model quantities that are products of many small independent factors - learning rate schedules, weight magnitudes after multiplicative updates, and token probability products in sequence models. When σ21\sigma^2 \gg 1 the distribution has a heavy right tail; this explains why gradient clipping is necessary: the product of many near-1 weights can occasionally explode.


V.7 Order Statistics of Uniform

Problem. Let U(1)U(2)U(n)U_{(1)} \le U_{(2)} \le \cdots \le U_{(n)} be the order statistics of nn iid Uniform(0,1)\text{Uniform}(0,1) random variables. Find fU(k)(u)f_{U_{(k)}}(u).

Solution.

For the kk-th order statistic, we need exactly k1k-1 values below uu, one value at uu, and nkn-k values above uu. This is a multinomial counting argument:

fU(k)(u)=n!(k1)!(nk)!uk1(1u)nk1,u[0,1].f_{U_{(k)}}(u) = \frac{n!}{(k-1)!(n-k)!} \cdot u^{k-1} \cdot (1-u)^{n-k} \cdot 1, \quad u \in [0,1].

Recognising the kernel: U(k)Beta(k,nk+1)U_{(k)} \sim \text{Beta}(k, n-k+1). This is a remarkable result connecting order statistics to the Beta distribution.

Moments:

E[U(k)]=kn+1,Var(U(k))=k(nk+1)(n+1)2(n+2).\mathbb{E}[U_{(k)}] = \frac{k}{n+1}, \quad \text{Var}(U_{(k)}) = \frac{k(n-k+1)}{(n+1)^2(n+2)}.

AI relevance. Order statistics underlie Top-kk sampling in autoregressive generation. When sampling the next token by selecting from the kk highest-probability tokens, the threshold is the kk-th order statistic of the probability vector (in decreasing order). The Beta distribution for U(k)U_{(k)} gives the distribution of this adaptive threshold.


V.8 Conditional Independence: Chain vs Fork vs Collider

Problem. In each graphical model below, state whether XYZX \perp Y | Z and whether XYX \perp Y (marginal independence).

(a) Chain: XZYX \to Z \to Y

(b) Fork: XZYX \leftarrow Z \rightarrow Y

(c) Collider: XZYX \to Z \leftarrow Y

Solution.

(a) Chain. The joint is p(x,z,y)=p(x)p(zx)p(yz)p(x,z,y) = p(x)p(z|x)p(y|z).

  • Marginal: p(x,y)=zp(x)p(zx)p(yz)p(x,y) = \sum_z p(x)p(z|x)p(y|z), generally not p(x)p(y)p(x)p(y). Not marginally independent.
  • Conditional: p(x,yz)=p(x,z,y)/p(z)=p(xz)p(yz)p(x,y|z) = p(x,z,y)/p(z) = p(x|z)p(y|z) after noting p(xz)=p(x)p(zx)/p(z)p(x|z) = p(x)p(z|x)/p(z). XYZX \perp Y | Z. [ok]

(b) Fork. The joint is p(z)p(xz)p(yz)p(z)p(x|z)p(y|z).

  • Marginal: p(x,y)=zp(z)p(xz)p(yz)p(x,y) = \sum_z p(z)p(x|z)p(y|z), generally not p(x)p(y)p(x)p(y) because both share cause ZZ. Not marginally independent.
  • Conditional: p(x,yz)=p(xz)p(yz)p(x,y|z) = p(x|z)p(y|z). XYZX \perp Y | Z. [ok]

(c) Collider. The joint is p(x)p(y)p(zx,y)p(x)p(y)p(z|x,y).

  • Marginal: p(x,y)=p(x)p(y)zp(zx,y)=p(x)p(y)p(x,y) = p(x)p(y)\sum_z p(z|x,y) = p(x)p(y). XYX \perp Y (marginal). [ok]
  • Conditional: knowing ZZ creates dependence between XX and YY (explaining away / Berkson's paradox). X⊥̸YZX \not\perp Y | Z in general.

Summary table:

StructureXYX \perp Y (marginal)?XYZX \perp Y \mid Z?
ChainNoYes
ForkNoYes
ColliderYesNo

The collider reversal is the most counterintuitive result in probabilistic graphical models and has real-world consequences: conditioning on a collider (e.g., selecting on a trait influenced by two independent factors) induces spurious correlations between otherwise independent causes. In neural architecture search, selecting architectures that achieve high validation accuracy (collider: architecture traits + dataset difficulty) can create apparent anti-correlations between design choices that are truly independent.


Appendix W - Notation Reference and Quick Formulas

W.1 Notation Used in This Section

SymbolMeaning
fX,Y(x,y)f_{X,Y}(x,y)Joint PDF of (X,Y)(X,Y)
pX,Y(x,y)p_{X,Y}(x,y)Joint PMF of (X,Y)(X,Y)
FX,Y(x,y)F_{X,Y}(x,y)Joint CDF: P(Xx,Yy)P(X \le x, Y \le y)
fX(x)f_{X}(x)Marginal PDF of XX
fYX(yx)f_{Y\|X}(y\|x)Conditional PDF of YY given X=xX=x
XYX \perp YXX and YY are independent
XYZX \perp Y \| ZXX and YY are conditionally independent given ZZ
μ\boldsymbol{\mu}Mean vector of multivariate distribution
Σ\SigmaCovariance matrix
Σ12\Sigma_{12}Off-diagonal block Cov(X1,X2)\text{Cov}(X_1, X_2)
Σ12\Sigma_{1\|2}Conditional covariance (Schur complement)
ρXY\rho_{XY}Pearson correlation coefficient
ϕ()\phi(\cdot)Standard normal PDF
C(u,v)C(u,v)Copula function
\odotElement-wise (Hadamard) product

W.2 Master Formula Sheet

Joint, marginal, conditional (continuous):

fX(x)=fX,Y(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dy fYX(yx)=fX,Y(x,y)fX(x)f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)} fX,Y(x,y)=fYX(yx)fX(x)f_{X,Y}(x,y) = f_{Y|X}(y|x) \cdot f_X(x)

Chain rule (nn variables):

p(x1,,xn)=i=1np(xix1,,xi1)p(x_1, \ldots, x_n) = \prod_{i=1}^n p(x_i \mid x_1, \ldots, x_{i-1})

Independence:

XY    fX,Y(x,y)=fX(x)fY(y) a.e.X \perp Y \iff f_{X,Y}(x,y) = f_X(x)f_Y(y) \text{ a.e.}

Covariance and correlation:

Cov(X,Y)=E[XY]E[X]E[Y]\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] ρXY=Cov(X,Y)Var(X)Var(Y)[1,1]\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} \in [-1,1]

Multivariate Gaussian:

XN(μ,Σ)\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) f(x)=1(2π)d/2Σ1/2exp ⁣(12(xμ)Σ1(xμ))f(\mathbf{x}) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}} \exp\!\left(-\tfrac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu})\right)

MVN conditional (Schur complement):

X1X2=x2N(μ12,Σ12)\mathbf{X}_1|\mathbf{X}_2=\mathbf{x}_2 \sim \mathcal{N}(\boldsymbol{\mu}_{1|2},\, \Sigma_{1|2}) μ12=μ1+Σ12Σ221(x2μ2)\boldsymbol{\mu}_{1|2} = \boldsymbol{\mu}_1 + \Sigma_{12}\Sigma_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2) Σ12=Σ11Σ12Σ221Σ21\Sigma_{1|2} = \Sigma_{11} - \Sigma_{12}\Sigma_{22}^{-1}\Sigma_{21}

MVN affine transform:

AX+bN(Aμ+b,  AΣA)A\mathbf{X} + \mathbf{b} \sim \mathcal{N}(A\boldsymbol{\mu}+\mathbf{b},\; A\Sigma A^\top)

Change of variables (bivariate):

fY(y)=fX(g1(y))detJg1(y)f_\mathbf{Y}(\mathbf{y}) = f_\mathbf{X}(g^{-1}(\mathbf{y})) \cdot |\det J_{g^{-1}}(\mathbf{y})|

Convolution:

fX+Y(z)=fX(x)fY(zx)dxf_{X+Y}(z) = \int_{-\infty}^{\infty} f_X(x)\, f_Y(z-x)\, dx

Reparameterisation trick:

z=μϕ+σϕε,εN(0,I)\mathbf{z} = \boldsymbol{\mu}_\phi + \boldsymbol{\sigma}_\phi \odot \boldsymbol{\varepsilon}, \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, I) ϕEqϕ[f(z)]=Eε[ϕf(μϕ+σϕε)]\nabla_\phi \mathbb{E}_{q_\phi}[f(\mathbf{z})] = \mathbb{E}_{\boldsymbol{\varepsilon}}[\nabla_\phi f(\boldsymbol{\mu}_\phi + \boldsymbol{\sigma}_\phi \odot \boldsymbol{\varepsilon})]

Sklar's Theorem:

FX,Y(x,y)=C(FX(x),FY(y))F_{X,Y}(x,y) = C(F_X(x), F_Y(y))

W.3 Key Inequalities

InequalityStatementUse Case
Cauchy-Schwarz$\text{Cov}(X,Y)
Chebyshev$P(X-\mu
Jensenf(E[X])E[f(X)]f(\mathbb{E}[X]) \le \mathbb{E}[f(X)] for convex ffELBO derivation, log inequalities
MarkovP(Xt)E[X]/tP(X \ge t) \le \mathbb{E}[X]/t for X0X \ge 0Simplest tail bound
Data ProcessingI(X;Y)I(X;g(Y))I(X;Y) \ge I(X;g(Y)) for any function ggInformation is not increased by processing

W.4 Common Mistakes - Quick Reference

#MistakeFix
1Confusing joint CDF and joint PDFF(x,y)=P(Xx,Yy)F(x,y) = P(X \le x, Y \le y); PDF is the mixed partial 2F/xy\partial^2 F / \partial x \partial y
2Cov=0\text{Cov}=0 \Rightarrow independentOnly true for jointly Gaussian; counterexample: unit disk
3Forgetting Jacobian in change of variablesAlways compute $
4Treating correlated Gaussians as independentX,YX,Y Gaussian does not imply (X,Y)(X,Y) jointly Gaussian
5Conditioning on zero-probability events without limiting argumentUse $f_{Y
6Applying Schur formula without checking Σ220\Sigma_{22} \succ 0Verify positive definiteness before inverting
7Confusing marginal and conditional independenceCollider: marginally independent, conditionally dependent
8Forgetting the log-determinant term in MVN log-likelihood$\log f = -\tfrac{d}{2}\log 2\pi - \tfrac{1}{2}\log