StatisticsMath for LLMs

Statistics

Statistics

Exercises Notebook

Converted from exercises.ipynb for web reading.

Descriptive Statistics — Exercises

10 exercises from basic EDA through Batch Norm and Adam momentum analysis.

FormatDescription
ProblemMarkdown cell with task description
Your SolutionCode cell with scaffolding
SolutionCode cell with reference solution and checks

Difficulty Levels

LevelExercisesFocus
1–3Core mechanics
★★4–6Deeper theory
★★★7-10AI / ML applications

Topic Map

TopicExercise
Central tendency & spread1
Outlier detection2
ECDF & Q-Q plot3
Robust statistics under contamination4
Covariance matrix & Mahalanobis5
Anscombe's Quartet6
Batch Norm vs Layer Norm7
Adam momentum as sample statistics8

Additional applied exercises: Exercise 9: Robust scaling under outliers; Exercise 10: Descriptive drift dashboard.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import numpy.linalg as la
from scipy import stats

try:
    import matplotlib.pyplot as plt
    plt.style.use('seaborn-v0_8-whitegrid')
    plt.rcParams['figure.figsize'] = [10, 5]
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)

def header(title):
    print('\n' + '='*len(title))
    print(title)
    print('='*len(title))

def check_close(name, got, expected, tol=1e-4):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'}{name}")
    if not ok:
        print('  expected:', expected)
        print('  got     :', got)
    return ok

def check_true(name, cond):
    print(f"{'PASS' if cond else 'FAIL'}{name}")
    return cond

def five_number_summary(x):
    q1, q3 = np.percentile(x, [25, 75])
    return np.array([x.min(), q1, np.median(x), q3, x.max()])

print('Setup complete.')

Exercise 1 ★ — Central Tendency and Spread

Given the dataset representing API response times (seconds):

x={0.12,0.15,0.09,0.87,0.11,0.14,0.13,0.10,0.11,0.12}x = \{0.12, 0.15, 0.09, 0.87, 0.11, 0.14, 0.13, 0.10, 0.11, 0.12\}

(a) Compute the mean, median, and mode.

(b) Compute the sample variance (Bessel's correction), standard deviation, and IQR.

(c) Compute skewness g1g_1 and excess kurtosis g2g_2.

(d) Remove the outlier 0.870.87 and recompute mean and median. By what factor does the mean change? The median?

(e) State which measure of central tendency best represents the 'typical' response time.

Code cell 5

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 6

# Solution
x = np.array([0.12, 0.15, 0.09, 0.87, 0.11, 0.14, 0.13, 0.10, 0.11, 0.12])

# (a) Central tendency
mean_x   = np.mean(x)
median_x = np.median(x)
mode_x   = stats.mode(x).mode

# (b) Spread
var_x = np.var(x, ddof=1)
std_x = np.std(x, ddof=1)
q1, q3 = np.percentile(x, [25, 75])
iqr_x = q3 - q1

# (c) Shape
skew_x = stats.skew(x)
kurt_x = stats.kurtosis(x)

# (d) Remove outlier
x_clean      = x[x < 0.5]
mean_clean   = np.mean(x_clean)
median_clean = np.median(x_clean)
mean_factor  = mean_x / mean_clean
median_factor = median_x / median_clean

header('Exercise 1: Central Tendency and Spread')
print(f'(a) Mean = {mean_x:.4f}s, Median = {median_x:.4f}s, Mode = {mode_x:.2f}s')
print(f'(b) Var = {var_x:.6f}, Std = {std_x:.4f}s, IQR = {iqr_x:.4f}s')
print(f'(c) Skewness g₁ = {skew_x:.3f} (right-skewed), Kurtosis g₂ = {kurt_x:.3f}')
print(f'(d) Mean with outlier: {mean_x:.4f}s → without: {mean_clean:.4f}s (factor: {mean_factor:.2f}x)')
print(f'    Median with:        {median_x:.4f}s → without: {median_clean:.4f}s (factor: {median_factor:.2f}x)')
print(f'(e) Median ({median_clean:.4f}s) best represents typical response time.')
print(f'    Mean ({mean_x:.4f}s) is inflated 17x by the 0.87s outlier.')

check_true('Mean > median (right-skewed)', mean_x > median_x)
check_true('Outlier removal dramatically changes mean', mean_factor > 5)
check_true('Outlier removal barely changes median', median_factor < 1.05)
print('\nTakeaway: API response time distributions are right-skewed — always report '
      'median latency (P50) and P95/P99, not mean latency.')

Exercise 2 ★ — Outlier Detection

Generate 200 samples from N(50,100)\mathcal{N}(50, 100) (mean 50, std 10) and inject 5 outliers at {5,10,100,110,120}\{5, 10, 100, 110, 120\}.

(a) Compute the five-number summary.

(b) Identify outliers using Tukey fences (1.5×IQR1.5 \times \text{IQR}).

(c) Identify outliers using the modified Z-score with MAD (threshold 3.5).

(d) Report true positives, false positives, and false negatives for each method.

Code cell 8

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 9

# Solution
np.random.seed(42)
x_base = np.random.normal(50, 10, 200)
injected = np.array([5.0, 10.0, 100.0, 110.0, 120.0])
x = np.concatenate([x_base, injected])
true_outliers = np.concatenate([np.zeros(200, dtype=bool), np.ones(5, dtype=bool)])

# (a) Five-number summary
fns = five_number_summary(x)

# (b) Tukey fences
q1, q3 = np.percentile(x, [25, 75])
iqr = q3 - q1
flag_tukey = (x < q1 - 1.5*iqr) | (x > q3 + 1.5*iqr)

# (c) Modified Z-score
med = np.median(x)
mad = np.median(np.abs(x - med))
mz = 0.6745 * (x - med) / mad
flag_mad = np.abs(mz) > 3.5

def report(flag, true):
    tp = np.sum(flag & true)
    fp = np.sum(flag & ~true)
    fn = np.sum(~flag & true)
    return tp, fp, fn

header('Exercise 2: Outlier Detection')
print(f'(a) Five-number summary: min={fns[0]:.1f}, Q1={fns[1]:.1f}, '
      f'median={fns[2]:.1f}, Q3={fns[3]:.1f}, max={fns[4]:.1f}')
print(f'    IQR = {iqr:.2f}')
print()
print(f'{'Method':<20} {'TP':>4} {'FP':>4} {'FN':>4}')
print('-' * 35)
for name, flag in [('Tukey fences', flag_tukey), ('Modified Z-score', flag_mad)]:
    tp, fp, fn = report(flag, true_outliers)
    print(f'{name:<20} {tp:>4} {fp:>4} {fn:>4}')

check_true('Tukey detects all 5 injected outliers', report(flag_tukey, true_outliers)[0] == 5)
check_true('Modified Z detects all 5', report(flag_mad, true_outliers)[0] == 5)
print('\nTakeaway: Modified Z-score (MAD-based) is preferred for ML datasets — '
      'it is robust to contamination that inflates the standard deviation.')

Exercise 3 ★ — ECDF and Q-Q Plot

Generate 200 samples from a Student-tt distribution with ν=3\nu = 3 degrees of freedom.

(a) Compute and plot the ECDF F^n(x)\hat{F}_n(x).

(b) Construct a Q-Q plot against the Gaussian distribution.

(c) Compute sample excess kurtosis g2g_2. Compare to the theoretical value 6/(ν4)6/(\nu-4)... wait, ν=3<4\nu = 3 < 4, so theoretical kurtosis is infinite. What does the sample kurtosis look like for large nn?

(d) Verify Glivenko-Cantelli: compute supxF^n(x)Ft(x)\sup_x |\hat{F}_n(x) - F_t(x)| and confirm it decreases as nn increases.

Code cell 11

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 12

# Solution
np.random.seed(42)
nu = 3
n = 200
x = np.random.standard_t(nu, n)

# (a) ECDF
x_sorted = np.sort(x)
ecdf_vals = np.arange(1, n+1) / n

# (b)/(c) Shape statistics
g1 = stats.skew(x)
g2 = stats.kurtosis(x)  # excess kurtosis

# (d) Glivenko-Cantelli vs t(nu) CDF
x_eval = np.linspace(-10, 10, 2000)
F_hat = np.mean(x[:, None] <= x_eval, axis=0)
F_true = stats.t.cdf(x_eval, df=nu)
sup_error = np.max(np.abs(F_hat - F_true))

# Sample kurtosis for different n
kurt_by_n = {}
for n_test in [100, 500, 2000, 10000]:
    samples = np.random.standard_t(nu, n_test)
    kurt_by_n[n_test] = stats.kurtosis(samples)

header('Exercise 3: ECDF and Q-Q Plot')
print(f'(b/c) Sample skewness g₁ = {g1:.3f} (theory: 0 for t-distribution)')
print(f'      Sample excess kurtosis g₂ = {g2:.3f}')
print(f'      t(3) has infinite theoretical kurtosis (ν<4)')
print(f'\nSample kurtosis by n (t(3) with infinite true kurtosis):')
for n_test, k in kurt_by_n.items():
    print(f'  n={n_test:6d}: g₂ = {k:.2f}')
print(f'\n(d) Glivenko-Cantelli: sup|F̂_n - F| = {sup_error:.4f}')

check_true('ECDF has n values', len(ecdf_vals) == n)
check_true('ECDF is monotone', np.all(np.diff(ecdf_vals) >= 0))
check_true('ECDF ends at 1', np.isclose(ecdf_vals[-1], 1.0))
check_true('Sup error < 0.1', sup_error < 0.1)
check_true('Skewness near 0', abs(g1) < 0.5)
print('\nTakeaway: The t(3) distribution has infinite kurtosis — sample '
      'kurtosis grows unboundedly with n. Gradient distributions in deep '
      'networks often have this heavy-tailed character.')

Exercise 4 ★★ — Robust Statistics Under Contamination

Compare the resilience of mean/median/trimmed-mean and std/MAD/IQR under increasing contamination.

Start with 95 samples from N(0,1)\mathcal{N}(0,1). Add contamination at N(20,1)\mathcal{N}(20,1).

(a) For contamination fractions α{0,0.05,0.10,0.25,0.40}\alpha \in \{0, 0.05, 0.10, 0.25, 0.40\}: compute mean, median, 10%-trimmed mean, std, MAD×1.4826, IQR/1.349.

(b) Which location estimator is closest to 0 across all α\alpha?

(c) Replace a single observation with 10610^6 and measure the change in each statistic. Verify the influence is bounded for median/MAD and unbounded for mean/std.

Code cell 14

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 15

# Solution
np.random.seed(42)

def contaminated_sample(alpha, n=100, seed=42):
    rng = np.random.default_rng(seed)
    k = int(alpha * n)
    clean = rng.normal(0, 1, n - k)
    outliers = rng.normal(20, 1, k)
    return np.concatenate([clean, outliers])

header('Exercise 4: Robust Statistics Under Contamination')
print(f'{'α':>6} {'Mean':>8} {'Median':>8} {'TrimMean':>10} {'Std':>8} {'MAD*c':>8} {'IQR/c':>8}')
print('-' * 60)

alphas = [0.0, 0.05, 0.10, 0.25, 0.40]
for alpha in alphas:
    x = contaminated_sample(alpha)
    mean_e   = np.mean(x)
    med_e    = np.median(x)
    trim_e   = stats.trim_mean(x, 0.10)
    std_e    = np.std(x, ddof=1)
    mad_e    = np.median(np.abs(x - np.median(x))) * 1.4826
    q1, q3  = np.percentile(x, [25, 75])
    iqr_e    = (q3 - q1) / 1.349
    print(f'{alpha:>6.2f} {mean_e:>8.3f} {med_e:>8.3f} {trim_e:>10.3f} '
          f'{std_e:>8.3f} {mad_e:>8.3f} {iqr_e:>8.3f}')

# (c) Influence of single extreme observation
x_base = contaminated_sample(0.0)
x_extreme = x_base.copy()
x_extreme[0] = 1e6

print(f'\n(c) Effect of replacing one obs with 10^6:')
print(f'  Mean:   {np.mean(x_base):.4f}{np.mean(x_extreme):.1f} (change: {np.mean(x_extreme)-np.mean(x_base):.1f})')
print(f'  Median: {np.median(x_base):.4f}{np.median(x_extreme):.4f} (change: {np.median(x_extreme)-np.median(x_base):.6f})')
print(f'  MAD:    {np.median(np.abs(x_base-np.median(x_base)))*1.4826:.4f} → '
         f'{np.median(np.abs(x_extreme-np.median(x_extreme)))*1.4826:.4f}')

check_true('Mean changes dramatically with outlier',
           abs(np.mean(x_extreme) - np.mean(x_base)) > 1000)
check_true('Median barely changes with outlier',
           abs(np.median(x_extreme) - np.median(x_base)) < 0.1)
print('\nTakeaway: For gradient aggregation in federated learning, use '
      'coordinate-wise median or geometric median instead of mean — '
      'achieves 50% breakdown point against Byzantine workers.')

Exercise 5 ★★ — Covariance Matrix and Mahalanobis Distance

Context: The Mahalanobis distance dM(x)=(xμ)Σ1(xμ)d_M(\mathbf{x}) = \sqrt{(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})} accounts for the scale and correlation structure of the data, unlike Euclidean distance.

(a) Generate n=300n=300 samples from a bivariate Gaussian with mean μ=[2,5]\boldsymbol{\mu}=[2, 5]^\top and covariance Σ=(4339)\Sigma = \begin{pmatrix}4 & 3 \\ 3 & 9\end{pmatrix}.

(b) Compute the sample covariance matrix Σ^\hat{\Sigma} (with Bessel's correction). Verify it is PSD.

(c) Compute the Mahalanobis distance from each sample to the sample mean. Under a bivariate Gaussian, dM2χ2(2)d_M^2 \sim \chi^2(2). Verify that ~95% of distances satisfy dM25.991d_M^2 \le 5.991.

(d) Compare: flag outliers using (i) Euclidean distance > 3 std, (ii) Mahalanobis dM2>χ0.9752(2)=7.378d_M^2 > \chi^2_{0.975}(2) = 7.378. Which method correctly captures the correlation structure?

(e) Compute the sample correlation matrix from Σ^\hat{\Sigma}.

Code cell 17

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 18

# Solution
np.random.seed(42)

mu_true = np.array([2.0, 5.0])
Sigma_true = np.array([[4.0, 3.0], [3.0, 9.0]])
n = 300

# (a)
X = np.random.multivariate_normal(mu_true, Sigma_true, n)

# (b)
mu_hat = X.mean(axis=0)
Sigma_hat = np.cov(X.T, ddof=1)
eigenvalues = np.linalg.eigvalsh(Sigma_hat)
is_psd = np.all(eigenvalues >= -1e-10)

# (c)
def mahalanobis(X, mu, Sigma):
    diff = X - mu
    Sigma_inv = np.linalg.inv(Sigma)
    return np.einsum('ni,ij,nj->n', diff, Sigma_inv, diff)

dm2 = mahalanobis(X, mu_hat, Sigma_hat)
frac_inside = np.mean(dm2 <= 5.991)   # chi2(2) 95th percentile

# (d)
# Euclidean: standardise each dimension independently
z = (X - mu_hat) / X.std(axis=0, ddof=1)
outliers_euclidean = np.where(np.max(np.abs(z), axis=1) > 3)[0]
outliers_mahal = np.where(dm2 > 7.378)[0]  # chi2(2) 97.5th percentile

# (e)
std_hat = np.sqrt(np.diag(Sigma_hat))
R = Sigma_hat / np.outer(std_hat, std_hat)

header('Exercise 5: Covariance Matrix and Mahalanobis Distance')
print(f'Sample mean:  {mu_hat}')
print(f'True mean:    {mu_true}')
print(f'Sample Sigma:\n{Sigma_hat}')
print(f'True Sigma:\n{Sigma_true}')
check_true('Sigma_hat is PSD', is_psd)
check_true('~95% inside chi2 ellipse', 0.90 <= frac_inside <= 0.99)
print(f'Fraction inside 95% ellipse: {frac_inside:.3f}  (expected ~0.95)')
print(f'Euclidean outliers: {len(outliers_euclidean)} | Mahalanobis outliers: {len(outliers_mahal)}')
check_close('R diagonal == 1', np.diag(R), np.ones(2))
check_true('|R[0,1]| < 1', abs(R[0, 1]) < 1)
print(f'Sample correlation: {R[0,1]:.4f}  (true: {3/np.sqrt(4*9):.4f})')
print('\nTakeaway: Mahalanobis distance rotates/scales the space by Sigma^{-1/2}, '
      'making correlation structure irrelevant — equivalent to Euclidean distance '
      'after whitening. Used in anomaly detection and multi-output normalisation.')

Exercise 6 ★★ — Anscombe's Quartet

Context: Four datasets with (nearly) identical summary statistics but very different visual structures — a canonical demonstration that descriptive statistics alone are insufficient.

(a) Construct the four Anscombe datasets (use the exact values below).

(b) For each dataset compute: mean of x, mean of y, variance of x, variance of y, Pearson correlation rr, and the OLS slope β^1=rsy/sx\hat{\beta}_1 = r \cdot s_y / s_x.

(c) Verify that all four datasets have the same summary statistics to within 0.01.

(d) Identify which dataset violates each OLS assumption:

  • Linearity (non-linear relationship)
  • No outlier / high-leverage point
  • Constant variance (homoskedasticity)

(e) Compute the residual standard error se=1n2iei2s_e = \sqrt{\frac{1}{n-2}\sum_i e_i^2} for each dataset.

Code cell 20

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 21

# Solution
x1 = np.array([10,8,13,9,11,14,6,4,12,7,5], dtype=float)
x2 = x1.copy(); x3 = x1.copy()
x4 = np.array([8,8,8,8,8,8,8,19,8,8,8], dtype=float)
y1 = np.array([8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68])
y2 = np.array([9.14,8.14,8.74,8.77,9.26,8.10,6.13,3.10,9.13,7.26,4.74])
y3 = np.array([7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73])
y4 = np.array([6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89])
datasets = [('I', x1, y1), ('II', x2, y2), ('III', x3, y3), ('IV', x4, y4)]

def summarise(x, y):
    n = len(x)
    mx, my = x.mean(), y.mean()
    vx = x.var(ddof=1); vy = y.var(ddof=1)
    r = np.corrcoef(x, y)[0, 1]
    slope = r * np.sqrt(vy / vx)
    intercept = my - slope * mx
    residuals = y - (slope * x + intercept)
    rse = np.sqrt(np.sum(residuals**2) / (n - 2))
    return {'mean_x': mx, 'mean_y': my, 'var_x': vx, 'var_y': vy,
            'r': r, 'slope': slope, 'rse': rse}

summaries = {name: summarise(x, y) for name, x, y in datasets}

header('Exercise 6: Anscombe Quartet')
print(f'  {"Dataset":<6} {"mean_x":>8} {"mean_y":>8} {"var_x":>8} {"var_y":>8} {"r":>8} {"slope":>8} {"rse":>8}')
for name, s in summaries.items():
    print(f'  {name:<6} {s["mean_x"]:>8.3f} {s["mean_y"]:>8.3f} '
          f'{s["var_x"]:>8.3f} {s["var_y"]:>8.3f} '
          f'{s["r"]:>8.4f} {s["slope"]:>8.4f} {s["rse"]:>8.4f}')

vals = [list(summaries[n].values()) for n in ['I','II','III','IV']]
for i, key in enumerate(['mean_x','mean_y','var_x','var_y','r','slope']):
    v = [summaries[n][key] for n in ['I','II','III','IV']]
    check_true(f'All {key} within 0.01', max(v)-min(v) < 0.01)

print('\nViolations:')
print('  Dataset II  — non-linear (quadratic) relationship')
print('  Dataset III — outlier/high-leverage point (x=13,y=12.74)')
print('  Dataset IV  — all x identical except one high-leverage point')
print('\nTakeaway: Always visualise your data. Identical summary statistics '
      'can mask radically different structures. Same principle applies '
      'when comparing model outputs by aggregate metrics.')

Exercise 7 ★★★ — Batch Norm, Layer Norm, and RMS Norm from Scratch

Context: Modern neural network normalisation layers are direct applications of sample mean and variance. In training, each normalisation type computes different statistics over different axes of the activation tensor.

Given a batch of activations XRB×dX \in \mathbb{R}^{B \times d} (BB = batch size, dd = hidden dim):

(a) Implement Batch Norm: normalise each feature across the batch dimension.

x^b,j=xb,jμ^jσ^j2+ε\hat{x}_{b,j} = \frac{x_{b,j} - \hat{\mu}_j}{\sqrt{\hat{\sigma}^2_j + \varepsilon}}

where μ^j=1Bbxb,j\hat{\mu}_j = \frac{1}{B}\sum_b x_{b,j} (no Bessel correction in BN).

(b) Implement Layer Norm: normalise each sample across the feature dimension.

x^b,j=xb,jμbσb2+ε\hat{x}_{b,j} = \frac{x_{b,j} - \mu_b}{\sqrt{\sigma^2_b + \varepsilon}}

where μb=1djxb,j\mu_b = \frac{1}{d}\sum_j x_{b,j}.

(c) Implement RMS Norm (used in LLaMA/Mistral):

x^b,j=xb,jRMS(xb)+ε,RMS(x)=1djxj2\hat{x}_{b,j} = \frac{x_{b,j}}{\text{RMS}(\mathbf{x}_b) + \varepsilon}, \quad \text{RMS}(\mathbf{x}) = \sqrt{\frac{1}{d}\sum_j x_j^2}

(d) Verify: BN output has zero mean and unit variance per feature across the batch; LN output has zero mean and unit variance per sample across features.

(e) Demonstrate BN failure mode: with batch size B=1B=1, the per-feature variance is zero.

Code cell 23

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 24

# Solution
np.random.seed(42)
B, d = 32, 64
X = np.random.normal(loc=3.0, scale=2.0, size=(B, d))
eps = 1e-5

def batch_norm(X, eps=1e-5):
    mu = X.mean(axis=0, keepdims=True)        # (1, d)
    var = X.var(axis=0, keepdims=True)         # (1, d) — ddof=0 in BN
    return (X - mu) / np.sqrt(var + eps)

def layer_norm(X, eps=1e-5):
    mu = X.mean(axis=1, keepdims=True)         # (B, 1)
    var = X.var(axis=1, keepdims=True)         # (B, 1)
    return (X - mu) / np.sqrt(var + eps)

def rms_norm(X, eps=1e-5):
    rms = np.sqrt((X**2).mean(axis=1, keepdims=True))  # (B, 1)
    return X / (rms + eps)

X_BN = batch_norm(X)
X_LN = layer_norm(X)
X_RN = rms_norm(X)

header('Exercise 7: Normalisation Layers')
# (d) Verify BN: per-feature mean~0, std~1 across batch
check_true('BN: per-feature mean ~ 0', np.allclose(X_BN.mean(axis=0), 0, atol=1e-6))
check_true('BN: per-feature std ~ 1',  np.allclose(X_BN.std(axis=0),  1, atol=1e-4))
# (d) Verify LN: per-sample mean~0, std~1 across features
check_true('LN: per-sample mean ~ 0', np.allclose(X_LN.mean(axis=1), 0, atol=1e-6))
check_true('LN: per-sample std ~ 1',  np.allclose(X_LN.std(axis=1),  1, atol=1e-4))
# RMS Norm: per-sample RMS ~ 1
rms_after = np.sqrt((X_RN**2).mean(axis=1))
check_true('RMSNorm: per-sample RMS ~ 1', np.allclose(rms_after, 1, atol=1e-4))

# (e) BN failure at batch_size=1
X1 = X[:1, :]   # single sample
var_b1 = X1.var(axis=0)
print(f'\nBN with B=1 — per-feature variance: mean={var_b1.mean():.2e} '
      f'(all zero → division by eps only)')
check_true('BN B=1 variance is 0', np.allclose(var_b1, 0))

print('\nNormalisation axis summary:')
print('  Batch Norm : normalises EACH FEATURE across the batch (axis=0)')
print('  Layer Norm : normalises EACH SAMPLE across features  (axis=1)')
print('  RMS Norm   : like LN but omits mean subtraction (faster, LLaMA)')
print('\nTakeaway: All three normalisation methods are direct applications '
      'of sample mean and variance. BN requires large batch; LN/RMSNorm '
      'work per-token — hence preferred in autoregressive LLM inference.')

Exercise 8 ★★★ — Adam as Running Sample Statistics

Context: The Adam optimiser maintains running estimates of the first moment (mean) and second moment (uncentred variance) of the gradient, with bias correction to account for the zero-initialised accumulators:

mt=β1mt1+(1β1)gt,vt=β2vt1+(1β2)gt2m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t, \qquad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \qquad \hat{v}_t = \frac{v_t}{1-\beta_2^t}

(a) Implement a scalar Adam accumulator that returns (m^t,v^t,θt)(\hat{m}_t, \hat{v}_t, \theta_t) for a stream of gradients, with β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999, η=0.001\eta=0.001, ε=108\varepsilon=10^{-8}.

(b) Simulate T=200T=200 gradient steps where the true gradient is gtN(0.5,12)g_t \sim \mathcal{N}(0.5,\,1^2) (a noisy gradient with positive signal). Plot m^t\hat{m}_t and v^t\sqrt{\hat{v}_t} over time.

(c) Show that m^t\hat{m}_t converges to the true gradient mean E[gt]=0.5\mathbb{E}[g_t] = 0.5.

(d) Compare bias-corrected m^t\hat{m}_t vs uncorrected mtm_t during the first 20 steps. Quantify the bias: biast=m^tmt\text{bias}_t = |\hat{m}_t - m_t|.

(e) The effective learning rate is ηeff=η/v^t\eta_\text{eff} = \eta / \sqrt{\hat{v}_t}. Show that for large gradients (high curvature), ηeff\eta_\text{eff} is automatically reduced.

Code cell 26

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 27

# Solution
np.random.seed(42)
T = 200
beta1, beta2, eta, eps_adam = 0.9, 0.999, 0.001, 1e-8
grads = np.random.normal(0.5, 1.0, T)

def adam_step(g, m, v, t, beta1=0.9, beta2=0.999, eta=0.001, eps=1e-8):
    m_new = beta1 * m + (1 - beta1) * g
    v_new = beta2 * v + (1 - beta2) * g**2
    m_hat = m_new / (1 - beta1**t)
    v_hat = v_new / (1 - beta2**t)
    step  = eta * m_hat / (np.sqrt(v_hat) + eps)
    return m_hat, v_hat, m_new, v_new, step

m_hat_hist, v_hat_hist = [], []
m_raw_hist, bias_hist  = [], []
m, v = 0.0, 0.0
for t, g in enumerate(grads, 1):
    m_hat, v_hat, m, v, step = adam_step(g, m, v, t)
    m_hat_hist.append(m_hat)
    v_hat_hist.append(v_hat)
    m_raw_hist.append(m)
    bias_hist.append(abs(m_hat - m))

m_hat_arr = np.array(m_hat_hist)
v_hat_arr = np.array(v_hat_hist)

header('Exercise 8: Adam as Running Sample Statistics')
check_true('m_hat converges to E[g]=0.5 (within 0.15)', abs(m_hat_arr[-1] - 0.5) < 0.15)
check_true('v_hat converges to E[g^2] ~ 1.25', abs(v_hat_arr[-1] - 1.25) < 0.3)

# (d) bias during first 20 steps
max_early_bias = max(bias_hist[:20])
late_bias = np.mean(bias_hist[180:])
check_true('Early bias is significant (>0.05)', max_early_bias > 0.05)
check_true('Late bias is small (<0.005)', late_bias < 0.005)
print(f'Max bias in steps 1-20: {max_early_bias:.4f}')
print(f'Mean bias in steps 181-200: {late_bias:.6f}')

# (e) effective learning rate
eta_eff = eta / (np.sqrt(v_hat_arr) + eps_adam)
print(f'\nEffective learning rate range: [{eta_eff.min():.6f}, {eta_eff.max():.6f}]')
print(f'Nominal learning rate: {eta}')
print(f'eta_eff < eta whenever sqrt(v_hat) > 1: '
      f'{(eta_eff < eta).mean()*100:.1f}% of steps')

print('\nBias correction significance:')
for t in [1, 2, 5, 10, 20, 50]:
    bc1 = 1 - beta1**t
    bc2 = 1 - beta2**t
    print(f'  t={t:3d}: 1-beta1^t={bc1:.4f}  1-beta2^t={bc2:.6f}')

print('\nTakeaway: Adam is a sample statistics engine. m_t is an EWMA of '
      'gradients (first moment); v_t is an EWMA of squared gradients (second '
      'moment). Bias correction = accounting for the cold-start: both accumulators '
      'start at 0, pulling estimates toward 0 in early steps.')

Exercise 9 *** - Robust Scaling Under Outliers

A feature contains mostly standard-normal values, but a few extreme outliers appear after a logging bug.

Part (a): Compare mean/std scaling with median/IQR robust scaling.

Part (b): Show how much the outliers affect each center and scale estimate.

Part (c): Compute the fraction of non-outlier points that land inside [2,2][-2,2] after each scaling.

Part (d): Explain why robust scaling can stabilize gradient-based ML pipelines.

Code cell 29

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 30

# Solution
x_clean = np.random.randn(1000)
outliers = np.array([15, 18, -14, 22, -20], dtype=float)
x = np.concatenate([x_clean, outliers])

mean, std = x.mean(), x.std(ddof=0)
median = np.median(x)
q1, q3 = np.percentile(x, [25, 75])
iqr = q3 - q1

z_standard = (x - mean) / std
z_robust = (x - median) / iqr
clean_standard = z_standard[:len(x_clean)]
clean_robust = z_robust[:len(x_clean)]

header('Exercise 9: Robust Scaling Under Outliers')
print(f'mean={mean:.3f}, std={std:.3f}')
print(f'median={median:.3f}, IQR={iqr:.3f}')
print(f'clean-only mean={x_clean.mean():.3f}, clean-only std={x_clean.std():.3f}')
print(f'fraction clean inside [-2,2] standard: {(np.abs(clean_standard)<=2).mean():.3f}')
print(f'fraction clean inside [-2,2] robust:   {(np.abs(clean_robust)<=2).mean():.3f}')
check_true('Outliers inflate standard deviation more than IQR', std / x_clean.std() > iqr / (np.percentile(x_clean,75)-np.percentile(x_clean,25)))
check_true('Robust center stays close to clean center', abs(median - np.median(x_clean)) < 0.05)
print('Takeaway: robust summaries protect preprocessing from a small number of extreme values.')

Exercise 10 *** - Descriptive Drift Dashboard

A production feature has a baseline sample and a current sample.

Part (a): Compute mean shift, variance ratio, and KS statistic.

Part (b): Build a small drift decision rule using thresholds.

Part (c): Compare a pure mean shift with a tail-only shift.

Part (d): Explain why drift monitoring uses several descriptive statistics, not one number.

Code cell 32

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 33

# Solution
from scipy import stats

baseline = np.random.normal(0, 1, 3000)
current_mean = np.random.normal(0.35, 1, 3000)
current_tail = np.concatenate([np.random.normal(0, 1, 2850), np.random.normal(3, 0.5, 150)])

def drift_report(base, cur):
    mean_shift = cur.mean() - base.mean()
    var_ratio = cur.var(ddof=1) / base.var(ddof=1)
    ks_stat, ks_p = stats.ks_2samp(base, cur)
    alert = abs(mean_shift) > 0.2 or var_ratio > 1.25 or ks_p < 0.01
    return mean_shift, var_ratio, ks_stat, ks_p, alert

header('Exercise 10: Descriptive Drift Dashboard')
for name, sample in [('mean shift', current_mean), ('tail shift', current_tail)]:
    ms, vr, ks, p, alert = drift_report(baseline, sample)
    print(f'{name:>10}: mean_shift={ms:.3f}, var_ratio={vr:.3f}, KS={ks:.3f}, p={p:.2e}, alert={alert}')
    check_true(f'{name} triggers drift alert', alert)
print('Takeaway: mean, variance, and distributional tests catch different failure modes.')

What to Review After Finishing

  • Central tendency — mean minimises L2 loss, median minimises L1; know when each is appropriate
  • Bessel's correction — why n1n-1 in the denominator of sample variance; unbiasedness vs consistency
  • Breakdown point — median has 50%, mean has 0%; MAD is the most robust scale estimator
  • Pearson vs Spearman — Pearson measures linear association; Spearman measures monotone association (rank-based)
  • Mahalanobis distance — accounts for correlation; equals Euclidean after whitening
  • Anscombe's Quartet — identical summaries, radically different structure; always visualise
  • Normalisation layers — BN normalises over batch (axis=0); LN/RMSNorm normalise per-token (axis=1); BN fails at B=1
  • Adam — bias-corrected EWMA of first and second gradient moments; bias correction is critical in early training

References

  1. Anscombe, F.J. (1973). Graphs in Statistical Analysis. American Statistician, 27(1), 17-21.
  2. Kingma, D.P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR 2015. [arXiv:1412.6980]
  3. Ba, J. et al. (2016). Layer Normalization. arXiv:1607.06450.
  4. Zhang, B. & Sennrich, R. (2019). Root Mean Square Layer Normalization. NeurIPS 2019. arXiv:1910.07467.
  5. Huber, P.J. (1981). Robust Statistics. Wiley.
  6. Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley.
  • Robust scaling under outliers - can you connect the computation to statistical ML practice?
  • Descriptive drift dashboard - can you connect the computation to statistical ML practice?
PreviousNext