Statistics

Exercises Notebook

Converted from exercises.ipynb for web reading.

Descriptive Statistics — Exercises

10 exercises from basic EDA through Batch Norm and Adam momentum analysis.

Format	Description
Problem	Markdown cell with task description
Your Solution	Code cell with scaffolding
Solution	Code cell with reference solution and checks

Difficulty Levels

Level	Exercises	Focus
★	1–3	Core mechanics
★★	4–6	Deeper theory
★★★	7-10	AI / ML applications

Topic Map

Topic	Exercise
Central tendency & spread	1
Outlier detection	2
ECDF & Q-Q plot	3
Robust statistics under contamination	4
Covariance matrix & Mahalanobis	5
Anscombe's Quartet	6
Batch Norm vs Layer Norm	7
Adam momentum as sample statistics	8

Additional applied exercises: Exercise 9: Robust scaling under outliers; Exercise 10: Descriptive drift dashboard.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import numpy.linalg as la
from scipy import stats

try:
    import matplotlib.pyplot as plt
    plt.style.use('seaborn-v0_8-whitegrid')
    plt.rcParams['figure.figsize'] = [10, 5]
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)

def header(title):
    print('\n' + '='*len(title))
    print(title)
    print('='*len(title))

def check_close(name, got, expected, tol=1e-4):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'} — {name}")
    if not ok:
        print('  expected:', expected)
        print('  got     :', got)
    return ok

def check_true(name, cond):
    print(f"{'PASS' if cond else 'FAIL'} — {name}")
    return cond

def five_number_summary(x):
    q1, q3 = np.percentile(x, [25, 75])
    return np.array([x.min(), q1, np.median(x), q3, x.max()])

print('Setup complete.')

Exercise 1 ★ — Central Tendency and Spread

Given the dataset representing API response times (seconds):

x = \{0.12, 0.15, 0.09, 0.87, 0.11, 0.14, 0.13, 0.10, 0.11, 0.12\}

(a) Compute the mean, median, and mode.

(b) Compute the sample variance (Bessel's correction), standard deviation, and IQR.

(c) Compute skewness $g_1$ and excess kurtosis $g_2$ .

(d) Remove the outlier $0.87$ and recompute mean and median. By what factor does the mean change? The median?

(e) State which measure of central tendency best represents the 'typical' response time.

Code cell 5

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 6

# Solution
x = np.array([0.12, 0.15, 0.09, 0.87, 0.11, 0.14, 0.13, 0.10, 0.11, 0.12])

# (a) Central tendency
mean_x   = np.mean(x)
median_x = np.median(x)
mode_x   = stats.mode(x).mode

# (b) Spread
var_x = np.var(x, ddof=1)
std_x = np.std(x, ddof=1)
q1, q3 = np.percentile(x, [25, 75])
iqr_x = q3 - q1

# (c) Shape
skew_x = stats.skew(x)
kurt_x = stats.kurtosis(x)

# (d) Remove outlier
x_clean      = x[x < 0.5]
mean_clean   = np.mean(x_clean)
median_clean = np.median(x_clean)
mean_factor  = mean_x / mean_clean
median_factor = median_x / median_clean

header('Exercise 1: Central Tendency and Spread')
print(f'(a) Mean = {mean_x:.4f}s, Median = {median_x:.4f}s, Mode = {mode_x:.2f}s')
print(f'(b) Var = {var_x:.6f}, Std = {std_x:.4f}s, IQR = {iqr_x:.4f}s')
print(f'(c) Skewness g₁ = {skew_x:.3f} (right-skewed), Kurtosis g₂ = {kurt_x:.3f}')
print(f'(d) Mean with outlier: {mean_x:.4f}s → without: {mean_clean:.4f}s (factor: {mean_factor:.2f}x)')
print(f'    Median with:        {median_x:.4f}s → without: {median_clean:.4f}s (factor: {median_factor:.2f}x)')
print(f'(e) Median ({median_clean:.4f}s) best represents typical response time.')
print(f'    Mean ({mean_x:.4f}s) is inflated 17x by the 0.87s outlier.')

check_true('Mean > median (right-skewed)', mean_x > median_x)
check_true('Outlier removal dramatically changes mean', mean_factor > 5)
check_true('Outlier removal barely changes median', median_factor < 1.05)
print('\nTakeaway: API response time distributions are right-skewed — always report '
      'median latency (P50) and P95/P99, not mean latency.')

Exercise 2 ★ — Outlier Detection

Generate 200 samples from $\mathcal{N}(50, 100)$ (mean 50, std 10) and inject 5 outliers at $\{5, 10, 100, 110, 120\}$ .

(a) Compute the five-number summary.

(b) Identify outliers using Tukey fences ( $1.5 \times \text{IQR}$ ).

(c) Identify outliers using the modified Z-score with MAD (threshold 3.5).

(d) Report true positives, false positives, and false negatives for each method.

Code cell 8

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 9

# Solution
np.random.seed(42)
x_base = np.random.normal(50, 10, 200)
injected = np.array([5.0, 10.0, 100.0, 110.0, 120.0])
x = np.concatenate([x_base, injected])
true_outliers = np.concatenate([np.zeros(200, dtype=bool), np.ones(5, dtype=bool)])

# (a) Five-number summary
fns = five_number_summary(x)

# (b) Tukey fences
q1, q3 = np.percentile(x, [25, 75])
iqr = q3 - q1
flag_tukey = (x < q1 - 1.5*iqr) | (x > q3 + 1.5*iqr)

# (c) Modified Z-score
med = np.median(x)
mad = np.median(np.abs(x - med))
mz = 0.6745 * (x - med) / mad
flag_mad = np.abs(mz) > 3.5

def report(flag, true):
    tp = np.sum(flag & true)
    fp = np.sum(flag & ~true)
    fn = np.sum(~flag & true)
    return tp, fp, fn

header('Exercise 2: Outlier Detection')
print(f'(a) Five-number summary: min={fns[0]:.1f}, Q1={fns[1]:.1f}, '
      f'median={fns[2]:.1f}, Q3={fns[3]:.1f}, max={fns[4]:.1f}')
print(f'    IQR = {iqr:.2f}')
print()
print(f'{'Method':<20} {'TP':>4} {'FP':>4} {'FN':>4}')
print('-' * 35)
for name, flag in [('Tukey fences', flag_tukey), ('Modified Z-score', flag_mad)]:
    tp, fp, fn = report(flag, true_outliers)
    print(f'{name:<20} {tp:>4} {fp:>4} {fn:>4}')

check_true('Tukey detects all 5 injected outliers', report(flag_tukey, true_outliers)[0] == 5)
check_true('Modified Z detects all 5', report(flag_mad, true_outliers)[0] == 5)
print('\nTakeaway: Modified Z-score (MAD-based) is preferred for ML datasets — '
      'it is robust to contamination that inflates the standard deviation.')

Exercise 3 ★ — ECDF and Q-Q Plot

Generate 200 samples from a Student- $t$ distribution with $\nu = 3$ degrees of freedom.

(a) Compute and plot the ECDF $\hat{F}_n(x)$ .

(b) Construct a Q-Q plot against the Gaussian distribution.

(c) Compute sample excess kurtosis $g_2$ . Compare to the theoretical value $6/(\nu-4)$ ... wait, $\nu = 3 < 4$ , so theoretical kurtosis is infinite. What does the sample kurtosis look like for large $n$ ?

(d) Verify Glivenko-Cantelli: compute $\sup_x |\hat{F}_n(x) - F_t(x)|$ and confirm it decreases as $n$ increases.

Code cell 11

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 12

# Solution
np.random.seed(42)
nu = 3
n = 200
x = np.random.standard_t(nu, n)

# (a) ECDF
x_sorted = np.sort(x)
ecdf_vals = np.arange(1, n+1) / n

# (b)/(c) Shape statistics
g1 = stats.skew(x)
g2 = stats.kurtosis(x)  # excess kurtosis

# (d) Glivenko-Cantelli vs t(nu) CDF
x_eval = np.linspace(-10, 10, 2000)
F_hat = np.mean(x[:, None] <= x_eval, axis=0)
F_true = stats.t.cdf(x_eval, df=nu)
sup_error = np.max(np.abs(F_hat - F_true))

# Sample kurtosis for different n
kurt_by_n = {}
for n_test in [100, 500, 2000, 10000]:
    samples = np.random.standard_t(nu, n_test)
    kurt_by_n[n_test] = stats.kurtosis(samples)

header('Exercise 3: ECDF and Q-Q Plot')
print(f'(b/c) Sample skewness g₁ = {g1:.3f} (theory: 0 for t-distribution)')
print(f'      Sample excess kurtosis g₂ = {g2:.3f}')
print(f'      t(3) has infinite theoretical kurtosis (ν<4)')
print(f'\nSample kurtosis by n (t(3) with infinite true kurtosis):')
for n_test, k in kurt_by_n.items():
    print(f'  n={n_test:6d}: g₂ = {k:.2f}')
print(f'\n(d) Glivenko-Cantelli: sup|F̂_n - F| = {sup_error:.4f}')

check_true('ECDF has n values', len(ecdf_vals) == n)
check_true('ECDF is monotone', np.all(np.diff(ecdf_vals) >= 0))
check_true('ECDF ends at 1', np.isclose(ecdf_vals[-1], 1.0))
check_true('Sup error < 0.1', sup_error < 0.1)
check_true('Skewness near 0', abs(g1) < 0.5)
print('\nTakeaway: The t(3) distribution has infinite kurtosis — sample '
      'kurtosis grows unboundedly with n. Gradient distributions in deep '
      'networks often have this heavy-tailed character.')

Exercise 4 ★★ — Robust Statistics Under Contamination

Compare the resilience of mean/median/trimmed-mean and std/MAD/IQR under increasing contamination.

Start with 95 samples from $\mathcal{N}(0,1)$ . Add contamination at $\mathcal{N}(20,1)$ .

(a) For contamination fractions $\alpha \in \{0, 0.05, 0.10, 0.25, 0.40\}$ : compute mean, median, 10%-trimmed mean, std, MAD×1.4826, IQR/1.349.

(b) Which location estimator is closest to 0 across all $\alpha$ ?

(c) Replace a single observation with $10^6$ and measure the change in each statistic. Verify the influence is bounded for median/MAD and unbounded for mean/std.

Code cell 14

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 15

# Solution
np.random.seed(42)

def contaminated_sample(alpha, n=100, seed=42):
    rng = np.random.default_rng(seed)
    k = int(alpha * n)
    clean = rng.normal(0, 1, n - k)
    outliers = rng.normal(20, 1, k)
    return np.concatenate([clean, outliers])

header('Exercise 4: Robust Statistics Under Contamination')
print(f'{'α':>6} {'Mean':>8} {'Median':>8} {'TrimMean':>10} {'Std':>8} {'MAD*c':>8} {'IQR/c':>8}')
print('-' * 60)

alphas = [0.0, 0.05, 0.10, 0.25, 0.40]
for alpha in alphas:
    x = contaminated_sample(alpha)
    mean_e   = np.mean(x)
    med_e    = np.median(x)
    trim_e   = stats.trim_mean(x, 0.10)
    std_e    = np.std(x, ddof=1)
    mad_e    = np.median(np.abs(x - np.median(x))) * 1.4826
    q1, q3  = np.percentile(x, [25, 75])
    iqr_e    = (q3 - q1) / 1.349
    print(f'{alpha:>6.2f} {mean_e:>8.3f} {med_e:>8.3f} {trim_e:>10.3f} '
          f'{std_e:>8.3f} {mad_e:>8.3f} {iqr_e:>8.3f}')

# (c) Influence of single extreme observation
x_base = contaminated_sample(0.0)
x_extreme = x_base.copy()
x_extreme[0] = 1e6

print(f'\n(c) Effect of replacing one obs with 10^6:')
print(f'  Mean:   {np.mean(x_base):.4f} → {np.mean(x_extreme):.1f} (change: {np.mean(x_extreme)-np.mean(x_base):.1f})')
print(f'  Median: {np.median(x_base):.4f} → {np.median(x_extreme):.4f} (change: {np.median(x_extreme)-np.median(x_base):.6f})')
print(f'  MAD:    {np.median(np.abs(x_base-np.median(x_base)))*1.4826:.4f} → '
         f'{np.median(np.abs(x_extreme-np.median(x_extreme)))*1.4826:.4f}')

check_true('Mean changes dramatically with outlier',
           abs(np.mean(x_extreme) - np.mean(x_base)) > 1000)
check_true('Median barely changes with outlier',
           abs(np.median(x_extreme) - np.median(x_base)) < 0.1)
print('\nTakeaway: For gradient aggregation in federated learning, use '
      'coordinate-wise median or geometric median instead of mean — '
      'achieves 50% breakdown point against Byzantine workers.')

Exercise 5 ★★ — Covariance Matrix and Mahalanobis Distance

Context: The Mahalanobis distance $d_M(\mathbf{x}) = \sqrt{(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1} (\mathbf{x}-\boldsymbol{\mu})}$ accounts for the scale and correlation structure of the data, unlike Euclidean distance.

(a) Generate $n=300$ samples from a bivariate Gaussian with mean $\boldsymbol{\mu}=[2, 5]^\top$ and covariance $\Sigma = \begin{pmatrix}4 & 3 \\ 3 & 9\end{pmatrix}$ .

(b) Compute the sample covariance matrix $\hat{\Sigma}$ (with Bessel's correction). Verify it is PSD.

(c) Compute the Mahalanobis distance from each sample to the sample mean. Under a bivariate Gaussian, $d_M^2 \sim \chi^2(2)$ . Verify that ~95% of distances satisfy $d_M^2 \le 5.991$ .

(d) Compare: flag outliers using (i) Euclidean distance > 3 std, (ii) Mahalanobis $d_M^2 > \chi^2_{0.975}(2) = 7.378$ . Which method correctly captures the correlation structure?

(e) Compute the sample correlation matrix from $\hat{\Sigma}$ .

Code cell 17

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 18

# Solution
np.random.seed(42)

mu_true = np.array([2.0, 5.0])
Sigma_true = np.array([[4.0, 3.0], [3.0, 9.0]])
n = 300

# (a)
X = np.random.multivariate_normal(mu_true, Sigma_true, n)

# (b)
mu_hat = X.mean(axis=0)
Sigma_hat = np.cov(X.T, ddof=1)
eigenvalues = np.linalg.eigvalsh(Sigma_hat)
is_psd = np.all(eigenvalues >= -1e-10)

# (c)
def mahalanobis(X, mu, Sigma):
    diff = X - mu
    Sigma_inv = np.linalg.inv(Sigma)
    return np.einsum('ni,ij,nj->n', diff, Sigma_inv, diff)

dm2 = mahalanobis(X, mu_hat, Sigma_hat)
frac_inside = np.mean(dm2 <= 5.991)   # chi2(2) 95th percentile

# (d)
# Euclidean: standardise each dimension independently
z = (X - mu_hat) / X.std(axis=0, ddof=1)
outliers_euclidean = np.where(np.max(np.abs(z), axis=1) > 3)[0]
outliers_mahal = np.where(dm2 > 7.378)[0]  # chi2(2) 97.5th percentile

# (e)
std_hat = np.sqrt(np.diag(Sigma_hat))
R = Sigma_hat / np.outer(std_hat, std_hat)

header('Exercise 5: Covariance Matrix and Mahalanobis Distance')
print(f'Sample mean:  {mu_hat}')
print(f'True mean:    {mu_true}')
print(f'Sample Sigma:\n{Sigma_hat}')
print(f'True Sigma:\n{Sigma_true}')
check_true('Sigma_hat is PSD', is_psd)
check_true('~95% inside chi2 ellipse', 0.90 <= frac_inside <= 0.99)
print(f'Fraction inside 95% ellipse: {frac_inside:.3f}  (expected ~0.95)')
print(f'Euclidean outliers: {len(outliers_euclidean)} | Mahalanobis outliers: {len(outliers_mahal)}')
check_close('R diagonal == 1', np.diag(R), np.ones(2))
check_true('|R[0,1]| < 1', abs(R[0, 1]) < 1)
print(f'Sample correlation: {R[0,1]:.4f}  (true: {3/np.sqrt(4*9):.4f})')
print('\nTakeaway: Mahalanobis distance rotates/scales the space by Sigma^{-1/2}, '
      'making correlation structure irrelevant — equivalent to Euclidean distance '
      'after whitening. Used in anomaly detection and multi-output normalisation.')

Exercise 6 ★★ — Anscombe's Quartet

Context: Four datasets with (nearly) identical summary statistics but very different visual structures — a canonical demonstration that descriptive statistics alone are insufficient.

(a) Construct the four Anscombe datasets (use the exact values below).

(b) For each dataset compute: mean of x, mean of y, variance of x, variance of y, Pearson correlation $r$ , and the OLS slope $\hat{\beta}_1 = r \cdot s_y / s_x$ .

(c) Verify that all four datasets have the same summary statistics to within 0.01.

(d) Identify which dataset violates each OLS assumption:

Linearity (non-linear relationship)
No outlier / high-leverage point
Constant variance (homoskedasticity)

(e) Compute the residual standard error $s_e = \sqrt{\frac{1}{n-2}\sum_i e_i^2}$ for each dataset.

Code cell 20

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 21

# Solution
x1 = np.array([10,8,13,9,11,14,6,4,12,7,5], dtype=float)
x2 = x1.copy(); x3 = x1.copy()
x4 = np.array([8,8,8,8,8,8,8,19,8,8,8], dtype=float)
y1 = np.array([8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68])
y2 = np.array([9.14,8.14,8.74,8.77,9.26,8.10,6.13,3.10,9.13,7.26,4.74])
y3 = np.array([7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73])
y4 = np.array([6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89])
datasets = [('I', x1, y1), ('II', x2, y2), ('III', x3, y3), ('IV', x4, y4)]

def summarise(x, y):
    n = len(x)
    mx, my = x.mean(), y.mean()
    vx = x.var(ddof=1); vy = y.var(ddof=1)
    r = np.corrcoef(x, y)[0, 1]
    slope = r * np.sqrt(vy / vx)
    intercept = my - slope * mx
    residuals = y - (slope * x + intercept)
    rse = np.sqrt(np.sum(residuals**2) / (n - 2))
    return {'mean_x': mx, 'mean_y': my, 'var_x': vx, 'var_y': vy,
            'r': r, 'slope': slope, 'rse': rse}

summaries = {name: summarise(x, y) for name, x, y in datasets}

header('Exercise 6: Anscombe Quartet')
print(f'  {"Dataset":<6} {"mean_x":>8} {"mean_y":>8} {"var_x":>8} {"var_y":>8} {"r":>8} {"slope":>8} {"rse":>8}')
for name, s in summaries.items():
    print(f'  {name:<6} {s["mean_x"]:>8.3f} {s["mean_y"]:>8.3f} '
          f'{s["var_x"]:>8.3f} {s["var_y"]:>8.3f} '
          f'{s["r"]:>8.4f} {s["slope"]:>8.4f} {s["rse"]:>8.4f}')

vals = [list(summaries[n].values()) for n in ['I','II','III','IV']]
for i, key in enumerate(['mean_x','mean_y','var_x','var_y','r','slope']):
    v = [summaries[n][key] for n in ['I','II','III','IV']]
    check_true(f'All {key} within 0.01', max(v)-min(v) < 0.01)

print('\nViolations:')
print('  Dataset II  — non-linear (quadratic) relationship')
print('  Dataset III — outlier/high-leverage point (x=13,y=12.74)')
print('  Dataset IV  — all x identical except one high-leverage point')
print('\nTakeaway: Always visualise your data. Identical summary statistics '
      'can mask radically different structures. Same principle applies '
      'when comparing model outputs by aggregate metrics.')

Exercise 7 ★★★ — Batch Norm, Layer Norm, and RMS Norm from Scratch

Context: Modern neural network normalisation layers are direct applications of sample mean and variance. In training, each normalisation type computes different statistics over different axes of the activation tensor.

Given a batch of activations $X \in \mathbb{R}^{B \times d}$ ( $B$ = batch size, $d$ = hidden dim):

(a) Implement Batch Norm: normalise each feature across the batch dimension.

\hat{x}_{b,j} = \frac{x_{b,j} - \hat{\mu}_j}{\sqrt{\hat{\sigma}^2_j + \varepsilon}}

where $\hat{\mu}_j = \frac{1}{B}\sum_b x_{b,j}$ (no Bessel correction in BN).

(b) Implement Layer Norm: normalise each sample across the feature dimension.

\hat{x}_{b,j} = \frac{x_{b,j} - \mu_b}{\sqrt{\sigma^2_b + \varepsilon}}

where $\mu_b = \frac{1}{d}\sum_j x_{b,j}$ .

(c) Implement RMS Norm (used in LLaMA/Mistral):

\hat{x}_{b,j} = \frac{x_{b,j}}{\text{RMS}(\mathbf{x}_b) + \varepsilon}, \quad \text{RMS}(\mathbf{x}) = \sqrt{\frac{1}{d}\sum_j x_j^2}

(d) Verify: BN output has zero mean and unit variance per feature across the batch; LN output has zero mean and unit variance per sample across features.

(e) Demonstrate BN failure mode: with batch size $B=1$ , the per-feature variance is zero.

Code cell 23

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 24

# Solution
np.random.seed(42)
B, d = 32, 64
X = np.random.normal(loc=3.0, scale=2.0, size=(B, d))
eps = 1e-5

def batch_norm(X, eps=1e-5):
    mu = X.mean(axis=0, keepdims=True)        # (1, d)
    var = X.var(axis=0, keepdims=True)         # (1, d) — ddof=0 in BN
    return (X - mu) / np.sqrt(var + eps)

def layer_norm(X, eps=1e-5):
    mu = X.mean(axis=1, keepdims=True)         # (B, 1)
    var = X.var(axis=1, keepdims=True)         # (B, 1)
    return (X - mu) / np.sqrt(var + eps)

def rms_norm(X, eps=1e-5):
    rms = np.sqrt((X**2).mean(axis=1, keepdims=True))  # (B, 1)
    return X / (rms + eps)

X_BN = batch_norm(X)
X_LN = layer_norm(X)
X_RN = rms_norm(X)

header('Exercise 7: Normalisation Layers')
# (d) Verify BN: per-feature mean~0, std~1 across batch
check_true('BN: per-feature mean ~ 0', np.allclose(X_BN.mean(axis=0), 0, atol=1e-6))
check_true('BN: per-feature std ~ 1',  np.allclose(X_BN.std(axis=0),  1, atol=1e-4))
# (d) Verify LN: per-sample mean~0, std~1 across features
check_true('LN: per-sample mean ~ 0', np.allclose(X_LN.mean(axis=1), 0, atol=1e-6))
check_true('LN: per-sample std ~ 1',  np.allclose(X_LN.std(axis=1),  1, atol=1e-4))
# RMS Norm: per-sample RMS ~ 1
rms_after = np.sqrt((X_RN**2).mean(axis=1))
check_true('RMSNorm: per-sample RMS ~ 1', np.allclose(rms_after, 1, atol=1e-4))

# (e) BN failure at batch_size=1
X1 = X[:1, :]   # single sample
var_b1 = X1.var(axis=0)
print(f'\nBN with B=1 — per-feature variance: mean={var_b1.mean():.2e} '
      f'(all zero → division by eps only)')
check_true('BN B=1 variance is 0', np.allclose(var_b1, 0))

print('\nNormalisation axis summary:')
print('  Batch Norm : normalises EACH FEATURE across the batch (axis=0)')
print('  Layer Norm : normalises EACH SAMPLE across features  (axis=1)')
print('  RMS Norm   : like LN but omits mean subtraction (faster, LLaMA)')
print('\nTakeaway: All three normalisation methods are direct applications '
      'of sample mean and variance. BN requires large batch; LN/RMSNorm '
      'work per-token — hence preferred in autoregressive LLM inference.')

Exercise 8 ★★★ — Adam as Running Sample Statistics

Context: The Adam optimiser maintains running estimates of the first moment (mean) and second moment (uncentred variance) of the gradient, with bias correction to account for the zero-initialised accumulators:

m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t, \qquad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2

\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \qquad \hat{v}_t = \frac{v_t}{1-\beta_2^t}

(a) Implement a scalar Adam accumulator that returns $(\hat{m}_t, \hat{v}_t, \theta_t)$ for a stream of gradients, with $\beta_1=0.9$ , $\beta_2=0.999$ , $\eta=0.001$ , $\varepsilon=10^{-8}$ .

(b) Simulate $T=200$ gradient steps where the true gradient is $g_t \sim \mathcal{N}(0.5,\,1^2)$ (a noisy gradient with positive signal). Plot $\hat{m}_t$ and $\sqrt{\hat{v}_t}$ over time.

(c) Show that $\hat{m}_t$ converges to the true gradient mean $\mathbb{E}[g_t] = 0.5$ .

(d) Compare bias-corrected $\hat{m}_t$ vs uncorrected $m_t$ during the first 20 steps. Quantify the bias: $\text{bias}_t = |\hat{m}_t - m_t|$ .

(e) The effective learning rate is $\eta_\text{eff} = \eta / \sqrt{\hat{v}_t}$ . Show that for large gradients (high curvature), $\eta_\text{eff}$ is automatically reduced.

Code cell 26

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 27

# Solution
np.random.seed(42)
T = 200
beta1, beta2, eta, eps_adam = 0.9, 0.999, 0.001, 1e-8
grads = np.random.normal(0.5, 1.0, T)

def adam_step(g, m, v, t, beta1=0.9, beta2=0.999, eta=0.001, eps=1e-8):
    m_new = beta1 * m + (1 - beta1) * g
    v_new = beta2 * v + (1 - beta2) * g**2
    m_hat = m_new / (1 - beta1**t)
    v_hat = v_new / (1 - beta2**t)
    step  = eta * m_hat / (np.sqrt(v_hat) + eps)
    return m_hat, v_hat, m_new, v_new, step

m_hat_hist, v_hat_hist = [], []
m_raw_hist, bias_hist  = [], []
m, v = 0.0, 0.0
for t, g in enumerate(grads, 1):
    m_hat, v_hat, m, v, step = adam_step(g, m, v, t)
    m_hat_hist.append(m_hat)
    v_hat_hist.append(v_hat)
    m_raw_hist.append(m)
    bias_hist.append(abs(m_hat - m))

m_hat_arr = np.array(m_hat_hist)
v_hat_arr = np.array(v_hat_hist)

header('Exercise 8: Adam as Running Sample Statistics')
check_true('m_hat converges to E[g]=0.5 (within 0.15)', abs(m_hat_arr[-1] - 0.5) < 0.15)
check_true('v_hat converges to E[g^2] ~ 1.25', abs(v_hat_arr[-1] - 1.25) < 0.3)

# (d) bias during first 20 steps
max_early_bias = max(bias_hist[:20])
late_bias = np.mean(bias_hist[180:])
check_true('Early bias is significant (>0.05)', max_early_bias > 0.05)
check_true('Late bias is small (<0.005)', late_bias < 0.005)
print(f'Max bias in steps 1-20: {max_early_bias:.4f}')
print(f'Mean bias in steps 181-200: {late_bias:.6f}')

# (e) effective learning rate
eta_eff = eta / (np.sqrt(v_hat_arr) + eps_adam)
print(f'\nEffective learning rate range: [{eta_eff.min():.6f}, {eta_eff.max():.6f}]')
print(f'Nominal learning rate: {eta}')
print(f'eta_eff < eta whenever sqrt(v_hat) > 1: '
      f'{(eta_eff < eta).mean()*100:.1f}% of steps')

print('\nBias correction significance:')
for t in [1, 2, 5, 10, 20, 50]:
    bc1 = 1 - beta1**t
    bc2 = 1 - beta2**t
    print(f'  t={t:3d}: 1-beta1^t={bc1:.4f}  1-beta2^t={bc2:.6f}')

print('\nTakeaway: Adam is a sample statistics engine. m_t is an EWMA of '
      'gradients (first moment); v_t is an EWMA of squared gradients (second '
      'moment). Bias correction = accounting for the cold-start: both accumulators '
      'start at 0, pulling estimates toward 0 in early steps.')

Exercise 9 *** - Robust Scaling Under Outliers

A feature contains mostly standard-normal values, but a few extreme outliers appear after a logging bug.

Part (a): Compare mean/std scaling with median/IQR robust scaling.

Part (b): Show how much the outliers affect each center and scale estimate.

Part (c): Compute the fraction of non-outlier points that land inside $[-2,2]$ after each scaling.

Part (d): Explain why robust scaling can stabilize gradient-based ML pipelines.

Code cell 29

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 30

# Solution
x_clean = np.random.randn(1000)
outliers = np.array([15, 18, -14, 22, -20], dtype=float)
x = np.concatenate([x_clean, outliers])

mean, std = x.mean(), x.std(ddof=0)
median = np.median(x)
q1, q3 = np.percentile(x, [25, 75])
iqr = q3 - q1

z_standard = (x - mean) / std
z_robust = (x - median) / iqr
clean_standard = z_standard[:len(x_clean)]
clean_robust = z_robust[:len(x_clean)]

header('Exercise 9: Robust Scaling Under Outliers')
print(f'mean={mean:.3f}, std={std:.3f}')
print(f'median={median:.3f}, IQR={iqr:.3f}')
print(f'clean-only mean={x_clean.mean():.3f}, clean-only std={x_clean.std():.3f}')
print(f'fraction clean inside [-2,2] standard: {(np.abs(clean_standard)<=2).mean():.3f}')
print(f'fraction clean inside [-2,2] robust:   {(np.abs(clean_robust)<=2).mean():.3f}')
check_true('Outliers inflate standard deviation more than IQR', std / x_clean.std() > iqr / (np.percentile(x_clean,75)-np.percentile(x_clean,25)))
check_true('Robust center stays close to clean center', abs(median - np.median(x_clean)) < 0.05)
print('Takeaway: robust summaries protect preprocessing from a small number of extreme values.')

Exercise 10 *** - Descriptive Drift Dashboard

A production feature has a baseline sample and a current sample.

Part (a): Compute mean shift, variance ratio, and KS statistic.

Part (b): Build a small drift decision rule using thresholds.

Part (c): Compare a pure mean shift with a tail-only shift.

Part (d): Explain why drift monitoring uses several descriptive statistics, not one number.

Code cell 32

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 33

# Solution
from scipy import stats

baseline = np.random.normal(0, 1, 3000)
current_mean = np.random.normal(0.35, 1, 3000)
current_tail = np.concatenate([np.random.normal(0, 1, 2850), np.random.normal(3, 0.5, 150)])

def drift_report(base, cur):
    mean_shift = cur.mean() - base.mean()
    var_ratio = cur.var(ddof=1) / base.var(ddof=1)
    ks_stat, ks_p = stats.ks_2samp(base, cur)
    alert = abs(mean_shift) > 0.2 or var_ratio > 1.25 or ks_p < 0.01
    return mean_shift, var_ratio, ks_stat, ks_p, alert

header('Exercise 10: Descriptive Drift Dashboard')
for name, sample in [('mean shift', current_mean), ('tail shift', current_tail)]:
    ms, vr, ks, p, alert = drift_report(baseline, sample)
    print(f'{name:>10}: mean_shift={ms:.3f}, var_ratio={vr:.3f}, KS={ks:.3f}, p={p:.2e}, alert={alert}')
    check_true(f'{name} triggers drift alert', alert)
print('Takeaway: mean, variance, and distributional tests catch different failure modes.')

What to Review After Finishing

Central tendency — mean minimises L2 loss, median minimises L1; know when each is appropriate
Bessel's correction — why $n-1$ in the denominator of sample variance; unbiasedness vs consistency
Breakdown point — median has 50%, mean has 0%; MAD is the most robust scale estimator
Pearson vs Spearman — Pearson measures linear association; Spearman measures monotone association (rank-based)
Mahalanobis distance — accounts for correlation; equals Euclidean after whitening
Anscombe's Quartet — identical summaries, radically different structure; always visualise
Normalisation layers — BN normalises over batch (axis=0); LN/RMSNorm normalise per-token (axis=1); BN fails at B=1
Adam — bias-corrected EWMA of first and second gradient moments; bias correction is critical in early training

References

Anscombe, F.J. (1973). Graphs in Statistical Analysis. American Statistician, 27(1), 17-21.
Kingma, D.P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR 2015. [arXiv:1412.6980]
Ba, J. et al. (2016). Layer Normalization. arXiv:1607.06450.
Zhang, B. & Sennrich, R. (2019). Root Mean Square Layer Normalization. NeurIPS 2019. arXiv:1910.07467.
Huber, P.J. (1981). Robust Statistics. Wiley.
Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley.

Robust scaling under outliers - can you connect the computation to statistical ML practice?
Descriptive drift dashboard - can you connect the computation to statistical ML practice?