Exercises NotebookMath for LLMs

Hypothesis Testing

Statistics / Hypothesis Testing

Run notebook
Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Exercises Notebook

Exercises Notebook

Converted from exercises.ipynb for web reading.

Hypothesis Testing — Exercises

10 exercises covering the complete hypothesis testing curriculum.

FormatDescription
ProblemMarkdown cell with task description
Your SolutionCode cell with scaffolding
SolutionCode cell with reference solution and checks

Difficulty Levels

LevelExercisesFocus
1-3Core mechanics: t-test, chi-squared, power
★★4-6Theory: NP Lemma, multiple testing, permutation
★★★7-10AI applications: SPRT, KS drift detection

Topic Map

TopicExercise
One-sample t-test1
Chi-squared goodness-of-fit2
Power and sample size3
Neyman-Pearson Lemma4
Multiple testing correction5
Permutation test6
Sequential A/B testing (SPRT)7
KS-based data drift detector8

Additional applied exercises: Exercise 9: Benjamini-Hochberg FDR; Exercise 10: Paired permutation benchmark test.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import numpy.linalg as la
from scipy import stats

try:
    import matplotlib.pyplot as plt
    import matplotlib as mpl
    mpl.rcParams.update({'figure.figsize': (10, 6), 'figure.dpi': 110,
                         'axes.spines.top': False, 'axes.spines.right': False})
    COLORS = {'primary': '#0077BB', 'secondary': '#EE7733',
              'tertiary': '#009988', 'error': '#CC3311',
              'neutral': '#555555', 'highlight': '#EE3377'}
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)

def header(title):
    print('\n' + '=' * len(title))
    print(title)
    print('=' * len(title))

def check_close(name, got, expected, tol=1e-6):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'}{name}")
    if not ok:
        print(f'  expected: {expected}')
        print(f'  got:      {got}')
    return ok

def check_true(name, cond):
    print(f"{'PASS' if cond else 'FAIL'}{name}")
    return cond

def cohen_h(p1, p2):
    """Cohen's h effect size for two proportions."""
    return 2*np.arcsin(p1**0.5) - 2*np.arcsin(p2**0.5)

print('Setup complete.')

Exercise 1 ★ — One-Sample t-Test from Scratch

An LLM is tested for latency compliance. The SLA requires mean latency ≤ 45 ms. On 20 requests: mean = 47.3 ms, sample std = 8.1 ms.

(a) State H0H_0 and H1H_1. Is this one-sided or two-sided?

(b) Compute the t-statistic: t=(Xˉμ0)/(S/n)t = (\bar{X} - \mu_0)/(S/\sqrt{n}).

(c) Find the critical value t19,0.05t_{19, 0.05} (one-sided) from the t-distribution.

(d) Compute the exact one-sided p-value.

(e) State your conclusion. Compute Cohen's d=(Xˉμ0)/Sd = (\bar{X} - \mu_0)/S.

Code cell 5

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 6

# Solution
def one_sample_t_test(xbar, s, n, mu0):
    t_stat = (xbar - mu0) / (s / n**0.5)
    df = n - 1
    # One-sided upper: P(T > t | H0)
    p_one_sided = stats.t.sf(t_stat, df)
    return t_stat, df, p_one_sided

xbar, s, n, mu0 = 47.3, 8.1, 20, 45.0
t_stat, df, p_one = one_sample_t_test(xbar, s, n, mu0)
t_crit = stats.t.ppf(0.95, df)  # one-sided alpha=0.05
d_cohen = (xbar - mu0) / s

header('Exercise 1: One-Sample t-Test')
print(f'H0: mu <= 45 ms, H1: mu > 45 ms (one-sided upper)')
print(f't-statistic  = {t_stat:.4f}')
print(f'df           = {df}')
print(f'Critical val = {t_crit:.4f}')
print(f'p-value      = {p_one:.4f}')
print(f"Cohen's d    = {d_cohen:.4f}")

check_close('t-stat', t_stat, (47.3-45)/(8.1/20**0.5))
check_close('df', df, 19)
check_true('p < 0.05 (reject H0)', p_one < 0.05)
check_true('t > t_crit', t_stat > t_crit)

# Verify with scipy
np.random.seed(42)
data = np.random.normal(47.3, 8.1, 20)
t_scipy, p_scipy = stats.ttest_1samp(data, mu0)
print(f'scipy one-sided p: {p_scipy/2:.4f} (reference)')
print(f'\nTakeaway: The LLM violates the 45ms SLA at α=0.05 (d={d_cohen:.2f}, medium effect).')

Exercise 2 ★ — Chi-Squared Goodness-of-Fit

A text classifier should output uniformly across 5 categories. On 500 test examples, observed counts: [87, 113, 95, 102, 103].

(a) State H0H_0 and compute expected counts.

(b) Compute the chi-squared statistic χ2=(OiEi)2/Ei\chi^2 = \sum (O_i - E_i)^2/E_i.

(c) Compute the p-value with df=k1=4df = k-1 = 4.

(d) Reject or fail to reject H0H_0 at α=0.05\alpha = 0.05? (Critical: χ4,0.052=9.49\chi^2_{4,0.05} = 9.49)

(e) Compute Cramér's V=χ2/(n(k1))V = \sqrt{\chi^2/(n(k-1))} and interpret.

Code cell 8

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 9

# Solution
def chi_squared_gof(observed):
    n = observed.sum()
    k = len(observed)
    expected = np.full(k, n / k)
    chi2_stat = ((observed - expected)**2 / expected).sum()
    df = k - 1
    p_value = stats.chi2.sf(chi2_stat, df)
    cramer_v = (chi2_stat / (n * (k - 1)))**0.5
    return chi2_stat, p_value, cramer_v

observed = np.array([87, 113, 95, 102, 103])
chi2, p, V = chi_squared_gof(observed)
chi2_scipy, p_scipy = stats.chisquare(observed)

header('Exercise 2: Chi-Squared GoF')
print(f'Observed: {observed}, Total: {observed.sum()}')
print(f'Expected: {np.full(5, 100.0)} (uniform)')
print(f'chi2 = {chi2:.4f}, df = 4, p = {p:.4f}')
print(f"Cramér V = {V:.4f} (small-medium effect)")

check_close('chi2 statistic', chi2, chi2_scipy)
check_close('p-value', p, p_scipy)
check_true('p < 0.05 (reject uniform H0)', p < 0.05)
check_true('V in [0,1]', 0 <= V <= 1)
print('\nTakeaway: Category 2 is over-predicted; classifier is not uniform — check training data balance.')

Exercise 3 ★ — Power and Sample Size

You want to detect that model A has higher accuracy than model B: pA=0.88p_A = 0.88, pB=0.85p_B = 0.85.

(a) Compute Cohen's hh for this effect.

(b) Compute required nn per group for 80% power at α=0.05\alpha = 0.05 (two-sided).

(c) What is the power if only n=2,000n = 2{,}000 per group is available?

(d) Plot the power curve as a function of nn from 500 to 8000.

(e) Find the nn that gives 95% power.

Code cell 11

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 12

# Solution
def power_two_prop(p1, p2, n, alpha=0.05):
    pbar = (p1 + p2) / 2
    z_a = stats.norm.ppf(1 - alpha/2)
    se_h0 = (2 * pbar * (1-pbar) / n)**0.5
    se_h1 = (p1*(1-p1)/n + p2*(1-p2)/n)**0.5
    delta = abs(p1 - p2)
    ncp = delta / se_h1
    power = stats.norm.sf(z_a - delta/se_h0) + stats.norm.cdf(-z_a - delta/se_h0)
    return power

def n_required(p1, p2, alpha=0.05, target_power=0.80):
    z_a = stats.norm.ppf(1 - alpha/2)
    z_b = stats.norm.ppf(target_power)
    pbar = (p1 + p2) / 2
    num = (z_a * (2*pbar*(1-pbar))**0.5 + z_b * (p1*(1-p1) + p2*(1-p2))**0.5)**2
    return int(np.ceil(num / (p1 - p2)**2))

p_A, p_B = 0.88, 0.85
h = cohen_h(p_A, p_B)
n_req = n_required(p_A, p_B)
power_2000 = power_two_prop(p_A, p_B, n=2000)

# Find n for 95% power
n_search = np.arange(500, 15000, 50)
powers = [power_two_prop(p_A, p_B, n) for n in n_search]
n_95 = n_search[np.argmax(np.array(powers) >= 0.95)]

header('Exercise 3: Power and Sample Size')
print(f"Cohen's h = {h:.4f}")
print(f'n required for 80% power: {n_req}')
print(f'Power at n=2000: {power_2000:.4f}')
print(f'n for 95% power: {n_95}')

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(9, 5))
    ax.plot(n_search, powers, color=COLORS['primary'], lw=2)
    ax.axhline(0.80, color=COLORS['error'], linestyle='--', label='80% power')
    ax.axhline(0.95, color=COLORS['secondary'], linestyle='--', label='95% power')
    ax.axvline(n_req, color=COLORS['error'], linestyle=':', alpha=0.7)
    ax.axvline(n_95, color=COLORS['secondary'], linestyle=':', alpha=0.7)
    ax.set_xlabel('n per group'); ax.set_ylabel('Power')
    ax.set_title(f'Power curve: pA={p_A}, pB={p_B}, α=0.05')
    ax.legend(); plt.tight_layout(); plt.show()

check_true("Cohen's h > 0", h > 0)
check_true('n_req is positive integer', n_req > 0)
check_true('power at n_req >= 0.78', power_two_prop(p_A, p_B, n_req) >= 0.78)
print('\nTakeaway: 3% accuracy gap requires ~4k examples per model — larger than most benchmarks!')

Exercise 4 ★★ — Neyman-Pearson Lemma for Exponential Distribution

X1,,XniidExp(λ)X_1, \ldots, X_n \overset{iid}{\sim} \text{Exp}(\lambda). Test H0:λ=λ0H_0: \lambda = \lambda_0 vs. H1:λ=λ1>λ0H_1: \lambda = \lambda_1 > \lambda_0.

(a) Write the likelihood ratio Λ(x)=L(λ1)/L(λ0)\Lambda(\mathbf{x}) = \mathcal{L}(\lambda_1)/\mathcal{L}(\lambda_0) and simplify.

(b) Show that rejecting when Λ>k\Lambda > k is equivalent to rejecting when Xˉ<c\bar{X} < c.

(c) Find cc in terms of α\alpha, λ0\lambda_0, nn using the fact that 2λ0nXˉχ2n22\lambda_0 n\bar{X} \sim \chi^2_{2n}.

(d) Verify: for λ0=1\lambda_0 = 1, n=10n = 10, α=0.05\alpha = 0.05, confirm the test has correct size.

(e) Verify the test is UMP: compare power to a random test of the same size.

Code cell 14

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 15

# Solution
def np_critical_value_exp(lambda0, n, alpha):
    # 2*lambda0*n*Xbar ~ chi2(2n)
    # Reject when Xbar < c, i.e., 2*lambda0*n*Xbar < 2*lambda0*n*c
    # P(chi2(2n) < chi2_alpha) = alpha -> chi2_lo = chi2(2n).ppf(alpha)
    chi2_lo = stats.chi2.ppf(alpha, df=2*n)
    c = chi2_lo / (2 * lambda0 * n)
    return c

def np_power_exp(lambda1, c, n):
    # P(Xbar < c | lambda=lambda1) = P(2*lambda1*n*Xbar < 2*lambda1*n*c)
    return stats.chi2.cdf(2*lambda1*n*c, df=2*n)

lambda0, n_exp, alpha = 1.0, 10, 0.05
c = np_critical_value_exp(lambda0, n_exp, alpha)

# Verify size by simulation
np.random.seed(42)
size_sim = np.mean([np.random.exponential(1/lambda0, n_exp).mean() < c for _ in range(50000)])

# Power at lambda1 = 2.0
lambda1_test = 2.0
power_theory = np_power_exp(lambda1_test, c, n_exp)
power_sim = np.mean([np.random.exponential(1/lambda1_test, n_exp).mean() < c for _ in range(50000)])

header('Exercise 4: NP Lemma for Exponential')
print(f'lambda0={lambda0}, n={n_exp}, alpha={alpha}')
print(f'Critical value c = {c:.6f}')
print(f'Simulated size: {size_sim:.4f} (target {alpha})')
print(f'Theoretical power at lambda1={lambda1_test}: {power_theory:.4f}')
print(f'Simulated power: {power_sim:.4f}')

check_true(f'Size ≈ alpha (within 0.01)', abs(size_sim - alpha) < 0.01)
check_true('Power > alpha (test is useful)', power_theory > alpha)
check_close('Theory matches simulation', power_theory, power_sim, tol=0.01)

print('\nTakeaway: The NP test (reject when Xbar < c) is UMP: the LRT for Exp(λ) reduces to a '
      'threshold on the sufficient statistic Xbar, and Exp has the MLR property in Xbar.')

Exercise 5 ★★ — Multiple Testing Correction

A model is compared to baseline on 50 benchmarks (45 true nulls, 5 true alternatives).

(a) Simulate p-values: nulls U(0,1)\sim \mathcal{U}(0,1), alternatives Beta(0.2,1)\sim \text{Beta}(0.2, 1).

(b) Count discoveries with no correction at α=0.05\alpha = 0.05.

(c) Apply Bonferroni correction and count discoveries.

(d) Apply BH correction at q=0.05q = 0.05 and count discoveries.

(e) Across 1000 simulations, estimate empirical FWER and FDR for each method. Plot and compare.

Code cell 17

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 18

# Solution
def simulate_pvalues(m0=45, m1=5, seed=None):
    if seed is not None:
        np.random.seed(seed)
    pvals_null = np.random.uniform(0, 1, m0)
    pvals_alt  = np.random.beta(0.2, 1.0, m1)
    pvals = np.concatenate([pvals_null, pvals_alt])
    truth = np.array([False]*m0 + [True]*m1)
    return pvals, truth

def bh_reject(pvals, alpha=0.05):
    pvals = np.asarray(pvals)
    m = len(pvals)
    order = np.argsort(pvals)
    ranked = pvals[order]
    thresholds = alpha * np.arange(1, m + 1) / m
    passed = ranked <= thresholds
    reject = np.zeros(m, dtype=bool)
    if passed.any():
        k = np.max(np.where(passed)[0])
        reject[order[:k+1]] = True
    return reject

def apply_corrections(pvals, alpha=0.05):
    r_none = pvals < alpha
    r_bonf = pvals < alpha / len(pvals)
    r_bh = bh_reject(pvals, alpha=alpha)
    return {'none': np.array(r_none), 'bonf': np.array(r_bonf), 'bh': np.array(r_bh)}

# Single run
np.random.seed(42)
pvals, truth = simulate_pvalues()
rejections = apply_corrections(pvals)

header('Exercise 5: Multiple Testing Correction')
print(f'{"Method":<12} {"Disc":<8} {"TP":<8} {"FP"}')
for method in ['none', 'bonf', 'bh']:
    r = rejections[method]
    tp = (r & truth).sum(); fp = (r & ~truth).sum()
    print(f'{method:<12} {r.sum():<8} {tp:<8} {fp}')

# 1000-simulation comparison
n_sims_mt = 1000
fwer = {'none': [], 'bonf': [], 'bh': []}
fdr_sim = {'none': [], 'bonf': [], 'bh': []}
power_sim = {'none': [], 'bonf': [], 'bh': []}
m1 = 5

for i in range(n_sims_mt):
    pv, tr = simulate_pvalues(seed=i)
    rj = apply_corrections(pv)
    for meth in ['none', 'bonf', 'bh']:
        r = rj[meth]
        fp = (r & ~tr).sum(); tp = (r & tr).sum(); R = r.sum()
        fwer[meth].append(fp > 0)
        fdr_sim[meth].append(fp / max(R, 1))
        power_sim[meth].append(tp / m1)

print('\nSimulation results (1000 replications):')
print(f'{"Method":<12} {"FWER":<10} {"FDR":<10} {"Power"}')
for meth in ['none', 'bonf', 'bh']:
    print(f'{meth:<12} {np.mean(fwer[meth]):.3f}      '
          f'{np.mean(fdr_sim[meth]):.3f}      {np.mean(power_sim[meth]):.3f}')

check_true('Bonferroni FWER <= 0.06', np.mean(fwer['bonf']) <= 0.06)
check_true('BH FDR <= 0.06', np.mean(fdr_sim['bh']) <= 0.06)
check_true('BH power >= Bonferroni power', np.mean(power_sim['bh']) >= np.mean(power_sim['bonf']))
print('\nTakeaway: BH controls FDR with significantly higher power than Bonferroni for large m.')

Exercise 6 ★★ — Permutation Test for Two-Sample Means

Two LLMs are evaluated on 30 shared test prompts. Model A: N(0.72,0.12)\mathcal{N}(0.72, 0.1^2), Model B: N(0.68,0.12)\mathcal{N}(0.68, 0.1^2).

(a) Compute the observed mean difference.

(b) Implement a permutation test with B=10,000B = 10{,}000 permutations.

(c) Compute the permutation p-value (two-sided).

(d) Compare to a Welch t-test p-value.

(e) Repeat 500 times under H0H_0 (μA=μB=0.70\mu_A = \mu_B = 0.70) and verify Type I error 0.06\leq 0.06.

Code cell 20

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 21

# Solution
def permutation_test(x, y, B=10_000):
    obs_diff = x.mean() - y.mean()
    combined = np.concatenate([x, y])
    n1 = len(x)
    perm_diffs = np.empty(B)
    for i in range(B):
        perm = np.random.permutation(combined)
        perm_diffs[i] = perm[:n1].mean() - perm[n1:].mean()
    p_value = (np.abs(perm_diffs) >= np.abs(obs_diff)).mean()
    return obs_diff, p_value

np.random.seed(42)
scores_A = np.random.normal(0.72, 0.10, 30)
scores_B = np.random.normal(0.68, 0.10, 30)

obs_diff, p_perm = permutation_test(scores_A, scores_B)
_, p_welch = stats.ttest_ind(scores_A, scores_B, equal_var=False)

# Type I error simulation
np.random.seed(99)
n_sims_perm = 500
false_pos = 0
for _ in range(n_sims_perm):
    a = np.random.normal(0.70, 0.10, 30)
    b = np.random.normal(0.70, 0.10, 30)
    _, pp = permutation_test(a, b, B=999)
    if pp < 0.05:
        false_pos += 1

type1_rate = false_pos / n_sims_perm

header('Exercise 6: Permutation Test')
print(f'Observed diff: {obs_diff:.4f}')
print(f'Permutation p: {p_perm:.4f}')
print(f'Welch t p:     {p_welch:.4f}')
print(f'Type I error (H0 true): {type1_rate:.3f}')

check_true('p-perm and p-welch agree within 0.05', abs(p_perm - p_welch) < 0.08)
check_true(f'Type I error <= 0.07', type1_rate <= 0.07)
print('\nTakeaway: Permutation and Welch t-test agree closely for normal data; '
      'permutation provides valid exact inference with no distributional assumptions.')

Exercise 7 ★★★ — Sequential A/B Test with SPRT

Two variants are tested on streaming requests: H0:pA=pB=0.80H_0: p_A = p_B = 0.80 vs. H1:pA=0.85,pB=0.80H_1: p_A = 0.85, p_B = 0.80. Set α=0.05\alpha = 0.05, β=0.20\beta = 0.20.

(a) Derive and compute the SPRT log-likelihood ratio increment per observation.

(b) Compute the Wald stopping boundaries AA and BB.

(c) Simulate one sequential path and plot n\ell_n vs. nn with boundaries.

(d) Estimate expected stopping time under H0H_0 and H1H_1 (1000 simulations each).

(e) Verify empirical Type I error 0.05\leq 0.05 over 2000 runs under H0H_0.

Code cell 23

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 24

# Solution
def sprt_boundaries(alpha, beta):
    A = np.log((1 - beta) / alpha)
    B = np.log(beta / (1 - alpha))
    return A, B

def sprt_run(p_true, p0, p1, A, B, max_n=5000):
    llr = 0.0
    for n in range(1, max_n + 1):
        x = np.random.binomial(1, p_true)
        if x == 1:
            llr += np.log(p1 / p0)
        else:
            llr += np.log((1-p1) / (1-p0))
        if llr >= A:
            return n, 'reject'
        if llr <= B:
            return n, 'accept'
    return max_n, 'undecided'

p0_s, p1_s = 0.80, 0.85
alpha_s, beta_s = 0.05, 0.20
A_bound, B_bound = sprt_boundaries(alpha_s, beta_s)

# One example path under H1
np.random.seed(42)
llr_path = [0.0]
for xi in np.random.binomial(1, p1_s, 3000):
    d = np.log(p1_s/p0_s) if xi == 1 else np.log((1-p1_s)/(1-p0_s))
    llr_path.append(llr_path[-1] + d)
    if llr_path[-1] >= A_bound or llr_path[-1] <= B_bound:
        break

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.plot(llr_path, color=COLORS['primary'], lw=1.5)
    ax.axhline(A_bound, color=COLORS['error'], linestyle='--', label=f'Reject H0 ({A_bound:.2f})')
    ax.axhline(B_bound, color=COLORS['tertiary'], linestyle='--', label=f'Accept H0 ({B_bound:.2f})')
    ax.set_xlabel('n'); ax.set_ylabel('Log-LR')
    ax.set_title(f'SPRT path (stopped at n={len(llr_path)-1})')
    ax.legend(); plt.tight_layout(); plt.show()

# Simulation study
np.random.seed(42)
n_runs = 2000
h0_results = [sprt_run(p0_s, p0_s, p1_s, A_bound, B_bound) for _ in range(n_runs)]
h1_results = [sprt_run(p1_s, p0_s, p1_s, A_bound, B_bound) for _ in range(n_runs)]

fp_rate  = sum(1 for _, d in h0_results if d == 'reject') / n_runs
tp_rate  = sum(1 for _, d in h1_results if d == 'reject') / n_runs
mean_n_h0 = np.mean([n for n, _ in h0_results])
mean_n_h1 = np.mean([n for n, _ in h1_results])

header('Exercise 7: SPRT Sequential A/B Test')
print(f'Boundaries: A={A_bound:.3f}, B={B_bound:.3f}')
print(f'Type I error: {fp_rate:.4f} (target <= {alpha_s})')
print(f'Power:        {tp_rate:.4f} (target >= {1-beta_s})')
print(f'Mean n under H0: {mean_n_h0:.0f}')
print(f'Mean n under H1: {mean_n_h1:.0f}')

check_true(f'Type I error <= alpha+0.01', fp_rate <= alpha_s + 0.01)
check_true(f'Power >= 1-beta-0.02', tp_rate >= (1-beta_s) - 0.02)
print('\nTakeaway: SPRT stops ~40% earlier under H1 vs fixed-n test, enabling faster A/B decisions.')

Exercise 8 ★★★ — KS-Based Data Drift Detector

Build a drift detection system for an LLM serving pipeline. Reference: embedding norms N(10,12)\sim \mathcal{N}(10, 1^2) from 5000 training samples.

(a) Simulate reference data and 3 production batches: no drift, mean drift (+0.5), severe drift (+2).

(b) Apply two-sample KS test to each batch vs. reference.

(c) Apply BH correction across the 3 batch comparisons.

(d) Implement a sliding-window detector: alert if last 3 consecutive days all have KS p<0.1p < 0.1.

(e) Show KS is more sensitive than t-test for scale-only drift (same mean, σ:11.5\sigma: 1 \to 1.5).

Code cell 26

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 27

# Solution
def ks_drift_test(reference, production):
    ks_stat, p_value = stats.ks_2samp(reference, production)
    return ks_stat, p_value

def sliding_window_alert(p_values, window=3, threshold=0.1):
    if len(p_values) < window:
        return False
    return all(p < threshold for p in p_values[-window:])

np.random.seed(42)
n_ref, n_prod = 5000, 1000
reference = np.random.normal(10.0, 1.0, n_ref)

batches = {
    'No drift':    np.random.normal(10.0, 1.0, n_prod),
    'Mean +0.5':   np.random.normal(10.5, 1.0, n_prod),
    'Severe +2.0': np.random.normal(12.0, 1.0, n_prod),
}

ks_stats, p_vals = [], []
for name, batch in batches.items():
    ks, p = ks_drift_test(reference, batch)
    ks_stats.append(ks); p_vals.append(p)


def bh_adjust(pvals, alpha=0.05):
    pvals = np.asarray(pvals, dtype=float)
    m = len(pvals)
    order = np.argsort(pvals)
    ranked = pvals[order]
    q_ranked = np.empty(m)
    running = 1.0
    for i in range(m - 1, -1, -1):
        running = min(running, ranked[i] * m / (i + 1))
        q_ranked[i] = running
    q_vals = np.empty(m)
    q_vals[order] = q_ranked
    reject = q_vals <= alpha
    return reject, q_vals

# BH correction
reject_bh, q_vals = bh_adjust(p_vals, alpha=0.05)

header('Exercise 8: KS Drift Detection')
for i, (name, _) in enumerate(batches.items()):
    print(f'  {name:<18}: KS={ks_stats[i]:.4f}, p={p_vals[i]:.6f}, '
          f'q={q_vals[i]:.4f}, alert={"YES" if reject_bh[i] else "no"}')

# Sliding window demo (7 daily batches)
daily_pvals = []
for day in range(7):
    mu = 10.0 if day < 4 else 10.8
    batch_d = np.random.normal(mu, 1.0, 500)
    _, p_d = ks_drift_test(reference, batch_d)
    daily_pvals.append(p_d)

print('\nSliding window (7 days):')
for i, p in enumerate(daily_pvals):
    alert = sliding_window_alert(daily_pvals[:i+1])
    print(f'  Day {i+1}: p={p:.4f}  {"ALERT" if alert else ""}')

# Scale-only drift: KS vs t-test
batch_scale = np.random.normal(10.0, 1.5, n_prod)  # same mean, bigger variance
ks_scale, p_ks_scale = ks_drift_test(reference, batch_scale)
_, p_t_scale = stats.ttest_ind(reference, batch_scale)

print('\nScale-only drift (sigma: 1 -> 1.5, mean unchanged):')
print(f'  KS p={p_ks_scale:.6f}  (detects scale change)')
print(f'  t-test p={p_t_scale:.4f}  (misses scale change)')

check_true('KS detects mean drift', p_vals[1] < 0.05)
check_true('KS detects severe drift', p_vals[2] < 0.001)
check_true('KS detects scale drift, t-test misses', p_ks_scale < 0.05 and p_t_scale > 0.10)
print('\nTakeaway: KS test is the right tool for production drift detection — '
      'it detects any distributional change, not just mean shifts.')

Exercise 9 *** - Benjamini-Hochberg FDR Control

A benchmark report tests 100 metrics, with a mixture of null and non-null effects.

Part (a): Implement Benjamini-Hochberg at target FDR q=0.1q=0.1.

Part (b): Compare discoveries with Bonferroni correction.

Part (c): Estimate the realized false discovery proportion in simulation.

Part (d): Explain why FDR is often more useful than family-wise error for large evaluation suites.

Code cell 29

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 30

# Solution
from scipy import stats

def bh_reject(pvals, q):
    pvals = np.asarray(pvals)
    m = len(pvals)
    order = np.argsort(pvals)
    ranked = pvals[order]
    thresh = q * np.arange(1, m + 1) / m
    passed = ranked <= thresh
    reject = np.zeros(m, dtype=bool)
    if passed.any():
        k = np.max(np.where(passed)[0])
        reject[order[:k+1]] = True
    return reject

m = 100
m_alt = 25
z_null = np.random.randn(m - m_alt)
z_alt = np.random.randn(m_alt) + 2.5
z = np.concatenate([z_null, z_alt])
truth_alt = np.array([False]*(m-m_alt) + [True]*m_alt)
pvals = 1 - stats.norm.cdf(z)
rej_bh = bh_reject(pvals, 0.1)
rej_bonf = pvals <= 0.05 / m
fdp = ((rej_bh) & (~truth_alt)).sum() / max(rej_bh.sum(), 1)

header('Exercise 9: Benjamini-Hochberg FDR')
print(f'BH discoveries: {rej_bh.sum()}, FDP={fdp:.3f}')
print(f'Bonferroni discoveries: {rej_bonf.sum()}')
check_true('BH is at least as powerful as Bonferroni in this draw', rej_bh.sum() >= rej_bonf.sum())
check_true('Realized FDP is reasonable in this simulation', fdp <= 0.25)
print('Takeaway: BH accepts controlled false discoveries to gain power across many metrics.')

Exercise 10 *** - Permutation Test for Benchmark Difference

Two models are evaluated on the same examples, so correctness outcomes are paired.

Part (a): Compute the observed paired accuracy difference.

Part (b): Use a sign-flip permutation test for the null of exchangeable paired differences.

Part (c): Compare with an unpaired test and explain why pairing matters.

Part (d): Connect the test to model-comparison claims on leaderboards.

Code cell 32

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 33

# Solution
from scipy import stats

n = 800
base = (np.random.rand(n) < 0.76).astype(int)
improved = base.copy()
flip_up = np.random.rand(n) < 0.08
flip_down = np.random.rand(n) < 0.03
improved[(base == 0) & flip_up] = 1
improved[(base == 1) & flip_down] = 0
d = improved - base
obs = d.mean()

B = 5000
signs = np.random.choice([-1, 1], size=(B, n))
perm_stats = (signs * d).mean(axis=1)
p_perm = np.mean(np.abs(perm_stats) >= abs(obs))
_, p_unpaired = stats.ttest_ind(improved, base, equal_var=False)

header('Exercise 10: Paired Permutation Test')
print(f'paired accuracy difference: {obs:.4f}')
print(f'paired permutation p-value: {p_perm:.4g}')
print(f'unpaired t-test p-value: {p_unpaired:.4g}')
check_true('Paired test detects the improvement', p_perm < 0.05)
print('Takeaway: paired tests use example-level correlation and are the right default for model-vs-model evaluation.')

What to Review After Finishing

  • Exercise 1 — Can you derive the t-statistic formula from first principles?
  • Exercise 2 — Do you understand why df=k1df = k-1 (not kk) for chi-squared GoF?
  • Exercise 3 — Can you explain why the required nn scales as 1/δ21/\delta^2?
  • Exercise 4 — Can you state the NP Lemma and prove why LRT is optimal?
  • Exercise 5 — Why does BH have higher power than Bonferroni? When would you choose Bonferroni?
  • Exercise 6 — When does the permutation test differ substantially from the t-test?
  • Exercise 7 — Why does optional stopping (peeking) inflate Type I error, and how does SPRT fix this?
  • Exercise 8 — What is the trade-off between sensitivity and specificity in a drift detector?

References

  1. Casella & Berger — Statistical Inference (2001), Ch. 8-9
  2. Lehmann & Romano — Testing Statistical Hypotheses (2005)
  3. Benjamini & Hochberg — 'Controlling the false discovery rate' (JRSS-B, 1995)
  4. Wald — Sequential Analysis (1947)
  5. Koehn — 'Statistical significance tests for MT' (EMNLP 2004)
  6. Johari et al. — 'Peeking at A/B Tests' (KDD 2017)
  • Benjamini-Hochberg FDR - can you connect the computation to statistical ML practice?
  • Paired permutation benchmark test - can you connect the computation to statistical ML practice?

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue