Hypothesis Testing

Exercises Notebook

Converted from exercises.ipynb for web reading.

Hypothesis Testing — Exercises

10 exercises covering the complete hypothesis testing curriculum.

Format	Description
Problem	Markdown cell with task description
Your Solution	Code cell with scaffolding
Solution	Code cell with reference solution and checks

Difficulty Levels

Level	Exercises	Focus
★	1-3	Core mechanics: t-test, chi-squared, power
★★	4-6	Theory: NP Lemma, multiple testing, permutation
★★★	7-10	AI applications: SPRT, KS drift detection

Topic Map

Topic	Exercise
One-sample t-test	1
Chi-squared goodness-of-fit	2
Power and sample size	3
Neyman-Pearson Lemma	4
Multiple testing correction	5
Permutation test	6
Sequential A/B testing (SPRT)	7
KS-based data drift detector	8

Additional applied exercises: Exercise 9: Benjamini-Hochberg FDR; Exercise 10: Paired permutation benchmark test.

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import numpy.linalg as la
from scipy import stats

try:
    import matplotlib.pyplot as plt
    import matplotlib as mpl
    mpl.rcParams.update({'figure.figsize': (10, 6), 'figure.dpi': 110,
                         'axes.spines.top': False, 'axes.spines.right': False})
    COLORS = {'primary': '#0077BB', 'secondary': '#EE7733',
              'tertiary': '#009988', 'error': '#CC3311',
              'neutral': '#555555', 'highlight': '#EE3377'}
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)

def header(title):
    print('\n' + '=' * len(title))
    print(title)
    print('=' * len(title))

def check_close(name, got, expected, tol=1e-6):
    ok = np.allclose(got, expected, atol=tol, rtol=tol)
    print(f"{'PASS' if ok else 'FAIL'} — {name}")
    if not ok:
        print(f'  expected: {expected}')
        print(f'  got:      {got}')
    return ok

def check_true(name, cond):
    print(f"{'PASS' if cond else 'FAIL'} — {name}")
    return cond

def cohen_h(p1, p2):
    """Cohen's h effect size for two proportions."""
    return 2*np.arcsin(p1**0.5) - 2*np.arcsin(p2**0.5)

print('Setup complete.')

Exercise 1 ★ — One-Sample t-Test from Scratch

An LLM is tested for latency compliance. The SLA requires mean latency ≤ 45 ms. On 20 requests: mean = 47.3 ms, sample std = 8.1 ms.

(a) State $H_0$ and $H_1$ . Is this one-sided or two-sided?

(b) Compute the t-statistic: $t = (\bar{X} - \mu_0)/(S/\sqrt{n})$ .

(c) Find the critical value $t_{19, 0.05}$ (one-sided) from the t-distribution.

(d) Compute the exact one-sided p-value.

(e) State your conclusion. Compute Cohen's $d = (\bar{X} - \mu_0)/S$ .

Code cell 5

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 6

# Solution
def one_sample_t_test(xbar, s, n, mu0):
    t_stat = (xbar - mu0) / (s / n**0.5)
    df = n - 1
    # One-sided upper: P(T > t | H0)
    p_one_sided = stats.t.sf(t_stat, df)
    return t_stat, df, p_one_sided

xbar, s, n, mu0 = 47.3, 8.1, 20, 45.0
t_stat, df, p_one = one_sample_t_test(xbar, s, n, mu0)
t_crit = stats.t.ppf(0.95, df)  # one-sided alpha=0.05
d_cohen = (xbar - mu0) / s

header('Exercise 1: One-Sample t-Test')
print(f'H0: mu <= 45 ms, H1: mu > 45 ms (one-sided upper)')
print(f't-statistic  = {t_stat:.4f}')
print(f'df           = {df}')
print(f'Critical val = {t_crit:.4f}')
print(f'p-value      = {p_one:.4f}')
print(f"Cohen's d    = {d_cohen:.4f}")

check_close('t-stat', t_stat, (47.3-45)/(8.1/20**0.5))
check_close('df', df, 19)
check_true('p < 0.05 (reject H0)', p_one < 0.05)
check_true('t > t_crit', t_stat > t_crit)

# Verify with scipy
np.random.seed(42)
data = np.random.normal(47.3, 8.1, 20)
t_scipy, p_scipy = stats.ttest_1samp(data, mu0)
print(f'scipy one-sided p: {p_scipy/2:.4f} (reference)')
print(f'\nTakeaway: The LLM violates the 45ms SLA at α=0.05 (d={d_cohen:.2f}, medium effect).')

Exercise 2 ★ — Chi-Squared Goodness-of-Fit

A text classifier should output uniformly across 5 categories. On 500 test examples, observed counts: [87, 113, 95, 102, 103].

(a) State $H_0$ and compute expected counts.

(b) Compute the chi-squared statistic $\chi^2 = \sum (O_i - E_i)^2/E_i$ .

(c) Compute the p-value with $df = k-1 = 4$ .

(d) Reject or fail to reject $H_0$ at $\alpha = 0.05$ ? (Critical: $\chi^2_{4,0.05} = 9.49$ )

(e) Compute Cramér's $V = \sqrt{\chi^2/(n(k-1))}$ and interpret.

Code cell 8

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 9

# Solution
def chi_squared_gof(observed):
    n = observed.sum()
    k = len(observed)
    expected = np.full(k, n / k)
    chi2_stat = ((observed - expected)**2 / expected).sum()
    df = k - 1
    p_value = stats.chi2.sf(chi2_stat, df)
    cramer_v = (chi2_stat / (n * (k - 1)))**0.5
    return chi2_stat, p_value, cramer_v

observed = np.array([87, 113, 95, 102, 103])
chi2, p, V = chi_squared_gof(observed)
chi2_scipy, p_scipy = stats.chisquare(observed)

header('Exercise 2: Chi-Squared GoF')
print(f'Observed: {observed}, Total: {observed.sum()}')
print(f'Expected: {np.full(5, 100.0)} (uniform)')
print(f'chi2 = {chi2:.4f}, df = 4, p = {p:.4f}')
print(f"Cramér V = {V:.4f} (small-medium effect)")

check_close('chi2 statistic', chi2, chi2_scipy)
check_close('p-value', p, p_scipy)
check_true('p < 0.05 (reject uniform H0)', p < 0.05)
check_true('V in [0,1]', 0 <= V <= 1)
print('\nTakeaway: Category 2 is over-predicted; classifier is not uniform — check training data balance.')

Exercise 3 ★ — Power and Sample Size

You want to detect that model A has higher accuracy than model B: $p_A = 0.88$ , $p_B = 0.85$ .

(a) Compute Cohen's $h$ for this effect.

(b) Compute required $n$ per group for 80% power at $\alpha = 0.05$ (two-sided).

(c) What is the power if only $n = 2{,}000$ per group is available?

(d) Plot the power curve as a function of $n$ from 500 to 8000.

(e) Find the $n$ that gives 95% power.

Code cell 11

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 12

# Solution
def power_two_prop(p1, p2, n, alpha=0.05):
    pbar = (p1 + p2) / 2
    z_a = stats.norm.ppf(1 - alpha/2)
    se_h0 = (2 * pbar * (1-pbar) / n)**0.5
    se_h1 = (p1*(1-p1)/n + p2*(1-p2)/n)**0.5
    delta = abs(p1 - p2)
    ncp = delta / se_h1
    power = stats.norm.sf(z_a - delta/se_h0) + stats.norm.cdf(-z_a - delta/se_h0)
    return power

def n_required(p1, p2, alpha=0.05, target_power=0.80):
    z_a = stats.norm.ppf(1 - alpha/2)
    z_b = stats.norm.ppf(target_power)
    pbar = (p1 + p2) / 2
    num = (z_a * (2*pbar*(1-pbar))**0.5 + z_b * (p1*(1-p1) + p2*(1-p2))**0.5)**2
    return int(np.ceil(num / (p1 - p2)**2))

p_A, p_B = 0.88, 0.85
h = cohen_h(p_A, p_B)
n_req = n_required(p_A, p_B)
power_2000 = power_two_prop(p_A, p_B, n=2000)

# Find n for 95% power
n_search = np.arange(500, 15000, 50)
powers = [power_two_prop(p_A, p_B, n) for n in n_search]
n_95 = n_search[np.argmax(np.array(powers) >= 0.95)]

header('Exercise 3: Power and Sample Size')
print(f"Cohen's h = {h:.4f}")
print(f'n required for 80% power: {n_req}')
print(f'Power at n=2000: {power_2000:.4f}')
print(f'n for 95% power: {n_95}')

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(9, 5))
    ax.plot(n_search, powers, color=COLORS['primary'], lw=2)
    ax.axhline(0.80, color=COLORS['error'], linestyle='--', label='80% power')
    ax.axhline(0.95, color=COLORS['secondary'], linestyle='--', label='95% power')
    ax.axvline(n_req, color=COLORS['error'], linestyle=':', alpha=0.7)
    ax.axvline(n_95, color=COLORS['secondary'], linestyle=':', alpha=0.7)
    ax.set_xlabel('n per group'); ax.set_ylabel('Power')
    ax.set_title(f'Power curve: pA={p_A}, pB={p_B}, α=0.05')
    ax.legend(); plt.tight_layout(); plt.show()

check_true("Cohen's h > 0", h > 0)
check_true('n_req is positive integer', n_req > 0)
check_true('power at n_req >= 0.78', power_two_prop(p_A, p_B, n_req) >= 0.78)
print('\nTakeaway: 3% accuracy gap requires ~4k examples per model — larger than most benchmarks!')

Exercise 4 ★★ — Neyman-Pearson Lemma for Exponential Distribution

$X_1, \ldots, X_n \overset{iid}{\sim} \text{Exp}(\lambda)$ . Test $H_0: \lambda = \lambda_0$ vs. $H_1: \lambda = \lambda_1 > \lambda_0$ .

(a) Write the likelihood ratio $\Lambda(\mathbf{x}) = \mathcal{L}(\lambda_1)/\mathcal{L}(\lambda_0)$ and simplify.

(b) Show that rejecting when $\Lambda > k$ is equivalent to rejecting when $\bar{X} < c$ .

(c) Find $c$ in terms of $\alpha$ , $\lambda_0$ , $n$ using the fact that $2\lambda_0 n\bar{X} \sim \chi^2_{2n}$ .

(d) Verify: for $\lambda_0 = 1$ , $n = 10$ , $\alpha = 0.05$ , confirm the test has correct size.

(e) Verify the test is UMP: compare power to a random test of the same size.

Code cell 14

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 15

# Solution
def np_critical_value_exp(lambda0, n, alpha):
    # 2*lambda0*n*Xbar ~ chi2(2n)
    # Reject when Xbar < c, i.e., 2*lambda0*n*Xbar < 2*lambda0*n*c
    # P(chi2(2n) < chi2_alpha) = alpha -> chi2_lo = chi2(2n).ppf(alpha)
    chi2_lo = stats.chi2.ppf(alpha, df=2*n)
    c = chi2_lo / (2 * lambda0 * n)
    return c

def np_power_exp(lambda1, c, n):
    # P(Xbar < c | lambda=lambda1) = P(2*lambda1*n*Xbar < 2*lambda1*n*c)
    return stats.chi2.cdf(2*lambda1*n*c, df=2*n)

lambda0, n_exp, alpha = 1.0, 10, 0.05
c = np_critical_value_exp(lambda0, n_exp, alpha)

# Verify size by simulation
np.random.seed(42)
size_sim = np.mean([np.random.exponential(1/lambda0, n_exp).mean() < c for _ in range(50000)])

# Power at lambda1 = 2.0
lambda1_test = 2.0
power_theory = np_power_exp(lambda1_test, c, n_exp)
power_sim = np.mean([np.random.exponential(1/lambda1_test, n_exp).mean() < c for _ in range(50000)])

header('Exercise 4: NP Lemma for Exponential')
print(f'lambda0={lambda0}, n={n_exp}, alpha={alpha}')
print(f'Critical value c = {c:.6f}')
print(f'Simulated size: {size_sim:.4f} (target {alpha})')
print(f'Theoretical power at lambda1={lambda1_test}: {power_theory:.4f}')
print(f'Simulated power: {power_sim:.4f}')

check_true(f'Size ≈ alpha (within 0.01)', abs(size_sim - alpha) < 0.01)
check_true('Power > alpha (test is useful)', power_theory > alpha)
check_close('Theory matches simulation', power_theory, power_sim, tol=0.01)

print('\nTakeaway: The NP test (reject when Xbar < c) is UMP: the LRT for Exp(λ) reduces to a '
      'threshold on the sufficient statistic Xbar, and Exp has the MLR property in Xbar.')

Exercise 5 ★★ — Multiple Testing Correction

A model is compared to baseline on 50 benchmarks (45 true nulls, 5 true alternatives).

(a) Simulate p-values: nulls $\sim \mathcal{U}(0,1)$ , alternatives $\sim \text{Beta}(0.2, 1)$ .

(b) Count discoveries with no correction at $\alpha = 0.05$ .

(c) Apply Bonferroni correction and count discoveries.

(d) Apply BH correction at $q = 0.05$ and count discoveries.

(e) Across 1000 simulations, estimate empirical FWER and FDR for each method. Plot and compare.

Code cell 17

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 18

# Solution
def simulate_pvalues(m0=45, m1=5, seed=None):
    if seed is not None:
        np.random.seed(seed)
    pvals_null = np.random.uniform(0, 1, m0)
    pvals_alt  = np.random.beta(0.2, 1.0, m1)
    pvals = np.concatenate([pvals_null, pvals_alt])
    truth = np.array([False]*m0 + [True]*m1)
    return pvals, truth

def bh_reject(pvals, alpha=0.05):
    pvals = np.asarray(pvals)
    m = len(pvals)
    order = np.argsort(pvals)
    ranked = pvals[order]
    thresholds = alpha * np.arange(1, m + 1) / m
    passed = ranked <= thresholds
    reject = np.zeros(m, dtype=bool)
    if passed.any():
        k = np.max(np.where(passed)[0])
        reject[order[:k+1]] = True
    return reject

def apply_corrections(pvals, alpha=0.05):
    r_none = pvals < alpha
    r_bonf = pvals < alpha / len(pvals)
    r_bh = bh_reject(pvals, alpha=alpha)
    return {'none': np.array(r_none), 'bonf': np.array(r_bonf), 'bh': np.array(r_bh)}

# Single run
np.random.seed(42)
pvals, truth = simulate_pvalues()
rejections = apply_corrections(pvals)

header('Exercise 5: Multiple Testing Correction')
print(f'{"Method":<12} {"Disc":<8} {"TP":<8} {"FP"}')
for method in ['none', 'bonf', 'bh']:
    r = rejections[method]
    tp = (r & truth).sum(); fp = (r & ~truth).sum()
    print(f'{method:<12} {r.sum():<8} {tp:<8} {fp}')

# 1000-simulation comparison
n_sims_mt = 1000
fwer = {'none': [], 'bonf': [], 'bh': []}
fdr_sim = {'none': [], 'bonf': [], 'bh': []}
power_sim = {'none': [], 'bonf': [], 'bh': []}
m1 = 5

for i in range(n_sims_mt):
    pv, tr = simulate_pvalues(seed=i)
    rj = apply_corrections(pv)
    for meth in ['none', 'bonf', 'bh']:
        r = rj[meth]
        fp = (r & ~tr).sum(); tp = (r & tr).sum(); R = r.sum()
        fwer[meth].append(fp > 0)
        fdr_sim[meth].append(fp / max(R, 1))
        power_sim[meth].append(tp / m1)

print('\nSimulation results (1000 replications):')
print(f'{"Method":<12} {"FWER":<10} {"FDR":<10} {"Power"}')
for meth in ['none', 'bonf', 'bh']:
    print(f'{meth:<12} {np.mean(fwer[meth]):.3f}      '
          f'{np.mean(fdr_sim[meth]):.3f}      {np.mean(power_sim[meth]):.3f}')

check_true('Bonferroni FWER <= 0.06', np.mean(fwer['bonf']) <= 0.06)
check_true('BH FDR <= 0.06', np.mean(fdr_sim['bh']) <= 0.06)
check_true('BH power >= Bonferroni power', np.mean(power_sim['bh']) >= np.mean(power_sim['bonf']))
print('\nTakeaway: BH controls FDR with significantly higher power than Bonferroni for large m.')

Exercise 6 ★★ — Permutation Test for Two-Sample Means

Two LLMs are evaluated on 30 shared test prompts. Model A: $\mathcal{N}(0.72, 0.1^2)$ , Model B: $\mathcal{N}(0.68, 0.1^2)$ .

(a) Compute the observed mean difference.

(b) Implement a permutation test with $B = 10{,}000$ permutations.

(c) Compute the permutation p-value (two-sided).

(d) Compare to a Welch t-test p-value.

(e) Repeat 500 times under $H_0$ ( $\mu_A = \mu_B = 0.70$ ) and verify Type I error $\leq 0.06$ .

Code cell 20

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 21

# Solution
def permutation_test(x, y, B=10_000):
    obs_diff = x.mean() - y.mean()
    combined = np.concatenate([x, y])
    n1 = len(x)
    perm_diffs = np.empty(B)
    for i in range(B):
        perm = np.random.permutation(combined)
        perm_diffs[i] = perm[:n1].mean() - perm[n1:].mean()
    p_value = (np.abs(perm_diffs) >= np.abs(obs_diff)).mean()
    return obs_diff, p_value

np.random.seed(42)
scores_A = np.random.normal(0.72, 0.10, 30)
scores_B = np.random.normal(0.68, 0.10, 30)

obs_diff, p_perm = permutation_test(scores_A, scores_B)
_, p_welch = stats.ttest_ind(scores_A, scores_B, equal_var=False)

# Type I error simulation
np.random.seed(99)
n_sims_perm = 500
false_pos = 0
for _ in range(n_sims_perm):
    a = np.random.normal(0.70, 0.10, 30)
    b = np.random.normal(0.70, 0.10, 30)
    _, pp = permutation_test(a, b, B=999)
    if pp < 0.05:
        false_pos += 1

type1_rate = false_pos / n_sims_perm

header('Exercise 6: Permutation Test')
print(f'Observed diff: {obs_diff:.4f}')
print(f'Permutation p: {p_perm:.4f}')
print(f'Welch t p:     {p_welch:.4f}')
print(f'Type I error (H0 true): {type1_rate:.3f}')

check_true('p-perm and p-welch agree within 0.05', abs(p_perm - p_welch) < 0.08)
check_true(f'Type I error <= 0.07', type1_rate <= 0.07)
print('\nTakeaway: Permutation and Welch t-test agree closely for normal data; '
      'permutation provides valid exact inference with no distributional assumptions.')

Exercise 7 ★★★ — Sequential A/B Test with SPRT

Two variants are tested on streaming requests: $H_0: p_A = p_B = 0.80$ vs. $H_1: p_A = 0.85, p_B = 0.80$ . Set $\alpha = 0.05$ , $\beta = 0.20$ .

(a) Derive and compute the SPRT log-likelihood ratio increment per observation.

(b) Compute the Wald stopping boundaries $A$ and $B$ .

(c) Simulate one sequential path and plot $\ell_n$ vs. $n$ with boundaries.

(d) Estimate expected stopping time under $H_0$ and $H_1$ (1000 simulations each).

(e) Verify empirical Type I error $\leq 0.05$ over 2000 runs under $H_0$ .

Code cell 23

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 24

# Solution
def sprt_boundaries(alpha, beta):
    A = np.log((1 - beta) / alpha)
    B = np.log(beta / (1 - alpha))
    return A, B

def sprt_run(p_true, p0, p1, A, B, max_n=5000):
    llr = 0.0
    for n in range(1, max_n + 1):
        x = np.random.binomial(1, p_true)
        if x == 1:
            llr += np.log(p1 / p0)
        else:
            llr += np.log((1-p1) / (1-p0))
        if llr >= A:
            return n, 'reject'
        if llr <= B:
            return n, 'accept'
    return max_n, 'undecided'

p0_s, p1_s = 0.80, 0.85
alpha_s, beta_s = 0.05, 0.20
A_bound, B_bound = sprt_boundaries(alpha_s, beta_s)

# One example path under H1
np.random.seed(42)
llr_path = [0.0]
for xi in np.random.binomial(1, p1_s, 3000):
    d = np.log(p1_s/p0_s) if xi == 1 else np.log((1-p1_s)/(1-p0_s))
    llr_path.append(llr_path[-1] + d)
    if llr_path[-1] >= A_bound or llr_path[-1] <= B_bound:
        break

if HAS_MPL:
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.plot(llr_path, color=COLORS['primary'], lw=1.5)
    ax.axhline(A_bound, color=COLORS['error'], linestyle='--', label=f'Reject H0 ({A_bound:.2f})')
    ax.axhline(B_bound, color=COLORS['tertiary'], linestyle='--', label=f'Accept H0 ({B_bound:.2f})')
    ax.set_xlabel('n'); ax.set_ylabel('Log-LR')
    ax.set_title(f'SPRT path (stopped at n={len(llr_path)-1})')
    ax.legend(); plt.tight_layout(); plt.show()

# Simulation study
np.random.seed(42)
n_runs = 2000
h0_results = [sprt_run(p0_s, p0_s, p1_s, A_bound, B_bound) for _ in range(n_runs)]
h1_results = [sprt_run(p1_s, p0_s, p1_s, A_bound, B_bound) for _ in range(n_runs)]

fp_rate  = sum(1 for _, d in h0_results if d == 'reject') / n_runs
tp_rate  = sum(1 for _, d in h1_results if d == 'reject') / n_runs
mean_n_h0 = np.mean([n for n, _ in h0_results])
mean_n_h1 = np.mean([n for n, _ in h1_results])

header('Exercise 7: SPRT Sequential A/B Test')
print(f'Boundaries: A={A_bound:.3f}, B={B_bound:.3f}')
print(f'Type I error: {fp_rate:.4f} (target <= {alpha_s})')
print(f'Power:        {tp_rate:.4f} (target >= {1-beta_s})')
print(f'Mean n under H0: {mean_n_h0:.0f}')
print(f'Mean n under H1: {mean_n_h1:.0f}')

check_true(f'Type I error <= alpha+0.01', fp_rate <= alpha_s + 0.01)
check_true(f'Power >= 1-beta-0.02', tp_rate >= (1-beta_s) - 0.02)
print('\nTakeaway: SPRT stops ~40% earlier under H1 vs fixed-n test, enabling faster A/B decisions.')

Exercise 8 ★★★ — KS-Based Data Drift Detector

Build a drift detection system for an LLM serving pipeline. Reference: embedding norms $\sim \mathcal{N}(10, 1^2)$ from 5000 training samples.

(a) Simulate reference data and 3 production batches: no drift, mean drift (+0.5), severe drift (+2).

(b) Apply two-sample KS test to each batch vs. reference.

(c) Apply BH correction across the 3 batch comparisons.

(d) Implement a sliding-window detector: alert if last 3 consecutive days all have KS $p < 0.1$ .

(e) Show KS is more sensitive than t-test for scale-only drift (same mean, $\sigma: 1 \to 1.5$ ).

Code cell 26

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 27

# Solution
def ks_drift_test(reference, production):
    ks_stat, p_value = stats.ks_2samp(reference, production)
    return ks_stat, p_value

def sliding_window_alert(p_values, window=3, threshold=0.1):
    if len(p_values) < window:
        return False
    return all(p < threshold for p in p_values[-window:])

np.random.seed(42)
n_ref, n_prod = 5000, 1000
reference = np.random.normal(10.0, 1.0, n_ref)

batches = {
    'No drift':    np.random.normal(10.0, 1.0, n_prod),
    'Mean +0.5':   np.random.normal(10.5, 1.0, n_prod),
    'Severe +2.0': np.random.normal(12.0, 1.0, n_prod),
}

ks_stats, p_vals = [], []
for name, batch in batches.items():
    ks, p = ks_drift_test(reference, batch)
    ks_stats.append(ks); p_vals.append(p)


def bh_adjust(pvals, alpha=0.05):
    pvals = np.asarray(pvals, dtype=float)
    m = len(pvals)
    order = np.argsort(pvals)
    ranked = pvals[order]
    q_ranked = np.empty(m)
    running = 1.0
    for i in range(m - 1, -1, -1):
        running = min(running, ranked[i] * m / (i + 1))
        q_ranked[i] = running
    q_vals = np.empty(m)
    q_vals[order] = q_ranked
    reject = q_vals <= alpha
    return reject, q_vals

# BH correction
reject_bh, q_vals = bh_adjust(p_vals, alpha=0.05)

header('Exercise 8: KS Drift Detection')
for i, (name, _) in enumerate(batches.items()):
    print(f'  {name:<18}: KS={ks_stats[i]:.4f}, p={p_vals[i]:.6f}, '
          f'q={q_vals[i]:.4f}, alert={"YES" if reject_bh[i] else "no"}')

# Sliding window demo (7 daily batches)
daily_pvals = []
for day in range(7):
    mu = 10.0 if day < 4 else 10.8
    batch_d = np.random.normal(mu, 1.0, 500)
    _, p_d = ks_drift_test(reference, batch_d)
    daily_pvals.append(p_d)

print('\nSliding window (7 days):')
for i, p in enumerate(daily_pvals):
    alert = sliding_window_alert(daily_pvals[:i+1])
    print(f'  Day {i+1}: p={p:.4f}  {"ALERT" if alert else ""}')

# Scale-only drift: KS vs t-test
batch_scale = np.random.normal(10.0, 1.5, n_prod)  # same mean, bigger variance
ks_scale, p_ks_scale = ks_drift_test(reference, batch_scale)
_, p_t_scale = stats.ttest_ind(reference, batch_scale)

print('\nScale-only drift (sigma: 1 -> 1.5, mean unchanged):')
print(f'  KS p={p_ks_scale:.6f}  (detects scale change)')
print(f'  t-test p={p_t_scale:.4f}  (misses scale change)')

check_true('KS detects mean drift', p_vals[1] < 0.05)
check_true('KS detects severe drift', p_vals[2] < 0.001)
check_true('KS detects scale drift, t-test misses', p_ks_scale < 0.05 and p_t_scale > 0.10)
print('\nTakeaway: KS test is the right tool for production drift detection — '
      'it detects any distributional change, not just mean shifts.')

Exercise 9 *** - Benjamini-Hochberg FDR Control

A benchmark report tests 100 metrics, with a mixture of null and non-null effects.

Part (a): Implement Benjamini-Hochberg at target FDR $q=0.1$ .

Part (b): Compare discoveries with Bonferroni correction.

Part (c): Estimate the realized false discovery proportion in simulation.

Part (d): Explain why FDR is often more useful than family-wise error for large evaluation suites.

Code cell 29

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 30

# Solution
from scipy import stats

def bh_reject(pvals, q):
    pvals = np.asarray(pvals)
    m = len(pvals)
    order = np.argsort(pvals)
    ranked = pvals[order]
    thresh = q * np.arange(1, m + 1) / m
    passed = ranked <= thresh
    reject = np.zeros(m, dtype=bool)
    if passed.any():
        k = np.max(np.where(passed)[0])
        reject[order[:k+1]] = True
    return reject

m = 100
m_alt = 25
z_null = np.random.randn(m - m_alt)
z_alt = np.random.randn(m_alt) + 2.5
z = np.concatenate([z_null, z_alt])
truth_alt = np.array([False]*(m-m_alt) + [True]*m_alt)
pvals = 1 - stats.norm.cdf(z)
rej_bh = bh_reject(pvals, 0.1)
rej_bonf = pvals <= 0.05 / m
fdp = ((rej_bh) & (~truth_alt)).sum() / max(rej_bh.sum(), 1)

header('Exercise 9: Benjamini-Hochberg FDR')
print(f'BH discoveries: {rej_bh.sum()}, FDP={fdp:.3f}')
print(f'Bonferroni discoveries: {rej_bonf.sum()}')
check_true('BH is at least as powerful as Bonferroni in this draw', rej_bh.sum() >= rej_bonf.sum())
check_true('Realized FDP is reasonable in this simulation', fdp <= 0.25)
print('Takeaway: BH accepts controlled false discoveries to gain power across many metrics.')

Exercise 10 *** - Permutation Test for Benchmark Difference

Two models are evaluated on the same examples, so correctness outcomes are paired.

Part (a): Compute the observed paired accuracy difference.

Part (b): Use a sign-flip permutation test for the null of exchangeable paired differences.

Part (c): Compare with an unpaired test and explain why pairing matters.

Part (d): Connect the test to model-comparison claims on leaderboards.

Code cell 32

# Your Solution
print("Write your solution here, then compare with the reference solution below.")

Code cell 33

# Solution
from scipy import stats

n = 800
base = (np.random.rand(n) < 0.76).astype(int)
improved = base.copy()
flip_up = np.random.rand(n) < 0.08
flip_down = np.random.rand(n) < 0.03
improved[(base == 0) & flip_up] = 1
improved[(base == 1) & flip_down] = 0
d = improved - base
obs = d.mean()

B = 5000
signs = np.random.choice([-1, 1], size=(B, n))
perm_stats = (signs * d).mean(axis=1)
p_perm = np.mean(np.abs(perm_stats) >= abs(obs))
_, p_unpaired = stats.ttest_ind(improved, base, equal_var=False)

header('Exercise 10: Paired Permutation Test')
print(f'paired accuracy difference: {obs:.4f}')
print(f'paired permutation p-value: {p_perm:.4g}')
print(f'unpaired t-test p-value: {p_unpaired:.4g}')
check_true('Paired test detects the improvement', p_perm < 0.05)
print('Takeaway: paired tests use example-level correlation and are the right default for model-vs-model evaluation.')

What to Review After Finishing

Exercise 1 — Can you derive the t-statistic formula from first principles?
Exercise 2 — Do you understand why $df = k-1$ (not $k$ ) for chi-squared GoF?
Exercise 3 — Can you explain why the required $n$ scales as $1/\delta^2$ ?
Exercise 4 — Can you state the NP Lemma and prove why LRT is optimal?
Exercise 5 — Why does BH have higher power than Bonferroni? When would you choose Bonferroni?
Exercise 6 — When does the permutation test differ substantially from the t-test?
Exercise 7 — Why does optional stopping (peeking) inflate Type I error, and how does SPRT fix this?
Exercise 8 — What is the trade-off between sensitivity and specificity in a drift detector?

References

Casella & Berger — Statistical Inference (2001), Ch. 8-9
Lehmann & Romano — Testing Statistical Hypotheses (2005)
Benjamini & Hochberg — 'Controlling the false discovery rate' (JRSS-B, 1995)
Wald — Sequential Analysis (1947)
Koehn — 'Statistical significance tests for MT' (EMNLP 2004)
Johari et al. — 'Peeking at A/B Tests' (KDD 2017)

Benjamini-Hochberg FDR - can you connect the computation to statistical ML practice?
Paired permutation benchmark test - can you connect the computation to statistical ML practice?