Exercises Notebook
Converted from
exercises.ipynbfor web reading.
Hypothesis Testing — Exercises
10 exercises covering the complete hypothesis testing curriculum.
| Format | Description |
|---|---|
| Problem | Markdown cell with task description |
| Your Solution | Code cell with scaffolding |
| Solution | Code cell with reference solution and checks |
Difficulty Levels
| Level | Exercises | Focus |
|---|---|---|
| ★ | 1-3 | Core mechanics: t-test, chi-squared, power |
| ★★ | 4-6 | Theory: NP Lemma, multiple testing, permutation |
| ★★★ | 7-10 | AI applications: SPRT, KS drift detection |
Topic Map
| Topic | Exercise |
|---|---|
| One-sample t-test | 1 |
| Chi-squared goodness-of-fit | 2 |
| Power and sample size | 3 |
| Neyman-Pearson Lemma | 4 |
| Multiple testing correction | 5 |
| Permutation test | 6 |
| Sequential A/B testing (SPRT) | 7 |
| KS-based data drift detector | 8 |
Additional applied exercises: Exercise 9: Benjamini-Hochberg FDR; Exercise 10: Paired permutation benchmark test.
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
import numpy.linalg as la
from scipy import stats
try:
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams.update({'figure.figsize': (10, 6), 'figure.dpi': 110,
'axes.spines.top': False, 'axes.spines.right': False})
COLORS = {'primary': '#0077BB', 'secondary': '#EE7733',
'tertiary': '#009988', 'error': '#CC3311',
'neutral': '#555555', 'highlight': '#EE3377'}
HAS_MPL = True
except ImportError:
HAS_MPL = False
np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)
def header(title):
print('\n' + '=' * len(title))
print(title)
print('=' * len(title))
def check_close(name, got, expected, tol=1e-6):
ok = np.allclose(got, expected, atol=tol, rtol=tol)
print(f"{'PASS' if ok else 'FAIL'} — {name}")
if not ok:
print(f' expected: {expected}')
print(f' got: {got}')
return ok
def check_true(name, cond):
print(f"{'PASS' if cond else 'FAIL'} — {name}")
return cond
def cohen_h(p1, p2):
"""Cohen's h effect size for two proportions."""
return 2*np.arcsin(p1**0.5) - 2*np.arcsin(p2**0.5)
print('Setup complete.')
Exercise 1 ★ — One-Sample t-Test from Scratch
An LLM is tested for latency compliance. The SLA requires mean latency ≤ 45 ms. On 20 requests: mean = 47.3 ms, sample std = 8.1 ms.
(a) State and . Is this one-sided or two-sided?
(b) Compute the t-statistic: .
(c) Find the critical value (one-sided) from the t-distribution.
(d) Compute the exact one-sided p-value.
(e) State your conclusion. Compute Cohen's .
Code cell 5
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 6
# Solution
def one_sample_t_test(xbar, s, n, mu0):
t_stat = (xbar - mu0) / (s / n**0.5)
df = n - 1
# One-sided upper: P(T > t | H0)
p_one_sided = stats.t.sf(t_stat, df)
return t_stat, df, p_one_sided
xbar, s, n, mu0 = 47.3, 8.1, 20, 45.0
t_stat, df, p_one = one_sample_t_test(xbar, s, n, mu0)
t_crit = stats.t.ppf(0.95, df) # one-sided alpha=0.05
d_cohen = (xbar - mu0) / s
header('Exercise 1: One-Sample t-Test')
print(f'H0: mu <= 45 ms, H1: mu > 45 ms (one-sided upper)')
print(f't-statistic = {t_stat:.4f}')
print(f'df = {df}')
print(f'Critical val = {t_crit:.4f}')
print(f'p-value = {p_one:.4f}')
print(f"Cohen's d = {d_cohen:.4f}")
check_close('t-stat', t_stat, (47.3-45)/(8.1/20**0.5))
check_close('df', df, 19)
check_true('p < 0.05 (reject H0)', p_one < 0.05)
check_true('t > t_crit', t_stat > t_crit)
# Verify with scipy
np.random.seed(42)
data = np.random.normal(47.3, 8.1, 20)
t_scipy, p_scipy = stats.ttest_1samp(data, mu0)
print(f'scipy one-sided p: {p_scipy/2:.4f} (reference)')
print(f'\nTakeaway: The LLM violates the 45ms SLA at α=0.05 (d={d_cohen:.2f}, medium effect).')
Exercise 2 ★ — Chi-Squared Goodness-of-Fit
A text classifier should output uniformly across 5 categories.
On 500 test examples, observed counts: [87, 113, 95, 102, 103].
(a) State and compute expected counts.
(b) Compute the chi-squared statistic .
(c) Compute the p-value with .
(d) Reject or fail to reject at ? (Critical: )
(e) Compute Cramér's and interpret.
Code cell 8
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 9
# Solution
def chi_squared_gof(observed):
n = observed.sum()
k = len(observed)
expected = np.full(k, n / k)
chi2_stat = ((observed - expected)**2 / expected).sum()
df = k - 1
p_value = stats.chi2.sf(chi2_stat, df)
cramer_v = (chi2_stat / (n * (k - 1)))**0.5
return chi2_stat, p_value, cramer_v
observed = np.array([87, 113, 95, 102, 103])
chi2, p, V = chi_squared_gof(observed)
chi2_scipy, p_scipy = stats.chisquare(observed)
header('Exercise 2: Chi-Squared GoF')
print(f'Observed: {observed}, Total: {observed.sum()}')
print(f'Expected: {np.full(5, 100.0)} (uniform)')
print(f'chi2 = {chi2:.4f}, df = 4, p = {p:.4f}')
print(f"Cramér V = {V:.4f} (small-medium effect)")
check_close('chi2 statistic', chi2, chi2_scipy)
check_close('p-value', p, p_scipy)
check_true('p < 0.05 (reject uniform H0)', p < 0.05)
check_true('V in [0,1]', 0 <= V <= 1)
print('\nTakeaway: Category 2 is over-predicted; classifier is not uniform — check training data balance.')
Exercise 3 ★ — Power and Sample Size
You want to detect that model A has higher accuracy than model B: , .
(a) Compute Cohen's for this effect.
(b) Compute required per group for 80% power at (two-sided).
(c) What is the power if only per group is available?
(d) Plot the power curve as a function of from 500 to 8000.
(e) Find the that gives 95% power.
Code cell 11
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 12
# Solution
def power_two_prop(p1, p2, n, alpha=0.05):
pbar = (p1 + p2) / 2
z_a = stats.norm.ppf(1 - alpha/2)
se_h0 = (2 * pbar * (1-pbar) / n)**0.5
se_h1 = (p1*(1-p1)/n + p2*(1-p2)/n)**0.5
delta = abs(p1 - p2)
ncp = delta / se_h1
power = stats.norm.sf(z_a - delta/se_h0) + stats.norm.cdf(-z_a - delta/se_h0)
return power
def n_required(p1, p2, alpha=0.05, target_power=0.80):
z_a = stats.norm.ppf(1 - alpha/2)
z_b = stats.norm.ppf(target_power)
pbar = (p1 + p2) / 2
num = (z_a * (2*pbar*(1-pbar))**0.5 + z_b * (p1*(1-p1) + p2*(1-p2))**0.5)**2
return int(np.ceil(num / (p1 - p2)**2))
p_A, p_B = 0.88, 0.85
h = cohen_h(p_A, p_B)
n_req = n_required(p_A, p_B)
power_2000 = power_two_prop(p_A, p_B, n=2000)
# Find n for 95% power
n_search = np.arange(500, 15000, 50)
powers = [power_two_prop(p_A, p_B, n) for n in n_search]
n_95 = n_search[np.argmax(np.array(powers) >= 0.95)]
header('Exercise 3: Power and Sample Size')
print(f"Cohen's h = {h:.4f}")
print(f'n required for 80% power: {n_req}')
print(f'Power at n=2000: {power_2000:.4f}')
print(f'n for 95% power: {n_95}')
if HAS_MPL:
fig, ax = plt.subplots(figsize=(9, 5))
ax.plot(n_search, powers, color=COLORS['primary'], lw=2)
ax.axhline(0.80, color=COLORS['error'], linestyle='--', label='80% power')
ax.axhline(0.95, color=COLORS['secondary'], linestyle='--', label='95% power')
ax.axvline(n_req, color=COLORS['error'], linestyle=':', alpha=0.7)
ax.axvline(n_95, color=COLORS['secondary'], linestyle=':', alpha=0.7)
ax.set_xlabel('n per group'); ax.set_ylabel('Power')
ax.set_title(f'Power curve: pA={p_A}, pB={p_B}, α=0.05')
ax.legend(); plt.tight_layout(); plt.show()
check_true("Cohen's h > 0", h > 0)
check_true('n_req is positive integer', n_req > 0)
check_true('power at n_req >= 0.78', power_two_prop(p_A, p_B, n_req) >= 0.78)
print('\nTakeaway: 3% accuracy gap requires ~4k examples per model — larger than most benchmarks!')
Exercise 4 ★★ — Neyman-Pearson Lemma for Exponential Distribution
. Test vs. .
(a) Write the likelihood ratio and simplify.
(b) Show that rejecting when is equivalent to rejecting when .
(c) Find in terms of , , using the fact that .
(d) Verify: for , , , confirm the test has correct size.
(e) Verify the test is UMP: compare power to a random test of the same size.
Code cell 14
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 15
# Solution
def np_critical_value_exp(lambda0, n, alpha):
# 2*lambda0*n*Xbar ~ chi2(2n)
# Reject when Xbar < c, i.e., 2*lambda0*n*Xbar < 2*lambda0*n*c
# P(chi2(2n) < chi2_alpha) = alpha -> chi2_lo = chi2(2n).ppf(alpha)
chi2_lo = stats.chi2.ppf(alpha, df=2*n)
c = chi2_lo / (2 * lambda0 * n)
return c
def np_power_exp(lambda1, c, n):
# P(Xbar < c | lambda=lambda1) = P(2*lambda1*n*Xbar < 2*lambda1*n*c)
return stats.chi2.cdf(2*lambda1*n*c, df=2*n)
lambda0, n_exp, alpha = 1.0, 10, 0.05
c = np_critical_value_exp(lambda0, n_exp, alpha)
# Verify size by simulation
np.random.seed(42)
size_sim = np.mean([np.random.exponential(1/lambda0, n_exp).mean() < c for _ in range(50000)])
# Power at lambda1 = 2.0
lambda1_test = 2.0
power_theory = np_power_exp(lambda1_test, c, n_exp)
power_sim = np.mean([np.random.exponential(1/lambda1_test, n_exp).mean() < c for _ in range(50000)])
header('Exercise 4: NP Lemma for Exponential')
print(f'lambda0={lambda0}, n={n_exp}, alpha={alpha}')
print(f'Critical value c = {c:.6f}')
print(f'Simulated size: {size_sim:.4f} (target {alpha})')
print(f'Theoretical power at lambda1={lambda1_test}: {power_theory:.4f}')
print(f'Simulated power: {power_sim:.4f}')
check_true(f'Size ≈ alpha (within 0.01)', abs(size_sim - alpha) < 0.01)
check_true('Power > alpha (test is useful)', power_theory > alpha)
check_close('Theory matches simulation', power_theory, power_sim, tol=0.01)
print('\nTakeaway: The NP test (reject when Xbar < c) is UMP: the LRT for Exp(λ) reduces to a '
'threshold on the sufficient statistic Xbar, and Exp has the MLR property in Xbar.')
Exercise 5 ★★ — Multiple Testing Correction
A model is compared to baseline on 50 benchmarks (45 true nulls, 5 true alternatives).
(a) Simulate p-values: nulls , alternatives .
(b) Count discoveries with no correction at .
(c) Apply Bonferroni correction and count discoveries.
(d) Apply BH correction at and count discoveries.
(e) Across 1000 simulations, estimate empirical FWER and FDR for each method. Plot and compare.
Code cell 17
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 18
# Solution
def simulate_pvalues(m0=45, m1=5, seed=None):
if seed is not None:
np.random.seed(seed)
pvals_null = np.random.uniform(0, 1, m0)
pvals_alt = np.random.beta(0.2, 1.0, m1)
pvals = np.concatenate([pvals_null, pvals_alt])
truth = np.array([False]*m0 + [True]*m1)
return pvals, truth
def bh_reject(pvals, alpha=0.05):
pvals = np.asarray(pvals)
m = len(pvals)
order = np.argsort(pvals)
ranked = pvals[order]
thresholds = alpha * np.arange(1, m + 1) / m
passed = ranked <= thresholds
reject = np.zeros(m, dtype=bool)
if passed.any():
k = np.max(np.where(passed)[0])
reject[order[:k+1]] = True
return reject
def apply_corrections(pvals, alpha=0.05):
r_none = pvals < alpha
r_bonf = pvals < alpha / len(pvals)
r_bh = bh_reject(pvals, alpha=alpha)
return {'none': np.array(r_none), 'bonf': np.array(r_bonf), 'bh': np.array(r_bh)}
# Single run
np.random.seed(42)
pvals, truth = simulate_pvalues()
rejections = apply_corrections(pvals)
header('Exercise 5: Multiple Testing Correction')
print(f'{"Method":<12} {"Disc":<8} {"TP":<8} {"FP"}')
for method in ['none', 'bonf', 'bh']:
r = rejections[method]
tp = (r & truth).sum(); fp = (r & ~truth).sum()
print(f'{method:<12} {r.sum():<8} {tp:<8} {fp}')
# 1000-simulation comparison
n_sims_mt = 1000
fwer = {'none': [], 'bonf': [], 'bh': []}
fdr_sim = {'none': [], 'bonf': [], 'bh': []}
power_sim = {'none': [], 'bonf': [], 'bh': []}
m1 = 5
for i in range(n_sims_mt):
pv, tr = simulate_pvalues(seed=i)
rj = apply_corrections(pv)
for meth in ['none', 'bonf', 'bh']:
r = rj[meth]
fp = (r & ~tr).sum(); tp = (r & tr).sum(); R = r.sum()
fwer[meth].append(fp > 0)
fdr_sim[meth].append(fp / max(R, 1))
power_sim[meth].append(tp / m1)
print('\nSimulation results (1000 replications):')
print(f'{"Method":<12} {"FWER":<10} {"FDR":<10} {"Power"}')
for meth in ['none', 'bonf', 'bh']:
print(f'{meth:<12} {np.mean(fwer[meth]):.3f} '
f'{np.mean(fdr_sim[meth]):.3f} {np.mean(power_sim[meth]):.3f}')
check_true('Bonferroni FWER <= 0.06', np.mean(fwer['bonf']) <= 0.06)
check_true('BH FDR <= 0.06', np.mean(fdr_sim['bh']) <= 0.06)
check_true('BH power >= Bonferroni power', np.mean(power_sim['bh']) >= np.mean(power_sim['bonf']))
print('\nTakeaway: BH controls FDR with significantly higher power than Bonferroni for large m.')
Exercise 6 ★★ — Permutation Test for Two-Sample Means
Two LLMs are evaluated on 30 shared test prompts. Model A: , Model B: .
(a) Compute the observed mean difference.
(b) Implement a permutation test with permutations.
(c) Compute the permutation p-value (two-sided).
(d) Compare to a Welch t-test p-value.
(e) Repeat 500 times under () and verify Type I error .
Code cell 20
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 21
# Solution
def permutation_test(x, y, B=10_000):
obs_diff = x.mean() - y.mean()
combined = np.concatenate([x, y])
n1 = len(x)
perm_diffs = np.empty(B)
for i in range(B):
perm = np.random.permutation(combined)
perm_diffs[i] = perm[:n1].mean() - perm[n1:].mean()
p_value = (np.abs(perm_diffs) >= np.abs(obs_diff)).mean()
return obs_diff, p_value
np.random.seed(42)
scores_A = np.random.normal(0.72, 0.10, 30)
scores_B = np.random.normal(0.68, 0.10, 30)
obs_diff, p_perm = permutation_test(scores_A, scores_B)
_, p_welch = stats.ttest_ind(scores_A, scores_B, equal_var=False)
# Type I error simulation
np.random.seed(99)
n_sims_perm = 500
false_pos = 0
for _ in range(n_sims_perm):
a = np.random.normal(0.70, 0.10, 30)
b = np.random.normal(0.70, 0.10, 30)
_, pp = permutation_test(a, b, B=999)
if pp < 0.05:
false_pos += 1
type1_rate = false_pos / n_sims_perm
header('Exercise 6: Permutation Test')
print(f'Observed diff: {obs_diff:.4f}')
print(f'Permutation p: {p_perm:.4f}')
print(f'Welch t p: {p_welch:.4f}')
print(f'Type I error (H0 true): {type1_rate:.3f}')
check_true('p-perm and p-welch agree within 0.05', abs(p_perm - p_welch) < 0.08)
check_true(f'Type I error <= 0.07', type1_rate <= 0.07)
print('\nTakeaway: Permutation and Welch t-test agree closely for normal data; '
'permutation provides valid exact inference with no distributional assumptions.')
Exercise 7 ★★★ — Sequential A/B Test with SPRT
Two variants are tested on streaming requests: vs. . Set , .
(a) Derive and compute the SPRT log-likelihood ratio increment per observation.
(b) Compute the Wald stopping boundaries and .
(c) Simulate one sequential path and plot vs. with boundaries.
(d) Estimate expected stopping time under and (1000 simulations each).
(e) Verify empirical Type I error over 2000 runs under .
Code cell 23
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 24
# Solution
def sprt_boundaries(alpha, beta):
A = np.log((1 - beta) / alpha)
B = np.log(beta / (1 - alpha))
return A, B
def sprt_run(p_true, p0, p1, A, B, max_n=5000):
llr = 0.0
for n in range(1, max_n + 1):
x = np.random.binomial(1, p_true)
if x == 1:
llr += np.log(p1 / p0)
else:
llr += np.log((1-p1) / (1-p0))
if llr >= A:
return n, 'reject'
if llr <= B:
return n, 'accept'
return max_n, 'undecided'
p0_s, p1_s = 0.80, 0.85
alpha_s, beta_s = 0.05, 0.20
A_bound, B_bound = sprt_boundaries(alpha_s, beta_s)
# One example path under H1
np.random.seed(42)
llr_path = [0.0]
for xi in np.random.binomial(1, p1_s, 3000):
d = np.log(p1_s/p0_s) if xi == 1 else np.log((1-p1_s)/(1-p0_s))
llr_path.append(llr_path[-1] + d)
if llr_path[-1] >= A_bound or llr_path[-1] <= B_bound:
break
if HAS_MPL:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(llr_path, color=COLORS['primary'], lw=1.5)
ax.axhline(A_bound, color=COLORS['error'], linestyle='--', label=f'Reject H0 ({A_bound:.2f})')
ax.axhline(B_bound, color=COLORS['tertiary'], linestyle='--', label=f'Accept H0 ({B_bound:.2f})')
ax.set_xlabel('n'); ax.set_ylabel('Log-LR')
ax.set_title(f'SPRT path (stopped at n={len(llr_path)-1})')
ax.legend(); plt.tight_layout(); plt.show()
# Simulation study
np.random.seed(42)
n_runs = 2000
h0_results = [sprt_run(p0_s, p0_s, p1_s, A_bound, B_bound) for _ in range(n_runs)]
h1_results = [sprt_run(p1_s, p0_s, p1_s, A_bound, B_bound) for _ in range(n_runs)]
fp_rate = sum(1 for _, d in h0_results if d == 'reject') / n_runs
tp_rate = sum(1 for _, d in h1_results if d == 'reject') / n_runs
mean_n_h0 = np.mean([n for n, _ in h0_results])
mean_n_h1 = np.mean([n for n, _ in h1_results])
header('Exercise 7: SPRT Sequential A/B Test')
print(f'Boundaries: A={A_bound:.3f}, B={B_bound:.3f}')
print(f'Type I error: {fp_rate:.4f} (target <= {alpha_s})')
print(f'Power: {tp_rate:.4f} (target >= {1-beta_s})')
print(f'Mean n under H0: {mean_n_h0:.0f}')
print(f'Mean n under H1: {mean_n_h1:.0f}')
check_true(f'Type I error <= alpha+0.01', fp_rate <= alpha_s + 0.01)
check_true(f'Power >= 1-beta-0.02', tp_rate >= (1-beta_s) - 0.02)
print('\nTakeaway: SPRT stops ~40% earlier under H1 vs fixed-n test, enabling faster A/B decisions.')
Exercise 8 ★★★ — KS-Based Data Drift Detector
Build a drift detection system for an LLM serving pipeline. Reference: embedding norms from 5000 training samples.
(a) Simulate reference data and 3 production batches: no drift, mean drift (+0.5), severe drift (+2).
(b) Apply two-sample KS test to each batch vs. reference.
(c) Apply BH correction across the 3 batch comparisons.
(d) Implement a sliding-window detector: alert if last 3 consecutive days all have KS .
(e) Show KS is more sensitive than t-test for scale-only drift (same mean, ).
Code cell 26
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 27
# Solution
def ks_drift_test(reference, production):
ks_stat, p_value = stats.ks_2samp(reference, production)
return ks_stat, p_value
def sliding_window_alert(p_values, window=3, threshold=0.1):
if len(p_values) < window:
return False
return all(p < threshold for p in p_values[-window:])
np.random.seed(42)
n_ref, n_prod = 5000, 1000
reference = np.random.normal(10.0, 1.0, n_ref)
batches = {
'No drift': np.random.normal(10.0, 1.0, n_prod),
'Mean +0.5': np.random.normal(10.5, 1.0, n_prod),
'Severe +2.0': np.random.normal(12.0, 1.0, n_prod),
}
ks_stats, p_vals = [], []
for name, batch in batches.items():
ks, p = ks_drift_test(reference, batch)
ks_stats.append(ks); p_vals.append(p)
def bh_adjust(pvals, alpha=0.05):
pvals = np.asarray(pvals, dtype=float)
m = len(pvals)
order = np.argsort(pvals)
ranked = pvals[order]
q_ranked = np.empty(m)
running = 1.0
for i in range(m - 1, -1, -1):
running = min(running, ranked[i] * m / (i + 1))
q_ranked[i] = running
q_vals = np.empty(m)
q_vals[order] = q_ranked
reject = q_vals <= alpha
return reject, q_vals
# BH correction
reject_bh, q_vals = bh_adjust(p_vals, alpha=0.05)
header('Exercise 8: KS Drift Detection')
for i, (name, _) in enumerate(batches.items()):
print(f' {name:<18}: KS={ks_stats[i]:.4f}, p={p_vals[i]:.6f}, '
f'q={q_vals[i]:.4f}, alert={"YES" if reject_bh[i] else "no"}')
# Sliding window demo (7 daily batches)
daily_pvals = []
for day in range(7):
mu = 10.0 if day < 4 else 10.8
batch_d = np.random.normal(mu, 1.0, 500)
_, p_d = ks_drift_test(reference, batch_d)
daily_pvals.append(p_d)
print('\nSliding window (7 days):')
for i, p in enumerate(daily_pvals):
alert = sliding_window_alert(daily_pvals[:i+1])
print(f' Day {i+1}: p={p:.4f} {"ALERT" if alert else ""}')
# Scale-only drift: KS vs t-test
batch_scale = np.random.normal(10.0, 1.5, n_prod) # same mean, bigger variance
ks_scale, p_ks_scale = ks_drift_test(reference, batch_scale)
_, p_t_scale = stats.ttest_ind(reference, batch_scale)
print('\nScale-only drift (sigma: 1 -> 1.5, mean unchanged):')
print(f' KS p={p_ks_scale:.6f} (detects scale change)')
print(f' t-test p={p_t_scale:.4f} (misses scale change)')
check_true('KS detects mean drift', p_vals[1] < 0.05)
check_true('KS detects severe drift', p_vals[2] < 0.001)
check_true('KS detects scale drift, t-test misses', p_ks_scale < 0.05 and p_t_scale > 0.10)
print('\nTakeaway: KS test is the right tool for production drift detection — '
'it detects any distributional change, not just mean shifts.')
Exercise 9 *** - Benjamini-Hochberg FDR Control
A benchmark report tests 100 metrics, with a mixture of null and non-null effects.
Part (a): Implement Benjamini-Hochberg at target FDR .
Part (b): Compare discoveries with Bonferroni correction.
Part (c): Estimate the realized false discovery proportion in simulation.
Part (d): Explain why FDR is often more useful than family-wise error for large evaluation suites.
Code cell 29
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 30
# Solution
from scipy import stats
def bh_reject(pvals, q):
pvals = np.asarray(pvals)
m = len(pvals)
order = np.argsort(pvals)
ranked = pvals[order]
thresh = q * np.arange(1, m + 1) / m
passed = ranked <= thresh
reject = np.zeros(m, dtype=bool)
if passed.any():
k = np.max(np.where(passed)[0])
reject[order[:k+1]] = True
return reject
m = 100
m_alt = 25
z_null = np.random.randn(m - m_alt)
z_alt = np.random.randn(m_alt) + 2.5
z = np.concatenate([z_null, z_alt])
truth_alt = np.array([False]*(m-m_alt) + [True]*m_alt)
pvals = 1 - stats.norm.cdf(z)
rej_bh = bh_reject(pvals, 0.1)
rej_bonf = pvals <= 0.05 / m
fdp = ((rej_bh) & (~truth_alt)).sum() / max(rej_bh.sum(), 1)
header('Exercise 9: Benjamini-Hochberg FDR')
print(f'BH discoveries: {rej_bh.sum()}, FDP={fdp:.3f}')
print(f'Bonferroni discoveries: {rej_bonf.sum()}')
check_true('BH is at least as powerful as Bonferroni in this draw', rej_bh.sum() >= rej_bonf.sum())
check_true('Realized FDP is reasonable in this simulation', fdp <= 0.25)
print('Takeaway: BH accepts controlled false discoveries to gain power across many metrics.')
Exercise 10 *** - Permutation Test for Benchmark Difference
Two models are evaluated on the same examples, so correctness outcomes are paired.
Part (a): Compute the observed paired accuracy difference.
Part (b): Use a sign-flip permutation test for the null of exchangeable paired differences.
Part (c): Compare with an unpaired test and explain why pairing matters.
Part (d): Connect the test to model-comparison claims on leaderboards.
Code cell 32
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 33
# Solution
from scipy import stats
n = 800
base = (np.random.rand(n) < 0.76).astype(int)
improved = base.copy()
flip_up = np.random.rand(n) < 0.08
flip_down = np.random.rand(n) < 0.03
improved[(base == 0) & flip_up] = 1
improved[(base == 1) & flip_down] = 0
d = improved - base
obs = d.mean()
B = 5000
signs = np.random.choice([-1, 1], size=(B, n))
perm_stats = (signs * d).mean(axis=1)
p_perm = np.mean(np.abs(perm_stats) >= abs(obs))
_, p_unpaired = stats.ttest_ind(improved, base, equal_var=False)
header('Exercise 10: Paired Permutation Test')
print(f'paired accuracy difference: {obs:.4f}')
print(f'paired permutation p-value: {p_perm:.4g}')
print(f'unpaired t-test p-value: {p_unpaired:.4g}')
check_true('Paired test detects the improvement', p_perm < 0.05)
print('Takeaway: paired tests use example-level correlation and are the right default for model-vs-model evaluation.')
What to Review After Finishing
- Exercise 1 — Can you derive the t-statistic formula from first principles?
- Exercise 2 — Do you understand why (not ) for chi-squared GoF?
- Exercise 3 — Can you explain why the required scales as ?
- Exercise 4 — Can you state the NP Lemma and prove why LRT is optimal?
- Exercise 5 — Why does BH have higher power than Bonferroni? When would you choose Bonferroni?
- Exercise 6 — When does the permutation test differ substantially from the t-test?
- Exercise 7 — Why does optional stopping (peeking) inflate Type I error, and how does SPRT fix this?
- Exercise 8 — What is the trade-off between sensitivity and specificity in a drift detector?
References
- Casella & Berger — Statistical Inference (2001), Ch. 8-9
- Lehmann & Romano — Testing Statistical Hypotheses (2005)
- Benjamini & Hochberg — 'Controlling the false discovery rate' (JRSS-B, 1995)
- Wald — Sequential Analysis (1947)
- Koehn — 'Statistical significance tests for MT' (EMNLP 2004)
- Johari et al. — 'Peeking at A/B Tests' (KDD 2017)
- Benjamini-Hochberg FDR - can you connect the computation to statistical ML practice?
- Paired permutation benchmark test - can you connect the computation to statistical ML practice?