Exercises Notebook
Converted from
exercises.ipynbfor web reading.
Expectation and Moments — Exercises
10 exercises covering LOTUS, linearity, variance, conditional expectation, MGFs, Jensen's inequality, bias-variance decomposition, and Adam.
| Format | Description |
|---|---|
| Problem | Markdown cell with task description |
| Your Solution | Code cell with scaffolding |
| Solution | Code cell with reference solution and checks |
Difficulty Levels
| Level | Exercises | Focus |
|---|---|---|
| ★ | 1–3 | Core mechanics |
| ★★ | 4–6 | Deeper theory |
| ★★★ | 7-10 | AI / ML applications |
Topic Map
| Topic | Exercise |
|---|---|
| LOTUS | 1 |
| Linearity of expectation | 2 |
| MGF and Gaussian moments | 3 |
| Tower property | 4 |
| Gamma MGF and cumulants | 5 |
| Jensen and KL divergence | 6 |
| Bias-variance decomposition | 7 |
| Adam as moment estimation | 8 |
| Mini-batch gradient variance | 9 |
| Control variates | 10 |
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
from scipy import stats
from scipy.integrate import quad
try:
import matplotlib.pyplot as plt
HAS_MPL = True
except ImportError:
HAS_MPL = False
np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)
def header(title):
print('\n' + '=' * len(title))
print(title)
print('=' * len(title))
def check_close(name, got, expected, tol=1e-4):
ok = np.allclose(got, expected, atol=tol, rtol=tol)
print(f"{'PASS' if ok else 'FAIL'} — {name}")
if not ok:
print(f' expected: {expected}')
print(f' got : {got}')
return ok
def check_true(name, cond):
print(f"{'PASS' if cond else 'FAIL'} — {name}")
return cond
print('Setup complete.')
Exercise 1 ★ — LOTUS: Computing Expectations Without Finding Distributions
Let with PDF
with , .
(a) Compute by numerical integration (LOTUS with ). Check against the theoretical value .
(b) Compute and . Check against .
(c) Compute using LOTUS. The theoretical value is where is the digamma function.
(d) Compute and verify for .
(e) Verify all answers by Monte Carlo simulation with samples.
Code cell 5
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 6
# Solution
from scipy.integrate import quad
from scipy.special import gamma as gamma_fn, digamma
import numpy as np
from scipy import stats
alpha, beta = 3.0, 2.0
def gamma_pdf(x, a, b):
return b**a / gamma_fn(a) * x**(a-1) * np.exp(-b*x)
# (a) E[X]
E_X, _ = quad(lambda x: x * gamma_pdf(x, alpha, beta), 0, np.inf)
# (b) E[X^2] and Var(X)
E_X2, _ = quad(lambda x: x**2 * gamma_pdf(x, alpha, beta), 0, np.inf)
Var_X = E_X2 - E_X**2
# (c) E[log X]
E_logX, _ = quad(lambda x: np.log(x) * gamma_pdf(x, alpha, beta), 0, np.inf)
E_logX_theory = digamma(alpha) - np.log(beta)
# (d) E[1/X]
E_invX, _ = quad(lambda x: (1.0/x) * gamma_pdf(x, alpha, beta), 0, np.inf)
E_invX_theory = beta / (alpha - 1) # valid for alpha > 1
# (e) Monte Carlo
N = 100000
np.random.seed(42)
samples = np.random.gamma(alpha, 1.0/beta, N) # scipy: scale = 1/beta
header('Exercise 1: LOTUS — Gamma Distribution Moments')
check_close('E[X] theory (alpha/beta)', E_X, alpha/beta)
check_close('E[X] vs MC', E_X, samples.mean(), tol=0.02)
check_close('Var(X) theory (alpha/beta^2)', Var_X, alpha/beta**2)
check_close('Var(X) vs MC', Var_X, samples.var(), tol=0.02)
check_close('E[log X] theory (digamma(alpha)-log(beta))', E_logX, E_logX_theory)
check_close('E[log X] vs MC', E_logX, np.log(samples).mean(), tol=0.02)
check_close('E[1/X] theory (beta/(alpha-1))', E_invX, E_invX_theory)
check_close('E[1/X] vs MC', E_invX, (1.0/samples).mean(), tol=0.01)
print('\nTakeaway: LOTUS lets us compute E[g(X)] directly from f_X — no need to find the distribution of g(X).')
Exercise 2 ★ — Linearity of Expectation Without Independence
Let and define . Note that and are not independent.
(a) Compute using linearity: . Verify by Monte Carlo.
(b) Compute using linearity. Verify by Monte Carlo.
(c) Compute . Is this equal to ? Explain why or why not.
(d) Using the indicator variable trick: For i.i.d. , let . Compute by linearity and verify by simulation.
(e) Show by simulation that when and are dependent.
Code cell 8
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 9
# Solution
import numpy as np
np.random.seed(42)
N = 200000
X = np.random.normal(0, 1, N)
Y = X**2
# Theory: E[X]=0, E[Y]=E[X^2]=Var(X)+(E[X])^2=1, E[XY]=E[X^3]=0 (odd moment of N(0,1))
# (a)
E_X_plus_Y_linearity = 0 + 1 # E[X] + E[X^2] = 0 + 1 = 1
E_X_plus_Y_mc = (X + Y).mean()
# (b)
E_combo_linearity = 3*0 - 2*1 + 5 # = 3
E_combo_mc = (3*X - 2*Y + 5).mean()
# (c): E[XY] = E[X^3] = 0 (odd moment of symmetric distribution)
E_XY = (X * Y).mean()
E_X_times_E_Y = X.mean() * Y.mean() # 0 * 1 = 0
# Coincidentally both are ~0 here, but for different reasons
# E[XY] = E[X^3] = 0 by symmetry; E[X]*E[Y] = 0*1 = 0
# They agree but X and Y are NOT independent (proven by P(Y<1) vs P(Y<1|X>1.5))
# (d) indicator trick: E[N] = 100 * P(X_i > 0.7) = 100 * 0.3 = 30
E_N_linearity = 100 * 0.3
N_sim = sum(np.random.uniform(0, 1, 100) > 0.7 for _ in range(20000))
E_N_sim = N_sim / 20000 # This is a scalar, not array
# (e)
Var_X_val = X.var()
Var_Y_val = Y.var()
Var_XpY_decomp = Var_X_val + Var_Y_val # Wrong when dependent!
Var_XpY_direct = (X + Y).var()
Cov_XY = np.cov(X, Y)[0, 1]
header('Exercise 2: Linearity of Expectation')
check_close('E[X+Y]: linearity vs MC', E_X_plus_Y_linearity, E_X_plus_Y_mc, tol=0.02)
check_close('E[3X-2Y+5]: linearity vs MC', E_combo_linearity, E_combo_mc, tol=0.02)
print(f'\nE[XY] = {E_XY:.4f} (E[X]*E[Y] = {E_X_times_E_Y:.4f} — agree here, but X,Y still dependent!)')
check_close('E[N] indicator trick', E_N_linearity, 30.0)
print(f'\nVariance when dependent:')
print(f' Var(X)+Var(Y) = {Var_XpY_decomp:.4f} (assumes independence)')
print(f' Var(X+Y) = {Var_XpY_direct:.4f} (true value)')
print(f' Cov(X,Y) = {Cov_XY:.4f} (non-zero!)')
print(f' Var(X+Y) = Var(X)+Var(Y)+2Cov(X,Y) = {Var_XpY_decomp + 2*Cov_XY:.4f}')
check_true('Var(X+Y) ≠ Var(X)+Var(Y) when dependent', abs(Var_XpY_decomp - Var_XpY_direct) > 0.1)
print('\nTakeaway: Linearity of expectation always holds; Var(X+Y)=Var(X)+Var(Y) requires independence.')
Exercise 3 ★ — Moments of the Gaussian via MGF Differentiation
For , the MGF is
(a) Differentiate analytically to derive:
- — verify equals
- — verify equals
- — derive the formula
(b) Implement gaussian_mgf(t, mu, sigma) and use numerical differentiation to extract the first 4 moments.
(c) From the moments, compute the skewness and excess kurtosis . Verify: for Gaussian, and .
(d) Repeat for with MGF for . Verify and .
Code cell 11
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 12
# Solution
import numpy as np
mu, sigma = 2.0, 1.5
def gaussian_mgf(t, mu, sigma):
return np.exp(mu*t + 0.5*sigma**2*t**2)
def numerical_moments(mgf_fn, n_moments=4, h=1e-4):
"""k-th moment = k-th derivative of MGF at t=0 (via finite differences)."""
moments = []
for k in range(1, n_moments+1):
# Use central finite difference: f^(k)(0) ≈ (1/h^k) * sum c_i * f(i*h)
# Use scipy.misc.derivative or a simple 5-point stencil
# Simple approach: use log-derivative for better numerics
# For small k, direct approach is fine
ts = np.array([-2*h, -h, 0, h, 2*h])
vals = np.array([mgf_fn(t) for t in ts])
if k == 1:
d = (-vals[4] + 8*vals[3] - 8*vals[1] + vals[0]) / (12*h)
elif k == 2:
d = (-vals[4] + 16*vals[3] - 30*vals[2] + 16*vals[1] - vals[0]) / (12*h**2)
elif k == 3:
d = (vals[4] - 2*vals[3] + 2*vals[1] - vals[0]) / (2*h**3)
elif k == 4:
d = (vals[4] - 4*vals[3] + 6*vals[2] - 4*vals[1] + vals[0]) / h**4
moments.append(d)
return np.array(moments)
moments = numerical_moments(lambda t: gaussian_mgf(t, mu, sigma))
# Central moments from raw moments
m1 = moments[0] # E[X]
m2 = moments[1] # E[X^2]
m3 = moments[2] # E[X^3]
m4 = moments[3] # E[X^4]
mu2 = m2 - m1**2 # Var(X)
mu3 = m3 - 3*m1*mu2 - m1**3
mu4 = m4 - 4*m1*m3 + 6*m1**2*m2 - 3*m1**4
skewness = mu3 / mu2**1.5
ex_kurt = mu4 / mu2**2 - 3
# Exponential
lam = 2.0
def exp_mgf(t, lam):
return lam / (lam - t)
moments_exp = numerical_moments(lambda t: exp_mgf(t, lam), h=1e-5)
m1e = moments_exp[0]; m2e = moments_exp[1]
m3e = moments_exp[2]; m4e = moments_exp[3]
mu2e = m2e - m1e**2
mu3e = m3e - 3*m1e*mu2e - m1e**3
mu4e = m4e - 4*m1e*m3e + 6*m1e**2*m2e - 3*m1e**4
skew_exp = mu3e / mu2e**1.5
kurt_exp = mu4e / mu2e**2 - 3
header('Exercise 3: Gaussian and Exponential Moments via MGF')
print(f'Gaussian N({mu}, {sigma}^2):')
check_close('E[X] = mu', moments[0], mu)
check_close('E[X^2] = mu^2+sigma^2', moments[1], mu**2 + sigma**2)
check_close('Skewness = 0', skewness, 0.0, tol=0.01)
check_close('Excess kurtosis = 0', ex_kurt, 0.0, tol=0.05)
print(f'\nExponential(lam={lam}):')
check_close('E[X] = 1/lam', moments_exp[0], 1/lam, tol=0.01)
check_close('Var(X) = 1/lam^2', mu2e, 1/lam**2, tol=0.01)
check_close('Skewness = 2', skew_exp, 2.0, tol=0.1)
check_close('Excess kurtosis = 6', kurt_exp, 6.0, tol=0.3)
print('\nTakeaway: MGF encodes all moments; differentiation extracts them without integration.')
Exercise 4 ★★ — Tower Property and Law of Total Variance
A customer service centre receives calls. The number of calls per hour follows a Poisson distribution with rate , where itself is random: with , .
(a) Use the tower property to compute :
(b) Use the law of total variance to compute :
Note: for , .
(c) Verify both results by simulation: draw , then .
(d) The marginal distribution of (Poisson with Gamma-distributed rate) is a Negative Binomial. Verify this by comparing the marginal PMF to .
Code cell 14
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 15
# Solution
import numpy as np
from scipy import stats
np.random.seed(42)
alpha, beta = 2.0, 0.5
# (a)
E_N = alpha / beta # = 4
# (b)
E_Var_N_given_L = alpha / beta # E[Lambda]
Var_E_N_given_L = alpha / beta**2 # Var(Lambda)
Var_N = E_Var_N_given_L + Var_E_N_given_L
# (c)
N_sim = 100000
Lambda_samples = np.random.gamma(alpha, 1.0/beta, N_sim) # scale = 1/beta
N_samples = np.array([np.random.poisson(lam) for lam in Lambda_samples])
# (d) Negative Binomial: r=alpha, p=beta/(beta+1)
r = alpha
p = beta / (beta + 1)
max_k = 30
ks = np.arange(max_k)
negbin_pmf = stats.nbinom.pmf(ks, r, p)
sample_pmf, _ = np.histogram(N_samples, bins=np.arange(max_k+1), density=False)
sample_pmf = sample_pmf / N_sim
tvd = np.abs(sample_pmf - negbin_pmf).sum() / 2
header('Exercise 4: Tower Property and Law of Total Variance')
check_close('E[N] = alpha/beta = 4', E_N, 4.0)
check_close('E[N] simulation', E_N, N_samples.mean(), tol=0.05)
check_close('Var(N) = alpha/beta + alpha/beta^2', Var_N, alpha/beta + alpha/beta**2)
check_close('Var(N) simulation', Var_N, N_samples.var(), tol=0.5)
print(f'\nLaw of total variance decomposition:')
print(f' E[Var(N|Lambda)] = {E_Var_N_given_L:.2f} ("within" component)')
print(f' Var(E[N|Lambda]) = {Var_E_N_given_L:.2f} ("between" component)')
print(f' Total Var(N) = {Var_N:.2f}')
print(f' Note: Var(N) > E[N] = {E_N} — overdispersion vs Poisson!')
print(f'\nMarginal ~ NegBin({r}, {p:.3f}):')
print(f' Total variation distance vs sample: {tvd:.4f} (should be small)')
check_true('TVD < 0.02 (close to NegBin)', tvd < 0.02)
print('\nTakeaway: Tower property: E[Y] = E[E[Y|X]] — marginalise by conditioning then averaging.')
Exercise 5 ★★ — MGF of the Gamma Distribution and Cumulants
The Gamma distribution (shape , rate ) has MGF:
(a) Implement gamma_mgf(t, alpha, beta) and verify numerically against Monte Carlo estimates .
(b) Compute the first four cumulants using the cumulant generating function :
- (mean)
- (variance)
- (third cumulant)
- (fourth cumulant)
(c) From cumulants, compute skewness and excess kurtosis . Verify: , .
(d) Verify all results by Monte Carlo with , , .
Code cell 17
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 18
# Solution
import numpy as np
np.random.seed(42)
alpha, beta = 3.0, 2.0
def gamma_mgf(t, alpha, beta):
return (beta / (beta - t))**alpha
def gamma_cgf(t, alpha, beta):
return alpha * np.log(beta) - alpha * np.log(beta - t)
# (a) Verify
N_mc = 100000
samples = np.random.gamma(alpha, 1.0/beta, N_mc)
# (b) Cumulants via finite differences of K(t)
h = 1e-4
ts_fd = np.array([-2*h, -h, 0, h, 2*h])
K_vals = np.array([gamma_cgf(t, alpha, beta) for t in ts_fd])
kappa1 = (-K_vals[4] + 8*K_vals[3] - 8*K_vals[1] + K_vals[0]) / (12*h)
kappa2 = (-K_vals[4] + 16*K_vals[3] - 30*K_vals[2] + 16*K_vals[1] - K_vals[0]) / (12*h**2)
kappa3 = (K_vals[4] - 2*K_vals[3] + 2*K_vals[1] - K_vals[0]) / (2*h**3)
kappa4 = (K_vals[4] - 4*K_vals[3] + 6*K_vals[2] - 4*K_vals[1] + K_vals[0]) / h**4
# (c)
skewness = kappa3 / kappa2**1.5
ex_kurt = kappa4 / kappa2**2
header('Exercise 5: Gamma MGF and Cumulants')
for t in [0.1, 0.5, 1.0]:
mgf_theory = gamma_mgf(t, alpha, beta)
mgf_mc = np.exp(t * samples).mean()
check_close(f'MGF(t={t}): theory vs MC', mgf_theory, mgf_mc, tol=0.01)
print(f'\nCumulants:')
check_close('kappa1 = alpha/beta', kappa1, alpha/beta)
check_close('kappa2 = alpha/beta^2', kappa2, alpha/beta**2)
check_close('kappa3 = 2*alpha/beta^3', kappa3, 2*alpha/beta**3)
check_close('kappa4 = 6*alpha/beta^4', kappa4, 6*alpha/beta**4)
print(f'\nSkewness and kurtosis:')
check_close('Skewness = 2/sqrt(alpha)', skewness, 2/np.sqrt(alpha), tol=0.01)
check_close('Excess kurtosis = 6/alpha', ex_kurt, 6/alpha, tol=0.05)
print(f'\nMC verification: skewness={((samples-samples.mean())**3).mean()/samples.std()**3:.3f}, '
f'kurtosis={(((samples-samples.mean())**4).mean()/samples.std()**4 - 3):.3f}')
print('\nTakeaway: CGF K(t)=log M(t) has a simpler form; its derivatives give cumulants directly.')
Exercise 6 ★★ — Jensen's Inequality and the KL Divergence
(a) Use Jensen's inequality to prove for any positive random variable . Verify numerically for .
(b) Implement kl_divergence(p, q) for discrete distributions and verify with equality iff .
(c) Show that the KL divergence is not symmetric: compute and for two specific distributions and verify they differ.
(d) For two Gaussians and , the KL divergence has a closed form:
Verify by Monte Carlo and by numerical integration.
(e) Derive the ELBO inequality: for any distributions and ,
Verify numerically for a simple Gaussian model.
Code cell 20
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 21
# Solution
import numpy as np
from scipy import stats
from scipy.integrate import quad
np.random.seed(42)
# (a)
X = np.random.exponential(1.0, 200000)
E_logX = np.log(X).mean() # theory: -gamma ≈ -0.5772
log_EX = np.log(X.mean()) # log(E[X]) = log(1) = 0
# (b)
def kl_divergence(p, q, eps=1e-10):
p, q = np.asarray(p, dtype=float), np.asarray(q, dtype=float)
mask = p > eps
return np.sum(p[mask] * np.log(p[mask] / (q[mask] + eps)))
p = np.array([0.4, 0.3, 0.2, 0.1])
q = np.array([0.25, 0.25, 0.25, 0.25])
# (d)
def gaussian_kl(mu1, sigma1, mu2, sigma2):
return (np.log(sigma2/sigma1)
+ (sigma1**2 + (mu1-mu2)**2) / (2*sigma2**2)
- 0.5)
mu1, sigma1 = 1.0, 1.5
mu2, sigma2 = 2.0, 2.0
kl_closed = gaussian_kl(mu1, sigma1, mu2, sigma2)
# Numerical integration verification
def kl_integrand(x):
p = stats.norm.pdf(x, mu1, sigma1)
q = stats.norm.pdf(x, mu2, sigma2)
if p < 1e-300 or q < 1e-300:
return 0.0
return p * np.log(p / q)
kl_numerical, _ = quad(kl_integrand, mu1 - 8*sigma1, mu1 + 8*sigma1)
# MC verification
samples_p = np.random.normal(mu1, sigma1, 500000)
log_p = stats.norm.logpdf(samples_p, mu1, sigma1)
log_q = stats.norm.logpdf(samples_p, mu2, sigma2)
kl_mc = (log_p - log_q).mean()
# (e) ELBO: log p(x) >= E_q[log p(x,z)] - E_q[log q(z)] = ELBO
# Simple model: p(z) = N(0,1), p(x|z) = N(z, 1), q(z) = N(mu_q, sigma_q^2)
# Marginal: p(x) = N(0, 2) (integrate out z)
x_obs = 1.5
log_px = stats.norm.logpdf(x_obs, 0, np.sqrt(2)) # true log p(x)
mu_q, sigma_q = 0.75, 0.7 # approximate posterior q(z)
z_samples = np.random.normal(mu_q, sigma_q, 100000)
log_pxz = stats.norm.logpdf(x_obs, z_samples, 1.0) + stats.norm.logpdf(z_samples, 0, 1.0)
log_qz = stats.norm.logpdf(z_samples, mu_q, sigma_q)
elbo = (log_pxz - log_qz).mean()
header('Exercise 6: Jensen\'s Inequality and KL Divergence')
print(f'(a) Jensen: E[log X] = {E_logX:.4f} <= log(E[X]) = {log_EX:.4f}')
check_true('E[log X] <= log(E[X])', E_logX <= log_EX)
check_close('KL(p||p) = 0', kl_divergence(p, p), 0.0, tol=1e-8)
check_true('KL(p||q) > 0', kl_divergence(p, q) > 0)
print(f'\n(c) Asymmetry: KL(p||q)={kl_divergence(p,q):.4f}, KL(q||p)={kl_divergence(q,p):.4f}')
check_true('KL is asymmetric', abs(kl_divergence(p,q) - kl_divergence(q,p)) > 0.01)
print(f'\n(d) Gaussian KL: closed={kl_closed:.4f}, numerical={kl_numerical:.4f}, MC={kl_mc:.4f}')
check_close('closed == numerical', kl_closed, kl_numerical, tol=0.01)
check_close('closed == MC', kl_closed, kl_mc, tol=0.01)
print(f'\n(e) ELBO: log p(x) = {log_px:.4f}, ELBO = {elbo:.4f}')
check_true('ELBO <= log p(x)', elbo <= log_px + 0.01)
print('\nTakeaway: KL >= 0 by Jensen; the ELBO is a lower bound on log p(x) for any q.')
Exercise 7 ★★★ — Bias-Variance Decomposition
Consider estimating from noisy observations , .
(a) Implement bias_variance_decomposition(degree, n_train, sigma, n_datasets, x_test) that:
- Generates
n_datasetsindependent training sets of sizen_train - Fits a degree-
dpolynomial to each - Returns bias², variance, and noise at each test point
(b) Verify the decomposition: .
(c) Plot (or print) how bias² and variance change as degree goes from 1 to 12 (with , ). Identify the optimal degree.
(d) Show the effect of increasing : fit degree=6 for . Which component dominates at small ? At large ?
Code cell 23
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 24
# Solution
import numpy as np
np.random.seed(42)
def true_fn(x):
return np.sin(2 * np.pi * x)
def poly_predict(x_train, y_train, x_test, degree):
Phi_tr = np.column_stack([x_train**k for k in range(degree+1)])
Phi_te = np.column_stack([x_test**k for k in range(degree+1)])
w, _, _, _ = np.linalg.lstsq(Phi_tr, y_train, rcond=None)
return Phi_te @ w
def bias_variance_decomp(degree, n_train, sigma, n_datasets=300, x_test=None):
if x_test is None:
x_test = np.linspace(0, 1, 50)
f_test = true_fn(x_test)
preds = []
for _ in range(n_datasets):
x_tr = np.linspace(0, 1, n_train)
y_tr = true_fn(x_tr) + np.random.normal(0, sigma, n_train)
pred = poly_predict(x_tr, y_tr, x_test, degree)
preds.append(pred)
preds = np.array(preds) # (n_datasets, len(x_test))
mean_pred = preds.mean(axis=0)
bias2 = np.mean((mean_pred - f_test)**2)
variance = np.mean(preds.var(axis=0))
noise = sigma**2
mse = np.mean((preds - f_test)**2) # average over datasets and x_test
return bias2, variance, noise, mse
header('Exercise 7: Bias-Variance Decomposition')
# (b) Verify decomposition
sigma = 0.3
bias2, var, noise, mse = bias_variance_decomp(4, 20, sigma)
print(f'Degree=4, n=20:')
print(f' Bias^2={bias2:.4f}, Var={var:.4f}, Noise={noise:.4f}')
print(f' Bias^2+Var+Noise={bias2+var+noise:.4f}, MSE={mse:.4f}')
check_close('MSE = Bias^2 + Var + Noise', mse, bias2+var+noise, tol=0.02)
# (c) Sweep degrees
print(f'\nDegree sweep (n=15, sigma=0.3):')
print(f'{"Degree":8s} {"Bias^2":10s} {"Variance":10s} {"Total MSE":10s}')
best_mse = float('inf'); best_d = 0
for d in range(1, 12):
b2, v, ns, mse_d = bias_variance_decomp(d, 15, sigma)
print(f' d={d:2d}: bias2={b2:.4f} var={v:.4f} mse={b2+v+ns:.4f}')
if b2+v+ns < best_mse:
best_mse = b2+v+ns; best_d = d
print(f' => Optimal degree: {best_d}')
# (d) Effect of n
print(f'\nEffect of n (degree=6, sigma=0.3):')
for n in [10, 20, 50, 100]:
b2, v, ns, mse_n = bias_variance_decomp(6, n, sigma)
dom = 'bias dominates' if b2 > v else 'variance dominates'
print(f' n={n:4d}: bias2={b2:.4f}, var={v:.4f} ({dom})')
print('\nTakeaway: Bias decreases with complexity; variance increases; optimal = minimum MSE.')
Exercise 8 ★★★ — Adam as Adaptive Moment Estimation
Adam tracks running estimates of the first and second moments of the gradient:
with bias corrections , .
(a) Implement adam_optimizer(grad_fn, theta0, lr, n_steps) and minimise .
(b) Show that without bias correction, Adam converges slowly at first (underestimates moments). Quantify the initialisation bias at : vs .
(c) Implement SGD for comparison. Show Adam converges faster on this problem.
(d) Analyse the effective learning rate : show that it is bounded by approximately per parameter, regardless of gradient scale.
(e) Demonstrate Adam's robustness to gradient scale: multiply the gradient by 100 and show Adam still converges, while SGD with the same learning rate diverges.
Code cell 26
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 27
# Solution
import numpy as np
np.random.seed(42)
def f(theta):
return (theta[0] - 3)**2 + 10*(theta[1] + 1)**2
def grad_f(theta, scale=1.0):
return scale * np.array([2*(theta[0]-3), 20*(theta[1]+1)])
def adam(grad_fn, theta0, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8, n_steps=500):
theta = theta0.copy().astype(float)
m = np.zeros_like(theta)
v = np.zeros_like(theta)
losses = [f(theta)]
for t in range(1, n_steps+1):
g = grad_fn(theta)
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g**2
m_hat = m / (1 - beta1**t) # bias correction
v_hat = v / (1 - beta2**t) # bias correction
theta -= lr * m_hat / (np.sqrt(v_hat) + eps)
losses.append(f(theta))
return theta, np.array(losses)
def sgd(grad_fn, theta0, lr=0.01, n_steps=500):
theta = theta0.copy().astype(float)
losses = [f(theta)]
for _ in range(n_steps):
g = grad_fn(theta)
theta -= lr * g
losses.append(f(theta))
return theta, np.array(losses)
theta0 = np.array([0.0, 0.0])
theta_adam, losses_adam = adam(grad_f, theta0)
theta_sgd, losses_sgd = sgd(grad_f, theta0, lr=0.01)
header('Exercise 8: Adam as Moment Estimation')
check_close('Adam converges: theta -> [3, -1]', theta_adam, np.array([3.0, -1.0]), tol=0.01)
check_close('SGD converges: theta -> [3, -1]', theta_sgd, np.array([3.0, -1.0]), tol=0.01)
# (b) Bias correction analysis
beta1, beta2 = 0.9, 0.999
g1 = grad_f(theta0)
m1_no_corr = (1 - beta1) * g1
m1_corr = m1_no_corr / (1 - beta1**1)
print(f'\n(b) Bias correction at t=1:')
print(f' True gradient g1 = {g1}')
print(f' m1 without correction = {m1_no_corr} (only {100*(1-beta1):.0f}% of g1)')
print(f' m1 with correction = {m1_corr} (= g1, fully corrected)')
# (c) Convergence comparison
print(f'\n(c) Adam vs SGD (500 steps):')
adam_steps_to_1e3 = next((i for i,l in enumerate(losses_adam) if l < 1e-3), None)
sgd_steps_to_1e3 = next((i for i,l in enumerate(losses_sgd) if l < 1e-3), None)
print(f' Adam: {adam_steps_to_1e3} steps to loss < 1e-3')
print(f' SGD: {sgd_steps_to_1e3} steps to loss < 1e-3')
# (d) Effective learning rate
eps = 1e-8
theta_test = np.array([1.0, 0.5])
g = grad_f(theta_test)
eff_lr = 0.01 * g / (np.sqrt(g**2) + eps) # at convergence, m_hat/sqrt(v_hat) = sign(g)
print(f'\n(d) Effective step size (sign of gradient, |step| ≈ lr):')
print(f' gradient: {g}')
print(f' effective step: {eff_lr}')
print(f' |step| ≈ lr = 0.01: {np.allclose(np.abs(eff_lr), 0.01, atol=0.001)}')
# (e) Gradient scale robustness
theta_adam_scaled, losses_adam_scaled = adam(lambda t: grad_f(t, scale=100), theta0)
theta_sgd_scaled, losses_sgd_scaled = sgd(lambda t: grad_f(t, scale=100), theta0, lr=0.01)
print(f'\n(e) With gradient scale=100:')
print(f' Adam final loss: {losses_adam_scaled[-1]:.6f} (converges!)')
print(f' SGD final loss: {losses_sgd_scaled[-1]:.6f} (diverges: nan or very large)')
check_close('Adam robust to scale', theta_adam_scaled, np.array([3.0, -1.0]), tol=0.05)
print('\nTakeaway: Adam estimates E[g] and E[g^2] per-parameter; effective LR ≈ lr regardless of gradient scale.')
Exercise 9 *** - Mini-Batch Gradient Variance
Let per-example gradients be iid with mean and variance . A mini-batch gradient is .
Part (a): Derive .
Part (b): Verify this scaling by simulation for several batch sizes.
Part (c): Explain the tradeoff between gradient noise and compute efficiency.
Code cell 29
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 30
# Solution
mu, sigma = 0.3, 2.0
batch_sizes = [1, 4, 16, 64, 256]
n_trials = 20000
header('Exercise 9: Mini-Batch Gradient Variance')
for B in batch_sizes:
batches = np.random.normal(mu, sigma, size=(n_trials, B))
means = batches.mean(axis=1)
empirical_var = means.var(ddof=0)
theory_var = sigma**2 / B
print(f'B={B:3d}: empirical Var={empirical_var:.4f}, theory={theory_var:.4f}')
check_close(f'variance scales as 1/B for B={B}', empirical_var, theory_var, tol=0.06)
print('Takeaway: larger batches reduce estimator variance, but with diminishing returns proportional to 1/B.')
Exercise 10 *** - Control Variates for Variance Reduction
Estimate for . Use as a control variate with known mean .
Part (a): Implement the naive Monte Carlo estimator.
Part (b): Estimate the optimal control coefficient .
Part (c): Compare estimator variance with and without the control variate.
Part (d): Explain why variance reduction matters for policy gradients and variational inference.
Code cell 32
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 33
# Solution
n = 200000
X = np.random.randn(n)
Y = np.exp(X)
C = X
b_star = np.cov(Y, C, ddof=0)[0, 1] / np.var(C)
Y_cv = Y - b_star * (C - 0.0)
true_value = np.exp(0.5)
header('Exercise 10: Control Variates')
print(f'True E[e^X]: {true_value:.4f}')
print(f'Naive estimate: {Y.mean():.4f}')
print(f'Control-variate estimate: {Y_cv.mean():.4f}')
print(f'Optimal b*: {b_star:.4f}')
print(f'Naive sample variance: {Y.var():.4f}')
print(f'Control-variate sample variance: {Y_cv.var():.4f}')
check_true('Control variate reduces variance', Y_cv.var() < Y.var())
check_close('Control-variate estimate is close to truth', Y_cv.mean(), true_value, tol=0.02)
print('Takeaway: subtracting a correlated zero-mean term keeps the expectation fixed while reducing Monte Carlo noise.')
What to Review After Finishing
- LOTUS: — no need to find distribution of
- Linearity: — no independence needed
- Variance formula:
- Tower property:
- Law of total variance:
- MGF: ; -th derivative at 0 gives -th raw moment
- Jensen: convex ; proves
- Bias-variance:
- Adam: , ; bias correction at small
References
- DeGroot & Schervish, Probability and Statistics (4th ed.) — Chapters 4–5
- Bishop, PRML (2006) — Appendix B (probability review)
- Goodfellow, Bengio & Courville, Deep Learning (2016) — Chapter 3
- Kingma & Ba, Adam: A Method for Stochastic Optimization (2015) — arXiv:1412.6980
- Hastie, Tibshirani & Friedman, ESL (2009) — §7.3 (bias-variance tradeoff)
- Belkin et al., Reconciling modern ML and the bias-variance tradeoff (2019) — double descent
- Mini-batch gradient variance - can you explain the probabilistic calculation and the ML relevance?
- Control variates - can you connect the computation to model behavior?