Exercises Notebook
Converted from
exercises.ipynbfor web reading.
§6.2 Common Distributions — Exercises
10 exercises covering discrete and continuous distributions, MGFs, exponential family, conjugate priors, and ML applications (cross-entropy, VAE KL, language model sampling).
| Format | Description |
|---|---|
| Problem | Markdown cell with task description |
| Your Solution | Code cell with scaffolding |
| Solution | Code cell with reference solution and checks |
Difficulty Levels
| Level | Exercises | Focus |
|---|---|---|
| ★ | 1–3 | PMFs, CDFs, simulation |
| ★★ | 4–6 | MGFs, exponential family, conjugate priors |
| ★★★ | 7-10 | ML: softmax cross-entropy, VAE KL divergence |
Topic Map
| Topic | Exercise |
|---|---|
| Binomial and Poisson | 1 |
| Beta distribution | 2 |
| Gaussian moments and CLT | 3 |
| Moment generating functions | 4 |
| Exponential family | 5 |
| Beta-Binomial conjugate | 6 |
| Softmax cross-entropy | 7 |
| VAE KL divergence | 8 |
| Beta-Binomial posterior predictive | 9 |
| Temperature-scaled categorical | 10 |
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
from scipy import stats
from scipy.special import gamma, gammaln
import math
try:
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams['figure.figsize'] = [10, 5]
HAS_MPL = True
except ImportError:
HAS_MPL = False
np.set_printoptions(precision=6, suppress=True)
np.random.seed(42)
def header(title):
print('\n' + '=' * len(title))
print(title)
print('=' * len(title))
def check_close(name, got, expected, tol=1e-8):
ok = np.allclose(got, expected, atol=tol, rtol=tol)
print(f"{'PASS' if ok else 'FAIL'} — {name}")
if not ok:
print(f' expected: {expected}')
print(f' got : {got}')
return ok
def check_true(name, cond):
print(f"{'PASS' if cond else 'FAIL'} — {name}")
return cond
print('Setup complete.')
Exercise 1 ★ — Binomial and Poisson: Token Error Rates
A language model makes transcription errors with probability per token. A document has tokens.
Part (a): Model the number of errors . Compute , , , , .
Part (b): Approximate using . Compute the same quantities and compare.
Part (c): Compute the total variation distance .
Part (d): Simulate 50,000 documents. Empirically verify .
Code cell 5
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 6
# Solution
import numpy as np
from scipy import stats
np.random.seed(42)
n, p = 500, 0.02
lam = n * p # = 10
binom_rv = stats.binom(n, p)
pois_rv = stats.poisson(lam)
# Part (a)
p_zero_b = binom_rv.pmf(0)
p_le5_b = binom_rv.cdf(5)
p_gt15_b = binom_rv.sf(15)
e_x_b = binom_rv.mean()
var_x_b = binom_rv.var()
# Part (b)
p_zero_p = pois_rv.pmf(0)
p_le5_p = pois_rv.cdf(5)
p_gt15_p = pois_rv.sf(15)
# Part (c): TVD
k_vals = np.arange(0, 50)
tvd = 0.5 * np.sum(np.abs(binom_rv.pmf(k_vals) - pois_rv.pmf(k_vals)))
# Part (d): Simulation
n_sim = 50000
errors = stats.binom.rvs(n, p, size=n_sim)
p_le5_sim = np.mean(errors <= 5)
header('Exercise 1: Binomial and Poisson Approximation')
print(f'Binomial: P(X=0)={p_zero_b:.6f}, P(X<=5)={p_le5_b:.6f}, P(X>15)={p_gt15_b:.6f}')
print(f'Poisson: P(Y=0)={p_zero_p:.6f}, P(Y<=5)={p_le5_p:.6f}, P(Y>15)={p_gt15_p:.6f}')
print(f'Binom E[X]={e_x_b:.4f}, Var={var_x_b:.4f}')
print(f'Pois E[Y]={pois_rv.mean():.4f}, Var={pois_rv.var():.4f}')
check_close('Binom E[X] = np', e_x_b, n*p)
check_close('Binom Var = np(1-p)', var_x_b, n*p*(1-p))
print(f'TVD(Binom, Poisson) = {tvd:.6f}')
check_true('TVD < 0.01', tvd < 0.01)
print(f'Simulation P(X<=5) = {p_le5_sim:.4f}, Analytical = {p_le5_b:.4f}')
check_true('Simulation within 0.005', abs(p_le5_sim - p_le5_b) < 0.005)
print('\nTakeaway: For rare events (np fixed, n large), Poisson excellently approximates')
print('Binomial — the theoretical bound on TVD is min(p, np^2).')
Exercise 2 ★ — Beta Distribution: Model Confidence Calibration
A model's confidence score for a class is modelled as .
Part (a): Compute the mean, variance, and mode for .
Part (b): For , compute , .
Part (c): Show numerically that by comparing CDFs at 10 test points.
Part (d): The effective sample size of a prior is . After observing 20 successes out of 30 trials, starting from , compute the posterior and its 95% credible interval.
Code cell 8
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 9
# Solution
import numpy as np
from scipy import stats
header('Exercise 2: Beta Distribution')
# Part (a)
configs = [(2, 5), (5, 2), (5, 5), (0.5, 0.5)]
print('Beta distribution statistics:')
for a, b in configs:
rv = stats.beta(a, b)
mean = rv.mean()
var = rv.var()
mode = (a-1)/(a+b-2) if (a > 1 and b > 1) else 'boundary'
print(f' Beta({a},{b}): mean={mean:.4f}, var={var:.6f}, mode={mode}')
# Part (b)
a, b = 3, 7
rv = stats.beta(a, b)
p_gt_half = rv.sf(0.5)
p_interval = rv.cdf(0.4) - rv.cdf(0.2)
print(f'\nBeta(3,7):')
print(f' P(theta>0.5) = {p_gt_half:.6f}')
print(f' P(0.2<=theta<=0.4) = {p_interval:.6f}')
check_close('P(theta>0.5) < 0.1', p_gt_half < 0.1, True, tol=0) # qualitative check
check_true('P(theta>0.5) < 0.1', p_gt_half < 0.1)
# Part (c): Beta(1,1) = Uniform(0,1)
test_pts = np.linspace(0.1, 0.9, 10)
beta11_cdf = stats.beta(1, 1).cdf(test_pts)
uniform_cdf = stats.uniform.cdf(test_pts)
check_close('Beta(1,1) = Uniform CDF', beta11_cdf, uniform_cdf)
# Part (d): Bayesian update
alpha_prior, beta_prior = 2, 2
successes, failures = 20, 10
alpha_post = alpha_prior + successes
beta_post = beta_prior + failures
post_rv = stats.beta(alpha_post, beta_post)
ci_low, ci_high = post_rv.ppf(0.025), post_rv.ppf(0.975)
print(f'\nBayesian Update:')
print(f' Prior: Beta({alpha_prior},{beta_prior}), mean={alpha_prior/(alpha_prior+beta_prior):.4f}')
print(f' Data: {successes}/{successes+failures} successes')
print(f' Posterior: Beta({alpha_post},{beta_post}), mean={post_rv.mean():.4f}')
print(f' MLE: {successes/(successes+failures):.4f}')
print(f' 95% Credible Interval: [{ci_low:.4f}, {ci_high:.4f}]')
check_true('Posterior mean > prior mean (data is informative)', post_rv.mean() > alpha_prior/(alpha_prior+beta_prior))
check_true('95% CI contains MLE', ci_low <= successes/(successes+failures) <= ci_high)
print('\nTakeaway: Beta is the conjugate prior for Binomial — posterior update is just')
print('adding observations to the prior hyperparameters (pseudo-counts).')
Exercise 3 ★ — Gaussian: Xavier Initialisation and the CLT
Part (a): For , compute , , and using the moment formula for Gaussians: where .
Part (b): Xavier initialisation sets . For a layer with inputs, compute the std of the output where , independently. Show that Xavier achieves .
Part (c): Demonstrate the CLT: if iid, then . Verify via KS test for .
Code cell 11
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 12
# Solution
import numpy as np
from scipy import stats
np.random.seed(42)
header('Exercise 3: Gaussian Moments and Xavier Init')
# Part (a): Gaussian even moments
sigma = 2.0
# E[X^{2k}] = (2k-1)!! * sigma^{2k}
E_X2 = sigma**2 # k=1: 1!! * sigma^2 = sigma^2
E_X4 = 3 * sigma**4 # k=2: 3!! * sigma^4 = 3*sigma^4
E_X6 = 15 * sigma**6 # k=3: 5!! * sigma^6 = 15*sigma^6
# Verify numerically
X = np.random.normal(0, sigma, 1000000)
print('Gaussian moments:')
check_close('E[X^2]', (X**2).mean(), E_X2, tol=0.1)
check_close('E[X^4]', (X**4).mean(), E_X4, tol=1.0)
check_close('E[X^6]', (X**6).mean(), E_X6, tol=20.0)
# Part (b): Xavier initialisation
n_in = 512
sigma_w = np.sqrt(1.0 / n_in) # Xavier
# z = sum_i w_i * x_i; Var(w_i*x_i) = Var(w_i)*Var(x_i) = sigma_w^2 * 1
# Var(z) = n_in * sigma_w^2 = n_in * (1/n_in) = 1
var_z_theory = n_in * sigma_w**2
print(f'\nXavier init: sigma_w={sigma_w:.6f}')
print(f'Var(z) = n_in * sigma_w^2 = {var_z_theory:.6f}')
check_close('Xavier: Var(z) = 1', var_z_theory, 1.0)
# Verify with simulation
n_trials = 10000
W = np.random.normal(0, sigma_w, (n_trials, n_in))
X = np.random.normal(0, 1, (n_trials, n_in))
Z = (W * X).sum(axis=1)
print(f'Simulated: mean={Z.mean():.4f}, var={Z.var():.4f}')
check_close('Xavier simulated Var(z) ≈ 1', Z.var(), 1.0, tol=0.05)
# Part (c): CLT demonstration
print('\nCLT: Uniform -> Normal')
for n in [1, 5, 30]:
samples = np.random.uniform(0, 1, (50000, n)).mean(axis=1)
z_scores = (samples - 0.5) / (1 / np.sqrt(12*n))
ks_stat, p_val = stats.kstest(z_scores, 'norm')
print(f' n={n:>2}: KS stat={ks_stat:.4f}, p={p_val:.4f}')
print('\nTakeaway: Xavier init preserves variance through layers — preventing vanishing/')
print('exploding activations. This relies on the fact that Var(wX) = Var(w)*Var(X) for')
print('independent w, X both centred at 0.')
Exercise 4 ★★ — Moment Generating Functions
Part (a): Compute the first four moments of using the MGF . Recall that .
Part (b): Use the product rule for MGFs to show that if and are independent, then . Verify numerically via a KS test.
Part (c): Write a function empirical_mgf(samples, t) that estimates
from samples. Compare the empirical MGF of against its theoretical
value at .
Code cell 14
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 15
# Solution
import numpy as np
from scipy import stats
np.random.seed(42)
header('Exercise 4: Moment Generating Functions')
# Part (a): Poisson moments via numerical differentiation of MGF
lam = 3.0
def mgf_pois(t):
return np.exp(lam * (np.exp(t) - 1))
# Numerical derivatives at t=0
eps = 1e-6
E_X1 = (mgf_pois(eps) - mgf_pois(-eps)) / (2*eps)
E_X2_num = (mgf_pois(eps) - 2*mgf_pois(0) + mgf_pois(-eps)) / eps**2
# Analytical: for Poisson(lambda)
# E[X] = lambda
# E[X^2] = lambda^2 + lambda
# E[X^3] = lambda^3 + 3*lambda^2 + lambda
# E[X^4] = lambda^4 + 6*lambda^3 + 7*lambda^2 + lambda
E_X1_th = lam
E_X2_th = lam**2 + lam
E_X3_th = lam**3 + 3*lam**2 + lam
E_X4_th = lam**4 + 6*lam**3 + 7*lam**2 + lam
# Verify vs simulation
n_sim = 500000
X = stats.poisson(lam).rvs(n_sim)
print('Poisson moments (simulation vs theory):')
check_close('E[X] = lam', X.mean(), E_X1_th, tol=0.05)
check_close('E[X^2]', (X**2).mean(), E_X2_th, tol=0.3)
check_close('E[X^3]', (X**3).mean(), E_X3_th, tol=1.0)
# Part (b): Exp + Exp ~ Gamma(2, lambda) via MGF product rule
# M_Exp(t) = lambda/(lambda-t)
# M_{X+Y}(t) = [lambda/(lambda-t)]^2 = M_Gamma(2,lambda)(t)
lambda_e = 2.0
X_exp = stats.expon(scale=1/lambda_e).rvs(50000)
Y_exp = stats.expon(scale=1/lambda_e).rvs(50000)
Z = X_exp + Y_exp
gamma_rv = stats.gamma(2, scale=1/lambda_e)
ks_stat, p_val = stats.kstest(Z, gamma_rv.cdf)
print(f'\nExp+Exp ~ Gamma(2,lambda): KS p={p_val:.4f}')
check_true('Exp+Exp ~ Gamma(2,lambda) (p>0.05)', p_val > 0.05)
# Part (c): Empirical MGF
def empirical_mgf(samples, t):
return np.mean(np.exp(t * samples))
mu_g, sigma_g = 2.0, 3.0
X_g = np.random.normal(mu_g, sigma_g, 200000)
print('\nEmpirical vs theoretical MGF for N(2,9):')
for t in [0.1, 0.2, 0.3]:
emp = empirical_mgf(X_g, t)
theory = np.exp(mu_g*t + sigma_g**2*t**2/2)
print(f' t={t}: emp={emp:.4f}, theory={theory:.4f}, rel_err={abs(emp-theory)/theory:.4f}')
check_close(f'MGF at t={t}', emp, theory, tol=0.02)
print('\nTakeaway: The MGF uniquely determines a distribution. Its derivatives at t=0 give')
print('all moments, and the product rule for independent sums is a powerful tool.')
Exercise 5 ★★ — Exponential Family: Bernoulli and Gaussian
Part (a): Write the distribution in canonical exponential family form:
Identify , , , and .
Part (b): Verify the log-partition theorem for Bernoulli: and .
Part (c): Show that where is the sigmoid function. This connects maximum likelihood estimation to logistic regression.
Part (d): Write in exponential family form with natural parameters and . Verify that .
Code cell 17
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 18
# Solution
import numpy as np
from scipy import stats
header('Exercise 5: Exponential Family')
# Part (a): Bernoulli as exponential family
# p(x|p) = exp(x*log(p/(1-p)) + log(1-p))
# eta = log(p/(1-p)) [log-odds / logit]
# T(x) = x [sufficient statistic]
# A(eta) = log(1 + e^eta) [log-partition]
# h(x) = 1 [base measure]
p = 0.7
eta = np.log(p / (1 - p)) # logit(p)
A_eta = np.log(1 + np.exp(eta))
print(f'Bernoulli exponential family:')
print(f' eta = logit(p) = {eta:.4f}')
print(f' A(eta) = log(1+e^eta) = {A_eta:.4f}')
print(f' A(eta) = -log(1-p) = {-np.log(1-p):.4f}')
check_close('A(eta) = -log(1-p)', A_eta, -np.log(1-p))
# Part (b)+(c): A'(eta) = sigmoid(eta) = p
def A_bernoulli(eta):
return np.log(1 + np.exp(eta))
eps = 1e-7
dA = (A_bernoulli(eta+eps) - A_bernoulli(eta-eps)) / (2*eps)
sigmoid_eta = 1 / (1 + np.exp(-eta))
print(f'\nLog-partition theorem:')
print(f' A prime(eta) = {dA:.6f}')
print(f' sigmoid(eta) = {sigmoid_eta:.6f}')
print(f' p = {p:.6f}')
check_close('A prime(eta) = sigmoid(eta) = p', dA, p)
# Part (d): Gaussian exponential family
mu, sigma = 2.0, 1.5
eta1 = mu / sigma**2
eta2 = -1 / (2 * sigma**2)
def A_gaussian(eta1, eta2):
return -eta1**2 / (4*eta2) - 0.5 * np.log(-2*eta2)
eps = 1e-7
dA_deta1 = (A_gaussian(eta1+eps, eta2) - A_gaussian(eta1-eps, eta2)) / (2*eps)
print(f'\nGaussian exponential family:')
print(f' dA/d_eta1 = {dA_deta1:.6f} (E[X] = mu = {mu:.6f})')
check_close('dA/d_eta1 = E[X] = mu', dA_deta1, mu)
print('\nTakeaway: The exponential family unifies all common distributions. The log-partition')
print('theorem connects MLE to moment matching — the basis of GLMs and natural gradients.')
Exercise 6 ★★ — Dirichlet-Categorical: Topic Model Simulation
Part (a): Starting from a prior over topics, update the distribution after observing word counts . Compute the posterior mean and the MAP estimate.
Part (b): Sample 1000 probability vectors from for , , . Report the mean entropy across samples in each case.
Part (c): Implement a single-step LDA document update: given a document with word-topic assignments, update the topic distribution using the Dirichlet conjugate update.
Part (d): Show that the posterior predictive probability of observing category next, given counts from a prior, is .
Code cell 20
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 21
# Solution
import numpy as np
from scipy import stats
np.random.seed(42)
header('Exercise 6: Dirichlet-Categorical Conjugate Prior')
K = 4
alpha_prior = np.ones(K)
counts = np.array([15, 8, 3, 4])
n = counts.sum()
# Part (a)
alpha_post = alpha_prior + counts
alpha0_post = alpha_post.sum()
post_mean = alpha_post / alpha0_post
post_mode = (alpha_post - 1) / (alpha0_post - K) # valid since alpha_k > 1
print('Dirichlet-Categorical update:')
print(f' Prior: Dir({alpha_prior})')
print(f' Counts: {counts}')
print(f' Posterior: Dir({alpha_post})')
print(f' Posterior mean: {post_mean.round(4)}')
print(f' MAP estimate: {post_mode.round(4)}')
print(f' MLE (counts/n): {(counts/n).round(4)}')
check_close('Post mean sums to 1', post_mean.sum(), 1.0)
check_close('MAP sums to 1', post_mode.sum(), 1.0)
# Part (b): Entropy comparison
def entropy_samples(samples):
p = samples + 1e-12
return -np.sum(p * np.log(p), axis=1).mean()
max_entropy = np.log(K)
print('\nDirichlet entropy vs concentration:')
for alpha_scale in [0.1, 1.0, 10.0]:
alpha = np.ones(K) * alpha_scale
samples = np.random.dirichlet(alpha, 2000)
mean_H = entropy_samples(samples)
print(f' alpha={alpha_scale}: mean H={mean_H:.4f} (max={max_entropy:.4f})')
check_true('Concentrated (alpha=10) has higher mean entropy than sparse (alpha=0.1)',
entropy_samples(np.random.dirichlet(np.ones(K)*10, 2000)) >
entropy_samples(np.random.dirichlet(np.ones(K)*0.1, 2000)))
# Part (d): Posterior predictive
# P(x_{n+1}=k | n) = (alpha_k + n_k) / (alpha_0 + n)
pred_prob = alpha_post / (alpha_prior.sum() + n)
print(f'\nPosterior predictive: {pred_prob.round(4)}')
check_close('Predictive probs sum to 1', pred_prob.sum(), 1.0)
print('\nTakeaway: Dirichlet-Categorical conjugacy is the backbone of LDA topic models.')
print('The posterior predictive with Laplace (alpha=1) smoothing avoids zero-probability')
print('tokens — crucial for language model generalisation.')
Exercise 7 ★★★ — Softmax Cross-Entropy: Loss and Gradient
The cross-entropy loss for multi-class classification is
Part (a): Implement softmax(z) with numerical stability (subtract max).
Implement cross_entropy_loss(z, y) where is the true class index.
Part (b): Compute the gradient . Show that it equals where and is the one-hot vector.
Part (c): Verify your gradient via finite differences.
Part (d): Show that label smoothing with parameter replaces the one-hot target with . Compute the smoothed loss for .
Code cell 23
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 24
# Solution
import numpy as np
header('Exercise 7: Softmax Cross-Entropy')
def softmax(z):
z = np.asarray(z, dtype=float)
z = z - z.max() # numerical stability
e = np.exp(z)
return e / e.sum()
def cross_entropy_loss(z, y):
p = softmax(z)
return -np.log(p[y] + 1e-12)
def ce_gradient(z, y):
K = len(z)
p = softmax(z)
e_y = np.zeros(K)
e_y[y] = 1.0
return p - e_y # gradient w.r.t. logits
z = np.array([2.0, 1.0, 0.5, -0.5, -1.0])
y = 0
p = softmax(z)
loss = cross_entropy_loss(z, y)
grad = ce_gradient(z, y)
print(f'softmax(z) = {p.round(4)}')
print(f'CE loss = {loss:.4f}')
print(f'gradient = {grad.round(4)}')
check_close('p sums to 1', p.sum(), 1.0)
# Verify gradient via finite differences
eps = 1e-7
grad_fd = np.zeros(len(z))
for i in range(len(z)):
z_plus = z.copy(); z_plus[i] += eps
z_minus = z.copy(); z_minus[i] -= eps
grad_fd[i] = (cross_entropy_loss(z_plus, y) - cross_entropy_loss(z_minus, y)) / (2*eps)
print('\nGradient verification:')
check_close('Analytical gradient = finite diff', grad, grad_fd)
# Label smoothing
K = len(z)
eps_smooth = 0.1
y_smooth = np.ones(K) * eps_smooth / K
y_smooth[y] += 1 - eps_smooth
loss_smooth = -np.sum(y_smooth * np.log(p + 1e-12))
print(f'\nLabel smoothing (eps={eps_smooth}):')
print(f' Hard CE loss: {loss:.4f}')
print(f' Smoothed CE loss: {loss_smooth:.4f}')
check_true('Label-smoothed loss > hard loss (entropy penalty)', loss_smooth > loss)
print('\nTakeaway: The CE gradient p - e_y has an elegant form: it pushes the model')
print('probability toward 1 for the correct class and toward 0 for others.')
print('Label smoothing prevents over-confident predictions (regularisation).')
Exercise 8 ★★★ — VAE KL Divergence: Closed-Form Regulariser
In a Variational Autoencoder (VAE), the encoder outputs and that parameterise . The prior is .
Part (a): Derive and implement the closed-form KL divergence:
Part (b): Verify that KL = 0 when and .
Part (c): Compute KL for , .
Part (d): Show numerically that this closed-form matches the Monte Carlo estimate using samples.
Part (e): Plot how KL varies as ranges from to (with ) and as ranges from to (with ).
Code cell 26
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 27
# Solution
import numpy as np
from scipy import stats
header('Exercise 8: VAE KL Divergence')
def kl_gaussian(mu, log_var):
"""KL(N(mu, diag(sigma^2)) || N(0, I)), per-dimension then summed."""
sigma2 = np.exp(log_var)
return 0.5 * np.sum(mu**2 + sigma2 - log_var - 1)
# Part (b): KL = 0 at prior
mu_zero = np.zeros(3)
lv_zero = np.zeros(3)
kl_zero = kl_gaussian(mu_zero, lv_zero)
print(f'KL at mu=0, log_var=0: {kl_zero:.8f}')
check_close('KL = 0 at prior', kl_zero, 0.0)
# Part (c)
mu_c = np.array([1.0, -0.5, 2.0])
lv_c = np.array([0.0, -0.5, 0.5])
kl_c = kl_gaussian(mu_c, lv_c)
print(f'KL = {kl_c:.4f}')
# Per-dimension contributions
per_d = 0.5 * (mu_c**2 + np.exp(lv_c) - lv_c - 1)
print(f'Per-dim contributions: {per_d.round(4)}')
check_close('Sum of per-dim = total KL', per_d.sum(), kl_c)
# Part (d): Monte Carlo verification
D = 3
N = 200000
mu_mc = np.array([1.0, 0.5, -0.5])
lv_mc = np.array([0.0, 0.5, -0.3])
sigma_mc = np.exp(0.5 * lv_mc)
# Sample from q
Z = np.random.normal(0, 1, (N, D)) * sigma_mc + mu_mc
# log q(z) - log p(z)
log_q = stats.norm.logpdf(Z, loc=mu_mc, scale=sigma_mc).sum(axis=1)
log_p = stats.norm.logpdf(Z, loc=0, scale=1).sum(axis=1)
kl_mc = (log_q - log_p).mean()
kl_closed = kl_gaussian(mu_mc, lv_mc)
print(f'\nKL Monte Carlo: {kl_mc:.4f}')
print(f'KL closed-form: {kl_closed:.4f}')
check_close('MC ≈ closed-form', kl_mc, kl_closed, tol=0.05)
try:
import matplotlib.pyplot as plt
HAS_MPL = True
except ImportError:
HAS_MPL = False
# Part (e): KL as function of mu and sigma^2
if HAS_MPL:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
ax = axes[0]
mu_range = np.linspace(-3, 3, 200)
kl_vs_mu = [kl_gaussian(np.array([m]), np.array([0.0])) for m in mu_range]
ax.plot(mu_range, kl_vs_mu, 'steelblue', lw=2)
ax.axhline(0, color='red', ls='--', alpha=0.5)
ax.set_xlabel('μ (σ²=1)'); ax.set_ylabel('KL'); ax.set_title('KL vs Mean')
ax = axes[1]
sigma2_range = np.linspace(0.1, 5, 200)
lv_range = np.log(sigma2_range)
kl_vs_var = [kl_gaussian(np.array([0.0]), np.array([lv])) for lv in lv_range]
ax.plot(sigma2_range, kl_vs_var, 'firebrick', lw=2)
ax.axhline(0, color='blue', ls='--', alpha=0.5)
ax.set_xlabel('σ² (μ=0)'); ax.set_ylabel('KL'); ax.set_title('KL vs Variance')
plt.suptitle('VAE Regularisation: KL(N(μ,σ²)||N(0,1))', fontsize=13)
plt.tight_layout(); plt.show()
print('\nTakeaway: The closed-form KL enables efficient ELBO optimisation in VAEs.')
print('The KL term is 0 only at the prior (mu=0, sigma^2=1), pushing the encoder')
print('toward a compact, unit-variance latent space while the reconstruction term')
print('pulls it toward informative encodings.')
Exercise 9 *** - Beta-Binomial Posterior Predictive
A classifier has been evaluated on examples and got correct. Put a prior on its unknown accuracy .
Part (a): Compute the posterior distribution for .
Part (b): Compute the posterior mean and a 95% credible interval.
Part (c): Compute the posterior predictive probability that the next label is correct.
Part (d): Compare this Bayesian estimate with the raw empirical accuracy.
Code cell 29
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 30
# Solution
from scipy import stats
a0, b0 = 2, 2
n, k = 40, 31
apost = a0 + k
bpost = b0 + n - k
posterior = stats.beta(apost, bpost)
posterior_mean = apost / (apost + bpost)
ci = posterior.ppf([0.025, 0.975])
raw_acc = k / n
header('Exercise 9: Beta-Binomial Posterior Predictive')
print(f'Posterior: Beta({apost}, {bpost})')
print(f'Raw accuracy: {raw_acc:.4f}')
print(f'Posterior mean / predictive P(next correct): {posterior_mean:.4f}')
print(f'95% credible interval: [{ci[0]:.4f}, {ci[1]:.4f}]')
check_true('Posterior mean is shrunk toward prior mean 0.5', 0.5 < posterior_mean < raw_acc)
check_true('Credible interval contains posterior mean', ci[0] < posterior_mean < ci[1])
print('Takeaway: conjugacy turns counts into calibrated uncertainty, not just a point estimate.')
Exercise 10 *** - Temperature Scaling as a Categorical Distribution Family
Let logits define a categorical distribution .
Part (a): Compute for several temperatures.
Part (b): Verify that entropy increases with temperature for the given logits.
Part (c): Compute the expected negative log-likelihood of a target class under each .
Part (d): Connect this to calibration and decoding in language models.
Code cell 32
# Your Solution
print("Write your solution here, then compare with the reference solution below.")
Code cell 33
# Solution
def softmax_T(z, T):
z = np.asarray(z, dtype=float) / T
z -= z.max()
p = np.exp(z)
return p / p.sum()
def entropy_nat(p):
return -np.sum(p * np.log(p + 1e-15))
z = np.array([2.2, 1.0, 0.3, -0.5])
target = 0
temps = np.array([0.5, 0.8, 1.0, 1.5, 2.0])
entropies = []
nlls = []
header('Exercise 10: Temperature Scaling Categorical Family')
for T in temps:
p = softmax_T(z, T)
H = entropy_nat(p)
nll = -np.log(p[target])
entropies.append(H)
nlls.append(nll)
print(f'T={T:.1f}: p={p.round(4)}, entropy={H:.4f}, target NLL={nll:.4f}')
check_true('Entropy is nondecreasing here as T rises', np.all(np.diff(entropies) > -1e-10))
check_true('Top-class NLL rises as distribution is softened', nlls[-1] > nlls[0])
print('Takeaway: temperature is not a new distribution; it is a reparameterized categorical that controls sharpness.')
What to Review After Finishing
- Binomial → Poisson approximation: when is TVD small?
- Beta distribution: mode formula and edge cases ()
- Gaussian moment formula:
- MGF uniqueness and product rule for independent sums
- Exponential family: natural parameters, sufficient statistics, log-partition theorem
- Dirichlet-Categorical conjugacy: counting update rule
- Softmax gradient: derive analytically
- VAE KL: closed-form derivation and connection to ELBO
References
- Murphy, K.P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
- Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer.
- Blei, D., Kucukelbir, A., McAuliffe, J. (2017). Variational Inference: A Review for Statisticians.
- Wainwright, M., Jordan, M. (2008). Graphical Models, Exponential Families, and Variational Inference.
- Beta-Binomial posterior predictive - can you explain the probabilistic calculation and the ML relevance?
- Temperature-scaled categorical - can you connect the computation to model behavior?