Theory Notebook
Converted from
theory.ipynbfor web reading.
Regression Analysis — Theory Notebook
Regression turns feature-response relationships into a concrete language for prediction, uncertainty, and regularization.
This notebook is the interactive companion to notes.md. It moves from conditional modeling and OLS geometry through diagnostics, regularized regression, GLMs, and machine-learning applications.
| Coverage | What You Will Do |
|---|---|
| OLS foundations | Derive slopes, solve normal equations, verify projection geometry |
| Diagnostics | Inspect leverage, multicollinearity, residual structure, and heteroscedasticity |
| Regularization | Compare OLS, Ridge, and Lasso under instability and sparsity |
| GLMs | Fit logistic and Poisson models and interpret their linear predictors |
| ML links | Use regression-style models for baselines, probes, calibration, and structured adaptation |
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
import numpy.linalg as la
from scipy import stats
from sklearn.linear_model import (LinearRegression, Ridge, Lasso,
LogisticRegression, PoissonRegressor, HuberRegressor)
from sklearn.metrics import mean_squared_error, accuracy_score, brier_score_loss
from sklearn.model_selection import KFold, train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
try:
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 12,
"axes.titlesize": 14,
"axes.labelsize": 12,
"xtick.labelsize": 10,
"ytick.labelsize": 10,
})
HAS_MPL = True
except ImportError:
HAS_MPL = False
try:
import seaborn as sns
if HAS_MPL:
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
if HAS_MPL:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
print("Setup complete.")
Code cell 4
COLORS = {
"primary": "#0077BB",
"secondary": "#EE7733",
"tertiary": "#009988",
"error": "#CC3311",
"neutral": "#555555",
"highlight": "#EE3377",
}
def header(title):
print("\n" + "=" * len(title))
print(title)
print("=" * len(title))
def report(name, cond):
print(f"{'PASS' if cond else 'FAIL'} - {name}")
return cond
def add_intercept(X):
X = np.asarray(X, dtype=float)
if X.ndim == 1:
X = X[:, None]
return np.column_stack([np.ones(X.shape[0]), X])
def ols_beta(X, y):
X = np.asarray(X, dtype=float)
y = np.asarray(y, dtype=float)
return la.lstsq(X, y, rcond=None)[0]
def rss(y, fitted):
return float(np.sum((np.asarray(y) - np.asarray(fitted)) ** 2))
def tss(y):
y = np.asarray(y, dtype=float)
return float(np.sum((y - y.mean()) ** 2))
def r2_score_manual(y, fitted):
return 1.0 - rss(y, fitted) / tss(y)
def adjusted_r2(y, fitted, p):
n = len(y)
return 1.0 - (rss(y, fitted) / (n - p)) / (tss(y) / (n - 1))
def hat_matrix(X):
X = np.asarray(X, dtype=float)
return X @ la.pinv(X.T @ X) @ X.T
def prediction_interval(X, y, x_star, alpha=0.05):
beta = ols_beta(X, y)
fitted = X @ beta
resid = y - fitted
n, p = X.shape
s2 = rss(y, fitted) / (n - p)
xtx_inv = la.pinv(X.T @ X)
x_star = np.asarray(x_star, dtype=float)
mean = float(x_star @ beta)
mean_se = np.sqrt(s2 * x_star @ xtx_inv @ x_star)
tcrit = stats.t.ppf(1 - alpha / 2, n - p)
ci = (mean - tcrit * mean_se, mean + tcrit * mean_se)
pred_se = np.sqrt(s2 * (1 + x_star @ xtx_inv @ x_star))
pi = (mean - tcrit * pred_se, mean + tcrit * pred_se)
return mean, ci, pi
def hc0_se(X, resid):
X = np.asarray(X, dtype=float)
resid = np.asarray(resid, dtype=float)
xtx_inv = la.pinv(X.T @ X)
meat = X.T @ (resid[:, None] ** 2 * X)
cov = xtx_inv @ meat @ xtx_inv
return np.sqrt(np.diag(cov))
def soft_threshold(z, lam):
return np.sign(z) * np.maximum(np.abs(z) - lam, 0.0)
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
def vif_scores(X):
X = np.asarray(X, dtype=float)
out = []
for j in range(X.shape[1]):
target = X[:, j]
others = np.delete(X, j, axis=1)
beta = ols_beta(add_intercept(others), target)
fitted = add_intercept(others) @ beta
r2 = np.clip(r2_score_manual(target, fitted), 0.0, 0.999999999)
out.append(1.0 / (1.0 - r2))
return np.array(out)
print("Helpers ready.")
1. Intuition
1.1 From Correlation to Conditional Modeling
Correlation summarizes association. Regression asks what changes once we condition on multiple predictors at once.
Code cell 6
# === 1.1 From Correlation to Conditional Modeling ===
X1 = rng.normal(size=400)
X2 = 0.8 * X1 + rng.normal(scale=0.5, size=400)
y = 3.0 * X1 - 2.0 * X2 + rng.normal(scale=0.7, size=400)
simple_slope = np.cov(X1, y, ddof=1)[0, 1] / np.var(X1, ddof=1)
X_multi = add_intercept(np.column_stack([X1, X2]))
beta_multi = ols_beta(X_multi, y)
header("Association vs conditional effect")
print(f"corr(X1, y): {np.corrcoef(X1, y)[0, 1]:.4f}")
print(f"Simple-regression slope for X1 only: {simple_slope:.4f}")
print(f"Multiple-regression coefficient for X1: {beta_multi[1]:.4f}")
print(f"Multiple-regression coefficient for X2: {beta_multi[2]:.4f}")
report("Conditional slope differs from raw association", abs(simple_slope - beta_multi[1]) > 0.25)
1.2 Why Regression Is the Blueprint for Supervised Learning
The same linear predictor can drive very different tasks depending on how we map it to the response.
Code cell 8
# === 1.2 Why Regression Is the Blueprint for Supervised Learning ===
eta = np.linspace(-3, 3, 7)
identity = eta
prob = sigmoid(eta)
rate = np.exp(eta)
header("One linear predictor, multiple supervised tasks")
print("eta ", eta)
print("identity ", np.round(identity, 4))
print("sigmoid ", np.round(prob, 4))
print("exp(rate) ", np.round(rate, 4))
report("Sigmoid stays in [0, 1]", np.all((prob >= 0) & (prob <= 1)))
report("Exponential link stays positive", np.all(rate > 0))
1.3 When Linear Structure Works and When It Breaks
A linear fit can be excellent when the true relationship is close to linear, and badly misspecified when curvature matters.
Code cell 10
# === 1.3 When Linear Structure Works and When It Breaks ===
x = np.linspace(-2.5, 2.5, 200)
y_linear = 1.0 + 2.0 * x + rng.normal(scale=0.35, size=len(x))
y_curve = 1.0 + 0.8 * x + 1.4 * x**2 + rng.normal(scale=0.35, size=len(x))
X_lin = add_intercept(x)
beta_linear = ols_beta(X_lin, y_linear)
beta_bad = ols_beta(X_lin, y_curve)
poly = PolynomialFeatures(degree=2, include_bias=True)
X_poly = poly.fit_transform(x[:, None])
beta_poly = ols_beta(X_poly, y_curve)
mse_linear_on_linear = mean_squared_error(y_linear, X_lin @ beta_linear)
mse_linear_on_curve = mean_squared_error(y_curve, X_lin @ beta_bad)
mse_poly_on_curve = mean_squared_error(y_curve, X_poly @ beta_poly)
header("Linear structure: match vs mismatch")
print(f"Linear data, linear fit MSE : {mse_linear_on_linear:.4f}")
print(f"Curved data, linear fit MSE : {mse_linear_on_curve:.4f}")
print(f"Curved data, quadratic fit MSE : {mse_poly_on_curve:.4f}")
report("Quadratic basis improves nonlinear fit", mse_poly_on_curve < mse_linear_on_curve)
if HAS_MPL:
fig, ax = plt.subplots(1, 2, figsize=(12, 4.5))
ax[0].scatter(x, y_linear, s=12, alpha=0.5, color=COLORS["primary"])
ax[0].plot(x, X_lin @ beta_linear, color=COLORS["secondary"], lw=2)
ax[0].set_title("Linear signal")
ax[0].set_xlabel("x")
ax[0].set_ylabel("y")
ax[1].scatter(x, y_curve, s=12, alpha=0.5, color=COLORS["primary"])
ax[1].plot(x, X_lin @ beta_bad, color=COLORS["error"], lw=2, label="linear fit")
ax[1].plot(x, X_poly @ beta_poly, color=COLORS["tertiary"], lw=2, label="quadratic fit")
ax[1].legend()
ax[1].set_title("Curvature breaks plain linearity")
ax[1].set_xlabel("x")
ax[1].set_ylabel("y")
plt.tight_layout()
plt.show()
1.4 Historical Timeline
Regression developed by combining geometry, probability, and practical prediction needs.
Code cell 12
# === 1.4 Historical Timeline ===
timeline = [
(1805, "Legendre", "least squares"),
(1809, "Gauss", "Gaussian error connection"),
(1970, "Hoerl & Kennard", "Ridge regression"),
(1972, "Nelder & Wedderburn", "GLMs"),
(1996, "Tibshirani", "Lasso"),
(2019, "interpretability literature", "linear probes"),
(2022, "PEFT literature", "low-rank adaptation at scale"),
]
header("Selected regression milestones")
for year, person, contribution in timeline:
print(f"{year}: {person} - {contribution}")
report("Timeline is in chronological order", all(timeline[i][0] < timeline[i + 1][0] for i in range(len(timeline) - 1)))
2. Formal Definitions
2.1 Response, Predictors, and Conditional Expectation
The regression target is a conditional pattern such as the mean response given predictors.
Code cell 14
# === 2.1 Response, Predictors, and Conditional Expectation ===
x = rng.uniform(-2, 2, size=500)
y = 1.5 + 1.2 * x + rng.normal(scale=0.6, size=len(x))
bins = np.linspace(-2, 2, 9)
which = np.digitize(x, bins) - 1
bin_means_x = []
bin_means_y = []
for j in range(len(bins) - 1):
mask = which == j
if mask.sum() > 5:
bin_means_x.append(x[mask].mean())
bin_means_y.append(y[mask].mean())
beta = ols_beta(add_intercept(x), y)
header("Conditional mean as the regression target")
print(f"Estimated intercept: {beta[0]:.4f}")
print(f"Estimated slope : {beta[1]:.4f}")
report("Binned conditional means track the fitted line", np.corrcoef(bin_means_x, bin_means_y)[0, 1] > 0.95)
2.2 Design Matrix, Linear Predictor, and Feature Coding
The design matrix stores the feature columns that define the model space.
Code cell 16
# === 2.2 Design Matrix, Linear Predictor, and Feature Coding ===
length = np.array([20, 50, 20, 80], dtype=float)
gpu_h100 = np.array([0, 0, 1, 1], dtype=float)
interaction = length * gpu_h100
X = np.column_stack([np.ones(len(length)), length, gpu_h100, interaction])
columns = ["intercept", "length", "gpu_h100", "length_x_gpu_h100"]
header("Design matrix example")
print("Columns:", columns)
print(X)
print("Shape:", X.shape)
report("Intercept column is all ones", np.allclose(X[:, 0], 1.0))
2.3 Fitted Values, Residuals, and Sum-of-Squares Quantities
Fitted values and residuals organize both geometry and model evaluation.
Code cell 18
# === 2.3 Fitted Values, Residuals, and Sum-of-Squares Quantities ===
x = np.linspace(0, 5, 20)
y = 2.0 + 0.7 * x + rng.normal(scale=0.5, size=len(x))
X = add_intercept(x)
beta = ols_beta(X, y)
fitted = X @ beta
resid = y - fitted
RSS = rss(y, fitted)
TSS = tss(y)
ESS = float(np.sum((fitted - y.mean()) ** 2))
header("TSS = ESS + RSS decomposition")
print(f"RSS: {RSS:.6f}")
print(f"ESS: {ESS:.6f}")
print(f"TSS: {TSS:.6f}")
report("Explained + residual equals total", np.allclose(RSS + ESS, TSS))
report("Residuals sum to zero with intercept", np.allclose(resid.sum(), 0.0))
2.4 Classical Linear Model Assumptions
The assumptions are separable: different failures break different parts of the classical theory.
Code cell 20
# === 2.4 Classical Linear Model Assumptions ===
x = np.linspace(0, 4, 300)
sigma_homo = 0.5 * np.ones_like(x)
sigma_hetero = 0.2 + 0.35 * x
y_homo = 1.0 + 1.1 * x + rng.normal(scale=sigma_homo)
y_hetero = 1.0 + 1.1 * x + rng.normal(scale=sigma_hetero)
fit_homo = add_intercept(x) @ ols_beta(add_intercept(x), y_homo)
fit_hetero = add_intercept(x) @ ols_beta(add_intercept(x), y_hetero)
resid_homo = y_homo - fit_homo
resid_hetero = y_hetero - fit_hetero
mid = np.median(x)
header("Homoscedastic vs heteroscedastic residual spread")
print("Low-x residual var (homo / hetero):",
np.var(resid_homo[x <= mid]), np.var(resid_hetero[x <= mid]))
print("High-x residual var (homo / hetero):",
np.var(resid_homo[x > mid]), np.var(resid_hetero[x > mid]))
report("Heteroscedastic data shows growing variance", np.var(resid_hetero[x > mid]) > 2 * np.var(resid_hetero[x <= mid]))
2.5 Examples, Non-Examples, and Edge Cases
Rank and dimensionality determine whether OLS has a unique solution.
Code cell 22
# === 2.5 Examples, Non-Examples, and Edge Cases ===
X_full = np.array([[1, 0], [1, 1], [1, 2], [1, 3]], dtype=float)
X_collinear = np.array([[1, 1, 2], [1, 2, 4], [1, 3, 6], [1, 4, 8]], dtype=float)
X_high_dim = rng.normal(size=(6, 10))
header("Ranks of common regression design cases")
print(f"Full-rank design rank : {la.matrix_rank(X_full)} / {X_full.shape[1]}")
print(f"Collinear design rank : {la.matrix_rank(X_collinear)} / {X_collinear.shape[1]}")
print(f"High-dimensional design : {X_high_dim.shape[0]} rows, {X_high_dim.shape[1]} cols")
report("Collinear matrix loses rank", la.matrix_rank(X_collinear) < X_collinear.shape[1])
report("p > n example is high dimensional", X_high_dim.shape[1] > X_high_dim.shape[0])
3. Core Theory I: OLS and Inference
3.1 Simple Linear Regression and Its Link to Correlation
In one predictor, the slope is covariance divided by predictor variance.
Code cell 24
# === 3.1 Simple Linear Regression and Its Link to Correlation ===
x = rng.normal(size=300)
y = 0.8 + 1.7 * x + rng.normal(scale=0.8, size=len(x))
cov_xy = np.cov(x, y, ddof=1)[0, 1]
var_x = np.var(x, ddof=1)
slope_cov = cov_xy / var_x
beta = ols_beta(add_intercept(x), y)
correlation_form = np.corrcoef(x, y)[0, 1] * np.std(y, ddof=1) / np.std(x, ddof=1)
header("Simple-regression slope formulas")
print(f"Slope from covariance/variance : {slope_cov:.6f}")
print(f"Slope from OLS : {beta[1]:.6f}")
print(f"Slope from correlation formula : {correlation_form:.6f}")
report("All slope formulas agree", np.allclose([slope_cov, beta[1], correlation_form], slope_cov))
3.2 Matrix OLS and the Normal Equations
The normal equations define the least-squares solution in multiple regression.
Code cell 26
# === 3.2 Matrix OLS and the Normal Equations ===
X = add_intercept(rng.normal(size=(120, 3)))
true_beta = np.array([1.0, 2.0, -1.5, 0.5])
y = X @ true_beta + rng.normal(scale=0.4, size=X.shape[0])
beta_lstsq = ols_beta(X, y)
beta_normal = la.solve(X.T @ X, X.T @ y)
normal_residual = X.T @ (y - X @ beta_normal)
header("Normal equations")
print("beta from lstsq :", np.round(beta_lstsq, 4))
print("beta from formula :", np.round(beta_normal, 4))
print("X^T residual :", np.round(normal_residual, 8))
report("Closed-form and lstsq agree", np.allclose(beta_lstsq, beta_normal))
report("Normal equations hold", np.allclose(normal_residual, 0.0, atol=1e-8))
3.3 Projection Geometry, the Hat Matrix, and Orthogonality
OLS projects the response vector onto the column space of the design matrix.
Code cell 28
# === 3.3 Projection Geometry, the Hat Matrix, and Orthogonality ===
X = add_intercept(np.linspace(-2, 2, 25))
y = 1.2 + 0.9 * X[:, 1] + rng.normal(scale=0.3, size=X.shape[0])
H = hat_matrix(X)
fitted = H @ y
resid = y - fitted
leverage = np.diag(H)
header("Projection geometry")
print(f"Trace(H) = {np.trace(H):.6f}")
print(f"Max |H^2 - H| = {np.max(np.abs(H @ H - H)):.2e}")
print(f"Max |X^T e| = {np.max(np.abs(X.T @ resid)):.2e}")
print(f"Leverage sum = {leverage.sum():.6f}")
report("H is idempotent", np.allclose(H @ H, H))
report("Residual is orthogonal to model space", np.allclose(X.T @ resid, 0.0, atol=1e-8))
report("Leverages sum to number of parameters", np.allclose(leverage.sum(), X.shape[1]))
3.4 Gauss-Markov Theorem
Under the classical assumptions, OLS beats every other linear unbiased estimator in variance.
Code cell 30
# === 3.4 Gauss-Markov Theorem ===
X = add_intercept(np.linspace(-1, 1, 12))
H = hat_matrix(X)
M = np.eye(len(X)) - H
B = rng.normal(size=(X.shape[1], X.shape[0]))
C = B @ M
beta_true = np.array([0.5, 1.3])
ols_draws = []
alt_draws = []
for _ in range(4000):
y = X @ beta_true + rng.normal(scale=0.7, size=X.shape[0])
beta_ols = ols_beta(X, y)
beta_alt = beta_ols + C @ y
ols_draws.append(beta_ols)
alt_draws.append(beta_alt)
ols_draws = np.array(ols_draws)
alt_draws = np.array(alt_draws)
ols_cov = np.cov(ols_draws.T)
alt_cov = np.cov(alt_draws.T)
header("Empirical Gauss-Markov check")
print("Max |C X|:", np.max(np.abs(C @ X)))
print("Mean OLS estimate:", np.round(ols_draws.mean(axis=0), 4))
print("Mean alternative estimate:", np.round(alt_draws.mean(axis=0), 4))
print("trace Var(OLS) :", np.trace(ols_cov))
print("trace Var(alternative):", np.trace(alt_cov))
report("Alternative estimator is approximately unbiased", np.allclose(alt_draws.mean(axis=0), beta_true, atol=0.07))
report("OLS has smaller variance trace", np.trace(ols_cov) < np.trace(alt_cov))
3.5 Sampling Distribution, Standard Errors, Confidence Intervals, and Prediction Intervals
Coefficient uncertainty and predictive uncertainty are related but not the same.
Code cell 32
# === 3.5 Sampling Distribution, Standard Errors, Confidence Intervals, and Prediction Intervals ===
x = np.linspace(0, 4, 40)
X = add_intercept(x)
y = 1.0 + 1.5 * x + rng.normal(scale=0.6, size=len(x))
beta = ols_beta(X, y)
fitted = X @ beta
resid = y - fitted
n, p = X.shape
s2 = rss(y, fitted) / (n - p)
xtx_inv = la.pinv(X.T @ X)
se_beta = np.sqrt(np.diag(s2 * xtx_inv))
tcrit = stats.t.ppf(0.975, n - p)
ci_slope = (beta[1] - tcrit * se_beta[1], beta[1] + tcrit * se_beta[1])
mean_star, ci_mean, pi_star = prediction_interval(X, y, np.array([1.0, 2.5]))
header("Coefficient vs mean-response vs prediction intervals")
print(f"Slope estimate: {beta[1]:.4f}")
print(f"Slope 95% CI: ({ci_slope[0]:.4f}, {ci_slope[1]:.4f})")
print(f"Predicted mean at x*=2.5: {mean_star:.4f}")
print(f"95% CI for mean response: ({ci_mean[0]:.4f}, {ci_mean[1]:.4f})")
print(f"95% prediction interval : ({pi_star[0]:.4f}, {pi_star[1]:.4f})")
report("Prediction interval is wider than mean-response CI", (pi_star[1] - pi_star[0]) > (ci_mean[1] - ci_mean[0]))
4. Core Theory II: Model Assessment and Diagnostics
4.1 R^2, Adjusted R^2, Residual Standard Error, and Goodness of Fit
Goodness-of-fit summaries are useful, but only when we remember what they do and do not measure.
Code cell 34
# === 4.1 R^2, Adjusted R^2, Residual Standard Error, and Goodness of Fit ===
x1 = rng.normal(size=250)
x2 = rng.normal(size=250)
y = 1.0 + 2.0 * x1 + rng.normal(scale=1.0, size=250)
X1 = add_intercept(x1)
X2 = add_intercept(np.column_stack([x1, x2]))
b1 = ols_beta(X1, y)
b2 = ols_beta(X2, y)
fit1 = X1 @ b1
fit2 = X2 @ b2
header("Goodness of fit across nested models")
print(f"Model 1 R^2 : {r2_score_manual(y, fit1):.4f}")
print(f"Model 2 R^2 : {r2_score_manual(y, fit2):.4f}")
print(f"Model 1 adjusted R^2: {adjusted_r2(y, fit1, X1.shape[1]):.4f}")
print(f"Model 2 adjusted R^2: {adjusted_r2(y, fit2, X2.shape[1]):.4f}")
print(f"Model 1 residual SE : {np.sqrt(rss(y, fit1) / (len(y) - X1.shape[1])):.4f}")
print(f"Model 2 residual SE : {np.sqrt(rss(y, fit2) / (len(y) - X2.shape[1])):.4f}")
report("R^2 cannot decrease when adding predictors", r2_score_manual(y, fit2) >= r2_score_manual(y, fit1))
4.2 Residual Diagnostics for Linearity, Variance, and Normality
Residuals should look like leftover noise, not hidden structure.
Code cell 36
# === 4.2 Residual Diagnostics for Linearity, Variance, and Normality ===
x = np.linspace(-2, 2, 220)
y = 1.0 + 1.5 * x + 1.8 * x**2 + rng.normal(scale=0.45, size=len(x))
X = add_intercept(x)
beta = ols_beta(X, y)
fitted = X @ beta
resid = y - fitted
qq_theory, qq_sample = stats.probplot(resid, dist="norm", fit=False)
header("Residual diagnostics")
print(f"corr(residual, x^2): {np.corrcoef(resid, x**2)[0, 1]:.4f}")
print(f"Residual skewness : {stats.skew(resid):.4f}")
print(f"Residual kurtosis : {stats.kurtosis(resid):.4f}")
report("Curvature remains in residuals", abs(np.corrcoef(resid, x**2)[0, 1]) > 0.4)
if HAS_MPL:
fig, ax = plt.subplots(1, 2, figsize=(12, 4.5))
ax[0].scatter(fitted, resid, s=14, alpha=0.6, color=COLORS["primary"])
ax[0].axhline(0.0, color=COLORS["neutral"], lw=1.5)
ax[0].set_title("Residuals vs fitted")
ax[0].set_xlabel("fitted")
ax[0].set_ylabel("residual")
ax[1].scatter(qq_theory, qq_sample, s=14, alpha=0.6, color=COLORS["secondary"])
ax[1].plot([min(qq_theory), max(qq_theory)], [min(qq_theory), max(qq_theory)], color=COLORS["neutral"], lw=1.5)
ax[1].set_title("QQ-style residual check")
ax[1].set_xlabel("theoretical quantiles")
ax[1].set_ylabel("sample quantiles")
plt.tight_layout()
plt.show()
4.3 Leverage, Influence, and Cook's Distance
High leverage and large residuals combine to create influential observations.
Code cell 38
# === 4.3 Leverage, Influence, and Cook's Distance ===
x = np.array([0.0, 0.2, 0.5, 0.8, 1.1, 1.4, 1.8, 5.0])
y = np.array([1.0, 1.2, 1.6, 1.9, 2.1, 2.4, 2.7, 6.8])
X = add_intercept(x)
beta = ols_beta(X, y)
fitted = X @ beta
resid = y - fitted
H = hat_matrix(X)
h = np.diag(H)
s2 = rss(y, fitted) / (len(y) - X.shape[1])
cooks = (resid**2 / (X.shape[1] * s2)) * (h / (1 - h) ** 2)
most_influential = int(np.argmax(cooks))
header("Influence diagnostics")
print("Leverage:", np.round(h, 4))
print("Cook's distance:", np.round(cooks, 4))
print(f"Most influential index: {most_influential}")
report("Edge point has highest leverage", np.argmax(h) == len(x) - 1)
report("Same point is most influential", most_influential == len(x) - 1)
4.4 Multicollinearity, Condition Number, and VIF
Collinearity damages coefficient stability even when prediction remains acceptable.
Code cell 40
# === 4.4 Multicollinearity, Condition Number, and VIF ===
x1 = rng.normal(size=300)
x2 = x1 + rng.normal(scale=0.05, size=300)
x3 = rng.normal(size=300)
X_no_intercept = np.column_stack([x1, x2, x3])
X = add_intercept(X_no_intercept)
y = 1.0 + 2.5 * x1 - 2.4 * x2 + 0.6 * x3 + rng.normal(scale=0.5, size=300)
beta = ols_beta(X, y)
condition = la.cond(StandardScaler().fit_transform(X_no_intercept))
vifs = vif_scores(StandardScaler().fit_transform(X_no_intercept))
header("Multicollinearity diagnostics")
print("OLS coefficients:", np.round(beta, 4))
print(f"Condition number: {condition:.2f}")
print("VIFs:", np.round(vifs, 2))
report("At least one VIF is very large", np.max(vifs) > 50)
4.5 Heteroscedasticity, Weighted Least Squares, and Robust Standard Errors
When error variance changes with the predictors, OLS coefficients can stay unbiased while inference becomes fragile.
Code cell 42
# === 4.5 Heteroscedasticity, Weighted Least Squares, and Robust Standard Errors ===
x = np.linspace(0.2, 4.0, 250)
sigma = 0.25 + 0.35 * x
y = 1.2 + 0.9 * x + rng.normal(scale=sigma)
X = add_intercept(x)
beta_ols = ols_beta(X, y)
fit_ols = X @ beta_ols
resid_ols = y - fit_ols
classic_se = np.sqrt(np.diag((rss(y, fit_ols) / (len(y) - X.shape[1])) * la.pinv(X.T @ X)))
robust_se = hc0_se(X, resid_ols)
W = np.diag(1 / sigma**2)
beta_wls = la.solve(X.T @ W @ X, X.T @ W @ y)
header("OLS vs WLS under heteroscedasticity")
print("OLS coefficients:", np.round(beta_ols, 4))
print("WLS coefficients:", np.round(beta_wls, 4))
print("Classical SEs:", np.round(classic_se, 4))
print("Robust HC0 SEs:", np.round(robust_se, 4))
report("Robust slope SE differs from classical SE", abs(robust_se[1] - classic_se[1]) > 0.01)
5. Core Theory III: Regularized and Generalized Regression
5.1 Ridge Regression as Stabilized OLS and Shrinkage
Ridge adds an L2 penalty to stabilize estimation when predictors are redundant or noisy.
Code cell 44
# === 5.1 Ridge Regression as Stabilized OLS and Shrinkage ===
X_raw = rng.normal(size=(140, 6))
X_raw[:, 1] = X_raw[:, 0] + 0.15 * rng.normal(size=140)
true_beta = np.array([2.2, 2.0, 0.0, 0.0, 0.7, 0.0])
y = X_raw @ true_beta + rng.normal(scale=1.0, size=140)
scaler = StandardScaler()
X = scaler.fit_transform(X_raw)
ols = ols_beta(add_intercept(X), y)[1:]
lambdas = np.logspace(-3, 2, 8)
ridge_betas = []
for lam in lambdas:
ridge_beta = la.solve(X.T @ X + lam * np.eye(X.shape[1]), X.T @ y)
ridge_betas.append(ridge_beta)
ridge_betas = np.array(ridge_betas)
header("Ridge shrinkage path")
print("OLS coefficient norm:", la.norm(ols))
print("Largest-lambda ridge norm:", la.norm(ridge_betas[-1]))
report("Ridge shrinks coefficient norm", la.norm(ridge_betas[-1]) < la.norm(ols))
if HAS_MPL:
fig, ax = plt.subplots()
for j in range(ridge_betas.shape[1]):
ax.plot(lambdas, ridge_betas[:, j], marker='o', label=f'beta{j+1}')
ax.set_xscale('log')
ax.set_title('Ridge coefficient paths')
ax.set_xlabel('lambda')
ax.set_ylabel('coefficient value')
plt.tight_layout()
plt.show()
5.2 Lasso Regression and Sparse Feature Selection
The L1 penalty encourages exact zeros and yields sparse fitted models.
Code cell 46
# === 5.2 Lasso Regression and Sparse Feature Selection ===
X_raw = rng.normal(size=(160, 8))
X_raw[:, 2] = X_raw[:, 1] + 0.2 * rng.normal(size=160)
true_beta = np.array([2.5, 0.0, 0.0, -1.8, 0.0, 0.7, 0.0, 0.0])
y = X_raw @ true_beta + rng.normal(scale=0.8, size=160)
X = StandardScaler().fit_transform(X_raw)
alphas = [0.01, 0.05, 0.1, 0.2]
coefs = []
for alpha in alphas:
model = Lasso(alpha=alpha, max_iter=20000)
model.fit(X, y)
coefs.append(model.coef_)
coefs = np.array(coefs)
header("Lasso sparsity")
for alpha, coef in zip(alphas, coefs):
print(f"alpha={alpha:>4}: nonzero={np.sum(np.abs(coef) > 1e-6)} coef={np.round(coef, 3)}")
report("Higher alpha yields at least as much sparsity", np.sum(np.abs(coefs[-1]) > 1e-6) <= np.sum(np.abs(coefs[0]) > 1e-6))
5.3 Logistic Regression as a Bernoulli GLM
Logistic regression maps a linear predictor through the sigmoid to model probabilities.
Code cell 48
# === 5.3 Logistic Regression as a Bernoulli GLM ===
X = rng.normal(size=(500, 2))
logits = -0.4 + 1.6 * X[:, 0] - 1.1 * X[:, 1]
probs = sigmoid(logits)
y = rng.binomial(1, probs)
clf = LogisticRegression(C=1e6, solver='lbfgs', max_iter=2000)
clf.fit(X, y)
pred_prob = clf.predict_proba(X[:5])[:, 1]
odds_ratio_x1 = np.exp(clf.coef_[0, 0])
header("Logistic regression")
print("Estimated intercept:", np.round(clf.intercept_, 4))
print("Estimated coefficients:", np.round(clf.coef_[0], 4))
print("First five predicted probabilities:", np.round(pred_prob, 4))
print(f"Odds ratio for x1: {odds_ratio_x1:.4f}")
report("Predicted probabilities stay in [0, 1]", np.all((pred_prob >= 0) & (pred_prob <= 1)))
5.4 Poisson Regression for Counts and Rates
Poisson regression uses a log link so that fitted means stay positive.
Code cell 50
# === 5.4 Poisson Regression for Counts and Rates ===
x = rng.normal(size=450)
exposure = rng.uniform(0.5, 3.0, size=450)
mu = exposure * np.exp(0.3 + 0.7 * x)
y = rng.poisson(mu)
X = np.column_stack([x, np.log(exposure)])
model = PoissonRegressor(alpha=1e-8, max_iter=3000)
model.fit(X, y)
pred_counts = model.predict(X[:5])
rate_ratio = np.exp(model.coef_[0])
header("Poisson regression with exposure-like feature")
print("Estimated intercept:", round(model.intercept_, 4))
print("Estimated coefficients:", np.round(model.coef_, 4))
print("First five predicted counts:", np.round(pred_counts, 3))
print(f"Multiplicative count change per +1 in x: {rate_ratio:.4f}")
report("Predicted counts are positive", np.all(pred_counts > 0))
5.5 Generalized Linear Models
GLMs keep the same linear predictor but change the response distribution and link function.
Code cell 52
# === 5.5 Generalized Linear Models ===
eta = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
identity_mean = eta
logistic_mean = sigmoid(eta)
poisson_mean = np.exp(eta)
header("GLM link comparison")
print("eta :", eta)
print("identity mean :", np.round(identity_mean, 4))
print("logit^{-1}(eta):", np.round(logistic_mean, 4))
print("exp(eta) :", np.round(poisson_mean, 4))
report("Gaussian identity can be negative", np.any(identity_mean < 0))
report("Poisson mean remains positive", np.all(poisson_mean > 0))
6. Advanced Topics
6.1 Basis Expansions, Polynomial Regression, and Spline Intuition
A nonlinear relationship can still be modeled by regression once the features are expanded.
Code cell 54
# === 6.1 Basis Expansions, Polynomial Regression, and Spline Intuition ===
x = np.linspace(-2.5, 2.5, 180)
y = np.sin(1.5 * x) + 0.2 * x + rng.normal(scale=0.18, size=len(x))
X_linear = add_intercept(x)
X_poly = PolynomialFeatures(degree=3, include_bias=True).fit_transform(x[:, None])
knots = np.array([-1.0, 0.0, 1.0])
X_hinge = np.column_stack([np.ones(len(x)), x] + [np.maximum(0.0, x - k) for k in knots])
fit_linear = X_linear @ ols_beta(X_linear, y)
fit_poly = X_poly @ ols_beta(X_poly, y)
fit_hinge = X_hinge @ ols_beta(X_hinge, y)
header("Basis expansions")
print(f"Linear MSE : {mean_squared_error(y, fit_linear):.4f}")
print(f"Cubic basis MSE: {mean_squared_error(y, fit_poly):.4f}")
print(f"Hinge basis MSE: {mean_squared_error(y, fit_hinge):.4f}")
report("At least one expanded basis beats plain linear fit", min(mean_squared_error(y, fit_poly), mean_squared_error(y, fit_hinge)) < mean_squared_error(y, fit_linear))
6.2 High-Dimensional Regression and the p > n Regime
When predictors outnumber observations, there are many interpolating solutions and the inductive bias chooses among them.
Code cell 56
# === 6.2 High-Dimensional Regression and the p > n Regime ===
n, p = 25, 60
X_train = rng.normal(size=(n, p))
beta_true = np.zeros(p)
beta_true[:5] = np.array([2.0, -1.5, 1.0, 0.5, -0.8])
y_train = X_train @ beta_true + rng.normal(scale=0.15, size=n)
X_test = rng.normal(size=(400, p))
y_test = X_test @ beta_true + rng.normal(scale=0.15, size=400)
beta_min_norm = la.pinv(X_train) @ y_train
beta_ridge = la.solve(X_train.T @ X_train + 2.0 * np.eye(p), X_train.T @ y_train)
train_mse_min = mean_squared_error(y_train, X_train @ beta_min_norm)
train_mse_ridge = mean_squared_error(y_train, X_train @ beta_ridge)
test_mse_min = mean_squared_error(y_test, X_test @ beta_min_norm)
test_mse_ridge = mean_squared_error(y_test, X_test @ beta_ridge)
header("p > n regime")
print(f"Minimum-norm train MSE: {train_mse_min:.6f}")
print(f"Ridge train MSE : {train_mse_ridge:.6f}")
print(f"Minimum-norm test MSE: {test_mse_min:.6f}")
print(f"Ridge test MSE : {test_mse_ridge:.6f}")
print(f"||beta_min_norm||_2 : {la.norm(beta_min_norm):.4f}")
print(f"||beta_ridge||_2 : {la.norm(beta_ridge):.4f}")
report("Minimum-norm interpolation nearly fits training data", train_mse_min < 0.05)
6.3 Robust Regression and Outlier-Resistant Losses
Outlier-resistant losses reduce the influence of extreme residuals.
Code cell 58
# === 6.3 Robust Regression and Outlier-Resistant Losses ===
x = rng.uniform(-2, 2, size=160)
y = 1.0 + 1.8 * x + rng.normal(scale=0.35, size=len(x))
y_with_outliers = y.copy()
y_with_outliers[[10, 60, 120]] += np.array([8.0, -7.0, 9.0])
X = x[:, None]
ols = LinearRegression().fit(X, y_with_outliers)
huber = HuberRegressor().fit(X, y_with_outliers)
clean_mask = np.ones(len(x), dtype=bool)
clean_mask[[10, 60, 120]] = False
header("Robust vs ordinary regression")
print(f"OLS slope : {ols.coef_[0]:.4f}")
print(f"Huber slope : {huber.coef_[0]:.4f}")
print(f"Clean-data MSE, OLS : {mean_squared_error(y[clean_mask], ols.predict(X[clean_mask])):.4f}")
print(f"Clean-data MSE, Huber: {mean_squared_error(y[clean_mask], huber.predict(X[clean_mask])):.4f}")
report("Huber is closer to the clean slope", abs(huber.coef_[0] - 1.8) < abs(ols.coef_[0] - 1.8))
6.4 Model Selection, Cross-Validation, and Information Criteria
Choosing complexity requires a penalty, validation strategy, or both.
Code cell 60
# === 6.4 Model Selection, Cross-Validation, and Information Criteria ===
x = np.linspace(-2, 2, 160)
y = 1.0 + 1.2 * x - 0.8 * x**2 + 0.5 * x**3 + rng.normal(scale=0.45, size=len(x))
kf = KFold(n_splits=5, shuffle=True, random_state=42)
results = []
for degree in range(1, 6):
poly = PolynomialFeatures(degree=degree, include_bias=True)
X = poly.fit_transform(x[:, None])
cv_mse = []
for train_idx, test_idx in kf.split(X):
beta = ols_beta(X[train_idx], y[train_idx])
pred = X[test_idx] @ beta
cv_mse.append(mean_squared_error(y[test_idx], pred))
beta_full = ols_beta(X, y)
fit_full = X @ beta_full
rss_full = rss(y, fit_full)
n = len(y)
k = X.shape[1]
aic = n * np.log(rss_full / n) + 2 * k
bic = n * np.log(rss_full / n) + k * np.log(n)
results.append((degree, np.mean(cv_mse), aic, bic))
header("Model-selection criteria across polynomial degree")
for degree, cv_mse, aic, bic in results:
print(f"degree={degree}: CV MSE={cv_mse:.4f}, AIC={aic:.2f}, BIC={bic:.2f}")
best_cv = min(results, key=lambda t: t[1])[0]
best_bic = min(results, key=lambda t: t[3])[0]
report("Cross-validation and BIC both produce a finite optimum", 1 <= best_cv <= 5 and 1 <= best_bic <= 5)
6.5 Omitted-Variable Bias, Extrapolation, and Causal Caution
A fitted coefficient can absorb missing structure, and a linear fit can become unreliable outside the observed region.
Code cell 62
# === 6.5 Omitted-Variable Bias, Extrapolation, and Causal Caution ===
Z = rng.normal(size=500)
X = 0.7 * Z + rng.normal(scale=0.6, size=500)
Y = 2.0 * X + 3.0 * Z + rng.normal(scale=0.5, size=500)
beta_short = ols_beta(add_intercept(X), Y)
beta_full = ols_beta(add_intercept(np.column_stack([X, Z])), Y)
x_train = np.linspace(0, 1, 80)
y_train = 1.0 + 0.5 * x_train + 2.0 * x_train**2 + rng.normal(scale=0.1, size=len(x_train))
linear_extrap = ols_beta(add_intercept(x_train), y_train)
x_star = 2.0
pred_star = np.array([1.0, x_star]) @ linear_extrap
true_star = 1.0 + 0.5 * x_star + 2.0 * x_star**2
header("Bias from omission and danger from extrapolation")
print(f"Slope on X alone : {beta_short[1]:.4f}")
print(f"Slope on X controlling Z: {beta_full[1]:.4f}")
print(f"Linear extrapolated prediction at x=2.0: {pred_star:.4f}")
print(f"Underlying nonlinear value at x=2.0 : {true_star:.4f}")
report("Omitted variable changes the slope materially", abs(beta_short[1] - beta_full[1]) > 0.5)
report("Extrapolation error is large", abs(pred_star - true_star) > 1.0)
7. Applications in Machine Learning
7.1 Linear Regression Baselines for Tabular and Structured Prediction
A strong linear baseline is often the fastest way to discover whether a task is mostly about representation quality or about model complexity.
Code cell 64
# === 7.1 Linear Regression Baselines for Tabular and Structured Prediction ===
X = rng.normal(size=(1200, 6))
true_beta = np.array([1.5, -2.0, 0.8, 0.0, 0.6, 0.0])
y = X @ true_beta + rng.normal(scale=1.1, size=1200)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
lin = LinearRegression().fit(X_train, y_train)
mean_pred = np.full_like(y_test, y_train.mean())
lin_pred = lin.predict(X_test)
header("Tabular baseline check")
print(f"Mean-baseline test MSE : {mean_squared_error(y_test, mean_pred):.4f}")
print(f"Linear-model test MSE : {mean_squared_error(y_test, lin_pred):.4f}")
print("Estimated coefficients:", np.round(lin.coef_, 3))
report("Linear baseline beats mean baseline", mean_squared_error(y_test, lin_pred) < mean_squared_error(y_test, mean_pred))
7.2 Logistic Regression Baselines, Calibration, and Probability Outputs
Logistic regression is still a reliable probabilistic baseline when features already carry the important structure.
Code cell 66
# === 7.2 Logistic Regression Baselines, Calibration, and Probability Outputs ===
X = rng.normal(size=(1400, 3))
logits = -0.3 + 1.4 * X[:, 0] - 0.9 * X[:, 1] + 0.7 * X[:, 2]
prob = sigmoid(logits)
y = rng.binomial(1, prob)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf = LogisticRegression(C=1.0, solver='lbfgs', max_iter=2000)
clf.fit(X_train, y_train)
p_test = clf.predict_proba(X_test)[:, 1]
acc = accuracy_score(y_test, (p_test >= 0.5).astype(int))
brier = brier_score_loss(y_test, p_test)
bins = np.linspace(0, 1, 6)
indices = np.digitize(p_test, bins) - 1
calibration_rows = []
for b in range(len(bins) - 1):
mask = indices == b
if mask.sum() > 0:
calibration_rows.append((mask.sum(), p_test[mask].mean(), y_test[mask].mean()))
header("Logistic baseline and simple calibration view")
print(f"Test accuracy : {acc:.4f}")
print(f"Brier score : {brier:.4f}")
for count, p_mean, y_mean in calibration_rows:
print(f"bin count={count:>3d}, mean predicted={p_mean:.3f}, empirical positive rate={y_mean:.3f}")
report("Probabilities are well-formed", np.all((p_test >= 0) & (p_test <= 1)))
7.3 Linear Probes on Embeddings and Hidden States
A linear probe tests whether a representation makes a target linearly recoverable.
Code cell 68
# === 7.3 Linear Probes on Embeddings and Hidden States ===
emb = rng.normal(size=(900, 20))
probe_direction = rng.normal(size=20)
labels = (emb @ probe_direction + 0.3 * rng.normal(size=900) > 0).astype(int)
labels_shuffled = rng.permutation(labels)
E_train, E_test, y_train, y_test = train_test_split(emb, labels, test_size=0.3, random_state=42)
_, _, ys_train, ys_test = train_test_split(emb, labels_shuffled, test_size=0.3, random_state=42)
probe = LogisticRegression(C=1.0, solver='lbfgs', max_iter=2000)
probe.fit(E_train, y_train)
probe_shuffled = LogisticRegression(C=1.0, solver='lbfgs', max_iter=2000)
probe_shuffled.fit(E_train, ys_train)
acc_true = accuracy_score(y_test, probe.predict(E_test))
acc_shuffle = accuracy_score(ys_test, probe_shuffled.predict(E_test))
header("Linear probes on embeddings")
print(f"Probe accuracy on real labels : {acc_true:.4f}")
print(f"Probe accuracy on shuffled labels: {acc_shuffle:.4f}")
report("Representation contains linearly decodable signal", acc_true > acc_shuffle + 0.2)
7.4 Ridge, Lasso, Weight Decay, and Sparse or Low-Rank Adaptation
Modern regularization and adaptation methods often repackage classical shrinkage ideas.
Code cell 70
# === 7.4 Ridge, Lasso, Weight Decay, and Sparse or Low-Rank Adaptation ===
X = rng.normal(size=(240, 12))
true_beta = np.zeros(12)
true_beta[:4] = np.array([2.0, -1.6, 1.2, 0.8])
y = X @ true_beta + rng.normal(scale=1.0, size=240)
X_std = StandardScaler().fit_transform(X)
ridge = Ridge(alpha=5.0).fit(X_std, y)
lasso = Lasso(alpha=0.15, max_iter=20000).fit(X_std, y)
A = rng.normal(size=(16, 2))
B = rng.normal(size=(2, 10))
delta = A @ B + 0.05 * rng.normal(size=(16, 10))
U, S, Vt = la.svd(delta, full_matrices=False)
rank2 = (U[:, :2] * S[:2]) @ Vt[:2]
rank2_error = la.norm(delta - rank2, ord='fro')
full_error = la.norm(delta, ord='fro')
header("Regularization and structured adaptation")
print(f"Ridge coefficient norm: {la.norm(ridge.coef_):.4f}")
print(f"Lasso nonzero count : {np.sum(np.abs(lasso.coef_) > 1e-6)}")
print(f"Rank-2 approximation relative error: {rank2_error / full_error:.4f}")
report("Lasso is sparser than Ridge", np.sum(np.abs(lasso.coef_) > 1e-6) < len(ridge.coef_))
report("Low-rank approximation captures most structure", rank2_error / full_error < 0.25)
7.5 GLMs in Production Systems
Binary events, counts, and rates are common operational targets, so logistic and Poisson models remain useful production tools.
Code cell 72
# === 7.5 GLMs in Production Systems ===
traffic = rng.normal(size=(800, 2))
ctr_logits = -1.0 + 0.9 * traffic[:, 0] + 0.6 * traffic[:, 1]
ctr_labels = rng.binomial(1, sigmoid(ctr_logits))
ctr_model = LogisticRegression(C=1.0, solver='lbfgs', max_iter=2000).fit(traffic, ctr_labels)
ops_x = rng.normal(size=800)
ops_exposure = rng.uniform(0.8, 2.5, size=800)
incident_mu = ops_exposure * np.exp(-0.2 + 0.8 * ops_x)
incident_counts = rng.poisson(incident_mu)
incident_model = PoissonRegressor(alpha=1e-8, max_iter=3000).fit(np.column_stack([ops_x, np.log(ops_exposure)]), incident_counts)
sample_ctr = ctr_model.predict_proba(np.array([[0.5, -0.1], [1.2, 0.9]]))[:, 1]
sample_incidents = incident_model.predict(np.array([[0.2, np.log(1.0)], [1.0, np.log(2.0)]]))
header("GLMs in production-style targets")
print("Sample CTR predictions :", np.round(sample_ctr, 4))
print("Sample incident predictions :", np.round(sample_incidents, 4))
report("CTR outputs are probabilities", np.all((sample_ctr >= 0) & (sample_ctr <= 1)))
report("Incident predictions are positive", np.all(sample_incidents > 0))
8. Common Mistakes
| # | Mistake | Why it is wrong | Fix |
|---|---|---|---|
| 1 | Interpreting OLS coefficients without checking multicollinearity | Correlated predictors inflate variance and flip signs | Compute VIF; consider Ridge or variable selection |
| 2 | Using R² to compare models with different numbers of predictors | R² always increases with more predictors | Use adjusted R², AIC, BIC, or cross-validation |
| 3 | Ignoring residual plots | Violations of linearity or homoscedasticity are invisible from summary stats alone | Always check residuals-vs-fitted and QQ plot |
| 4 | Forgetting to standardize before Ridge/Lasso | Penalizes large-scale features disproportionately | Standardize features to zero mean and unit variance first |
| 5 | Using Lasso when groups of correlated features matter | Lasso arbitrarily drops one from a correlated group | Use Elastic Net or Group Lasso for correlated features |
| 6 | Predicting outside the training range and trusting the result | Extrapolation assumes the linear relationship continues | Report wider uncertainty or refuse prediction far from training data |
| 7 | Using accuracy as the loss for logistic regression training | Accuracy is non-differentiable; gradient is zero almost everywhere | Minimize log-loss (binary cross-entropy) |
| 8 | Treating confidence intervals and prediction intervals as the same | PI is wider because it includes observation noise | Compute both; use PI when making individual-level predictions |
What to Notice After Running This Notebook
- OLS is best understood as projection before it is understood as algebra.
- Diagnostics matter because the wrong mean structure can survive pretty coefficient tables.
- Regularization changes the geometry of estimation long before it changes the buzzwords around a model.
- Logistic and Poisson regression are not side quests; they are the core GLM bridge from classical statistics to modern ML losses.
- Linear probes, calibration heads, and weight-decay-style penalties are direct descendants of the regression ideas in this section.