Theory Notebook

Converted from theory.ipynb for web reading.

Regression Analysis — Theory Notebook

Regression turns feature-response relationships into a concrete language for prediction, uncertainty, and regularization.

This notebook is the interactive companion to notes.md. It moves from conditional modeling and OLS geometry through diagnostics, regularized regression, GLMs, and machine-learning applications.

Coverage	What You Will Do
OLS foundations	Derive slopes, solve normal equations, verify projection geometry
Diagnostics	Inspect leverage, multicollinearity, residual structure, and heteroscedasticity
Regularization	Compare OLS, Ridge, and Lasso under instability and sparsity
GLMs	Fit logistic and Poisson models and interpret their linear predictors
ML links	Use regression-style models for baselines, probes, calibration, and structured adaptation

Code cell 2

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

try:
    import seaborn as sns
    sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

mpl.rcParams.update({
    "figure.figsize":    (10, 6),
    "figure.dpi":         120,
    "font.size":           13,
    "axes.titlesize":      15,
    "axes.labelsize":      13,
    "xtick.labelsize":     11,
    "ytick.labelsize":     11,
    "legend.fontsize":     11,
    "legend.framealpha":   0.85,
    "lines.linewidth":      2.0,
    "axes.spines.top":     False,
    "axes.spines.right":   False,
    "savefig.bbox":       "tight",
    "savefig.dpi":         150,
})
np.random.seed(42)
print("Plot setup complete.")

Code cell 3

import numpy as np
import numpy.linalg as la
from scipy import stats
from sklearn.linear_model import (LinearRegression, Ridge, Lasso,
                                   LogisticRegression, PoissonRegressor, HuberRegressor)
from sklearn.metrics import mean_squared_error, accuracy_score, brier_score_loss
from sklearn.model_selection import KFold, train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

try:
    import matplotlib.pyplot as plt
    import matplotlib as mpl
    mpl.rcParams.update({
        "figure.figsize": (10, 6),
        "figure.dpi": 120,
        "font.size": 12,
        "axes.titlesize": 14,
        "axes.labelsize": 12,
        "xtick.labelsize": 10,
        "ytick.labelsize": 10,
    })
    HAS_MPL = True
except ImportError:
    HAS_MPL = False

try:
    import seaborn as sns
    if HAS_MPL:
        sns.set_theme(style="whitegrid", palette="colorblind")
    HAS_SNS = True
except ImportError:
    if HAS_MPL:
        plt.style.use("seaborn-v0_8-whitegrid")
    HAS_SNS = False

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
print("Setup complete.")

Code cell 4

COLORS = {
    "primary": "#0077BB",
    "secondary": "#EE7733",
    "tertiary": "#009988",
    "error": "#CC3311",
    "neutral": "#555555",
    "highlight": "#EE3377",
}


def header(title):
    print("\n" + "=" * len(title))
    print(title)
    print("=" * len(title))


def report(name, cond):
    print(f"{'PASS' if cond else 'FAIL'} - {name}")
    return cond


def add_intercept(X):
    X = np.asarray(X, dtype=float)
    if X.ndim == 1:
        X = X[:, None]
    return np.column_stack([np.ones(X.shape[0]), X])


def ols_beta(X, y):
    X = np.asarray(X, dtype=float)
    y = np.asarray(y, dtype=float)
    return la.lstsq(X, y, rcond=None)[0]


def rss(y, fitted):
    return float(np.sum((np.asarray(y) - np.asarray(fitted)) ** 2))


def tss(y):
    y = np.asarray(y, dtype=float)
    return float(np.sum((y - y.mean()) ** 2))


def r2_score_manual(y, fitted):
    return 1.0 - rss(y, fitted) / tss(y)


def adjusted_r2(y, fitted, p):
    n = len(y)
    return 1.0 - (rss(y, fitted) / (n - p)) / (tss(y) / (n - 1))


def hat_matrix(X):
    X = np.asarray(X, dtype=float)
    return X @ la.pinv(X.T @ X) @ X.T


def prediction_interval(X, y, x_star, alpha=0.05):
    beta = ols_beta(X, y)
    fitted = X @ beta
    resid = y - fitted
    n, p = X.shape
    s2 = rss(y, fitted) / (n - p)
    xtx_inv = la.pinv(X.T @ X)
    x_star = np.asarray(x_star, dtype=float)
    mean = float(x_star @ beta)
    mean_se = np.sqrt(s2 * x_star @ xtx_inv @ x_star)
    tcrit = stats.t.ppf(1 - alpha / 2, n - p)
    ci = (mean - tcrit * mean_se, mean + tcrit * mean_se)
    pred_se = np.sqrt(s2 * (1 + x_star @ xtx_inv @ x_star))
    pi = (mean - tcrit * pred_se, mean + tcrit * pred_se)
    return mean, ci, pi


def hc0_se(X, resid):
    X = np.asarray(X, dtype=float)
    resid = np.asarray(resid, dtype=float)
    xtx_inv = la.pinv(X.T @ X)
    meat = X.T @ (resid[:, None] ** 2 * X)
    cov = xtx_inv @ meat @ xtx_inv
    return np.sqrt(np.diag(cov))


def soft_threshold(z, lam):
    return np.sign(z) * np.maximum(np.abs(z) - lam, 0.0)


def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))


def vif_scores(X):
    X = np.asarray(X, dtype=float)
    out = []
    for j in range(X.shape[1]):
        target = X[:, j]
        others = np.delete(X, j, axis=1)
        beta = ols_beta(add_intercept(others), target)
        fitted = add_intercept(others) @ beta
        r2 = np.clip(r2_score_manual(target, fitted), 0.0, 0.999999999)
        out.append(1.0 / (1.0 - r2))
    return np.array(out)


print("Helpers ready.")

1. Intuition

1.1 From Correlation to Conditional Modeling

Correlation summarizes association. Regression asks what changes once we condition on multiple predictors at once.

Code cell 6

# === 1.1 From Correlation to Conditional Modeling ===
X1 = rng.normal(size=400)
X2 = 0.8 * X1 + rng.normal(scale=0.5, size=400)
y = 3.0 * X1 - 2.0 * X2 + rng.normal(scale=0.7, size=400)

simple_slope = np.cov(X1, y, ddof=1)[0, 1] / np.var(X1, ddof=1)
X_multi = add_intercept(np.column_stack([X1, X2]))
beta_multi = ols_beta(X_multi, y)

header("Association vs conditional effect")
print(f"corr(X1, y): {np.corrcoef(X1, y)[0, 1]:.4f}")
print(f"Simple-regression slope for X1 only: {simple_slope:.4f}")
print(f"Multiple-regression coefficient for X1: {beta_multi[1]:.4f}")
print(f"Multiple-regression coefficient for X2: {beta_multi[2]:.4f}")
report("Conditional slope differs from raw association", abs(simple_slope - beta_multi[1]) > 0.25)

1.2 Why Regression Is the Blueprint for Supervised Learning

The same linear predictor can drive very different tasks depending on how we map it to the response.

Code cell 8

# === 1.2 Why Regression Is the Blueprint for Supervised Learning ===
eta = np.linspace(-3, 3, 7)
identity = eta
prob = sigmoid(eta)
rate = np.exp(eta)

header("One linear predictor, multiple supervised tasks")
print("eta        ", eta)
print("identity   ", np.round(identity, 4))
print("sigmoid    ", np.round(prob, 4))
print("exp(rate)  ", np.round(rate, 4))
report("Sigmoid stays in [0, 1]", np.all((prob >= 0) & (prob <= 1)))
report("Exponential link stays positive", np.all(rate > 0))

1.3 When Linear Structure Works and When It Breaks

A linear fit can be excellent when the true relationship is close to linear, and badly misspecified when curvature matters.

Code cell 10

# === 1.3 When Linear Structure Works and When It Breaks ===
x = np.linspace(-2.5, 2.5, 200)
y_linear = 1.0 + 2.0 * x + rng.normal(scale=0.35, size=len(x))
y_curve = 1.0 + 0.8 * x + 1.4 * x**2 + rng.normal(scale=0.35, size=len(x))

X_lin = add_intercept(x)
beta_linear = ols_beta(X_lin, y_linear)
beta_bad = ols_beta(X_lin, y_curve)
poly = PolynomialFeatures(degree=2, include_bias=True)
X_poly = poly.fit_transform(x[:, None])
beta_poly = ols_beta(X_poly, y_curve)

mse_linear_on_linear = mean_squared_error(y_linear, X_lin @ beta_linear)
mse_linear_on_curve = mean_squared_error(y_curve, X_lin @ beta_bad)
mse_poly_on_curve = mean_squared_error(y_curve, X_poly @ beta_poly)

header("Linear structure: match vs mismatch")
print(f"Linear data, linear fit MSE     : {mse_linear_on_linear:.4f}")
print(f"Curved data, linear fit MSE     : {mse_linear_on_curve:.4f}")
print(f"Curved data, quadratic fit MSE  : {mse_poly_on_curve:.4f}")
report("Quadratic basis improves nonlinear fit", mse_poly_on_curve < mse_linear_on_curve)

if HAS_MPL:
    fig, ax = plt.subplots(1, 2, figsize=(12, 4.5))
    ax[0].scatter(x, y_linear, s=12, alpha=0.5, color=COLORS["primary"])
    ax[0].plot(x, X_lin @ beta_linear, color=COLORS["secondary"], lw=2)
    ax[0].set_title("Linear signal")
    ax[0].set_xlabel("x")
    ax[0].set_ylabel("y")
    ax[1].scatter(x, y_curve, s=12, alpha=0.5, color=COLORS["primary"])
    ax[1].plot(x, X_lin @ beta_bad, color=COLORS["error"], lw=2, label="linear fit")
    ax[1].plot(x, X_poly @ beta_poly, color=COLORS["tertiary"], lw=2, label="quadratic fit")
    ax[1].legend()
    ax[1].set_title("Curvature breaks plain linearity")
    ax[1].set_xlabel("x")
    ax[1].set_ylabel("y")
    plt.tight_layout()
    plt.show()

1.4 Historical Timeline

Regression developed by combining geometry, probability, and practical prediction needs.

Code cell 12

# === 1.4 Historical Timeline ===
timeline = [
    (1805, "Legendre", "least squares"),
    (1809, "Gauss", "Gaussian error connection"),
    (1970, "Hoerl & Kennard", "Ridge regression"),
    (1972, "Nelder & Wedderburn", "GLMs"),
    (1996, "Tibshirani", "Lasso"),
    (2019, "interpretability literature", "linear probes"),
    (2022, "PEFT literature", "low-rank adaptation at scale"),
]

header("Selected regression milestones")
for year, person, contribution in timeline:
    print(f"{year}: {person} - {contribution}")
report("Timeline is in chronological order", all(timeline[i][0] < timeline[i + 1][0] for i in range(len(timeline) - 1)))

2. Formal Definitions

2.1 Response, Predictors, and Conditional Expectation

The regression target is a conditional pattern such as the mean response given predictors.

Code cell 14

# === 2.1 Response, Predictors, and Conditional Expectation ===
x = rng.uniform(-2, 2, size=500)
y = 1.5 + 1.2 * x + rng.normal(scale=0.6, size=len(x))

bins = np.linspace(-2, 2, 9)
which = np.digitize(x, bins) - 1
bin_means_x = []
bin_means_y = []
for j in range(len(bins) - 1):
    mask = which == j
    if mask.sum() > 5:
        bin_means_x.append(x[mask].mean())
        bin_means_y.append(y[mask].mean())

beta = ols_beta(add_intercept(x), y)
header("Conditional mean as the regression target")
print(f"Estimated intercept: {beta[0]:.4f}")
print(f"Estimated slope    : {beta[1]:.4f}")
report("Binned conditional means track the fitted line", np.corrcoef(bin_means_x, bin_means_y)[0, 1] > 0.95)

2.2 Design Matrix, Linear Predictor, and Feature Coding

The design matrix stores the feature columns that define the model space.

Code cell 16

# === 2.2 Design Matrix, Linear Predictor, and Feature Coding ===
length = np.array([20, 50, 20, 80], dtype=float)
gpu_h100 = np.array([0, 0, 1, 1], dtype=float)
interaction = length * gpu_h100
X = np.column_stack([np.ones(len(length)), length, gpu_h100, interaction])
columns = ["intercept", "length", "gpu_h100", "length_x_gpu_h100"]

header("Design matrix example")
print("Columns:", columns)
print(X)
print("Shape:", X.shape)
report("Intercept column is all ones", np.allclose(X[:, 0], 1.0))

2.3 Fitted Values, Residuals, and Sum-of-Squares Quantities

Fitted values and residuals organize both geometry and model evaluation.

Code cell 18

# === 2.3 Fitted Values, Residuals, and Sum-of-Squares Quantities ===
x = np.linspace(0, 5, 20)
y = 2.0 + 0.7 * x + rng.normal(scale=0.5, size=len(x))
X = add_intercept(x)
beta = ols_beta(X, y)
fitted = X @ beta
resid = y - fitted
RSS = rss(y, fitted)
TSS = tss(y)
ESS = float(np.sum((fitted - y.mean()) ** 2))

header("TSS = ESS + RSS decomposition")
print(f"RSS: {RSS:.6f}")
print(f"ESS: {ESS:.6f}")
print(f"TSS: {TSS:.6f}")
report("Explained + residual equals total", np.allclose(RSS + ESS, TSS))
report("Residuals sum to zero with intercept", np.allclose(resid.sum(), 0.0))

2.4 Classical Linear Model Assumptions

The assumptions are separable: different failures break different parts of the classical theory.

Code cell 20

# === 2.4 Classical Linear Model Assumptions ===
x = np.linspace(0, 4, 300)
sigma_homo = 0.5 * np.ones_like(x)
sigma_hetero = 0.2 + 0.35 * x

y_homo = 1.0 + 1.1 * x + rng.normal(scale=sigma_homo)
y_hetero = 1.0 + 1.1 * x + rng.normal(scale=sigma_hetero)

fit_homo = add_intercept(x) @ ols_beta(add_intercept(x), y_homo)
fit_hetero = add_intercept(x) @ ols_beta(add_intercept(x), y_hetero)
resid_homo = y_homo - fit_homo
resid_hetero = y_hetero - fit_hetero
mid = np.median(x)

header("Homoscedastic vs heteroscedastic residual spread")
print("Low-x residual var (homo / hetero):",
      np.var(resid_homo[x <= mid]), np.var(resid_hetero[x <= mid]))
print("High-x residual var (homo / hetero):",
      np.var(resid_homo[x > mid]), np.var(resid_hetero[x > mid]))
report("Heteroscedastic data shows growing variance", np.var(resid_hetero[x > mid]) > 2 * np.var(resid_hetero[x <= mid]))

2.5 Examples, Non-Examples, and Edge Cases

Rank and dimensionality determine whether OLS has a unique solution.

Code cell 22

# === 2.5 Examples, Non-Examples, and Edge Cases ===
X_full = np.array([[1, 0], [1, 1], [1, 2], [1, 3]], dtype=float)
X_collinear = np.array([[1, 1, 2], [1, 2, 4], [1, 3, 6], [1, 4, 8]], dtype=float)
X_high_dim = rng.normal(size=(6, 10))

header("Ranks of common regression design cases")
print(f"Full-rank design rank      : {la.matrix_rank(X_full)} / {X_full.shape[1]}")
print(f"Collinear design rank      : {la.matrix_rank(X_collinear)} / {X_collinear.shape[1]}")
print(f"High-dimensional design    : {X_high_dim.shape[0]} rows, {X_high_dim.shape[1]} cols")
report("Collinear matrix loses rank", la.matrix_rank(X_collinear) < X_collinear.shape[1])
report("p > n example is high dimensional", X_high_dim.shape[1] > X_high_dim.shape[0])

3. Core Theory I: OLS and Inference

3.1 Simple Linear Regression and Its Link to Correlation

In one predictor, the slope is covariance divided by predictor variance.

Code cell 24

# === 3.1 Simple Linear Regression and Its Link to Correlation ===
x = rng.normal(size=300)
y = 0.8 + 1.7 * x + rng.normal(scale=0.8, size=len(x))

cov_xy = np.cov(x, y, ddof=1)[0, 1]
var_x = np.var(x, ddof=1)
slope_cov = cov_xy / var_x
beta = ols_beta(add_intercept(x), y)
correlation_form = np.corrcoef(x, y)[0, 1] * np.std(y, ddof=1) / np.std(x, ddof=1)

header("Simple-regression slope formulas")
print(f"Slope from covariance/variance : {slope_cov:.6f}")
print(f"Slope from OLS                 : {beta[1]:.6f}")
print(f"Slope from correlation formula : {correlation_form:.6f}")
report("All slope formulas agree", np.allclose([slope_cov, beta[1], correlation_form], slope_cov))

3.2 Matrix OLS and the Normal Equations

The normal equations define the least-squares solution in multiple regression.

Code cell 26

# === 3.2 Matrix OLS and the Normal Equations ===
X = add_intercept(rng.normal(size=(120, 3)))
true_beta = np.array([1.0, 2.0, -1.5, 0.5])
y = X @ true_beta + rng.normal(scale=0.4, size=X.shape[0])

beta_lstsq = ols_beta(X, y)
beta_normal = la.solve(X.T @ X, X.T @ y)
normal_residual = X.T @ (y - X @ beta_normal)

header("Normal equations")
print("beta from lstsq   :", np.round(beta_lstsq, 4))
print("beta from formula :", np.round(beta_normal, 4))
print("X^T residual      :", np.round(normal_residual, 8))
report("Closed-form and lstsq agree", np.allclose(beta_lstsq, beta_normal))
report("Normal equations hold", np.allclose(normal_residual, 0.0, atol=1e-8))

3.3 Projection Geometry, the Hat Matrix, and Orthogonality

OLS projects the response vector onto the column space of the design matrix.

Code cell 28

# === 3.3 Projection Geometry, the Hat Matrix, and Orthogonality ===
X = add_intercept(np.linspace(-2, 2, 25))
y = 1.2 + 0.9 * X[:, 1] + rng.normal(scale=0.3, size=X.shape[0])
H = hat_matrix(X)
fitted = H @ y
resid = y - fitted
leverage = np.diag(H)

header("Projection geometry")
print(f"Trace(H) = {np.trace(H):.6f}")
print(f"Max |H^2 - H| = {np.max(np.abs(H @ H - H)):.2e}")
print(f"Max |X^T e| = {np.max(np.abs(X.T @ resid)):.2e}")
print(f"Leverage sum = {leverage.sum():.6f}")
report("H is idempotent", np.allclose(H @ H, H))
report("Residual is orthogonal to model space", np.allclose(X.T @ resid, 0.0, atol=1e-8))
report("Leverages sum to number of parameters", np.allclose(leverage.sum(), X.shape[1]))

3.4 Gauss-Markov Theorem

Under the classical assumptions, OLS beats every other linear unbiased estimator in variance.

Code cell 30

# === 3.4 Gauss-Markov Theorem ===
X = add_intercept(np.linspace(-1, 1, 12))
H = hat_matrix(X)
M = np.eye(len(X)) - H
B = rng.normal(size=(X.shape[1], X.shape[0]))
C = B @ M
beta_true = np.array([0.5, 1.3])

ols_draws = []
alt_draws = []
for _ in range(4000):
    y = X @ beta_true + rng.normal(scale=0.7, size=X.shape[0])
    beta_ols = ols_beta(X, y)
    beta_alt = beta_ols + C @ y
    ols_draws.append(beta_ols)
    alt_draws.append(beta_alt)

ols_draws = np.array(ols_draws)
alt_draws = np.array(alt_draws)
ols_cov = np.cov(ols_draws.T)
alt_cov = np.cov(alt_draws.T)

header("Empirical Gauss-Markov check")
print("Max |C X|:", np.max(np.abs(C @ X)))
print("Mean OLS estimate:", np.round(ols_draws.mean(axis=0), 4))
print("Mean alternative estimate:", np.round(alt_draws.mean(axis=0), 4))
print("trace Var(OLS)      :", np.trace(ols_cov))
print("trace Var(alternative):", np.trace(alt_cov))
report("Alternative estimator is approximately unbiased", np.allclose(alt_draws.mean(axis=0), beta_true, atol=0.07))
report("OLS has smaller variance trace", np.trace(ols_cov) < np.trace(alt_cov))

3.5 Sampling Distribution, Standard Errors, Confidence Intervals, and Prediction Intervals

Coefficient uncertainty and predictive uncertainty are related but not the same.

Code cell 32

# === 3.5 Sampling Distribution, Standard Errors, Confidence Intervals, and Prediction Intervals ===
x = np.linspace(0, 4, 40)
X = add_intercept(x)
y = 1.0 + 1.5 * x + rng.normal(scale=0.6, size=len(x))
beta = ols_beta(X, y)
fitted = X @ beta
resid = y - fitted
n, p = X.shape
s2 = rss(y, fitted) / (n - p)
xtx_inv = la.pinv(X.T @ X)
se_beta = np.sqrt(np.diag(s2 * xtx_inv))
tcrit = stats.t.ppf(0.975, n - p)
ci_slope = (beta[1] - tcrit * se_beta[1], beta[1] + tcrit * se_beta[1])
mean_star, ci_mean, pi_star = prediction_interval(X, y, np.array([1.0, 2.5]))

header("Coefficient vs mean-response vs prediction intervals")
print(f"Slope estimate: {beta[1]:.4f}")
print(f"Slope 95% CI: ({ci_slope[0]:.4f}, {ci_slope[1]:.4f})")
print(f"Predicted mean at x*=2.5: {mean_star:.4f}")
print(f"95% CI for mean response: ({ci_mean[0]:.4f}, {ci_mean[1]:.4f})")
print(f"95% prediction interval : ({pi_star[0]:.4f}, {pi_star[1]:.4f})")
report("Prediction interval is wider than mean-response CI", (pi_star[1] - pi_star[0]) > (ci_mean[1] - ci_mean[0]))

4. Core Theory II: Model Assessment and Diagnostics

4.1 R^2, Adjusted R^2, Residual Standard Error, and Goodness of Fit

Goodness-of-fit summaries are useful, but only when we remember what they do and do not measure.

Code cell 34

# === 4.1 R^2, Adjusted R^2, Residual Standard Error, and Goodness of Fit ===
x1 = rng.normal(size=250)
x2 = rng.normal(size=250)
y = 1.0 + 2.0 * x1 + rng.normal(scale=1.0, size=250)

X1 = add_intercept(x1)
X2 = add_intercept(np.column_stack([x1, x2]))
b1 = ols_beta(X1, y)
b2 = ols_beta(X2, y)
fit1 = X1 @ b1
fit2 = X2 @ b2

header("Goodness of fit across nested models")
print(f"Model 1 R^2        : {r2_score_manual(y, fit1):.4f}")
print(f"Model 2 R^2        : {r2_score_manual(y, fit2):.4f}")
print(f"Model 1 adjusted R^2: {adjusted_r2(y, fit1, X1.shape[1]):.4f}")
print(f"Model 2 adjusted R^2: {adjusted_r2(y, fit2, X2.shape[1]):.4f}")
print(f"Model 1 residual SE : {np.sqrt(rss(y, fit1) / (len(y) - X1.shape[1])):.4f}")
print(f"Model 2 residual SE : {np.sqrt(rss(y, fit2) / (len(y) - X2.shape[1])):.4f}")
report("R^2 cannot decrease when adding predictors", r2_score_manual(y, fit2) >= r2_score_manual(y, fit1))

4.2 Residual Diagnostics for Linearity, Variance, and Normality

Residuals should look like leftover noise, not hidden structure.

Code cell 36

# === 4.2 Residual Diagnostics for Linearity, Variance, and Normality ===
x = np.linspace(-2, 2, 220)
y = 1.0 + 1.5 * x + 1.8 * x**2 + rng.normal(scale=0.45, size=len(x))
X = add_intercept(x)
beta = ols_beta(X, y)
fitted = X @ beta
resid = y - fitted
qq_theory, qq_sample = stats.probplot(resid, dist="norm", fit=False)

header("Residual diagnostics")
print(f"corr(residual, x^2): {np.corrcoef(resid, x**2)[0, 1]:.4f}")
print(f"Residual skewness  : {stats.skew(resid):.4f}")
print(f"Residual kurtosis  : {stats.kurtosis(resid):.4f}")
report("Curvature remains in residuals", abs(np.corrcoef(resid, x**2)[0, 1]) > 0.4)

if HAS_MPL:
    fig, ax = plt.subplots(1, 2, figsize=(12, 4.5))
    ax[0].scatter(fitted, resid, s=14, alpha=0.6, color=COLORS["primary"])
    ax[0].axhline(0.0, color=COLORS["neutral"], lw=1.5)
    ax[0].set_title("Residuals vs fitted")
    ax[0].set_xlabel("fitted")
    ax[0].set_ylabel("residual")
    ax[1].scatter(qq_theory, qq_sample, s=14, alpha=0.6, color=COLORS["secondary"])
    ax[1].plot([min(qq_theory), max(qq_theory)], [min(qq_theory), max(qq_theory)], color=COLORS["neutral"], lw=1.5)
    ax[1].set_title("QQ-style residual check")
    ax[1].set_xlabel("theoretical quantiles")
    ax[1].set_ylabel("sample quantiles")
    plt.tight_layout()
    plt.show()

4.3 Leverage, Influence, and Cook's Distance

High leverage and large residuals combine to create influential observations.

Code cell 38

# === 4.3 Leverage, Influence, and Cook's Distance ===
x = np.array([0.0, 0.2, 0.5, 0.8, 1.1, 1.4, 1.8, 5.0])
y = np.array([1.0, 1.2, 1.6, 1.9, 2.1, 2.4, 2.7, 6.8])
X = add_intercept(x)
beta = ols_beta(X, y)
fitted = X @ beta
resid = y - fitted
H = hat_matrix(X)
h = np.diag(H)
s2 = rss(y, fitted) / (len(y) - X.shape[1])
cooks = (resid**2 / (X.shape[1] * s2)) * (h / (1 - h) ** 2)
most_influential = int(np.argmax(cooks))

header("Influence diagnostics")
print("Leverage:", np.round(h, 4))
print("Cook's distance:", np.round(cooks, 4))
print(f"Most influential index: {most_influential}")
report("Edge point has highest leverage", np.argmax(h) == len(x) - 1)
report("Same point is most influential", most_influential == len(x) - 1)

4.4 Multicollinearity, Condition Number, and VIF

Collinearity damages coefficient stability even when prediction remains acceptable.

Code cell 40

# === 4.4 Multicollinearity, Condition Number, and VIF ===
x1 = rng.normal(size=300)
x2 = x1 + rng.normal(scale=0.05, size=300)
x3 = rng.normal(size=300)
X_no_intercept = np.column_stack([x1, x2, x3])
X = add_intercept(X_no_intercept)
y = 1.0 + 2.5 * x1 - 2.4 * x2 + 0.6 * x3 + rng.normal(scale=0.5, size=300)
beta = ols_beta(X, y)
condition = la.cond(StandardScaler().fit_transform(X_no_intercept))
vifs = vif_scores(StandardScaler().fit_transform(X_no_intercept))

header("Multicollinearity diagnostics")
print("OLS coefficients:", np.round(beta, 4))
print(f"Condition number: {condition:.2f}")
print("VIFs:", np.round(vifs, 2))
report("At least one VIF is very large", np.max(vifs) > 50)

4.5 Heteroscedasticity, Weighted Least Squares, and Robust Standard Errors

When error variance changes with the predictors, OLS coefficients can stay unbiased while inference becomes fragile.

Code cell 42

# === 4.5 Heteroscedasticity, Weighted Least Squares, and Robust Standard Errors ===
x = np.linspace(0.2, 4.0, 250)
sigma = 0.25 + 0.35 * x
y = 1.2 + 0.9 * x + rng.normal(scale=sigma)
X = add_intercept(x)

beta_ols = ols_beta(X, y)
fit_ols = X @ beta_ols
resid_ols = y - fit_ols
classic_se = np.sqrt(np.diag((rss(y, fit_ols) / (len(y) - X.shape[1])) * la.pinv(X.T @ X)))
robust_se = hc0_se(X, resid_ols)
W = np.diag(1 / sigma**2)
beta_wls = la.solve(X.T @ W @ X, X.T @ W @ y)

header("OLS vs WLS under heteroscedasticity")
print("OLS coefficients:", np.round(beta_ols, 4))
print("WLS coefficients:", np.round(beta_wls, 4))
print("Classical SEs:", np.round(classic_se, 4))
print("Robust HC0 SEs:", np.round(robust_se, 4))
report("Robust slope SE differs from classical SE", abs(robust_se[1] - classic_se[1]) > 0.01)

5. Core Theory III: Regularized and Generalized Regression

5.1 Ridge Regression as Stabilized OLS and Shrinkage

Ridge adds an L2 penalty to stabilize estimation when predictors are redundant or noisy.

Code cell 44

# === 5.1 Ridge Regression as Stabilized OLS and Shrinkage ===
X_raw = rng.normal(size=(140, 6))
X_raw[:, 1] = X_raw[:, 0] + 0.15 * rng.normal(size=140)
true_beta = np.array([2.2, 2.0, 0.0, 0.0, 0.7, 0.0])
y = X_raw @ true_beta + rng.normal(scale=1.0, size=140)
scaler = StandardScaler()
X = scaler.fit_transform(X_raw)

ols = ols_beta(add_intercept(X), y)[1:]
lambdas = np.logspace(-3, 2, 8)
ridge_betas = []
for lam in lambdas:
    ridge_beta = la.solve(X.T @ X + lam * np.eye(X.shape[1]), X.T @ y)
    ridge_betas.append(ridge_beta)
ridge_betas = np.array(ridge_betas)

header("Ridge shrinkage path")
print("OLS coefficient norm:", la.norm(ols))
print("Largest-lambda ridge norm:", la.norm(ridge_betas[-1]))
report("Ridge shrinks coefficient norm", la.norm(ridge_betas[-1]) < la.norm(ols))

if HAS_MPL:
    fig, ax = plt.subplots()
    for j in range(ridge_betas.shape[1]):
        ax.plot(lambdas, ridge_betas[:, j], marker='o', label=f'beta{j+1}')
    ax.set_xscale('log')
    ax.set_title('Ridge coefficient paths')
    ax.set_xlabel('lambda')
    ax.set_ylabel('coefficient value')
    plt.tight_layout()
    plt.show()

5.2 Lasso Regression and Sparse Feature Selection

The L1 penalty encourages exact zeros and yields sparse fitted models.

Code cell 46

# === 5.2 Lasso Regression and Sparse Feature Selection ===
X_raw = rng.normal(size=(160, 8))
X_raw[:, 2] = X_raw[:, 1] + 0.2 * rng.normal(size=160)
true_beta = np.array([2.5, 0.0, 0.0, -1.8, 0.0, 0.7, 0.0, 0.0])
y = X_raw @ true_beta + rng.normal(scale=0.8, size=160)
X = StandardScaler().fit_transform(X_raw)

alphas = [0.01, 0.05, 0.1, 0.2]
coefs = []
for alpha in alphas:
    model = Lasso(alpha=alpha, max_iter=20000)
    model.fit(X, y)
    coefs.append(model.coef_)
coefs = np.array(coefs)

header("Lasso sparsity")
for alpha, coef in zip(alphas, coefs):
    print(f"alpha={alpha:>4}: nonzero={np.sum(np.abs(coef) > 1e-6)} coef={np.round(coef, 3)}")
report("Higher alpha yields at least as much sparsity", np.sum(np.abs(coefs[-1]) > 1e-6) <= np.sum(np.abs(coefs[0]) > 1e-6))

5.3 Logistic Regression as a Bernoulli GLM

Logistic regression maps a linear predictor through the sigmoid to model probabilities.

Code cell 48

# === 5.3 Logistic Regression as a Bernoulli GLM ===
X = rng.normal(size=(500, 2))
logits = -0.4 + 1.6 * X[:, 0] - 1.1 * X[:, 1]
probs = sigmoid(logits)
y = rng.binomial(1, probs)

clf = LogisticRegression(C=1e6, solver='lbfgs', max_iter=2000)
clf.fit(X, y)
pred_prob = clf.predict_proba(X[:5])[:, 1]
odds_ratio_x1 = np.exp(clf.coef_[0, 0])

header("Logistic regression")
print("Estimated intercept:", np.round(clf.intercept_, 4))
print("Estimated coefficients:", np.round(clf.coef_[0], 4))
print("First five predicted probabilities:", np.round(pred_prob, 4))
print(f"Odds ratio for x1: {odds_ratio_x1:.4f}")
report("Predicted probabilities stay in [0, 1]", np.all((pred_prob >= 0) & (pred_prob <= 1)))

5.4 Poisson Regression for Counts and Rates

Poisson regression uses a log link so that fitted means stay positive.

Code cell 50

# === 5.4 Poisson Regression for Counts and Rates ===
x = rng.normal(size=450)
exposure = rng.uniform(0.5, 3.0, size=450)
mu = exposure * np.exp(0.3 + 0.7 * x)
y = rng.poisson(mu)
X = np.column_stack([x, np.log(exposure)])
model = PoissonRegressor(alpha=1e-8, max_iter=3000)
model.fit(X, y)
pred_counts = model.predict(X[:5])
rate_ratio = np.exp(model.coef_[0])

header("Poisson regression with exposure-like feature")
print("Estimated intercept:", round(model.intercept_, 4))
print("Estimated coefficients:", np.round(model.coef_, 4))
print("First five predicted counts:", np.round(pred_counts, 3))
print(f"Multiplicative count change per +1 in x: {rate_ratio:.4f}")
report("Predicted counts are positive", np.all(pred_counts > 0))

5.5 Generalized Linear Models

GLMs keep the same linear predictor but change the response distribution and link function.

Code cell 52

# === 5.5 Generalized Linear Models ===
eta = np.array([-2.0, -1.0, 0.0, 1.0, 2.0])
identity_mean = eta
logistic_mean = sigmoid(eta)
poisson_mean = np.exp(eta)

header("GLM link comparison")
print("eta           :", eta)
print("identity mean :", np.round(identity_mean, 4))
print("logit^{-1}(eta):", np.round(logistic_mean, 4))
print("exp(eta)      :", np.round(poisson_mean, 4))
report("Gaussian identity can be negative", np.any(identity_mean < 0))
report("Poisson mean remains positive", np.all(poisson_mean > 0))

6. Advanced Topics

6.1 Basis Expansions, Polynomial Regression, and Spline Intuition

A nonlinear relationship can still be modeled by regression once the features are expanded.

Code cell 54

# === 6.1 Basis Expansions, Polynomial Regression, and Spline Intuition ===
x = np.linspace(-2.5, 2.5, 180)
y = np.sin(1.5 * x) + 0.2 * x + rng.normal(scale=0.18, size=len(x))
X_linear = add_intercept(x)
X_poly = PolynomialFeatures(degree=3, include_bias=True).fit_transform(x[:, None])
knots = np.array([-1.0, 0.0, 1.0])
X_hinge = np.column_stack([np.ones(len(x)), x] + [np.maximum(0.0, x - k) for k in knots])

fit_linear = X_linear @ ols_beta(X_linear, y)
fit_poly = X_poly @ ols_beta(X_poly, y)
fit_hinge = X_hinge @ ols_beta(X_hinge, y)

header("Basis expansions")
print(f"Linear MSE     : {mean_squared_error(y, fit_linear):.4f}")
print(f"Cubic basis MSE: {mean_squared_error(y, fit_poly):.4f}")
print(f"Hinge basis MSE: {mean_squared_error(y, fit_hinge):.4f}")
report("At least one expanded basis beats plain linear fit", min(mean_squared_error(y, fit_poly), mean_squared_error(y, fit_hinge)) < mean_squared_error(y, fit_linear))

6.2 High-Dimensional Regression and the p > n Regime

When predictors outnumber observations, there are many interpolating solutions and the inductive bias chooses among them.

Code cell 56

# === 6.2 High-Dimensional Regression and the p > n Regime ===
n, p = 25, 60
X_train = rng.normal(size=(n, p))
beta_true = np.zeros(p)
beta_true[:5] = np.array([2.0, -1.5, 1.0, 0.5, -0.8])
y_train = X_train @ beta_true + rng.normal(scale=0.15, size=n)
X_test = rng.normal(size=(400, p))
y_test = X_test @ beta_true + rng.normal(scale=0.15, size=400)

beta_min_norm = la.pinv(X_train) @ y_train
beta_ridge = la.solve(X_train.T @ X_train + 2.0 * np.eye(p), X_train.T @ y_train)

train_mse_min = mean_squared_error(y_train, X_train @ beta_min_norm)
train_mse_ridge = mean_squared_error(y_train, X_train @ beta_ridge)
test_mse_min = mean_squared_error(y_test, X_test @ beta_min_norm)
test_mse_ridge = mean_squared_error(y_test, X_test @ beta_ridge)

header("p > n regime")
print(f"Minimum-norm train MSE: {train_mse_min:.6f}")
print(f"Ridge train MSE      : {train_mse_ridge:.6f}")
print(f"Minimum-norm test MSE: {test_mse_min:.6f}")
print(f"Ridge test MSE      : {test_mse_ridge:.6f}")
print(f"||beta_min_norm||_2 : {la.norm(beta_min_norm):.4f}")
print(f"||beta_ridge||_2    : {la.norm(beta_ridge):.4f}")
report("Minimum-norm interpolation nearly fits training data", train_mse_min < 0.05)

6.3 Robust Regression and Outlier-Resistant Losses

Outlier-resistant losses reduce the influence of extreme residuals.

Code cell 58

# === 6.3 Robust Regression and Outlier-Resistant Losses ===
x = rng.uniform(-2, 2, size=160)
y = 1.0 + 1.8 * x + rng.normal(scale=0.35, size=len(x))
y_with_outliers = y.copy()
y_with_outliers[[10, 60, 120]] += np.array([8.0, -7.0, 9.0])
X = x[:, None]

ols = LinearRegression().fit(X, y_with_outliers)
huber = HuberRegressor().fit(X, y_with_outliers)
clean_mask = np.ones(len(x), dtype=bool)
clean_mask[[10, 60, 120]] = False

header("Robust vs ordinary regression")
print(f"OLS slope   : {ols.coef_[0]:.4f}")
print(f"Huber slope : {huber.coef_[0]:.4f}")
print(f"Clean-data MSE, OLS  : {mean_squared_error(y[clean_mask], ols.predict(X[clean_mask])):.4f}")
print(f"Clean-data MSE, Huber: {mean_squared_error(y[clean_mask], huber.predict(X[clean_mask])):.4f}")
report("Huber is closer to the clean slope", abs(huber.coef_[0] - 1.8) < abs(ols.coef_[0] - 1.8))

6.4 Model Selection, Cross-Validation, and Information Criteria

Choosing complexity requires a penalty, validation strategy, or both.

Code cell 60

# === 6.4 Model Selection, Cross-Validation, and Information Criteria ===
x = np.linspace(-2, 2, 160)
y = 1.0 + 1.2 * x - 0.8 * x**2 + 0.5 * x**3 + rng.normal(scale=0.45, size=len(x))
kf = KFold(n_splits=5, shuffle=True, random_state=42)
results = []
for degree in range(1, 6):
    poly = PolynomialFeatures(degree=degree, include_bias=True)
    X = poly.fit_transform(x[:, None])
    cv_mse = []
    for train_idx, test_idx in kf.split(X):
        beta = ols_beta(X[train_idx], y[train_idx])
        pred = X[test_idx] @ beta
        cv_mse.append(mean_squared_error(y[test_idx], pred))
    beta_full = ols_beta(X, y)
    fit_full = X @ beta_full
    rss_full = rss(y, fit_full)
    n = len(y)
    k = X.shape[1]
    aic = n * np.log(rss_full / n) + 2 * k
    bic = n * np.log(rss_full / n) + k * np.log(n)
    results.append((degree, np.mean(cv_mse), aic, bic))

header("Model-selection criteria across polynomial degree")
for degree, cv_mse, aic, bic in results:
    print(f"degree={degree}: CV MSE={cv_mse:.4f}, AIC={aic:.2f}, BIC={bic:.2f}")
best_cv = min(results, key=lambda t: t[1])[0]
best_bic = min(results, key=lambda t: t[3])[0]
report("Cross-validation and BIC both produce a finite optimum", 1 <= best_cv <= 5 and 1 <= best_bic <= 5)

6.5 Omitted-Variable Bias, Extrapolation, and Causal Caution

A fitted coefficient can absorb missing structure, and a linear fit can become unreliable outside the observed region.

Code cell 62

# === 6.5 Omitted-Variable Bias, Extrapolation, and Causal Caution ===
Z = rng.normal(size=500)
X = 0.7 * Z + rng.normal(scale=0.6, size=500)
Y = 2.0 * X + 3.0 * Z + rng.normal(scale=0.5, size=500)

beta_short = ols_beta(add_intercept(X), Y)
beta_full = ols_beta(add_intercept(np.column_stack([X, Z])), Y)

x_train = np.linspace(0, 1, 80)
y_train = 1.0 + 0.5 * x_train + 2.0 * x_train**2 + rng.normal(scale=0.1, size=len(x_train))
linear_extrap = ols_beta(add_intercept(x_train), y_train)
x_star = 2.0
pred_star = np.array([1.0, x_star]) @ linear_extrap
true_star = 1.0 + 0.5 * x_star + 2.0 * x_star**2

header("Bias from omission and danger from extrapolation")
print(f"Slope on X alone        : {beta_short[1]:.4f}")
print(f"Slope on X controlling Z: {beta_full[1]:.4f}")
print(f"Linear extrapolated prediction at x=2.0: {pred_star:.4f}")
print(f"Underlying nonlinear value at x=2.0    : {true_star:.4f}")
report("Omitted variable changes the slope materially", abs(beta_short[1] - beta_full[1]) > 0.5)
report("Extrapolation error is large", abs(pred_star - true_star) > 1.0)

7. Applications in Machine Learning

7.1 Linear Regression Baselines for Tabular and Structured Prediction

A strong linear baseline is often the fastest way to discover whether a task is mostly about representation quality or about model complexity.

Code cell 64

# === 7.1 Linear Regression Baselines for Tabular and Structured Prediction ===
X = rng.normal(size=(1200, 6))
true_beta = np.array([1.5, -2.0, 0.8, 0.0, 0.6, 0.0])
y = X @ true_beta + rng.normal(scale=1.1, size=1200)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

lin = LinearRegression().fit(X_train, y_train)
mean_pred = np.full_like(y_test, y_train.mean())
lin_pred = lin.predict(X_test)

header("Tabular baseline check")
print(f"Mean-baseline test MSE : {mean_squared_error(y_test, mean_pred):.4f}")
print(f"Linear-model test MSE  : {mean_squared_error(y_test, lin_pred):.4f}")
print("Estimated coefficients:", np.round(lin.coef_, 3))
report("Linear baseline beats mean baseline", mean_squared_error(y_test, lin_pred) < mean_squared_error(y_test, mean_pred))

7.2 Logistic Regression Baselines, Calibration, and Probability Outputs

Logistic regression is still a reliable probabilistic baseline when features already carry the important structure.

Code cell 66

# === 7.2 Logistic Regression Baselines, Calibration, and Probability Outputs ===
X = rng.normal(size=(1400, 3))
logits = -0.3 + 1.4 * X[:, 0] - 0.9 * X[:, 1] + 0.7 * X[:, 2]
prob = sigmoid(logits)
y = rng.binomial(1, prob)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

clf = LogisticRegression(C=1.0, solver='lbfgs', max_iter=2000)
clf.fit(X_train, y_train)
p_test = clf.predict_proba(X_test)[:, 1]
acc = accuracy_score(y_test, (p_test >= 0.5).astype(int))
brier = brier_score_loss(y_test, p_test)

bins = np.linspace(0, 1, 6)
indices = np.digitize(p_test, bins) - 1
calibration_rows = []
for b in range(len(bins) - 1):
    mask = indices == b
    if mask.sum() > 0:
        calibration_rows.append((mask.sum(), p_test[mask].mean(), y_test[mask].mean()))

header("Logistic baseline and simple calibration view")
print(f"Test accuracy : {acc:.4f}")
print(f"Brier score   : {brier:.4f}")
for count, p_mean, y_mean in calibration_rows:
    print(f"bin count={count:>3d}, mean predicted={p_mean:.3f}, empirical positive rate={y_mean:.3f}")
report("Probabilities are well-formed", np.all((p_test >= 0) & (p_test <= 1)))

7.3 Linear Probes on Embeddings and Hidden States

A linear probe tests whether a representation makes a target linearly recoverable.

Code cell 68

# === 7.3 Linear Probes on Embeddings and Hidden States ===
emb = rng.normal(size=(900, 20))
probe_direction = rng.normal(size=20)
labels = (emb @ probe_direction + 0.3 * rng.normal(size=900) > 0).astype(int)
labels_shuffled = rng.permutation(labels)
E_train, E_test, y_train, y_test = train_test_split(emb, labels, test_size=0.3, random_state=42)
_, _, ys_train, ys_test = train_test_split(emb, labels_shuffled, test_size=0.3, random_state=42)

probe = LogisticRegression(C=1.0, solver='lbfgs', max_iter=2000)
probe.fit(E_train, y_train)
probe_shuffled = LogisticRegression(C=1.0, solver='lbfgs', max_iter=2000)
probe_shuffled.fit(E_train, ys_train)

acc_true = accuracy_score(y_test, probe.predict(E_test))
acc_shuffle = accuracy_score(ys_test, probe_shuffled.predict(E_test))

header("Linear probes on embeddings")
print(f"Probe accuracy on real labels    : {acc_true:.4f}")
print(f"Probe accuracy on shuffled labels: {acc_shuffle:.4f}")
report("Representation contains linearly decodable signal", acc_true > acc_shuffle + 0.2)

7.4 Ridge, Lasso, Weight Decay, and Sparse or Low-Rank Adaptation

Modern regularization and adaptation methods often repackage classical shrinkage ideas.

Code cell 70

# === 7.4 Ridge, Lasso, Weight Decay, and Sparse or Low-Rank Adaptation ===
X = rng.normal(size=(240, 12))
true_beta = np.zeros(12)
true_beta[:4] = np.array([2.0, -1.6, 1.2, 0.8])
y = X @ true_beta + rng.normal(scale=1.0, size=240)
X_std = StandardScaler().fit_transform(X)

ridge = Ridge(alpha=5.0).fit(X_std, y)
lasso = Lasso(alpha=0.15, max_iter=20000).fit(X_std, y)

A = rng.normal(size=(16, 2))
B = rng.normal(size=(2, 10))
delta = A @ B + 0.05 * rng.normal(size=(16, 10))
U, S, Vt = la.svd(delta, full_matrices=False)
rank2 = (U[:, :2] * S[:2]) @ Vt[:2]
rank2_error = la.norm(delta - rank2, ord='fro')
full_error = la.norm(delta, ord='fro')

header("Regularization and structured adaptation")
print(f"Ridge coefficient norm: {la.norm(ridge.coef_):.4f}")
print(f"Lasso nonzero count   : {np.sum(np.abs(lasso.coef_) > 1e-6)}")
print(f"Rank-2 approximation relative error: {rank2_error / full_error:.4f}")
report("Lasso is sparser than Ridge", np.sum(np.abs(lasso.coef_) > 1e-6) < len(ridge.coef_))
report("Low-rank approximation captures most structure", rank2_error / full_error < 0.25)

7.5 GLMs in Production Systems

Binary events, counts, and rates are common operational targets, so logistic and Poisson models remain useful production tools.

Code cell 72

# === 7.5 GLMs in Production Systems ===
traffic = rng.normal(size=(800, 2))
ctr_logits = -1.0 + 0.9 * traffic[:, 0] + 0.6 * traffic[:, 1]
ctr_labels = rng.binomial(1, sigmoid(ctr_logits))
ctr_model = LogisticRegression(C=1.0, solver='lbfgs', max_iter=2000).fit(traffic, ctr_labels)

ops_x = rng.normal(size=800)
ops_exposure = rng.uniform(0.8, 2.5, size=800)
incident_mu = ops_exposure * np.exp(-0.2 + 0.8 * ops_x)
incident_counts = rng.poisson(incident_mu)
incident_model = PoissonRegressor(alpha=1e-8, max_iter=3000).fit(np.column_stack([ops_x, np.log(ops_exposure)]), incident_counts)

sample_ctr = ctr_model.predict_proba(np.array([[0.5, -0.1], [1.2, 0.9]]))[:, 1]
sample_incidents = incident_model.predict(np.array([[0.2, np.log(1.0)], [1.0, np.log(2.0)]]))

header("GLMs in production-style targets")
print("Sample CTR predictions      :", np.round(sample_ctr, 4))
print("Sample incident predictions :", np.round(sample_incidents, 4))
report("CTR outputs are probabilities", np.all((sample_ctr >= 0) & (sample_ctr <= 1)))
report("Incident predictions are positive", np.all(sample_incidents > 0))

8. Common Mistakes

#	Mistake	Why it is wrong	Fix
1	Interpreting OLS coefficients without checking multicollinearity	Correlated predictors inflate variance and flip signs	Compute VIF; consider Ridge or variable selection
2	Using R² to compare models with different numbers of predictors	R² always increases with more predictors	Use adjusted R², AIC, BIC, or cross-validation
3	Ignoring residual plots	Violations of linearity or homoscedasticity are invisible from summary stats alone	Always check residuals-vs-fitted and QQ plot
4	Forgetting to standardize before Ridge/Lasso	Penalizes large-scale features disproportionately	Standardize features to zero mean and unit variance first
5	Using Lasso when groups of correlated features matter	Lasso arbitrarily drops one from a correlated group	Use Elastic Net or Group Lasso for correlated features
6	Predicting outside the training range and trusting the result	Extrapolation assumes the linear relationship continues	Report wider uncertainty or refuse prediction far from training data
7	Using accuracy as the loss for logistic regression training	Accuracy is non-differentiable; gradient is zero almost everywhere	Minimize log-loss (binary cross-entropy)
8	Treating confidence intervals and prediction intervals as the same	PI is wider because it includes observation noise	Compute both; use PI when making individual-level predictions

What to Notice After Running This Notebook

OLS is best understood as projection before it is understood as algebra.
Diagnostics matter because the wrong mean structure can survive pretty coefficient tables.
Regularization changes the geometry of estimation long before it changes the buzzwords around a model.
Logistic and Poisson regression are not side quests; they are the core GLM bridge from classical statistics to modern ML losses.
Linear probes, calibration heads, and weight-decay-style penalties are direct descendants of the regression ideas in this section.