ML MATH MAPMath for LLMs

ML MATH MAP

docs

ML MATH MAP

A precise mapping from mathematical concepts to their concrete roles in modern ML systems. Each entry cites the specific model, paper, or algorithm where the mathematics is load-bearing — not merely present.


How to Read This Document

Each section identifies a mathematical domain, lists its core concepts, and maps each concept to: (a) the ML context, (b) the specific operation or formula, and (c) a concrete 2024–2026 example. This is a research-grade reference, not a survey.


1. Linear Algebra → ML

1.1 Matrix Multiplication

Math conceptML operationFormulaExample
y=Wx+b\mathbf{y} = W\mathbf{x} + \mathbf{b}Fully-connected layerForward passEvery dense layer in every neural network
Attention=softmax(QK/dk)V\text{Attention} = \text{softmax}(QK^\top/\sqrt{d_k})VSelf-attentionO(n2d)O(n^2 d)Transformers: GPT-4, LLaMA-3, Gemini
H(l+1)=σ(A^H(l)W(l))H^{(l+1)} = \sigma(\hat{A} H^{(l)} W^{(l)})Graph convolutionMessage passingGCN (Kipf & Welling, 2017)

1.2 Eigendecomposition

ConceptML roleWhere
Av=λvA\mathbf{v} = \lambda \mathbf{v}Stability of RNN hidden stateρ(Wh)<1\rho(W_h) < 1 prevents exploding gradients
Top eigenvector of XXX^\top XFirst principal componentPCA for dimensionality reduction
Eigenvalues of graph Laplacian LLSpectral graph convolutionChebNet, spectral GNNs
Hessian spectrum 2L\nabla^2 \mathcal{L}Loss landscape sharpnessSharpness-aware minimisation (Foret et al., 2021)
Neural Tangent Kernel eigenvaluesTraining speed of wide networksNTK theory (Jacot et al., 2018)

1.3 SVD

ConceptML roleWhere
A=UΣVA = U\Sigma V^\topLow-rank weight decompositionLoRA (Hu et al., 2022): ΔW=BA\Delta W = BA
Eckart-Young theoremOptimal rank-kk approximationMatrix factorisation for recommenders
Pseudoinverse AA^\daggerLeast-squares solutionNormal equations; ridge regression
Singular value spectrumWeight matrix healthWeightWatcher (Martin & Mahoney, 2021)
Randomised SVDScalable approximationHalko, Martinsson & Tropp (2011)

1.4 Norms and Regularisation

NormRegulariserEffectUsed in
θ22\lVert \boldsymbol{\theta} \rVert_2^2L2 / weight decayPenalises large weightsAdamW, all modern LLMs
θ1\lVert \boldsymbol{\theta} \rVert_1L1 / LassoInduces sparsitySparse fine-tuning
W2=σmax(W)\lVert W \rVert_2 = \sigma_{\max}(W)Spectral normalisationLipschitz constraintGANs (Miyato et al., 2018)
A\lVert A \rVert_* (nuclear)Nuclear normLow-rank inductive biasMatrix completion

2. Calculus → ML

2.1 Chain Rule = Backpropagation

The chain rule is not merely used in backpropagation — it is backpropagation.

LW[l]=La[L]k=l+1La[k]a[k1]a[l]W[l]\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[L]}} \cdot \prod_{k=l+1}^{L} \frac{\partial \mathbf{a}^{[k]}}{\partial \mathbf{a}^{[k-1]}} \cdot \frac{\partial \mathbf{a}^{[l]}}{\partial W^{[l]}}

Every automatic differentiation framework (PyTorch, JAX, TensorFlow) is a symbolic implementation of this identity.

2.2 Key Derivatives in Practice

FunctionDerivativeWhy it matters
σ(x)=1/(1+ex)\sigma(x) = 1/(1+e^{-x})σ(x)(1σ(x))\sigma(x)(1-\sigma(x))LSTM gate updates; vanishes for large x\lvert x \rvert
tanh(x)\tanh(x)1tanh2(x)1 - \tanh^2(x)RNN hidden states; range (1,1)(-1,1)
ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0,x)1[x>0]\mathbb{1}[x > 0]Avoids vanishing gradient for x>0x > 0
GELU(x)=xΦ(x)\text{GELU}(x) = x\Phi(x)Φ(x)+xϕ(x)\Phi(x) + x\phi(x)GPT-2, BERT, modern transformers
softmax(z)i\text{softmax}(\mathbf{z})_isi(δijsj)s_i(\delta_{ij} - s_j)Cross-entropy gradient; numerically use log-sum-exp

2.3 Gradient Flow and Vanishing/Exploding

For a depth-LL network: LW[1]l=2LW[l]σ(z[l])\frac{\partial \mathcal{L}}{\partial W^{[1]}} \propto \prod_{l=2}^{L} W^{[l]} \cdot \sigma'(\mathbf{z}^{[l]}).

ConditionEffectFix
Wσ<1\lVert W \sigma' \rVert < 1 repeatedVanishing gradientResidual connections, LSTM gates, ReLU
Wσ>1\lVert W \sigma' \rVert > 1 repeatedExploding gradientGradient clipping gc\lVert g \rVert \le c

2.4 Second-Order Methods

ConceptFormulaML application
Hessian H=2LH = \nabla^2 \mathcal{L}(H)ij=2L/θiθj(H)_{ij} = \partial^2 \mathcal{L}/\partial\theta_i\partial\theta_jNewton's method; Fisher information matrix
Gauss-Newton approximationHJJH \approx J^\top JK-FAC (Martens & Grosse, 2015)
Sharpness λmax(H)\lambda_{\max}(H)Largest Hessian eigenvalueSAM (Foret et al., 2021); flat minima generalise better

3. Probability → ML

3.1 Probabilistic View of Supervised Learning

FrameworkObjectiveEquivalent to
MLEmaxlogp(y(i)x(i);θ)\max \sum \log p(y^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\theta})Minimise cross-entropy loss
MAPmaxlogp(y(i)x(i);θ)+logp(θ)\max \sum \log p(y^{(i)} \mid \mathbf{x}^{(i)}; \boldsymbol{\theta}) + \log p(\boldsymbol{\theta})L2 reg. (Gaussian prior), L1 reg. (Laplace prior)
ELBOEq[logp(xz)]DKL(q(zx)p(z))\mathbb{E}_{q}[\log p(\mathbf{x} \mid \mathbf{z})] - D_\text{KL}(q(\mathbf{z}\mid\mathbf{x}) \| p(\mathbf{z}))VAE training objective (Kingma & Welling, 2014)

3.2 Distributions in Active Use (2026)

DistributionWhereFormula
N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma)Weight init (He/Xavier), VAE latent, diffusionp(x)exp(12(xμ)Σ1(xμ))p(\mathbf{x}) \propto \exp(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu}))
Categorical / softmaxLLM output token distributionp(xi)=ezi/jezjp(x_i) = e^{z_i}/\sum_j e^{z_j}
BernoulliDropout mask, binary classificationP(X=1)=pP(X=1) = p
DirichletTopic models, LDA, concentration priorConjugate to Categorical

3.3 Bayes' Theorem in ML

p(θD)=p(Dθ)p(θ)p(D)p(\boldsymbol{\theta} \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \boldsymbol{\theta})\, p(\boldsymbol{\theta})}{p(\mathcal{D})}
  • Likelihood p(Dθ)p(\mathcal{D} \mid \boldsymbol{\theta}): the loss function
  • Prior p(θ)p(\boldsymbol{\theta}): regularisation
  • Posterior p(θD)p(\boldsymbol{\theta} \mid \mathcal{D}): what we actually want
  • Evidence p(D)p(\mathcal{D}): intractable; approximated by ELBO, Laplace, MCMC

4. Information Theory → ML

ConceptFormulaML role
Cross-entropyH(p,q)=plogqH(p,q) = -\sum p \log qThe classification loss; training objective for all LLMs
KL divergenceDKL(pq)=plog(p/q)D_\text{KL}(p\|q) = \sum p \log(p/q)VAE regulariser; knowledge distillation; RLHF KL penalty
Mutual informationI(X;Y)=H(X)H(XY)I(X;Y) = H(X) - H(X\mid Y)InfoNCE loss; contrastive learning (SimCLR, CLIP)
Perplexityexp(1Ttlogp(xtx<t))\exp(-\frac{1}{T}\sum_t \log p(x_t\mid x_{<t}))LLM evaluation; lower = better language model
Bits-back codingEq[logp(xz)]+DKL(qp)-\mathbb{E}_q[\log p(\mathbf{x}\mid\mathbf{z})] + D_\text{KL}(q\|\,p)VAE ELBO reinterpreted as compression

5. Optimisation → ML

5.1 Gradient Descent Variants

AlgorithmUpdate ruleUsed in
SGD + momentumvt=βvt1+L\mathbf{v}_t = \beta\mathbf{v}_{t-1} + \nabla\mathcal{L}; θθηvt\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta\mathbf{v}_tVision models
Adam (Kingma & Ba, 2015)Adaptive per-parameter η\eta; bias-corrected momentsDefault for LLMs
AdamW (Loshchilov & Hutter, 2019)Adam + decoupled weight decayGPT-3, LLaMA, all frontier LLMs
Muon (2024)Orthogonalised Nesterov momentumGPT-4o-scale training
SOAP (2024)Shampoo + Adam preconditionerState-of-art efficiency

5.2 Learning Rate Schedules

ScheduleFormulaUsed in
Cosine annealingηt=ηmin+12(ηmaxηmin)(1+cosπtT)\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max}-\eta_{\min})(1+\cos\frac{\pi t}{T})GPT, LLaMA pretraining
Linear warmupηt=ηmaxt/Twarm\eta_t = \eta_{\max} \cdot t / T_\text{warm}All large models (first 1–4K steps)
WSD (warmup-stable-decay)Constant phase + sharp cosine decayMistral, Phi-3

6. Curriculum Map — Math to Model

This table maps each repository chapter to the specific models and papers that use it as load-bearing mathematics.

ChapterCore mathPrimary models / papers
02 Linear Algebra BasicsMatrix ops, rank, projectionsEvery neural network
03 Advanced Linear AlgebraSVD, eigenvaluesLoRA, PCA, WeightWatcher
04 Calculus FundamentalsDerivatives, chain ruleBackpropagation (Rumelhart et al., 1986)
05 Multivariate CalculusJacobian, HessianAdam, K-FAC, SAM
06 Probability TheoryDistributions, BayesVAE, DDPM, Bayesian deep learning
07 StatisticsMLE, MAP, hypothesis testsTraining objectives, model selection
08 OptimisationConvexity, GD, constraintsAll training algorithms
09 Information TheoryEntropy, KL, MICross-entropy loss, RLHF, contrastive learning
10 Numerical MethodsCondition number, stabilityMixed precision, numerical autograd
11 Graph TheoryLaplacian, random walksGCN, GAT, Node2Vec
12 Functional AnalysisHilbert spaces, RKHSSVMs, kernel methods, NTK theory
13 ML-Specific MathAttention math, normalisationTransformers (Vaswani et al., 2017)
14 Math for Specific ModelsRNN/LSTM, CNN, GANSequence models, generative models

This map is updated with each new section added to the curriculum. For the definitive reference on mathematics for ML, see: Goodfellow, Bengio & Courville (2016); Bishop (2006); Shalev-Shwartz & Ben-David (2014).