Mathematics for AI/ML/LLM — Cheatsheet
Every formula that actually matters. No filler.
Organized by how you encounter them in practice.
1 · Linear Algebra
Vectors
| Operation | Formula | Why It Matters |
|---|
| Dot Product | a⋅b=∑iaibi | Attention scores, similarity |
| Cosine Similarity | cosθ=∥a∥∥b∥a⋅b | Embeddings, RAG retrieval, semantic search |
| L2 Norm | ∥a∥=∑ai2 | Weight decay, distance metrics |
| L1 Norm | $|\mathbf{a}|_1 = \sum | a_i |
| Projection | projba=∥b∥2a⋅bb | Orthogonal decomposition, GS process |
Matrices
| Operation | Formula | NumPy / PyTorch |
|---|
| Matrix Multiply | (AB)ij=∑kAikBkj | A @ B |
| Transpose | (AT)ij=Aji | A.T |
| Inverse | AA−1=I | np.linalg.inv(A) |
| Trace | tr(A)=∑iAii | torch.trace(A) |
| Hadamard (element-wise) | (A⊙B)ij=AijBij | A * B |
Decompositions
Eigendecomposition — Used in: PCA, spectral clustering, graph analysis
Av=λvA=PDP−1
SVD — Used in: PCA, LoRA, matrix compression, recommender systems
A=UΣVT
- U (m×m): left singular vectors
- Σ (m×n): singular values on diagonal
- VT (n×n): right singular vectors
- Low-rank approximation: Ak=UkΣkVkT (keep top-k values)
PCA via SVD: Center data X, compute SVD, project onto top-k columns of V.
Matrix Properties That Matter
| Property | Meaning | Where You See It |
|---|
| Positive Definite | xTAx>0,∀x=0 | Covariance matrices, convex loss |
| Orthogonal | ATA=I | Rotation matrices, SVD components |
| Symmetric | A=AT | Covariance, Hessians, kernels |
| Rank | rank(A)= # linearly independent rows/cols | LoRA exploits low rank |
2 · Calculus
Derivatives You Need to Know
| Function | Derivative | Used In |
|---|
| xn | nxn−1 | Polynomial features |
| ex | ex | Softmax, exponential LR decay |
| ln(x) | 1/x | Log-likelihood, cross-entropy |
| σ(x)=1+e−x1 | σ(x)(1−σ(x)) | Sigmoid activation, logistic regression |
| tanh(x) | 1−tanh2(x) | RNN/LSTM gates |
Rules That Drive Backpropagation
| Rule | Formula |
|---|
| Chain Rule | (f∘g)′=f′(g(x))⋅g′(x) |
| Product Rule | (fg)′=f′g+fg′ |
| Sum Rule | (f+g)′=f′+g′ |
The chain rule IS backpropagation. Every layer computes local gradient × upstream gradient.
Multivariate Calculus
Gradient — direction of steepest ascent, used in every optimizer:
∇f=[∂x1∂f,∂x2∂f,…,∂xn∂f]T
Jacobian (f:Rn→Rm) — used in normalizing flows, diffusion models:
Jij=∂xj∂fi
Hessian — second-order info, used in loss landscape analysis, Newton's method:
Hij=∂xi∂xj∂2f
- H≻0 (positive definite) → local minimum
- H≺0 (negative definite) → local maximum
- Mixed eigenvalues → saddle point (common in deep learning!)
3 · Probability & Statistics
Core Rules
| Formula | Name |
|---|
| P(A∪B)=P(A)+P(B)−P(A∩B) | Addition |
| P(A∩B)=P(A)⋅P(B∣A) | Multiplication |
| P(A∣B)=P(B)P(B∣A)P(A) | Bayes' Theorem |
Bayes is everywhere: Naive Bayes, MAP estimation, Bayesian neural nets, posterior inference.
Key Distributions
| Distribution | PDF / PMF | Mean | Variance | Used In |
|---|
| Bernoulli | P(X=1)=p | p | p(1−p) | Binary classification |
| Categorical | P(X=k)=pk | — | — | Multi-class output |
| Gaussian | σ2π1e−2σ2(x−μ)2 | μ | σ2 | Weight init, noise, VAE |
| Multivariate Gaussian | (2π)n/2∥Σ∥1/21e−21(x−μ)TΣ−1(x−μ) | μ | Σ | Latent spaces, GMMs |
Estimation
MLE — maximize likelihood of observed data:
θ^MLE=argθmaxi=1∑nlogp(xi∣θ)
MAP — MLE + prior belief (= regularization!):
θ^MAP=argθmax[i=1∑nlogp(xi∣θ)+logp(θ)]
Gaussian prior on θ → L2 regularization. Laplace prior → L1 regularization.
Expectation & Variance
E[X]=x∑x⋅P(x)Var(X)=E[X2]−(E[X])2
Covariance
Cov(X,Y)=E[(X−μX)(Y−μY)]ρ=σXσYCov(X,Y)
4 · Activation Functions
| Function | Formula | Derivative | Used In |
|---|
| ReLU | max(0,x) | {10x>0x≤0 | CNNs, default choice |
| Leaky ReLU | max(αx,x) | {1αx>0x≤0 | Avoids dead neurons |
| GELU | x⋅Φ(x)≈x⋅σ(1.702x) | (complex) | GPT, BERT, modern transformers |
| SiLU / Swish | x⋅σ(x) | σ(x)+xσ(x)(1−σ(x)) | LLaMA, modern LLMs |
| Sigmoid | 1+e−x1 | σ(x)(1−σ(x)) | Gates (LSTM), binary output |
| Tanh | ex+e−xex−e−x | 1−tanh2(x) | RNN hidden states |
| Softmax | ∑jezjezi | si(δij−sj) | Classification output, attention |
5 · Loss Functions
Classification
Cross-Entropy (multi-class) — THE standard classification loss:
L=−i=1∑Cyilog(y^i)
Binary Cross-Entropy:
L=−[ylog(y^)+(1−y)log(1−y^)]
Focal Loss — handles class imbalance (used in object detection):
L=−αt(1−y^t)γlog(y^t)
Regression
MSE:
L=n1i=1∑n(yi−y^i)2
MAE / L1 Loss:
L=n1i=1∑n∣yi−y^i∣
Huber Loss — smooth transition between MSE and MAE:
Lδ={21(y−y^)2δ∣y−y^∣−21δ2∣y−y^∣≤δotherwise
Contrastive & Embedding Losses
Contrastive Loss (SimCLR / CLIP):
L=−log∑k=iexp(sim(zi,zk)/τ)exp(sim(zi,zj)/τ)
Where sim = cosine similarity, τ = temperature
Triplet Loss:
L=max(0,∥f(a)−f(p)∥2−∥f(a)−f(n)∥2+α)
- a = anchor, p = positive, n = negative, α = margin
6 · Optimization
Gradient Descent Variants
Vanilla GD:
θt+1=θt−η∇L(θt)
SGD with Momentum:
vt=γvt−1+η∇L(θt)θt+1=θt−vt
Adam (the default optimizer for most deep learning):
mt=β1mt−1+(1−β1)gt(1st moment)
vt=β2vt−1+(1−β2)gt2(2nd moment)
m^t=1−β1tmtv^t=1−β2tvt(bias correction)
θt+1=θt−v^t+ϵηm^t
Defaults: β1=0.9, β2=0.999, ϵ=10−8
AdamW (Adam with decoupled weight decay — used in LLM training):
θt+1=θt−η(v^t+ϵm^t+λθt)
Key difference from Adam: weight decay λθt is applied directly, not through the gradient.
Gradient Clipping
By norm (prevents exploding gradients — standard in LLM training):
g^={gc⋅∥g∥gif ∥g∥≤cotherwise
Learning Rate Schedules
Cosine Annealing (used in most modern LLM training):
ηt=ηmin+21(ηmax−ηmin)(1+cosTtπ)
Warmup + Cosine Decay (the LLM standard):
ηt={ηmax⋅Twarmuptcosine decayt<Twarmupt≥Twarmup
Regularization
| Method | Effect | Formula |
|---|
| L2 / Weight Decay | Small weights | L+λ∑θi2 |
| L1 | Sparse weights | L+λ∑∥θi∥ |
| Dropout | Random neuron masking | h=mask⊙f(x)/(1−p) |
Convexity
f convex⟺f(λx+(1−λ)y)≤λf(x)+(1−λ)f(y)
- Convex → single global minimum (guaranteed convergence)
- Deep learning losses are non-convex → saddle points, local minima
7 · Normalization
Batch Normalization
x^i=σB2+ϵxi−μByi=γx^i+β
Normalizes across the batch dimension. Used in CNNs.
Layer Normalization
x^i=σL2+ϵxi−μLyi=γx^i+β
Normalizes across the feature dimension. Used in original Transformers, BERT.
RMSNorm (Root Mean Square Normalization)
x^i=RMS(x)xi⋅γRMS(x)=n1i=1∑nxi2
No mean subtraction, no bias β. Faster than LayerNorm. Used in LLaMA, GPT-4, modern LLMs.
8 · Transformer & Attention Math
Scaled Dot-Product Attention
Attention(Q,K,V)=softmax(dkQKT)V
- Q=XWQ (queries), K=XWK (keys), V=XWV (values)
- dk = key dimension. Scaling by dk prevents softmax saturation.
- Complexity: O(n2d) where n = sequence length
Why dk?
If q,k have components with variance 1, then q⋅k has variance dk.
Dividing by dk restores unit variance → softmax gets reasonable gradients.
Multi-Head Attention
MultiHead(Q,K,V)=Concat(head1,…,headh)WO
headi=Attention(QWiQ,KWiK,VWiV)
- h heads, each with dk=dmodel/h
- Lets the model attend to different representation subspaces
Causal (Autoregressive) Masking
maskij={0−∞i≥ji<j
Applied before softmax: softmax(dkQKT+mask)V
Prevents token i from attending to future tokens j>i. Used in GPT, LLaMA, all decoder models.
Transformer Block
Input
→ RMSNorm / LayerNorm
→ Multi-Head Self-Attention + Residual
→ RMSNorm / LayerNorm
→ Feed-Forward Network + Residual
Output
Feed-Forward Network (FFN):
FFN(x)=W2⋅GELU(W1x+b1)+b2
- W1: dmodel→dff (typically dff=4×dmodel)
- W2: dff→dmodel
SwiGLU FFN (used in LLaMA, PaLM, modern LLMs):
SwiGLU(x)=(SiLU(W1x)⊙W3x)⋅W2
Residual Connection
output=x+Sublayer(x)
Enables gradient flow through deep networks (50+ layers in LLMs).
9 · LLM-Specific Math
Positional Encoding — Sinusoidal (Original Transformer)
PE(pos,2i)=sin(100002i/dpos)PE(pos,2i+1)=cos(100002i/dpos)
Rotary Position Embedding (RoPE) — Used in LLaMA, GPT-NeoX
Rotates query and key vectors based on position:
f(xm,m)=RmxmRm=(cosmθsinmθ−sinmθcosmθ)
Key property: ⟨f(q,m),f(k,n)⟩ depends only on q, k, and relative position m−n.
Tokenization (BPE)
Byte-Pair Encoding greedily merges the most frequent adjacent token pair:
pair∗=arg(a,b)maxcount(a,b)
Repeat until vocabulary size reached. Subword tokenization balances vocabulary size vs sequence length.
Language Model Probability
P(x1,…,xT)=t=1∏TP(xt∣x1,…,xt−1)
Training objective (minimize negative log-likelihood):
L=−T1t=1∑TlogP(xt∣x<t)
Temperature Scaling
P(xi)=∑jexp(zj/τ)exp(zi/τ)
- τ→0: greedy (picks highest logit)
- τ=1: standard sampling
- τ>1: more random / creative
Top-k and Top-p (Nucleus) Sampling
- Top-k: Sample from the k highest-probability tokens only
- Top-p: Sample from smallest set where ∑P(xi)≥p
Scaling Laws (Chinchilla)
L(N,D)≈A⋅N−α+B⋅D−β+L∞
- N = number of parameters, D = number of training tokens
- Chinchilla-optimal: D≈20×N (train on 20 tokens per parameter)
LoRA (Low-Rank Adaptation)
W′=W0+ΔW=W0+BA
- W0∈Rd×d (frozen pretrained weights)
- B∈Rd×r, A∈Rr×d where r≪d
- Only B and A are trained. Parameters: 2dr instead of d2
- Example: d=4096, r=16 → 99.2% fewer trainable parameters
Quantization
Uniform quantization (maps float → int):
xq=round(sx)+zxdequant=s(xq−z)
- s = scale factor, z = zero point
- INT8: 4× memory reduction, INT4: 8× memory reduction
KV Cache
During autoregressive generation, cache Key and Value matrices to avoid recomputation:
- Without cache: O(n2) per token
- With cache: O(n) per new token
- Memory: 2×nlayers×nheads×dhead×seq_len×precision
10 · Information Theory
| Concept | Formula | Used In |
|---|
| Entropy | H(X)=−∑p(x)logp(x) | Uncertainty measure, decision trees |
| Cross-Entropy | H(p,q)=−∑p(x)logq(x) | THE classification loss |
| KL Divergence | DKL(p∥q)=∑p(x)logq(x)p(x) | VAE loss, knowledge distillation |
| Perplexity | PPL=exp(−T1∑logP(xt∣x<t)) | LLM evaluation metric |
| Mutual Info | I(X;Y)=H(X)−H(X∣Y) | Feature selection, info bottleneck |
Cross-entropy = Entropy + KL Divergence: H(p,q)=H(p)+DKL(p∥q)
Minimizing cross-entropy loss = minimizing KL divergence from true distribution.
11 · Embeddings & Similarity
Embedding Lookup
embed(x)=E[x,:]E∈R∣V∣×d
One-hot × embedding matrix = row lookup. ∣V∣ = vocab size, d = embedding dim.
Similarity Metrics
| Metric | Formula | Range | Used In |
|---|
| Cosine Similarity | ∥a∥∥b∥a⋅b | [−1,1] | RAG, semantic search, CLIP |
| Dot Product | a⋅b | (−∞,∞) | Attention scores |
| Euclidean (L2) | ∥a−b∥2 | [0,∞) | k-NN, clustering |
Approximate Nearest Neighbor
For retrieval at scale (millions of vectors):
- HNSW: Hierarchical Navigable Small World graphs
- IVF: Inverted File Index — cluster then search within clusters
- PQ: Product Quantization — compress vectors, approximate distance
12 · Generative Models
VAE (Variational Autoencoder)
ELBO (Evidence Lower Bound):
logp(x)≥Eq(z∣x)[logp(x∣z)]−DKL(q(z∣x)∥p(z))
- First term: reconstruction quality
- Second term: regularize latent space toward prior p(z)=N(0,I)
Reparameterization Trick (enables backprop through sampling):
z=μ+σ⊙ϵ,ϵ∼N(0,I)
GAN (Generative Adversarial Network)
GminDmaxEx∼pdata[logD(x)]+Ez∼pz[log(1−D(G(z)))]
- Optimal discriminator: D∗(x)=pdata(x)+pg(x)pdata(x)
- At equilibrium: generator distribution = data distribution
Diffusion Models (DDPM)
Forward process — add Gaussian noise over T steps:
q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)
Direct jump to any timestep:
q(xt∣x0)=N(xt;αˉtx0,(1−αˉt)I)αˉt=s=1∏t(1−βs)
Reverse process — neural network learns to denoise:
pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),σt2I)
Training loss — predict the noise:
L=Et,x0,ϵ[∥ϵ−ϵθ(xt,t)∥2]
13 · Alignment & RLHF
Reward Modeling
Train reward model rϕ on human preferences:
P(responsew≻responsel)=σ(rϕ(x,yw)−rϕ(x,yl))
Bradley-Terry model: probability that response w is preferred over l.
PPO Objective (Proximal Policy Optimization)
LPPO=Et[min(rt(θ)At,clip(rt(θ),1−ϵ,1+ϵ)At)]
rt(θ)=πθold(at∣st)πθ(at∣st)
With KL penalty to stay close to base model:
objective=E[r(x,y)]−β⋅DKL(πθ∥πref)
DPO (Direct Preference Optimization)
Skips the reward model entirely:
LDPO=−E[logσ(βlogπref(yw∣x)πθ(yw∣x)−βlogπref(yl∣x)πθ(yl∣x))]
- Simpler than PPO pipeline (no reward model, no RL loop)
- Equivalent to PPO under certain assumptions
SFT (Supervised Fine-Tuning)
Standard next-token prediction on instruction-response pairs:
LSFT=−t=1∑TlogPθ(yt∣x,y<t)
14 · Backpropagation
For layer l with zl=Wlal−1+bl and al=σ(zl):
δl=((Wl+1)Tδl+1)⊙σ′(zl)
∂Wl∂L=δl(al−1)T∂bl∂L=δl
Vanishing / Exploding Gradients
After n layers: ∂W1∂L∝∏l=1nWl⋅σ′(zl)
| Problem | Cause | Solution |
|---|
| Vanishing | ∥W⋅σ′∥<1 repeated | Residual connections, LSTM gates, ReLU |
| Exploding | ∥W⋅σ′∥>1 repeated | Gradient clipping, proper initialization |
Weight Initialization
| Method | Variance | Best For |
|---|
| Xavier / Glorot | nin+nout2 | Sigmoid, Tanh |
| He / Kaiming | nin2 | ReLU and variants |
15 · Model-Specific Math
CNN — Convolution
(f∗g)(t)=τ∑f(τ)⋅g(t−τ)
Output size: ⌊sn+2p−k⌋+1
- n = input size, k = kernel size, p = padding, s = stride
RNN / LSTM
RNN:
ht=tanh(Whht−1+Wxxt+b)
LSTM gates:
ft=σ(Wf[ht−1,xt]+bf)(forget)
it=σ(Wi[ht−1,xt]+bi)(input)
c~t=tanh(Wc[ht−1,xt]+bc)(candidate)
ct=ft⊙ct−1+it⊙c~t(cell state)
ot=σ(Wo[ht−1,xt]+bo)(output)
ht=ot⊙tanh(ct)(hidden state)
GNN (Graph Convolutional Network)
H(l+1)=σ(D^−1/2A^D^−1/2H(l)W(l))
Where A^=A+I (adjacency + self-loops), D^ = degree matrix of A^.
16 · PyTorch Quick Reference
import torch
import torch.nn as nn
import torch.nn.functional as F
x = torch.randn(batch, seq_len, d_model)
x.shape, x.dtype, x.device
A @ B
torch.linalg.svd(A)
torch.linalg.eig(A)
torch.linalg.norm(x, dim=-1)
x = torch.tensor([2.0], requires_grad=True)
y = x**2 + 3*x
y.backward()
x.grad
nn.Linear(d_in, d_out)
nn.Embedding(vocab_size, d_model)
nn.MultiheadAttention(d_model, num_heads)
nn.LayerNorm(d_model)
nn.Dropout(p=0.1)
nn.CrossEntropyLoss()
nn.MSELoss()
nn.BCEWithLogitsLoss()
F.cosine_similarity(a, b)
torch.optim.Adam(model.parameters(), lr=1e-4)
torch.optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=1000)
torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=1e-3, total_steps=10000)
17 · Numbers to Know
| Metric | Value | Context |
|---|
| GPT-3 parameters | 175B | dmodel=12288, 96 layers, 96 heads |
| LLaMA-2 70B | 70B | dmodel=8192, 80 layers, 64 heads |
| Typical batch size (LLM) | 1-4M tokens | Per gradient step |
| Typical learning rate (LLM) | 1e-4 to 3e-4 | With cosine decay |
| Chinchilla-optimal tokens | 20×N | For N parameters |
| Float16 memory per param | 2 bytes | 70B model ≈ 140 GB |
| INT4 memory per param | 0.5 bytes | 70B model ≈ 35 GB |
| Attention FLOPs | 2n2d | n = seq len, d = dim |
| FFN FLOPs | 16nd2 | Dominates for short sequences |
This cheatsheet covers Ch.01–25 of the Math for AI/ML/LLM curriculum.
See also: Notation Guide · ML Math Map · Interview Prep