NOTATION GUIDEMath for LLMs

NOTATION GUIDE

docs

NOTATION GUIDE

Authority: This document is the single source of truth for all mathematical notation used throughout this repository. Every section, notebook, and exercise MUST conform to these conventions. When conventions conflict with intuition, this guide wins.


Design Principles

This guide follows the conventions of Goodfellow, Bengio & Courville (2016) Deep Learning, Horn & Johnson (2013) Matrix Analysis, and the notation standards of NeurIPS/ICML/ICLR. Where conventions conflict, the choice that minimises ambiguity in an ML context is adopted.

Three rules that override everything else:

  1. A symbol's meaning must be determinable from context within three lines of its first use.
  2. The same symbol never denotes two different objects in the same section.
  3. Prefer established ML-paper conventions over personal preference.

1. Object Hierarchy

1.1 Type-to-Notation Map

ObjectNotation styleLaTeX commandExample
ScalarItalic lowercasea, \alphaxRx \in \mathbb{R}
VectorBold lowercase\mathbf{x}xRn\mathbf{x} \in \mathbb{R}^n
MatrixUppercase italicA, WARm×nA \in \mathbb{R}^{m \times n}
Tensor (order 3\ge 3)Calligraphic\mathcal{X}XRd1××dk\mathcal{X} \in \mathbb{R}^{d_1 \times \cdots \times d_k}
Random variableUppercase italicX, ZXN(0,1)X \sim \mathcal{N}(0,1)
Set / spaceCalligraphic\mathcal{D}, \mathcal{H}D={(x(i),y(i))}\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)})\}
Number fieldBlackboard bold\mathbb{R}, \mathbb{C}Rn\mathbb{R}^n, Cm×n\mathbb{C}^{m \times n}
DistributionCalligraphic\mathcal{N}, \mathcal{U}N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma)

Do: \mathbf{x} for vectors. Do not: \vec{x} (physics convention), \underline{x} (typewriter convention), or plain x when xx is a vector — these are all errors in this codebase.

1.2 Indexing

WhatNotationLaTeXMeaning
Vector componentxix_ix_iii-th scalar, 1-indexed by default
Matrix entryAijA_{ij}A_{ij}row ii, column jj
Row iiAi,:A_{i,:}A_{i,:}as a row vector
Column jjA:,jA_{:,j}A_{:,j}as a column vector
Data samplex(i)\mathbf{x}^{(i)}\mathbf{x}^{(i)}parentheses distinguish from power
Layer indexh[l]\mathbf{h}^{[l]}\mathbf{h}^{[l]}square brackets distinguish from data
Time stepht\mathbf{h}_t\mathbf{h}_tsubscript, used in sequences

Convention: (i)(i) = data sample, [l][l] = network layer, t_t = time. These three scopes must never share the same superscript/subscript style.


2. Number Fields and Spaces

SymbolLaTeXMeaning
N\mathbb{N}\mathbb{N}{0,1,2,}\{0, 1, 2, \ldots\} (ISO 80000-2; 0 is included)
Z\mathbb{Z}\mathbb{Z}All integers {,1,0,1,}\{\ldots,-1,0,1,\ldots\}
R\mathbb{R}\mathbb{R}Real line
Rn\mathbb{R}^n\mathbb{R}^nEuclidean nn-space (column vectors by default)
Rm×n\mathbb{R}^{m \times n}\mathbb{R}^{m \times n}Real m×nm \times n matrices
C\mathbb{C}\mathbb{C}Complex numbers
Sn\mathbb{S}^n\mathbb{S}^nn×nn \times n real symmetric matrices
S+n\mathbb{S}^n_+\mathbb{S}^n_+Positive semidefinite (PSD) cone
S++n\mathbb{S}^n_{++}\mathbb{S}^n_{++}Strictly positive definite matrices
[n][n][n]Index set {1,2,,n}\{1, 2, \ldots, n\}

3. Linear Algebra

3.1 Vectors

SymbolLaTeXMeaning
x\mathbf{x}\mathbf{x}Column vector (default orientation)
x\mathbf{x}^\top\mathbf{x}^\topTranspose — use ^\top, never ^T in display math
ei\mathbf{e}_i\mathbf{e}_iii-th standard basis vector
0\mathbf{0}\mathbf{0}Zero vector
1\mathbf{1}\mathbf{1}All-ones vector
x,y\langle \mathbf{x}, \mathbf{y} \rangle\langle \mathbf{x}, \mathbf{y} \rangleInner product (preferred over dot)
xy\mathbf{x} \odot \mathbf{y}\mathbf{x} \odot \mathbf{y}Hadamard (element-wise) product

Transpose: Use ^\top in all display equations. Reserve ^T for inline code comments only. Rationale: ^T is ambiguous with transposing a random variable TT.

3.2 Matrices

SymbolLaTeXMeaning
InI_nI_nn×nn \times n identity (subscript when size matters)
AA^\topA^\topMatrix transpose
A1A^{-1}A^{-1}Matrix inverse (only for square, non-singular AA)
AA^\daggerA^\daggerMoore-Penrose pseudoinverse
AA^*A^*Conjugate transpose (Hermitian adjoint)
tr(A)\operatorname{tr}(A)\operatorname{tr}(A)Trace — use \operatorname, not \text
det(A)\det(A)\det(A)Determinant
rank(A)\operatorname{rank}(A)\operatorname{rank}(A)Rank
diag(x)\operatorname{diag}(\mathbf{x})\operatorname{diag}(\mathbf{x})Diagonal matrix from vector
ABA \otimes BA \otimes BKronecker product
A0A \succ 0A \succ 0Positive definite
A0A \succeq 0A \succeq 0Positive semidefinite

3.3 Norms

SymbolLaTeXDefinitionUse case
x2\lVert \mathbf{x} \rVert_2\lVert \mathbf{x} \rVert_2xi2\sqrt{\sum x_i^2}Default vector norm
x1\lVert \mathbf{x} \rVert_1\lVert \mathbf{x} \rVert_1xi\sum \lvert x_i \rvertSparsity, L1 regularisation
xp\lVert \mathbf{x} \rVert_p\lVert \mathbf{x} \rVert_p(xip)1/p(\sum \lvert x_i \rvert^p)^{1/p}General p\ell^p norm
x\lVert \mathbf{x} \rVert_\infty\lVert \mathbf{x} \rVert_\inftymaxixi\max_i \lvert x_i \rvertChebyshev / max norm
AF\lVert A \rVert_F\lVert A \rVert_Fi,jAij2\sqrt{\sum_{i,j} A_{ij}^2}Frobenius norm
A2\lVert A \rVert_2\lVert A \rVert_2σmax(A)\sigma_{\max}(A)Spectral norm (operator norm)
A\lVert A \rVert_*\lVert A \rVert_*iσi(A)\sum_i \sigma_i(A)Nuclear norm (trace norm)

Norm delimiters: Always use \lVert \rVert for norms and \lvert \rvert for absolute values — never bare \| or |. This ensures correct spacing in all renderers.

3.4 Decompositions

NameCanonical formConditions
EigendecompositionA=QΛQ1A = Q \Lambda Q^{-1}AA diagonalisable
Spectral theoremA=QΛQA = Q \Lambda Q^\topAA real symmetric
SVDA=UΣVA = U \Sigma V^\topARm×nA \in \mathbb{R}^{m \times n}, always exists
QRA=QRA = QRQQ orthogonal, RR upper triangular
CholeskyA=LLA = LL^\topA0A \succ 0
LUA=LUA = LUSquare, non-singular

SVD convention: URm×mU \in \mathbb{R}^{m \times m}, ΣRm×n\Sigma \in \mathbb{R}^{m \times n}, VRn×nV \in \mathbb{R}^{n \times n}. Singular values σ1σ20\sigma_1 \ge \sigma_2 \ge \cdots \ge 0 are always in non-increasing order. Left singular vectors = columns of UU; right singular vectors = columns of VV.


4. Calculus and Optimisation

4.1 Derivatives

SymbolLaTeXMeaning
dfdx\frac{df}{dx}\frac{df}{dx}Derivative of scalar ff w.r.t. scalar xx
f(x)f'(x)f'(x)Prime notation — only for univariate functions
fxi\frac{\partial f}{\partial x_i}\frac{\partial f}{\partial x_i}Partial derivative
xf\nabla_{\mathbf{x}} f\nabla_{\mathbf{x}} fGradient: gRn\mathbf{g} \in \mathbb{R}^n where gi=f/xig_i = \partial f/\partial x_i
Jf(x)J_f(\mathbf{x})J_f(\mathbf{x})Jacobian of f:RnRm\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m; JRm×nJ \in \mathbb{R}^{m \times n}
Hf(x)H_f(\mathbf{x})H_f(\mathbf{x})Hessian: (H)ij=2f/xixj(H)_{ij} = \partial^2 f / \partial x_i \partial x_j
Lθ\frac{\partial \mathcal{L}}{\partial \theta}\frac{\partial \mathcal{L}}{\partial \theta}Gradient of loss w.r.t. parameters — standard ML form

Gradient orientation: fRn\nabla f \in \mathbb{R}^n is always a column vector. This is the convention of Nocedal & Wright (2006) and all major ML frameworks. The gradient points in the direction of steepest ascent.

4.2 Optimisation Symbols

SymbolLaTeXMeaning
argminxf(x)\arg\min_{\mathbf{x}} f(\mathbf{x})\arg\min_{\mathbf{x}}Minimiser of ff
argmaxxf(x)\arg\max_{\mathbf{x}} f(\mathbf{x})\arg\max_{\mathbf{x}}Maximiser of ff
ff^*f^*Optimal (minimum) value
x\mathbf{x}^*\mathbf{x}^*Optimal solution
L(x,λ)\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda})\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda})Lagrangian
λ\boldsymbol{\lambda}\boldsymbol{\lambda}Lagrange multipliers (vector — use bold)
s.t.\text{s.t.}\text{s.t.}Subject to
η\eta\etaLearning rate
θt\theta_t\theta_tParameters at step tt

5. Probability and Statistics

5.1 Core Symbols

SymbolLaTeXMeaning
P(A)P(A)P(A)Probability of event AA
p(x)p(\mathbf{x})p(\mathbf{x})Probability density / mass function
P(AB)P(A \mid B)P(A \mid B)Conditional probability — use \mid, not |
XpX \sim pX \sim pXX is distributed according to pp
X ⁣ ⁣ ⁣YX \perp\!\!\!\perp YX \perp\!\!\!\perp YStatistical independence
E[X]\mathbb{E}[X]\mathbb{E}[X]Expectation — use blackboard bold E\mathbb{E}
Exp[f(x)]\mathbb{E}_{\mathbf{x} \sim p}[f(\mathbf{x})]\mathbb{E}_{\mathbf{x} \sim p}[f(\mathbf{x})]Expectation specifying distribution
Var(X)\operatorname{Var}(X)\operatorname{Var}(X)Variance
Cov(X,Y)\operatorname{Cov}(X, Y)\operatorname{Cov}(X, Y)Covariance
Σ\Sigma\SigmaCovariance matrix — capital sigma, not \sum

Conditional probability delimiter: Always P(A \mid B) with \mid. Never P(A|B) — the bare pipe | has incorrect spacing in LaTeX renderers.

5.2 Standard Distributions

DistributionNotationLaTeX
NormalN(μ,σ2)\mathcal{N}(\mu, \sigma^2)\mathcal{N}(\mu, \sigma^2)
Multivariate NormalN(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma)\mathcal{N}(\boldsymbol{\mu}, \Sigma)
UniformU(a,b)\mathcal{U}(a, b)\mathcal{U}(a, b)
BernoulliBern(p)\operatorname{Bern}(p)\operatorname{Bern}(p)
CategoricalCat(p)\operatorname{Cat}(\mathbf{p})\operatorname{Cat}(\mathbf{p})
DirichletDir(α)\operatorname{Dir}(\boldsymbol{\alpha})\operatorname{Dir}(\boldsymbol{\alpha})

Covariance matrix: The second argument of N(μ,Σ)\mathcal{N}(\boldsymbol{\mu}, \Sigma) is the covariance matrix, not the precision matrix. When using precision, write N(μ,Λ1)\mathcal{N}(\boldsymbol{\mu}, \Lambda^{-1}) explicitly.


6. Information Theory

SymbolLaTeXDefinitionUnit
H(X)H(X)H(X)xp(x)logp(x)-\sum_x p(x) \log p(x)nats (log = ln) or bits (log = log₂)
H(XY)H(X \mid Y)H(X \mid Y)H(X,Y)H(Y)H(X,Y) - H(Y)Conditional entropy
DKL(pq)D_{\mathrm{KL}}(p \| q)D_{\mathrm{KL}}(p | q)xp(x)logp(x)q(x)\sum_x p(x) \log \frac{p(x)}{q(x)}Non-negative; zero iff p=qp = q
H(p,q)H(p, q)H(p, q)xp(x)logq(x)-\sum_x p(x) \log q(x)Cross-entropy; H(p,q)=H(p)+DKL(pq)H(p,q) = H(p) + D_{\mathrm{KL}}(p\|q)
I(X;Y)I(X; Y)I(X; Y)H(X)H(XY)H(X) - H(X \mid Y)Mutual information; symmetric
PPL\operatorname{PPL}\operatorname{PPL}exp ⁣(1Tt=1Tlogp(xtx<t))\exp\!\left(-\frac{1}{T}\sum_{t=1}^T \log p(x_t \mid x_{<t})\right)Perplexity — LLM evaluation

KL divergence notation: Write DKL(pq)D_{\mathrm{KL}}(p \| q) with \mathrm{KL} and double-bar \|. Read as "KL divergence from qq to pp" — pp is the reference distribution. This is the convention of Kullback & Leibler (1951) and Cover & Thomas (2006).


7. Machine Learning Specifics

7.1 Data and Model Parameters

SymbolLaTeXMeaning
D={(x(i),y(i))}i=1n\mathcal{D} = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^n\mathcal{D}Dataset of nn examples
nnnNumber of training samples
dddInput dimension (number of features)
θ\boldsymbol{\theta}\boldsymbol{\theta}All trainable parameters (vector or set)
W[l]\mathbf{W}^{[l]}\mathbf{W}^{[l]}Weight matrix at layer ll
b[l]\mathbf{b}^{[l]}\mathbf{b}^{[l]}Bias vector at layer ll
y^\hat{y}\hat{y}Predicted output
L(θ)\mathcal{L}(\boldsymbol{\theta})\mathcal{L}(\boldsymbol{\theta})Loss function
fθ(x)f_{\boldsymbol{\theta}}(\mathbf{x})f_{\boldsymbol{\theta}}(\mathbf{x})Model with parameters θ\boldsymbol{\theta}

7.2 Neural Network Layers

SymbolMeaning
z[l]=W[l]a[l1]+b[l]\mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}Pre-activation at layer ll
a[l]=σ(z[l])\mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})Post-activation at layer ll
σ\sigmaGeneric activation function
δ[l]\delta^{[l]}Error signal (backprop gradient) at layer ll

7.3 Transformer-Specific

SymbolMeaning
Q,K,VQ, K, VQuery, Key, Value matrices
dkd_kKey/query dimension (scaling: dk\sqrt{d_k})
dmodeld_{\mathrm{model}}Model/embedding dimension
hhNumber of attention heads
softmax(z)i=ezijezj\operatorname{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_j e^{z_j}}Softmax — always written with explicit denominator on first use
Attention(Q,K,V)=softmax ⁣(QKdk)V\operatorname{Attention}(Q,K,V) = \operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)VScaled dot-product attention

8. Greek Letters — Reserved Meanings

The following assignments are fixed across the entire repository. Do not reuse these symbols for other quantities without explicit redefinition.

LetterReserved useContext
α\alphaLearning rate (alt.), significance levelOptimisation, statistics
β1,β2\beta_1, \beta_2Adam moment decay ratesOptimisation
γ\gammaDiscount factor (RL), gradient scalingRL, normalisation
η\etaLearning rate (primary)All optimisers
λ\lambdaEigenvalue; regularisation strengthLinear algebra, regularisation
μ\muMean; dual variableStatistics, optimisation
σ\sigmaStandard deviation; sigmoid activationStatistics, neural networks
σi\sigma_iii-th singular valueSVD
τ\tauTemperature (sampling); time constantLLM decoding, RNNs
θ\boldsymbol{\theta}All model parametersEvery ML model
ϕ\phiEncoder / feature-map parametersVAE, kernels
ψ\psiAuxiliary / decoder parametersVAE
ω\omegaAngular frequency; weights (secondary)Signal processing
Σ\SigmaCovariance matrix; singular value matrixProbability, SVD
Λ\LambdaDiagonal eigenvalue matrixSpectral decomposition
Φ\PhiFeature / design matrixKernel methods

9. Common LaTeX Pitfalls

WrongCorrectReason
|x|\lVert x \rVertIncorrect spacing for norms
| inside probability\midP(A|B) renders poorly
^T for transpose^\topT looks like a variable name
\text{tr}\operatorname{tr}\text is semantic, \operatorname is mathematical
\mathit{ReLU}\operatorname{ReLU}Named functions use \operatorname
...\ldots or \cdots\ldots for baseline, \cdots for centred
\mathbb{E}_{x}\mathbb{E}_{\mathbf{x} \sim p}Specify distribution when not clear from context
\sum for covariance matrix\Sigma\sum is a summation sign

This guide is versioned with the repository. If you find an inconsistency between this guide and any section file, the guide takes precedence — open an issue or PR to correct the section file.