Authority: This document is the single source of truth for all mathematical notation
used throughout this repository. Every section, notebook, and exercise MUST conform
to these conventions. When conventions conflict with intuition, this guide wins.
Design Principles
This guide follows the conventions of Goodfellow, Bengio & Courville (2016) Deep Learning,
Horn & Johnson (2013) Matrix Analysis, and the notation standards of NeurIPS/ICML/ICLR.
Where conventions conflict, the choice that minimises ambiguity in an ML context is adopted.
Three rules that override everything else:
A symbol's meaning must be determinable from context within three lines of its first use.
The same symbol never denotes two different objects in the same section.
Prefer established ML-paper conventions over personal preference.
1. Object Hierarchy
1.1 Type-to-Notation Map
Object
Notation style
LaTeX command
Example
Scalar
Italic lowercase
a, \alpha
x∈R
Vector
Bold lowercase
\mathbf{x}
x∈Rn
Matrix
Uppercase italic
A, W
A∈Rm×n
Tensor (order ≥3)
Calligraphic
\mathcal{X}
X∈Rd1×⋯×dk
Random variable
Uppercase italic
X, Z
X∼N(0,1)
Set / space
Calligraphic
\mathcal{D}, \mathcal{H}
D={(x(i),y(i))}
Number field
Blackboard bold
\mathbb{R}, \mathbb{C}
Rn, Cm×n
Distribution
Calligraphic
\mathcal{N}, \mathcal{U}
N(μ,Σ)
Do:\mathbf{x} for vectors.
Do not:\vec{x} (physics convention), \underline{x} (typewriter convention),
or plain x when x is a vector — these are all errors in this codebase.
1.2 Indexing
What
Notation
LaTeX
Meaning
Vector component
xi
x_i
i-th scalar, 1-indexed by default
Matrix entry
Aij
A_{ij}
row i, column j
Row i
Ai,:
A_{i,:}
as a row vector
Column j
A:,j
A_{:,j}
as a column vector
Data sample
x(i)
\mathbf{x}^{(i)}
parentheses distinguish from power
Layer index
h[l]
\mathbf{h}^{[l]}
square brackets distinguish from data
Time step
ht
\mathbf{h}_t
subscript, used in sequences
Convention:(i) = data sample, [l] = network layer, t = time.
These three scopes must never share the same superscript/subscript style.
2. Number Fields and Spaces
Symbol
LaTeX
Meaning
N
\mathbb{N}
{0,1,2,…} (ISO 80000-2; 0 is included)
Z
\mathbb{Z}
All integers {…,−1,0,1,…}
R
\mathbb{R}
Real line
Rn
\mathbb{R}^n
Euclidean n-space (column vectors by default)
Rm×n
\mathbb{R}^{m \times n}
Real m×n matrices
C
\mathbb{C}
Complex numbers
Sn
\mathbb{S}^n
n×n real symmetric matrices
S+n
\mathbb{S}^n_+
Positive semidefinite (PSD) cone
S++n
\mathbb{S}^n_{++}
Strictly positive definite matrices
[n]
[n]
Index set {1,2,…,n}
3. Linear Algebra
3.1 Vectors
Symbol
LaTeX
Meaning
x
\mathbf{x}
Column vector (default orientation)
x⊤
\mathbf{x}^\top
Transpose — use ^\top, never ^T in display math
ei
\mathbf{e}_i
i-th standard basis vector
0
\mathbf{0}
Zero vector
1
\mathbf{1}
All-ones vector
⟨x,y⟩
\langle \mathbf{x}, \mathbf{y} \rangle
Inner product (preferred over dot)
x⊙y
\mathbf{x} \odot \mathbf{y}
Hadamard (element-wise) product
Transpose: Use ^\top in all display equations. Reserve ^T for inline code
comments only. Rationale: ^T is ambiguous with transposing a random variable T.
3.2 Matrices
Symbol
LaTeX
Meaning
In
I_n
n×n identity (subscript when size matters)
A⊤
A^\top
Matrix transpose
A−1
A^{-1}
Matrix inverse (only for square, non-singular A)
A†
A^\dagger
Moore-Penrose pseudoinverse
A∗
A^*
Conjugate transpose (Hermitian adjoint)
tr(A)
\operatorname{tr}(A)
Trace — use \operatorname, not \text
det(A)
\det(A)
Determinant
rank(A)
\operatorname{rank}(A)
Rank
diag(x)
\operatorname{diag}(\mathbf{x})
Diagonal matrix from vector
A⊗B
A \otimes B
Kronecker product
A≻0
A \succ 0
Positive definite
A⪰0
A \succeq 0
Positive semidefinite
3.3 Norms
Symbol
LaTeX
Definition
Use case
∥x∥2
\lVert \mathbf{x} \rVert_2
∑xi2
Default vector norm
∥x∥1
\lVert \mathbf{x} \rVert_1
∑∣xi∣
Sparsity, L1 regularisation
∥x∥p
\lVert \mathbf{x} \rVert_p
(∑∣xi∣p)1/p
General ℓp norm
∥x∥∞
\lVert \mathbf{x} \rVert_\infty
maxi∣xi∣
Chebyshev / max norm
∥A∥F
\lVert A \rVert_F
∑i,jAij2
Frobenius norm
∥A∥2
\lVert A \rVert_2
σmax(A)
Spectral norm (operator norm)
∥A∥∗
\lVert A \rVert_*
∑iσi(A)
Nuclear norm (trace norm)
Norm delimiters: Always use \lVert \rVert for norms and \lvert \rvert for
absolute values — never bare \| or |. This ensures correct spacing in all renderers.
3.4 Decompositions
Name
Canonical form
Conditions
Eigendecomposition
A=QΛQ−1
A diagonalisable
Spectral theorem
A=QΛQ⊤
A real symmetric
SVD
A=UΣV⊤
A∈Rm×n, always exists
QR
A=QR
Q orthogonal, R upper triangular
Cholesky
A=LL⊤
A≻0
LU
A=LU
Square, non-singular
SVD convention:U∈Rm×m, Σ∈Rm×n,
V∈Rn×n. Singular values σ1≥σ2≥⋯≥0
are always in non-increasing order. Left singular vectors = columns of U;
right singular vectors = columns of V.
4. Calculus and Optimisation
4.1 Derivatives
Symbol
LaTeX
Meaning
dxdf
\frac{df}{dx}
Derivative of scalar f w.r.t. scalar x
f′(x)
f'(x)
Prime notation — only for univariate functions
∂xi∂f
\frac{\partial f}{\partial x_i}
Partial derivative
∇xf
\nabla_{\mathbf{x}} f
Gradient: g∈Rn where gi=∂f/∂xi
Jf(x)
J_f(\mathbf{x})
Jacobian of f:Rn→Rm; J∈Rm×n
Hf(x)
H_f(\mathbf{x})
Hessian: (H)ij=∂2f/∂xi∂xj
∂θ∂L
\frac{\partial \mathcal{L}}{\partial \theta}
Gradient of loss w.r.t. parameters — standard ML form
Gradient orientation:∇f∈Rn is always a column vector.
This is the convention of Nocedal & Wright (2006) and all major ML frameworks.
The gradient points in the direction of steepest ascent.
4.2 Optimisation Symbols
Symbol
LaTeX
Meaning
argminxf(x)
\arg\min_{\mathbf{x}}
Minimiser of f
argmaxxf(x)
\arg\max_{\mathbf{x}}
Maximiser of f
f∗
f^*
Optimal (minimum) value
x∗
\mathbf{x}^*
Optimal solution
L(x,λ)
\mathcal{L}(\mathbf{x}, \boldsymbol{\lambda})
Lagrangian
λ
\boldsymbol{\lambda}
Lagrange multipliers (vector — use bold)
s.t.
\text{s.t.}
Subject to
η
\eta
Learning rate
θt
\theta_t
Parameters at step t
5. Probability and Statistics
5.1 Core Symbols
Symbol
LaTeX
Meaning
P(A)
P(A)
Probability of event A
p(x)
p(\mathbf{x})
Probability density / mass function
P(A∣B)
P(A \mid B)
Conditional probability — use \mid, not |
X∼p
X \sim p
X is distributed according to p
X⊥⊥Y
X \perp\!\!\!\perp Y
Statistical independence
E[X]
\mathbb{E}[X]
Expectation — use blackboard bold E
Ex∼p[f(x)]
\mathbb{E}_{\mathbf{x} \sim p}[f(\mathbf{x})]
Expectation specifying distribution
Var(X)
\operatorname{Var}(X)
Variance
Cov(X,Y)
\operatorname{Cov}(X, Y)
Covariance
Σ
\Sigma
Covariance matrix — capital sigma, not \sum
Conditional probability delimiter: Always P(A \mid B) with \mid.
Never P(A|B) — the bare pipe | has incorrect spacing in LaTeX renderers.
5.2 Standard Distributions
Distribution
Notation
LaTeX
Normal
N(μ,σ2)
\mathcal{N}(\mu, \sigma^2)
Multivariate Normal
N(μ,Σ)
\mathcal{N}(\boldsymbol{\mu}, \Sigma)
Uniform
U(a,b)
\mathcal{U}(a, b)
Bernoulli
Bern(p)
\operatorname{Bern}(p)
Categorical
Cat(p)
\operatorname{Cat}(\mathbf{p})
Dirichlet
Dir(α)
\operatorname{Dir}(\boldsymbol{\alpha})
Covariance matrix: The second argument of N(μ,Σ)
is the covariance matrix, not the precision matrix. When using precision, write
N(μ,Λ−1) explicitly.
6. Information Theory
Symbol
LaTeX
Definition
Unit
H(X)
H(X)
−∑xp(x)logp(x)
nats (log = ln) or bits (log = log₂)
H(X∣Y)
H(X \mid Y)
H(X,Y)−H(Y)
Conditional entropy
DKL(p∥q)
D_{\mathrm{KL}}(p | q)
∑xp(x)logq(x)p(x)
Non-negative; zero iff p=q
H(p,q)
H(p, q)
−∑xp(x)logq(x)
Cross-entropy; H(p,q)=H(p)+DKL(p∥q)
I(X;Y)
I(X; Y)
H(X)−H(X∣Y)
Mutual information; symmetric
PPL
\operatorname{PPL}
exp(−T1∑t=1Tlogp(xt∣x<t))
Perplexity — LLM evaluation
KL divergence notation: Write DKL(p∥q) with \mathrm{KL} and
double-bar \|. Read as "KL divergence fromqtop" — p is the reference
distribution. This is the convention of Kullback & Leibler (1951) and Cover & Thomas (2006).
7. Machine Learning Specifics
7.1 Data and Model Parameters
Symbol
LaTeX
Meaning
D={(x(i),y(i))}i=1n
\mathcal{D}
Dataset of n examples
n
n
Number of training samples
d
d
Input dimension (number of features)
θ
\boldsymbol{\theta}
All trainable parameters (vector or set)
W[l]
\mathbf{W}^{[l]}
Weight matrix at layer l
b[l]
\mathbf{b}^{[l]}
Bias vector at layer l
y^
\hat{y}
Predicted output
L(θ)
\mathcal{L}(\boldsymbol{\theta})
Loss function
fθ(x)
f_{\boldsymbol{\theta}}(\mathbf{x})
Model with parameters θ
7.2 Neural Network Layers
Symbol
Meaning
z[l]=W[l]a[l−1]+b[l]
Pre-activation at layer l
a[l]=σ(z[l])
Post-activation at layer l
σ
Generic activation function
δ[l]
Error signal (backprop gradient) at layer l
7.3 Transformer-Specific
Symbol
Meaning
Q,K,V
Query, Key, Value matrices
dk
Key/query dimension (scaling: dk)
dmodel
Model/embedding dimension
h
Number of attention heads
softmax(z)i=∑jezjezi
Softmax — always written with explicit denominator on first use
Attention(Q,K,V)=softmax(dkQK⊤)V
Scaled dot-product attention
8. Greek Letters — Reserved Meanings
The following assignments are fixed across the entire repository.
Do not reuse these symbols for other quantities without explicit redefinition.
Letter
Reserved use
Context
α
Learning rate (alt.), significance level
Optimisation, statistics
β1,β2
Adam moment decay rates
Optimisation
γ
Discount factor (RL), gradient scaling
RL, normalisation
η
Learning rate (primary)
All optimisers
λ
Eigenvalue; regularisation strength
Linear algebra, regularisation
μ
Mean; dual variable
Statistics, optimisation
σ
Standard deviation; sigmoid activation
Statistics, neural networks
σi
i-th singular value
SVD
τ
Temperature (sampling); time constant
LLM decoding, RNNs
θ
All model parameters
Every ML model
ϕ
Encoder / feature-map parameters
VAE, kernels
ψ
Auxiliary / decoder parameters
VAE
ω
Angular frequency; weights (secondary)
Signal processing
Σ
Covariance matrix; singular value matrix
Probability, SVD
Λ
Diagonal eigenvalue matrix
Spectral decomposition
Φ
Feature / design matrix
Kernel methods
9. Common LaTeX Pitfalls
Wrong
Correct
Reason
|x|
\lVert x \rVert
Incorrect spacing for norms
| inside probability
\mid
P(A|B) renders poorly
^T for transpose
^\top
T looks like a variable name
\text{tr}
\operatorname{tr}
\text is semantic, \operatorname is mathematical
\mathit{ReLU}
\operatorname{ReLU}
Named functions use \operatorname
...
\ldots or \cdots
\ldots for baseline, \cdots for centred
\mathbb{E}_{x}
\mathbb{E}_{\mathbf{x} \sim p}
Specify distribution when not clear from context
\sum for covariance matrix
\Sigma
\sum is a summation sign
This guide is versioned with the repository. If you find an inconsistency between this
guide and any section file, the guide takes precedence — open an issue or PR to correct
the section file.