<- Sets and Logic | Next: Summation and Product Notation ->
"A neural network is a composition of functions. Understanding functions - what they preserve, what they destroy, how they compose, where they break - is understanding neural networks at the deepest possible level."
Overview
A function is the most fundamental concept in mathematics after sets. It is a rule that assigns to each input exactly one output. Every mathematical operation you have ever performed - adding, multiplying, differentiating, integrating - is a function. Every layer in a neural network is a function. The entire model from tokens in to probabilities out is a single (very complex) function. Training is optimisation over a space of functions. Inference is function evaluation.
This chapter develops function theory from first principles through to the frontier concepts needed to understand modern LLMs and AI systems at a mathematical level. We cover the formal definition, types (injective, surjective, bijective), composition, inverses, special classes (linear, affine, activation functions, convex, Lipschitz), higher-order functions and functionals, continuity, differentiability, measure-theoretic foundations, the systematic role of functions in machine learning, and the category-theoretic perspective that unifies it all.
Every concept is connected to how it appears in real neural network computation, with emphasis on transformer architectures, attention mechanisms, and the training pipeline.
Prerequisites
- Set theory fundamentals (Chapter 02: Sets and Logic)
- Basic algebra and coordinate geometry
- Familiarity with vectors and matrices (helpful but not required yet)
- Python/NumPy basics for code examples
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive code: function visualisation, composition chains, activation functions, Lipschitz demos |
| exercises.ipynb | Practice problems: injectivity proofs, composition calculations, inverse construction, approximation |
Learning Objectives
After completing this section, you will:
- Understand functions formally as subsets of Cartesian products satisfying totality and uniqueness
- Distinguish domain, codomain, range, image, and preimage - and know why each matters for AI
- Classify functions as injective, surjective, bijective, monotone, or periodic
- Compose functions and understand associativity, non-commutativity, and the identity function
- Construct inverses (left, right, two-sided, pseudo-inverse) and connect invertibility to information preservation
- Master special function classes: linear, affine, multilinear, polynomial, activation functions, convex, Lipschitz
- Work with higher-order functions, functionals, operators, currying, and function spaces
- Understand continuity (pointwise, uniform) and its role in gradient-based optimisation
- Apply derivatives, Jacobians, and the chain rule to function composition (backpropagation)
- Connect functions to measure theory: measurable functions, Lebesgue integration, probability distributions
- See every component of a neural network as a specific type of function with specific mathematical properties
- Appreciate the category-theoretic perspective: morphisms, functors, natural transformations
Table of Contents
- Intuition
- Formal Definitions
- Types of Functions
- Function Composition
- Inverse Functions
- Special Classes of Functions
- Higher-Order Functions and Functionals
- Continuity and Limits
- Differentiable Functions and Derivatives
- Measure-Theoretic View of Functions
- Functions in Machine Learning
- Category Theory Perspective
- Common Mistakes and Misconceptions
- Exercises and Practice Problems
- Why This Matters for AI/LLMs
- Conceptual Bridge
1. Intuition
1.1 What Is a Function?
A function is a rule that assigns to each input exactly one output. No ambiguity, no missing outputs, no contradictions. If you give a function the same input twice, you get the same output both times. This is the single most important mathematical concept you will ever encounter.
The word "function" comes from the Latin functio, meaning "performance" or "execution". A function performs a transformation: it takes something and produces something else, reliably and deterministically.
The machine metaphor. Think of a function as a machine with an input slot and an output slot. You drop an object into the input slot; the machine does something to it; exactly one object comes out of the output slot. The same input always produces the same output. The machine never refuses an input (within its declared domain), and it never produces two outputs for a single input.
+--------------+
input --->| f(x) = 2x |---> output
x = 3 | | f(3) = 6
+--------------+
Same input, same output - always:
f(3) = 6, f(3) = 6, f(3) = 6, ...
Formal view. A function is a subset of the Cartesian product such that every element of appears as the first element of exactly one ordered pair. If and , then . This definition reduces the concept of a function to the concept of a set - and since set theory is the foundation of all mathematics, this means functions are built on the most basic level of the mathematical hierarchy.
The word "mapping". The word "mapping" is a synonym for "function" that emphasises the geometric or structural idea. When we say " maps to ", we are emphasising that takes each point in one space and places it in another space. In linear algebra, we often say "linear map" rather than "linear function". In topology, we say "continuous map". The word "map" carries the same formal definition as "function".
For AI and LLMs. Every layer of a neural network is a function. Every activation function is a function. Every loss function is a function. The entire model - from raw text input to probability distribution over vocabulary - is a single (complicated) function. Training is optimisation: finding the best function within a parameterised family. Inference is function evaluation: computing the output for a given input. Understanding functions deeply is not a prerequisite for AI - it is understanding AI.
1.2 Why Functions Are Central to AI
Every component of a modern AI system is a function. Here is a systematic mapping:
| AI Component | Mathematical Function | Type Signature |
|---|---|---|
| Neural network layer | : vector space -> vector space | |
| Embedding layer | Maps discrete tokens to continuous vectors | |
| Attention mechanism | Maps query, key, value triples to output | |
| Loss function | Maps parameters to scalar loss | |
| Tokeniser | Maps strings to token sequences | |
| Probability head | Maps logits to probability distribution | $\sigma: \mathbb{R}^{ |
| Activation function | Elementwise nonlinearity | |
| Optimiser step | Maps current parameters to updated parameters | |
| Learning rate scheduler | Maps step count to learning rate | |
| Positional encoding | Maps position index to vector |
Everything that happens inside an LLM - from the moment text enters as bytes to the moment a probability distribution over the next token is produced - is a composition of functions. Understanding the mathematical properties of these functions (continuity, differentiability, Lipschitz constants, injectivity, equivariance) directly determines what the model can learn, how stable training is, and what the model can represent.
1.3 Functions as Transformations
The geometric viewpoint reveals what functions do to space. A function takes every point in -dimensional space and moves it to a point in -dimensional space. Collections of points - lines, surfaces, regions - are transformed into new collections.
Linear functions preserve structure. Under a linear function, straight lines map to straight lines (or to points). The origin maps to the origin. Parallelism is preserved. Grid lines remain evenly spaced. Linear functions can rotate, reflect, stretch, compress, and project, but they cannot bend or fold.
Linear transformation of a grid:
Before (identity): After (rotation + stretch):
+---+---+---+ / / / /
| | | | / / / /
+---+---+---+ / / / /
| | | | / / / /
+---+---+---+ / / / /
| | | |
+---+---+---+
Grid lines remain straight; origin stays fixed.
Nonlinear functions bend, fold, and warp space. Under a nonlinear function, straight lines can become curves, parallel lines can converge or diverge, and the structure of space is fundamentally altered. This is exactly what enables neural networks to learn complex decision boundaries - if all functions were linear, a neural network would be no more powerful than a single matrix multiplication.
Invertible functions (bijections) transform space without destroying any information. Every point in the output space came from exactly one point in the input space, so you can always "undo" the transformation and recover the original. This is the principle behind normalising flows: a chain of bijective transformations that transforms a simple distribution (like Gaussian) into a complex one (like the data distribution), with the guarantee that the transformation can be reversed.
Non-invertible functions (many-to-one) destroy information. Multiple input points map to the same output point. Once the function is applied, you cannot tell which input was used. ReLU is a simple example: both and map to , so if you only see the output , you cannot recover the input. This information destruction is actually useful - it's a form of compression, throwing away irrelevant details and keeping what matters.
Deep networks compose many simple functions. Each individual function may be simple (an affine map followed by an elementwise nonlinearity), but the composition of many such functions creates a global transformation that is highly nonlinear and enormously expressive. This is the fundamental insight of deep learning: complexity emerges from the composition of simplicity.
1.4 The Language of Functions in Mathematics
Functions appear everywhere in mathematics, but under different names depending on the context. Each name emphasises a different aspect of the same underlying concept:
| Name | Emphasis | Example |
|---|---|---|
| Function | General concept; rule from inputs to outputs | , |
| Map / Mapping | Geometric/structural view; moving points between spaces | Linear map |
| Morphism | Structure-preserving; respects algebraic properties | Group homomorphism |
| Operator | Function acting on functions (or function spaces) | Differentiation , integration |
| Functional | Function whose input is a function, output is a scalar | Definite integral |
| Transformation | Emphasis on changing/deforming objects | Linear transformation, Fourier transform |
| Kernel | Function of two arguments used in integral operators or ML | RBF kernel |
| Distribution | Generalised function (Schwartz); or probability function | Dirac delta ; Gaussian PDF |
The key insight is that all of these are instances of the same concept. A linear map is a function that happens to satisfy linearity. An operator is a function whose domain is a function space. A functional is an operator that happens to return scalars. Understanding that these are all functions - with the same formal definition, the same composition rules, the same invertibility questions - unifies vast swaths of mathematics under a single framework.
1.5 Historical Timeline
The concept of a function evolved over centuries, from implicit geometric constructions to the fully abstract modern definition:
| Date | Contributor | Development |
|---|---|---|
| ~300 BCE | Euclid | Geometric constructions implicitly define functions (compass-and-straightedge); proportionality as functional relationship |
| 1694 | Leibniz | First explicit use of the word "function" (functio) in a mathematical context; described quantities depending on a curve |
| 1748 | Euler | Function as analytic expression; introduced the notation; classified functions as algebraic or transcendental |
| 1837 | Dirichlet | Modern definition: function as arbitrary rule assigning outputs to inputs, not necessarily given by a formula. The Dirichlet function if , otherwise, has no formula but is a valid function |
| 1870s | Cantor | Function as set of ordered pairs; foundation in set theory; functions between infinite sets |
| 1879 | Frege | Functions in formal logic; predicate as function mapping objects to truth values |
| 1888 | Dedekind | Morphism concept; structure-preserving maps between number systems |
| 1900s | Hilbert | Operator theory; functions on infinite-dimensional spaces; functional analysis foundations |
| 1920s | Banach | Normed function spaces (Banach spaces); fixed-point theorem; foundation of functional analysis |
| 1936 | Church | Lambda calculus: functions as fundamental computational objects; every computation is function evaluation |
| 1945 | Eilenberg & Mac Lane | Category theory: morphisms (functions) as primary objects; composition as fundamental operation |
| 1960s-1980s | Various | Functional programming languages (ML, Haskell, Lisp); functions as first-class values in computation |
| 1986 | Rumelhart, Hinton, Williams | Backpropagation popularised: neural network as differentiable function composition; chain rule as algorithm |
| 1989 | Cybenko, Hornik | Universal Approximation Theorem: neural networks can approximate any continuous function |
| 2017 | Vaswani et al. | Transformer architecture: attention as specific function class; self-attention as permutation-equivariant function |
| 2020-2026 | Scaling era | Functions at unprecedented scale: GPT-4, LLaMA, Gemini - trillions of parameters defining single functions from text to text |
The trend is clear: the concept of a function has become increasingly abstract and increasingly central. From Euler's "formula in " to Dirichlet's "arbitrary rule" to Church's "computational object" to modern AI's "parameterised differentiable map" - each step expanded what counts as a function and what we can do with functions.
1.6 Pipeline Position in AI
Every stage of the LLM pipeline is a function. Here is the complete chain:
Input Data (text, tokens, images)
down [Tokeniser function T: \Sigma* -> V*]
Token Sequences (discrete integer IDs)
down [Embedding function E: V^n -> \mathbb{R}^n^x^d]
Embedding Matrices (continuous vectors per token)
down [Positional encoding PE: \mathbb{R}^n^x^d -> \mathbb{R}^n^x^d]
Position-Aware Embeddings
down [Transformer layer 1: f_1: \mathbb{R}^n^x^d -> \mathbb{R}^n^x^d]
down [Transformer layer 2: f_2: \mathbb{R}^n^x^d -> \mathbb{R}^n^x^d]
down [...]
down [Transformer layer L: fl: \mathbb{R}^n^x^d -> \mathbb{R}^n^x^d]
Contextual Representations (hidden states)
down [Layer Norm: LN: \mathbb{R}^n^x^d -> \mathbb{R}^n^x^d]
Normalised Representations
down [LM head (linear projection): W: \mathbb{R}^d -> \mathbb{R}|V|]
Logits (unnormalised scores per vocabulary token)
down [Softmax \sigma: \mathbb{R}|V| -> \Delta|V|^-^1]
Probability Distribution (simplex)
down [Sampling function: sample: \Delta|V|^-^1 -> V]
Output Token (single discrete token)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Functions everywhere. The entire pipeline is one
composed function: sample \circ \sigma \circ W \circ LN \circ fl \circ ... \circ f_1 \circ PE \circ E \circ T
Each arrow in this pipeline is a function with specific mathematical properties:
- T (tokeniser): discrete, non-differentiable, deterministic
- E (embedding): linear (lookup is matrix multiply with one-hot), differentiable
- PE (positional encoding): additive, fixed or learned, differentiable
- f_1 through fl (transformer layers): nonlinear, differentiable, permutation-equivariant, residual
- LN (layer norm): nonlinear, differentiable, scale-invariant
- W (LM head): linear, differentiable
- \sigma (softmax): nonlinear, differentiable, maps to probability simplex
- sample: stochastic (not a deterministic function in the classical sense; involves randomness)
Training optimises the parameters of the differentiable functions (E, f_1-fl, W) to minimise a loss function L over the training data. The loss function itself is a function mapping the entire parameter vector to a single scalar.
2. Formal Definitions
2.1 Function as a Set of Ordered Pairs
Definition. A function from set to set , written , is a subset satisfying two conditions:
- Totality (every input has an output): For every , there exists some such that .
- Uniqueness (single-valued): If and , then .
Totality says the function is defined everywhere on its domain. Uniqueness says the function does not assign two different outputs to the same input. Together they guarantee: for each , there is exactly one such that . We write this unique as .
Example. Consider and . Then:
- is a valid function (total, single-valued). Note: two inputs map to the same output () - this is allowed.
- is NOT a function (not total: input 2 has no output).
- is NOT a function (not single-valued: input 1 maps to both and ).
Relation vs function. A relation from to is any subset of , without the totality or uniqueness requirements. Every function is a relation, but not every relation is a function. A relation that satisfies uniqueness but not totality is called a partial function.
Relations \supseteq Partial Functions \supseteq Functions \supseteq Bijections
(any subset (unique but (unique and (unique, total,
of A \times B) not total) total) one-to-one, onto)
For AI. When we write a neural network layer , we are defining a function from to . Totality is guaranteed: for any input vector , the matrix multiplication is defined, and is defined on all of . Uniqueness is guaranteed: matrix multiplication is deterministic. During training with dropout, the computation is stochastic - but we model this as sampling a different function at each step, not as a single multi-valued function.
2.2 Domain, Codomain, and Range
Three fundamental sets are associated with every function :
| Concept | Symbol | Definition | Example: , |
|---|---|---|---|
| Domain | The set of all valid inputs | (all real numbers) | |
| Codomain | The "target" set where outputs live | (all real numbers) | |
| Range (Image) | The set of outputs actually produced | (non-negative reals) |
The distinction between codomain and range is subtle but crucial:
- The codomain is part of the function's definition - it is the set where outputs are declared to live.
- The range is a property of the function - it is the set of outputs that are actually achieved.
- Always: .
- A function is surjective (onto) if and only if .
Why the distinction matters for AI. Consider a neural network's output layer:
- Domain: (any hidden representation is a valid input).
- Codomain: (the output space has dimension equal to vocabulary size).
- Range: typically a strict subset of - not every logit vector will actually be produced.
When we apply softmax, the codomain becomes the probability simplex . The range is a strict subset: softmax can never output exactly 0 or exactly 1 for any component, because for all finite .
Domain (\mathbb{R}^d) Codomain (\mathbb{R}|V|)
+-------------+ +-----------------+
| | | |
| All hidden |---f(x)------->| All possible |
| states | | logit vectors |
| | | |
| | | +-------------+ |
| | | | Range: | |
| | | | Actually | |
| | | | produced | |
| | | | logits | |
| | | +-------------+ |
+-------------+ +-----------------+
2.3 Image and Preimage
Image. For a function and a subset , the image of under is:
Preimage (Inverse Image). For a subset , the preimage of under is:
Critical note: The preimage notation does NOT require to have an inverse function. Even for non-invertible functions, the preimage of a set is always well-defined.
Properties of images and preimages:
| Property | Image | Preimage |
|---|---|---|
| Union | OK | OK |
| Intersection | (\subseteq only!) | OK |
| Complement | No simple rule | OK |
| Subset |
The key asymmetry. Preimage preserves ALL set operations (unions, intersections, complements). Image only preserves unions. Image preserves intersections if and only if is injective.
Proof that image does not preserve intersection in general.
Let , , .
- , so .
- , , so .
- , so . QED
For AI. Preimages appear naturally in classification. The preimage is the decision region for class - the set of all inputs that the classifier assigns to class . Understanding the geometry of these preimages (connected? convex? simply connected?) is central to understanding what the classifier has learned.
2.4 Equality of Functions
Definition. Two functions and are equal, written , if and only if:
- They have the same domain:
- They have the same codomain:
- They agree on all inputs: for all
This is the principle of extensionality: two functions are equal if and only if they produce the same output for every input. How they compute their output is irrelevant - only the input-output mapping matters.
Example. The functions:
are equal as functions , because for all .
For AI. Two neural networks with completely different weights can compute the same function (e.g., by permuting neurons - see Section 11.4). Conversely, two architectures that "look" the same but have different weights compute different functions. Function equality is about the input-output mapping, not the implementation.
2.5 Partial Functions
A partial function satisfies uniqueness but not necessarily totality. There may be elements of for which is undefined.
The domain of definition: .
Examples in mathematics:
- Division: is partial on - undefined at
- Logarithm: is partial on - undefined for
- Square root: is partial on - undefined for
Examples in AI:
- Masked attention: In causal (autoregressive) attention, token cannot attend to token . The attention weight function is "undefined" for future positions - set to before softmax, giving weight 0.
- Sparse attention: Only a subset of positions are attended to; the attention function is partial.
- Variable-length sequences: A function defined on sequences up to length is partial on the space of all sequences.
Making partial functions total. Two standard approaches:
- Restrict the domain: change to only elements where is defined. becomes total on .
- Extend the codomain: add a special "undefined" value to and set for undefined inputs.
In neural network implementations, the extension approach is standard: masked attention sets logits to (a stand-in for "undefined"), and after softmax, these become 0, contributing nothing to the weighted average.
3. Types of Functions
3.1 Injective Functions (One-to-One)
Definition. A function is injective (or one-to-one) if different inputs always produce different outputs:
Equivalently (contrapositive): .
Intuition. An injective function preserves distinctness. No two inputs "collide" at the same output. If you see the output, you can determine which input produced it.
Injective: NOT injective:
A B A B
1 ----> a 1 ----> a
2 ----> b 2 ----+
3 ----> c 3 ----+> b
d (not hit) c (not hit)
Each output comes from Two inputs map to
at most one input the same output (b)
Proof technique for injectivity. Assume and derive .
Example 1. Prove that is injective.
Proof. Suppose . Then , so , so . QED
Example 2. Prove that is injective.
Proof. Suppose . Taking of both sides: . QED
Example 3. Prove that is NOT injective on .
Proof. Counterexample: but . QED
Proof technique to disprove injectivity: find two distinct inputs with the same output (a single counterexample suffices).
For AI. Injectivity means no information is lost. If a layer is injective, the original input can (in principle) be recovered from the output. Non-injective layers destroy information.
| Component | Injective? | Why |
|---|---|---|
| Token embedding | Yes | Each token maps to a unique vector |
| ReLU activation | No | All negative values map to 0 |
| Sigmoid activation | Yes | Strictly monotone increasing |
| Softmax () | No | Adding constant to all inputs gives same output |
| Max pooling | No | Different inputs with same max give same output |
| Average pooling | No | Different inputs with same mean give same output |
| Invertible ResNet layers | Yes | Designed to be bijective |
3.2 Surjective Functions (Onto)
Definition. A function is surjective (or onto) if every element of the codomain is achieved:
Equivalently: . Every element of is "hit" by at least one input.
Surjective: NOT surjective:
A B A B
1 ----> a 1 ----> a
2 ---+ 2 ----> b
3 ---+> b 3 ----> c
4 ----> c d <- never reached
Every output is hit Element d in codomain
by at least one input is never produced
Proof technique for surjectivity. Take an arbitrary and explicitly find an with .
Example 1. Prove that , is surjective.
Proof. Let . We need with : solve to get . Since , we have , and . QED
Example 2. Prove that is NOT surjective as .
Proof. Take . For any , , so no maps to . QED
Note on codomain choice. The same formula gives:
- - NOT surjective (negative numbers not in range)
- - surjective (every non-negative number is a perfect square of something)
Changing the codomain changes whether the function is surjective.
For AI - surjectivity means the output space is fully utilised:
| Component | Surjective? | Why |
|---|---|---|
| Softmax onto open simplex | Yes | Any strictly positive distribution achievable |
| ReLU: | No | Negative outputs never produced |
| ReLU: | Yes | Every non-negative value achievable |
| Tanh: | Yes | Approaches but never reaches |
| Linear : | Only if | Range is column space of |
3.3 Bijective Functions (One-to-One and Onto)
Definition. A function is bijective if it is both injective AND surjective. Every element of is the image of exactly one element of .
Bijective (perfect pairing):
A B
1 ----> a
2 ----> b
3 ----> c
Every element of A maps to a unique element of B.
Every element of B is hit exactly once.
No information created, no information destroyed.
Key property. A function has a two-sided inverse if and only if it is bijective.
Bijections and cardinality. A bijection proves that (same cardinality). This is the formal definition of "same size" for sets, including infinite sets. For example, is a bijection , proving that the natural numbers and the even numbers have the same cardinality - a result that seems paradoxical until you see the bijection.
For AI - where bijections are critical:
| Application | How Bijection Is Used |
|---|---|
| Normalising flows | Chain of bijections transforms simple distribution to complex one. Change of variables: $p_Y(y) = p_X(f^{-1}(y)) \cdot |
| Reversible architectures | Invertible ResNets (iRevNet, Glow) save memory by recomputing activations during backprop |
| Permutation matrices | Reordering tokens is a bijection on indices; attention is permutation-equivariant |
| Tokeniser / Detokeniser | Bijection between token IDs and token strings (within vocabulary) |
| Encryption | Encryption and decryption are inverse bijections; information-preserving |
3.4 Summary Table
| Property | Condition | Injective? | Surjective? | Invertible? | Example |
|---|---|---|---|---|---|
| Injective only | OK | NO | Left inverse | , | |
| Surjective only | NO | OK | Right inverse | , | |
| Bijective | Both | OK | OK | Two-sided | , |
| Neither | Neither | NO | NO | None | , |
3.5 Monotone Functions
Definition. A function is:
- Strictly increasing:
- Non-decreasing (weakly increasing):
- Strictly decreasing:
- Non-increasing (weakly decreasing):
A function is monotone if it is either non-decreasing or non-increasing.
Key theorem. Every strictly monotone function is injective.
Proof. Suppose is strictly increasing and . If , then - contradiction. If , then - contradiction. So . QED
Monotonicity in AI:
| Function | Monotone? | Type | Consequence |
|---|---|---|---|
| Sigmoid | Strictly increasing | On all | Injective; preserves input order |
| Tanh | Strictly increasing | On all | Injective; preserves order |
| ReLU | Non-decreasing (not strict) | Flat for | Not injective; loses sign info |
| GELU | NOT monotone | Dip near | Not order-preserving |
| Leaky ReLU () | Strictly increasing | On all | Injective; all info preserved |
| SiLU/Swish | NOT monotone | Dip near | Similar to GELU |
| Cross-entropy | Strictly decreasing in | Correct class prob | Higher confidence -> lower loss |
Why monotonicity matters for optimisation. If the loss function is monotone in some parameter direction, that direction has no local minima - the loss either always decreases or always increases. Non-monotone loss landscapes create local minima and saddle points that can trap gradient descent.
3.6 Periodic Functions
Definition. A function is periodic with period if:
The smallest such is the fundamental period.
Classical examples:
- and : period
- : period
- : period (Euler's formula)
Key property. Periodic functions on are never injective (since with ). However, they are injective when restricted to one period .
Periodic functions in modern AI:
1. Sinusoidal Positional Encoding (Vaswani et al. 2017):
Each dimension is a periodic function of position with a different frequency. Low dimensions have short periods (capture local position); high dimensions have long periods (capture global position). The periods range from to .
Why sinusoidal? For any fixed offset , can be written as a linear function of :
This means relative position information is accessible via a linear transformation - the model can learn to attend to "3 positions back" using a simple linear projection.
2. Rotary Position Embedding (RoPE, Su et al. 2021):
RoPE rotates the query and key vectors by position-dependent angles:
where is a block-diagonal rotation matrix with angles . Each block rotates by angle - a periodic function of position with period .
Key property of RoPE: The dot product depends only on , , and the relative position (not the absolute positions). This gives the model a natural notion of relative distance.
3. Fourier features in NeRF (Neural Radiance Fields):
Periodic functions at multiple frequencies encode position information at multiple scales, enabling the network to represent high-frequency details.
4. Function Composition
4.1 Definition and Notation
Definition. Given functions and , the composition of with is the function:
Read: "g composed with f" or "g after f". The function is applied first, then . Note the order: in , the rightmost function () is applied first.
Requirement. The composition is only defined when , i.e., the outputs of must be valid inputs for .
f g
A ------> B ------> C
g \circ f
A ----------------> C
(g \circ f)(x) = g(f(x))
For AI. This requirement means the output dimension of layer must match the input dimension of layer . Dimension mismatches between layers are among the most common neural network implementation bugs.
4.2 Associativity of Composition
Theorem. Function composition is associative: for , , :
Proof. For any :
Both sides equal . Since this holds for all , the functions are equal by extensionality. QED
Consequence. We can write without ambiguity - no parentheses needed.
For AI. Associativity means we can group layers however we want. A 96-layer transformer can be viewed as:
- 96 individual layers
- Two groups of 48 layers
- Three groups of 32 layers
The computed function is the same regardless of grouping. This is why we can meaningfully talk about "the first 10 layers" or "the last 3 layers" as coherent sub-computations. It also allows pipeline parallelism: split a model across GPUs at any layer boundary, and the overall function is unchanged.
4.3 The Identity Function
Definition. The identity function on set is:
Properties:
- for any (right identity)
- for any (left identity)
Proof of right identity. For any : . QED
For AI - Residual Connections. The identity function appears in residual connections (He et al. 2016):
The output is the input plus a learned residual. If , the block approximates the identity. This design allows the network to learn modifications to the identity, rather than learning the entire transformation from scratch.
Why this helps gradient flow: The Jacobian of a residual block is:
The identity term ensures gradients always flow through, even if is small. Without residual connections, gradients must pass through every layer multiplicatively - if any layer's Jacobian has small singular values, gradients vanish. With residual connections, there is always a "skip path" carrying the identity gradient.
Pre-LN Transformer block (GPT-2, LLaMA, etc.):
Two residual connections per block: one around attention, one around FFN. Each is .
4.4 Composition and Function Properties
Theorem. Composition preserves injectivity, surjectivity, and bijectivity:
| If is... | and is... | then is... | Proof sketch |
|---|---|---|---|
| Injective | Injective | Injective | |
| Surjective | Surjective | Surjective | For : surj s.t. ; surj s.t. |
| Bijective | Bijective | Bijective | Combine both rows above |
Full proof of injectivity preservation. Suppose and are both injective. Let , i.e., . Since is injective, . Since is injective, . QED
Converse results (partial converses):
- If is injective, then must be injective (but need not be).
- If is surjective, then must be surjective (but need not be).
Proof that injective implies injective. Suppose . Then , i.e., . Since is injective, . QED
For AI. If every layer is injective, the entire network is injective (no information loss). If any single layer is non-injective, the composition is non-injective. Since ReLU is non-injective (maps all negatives to 0), standard ReLU networks are non-injective - they destroy information about the sign of pre-activations.
4.5 Neural Networks as Function Composition
A deep neural network is exactly a composition of simpler functions. For a transformer with layers:
where each transformer layer is itself a composition:
expanding further:
The total function is a composition of sub-functions, each parameterised by learnable weights. All are differentiable (except sampling at inference), enabling end-to-end gradient-based training via the chain rule.
Counting the depth. A single transformer layer involves roughly:
- 2 layer norms (each: subtract mean, divide by std, scale, shift)
- 4 linear projections (Q, K, V, O)
- 1 softmax
- 2 more linear projections (FFN up and down)
- 1 activation function
- 2 residual additions
That is approximately 12 elementary functions per transformer layer. A 96-layer model composes elementary functions. The fact that gradient-based optimisation works over such deep compositions is remarkable and depends critically on residual connections and careful normalisation.
4.6 Non-Commutativity of Composition
Composition is NOT commutative. In general, .
Example. Let and :
These are different: but .
When does composition commute? Two functions and commute (i.e., ) in special cases:
- or (trivially)
- and are both powers of the same function:
- and are both linear: if and , then iff (matrices commute)
- Diagonal matrices always commute with each other
For AI. The order of operations matters enormously:
| Ordering | Effect |
|---|---|
| Pre-LN: | More stable training; used in GPT-2, LLaMA, most modern LLMs |
| Post-LN: | Original Transformer; training can be unstable without warmup |
| BatchNorm before activation | Standard in CNNs |
| BatchNorm after activation | Different normalisation statistics; rarely used |
| Dropout before softmax | Drops attention logits; called "attention dropout" |
| Dropout after softmax | Drops attention weights; different regularisation effect |
The shift from Post-LN to Pre-LN significantly improved training stability for deep transformers - same components, different composition order. This is a direct consequence of non-commutativity.
4.7 Iterated Composition and Fixed Points
Definition. The -fold composition of with itself:
with .
Fixed point. A point is a fixed point of if . Fixed points are invariant under iteration: for all .
Banach Fixed Point Theorem (Contraction Mapping Theorem). Let be a complete metric space and a contraction, meaning there exists such that:
Then:
- has a unique fixed point
- For any starting point , the sequence converges to
- Rate of convergence:
Proof sketch. The sequence is Cauchy because . Since is complete, it converges to some . By continuity of : . Uniqueness: if and , then , so , giving . QED
For AI: Deep Equilibrium Models (DEQ, Bai et al. 2019). Instead of stacking copies of a layer (using memory for activations), DEQ finds the fixed point of a single layer:
This is equivalent to an infinitely deep weight-tied network that has converged.
| Property | Standard Deep Network | DEQ |
|---|---|---|
| Memory | - store all activations | - store only |
| Computation | forward passes | Iterate until convergence |
| Backprop | Through all layers | Implicit differentiation at fixed point |
| Convergence guarantee | N/A | Guaranteed if is a contraction |
The Banach theorem guarantees convergence if the layer is a contraction - which can be enforced via spectral normalisation (constraining ).
Example: iterating cosine. on is a contraction (since ).
Starting from :
- Converges to the Dottie number: where .
5. Inverse Functions
5.1 Left and Right Inverses
Not every function has a full (two-sided) inverse. The general picture involves left and right inverses:
Definition. For :
- A left inverse of is a function such that (i.e., for all )
- A right inverse of is a function such that (i.e., for all )
- A two-sided inverse is a function that is both a left and right inverse
Key theorems connecting inverses to injectivity/surjectivity:
| Statement | Condition |
|---|---|
| has a left inverse | is injective |
| has a right inverse | is surjective (requires Axiom of Choice) |
| has a two-sided inverse | is bijective |
| If both left and right inverses exist | They are equal and unique |
Proof that injective implies left inverse exists. Suppose is injective with . Fix some . Define by:
Then for any : (by the first case), so . QED
Proof that left and right inverse, if both exist, are equal. Suppose is a left inverse () and is a right inverse (). Then:
So . QED
For AI. The distinction between left and right inverses appears in:
- Encoder-decoder architectures: The encoder maps high-dimensional input to a low-dimensional bottleneck. The decoder maps back. We want (reconstruct the original), making an approximate left inverse. But (not every bottleneck vector corresponds to a real input).
- Projection layers: Reducing dimension from to has a left inverse (pseudo-inverse) but not a right inverse.
5.2 The Inverse Function
Definition. If is bijective, the inverse function is defined by:
Properties of inverse functions:
| Property | Statement |
|---|---|
| Composition with original | and |
| Inverse of inverse | |
| Bijectivity | is itself bijective |
| Graph reflection | The graph of is the graph of reflected across |
| Continuity | If is continuous and bijective (on appropriate spaces), is continuous |
Common inverse pairs used in AI:
| Function | Inverse | Domain of | AI Use |
|---|---|---|---|
| Log-probabilities; (LogSoftmax) | |||
| Logit-space operations; probability calibration | |||
| ( invertible) | Solving linear systems; whitening transforms | ||
| () | Undoing normalisation; de-standardisation | ||
| Inverse activation in normalising flows |
5.3 Inverse of a Composition
Theorem (Socks-and-Shoes Lemma). If and are both bijective, then:
The inverse of a composition reverses the order.
Proof. We verify that is the inverse of :
Similarly: . QED
Name origin. You put on socks first, then shoes. To undo: you remove shoes first, then socks. The inverse reverses the order.
Generalisation. For bijections:
For AI: Normalising Flows. A normalising flow is a composition of bijections:
By the socks-and-shoes lemma, the inverse is:
Sampling (generation): start with noise and apply (forward pass through all layers in order ).
Likelihood computation (training): given data , apply (inverse pass through layers in reverse order ), and use the change-of-variables formula:
The order matters: forward pass and inverse pass traverse the layers in opposite directions.
5.4 Pseudo-Inverse (Moore-Penrose)
When is not bijective (e.g., a non-square matrix, or a square but singular matrix), no true inverse exists. The Moore-Penrose pseudo-inverse provides the "best substitute".
Definition. For a matrix , the pseudo-inverse is the unique matrix satisfying the four Penrose conditions:
- (acts like an inverse when it can)
- (acts like an inverse when it can)
- (symmetric projection onto column space)
- (symmetric projection onto row space)
Via SVD. If (singular value decomposition), then:
where is obtained by taking the reciprocal of each non-zero singular value and transposing. If , then .
Geometric interpretation. gives the minimum-norm least-squares solution to :
- If has a unique solution: is that solution (same as ).
- If has multiple solutions (underdetermined): is the solution with smallest .
- If has no exact solution (overdetermined): minimises , and among all minimisers, is the one with smallest .
Special cases:
| Matrix Type | Pseudo-Inverse |
|---|---|
| Square invertible (, full rank) | |
| Tall full-rank (, ) - overdetermined | (left inverse) |
| Wide full-rank (, ) - underdetermined | (right inverse) |
| Rank-deficient | Use SVD formula |
For AI. The pseudo-inverse appears in:
- Linear regression: closed-form solution
- LoRA (Low-Rank Adaptation): weight update where , with ; the effective inverse involves pseudo-inverses of low-rank matrices
- Weight initialisation analysis: understanding the effective rank and conditioning of weight matrices
- Gradient computation: when computing gradients through rank-deficient Jacobians
5.5 Invertibility and Information Preservation
Theorem. A function is bijective (invertible) if and only if it preserves all information - i.e., the input can be recovered from the output.
Linear case. For a linear map with :
| Property | Condition | Meaning |
|---|---|---|
| Injective | ; ; columns linearly independent | Nothing in the kernel; no information destroyed |
| Surjective | ; ; rows span | Full range; all outputs reachable |
| Bijective | and () | Square and full rank |
The dimension argument. By the rank-nullity theorem ():
- If (more inputs than outputs): , so , so is NOT injective. The layer necessarily compresses.
- If (fewer inputs than outputs): , so is NOT surjective. Not all outputs are reachable.
- Only when can be bijective (and even then, only if ).
Consequences for neural network architecture:
Information Flow in a Network
d=768 d=768 d=768 d=768
+--------+ +--------+ +--------+ +--------+
| Layer 1|--->| Layer 2|--->| Layer 3|--->| Layer 4|
+--------+ +--------+ +--------+ +--------+
Same dimension throughout -> bijection is POSSIBLE
(Transformer: d_model stays constant through all layers)
d=784 d=256 d=64 d=10
+--------+ +--------+ +--------+ +--------+
|Compress|--->|Compress|--->|Compress|--->| Output |
+--------+ +--------+ +--------+ +--------+
Decreasing dimension -> NOT injective -> information lost
(Classifier: must compress to number of classes)
Autoencoders exploit this: the encoder maps high dimension to low (non-injective, compresses, loses information), and the decoder maps low to high (non-surjective, reconstructs). The bottleneck forces the network to learn a compact representation preserving the most important information.
Invertible neural networks (iRevNet, Glow, NICE) maintain throughout and use carefully designed bijective layers (coupling layers, additive/affine transforms), enabling exact likelihood computation and perfect reconstruction.
6. Special Classes of Functions
6.1 Linear Functions
Definition. A function between vector spaces is linear if it preserves vector addition and scalar multiplication:
- Additivity: for all
- Homogeneity: for all ,
These two conditions can be combined into one: (preservation of linear combinations).
Key consequences:
- (the zero vector must map to the zero vector)
- is completely determined by its action on a basis: if is a basis for , then
- Every linear function can be represented as matrix multiplication: where
The kernel (null space). - the set of vectors that maps to zero.
Rank-Nullity Theorem. For with finite-dimensional:
This is a conservation law: dimension is neither created nor destroyed, just redistributed between the range and the kernel.
For AI. Linear layers (without bias) are the backbone of neural networks:
- The attention Q, K, V projections are linear maps
- The FFN up/down projections are linear maps
- The LM head (hidden state -> logits) is a linear map
- LoRA approximates weight updates with low-rank linear maps: where
Without nonlinear activation functions, composing any number of linear layers gives a single linear layer: . This is why activations are essential - they break the linearity.
6.2 Affine Functions
Definition. A function is affine if:
where is a matrix and is a bias vector.
Affine = linear + translation. An affine function is a linear function followed by a translation (shift by ). It preserves affine combinations: (preserves weighted averages where weights sum to 1).
Key difference from linear:
- Linear: always. Lines through the origin map to lines through the origin.
- Affine: in general. Lines map to lines, but the origin can shift.
Composition of affine functions is affine:
Result is affine with matrix and bias .
Critical consequence. Composing any number of affine functions (without nonlinear activations) gives a single affine function. A 100-layer network with only affine layers is equivalent to a single affine layer. This is why nonlinear activation functions are essential - they prevent the entire network from collapsing into a single affine transformation.
For AI. Nearly every "linear layer" in a neural network is actually affine: . The bias term allows the function to shift the output away from the origin, which is necessary for representing functions that do not pass through zero. Some architectures omit the bias (e.g., some attention projections in LLaMA), but most include it.
6.3 Multilinear Functions
Definition. A function is multilinear (-linear) if it is linear in each argument separately, with the others held fixed:
The most important case is bilinear ():
Key property. A bilinear function is linear in each argument but NOT linear overall:
Common bilinear operations in AI:
| Operation | Bilinear Map | Variables |
|---|---|---|
| Dot product | Linear in , linear in | |
| Matrix-vector product | Linear in , linear in | |
| Attention scores | Linear in , linear in | |
| Outer product | Linear in , linear in | |
| Bilinear form | Linear in , linear in |
For AI: Attention is bilinear (before softmax). The raw attention score between query and key is:
This is bilinear: linear in (doubling doubles the score) and linear in (doubling doubles the score). The softmax that follows makes the full attention mechanism nonlinear.
6.4 Polynomial Functions
Definition. A polynomial function has the form:
The degree of the polynomial is (the highest power with non-zero coefficient).
Weierstrass Approximation Theorem. Every continuous function on a closed interval can be uniformly approximated by polynomials. That is, for any continuous and any , there exists a polynomial with for all .
This is a classical universal approximation result - but it requires the degree (number of terms) to grow, and in general, the required degree grows quickly for complex functions.
Multivariate polynomials. A polynomial in variables:
where is a multi-index.
Connection to neural networks (depth separation). Telgarsky (2016) showed that there exist functions expressible by depth- ReLU networks with polynomial width that require exponential width in depth- networks. This is analogous to the fact that high-degree polynomials can represent functions that low-degree polynomials cannot.
Why neural networks don't use polynomial activations. If the activation function were polynomial, then composing affine layers with polynomial activations gives a polynomial of degree (where is depth). While high-degree polynomials are universal approximators, they have catastrophic extrapolation behaviour: explodes for and vanishes for . ReLU networks avoid this by producing piecewise linear functions, which extrapolate more gracefully.
6.5 Activation Functions
Activation functions are the nonlinear components that give neural networks their expressive power. Without them, any deep network collapses to a single affine transformation.
Detailed analysis of key activation functions:
Sigmoid:
| Property | Value |
|---|---|
| Range | |
| Derivative | |
| Max derivative | at |
| Monotone | Strictly increasing |
| Injective | Yes |
| Lipschitz constant | |
| Bounded | Yes |
| Symmetric | About : |
| Saturation | Gradients vanish for $ |
Problem. Max derivative is . After layers, gradient is at most . For : . This is the vanishing gradient problem - the reason sigmoid was abandoned as a hidden-layer activation in deep networks.
Tanh:
| Property | Value |
|---|---|
| Range | |
| Derivative | |
| Max derivative | at |
| Monotone | Strictly increasing |
| Zero-centred | Yes: and |
| Saturation | Still saturates for large $ |
ReLU (Rectified Linear Unit):
| Property | Value |
|---|---|
| Range | |
| Derivative | for ; for ; undefined at |
| Monotone | Non-decreasing (not strict) |
| Injective | No (all negatives map to 0) |
| Lipschitz constant | |
| Bounded | No (unbounded above) |
| Computation | Extremely cheap: just a comparison |
| Dying ReLU | If pre-activation is always negative, gradient is always 0 - neuron is "dead" |
| Sparsity | Roughly 50% of neurons output 0, creating sparse representations |
Why ReLU dominates. Despite losing information (not injective) and not being differentiable at 0, ReLU works extremely well because:
- Gradient is exactly 1 for positive inputs - no vanishing gradient
- Creates piecewise linear functions - efficient to compute and optimise
- Induces sparsity - only active neurons contribute
- Extremely cheap to compute
GELU (Gaussian Error Linear Units, Hendrycks & Gimpel 2016):
where is the standard normal CDF. Approximation: .
| Property | Value |
|---|---|
| Range | |
| Monotone | No (slight dip near ) |
| Smooth | Yes ( - infinitely differentiable) |
| Used in | GPT-2, BERT, most transformer models before SwiGLU era |
SwiGLU (Shazeer 2020, used in LLaMA, PaLM, Gemini):
where (Swish activation) and is elementwise multiplication.
| Property | Value |
|---|---|
| Gated | Yes - one path gates the other via sigmoid |
| Parameters | Requires 3 weight matrices instead of 2 (50% more parameters per FFN) |
| Used in | LLaMA 1/2/3, PaLM, Gemini, Mistral, most 2023-2026 LLMs |
| Why preferred | Empirically better than GELU for large-scale language models |
Mish (Misra 2019):
| Property | Value |
|---|---|
| Smooth | Yes () |
| Non-monotone | Slight negative dip (similar to GELU) |
| Self-regularising | Bounded below, unbounded above |
| Used in | YOLOv4, some vision models |
Comprehensive comparison table:
| Activation | Formula | Range | Smooth? | Monotone? | Injective? | Lipschitz | Vanishing grad? |
|---|---|---|---|---|---|---|---|
| Sigmoid | Yes | Yes | Yes | Yes (severe) | |||
| Tanh | Yes | Yes | Yes | Yes (moderate) | |||
| ReLU | No () | Yes (weak) | No | No (but dying neurons) | |||
| Leaky ReLU | No () | Yes | Yes | No | |||
| GELU | Yes | No | No | No | |||
| SiLU/Swish | Yes | No | No | No | |||
| Mish | Yes | No | No | No | |||
| Softplus | Yes | Yes | Yes | No |
6.6 Convex Functions
Definition. A function is convex if for all and :
Geometrically: the line segment connecting any two points on the graph lies above (or on) the graph. The function "curves upward".
f(x)
| /
| / Line segment (always above curve)
| */*
| / *
|/ * <- f curves below the line
/ * (this is convexity)
/----*--------- x
Three equivalent characterisations (for differentiable ):
- Definition:
- First-order condition: (function lies above every tangent plane)
- Second-order condition: (Hessian is positive semi-definite everywhere)
Strict convexity: replace with (for and ). Strict convexity implies a unique global minimum.
Strong convexity: for some . Strong convexity implies:
- Unique global minimum
- Gradient descent converges at rate (exponential)
- The Hessian satisfies
Why convexity matters for optimisation:
| Property | Convex | Non-convex |
|---|---|---|
| Local minima | Every local minimum is global | Can have many local minima |
| Gradient descent | Guaranteed to converge to global min | May converge to local min or saddle |
| Uniqueness | Strict convex -> unique minimum | Multiple minima possible |
| Rate of convergence | or | No guarantees |
For AI. The loss landscape of neural networks is non-convex in the parameters . However:
- Cross-entropy loss is convex in the logits (for fixed true labels): is convex because log-sum-exp is convex
- The regularisation term is strongly convex in
- Adding regularisation improves the conditioning of the loss landscape
Despite non-convexity, SGD finds good solutions in practice. This is one of the central mysteries of deep learning theory.
6.7 Lipschitz Functions
Definition. A function (between metric spaces) is Lipschitz continuous with constant if:
In with Euclidean norm: .
Intuition. A Lipschitz function cannot change its output faster than times the change in input. The constant is a bound on the "slope" or "sensitivity" of the function.
The smallest valid is the Lipschitz constant:
For differentiable functions: (supremum of the operator norm of the Jacobian).
Hierarchy:
(Contraction: . Lipschitz: .)
Lipschitz constants for common AI functions:
| Function | Lipschitz Constant | Notes |
|---|---|---|
| ReLU | $ | |
| Sigmoid | Max derivative at | |
| Tanh | Max derivative at | |
| Softmax | In norm | |
| Linear | Largest singular value | |
| Layer Norm | (approximately) | Depends on the input distribution |
| Affine | Bias does not affect Lipschitz constant |
Lipschitz constant of a composition. If has Lipschitz constant and has Lipschitz constant , then has Lipschitz constant at most :
For a deep network with layers, each with Lipschitz constant :
If each layer has Lipschitz constant , the bound grows exponentially with depth - a small input perturbation could be amplified exponentially. This is exactly the phenomenon behind adversarial examples: a tiny, imperceptible change to the input causes a large change in the output.
Spectral normalisation (Miyato et al. 2018). To control the Lipschitz constant of each layer, divide the weight matrix by its largest singular value:
This ensures , so each layer has Lipschitz constant (when composed with a 1-Lipschitz activation like ReLU), and the overall network has Lipschitz constant .
Applications of Lipschitz constraints in AI:
- Wasserstein GANs (WGAN): The discriminator must be 1-Lipschitz; enforced via spectral normalisation or gradient penalty
- Certified robustness: A network with Lipschitz constant is guaranteed to be robust to perturbations of size - no adversarial example can flip the prediction within this radius
- Training stability: Constraining layer Lipschitz constants prevents gradient explosion
- ODE-based models: Neural ODEs require Lipschitz dynamics for well-posedness (existence and uniqueness of solutions)
7. Higher-Order Functions and Functionals
7.1 Functions as Arguments
Definition. A higher-order function is a function that takes one or more functions as arguments, returns a function as output, or both.
In mathematical notation, if is a set of functions, a higher-order function has the form:
Formal framework. Let denote the set of all functions from to . A higher-order function is simply a function whose domain or codomain includes such function spaces.
Examples in mathematics:
- Derivative operator: , . Takes a differentiable function, returns its derivative.
- Integration operator: , . Takes a continuous function, returns a number.
- Composition operator: , . Takes a function, returns a function.
Examples in programming (Python/JAX):
# map applies a function to each element
map(f, [x1, x2, x3]) -> [f(x1), f(x2), f(x3)]
# JAX grad: takes a function, returns its gradient function
grad_f = jax.grad(f) # grad: (R^n -> R) -> (R^n -> R^n)
# JAX vmap: takes a function, returns its batched version
batched_f = jax.vmap(f) # vmap: (R^n -> R^m) -> (R^{B\timesn} -> R^{B\timesm})
7.2 Functionals (Functions of Functions)
Definition. A functional is a function from a function space to the real numbers:
where is a set of functions. A functional maps an entire function to a single number.
Key examples:
| Functional | Definition | What It Measures |
|---|---|---|
| Definite integral | Area under the curve | |
| norm | $J(f) = \left(\int | f(x) |
| Evaluation | Value at a specific point | |
| Supremum norm | $J(f) = \sup_x | f(x) |
| Entropy | Uncertainty/information | |
| KL divergence | Distance between distributions |
Loss functions as functionals. In machine learning, the loss function is a functional on the space of models:
This maps each model to a single number measuring how well fits the data. Training is the process of minimising this functional over the space of representable functions.
7.3 Operators (Functions on Function Spaces)
Definition. An operator is a function from one function space to another:
Unlike functionals (which output numbers), operators output functions.
Linear operators. An operator is linear if . These are the infinite-dimensional analogue of matrices.
Key examples of operators:
| Operator | Definition | Type |
|---|---|---|
| Derivative | Linear | |
| Integration | Linear | |
| Fourier transform | Linear | |
| Convolution | Linear (for fixed ) | |
| Composition | Generally nonlinear | |
| Softmax | Nonlinear |
For AI: Neural networks as operators. Recent work treats neural networks as operators between function spaces:
- Neural Operators (FNO, DeepONet): Learn mappings between function spaces directly, e.g., mapping initial conditions to PDE solutions
- Attention as an operator: Self-attention can be viewed as a nonlinear operator on a sequence of vectors: , where varies (the operator handles variable-length inputs)
7.4 Calculus of Variations
Problem. Find the function that minimises (or maximises) a functional .
This is optimisation over function spaces rather than over finite-dimensional parameter vectors. The classical framework is the calculus of variations.
Euler-Lagrange equation. Consider the functional:
where is the Lagrangian. The function that minimises satisfies:
This is the Euler-Lagrange equation - an ODE whose solutions are the critical points of the functional.
Example: shortest path. The length of a curve from to is:
The Euler-Lagrange equation gives , meaning the shortest path is a straight line . (In Euclidean geometry.)
Functional derivative. The analogue of a gradient for functionals:
This measures how changes when is perturbed at the point .
Connection to machine learning. Training a neural network by gradient descent on parameters is a finite-dimensional approximation to functional optimisation:
Functional view: min_{f \in F} L(f) <- optimise over all functions
up
Parameterise f_\theta
down
Parameter view: min_{\theta \in R^p} L(f_\theta) <- optimise over parameters
The "lottery ticket hypothesis" and neural architecture search can be seen as trying to find the right parameterisation of the function space.
7.5 Transformations in AI Frameworks
Modern deep learning frameworks extensively use higher-order functions. Here is a conceptual map:
JAX's transformation system:
| Transform | Type Signature | Description |
|---|---|---|
grad | Returns gradient function | |
vmap | Vectorises (auto-batches) | |
jit | Compiles for speed | |
pmap | Parallelises across devices |
The power of composability: These transforms compose:
This gives a compiled, batched gradient function - three layers of higher-order functions applied in sequence. Each transform takes a function and returns a new, enhanced function.
Autodiff as a higher-order function. Automatic differentiation (the engine behind backpropagation) is fundamentally a higher-order function:
It takes any differentiable function and returns its Jacobian (or gradient, for scalar outputs). The reverse-mode variant computes gradients in time regardless of the number of parameters - this is why training networks with billions of parameters is feasible.
8. Continuity and Limits
This chapter only needs the core idea of continuity: functions can preserve nearby structure instead of tearing inputs apart with abrupt jumps. The full story belongs to Limits and Continuity, where epsilon-delta reasoning, theorem statements, and examples are treated systematically.
8.1 The Epsilon-Delta Definition
A function is continuous at when sufficiently small input changes force small output changes:
You do not need to master epsilon-delta proofs yet. What matters in this chapter is the interpretation: continuity is a structural property of a function, not a plotting aesthetic. It tells us whether a function respects local neighborhoods.
8.2 Types of Continuity
Ordinary continuity is pointwise: the allowed may depend on the location . Uniform continuity is stronger: one choice of works across the whole set. Lipschitz continuity is stronger still and gives an explicit bound on how much outputs can move relative to inputs.
For AI, this hierarchy matters because robustness, sensitivity, and optimization stability are all function-shape questions. When we say a layer is "well behaved," we usually mean something in this family.
8.3 Continuity in Higher Dimensions
For , continuity is stated with norms:
This is the version that matters for neural networks, where inputs, hidden states, and parameters all live in high-dimensional spaces. A small step in parameter space should not cause an uncontrolled jump in loss unless the model includes genuinely discontinuous operations.
8.4 Limits and Convergence
Limits describe what a function approaches near a point, even if the function value at that point is missing or different. Continuity can be summarized as "the function value agrees with the limiting value."
This perspective is especially useful for activation functions and optimization trajectories: sigmoid and tanh saturate toward limiting values, while ReLU becomes asymptotically linear on the positive side. We will return to these ideas in Limits and Continuity.
8.5 Discontinuities and Their Consequences
Not every useful AI function is smooth or even continuous. argmax, hard thresholding, rounding, and hard attention all introduce abrupt changes. That is why so much modern ML relies on soft relaxations such as softmax, sigmoid gates, or temperature-controlled approximations.
The important bridge for this chapter is simple: continuity is one of the first major properties that distinguishes function classes in a way that directly affects trainability and robustness.
9. Differentiable Functions and Derivatives
Differentiability is the next major refinement: not only does the function avoid jumps, it admits a meaningful local linear approximation. The detailed machinery lives in Derivatives and Differentiation, Partial Derivatives and Gradients, Chain Rule and Backpropagation, and Automatic Differentiation.
9.1 Derivative as a Linear Approximation
For a scalar function, the derivative at is the limit
when that limit exists. In higher dimensions, the same idea becomes the gradient or Jacobian: the best linear map that approximates the function near a point. This "best local linear model" viewpoint is the right mental model for optimization.
9.2 Smoothness Classes
Analysts organize functions by regularity: continuous functions, continuously differentiable functions, twice differentiable functions, and so on. In ML terms, this is the difference between kinked activations such as ReLU and smoother choices such as GELU or softplus.
You do not need the full taxonomy yet. The important lesson is that stronger regularity assumptions buy stronger theorems, but many practical models deliberately work at the edge of those assumptions because piecewise-linear functions are computationally convenient.
9.3 The Chain Rule and Backpropagation
Differentiation becomes operationally important because neural networks are compositions:
The chain rule tells us how derivatives of compositions combine, and backpropagation is the computational form of that rule. This is the main reason the function-composition viewpoint earlier in the chapter matters so much.
9.4 Gradient Problems in Deep Networks
Composing many layers can make gradients shrink, blow up, or become numerically unstable. Vanishing and exploding gradients are not separate mysteries; they are consequences of how many Jacobian-like factors are multiplied together through depth.
Residual connections, normalization, initialization schemes, and gradient clipping are all function-shaping interventions designed to keep this derivative information usable.
9.5 Automatic Differentiation
Modern frameworks do not symbolically differentiate full formulas by hand. They build a computation graph and apply local derivative rules mechanically. That is automatic differentiation, and it is what turns abstract calculus into a practical training system.
For this section, the main bridge is conceptual: once functions are composed from simple pieces, their derivatives can also be assembled systematically. That idea powers nearly all large-scale deep learning.
10. Measure-Theoretic View of Functions
Functions eventually need to interact with probability, integration, and "almost everywhere" reasoning. This chapter only needs a preview so the vocabulary will feel familiar later. The closest developed companions in the current curriculum are Introduction to Probability and Random Variables, Expectation and Moments, and Integration; the dedicated measure-theory chapter sequence is still mostly scaffolded.
10.1 Why Measure Theory?
Probability theory needs a rigorous way to say which subsets count as events, which functions are valid random variables, and when an expectation is well defined. Measure theory supplies that language.
From the function viewpoint, the key upgrade is that we stop asking only "what output does this function produce?" and start asking "how does this function interact with a measure on its domain?"
10.2 Measurable Functions
A function between measurable spaces is measurable when preimages of measurable sets remain measurable. This is the measure-theoretic analogue of continuity preserving open-set structure.
That condition is what lets us push probabilities through a function and talk about the distribution of outputs in a mathematically controlled way.
10.3 Random Variables as Measurable Functions
A random variable is not fundamentally a mysterious object that "has a distribution." It is a measurable function
from outcomes to values. The distribution is the pushforward of through . This perspective becomes especially helpful when we later treat generative models as maps that transform one distribution into another.
10.4 Lebesgue Integration
Expectation is integration with respect to a probability measure. Lebesgue integration is the framework that makes this precise for very general classes of functions, far beyond the simple Riemann-integrable cases you first see in basic calculus.
For ML, this matters because loss expectations, likelihoods, densities, and interchange-of-limit arguments all rest on integration theory, even when the notation looks innocently familiar.
10.5 Almost Everywhere and Null Sets
Many theorems in analysis and ML care only about what happens outside sets of measure zero. A property that holds almost everywhere may fail on a negligible exceptional set and still be enough for integration, optimization, and probability arguments.
This is why ReLU's non-differentiability at a single point is usually harmless in practice: the bad set is tiny in the measure-theoretic sense that later theory makes precise.
11. Functions in Machine Learning
11.1 Universal Approximation Theorems
The fundamental question. Can neural networks represent any function? The universal approximation theorem says yes - with sufficient width or depth.
Width version (Cybenko 1989, Hornik 1991). For any continuous function and any , there exists a single-hidden-layer network:
with sufficiently large (but finite), such that .
Requirements on :
- Original (Cybenko): must be sigmoidal (monotone, as , as )
- Extended (Hornik): Any continuous, non-polynomial works
- ReLU: Specific constructions show that ReLU networks are universal approximators
Depth version (Lu et al. 2017). ReLU networks with width (where is input dimension) but arbitrary depth are universal approximators. Width is NOT sufficient - there exist continuous functions that cannot be approximated by arbitrarily deep but narrow ReLU networks.
What the theorem does NOT say:
- It does NOT guarantee that gradient descent can find the approximation
- It does NOT bound the number of parameters needed (could be astronomically large)
- It does NOT say anything about generalisation (fitting training data generalising)
- It does NOT say the approximation is efficient
Depth separation results. There exist functions that:
- Can be computed by depth- ReLU networks with polynomial width
- Require exponential width to approximate with depth- networks (Telgarsky 2016, Eldan & Shamir 2016)
This justifies deeper architectures: depth provides exponential efficiency gains in representation.
11.2 Loss Functions as Function Compositions
The training loss decomposes as a composition of functions:
Input x -> Model f_\theta(x) -> Prediction yhat -> Loss ell(yhat, y) -> Scalar L
Common loss functions and their properties:
| Loss | Formula | Domain -> Range | Convex in pred? | Bounded? |
|---|---|---|---|---|
| MSE | Yes (strongly) | No | ||
| Cross-entropy | Yes | No ( as ) | ||
| Hinge | Yes | No | ||
| Huber | MSE if $ | e | < \delta$, MAE otherwise | |
| KL divergence | Yes (in ) | No | ||
| Focal | Depends on | No |
Important subtlety: convexity in predictions vs. parameters. Cross-entropy is convex in the logits (input to softmax), but the overall loss is non-convex in the parameters because is a non-convex function of .
11.3 Hypothesis Classes and Function Spaces
Definition. A hypothesis class is the set of functions that a learning algorithm can select from:
The choice of architecture determines :
| Architecture | Hypothesis Class | Properties |
|---|---|---|
| Linear regression | Finite-dim, convex, small | |
| Polynomial degree | ${x \mapsto \sum a_\alpha x^\alpha : | \alpha |
| 1-hidden-layer MLP | Finite-dim, non-convex | |
| Deep network | Very complex function class | Finite-dim, highly non-convex |
| Kernel methods | Can be infinite-dim, convex | |
| Transformer | See below | Finite-dim, non-convex |
The hypothesis class of a transformer. A transformer with layers, heads, hidden dimension , and vocabulary defines:
This is a subset of all functions from variable-length token sequences to distributions over the vocabulary. The subset is determined by the architecture and parameter count.
Bias-variance tradeoff through function space lens:
- Small (few parameters): High bias (can't fit complex functions), low variance (stable across datasets)
- Large (many parameters): Low bias (can fit anything), but potentially high variance
- Modern deep learning paradox: Over-parameterised networks () have low bias AND low variance due to implicit regularisation by SGD
11.4 Attention as a Function
Self-attention defines a function where is sequence length and is hidden dimension:
Decomposing the function:
- Linear projections: , , (affine functions)
- Attention scores: (bilinear function of and )
- Softmax: (nonlinear, applied row-wise)
- Weighted sum: (bilinear function of and )
Properties of attention as a function:
| Property | Attention |
|---|---|
| Linear? | No (softmax makes it nonlinear) |
| Injective? | Generally yes (for distinct inputs) |
| Equivariant? | Yes - permuting input rows permutes output rows |
| Lipschitz? | Yes, but constant depends on sequence length |
| Differentiable? | Yes () |
| Input size | Variable ( can change) |
Attention is not a fixed function. Unlike a typical neural network layer with fixed computational graph, attention's computation depends on the input through the attention pattern . This makes it a data-dependent function: the effective "weight matrix" changes with every input.
11.5 Tokenisation and Embedding as Functions
Tokenisation is a function from strings to token sequences:
where is the character alphabet and is the token vocabulary. For BPE/SentencePiece tokenisation, this function is:
- Deterministic (same input always gives same output)
- Not injective in general (some tokenisers normalise text, losing information)
- Not surjective (not all token sequences correspond to valid text)
Embedding is a learned function from discrete tokens to continuous vectors:
where is the embedding matrix. This is:
- Injective (distinct tokens get distinct vectors - assuming no hash collisions)
- Not continuous (the domain is discrete, so continuity doesn't apply)
- A lookup table, not a "computation" - but mathematically it's a function
Complete pipeline as function composition:
String ---> Tokens ---> Embeddings ---> + Position ---> Transformer ---> Logits ---> Probs
Tokenise Embed PosEnc f_\theta LM Head Softmax
11.6 The Transformer Block as a Function
A single Pre-LN transformer block (as used in GPT-2+, LLaMA, etc.) defines a residual function:
Breaking this down:
| Step | Function | Type |
|---|---|---|
| 1. Layer Norm | Nonlinear (normalisation) | |
| 2. Self-Attention | Nonlinear (softmax), data-dependent | |
| 3. Residual Add | Linear in and in attention output | |
| 4. Layer Norm | Nonlinear | |
| 5. FFN | or SwiGLU variant | Nonlinear (activation function) |
| 6. Residual Add | Linear |
The full transformer stacks such blocks:
Parameter count and function expressiveness. For a transformer with hidden dimension , layers, heads, FFN dimension , and vocabulary :
For GPT-3 (175B): , , giving parameters. Each parameter is a coordinate in the hypothesis space .
12. Category Theory Perspective
12.1 Categories and Morphisms
Definition. A category consists of:
- A collection of objects
- For each pair of objects , a set of morphisms (arrows from to )
- A composition operation: if and , then
- An identity morphism for each object
Subject to:
- Associativity:
- Identity: and
This should look familiar. The axioms of a category are precisely the axioms of function composition from Section 4!
Key categories:
| Category | Objects | Morphisms |
|---|---|---|
| Set | Sets | Functions |
| Vect | Vector spaces | Linear maps |
| Top | Topological spaces | Continuous maps |
| Meas | Measurable spaces | Measurable functions |
| Grp | Groups | Group homomorphisms |
For AI. A neural network layer is a morphism in the category of parameterised differentiable functions:
- Objects: (Euclidean spaces of various dimensions)
- Morphisms: differentiable functions
- Composition: layered network construction
12.2 Functors: Maps Between Categories
Definition. A functor is a "structure-preserving map" between categories:
- maps objects:
- maps morphisms:
- preserves composition:
- preserves identities:
Intuition. A functor is a "homomorphism of categories" - it maps the entire structure (objects + arrows + composition) from one category to another.
Examples relevant to AI:
| Functor | From | To | Maps |
|---|---|---|---|
| Forgetful | Vect -> Set | Vector spaces | Forget the linear structure |
| Free | Set -> Vect | Sets | Map to free vector space |
| Probability | Meas -> Set | Measurable spaces | Map to set of probability measures |
| Dual space | Vect -> Vect | Vector spaces | (contravariant) |
For AI: JAX transforms as functors. The vmap transform in JAX can be viewed as a functor:
- It maps functions to batched functions
- It preserves composition:
- It preserves identity:
Similarly, grad is a (contravariant-like) functor from scalar-valued functions to vector-valued functions.
12.3 Natural Transformations
Definition. A natural transformation between functors assigns to each object a morphism such that the following diagram commutes for every morphism :
F(A) ---- \eta_A -----> G(A)
| |
F(f)| |G(f)
| |
v v
F(B) ---- \eta_B -----> G(B)
Commutative: .
Intuition. A natural transformation is a "systematic" way of converting one construction into another, one that works uniformly across all objects and is compatible with all morphisms.
For AI. Natural transformations appear in:
- Activation functions: The ReLU activation defines a natural transformation from the identity functor to itself (in a suitable category of parameterised functions). It transforms each layer's output in a way that commutes with the layer's parameterisation.
- Normalisation layers: Layer norm transforms representations in a way that is "natural" - it doesn't depend on the specific basis chosen for the hidden dimension.
12.4 Isomorphisms and Equivalences
Definition. A morphism is an isomorphism if there exists with and .
In different categories, isomorphisms have different names:
| Category | Isomorphism Called | Meaning |
|---|---|---|
| Set | Bijection | One-to-one correspondence |
| Vect | Linear isomorphism | Invertible linear map |
| Top | Homeomorphism | Continuous bijection with continuous inverse |
| Smooth manifolds | Diffeomorphism | Smooth bijection with smooth inverse |
| Groups | Group isomorphism | Structure-preserving bijection |
For AI: Diffeomorphisms in normalising flows. A normalising flow is a sequence of diffeomorphisms in the category of smooth manifolds. Each is:
- Smooth (differentiable)
- Bijective (invertible)
- Has a smooth inverse
The composition is also a diffeomorphism, and the change-of-variables formula allows exact likelihood computation:
12.5 The Yoneda Lemma and Representations
Yoneda Lemma (informal). An object is completely determined by its relationships (morphisms) to all other objects. More precisely:
Natural transformations from the representable functor to any functor are in bijection with elements of .
Consequence. Two objects are isomorphic if and only if they have the same "relationship pattern" to all other objects: if and only if for all .
For AI (conceptual analogy). The Yoneda perspective suggests that a representation (embedding) is good if it captures all the relationships of the represented object:
- A word embedding is a "compressed" version of the word's relationships to all other words
- Two words are "the same" (isomorphic) if they have the same relationship to all contexts
- This is precisely the distributional hypothesis: "a word is characterised by the company it keeps" (Firth 1957)
- Word2Vec, GloVe, and contextual embeddings all operationalise this principle: embeddings derive from co-occurrence patterns (relationships to other words)
The Yoneda perspective also connects to representation learning more broadly: a good representation of an object captures all the relevant "morphisms" (transformations, relationships) between it and everything else.
13. Common Mistakes and Misconceptions
13.1 Confusing Function with Formula
Mistake: Treating "" as the function, rather than understanding that the function is the mapping defined by .
Why it matters: The same formula can define different functions depending on domain and codomain:
- , - NOT injective
- , - injective
- , - surjective
- , - bijective
These are four different functions with the same formula.
13.2 Confusing Codomain and Range
Mistake: Assuming the codomain equals the range.
The codomain is part of the function's definition (what it can in principle output). The range is what it actually outputs. For , :
- Codomain: (all reals)
- Range: (non-negative reals only)
- The function is NOT surjective because range codomain
In AI: a classifier has codomain , but after softmax the actual outputs are in (the probability simplex) - a strict subset.
13.3 Assuming Composition is Commutative
Mistake: Assuming .
Composition is almost never commutative. In neural networks:
- Pre-LN: - normalise then attend
- Post-LN: - attend then normalise
- These produce different results and have different training dynamics
13.4 Forgetting That Linear Means f(0) = 0
Mistake: Calling a "linear function".
It is affine, not linear (unless ). A truly linear function satisfies , , and . The term "linear layer" in deep learning is a misnomer - PyTorch's nn.Linear is actually an affine layer.
13.5 Assuming Differentiable Implies Smooth
Mistake: Thinking once-differentiable () means infinitely differentiable ().
ReLU is not even (not differentiable at ). The function is but not . For optimisation, is sufficient for gradient descent, but is needed for second-order methods.
13.6 Ignoring Domain Restrictions
Mistake: Applying without checking that , or computing at .
In AI, this appears as:
log(0)in cross-entropy loss when prediction is exactly 0 - fix: uselog(p + eps)orlog_softmax- Division by zero in normalisation when variance is 0 - fix: use
sqrt(0)gradient is infinite - fix: usesqrt(x + eps)
13.7 Confusing Bijectivity with Invertibility
Mistake: Trying to "invert" a non-bijective function.
The pseudo-inverse is NOT a true inverse - but does not give or in general. In AI:
- You cannot perfectly reconstruct an input from a lower-dimensional embedding (dimensionality reduction is not injective)
- An autoencoder with bottleneck cannot be bijective
- Pooling layers (max pool, average pool) are not injective - information is lost
14. Exercises and Practice Problems
14.1 Foundational Exercises
Exercise 1 (Injectivity). Prove that the function defined by is injective if and only if .
Hint: Injective means . Use the fact that .
Exercise 2 (Composition). Let and be linear functions represented by matrices and respectively. Show that is represented by the matrix .
Exercise 3 (Surjectivity of softmax). Show that is surjective onto the interior of the simplex but NOT surjective onto the boundary.
Hint: For interior points, use . For the boundary, note that softmax outputs are always strictly positive.
Exercise 4 (Inverse of sigmoid). Derive the inverse of the sigmoid function .
Solution sketch: . This is the logit function: .
14.2 Intermediate Exercises
Exercise 5 (Lipschitz constant of composition). A network has 3 layers: ReLU activation (Lipschitz constant 1), a weight matrix with , and another weight matrix with . What is the upper bound on the network's Lipschitz constant?
Answer: .
Exercise 6 (Residual connection Jacobian). For a residual block where :
(a) Write the Jacobian in terms of . (b) Show that if , then is invertible. (c) What does this imply about the information flow through the residual block?
Exercise 7 (Convexity). Prove that the cross-entropy loss is convex in the logits .
Hint: Compute the Hessian and show it is positive semi-definite. The Hessian of log-sum-exp is where .
14.3 Advanced Exercises
Exercise 8 (Universal approximation). Construct a ReLU network with one hidden layer that exactly represents the function for .
Solution: . This is a network with 2 hidden units, weights , biases , output weights .
Exercise 9 (Measure theory). Let and . Find the pushforward measure (the distribution of ).
Hint: follows a chi-squared distribution with 1 degree of freedom. Use the change-of-variables formula for non-injective functions by splitting the domain into and .
Exercise 10 (Category theory). Verify that neural network layers with composition form a category: (a) Define the objects and morphisms. (b) Verify associativity of composition. (c) Define the identity morphism and verify the identity laws. (d) Is this category a subcategory of Set? Of Vect? Why or why not?
15. Why This Matters for AI/LLMs
15.1 Everything Is a Function
The central thesis of this chapter: every component of a modern AI system is a function, and the properties of those functions (injectivity, surjectivity, linearity, continuity, differentiability, Lipschitz constants) directly determine the system's capabilities and limitations.
+-----------------------------------------------------------------+
| THE AI SYSTEM AS A FUNCTION |
| |
| String ---> Tokens ---> Embeddings ---> Hidden ---> Logits ---> P |
| f_1 f_2 f_3 f_4 f_5 |
| |
| f_1: Tokenise (deterministic, not injective in general) |
| f_2: Embed (injective lookup, discrete -> continuous) |
| f_3: Transform (nonlinear, differentiable, high-dimensional) |
| f_4: LM Head (linear projection, dimension reduction) |
| f_5: Softmax (continuous, surjective onto int(\Delta), bounded) |
| |
| Full system: f = f_5 \circ f_4 \circ f_3 \circ f_2 \circ f_1 |
| Training: find \theta* = argmin_\theta E[L(f_\theta(x), y)] |
| Inference: sample from f_\theta*(context) |
+-----------------------------------------------------------------+
15.2 The Function Properties That Matter Most
| Property | Why It Matters | Where It Appears |
|---|---|---|
| Differentiability | Enables gradient-based training | Everywhere (backpropagation) |
| Lipschitz continuity | Controls sensitivity and robustness | All layers (adversarial robustness) |
| Injectivity | Preserves information | Invertible networks, normalising flows |
| Composition | Builds complex from simple | Deep networks = composition of layers |
| Linearity | Efficient, well-understood | Attention projections, FFN |
| Convexity | Guarantees global optimum | Loss (in logits), regularisation |
| Bilinearity | Captures interactions | Attention scores, dot products |
| Measurability | Enables probabilistic reasoning | Random variables, generative models |
15.3 Scale and the Function Perspective
The remarkable discovery of the scaling laws era (2020-present): the function class indexed by a few hyperparameters (depth , width , data , compute ) follows predictable relationships:
where is the loss, is compute, is data, and is parameter count. These are power laws - the loss is a specific functional form of the "size" of the function class.
From a function theory perspective:
- More parameters -> larger hypothesis class -> can approximate more complex functions
- More data -> better estimation of the target function -> lower approximation error
- More compute -> more SGD steps -> better search through the hypothesis class
- The power-law form suggests deep structural regularities in the relationship between function class capacity and approximation quality
16. Conceptual Bridge
16.1 Looking Ahead
The mathematical tools from this chapter appear throughout the rest of this curriculum:
| Topic | Function Concepts Used |
|---|---|
| Linear Algebra | Linear maps, matrix representations of functions |
| Calculus | Derivatives (local linear approximations of functions) |
| Optimisation | Minimising loss functionals over parameterised function spaces |
| Probability | Random variables as measurable functions, distributions as pushforwards |
| Information Theory | Entropy and mutual information as functionals on distributions |
| Backpropagation | Chain rule for compositions of differentiable functions |
16.2 The Meta-Insight
Mathematics is, in a deep sense, the study of functions and their properties. Linear algebra studies linear functions. Calculus studies differentiable functions. Topology studies continuous functions. Measure theory studies measurable functions. Category theory studies all structure-preserving functions (morphisms) at once.
The bridge forward is this: later chapters will often look like they are about matrices, gradients, probability distributions, or optimization algorithms, but underneath they are still about maps between spaces, compositions of transformations, and constraints on what those transformations can preserve. Understanding functions - truly understanding them - is understanding the language in which all of AI is written.
References
- Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." Mathematics of Control, Signals and Systems, 2(4), 303-314.
- Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." Neural Networks, 4(2), 251-257.
- Hendrycks, D. & Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." arXiv:1606.08415.
- Shazeer, N. (2020). "GLU Variants Improve Transformer." arXiv:2002.05202.
- Miyato, T. et al. (2018). "Spectral Normalization for Generative Adversarial Networks." ICLR 2018.
- Telgarsky, M. (2016). "Benefits of depth in neural networks." COLT 2016.
- He, K. et al. (2016). "Deep Residual Learning for Image Recognition." CVPR 2016.
- Kingma, D.P. & Dhariwal, P. (2018). "Glow: Generative Flow with Invertible 1\times1 Convolutions." NeurIPS 2018.
- Jang, E. et al. (2017). "Categorical Reparameterization with Gumbel-Softmax." ICLR 2017.
- Lu, Z. et al. (2017). "The Expressive Power of Neural Networks: A View from the Width." NeurIPS 2017.
- Penrose, R. (1955). "A generalized inverse for matrices." Mathematical Proceedings of the Cambridge Philosophical Society, 51(3), 406-413.
- Mac Lane, S. (1998). Categories for the Working Mathematician. Springer.
- Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.
- Misra, D. (2019). "Mish: A Self Regularized Non-Monotonic Activation Function." arXiv:1908.08681.
- Bai, S. et al. (2019). "Deep Equilibrium Models." NeurIPS 2019.