NotesMath for LLMs

Interpolation and Approximation

Numerical Methods / Interpolation and Approximation

Notes

"The art of interpolation is the art of making the most of what you know to say something useful about what you don't."

  • Attributed to Carl Friedrich Gauss

Overview

Given a set of data points, interpolation constructs a function that passes exactly through every point, while approximation finds the "best" function from some class that fits the data in a least-squares or minimax sense. These are among the oldest problems in numerical mathematics, yet they remain central to modern AI: positional encodings in transformers use sinusoidal interpolation, kernel methods are approximation schemes, and neural networks themselves are universal approximators whose expressive power is quantified by approximation theory.

The critical insight is that the choice of basis determines everything. Monomials (1,x,x2,1, x, x^2, \ldots) are algebraically simple but numerically catastrophic (Vandermonde matrices are notoriously ill-conditioned). Chebyshev polynomials, derived from trigonometry, minimize the maximal interpolation error among all polynomial interpolants - they are the basis that tames Runge's phenomenon. Splines trade global accuracy for local control, achieving smooth, stable interpolation at the cost of piecewise complexity. Each basis encodes a different prior about the function being approximated.

This section covers polynomial interpolation, Chebyshev theory, splines, least-squares approximation, Fourier approximation, and radial basis functions, with explicit connections to ML applications including kernel methods, positional encodings, and neural function approximation.

Prerequisites

  • Real analysis: continuity, differentiability, Taylor's theorem (Section01-Mathematical-Foundations)
  • Linear algebra: matrix factorization, least squares, condition numbers (Section02-Linear-Algebra-Basics, Section10-02)
  • Floating-point arithmetic: rounding error, numerical stability (Section10-01)
  • Norms and inner products (Section02-Linear-Algebra-Basics)

Companion Notebooks

NotebookDescription
theory.ipynbInteractive derivations: Lagrange interpolation, Chebyshev nodes, spline construction, least-squares fitting, Fourier series
exercises.ipynb10 graded exercises from Vandermonde conditioning to neural tangent kernel approximation

Learning Objectives

After completing this section, you will:

  • Construct polynomial interpolants using Lagrange and Newton divided-difference forms
  • Explain Runge's phenomenon and why equispaced nodes are dangerous for high-degree interpolation
  • Derive Chebyshev nodes as the optimal interpolation points minimizing the node polynomial
  • Implement natural and clamped cubic splines by solving a tridiagonal linear system
  • Apply least-squares polynomial fitting via QR decomposition and the normal equations
  • Compute discrete Fourier series and use FFT for fast polynomial evaluation
  • Understand radial basis function (RBF) interpolation and its connection to kernel methods
  • Connect approximation theory to neural network universality and positional encodings in transformers

Table of Contents


1. Intuition and Motivation

1.1 The Interpolation Problem

Interpolation: Given n+1n+1 data pairs (x0,y0),(x1,y1),,(xn,yn)(x_0, y_0), (x_1, y_1), \ldots, (x_n, y_n) with distinct nodes xix_i, find a function pp such that p(xi)=yip(x_i) = y_i for all ii.

Approximation: Find a function pp from some class F\mathcal{F} that minimizes some error fp\|f - p\| over F\mathcal{F}, possibly without passing through every data point.

The fundamental tension:

  • Interpolation (zero training error) risks overfitting and instability (Runge's phenomenon)
  • Approximation (nonzero training error) trades exactness for stability and generalization

This is exactly the bias-variance tradeoff in machine learning, seen through the lens of function approximation.

For AI: Every neural network layer computes a learned nonlinear function. Understanding which function classes can be represented and how efficiently is approximation theory. The expressive power of ReLU networks, the smoothness of GELU activations, and the positional encoding design in transformers all draw from this theory.

FUNCTION APPROXIMATION LANDSCAPE
========================================================================

  Data:  (x_0,y_0), ..., (x_n,y_n)
                      |
              What class of f?
               +------------+
          Global              Local
        +------+           +------+
     Polynomial          Piecewise
     (Lagrange,          (Splines,
      Chebyshev)          B-splines)
                               |
                         +-----------+
                      Smooth         Nonsmooth
                     (cubic)          (linear,
                                      B-spline)

  Special structures:
  - Trigonometric: periodic functions, FFT
  - RBF/kernel:    scattered data, RKHS, GP
  - Neural:        composition, ReLU networks

========================================================================

1.2 Why Approximation Matters for AI

1. Positional encodings: The original transformer (Vaswani et al., 2017) uses sinusoidal positional encodings PE(pos,2i)=sin(pos/100002i/d)\text{PE}(pos, 2i) = \sin(pos / 10000^{2i/d}). This is a trigonometric interpolation scheme allowing the model to interpolate positions not seen during training.

2. Kernel methods and SVMs: The kernel trick k(x,x)=ϕ(x)ϕ(x)k(x, x') = \phi(x)^\top \phi(x') computes inner products in a feature space implicitly. The representer theorem shows that the optimal solution lies in the span of the training data - exactly the RBF interpolant form.

3. Neural network expressiveness: The universal approximation theorem states that a single hidden layer with enough neurons can approximate any continuous function on a compact set. The rate of approximation depends on the function's smoothness - exactly what approximation theory quantifies.

4. Learned embeddings: Word embeddings, graph node embeddings, and molecule representations are all continuous approximations of discrete objects - the embedding space provides an interpolation structure.

5. Physics-informed neural networks (PINNs): These solve PDEs by fitting a neural network as the solution function, treating the problem as function approximation subject to differential equation constraints.

1.3 Historical Timeline

INTERPOLATION AND APPROXIMATION: HISTORICAL TIMELINE
========================================================================

  1700  - Newton's divided differences (1711): systematic polynomial interp.
  1795  - Gauss: least-squares method for orbit of Ceres
  1800  - Legendre: method of least squares (1805); orthogonal polynomials
  1853  - Chebyshev: polynomials with equioscillation, minimax approximation
  1900  - Runge: demonstrates instability of high-degree equispaced interp. (1901)
  1906  - Lebesgue: Lebesgue constant quantifies interpolation error
  1946  - Schoenberg: B-splines introduced for smoothing problems
  1965  - Cooley-Tukey: Fast Fourier Transform algorithm
  1970  - de Boor: stable B-spline evaluation algorithms
  1985  - Chui, Mallat: wavelet theory as multi-resolution approximation
  1989  - Cybenko, Hornik: universal approximation theorems for neural nets
  2017  - Vaswani et al.: sinusoidal positional encodings in transformers
  2020  - Tancik et al.: Fourier features for neural radiance fields (NeRF)
  2022  - RoPE (Su et al.): rotary positional encodings for LLMs (LLaMA)
  2024  - NTK theory: neural tangent kernel as linearized approximation

========================================================================

2. Polynomial Interpolation

2.1 Lagrange Form

Theorem (Existence and Uniqueness): For any n+1n+1 distinct nodes x0,x1,,xnx_0, x_1, \ldots, x_n and values y0,y1,,yny_0, y_1, \ldots, y_n, there exists a unique polynomial pnPnp_n \in \mathcal{P}_n (degree n\leq n) satisfying pn(xi)=yip_n(x_i) = y_i for all ii.

Proof sketch: The interpolation conditions pn(xi)=yip_n(x_i) = y_i give a linear system Va=yVa = y where VV is the Vandermonde matrix with Vij=xijV_{ij} = x_i^j. Since the xix_i are distinct, VV is invertible (its determinant is i>j(xixj)0\prod_{i > j}(x_i - x_j) \neq 0). Uniqueness follows. \square

Lagrange basis polynomials: Define

j(x)=k=0kjnxxkxjxk\ell_j(x) = \prod_{\substack{k=0 \\ k \neq j}}^{n} \frac{x - x_k}{x_j - x_k}

Each j\ell_j is a degree-nn polynomial satisfying j(xi)=δij\ell_j(x_i) = \delta_{ij} (Kronecker delta). The Lagrange interpolating polynomial is:

pn(x)=j=0nyjj(x)p_n(x) = \sum_{j=0}^{n} y_j \ell_j(x)

Barycentric Lagrange Form: The naive Lagrange form requires O(n2)O(n^2) operations to evaluate. The barycentric form is numerically stable and requires only O(n)O(n) after O(n2)O(n^2) preprocessing:

pn(x)=j=0nwjxxjyjj=0nwjxxj,wj=1kj(xjxk)p_n(x) = \frac{\sum_{j=0}^{n} \frac{w_j}{x - x_j} y_j}{\sum_{j=0}^{n} \frac{w_j}{x - x_j}}, \quad w_j = \frac{1}{\prod_{k \neq j}(x_j - x_k)}

The division by the denominator (which equals 1 by the Lagrange interpolation of the constant function f1f \equiv 1) provides automatic normalization and excellent numerical stability - it is one of the most stable algorithms in numerical analysis.

For AI: The attention mechanism in transformers can be viewed as a weighted sum jαjvj\sum_j \alpha_j v_j where the attention weights αj=softmax(qkj/d)\alpha_j = \text{softmax}(q^\top k_j / \sqrt{d}) play a role analogous to the Lagrange basis weights j(x)\ell_j(x). Both select how much each "stored value" contributes to the output.

2.2 Newton Divided Differences

The Newton form expresses pnp_n in a basis adapted for sequential addition of data points:

pn(x)=[y0]+[y0,y1](xx0)+[y0,y1,y2](xx0)(xx1)+p_n(x) = [y_0] + [y_0, y_1](x - x_0) + [y_0, y_1, y_2](x - x_0)(x - x_1) + \cdots

Divided differences are defined recursively:

[yi]=yi[y_i] = y_i [yi,yi+1,,yi+k]=[yi+1,,yi+k][yi,,yi+k1]xi+kxi[y_i, y_{i+1}, \ldots, y_{i+k}] = \frac{[y_{i+1}, \ldots, y_{i+k}] - [y_i, \ldots, y_{i+k-1}]}{x_{i+k} - x_i}

Key properties:

  1. Symmetric: [y0,y1,,yn][y_0, y_1, \ldots, y_n] is symmetric in all its arguments
  2. Derivative connection: [y0,y1,,yn]=f(n)(ξ)/n![y_0, y_1, \ldots, y_n] = f^{(n)}(\xi) / n! for some ξ\xi in the convex hull of {x0,,xn}\{x_0, \ldots, x_n\}
  3. Incremental: Adding a new data point (xn+1,yn+1)(x_{n+1}, y_{n+1}) extends the Newton form by one term - O(n)O(n) update vs O(n2)O(n^2) for recomputing Lagrange

Newton divided difference table:

x_0  |  y_0
     |         [y_0, y_1]
x_1  |  y_1               [y_0, y_1, y_2]
     |         [y_1, y_2]               [y_0, y_1, y_2, y_3]
x_2  |  y_2               [y_1, y_2, y_3]
     |         [y_2, y_3]
x_3  |  y_3

Horner's rule evaluates the Newton form in O(n)O(n) operations:

p = coeffs[n]
for i in range(n-1, -1, -1):
    p = p * (x - nodes[i]) + coeffs[i]

2.3 The Interpolation Error Theorem

Theorem: Let fCn+1[a,b]f \in C^{n+1}[a, b] and pnp_n be its interpolant at distinct nodes x0,,xn[a,b]x_0, \ldots, x_n \in [a,b]. Then for any x[a,b]x \in [a,b], there exists ξx(a,b)\xi_x \in (a,b) such that:

f(x)pn(x)=f(n+1)(ξx)(n+1)!ωn+1(x)f(x) - p_n(x) = \frac{f^{(n+1)}(\xi_x)}{(n+1)!} \omega_{n+1}(x)

where ωn+1(x)=i=0n(xxi)\omega_{n+1}(x) = \prod_{i=0}^{n}(x - x_i) is the node polynomial.

Proof sketch: Define g(t)=f(t)pn(t)λωn+1(t)g(t) = f(t) - p_n(t) - \lambda \omega_{n+1}(t) where λ\lambda is chosen so that g(x)=0g(x) = 0. Then gg has n+2n+2 zeros: x0,,xn,xx_0, \ldots, x_n, x. By Rolle's theorem, gg' has n+1n+1 zeros, gg'' has nn zeros, ..., g(n+1)g^{(n+1)} has 1 zero ξx\xi_x. Computing g(n+1)(ξx)=0g^{(n+1)}(\xi_x) = 0 gives λ=f(n+1)(ξx)/(n+1)!\lambda = f^{(n+1)}(\xi_x)/(n+1)!. \square

Implications:

  1. The error is controlled by ωn+1=maxx[a,b]ωn+1(x)\|\omega_{n+1}\|_\infty = \max_{x \in [a,b]} |\omega_{n+1}(x)| - choosing nodes to minimize this is Chebyshev's contribution
  2. Smooth functions (f(n+1)f^{(n+1)} small) are approximated well
  3. High-degree polynomials require ff to have many bounded derivatives

Lebesgue constant: The interpolation error can also be bounded by:

fpn(1+Λn)fpn\|f - p_n\|_\infty \leq (1 + \Lambda_n) \|f - p_n^*\|_\infty

where pnp_n^* is the best polynomial approximant of degree nn and Λn=maxx[a,b]jj(x)\Lambda_n = \max_{x \in [a,b]} \sum_j |\ell_j(x)| is the Lebesgue constant. The Lebesgue constant quantifies how much the interpolant amplifies errors in yiy_i.

  • Equispaced nodes: Λn2n/(enlnn)\Lambda_n \sim 2^n / (e n \ln n) - exponential growth (Runge's phenomenon)
  • Chebyshev nodes: Λn2πln(n+1)+0.52\Lambda_n \sim \frac{2}{\pi} \ln(n+1) + 0.52 - logarithmic growth (nearly optimal)

2.4 Runge's Phenomenon

Runge's phenomenon (1901): The interpolant of f(x)=1/(1+25x2)f(x) = 1/(1+25x^2) at n+1n+1 equispaced nodes on [1,1][-1,1] diverges as nn \to \infty, even though ff is infinitely differentiable!

The reason is purely numerical: the node polynomial ωn+1(x)\omega_{n+1}(x) for equispaced nodes is extremely large near the boundary of [1,1][-1, 1]. Even though f(n+1)f^{(n+1)} is bounded, the product grows without bound.

RUNGE'S PHENOMENON: EQUISPACED vs CHEBYSHEV NODES
========================================================================

  f(x) = 1/(1+25x^2),  n=15 interpolation points

  Equispaced nodes (-1, -12/15, ..., 12/15, 1):
  - Huge oscillations near +/-1
  - Max error \\approx 10^2 (function max is 1!)
  - Lebesgue constant _15 \\approx 5000

  Chebyshev nodes (cos((2k+1)\\pi/(2n+2))):
  - Near-optimal approximation
  - Max error \\approx 10^-^3
  - Lebesgue constant _15 \\approx 3.5

  Moral: Node placement is as important as degree!

========================================================================

For AI: Equispaced sampling is the default in practice (uniform time steps, uniform grid). Understanding why this can fail for high-degree polynomial interpolation is essential context for:

  • Time-series models that use polynomial basis functions
  • Finite element methods in physics-informed neural networks
  • Feature engineering with polynomial features (always use regularization or orthogonal polynomials)

3. Chebyshev Polynomials and Optimal Nodes

3.1 Chebyshev Polynomials

The Chebyshev polynomials of the first kind are defined via the remarkable trigonometric identity:

Tn(cosθ)=cos(nθ),θ[0,π]T_n(\cos\theta) = \cos(n\theta), \quad \theta \in [0, \pi]

Equivalently, on [1,1][-1, 1]:

T0(x)=1,T1(x)=x,Tn(x)=2xTn1(x)Tn2(x)T_0(x) = 1, \quad T_1(x) = x, \quad T_n(x) = 2x T_{n-1}(x) - T_{n-2}(x)

Key properties:

PropertyFormula
BoundednessTn=1\|T_n\|_\infty = 1 on [1,1][-1,1]
Roots (Chebyshev points of the 1st kind)xk=cos ⁣((2k+1)π2n)x_k = \cos\!\left(\frac{(2k+1)\pi}{2n}\right), k=0,,n1k = 0, \ldots, n-1
Extrema (Chebyshev points of the 2nd kind)xk=cos ⁣(kπn)x_k = \cos\!\left(\frac{k\pi}{n}\right), k=0,,nk = 0, \ldots, n
Orthogonality11Tm(x)Tn(x)dx1x2=π2δmn\int_{-1}^{1} T_m(x) T_n(x) \frac{dx}{\sqrt{1-x^2}} = \frac{\pi}{2} \delta_{mn} (for m,n1m,n \geq 1)
Leading coefficientTn(x)=2n1xn+lower orderT_n(x) = 2^{n-1} x^n + \text{lower order}
Minimax propertyAmong all monic polynomials of degree nn, Tn(x)/2n1T_n(x)/2^{n-1} has the smallest \infty-norm on [1,1][-1,1]

Why the trigonometric definition works: Tn(cosθ)=cos(nθ)T_n(\cos\theta) = \cos(n\theta) is periodic in θ\theta and stays in [1,1][-1,1]. The change of variables x=cosθx = \cos\theta maps [0,π][0,\pi] to [1,1][-1,1], compressing the uniform distribution on θ\theta to a non-uniform (arcsine) distribution on xx that weights the endpoints more heavily.

Chebyshev series: Any continuous function f:[1,1]Rf: [-1,1] \to \mathbb{R} can be expanded as:

f(x)=k=0ckTk(x),ck=2π11f(x)Tk(x)dx1x2f(x) = \sum_{k=0}^{\infty} c_k T_k(x), \quad c_k = \frac{2}{\pi} \int_{-1}^{1} f(x) T_k(x) \frac{dx}{\sqrt{1-x^2}}

(with c0c_0 divided by 2). The Chebyshev series converges faster than any fixed power of 1/n1/n for analytic functions - exponential convergence for analytic ff.

For AI: Chebyshev polynomials appear in:

  • Spectral graph neural networks: ChebNet approximates the graph spectral filter g(Λ)g(\Lambda) by a Chebyshev polynomial in Λ\Lambda, avoiding eigendecomposition
  • Polynomial activation functions: Some neural architectures use Chebyshev polynomials as activation functions for efficient function approximation
  • Numerical PDE solvers: Spectral methods using Chebyshev expansions achieve exponential accuracy for smooth solutions

3.2 Chebyshev Nodes

The Optimal Node Placement Problem: Which n+1n+1 nodes x0,,xn[1,1]x_0, \ldots, x_n \in [-1,1] minimize maxx[1,1]ωn+1(x)\max_{x \in [-1,1]} |\omega_{n+1}(x)| where ωn+1(x)=i(xxi)\omega_{n+1}(x) = \prod_i (x - x_i)?

Answer: The Chebyshev points of the first kind (roots of Tn+1T_{n+1}):

xk=cos((2k+1)π2(n+1)),k=0,1,,nx_k = \cos\left(\frac{(2k+1)\pi}{2(n+1)}\right), \quad k = 0, 1, \ldots, n

With these nodes, ωn+1(x)=Tn+1(x)/2n\omega_{n+1}(x) = T_{n+1}(x) / 2^n and:

maxx[1,1]ωn+1(x)=12n\max_{x \in [-1,1]} |\omega_{n+1}(x)| = \frac{1}{2^n}

This is the smallest possible value (by the minimax property of Tn+1T_{n+1}).

Error bound with Chebyshev nodes:

fpnf(n+1)2n(n+1)!\|f - p_n\|_\infty \leq \frac{\|f^{(n+1)}\|_\infty}{2^n (n+1)!}

Compare to equispaced nodes where ωn+1n!hn+1\|\omega_{n+1}\|_\infty \sim n! h^{n+1} (much larger for large nn).

Rescaling to [a,b][a,b]: The Chebyshev nodes on [a,b][a,b] are:

xk=a+b2+ba2cos((2k+1)π2(n+1))x_k = \frac{a+b}{2} + \frac{b-a}{2} \cos\left(\frac{(2k+1)\pi}{2(n+1)}\right)

3.3 Minimax Approximation

The minimax problem asks: among all polynomials pPnp \in \mathcal{P}_n, find pp^* minimizing fp\|f - p\|_\infty.

Chebyshev's theorem (Equioscillation theorem): pp^* is the best approximation if and only if there exist at least n+2n+2 points x0<x1<<xn+1x_0 < x_1 < \cdots < x_{n+1} in [a,b][a,b] where the error equioscillates:

f(xk)p(xk)=(1)kE,E=±fpf(x_k) - p^*(x_k) = (-1)^k E, \quad E = \pm \|f - p^*\|_\infty

Remez algorithm: An iterative algorithm that finds the minimax polynomial by exchanging equioscillation points. Converges quadratically.

For smooth functions: The minimax polynomial approximation converges exponentially for analytic functions - meaning fpnCρn\|f - p_n^*\|_\infty \leq C \rho^{-n} for some ρ>1\rho > 1 depending on the domain of analyticity.

3.4 Clenshaw-Curtis Quadrature Connection

The Chebyshev points of the second kind (extrema of TnT_n) are central to Clenshaw-Curtis quadrature (see Section10-05). The key insight: integrating the Chebyshev interpolant exactly gives highly accurate quadrature weights. This is treated in detail in the Numerical Integration section.


4. Spline Interpolation

4.1 Piecewise Polynomial Interpolation

Motivation: High-degree global polynomial interpolation is unstable (Runge). Instead, use a low-degree polynomial on each subinterval - piecewise polynomials or splines.

Definition: Given nodes a=x0<x1<<xn=ba = x_0 < x_1 < \cdots < x_n = b (breakpoints), a spline of degree kk is a function S:[a,b]RS: [a,b] \to \mathbb{R} such that:

  1. On each interval [xi,xi+1][x_i, x_{i+1}], SS is a polynomial of degree k\leq k
  2. SCk1[a,b]S \in C^{k-1}[a,b] - has k1k-1 continuous derivatives globally

Piecewise linear interpolation (k=1k=1): Connect adjacent data points with straight lines. Simple, O(n)O(n) construction, but only C0C^0 - derivatives are discontinuous. Error: O(h2)O(h^2) where h=maxi(xi+1xi)h = \max_i (x_{i+1} - x_i).

Piecewise cubic (k=3k=3): The sweet spot - 4 degrees of freedom per interval, 2 boundary conditions at each interior node (C2C^2 continuity), leaving 2 free conditions (boundary conditions). Error: O(h4)O(h^4).

4.2 Cubic Splines: Derivation and Construction

Natural cubic spline: Given n+1n+1 data points, find SC2[a,b]S \in C^2[a,b] piecewise cubic satisfying:

  1. S(xi)=yiS(x_i) = y_i for i=0,,ni = 0, \ldots, n (interpolation)
  2. S(x0)=S(xn)=0S''(x_0) = S''(x_n) = 0 (natural boundary conditions - zero curvature at endpoints)

Derivation: On [xi,xi+1][x_i, x_{i+1}] with hi=xi+1xih_i = x_{i+1} - x_i, let Mi=S(xi)M_i = S''(x_i) be the unknown second derivatives. Since SS is cubic on each interval and SS'' is linear:

S(x)=Mixi+1xhi+Mi+1xxihiS''(x) = M_i \frac{x_{i+1} - x}{h_i} + M_{i+1} \frac{x - x_i}{h_i}

Integrating twice and applying S(xi)=yiS(x_i) = y_i, S(xi+1)=yi+1S(x_{i+1}) = y_{i+1}:

S(x)=Mi(xi+1x)36hi+Mi+1(xxi)36hi+(yiMihi26)xi+1xhi+(yi+1Mi+1hi26)xxihiS(x) = M_i \frac{(x_{i+1}-x)^3}{6h_i} + M_{i+1} \frac{(x-x_i)^3}{6h_i} + \left(y_i - \frac{M_i h_i^2}{6}\right)\frac{x_{i+1}-x}{h_i} + \left(y_{i+1} - \frac{M_{i+1} h_i^2}{6}\right)\frac{x-x_i}{h_i}

Imposing C1C^1 continuity (matching first derivatives at interior nodes):

hi1Mi1+2(hi1+hi)Mi+hiMi+1=6(yi+1yihiyiyi1hi1),i=1,,n1h_{i-1} M_{i-1} + 2(h_{i-1} + h_i) M_i + h_i M_{i+1} = 6 \left(\frac{y_{i+1}-y_i}{h_i} - \frac{y_i - y_{i-1}}{h_{i-1}}\right), \quad i = 1, \ldots, n-1

This gives a tridiagonal linear system for M1,,Mn1M_1, \ldots, M_{n-1} (with M0=Mn=0M_0 = M_n = 0 for natural BC). Tridiagonal systems are solved in O(n)O(n) with the Thomas algorithm - spectacularly efficient.

CUBIC SPLINE SYSTEM (tridiagonal)
========================================================================

  +                                  + +   +   +   +
  | 2(h_0+h_1)  h_1                    | |M_1 |   |r_1 |
  | h_1      2(h_1+h_2)  h_2            | |M_2 |   |r_2 |
  |           h_2    2(h_2+h_3)  h_3   | |M_3 | = |r_3 |
  |                              | |  |   |  |
  |                    h_{n-2} 2(h_{n-2}+h_{n-1})| |M_{n-1}|   |r_{n-1}|
  +                                  + +   +   +   +

  r_i = 6[(y_i+_1-y_i)/h_i - (y_i-y_i_1)/h_i_1]  (second finite differences of y)

  Solved by Thomas algorithm in O(n) operations.

========================================================================

Error analysis: For the natural cubic spline:

fS5384h4f(4)\|f - S\|_\infty \leq \frac{5}{384} h^4 \|f^{(4)}\|_\infty

where h=maxhih = \max h_i. The O(h4)O(h^4) convergence makes cubic splines highly accurate for modest grid spacing.

Boundary condition options:

  • Natural: S(a)=S(b)=0S''(a) = S''(b) = 0 (minimizes S2\int |S''|^2)
  • Clamped: S(a)=f(a)S'(a) = f'(a), S(b)=f(b)S'(b) = f'(b) (uses derivative information)
  • Not-a-knot: SS''' continuous at x1x_1 and xn1x_{n-1} (used when derivatives unknown)
  • Periodic: S(a)=S(b)S'(a) = S'(b), S(a)=S(b)S''(a) = S''(b) (for periodic data)

Minimum curvature property: The natural cubic spline minimizes abS(x)2dx\int_a^b |S''(x)|^2 dx among all twice-differentiable interpolants. This is the thin plate spline principle: the spline is the smoothest interpolant.

For AI:

  • Neural ODEs use cubic spline interpolation to densify trajectory predictions
  • Temporal point processes use spline bases to model intensity functions
  • The minimum curvature property connects to L2 regularization: the RKHS norm for the Matern-3/2 kernel is equivalent to S2\int |S''|^2

4.3 B-Splines

B-splines (basis splines) provide a numerically stable, local basis for the spline space. Rather than working with the second-derivative representation, B-splines use recursively defined local basis functions.

Definition (Cox-de Boor recursion): Given a knot vector t0t1tn+kt_0 \leq t_1 \leq \cdots \leq t_{n+k}, the B-spline basis functions of degree pp are:

Bi,0(x)={1tix<ti+10otherwiseB_{i,0}(x) = \begin{cases} 1 & t_i \leq x < t_{i+1} \\ 0 & \text{otherwise} \end{cases} Bi,p(x)=xtiti+ptiBi,p1(x)+ti+p+1xti+p+1ti+1Bi+1,p1(x)B_{i,p}(x) = \frac{x - t_i}{t_{i+p} - t_i} B_{i,p-1}(x) + \frac{t_{i+p+1} - x}{t_{i+p+1} - t_{i+1}} B_{i+1,p-1}(x)

Key properties:

  • Local support: Bi,p(x)=0B_{i,p}(x) = 0 outside [ti,ti+p+1][t_i, t_{i+p+1}] - changing one control point affects only a local region
  • Partition of unity: iBi,p(x)=1\sum_i B_{i,p}(x) = 1 for all xx - control points are convex combinations
  • Non-negativity: Bi,p(x)0B_{i,p}(x) \geq 0
  • Convex hull property: The spline curve lies within the convex hull of its control points

B-spline curve: S(x)=i=0nciBi,p(x)S(x) = \sum_{i=0}^{n} c_i B_{i,p}(x) where cic_i are control point coefficients.

For AI: B-splines appear in:

  • Computer graphics: NURBS (Non-Uniform Rational B-Splines) for representing curved surfaces in 3D rendering used in diffusion model outputs
  • Kolmogorov-Arnold Networks (KAN): Replace MLP neurons with learned B-spline functions - a direct application of spline approximation theory in neural networks
  • Time-series modeling: Spline bases for smooth covariate functions in survival analysis and temporal models

4.4 Tension Splines and Shape Preservation

Standard cubic splines can overshoot - producing local extrema not present in the data. Monotone splines (Fritsch-Carlson algorithm) adjust slopes to preserve monotonicity:

  1. Compute slopes from divided differences
  2. If did_i has opposite sign to adjacent differences, set di=0d_i = 0
  3. Rescale slopes using αi2+βi29\alpha_i^2 + \beta_i^2 \leq 9 condition

For AI: Monotone spline constraints appear in calibration curves, cumulative distribution function modeling, and isotonic regression.


5. Least-Squares Approximation

5.1 Polynomial Least Squares

Given mm data points (x1,y1),,(xm,ym)(x_1, y_1), \ldots, (x_m, y_m) with m>n+1m > n+1, find the degree-nn polynomial minimizing:

minpPni=1m(yip(xi))2=mincRn+1Vcy22\min_{p \in \mathcal{P}_n} \sum_{i=1}^{m} (y_i - p(x_i))^2 = \min_{c \in \mathbb{R}^{n+1}} \|Vc - y\|_2^2

where VRm×(n+1)V \in \mathbb{R}^{m \times (n+1)} is the Vandermonde matrix Vij=xijV_{ij} = x_i^j.

The Vandermonde conditioning problem: Even for moderate nn, κ(V)=VV1\kappa(V) = \|V\| \|V^{-1}\| grows like O(ρn)O(\rho^n) for some ρ>1\rho > 1. Solving the normal equations VVc=VyV^\top V c = V^\top y squares the condition number: κ(VV)=κ(V)2\kappa(V^\top V) = \kappa(V)^2.

Solution via QR decomposition: Compute V=QRV = QR (thin QR), then c=R1Qyc = R^{-1} Q^\top y. This is stable and avoids squaring the condition number:

κQR solve=κ(V)vsκNormal eq.=κ(V)2\kappa_{\text{QR solve}} = \kappa(V) \quad \text{vs} \quad \kappa_{\text{Normal eq.}} = \kappa(V)^2

For AI: This is the same condition-number-squaring problem as in Section10-02. Linear regression with polynomial features should always use QR or SVD, never the normal equations with the Vandermonde matrix directly.

5.2 Orthogonal Polynomials and Gram-Schmidt

The Vandermonde ill-conditioning arises because the monomial basis {1,x,x2,}\{1, x, x^2, \ldots\} is nearly linearly dependent for large nn and typical node distributions. The fix: use an orthogonal polynomial basis for the weight function.

Orthogonal polynomial construction (three-term recurrence): For a weight function w(x)>0w(x) > 0 on [a,b][a,b]:

ϕ0(x)=1,ϕ1(x)=xα0\phi_0(x) = 1, \quad \phi_1(x) = x - \alpha_0 ϕk+1(x)=(xαk)ϕk(x)βkϕk1(x)\phi_{k+1}(x) = (x - \alpha_k) \phi_k(x) - \beta_k \phi_{k-1}(x)

where the coefficients are:

αk=xϕk,ϕkϕk,ϕk,βk=ϕk,ϕkϕk1,ϕk1\alpha_k = \frac{\langle x \phi_k, \phi_k \rangle}{\langle \phi_k, \phi_k \rangle}, \quad \beta_k = \frac{\langle \phi_k, \phi_k \rangle}{\langle \phi_{k-1}, \phi_{k-1} \rangle}

and f,g=abf(x)g(x)w(x)dx\langle f, g \rangle = \int_a^b f(x) g(x) w(x) dx.

Standard families:

Weight w(x)w(x)IntervalPolynomials
11[1,1][-1, 1]Legendre PnP_n
(1x2)1/2(1-x^2)^{-1/2}[1,1][-1, 1]Chebyshev TnT_n
exe^{-x}[0,)[0, \infty)Laguerre LnL_n
ex2e^{-x^2}(,)(-\infty, \infty)Hermite HnH_n

Discrete orthogonal polynomials: For a finite set of nodes with weights wiw_i, the discrete inner product f,g=iwif(xi)g(xi)\langle f, g \rangle = \sum_i w_i f(x_i) g(x_i) gives discrete orthogonal polynomials. Using these as the basis makes the least-squares problem diagonal: c^k=f,ϕk/ϕk,ϕk\hat{c}_k = \langle f, \phi_k \rangle / \langle \phi_k, \phi_k \rangle.

For AI: Hermite polynomials appear in quantum mechanics (wavefunction basis) and in the theory of Gaussian integrals. The connection to attention mechanism: the softmax function can be written as a ratio of Hermite polynomial evaluations at zero.

5.3 Minimax vs Least-Squares

Minimax (Chebyshev) approximation: Minimizes fp=maxxf(x)p(x)\|f - p\|_\infty = \max_x |f(x) - p(x)|.

  • Equioscillation at n+2n+2 points (Chebyshev's theorem)
  • Found by the Remez algorithm
  • Best for: applications where the worst case matters (safety-critical, numerical tables)

Least-squares (L2L^2) approximation: Minimizes fp22=fp2wdx\|f - p\|_2^2 = \int |f - p|^2 w dx.

  • Found by orthogonal projection: ck=f,ϕk/ϕk2c_k = \langle f, \phi_k \rangle / \|\phi_k\|^2
  • Best for: statistical fitting, machine learning (minimizes MSE)

Comparison:

CriterionMinimaxLeast-Squares
Error normLL^\inftyL2L^2
AlgorithmRemez (iterative)QR / orthogonal projection
Sensitivity to outliersHighLower (outliers at max)
ComputationO(n2)O(n^2) per Remez stepO(mn)O(mn) one-shot
ML analogyMax-margin (SVM)MSE (regression)

6. Fourier Approximation

6.1 Trigonometric Interpolation

For periodic functions on [0,2π][0, 2\pi], the natural interpolating functions are trigonometric polynomials. Given nn equispaced nodes xj=2πj/nx_j = 2\pi j / n for j=0,,n1j = 0, \ldots, n-1, the trigonometric interpolant is:

p(x)=k=n/2n/21ckeikx,ck=1nj=0n1yjeikxjp(x) = \sum_{k=-n/2}^{n/2-1} c_k e^{ikx}, \quad c_k = \frac{1}{n} \sum_{j=0}^{n-1} y_j e^{-ikx_j}

The coefficients ckc_k are exactly the Discrete Fourier Transform (DFT) of the data yjy_j.

Error for periodic functions: For a function with Fourier series f(x)=kf^keikxf(x) = \sum_k \hat{f}_k e^{ikx}:

fpn2k>n/2f^k\|f - p_n\|_\infty \leq 2\sum_{|k| > n/2} |\hat{f}_k|

For smooth periodic functions, the Fourier coefficients decay rapidly (exponentially for analytic functions), so this error is very small - spectral accuracy.

6.2 Discrete Fourier Transform

Definition: The DFT of a vector yCny \in \mathbb{C}^n is:

Yk=j=0n1yjωnjk,k=0,1,,n1Y_k = \sum_{j=0}^{n-1} y_j \omega_n^{-jk}, \quad k = 0, 1, \ldots, n-1

where ωn=e2πi/n\omega_n = e^{2\pi i/n} is the primitive nnth root of unity.

Matrix form: Y=FnyY = F_n y where (Fn)jk=ωnjk(F_n)_{jk} = \omega_n^{-jk}.

The matrix FnF_n is unitary (up to scaling): FnFn=nIF_n^* F_n = n I, so the inverse DFT (IDFT) is:

yj=1nk=0n1Ykωnjky_j = \frac{1}{n} \sum_{k=0}^{n-1} Y_k \omega_n^{jk}

Properties:

  • Parseval's theorem: Y2=ny2\|Y\|^2 = n \|y\|^2 (energy conservation)
  • Convolution theorem: DFT of convolution = pointwise product of DFTs
  • Real signals: If yjRy_j \in \mathbb{R}, then Ynk=YkY_{n-k} = \overline{Y_k} (conjugate symmetry)

For AI: The convolution theorem is the mathematical foundation of CNNs: a convolution in spatial domain equals pointwise multiplication in frequency domain. FFT-based convolution has complexity O(nlogn)O(n \log n) vs O(n2)O(n^2) for direct convolution.

6.3 Fast Fourier Transform

Naive DFT: O(n2)O(n^2) - computing each of nn output values requires summing nn terms.

Cooley-Tukey FFT (1965): Exploits the factorization of nn (assuming n=2mn = 2^m) to reduce complexity to O(nlogn)O(n \log n).

Radix-2 Decimation-in-Time: Split yy into even/odd indices:

Yk=j=0n/21y2jωn2jkEk+ωnkj=0n/21y2j+1ωn2jkOkY_k = \underbrace{\sum_{j=0}^{n/2-1} y_{2j} \omega_n^{-2jk}}_{E_k} + \omega_n^{-k} \underbrace{\sum_{j=0}^{n/2-1} y_{2j+1} \omega_n^{-2jk}}_{O_k}

Since ωn2j=ωn/2j\omega_n^{-2j} = \omega_{n/2}^{-j}, both EE and OO are DFTs of size n/2n/2. The butterfly pattern:

DFT BUTTERFLY STRUCTURE (n=8)
========================================================================

  Input (bit-reversed)          Output (in-order)
  y_0  ----+                       Y_0
  y4  ----- butterfly ---------  Y_1
  y_2  ----- butterfly ---------  Y_2
  y6  ----- butterfly ---------  Y_3
  y_1  ----- butterfly ---------  Y4
  y5  ----- butterfly ---------  Y5
  y_3  ----- butterfly ---------  Y6
  y7  ----+- butterfly ---------  Y7

  3 stages x 4 butterflies = 12 operations vs 64 for naive DFT.
  Speedup: O(n^2) -> O(n log n).

========================================================================

Complexity: T(n)=2T(n/2)+O(n)T(n) = 2T(n/2) + O(n) solves to T(n)=O(nlogn)T(n) = O(n \log n). For n=106n = 10^6, FFT requires 2×107\sim 2 \times 10^7 operations vs 101210^{12} for naive DFT - a factor of 50,000×50{,}000\times speedup.

For AI: FFT is used in:

  • Spectral convolution layers (FNet, S4, Hyena)
  • Long-range attention in Performer (via random Fourier features)
  • Frequency-domain training of neural networks

6.4 Aliasing and the Nyquist Theorem

Aliasing: When sampling at rate fsf_s, frequencies above fs/2f_s/2 (the Nyquist frequency) are indistinguishable from lower frequencies. The frequency ff aliases to ffsf - f_s if f>fs/2f > f_s/2.

Nyquist-Shannon sampling theorem: A signal with bandwidth BB (no frequencies above BB) can be perfectly reconstructed from samples taken at rate fs2Bf_s \geq 2B.

For AI: Aliasing matters in:

  • Audio processing (e.g., sample rate 44.1 kHz supports up to 22 kHz - just above human hearing)
  • Image super-resolution: downsampling without anti-aliasing introduces artifacts
  • Periodic positional encodings: the frequency range [104,1][10^{-4}, 1] in transformer PEs avoids aliasing for typical sequence lengths

7. Radial Basis Functions and Kernel Methods

7.1 RBF Interpolation

For scattered data (non-uniform nodes in Rd\mathbb{R}^d), polynomial interpolation becomes problematic. Radial basis function (RBF) interpolation uses:

s(x)=j=1ncjϕ(xxj),xRds(x) = \sum_{j=1}^{n} c_j \phi(\|x - x_j\|), \quad x \in \mathbb{R}^d

where ϕ:[0,)R\phi: [0, \infty) \to \mathbb{R} is a radial function. Common choices:

Nameϕ(r)\phi(r)Smoothness
Gaussianeε2r2e^{-\varepsilon^2 r^2}CC^\infty
Multiquadric1+ε2r2\sqrt{1 + \varepsilon^2 r^2}CC^\infty
Inverse multiquadric(1+ε2r2)1/2(1 + \varepsilon^2 r^2)^{-1/2}CC^\infty
Thin plate spliner2logrr^2 \log rC1C^1
Matern-3/2(1+3r/)exp(3r/)(1 + \sqrt{3}r/\ell)\exp(-\sqrt{3}r/\ell)C2C^2
Wendlandpiecewise polynomial, compact supportC2kC^{2k}

Interpolation system: The coefficients cc solve the linear system:

Φc=y,Φij=ϕ(xixj)\Phi c = y, \quad \Phi_{ij} = \phi(\|x_i - x_j\|)

For positive definite RBFs (Gaussian, Matern), Φ\Phi is positive definite - guaranteed unique solution.

For AI: The Gaussian RBF kernel is the most common kernel in kernel machines (SVM, Gaussian processes). The shape parameter ε\varepsilon (or lengthscale \ell) controls the smoothness/locality tradeoff - analogous to the learning rate in neural networks.

7.2 Connection to Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Space (RKHS): For a symmetric positive definite kernel k:X×XRk: \mathcal{X} \times \mathcal{X} \to \mathbb{R}, the RKHS Hk\mathcal{H}_k is the completion of span{k(,x):xX}\text{span}\{k(\cdot, x) : x \in \mathcal{X}\} with the inner product satisfying the reproducing property:

f,k(,x)H=f(x)\langle f, k(\cdot, x) \rangle_{\mathcal{H}} = f(x)

Representer theorem: The minimizer of minfHi(yif(xi))2+λfH2\min_{f \in \mathcal{H}} \sum_i (y_i - f(x_i))^2 + \lambda \|f\|_{\mathcal{H}}^2 has the form:

f=j=1ncjk(,xj)f^* = \sum_{j=1}^{n} c_j k(\cdot, x_j)

This is exactly the RBF interpolant! The RKHS norm fH\|f\|_{\mathcal{H}} acts as a regularizer controlling smoothness.

For AI: The representer theorem is foundational for:

  • Kernel SVMs: The decision boundary is a linear combination of support vectors in RKHS
  • Gaussian process regression: The posterior mean is the RKHS interpolant; posterior variance is the RKHS error
  • Neural tangent kernel (NTK): Infinite-width neural networks converge to kernel regression with the NTK kernel

7.3 Gaussian Processes as Bayesian Approximation

A Gaussian process fGP(μ,k)f \sim \mathcal{GP}(\mu, k) is a distribution over functions where any finite collection (f(x1),,f(xn))(f(x_1), \ldots, f(x_n)) is multivariate Gaussian with mean (μ(x1),,μ(xn))(\mu(x_1), \ldots, \mu(x_n)) and covariance matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j).

GP posterior: Given observations yi=f(xi)+εiy_i = f(x_i) + \varepsilon_i with εiN(0,σ2)\varepsilon_i \sim \mathcal{N}(0, \sigma^2):

f(x)yN(μ,σ2)f(x^*) | \mathbf{y} \sim \mathcal{N}(\mu^*, \sigma^{*2}) μ=k(K+σ2I)1y\mu^* = k^{*\top}(K + \sigma^2 I)^{-1} \mathbf{y} σ2=k(x,x)k(K+σ2I)1k\sigma^{*2} = k(x^*, x^*) - k^{*\top}(K + \sigma^2 I)^{-1} k^*

where ki=k(x,xi)k^*_i = k(x^*, x_i).

The posterior mean μ\mu^* is precisely the kernel ridge regression solution - Bayesian inference gives both a point estimate and uncertainty quantification "for free".

For AI: Gaussian processes are used for:

  • Bayesian optimization: Surrogate model for expensive black-box functions (hyperparameter tuning)
  • Neural architecture search (NAS): GP surrogate for network performance prediction
  • Uncertainty quantification: GP posterior variance estimates prediction confidence

8. Applications in Machine Learning

8.1 Positional Encodings: Sinusoidal Interpolation

The original transformer uses sinusoidal positional encodings (Vaswani et al., 2017):

PE(pos,2i)=sin(pos100002i/d),PE(pos,2i+1)=cos(pos100002i/d)\text{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad \text{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d}}\right)

Approximation theory perspective: These are trigonometric interpolation functions. The frequencies span [100001,1][10000^{-1}, 1] - about 4 decades - covering both local (high frequency) and global (low frequency) position information.

Why sinusoidal? For any offset kk, PE(pos+k)\text{PE}(pos+k) is a linear function of PE(pos)\text{PE}(pos) - the relative position kk can be expressed via a rotation matrix. This allows the model to attend to relative positions.

RoPE (Rotary Position Embedding, Su et al. 2022): Used in LLaMA, Mistral, GPT-NeoX. Applies a position-dependent rotation to query and key vectors:

qm=Rmq,kn=Rnk,Rm=(cosmθsinmθsinmθcosmθ)q_m = R_m q, \quad k_n = R_n k, \quad R_m = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix}

The inner product qmkn=qRmRnk=qRmnkq_m^\top k_n = q^\top R_m^\top R_n k = q^\top R_{m-n} k depends only on relative position mnm-n - exact relative position encoding via complex multiplication.

For AI: The interpolation challenge arises when extending context length. A model trained with RoPE on 2048 tokens needs to generalize to 8192 - trigonometric extrapolation, not interpolation. Methods like YaRN (Peng et al., 2023) modify the frequency range to enable longer context via interpolation of the position indices.

8.2 Neural Networks as Universal Approximators

Cybenko's theorem (1989): Any continuous function f:[0,1]dRf: [0,1]^d \to \mathbb{R} can be approximated uniformly by a single hidden layer network:

f(x)j=1Ncjσ(ajx+bj)f(x) \approx \sum_{j=1}^{N} c_j \sigma(a_j^\top x + b_j)

for any continuous sigmoidal σ\sigma. As NN \to \infty, the approximation error goes to zero.

Modern variants:

  • Barron's theorem (1993): Functions with finite Barron complexity Cf=f^(ω)ωdω<C_f = \int |\hat{f}(\omega)| \|\omega\| d\omega < \infty can be approximated with O(1/N)O(1/N) squared L2L^2 error using NN neurons - independent of dimension dd. This avoids the curse of dimensionality for this function class.
  • ReLU depth separation: Deep ReLU networks of depth LL can represent functions that would require O(2n)O(2^n) neurons in a shallow network - depth exponentially increases expressiveness
  • Approximation vs optimization: Universal approximation guarantees existence of weights; finding them via gradient descent is a separate (hard) problem

KAN (Kolmogorov-Arnold Networks, 2024): Replace the node-wise nonlinearity with edge-wise learned univariate functions (represented as B-splines):

KAN(x)=jϕL,j(iϕ1,ij(xi))\text{KAN}(x) = \sum_{j} \phi_{L,j}(\ldots \sum_i \phi_{1,ij}(x_i) \ldots)

Based on the Kolmogorov-Arnold representation theorem: every continuous function of nn variables can be written as a composition of univariate functions and addition. KANs directly implement this with learnable splines.

8.3 Feature Maps and Kernel Approximation

Random Fourier Features (Rahimi-Recht, 2007): The shift-invariant kernel k(x,y)=k(xy)k(x, y) = k(x-y) can be written (Bochner's theorem) as:

k(xy)=Eωp(ω)[eiω(xy)]k(x - y) = \mathbb{E}_{\omega \sim p(\omega)}[e^{i\omega^\top(x-y)}]

where p(ω)p(\omega) is the spectral density. Approximating with DD random frequencies:

k(x,y)z(x)z(y),z(x)=1D(cos(ω1x+b1)cos(ωDx+bD))k(x, y) \approx z(x)^\top z(y), \quad z(x) = \frac{1}{\sqrt{D}} \begin{pmatrix} \cos(\omega_1^\top x + b_1) \\ \vdots \\ \cos(\omega_D^\top x + b_D) \end{pmatrix}

This turns a kernel method into a linear model in feature space - O(nD)O(nD) instead of O(n2)O(n^2) for kernel matrix.

For AI: Random Fourier features are the foundation of:

  • Performer (Choromanski et al., 2020): Approximates softmax attention with O(n)O(n) complexity using random feature maps
  • Neural Tangent Kernel implementations: Computing NTK efficiently for large networks

9. Common Mistakes

#MistakeWhy It's WrongFix
1Using equispaced nodes for high-degree polynomial interpolationRunge's phenomenon: error grows exponentially near endpointsUse Chebyshev nodes or splines
2Solving least-squares via the normal equations with Vandermonde matrixCondition number squared: κ(VV)=κ(V)2\kappa(V^\top V) = \kappa(V)^2 grows like ρ2n\rho^{2n}Use QR decomposition or orthogonal polynomial basis
3Extrapolating polynomial interpolantsAll polynomial interpolants diverge outside [x0,xn][x_0, x_n] unless the function is actually polynomialUse local models (splines) or explicit extrapolation assumptions
4Setting spline boundary conditions to natural (S=0S''=0) when data is not naturalIntroduces artificial inflection points if the true function has nonzero curvature at endpointsUse clamped or not-a-knot conditions when derivative information is available
5Confusing interpolation degree with approximation qualityHigh degree interpolation is not always more accurate (Runge)Match the basis to the data's smoothness and node count
6Forgetting that DFT assumes periodic extensionDiscontinuity at boundaries causes Gibbs phenomenon - oscillations near jumpsApply windowing (Hann, Hamming) or use non-uniform FFT
7Using Gaussian RBF with too large an ε\varepsilonLeads to nearly singular Φ\Phi matrix (κeε2r2\kappa \sim e^{\varepsilon^2 r^2})Tune ε\varepsilon via cross-validation or use stable algorithms (contour Pade)
8Treating the Lebesgue constant as just a theoretical conceptThe Lebesgue constant directly bounds interpolation error amplification - if Λn\Lambda_n is large, small errors in yiy_i cause large errors in pnp_nAlways check the Lebesgue constant for your node choice
9Assuming neural network universality implies easy learningUAT says approximating functions exists, not that gradient descent finds itUnderstand the approximation-optimization gap
10Misinterpreting sinusoidal PE frequencies as sampling frequenciesThe 100002i/d10000^{2i/d} factor controls the wavelength (period), not a sampling rateThink in terms of wavelengths: smallest = 2π2\pi, largest = 2π×100002\pi \times 10000
11Using fixed-ε\varepsilon RBF for high-dimensional dataRBF interpolation is ill-conditioned in high dimensions without careful selectionUse compactly supported RBFs (Wendland) or regularized regression
12Evaluating Lagrange basis polynomials naivelyNumerical cancellation in kj(xxk)\prod_{k \neq j}(x - x_k) for large nnUse barycentric Lagrange form: numerically stable and O(n)O(n) per evaluation

10. Exercises

Exercise 1 * - Newton Divided Differences Implement the Newton divided difference algorithm and evaluate the interpolating polynomial using Horner's rule. (a) Compute the divided difference table for (0,1),(1,3),(2,7),(3,13)(0,1), (1,3), (2,7), (3,13). (b) Implement newton_interp(nodes, vals, x) using the divided difference table and Horner's method. (c) Verify the result matches the Lagrange interpolant.

Exercise 2 * - Runge's Phenomenon Demonstrate Runge's phenomenon numerically and show that Chebyshev nodes tame it. (a) For f(x)=1/(1+25x2)f(x) = 1/(1+25x^2), compute the interpolation error at n=5,10,15,20n = 5, 10, 15, 20 with equispaced nodes. (b) Repeat with Chebyshev nodes. Plot both error curves on a log scale. (c) Compute the Lebesgue constant for both node sets at n=15n = 15.

Exercise 3 * - Natural Cubic Spline Build a cubic spline solver from scratch using the tridiagonal system. (a) Derive the tridiagonal system for the second derivatives MiM_i. (b) Implement the Thomas algorithm for tridiagonal systems. (c) Test on f(x)=sin(πx)f(x) = \sin(\pi x) with 10 equispaced nodes. Plot the spline and its error.

Exercise 4 ** - Vandermonde Conditioning and QR Show that QR decomposition is numerically superior to the normal equations for polynomial fitting. (a) Construct the Vandermonde matrix for n=15n = 15 equispaced nodes. Compute κ(V)\kappa(V) and κ(VV)\kappa(V^\top V). (b) Fit a degree-15 polynomial to noisy data by: (i) normal equations, (ii) QR factorization. (c) Compare the residuals and coefficient sensitivity to noise.

Exercise 5 ** - Chebyshev Approximation Compute the Chebyshev series and compare it to the minimax polynomial. (a) Compute the first 20 Chebyshev coefficients of f(x)=xf(x) = |x| using the DCT. (b) Show that the truncated Chebyshev series achieves near-minimax approximation. (c) For n=10n = 10, plot: exact ff, Chebyshev approximation, and best polynomial approximation. Compute the difference in \infty-norm.

Exercise 6 ** - FFT and the Convolution Theorem Use the FFT to compute a convolution efficiently. (a) Implement a 1D discrete convolution via FFT: fft_convolve(x, h). (b) Verify correctness against np.convolve(x, h) for a Gaussian filter. (c) Measure and compare the time complexity of FFT vs direct convolution for n=1000,10000n = 1000, 10000.

Exercise 7 ** - RBF Interpolation and Kernel Ridge Regression Connect RBF interpolation to kernel methods. (a) Implement Gaussian RBF interpolation for 1D scattered data. (b) Show that as the regularization λ0\lambda \to 0, kernel ridge regression approaches RBF interpolation. (c) Demonstrate the shape parameter effect: for ε=0.1,1,10\varepsilon = 0.1, 1, 10, plot the RBF interpolant and the condition number of Φ\Phi.

Exercise 8 *** - Positional Encoding Interpolation Analyze sinusoidal positional encodings from an approximation theory perspective. (a) Implement transformer sinusoidal PEs for dimension d=64d = 64. (b) Show that PE(pos+k)\text{PE}(pos + k) is a linear function of PE(pos)\text{PE}(pos) (derive the rotation matrix). (c) Simulate context length extrapolation: train a simple model on positions {0,,511}\{0,\ldots,511\} and evaluate on {512,,2047}\{512,\ldots,2047\}. Compare original PEs, linear interpolated PEs, and NTK-scaled PEs.


11. Why This Matters for AI (2026 Perspective)

ConceptAI/LLM Impact
Chebyshev nodes and minimaxChebNet spectral GNNs avoid explicit eigendecomposition; used in molecular property prediction
Cubic splinesNeural ODEs use spline interpolation for continuous-time dynamics; used in physics-informed networks
B-splinesKAN networks (2024) parameterize learnable functions as B-splines; emerging alternative to MLPs
FFT and convolution theoremFNet, S4, Hyena use FFT-based long-range mixing as efficient attention alternative (O(nlogn)O(n\log n))
Random Fourier featuresPerformer achieves O(n)O(n) attention via kernel approximation; NTK analysis of neural nets
Sinusoidal interpolationRoPE positional encodings in LLaMA, Mistral, GPT-4 class models
Context length extrapolationYaRN, LongRoPE interpolate position indices to extend 4K-trained models to 128K+ tokens
Gaussian processesBayesian optimization for hyperparameter search in AutoML; uncertainty quantification in autonomous vehicles
RKHS and representer theoremFoundation for kernel SVMs, support vector regression, and theoretical analysis of neural networks
Universal approximationTheoretical basis for depth/width scaling laws; understanding which architectures can represent what functions

12. Conceptual Bridge

Looking back: This section builds on floating-point arithmetic (Section10-01) - all numerical interpolation is subject to rounding, and the condition numbers of Vandermonde and RBF matrices determine how much rounding error is amplified. From numerical linear algebra (Section10-02), we use QR decomposition for stable least-squares fitting and tridiagonal solvers for spline construction. The optimization perspective (Section10-03) underlies minimax approximation (Remez algorithm) and least-squares fitting. Earlier linear algebra (Section02) provided the theory of least squares, orthogonality, and projections.

Looking forward: The next section (Section10-05: Numerical Integration) uses interpolation directly - quadrature rules are derived by integrating interpolating polynomials. Gaussian quadrature nodes are the zeros of orthogonal polynomials. Clenshaw-Curtis quadrature uses Chebyshev nodes and the DCT for efficient weight computation. Everything built here - Chebyshev theory, orthogonal polynomials, FFT - feeds directly into numerical integration.

The connection to the broader curriculum:

  • Graph theory (Section11): Spectral graph convolutions approximate the spectral filter via Chebyshev polynomials in the graph Laplacian eigenvalues
  • Probability (Section07): Gaussian processes are probabilistic interpolants; Fourier analysis of stochastic processes uses spectral density
  • Optimization (Section08): The minimax approximation problem is solved by Chebyshev equioscillation, connecting to constrained optimization and duality
INTERPOLATION AND APPROXIMATION IN THE CURRICULUM
========================================================================

  Section10-01 Floating-Point      Section02 Linear Algebra
  (rounding errors)    -->  (least squares, QR)
           |                        |
           +------------+-----------+
                        v
              Section10-04 INTERPOLATION
              AND APPROXIMATION
              +------------------+
              | Lagrange / Newton|
              | Chebyshev / Runge|
              | Cubic splines    |
              | Least squares    |
              | FFT / DFT        |
              | RBF / RKHS       |
              +--------+---------+
                       |
          +------------+------------+
          v            v            v
    Section10-05          Section11 GNNs      AI Apps
  Numerical        (ChebNet,    (RoPE, KAN,
  Integration      SpectralGCN)  Performer)

========================================================================

Key insight: The choice of basis is the choice of inductive bias. Monomials assume no structure; Chebyshev assumes bounded derivatives; splines assume local smoothness; trigonometric functions assume periodicity; RBFs assume isotropy. Neural networks learn their basis from data - but the theoretical guarantees come from understanding which function classes can be represented and approximated efficiently.


Appendix A: Barycentric Lagrange Interpolation - Algorithm and Analysis

The barycentric form is the preferred implementation of polynomial interpolation.

A.1 Derivation

Define the node polynomial (x)=j=0n(xxj)\ell(x) = \prod_{j=0}^{n}(x - x_j) and barycentric weights wj=1/kj(xjxk)w_j = 1/\prod_{k \neq j}(x_j - x_k).

Then j(x)=(x)wj/(xxj)\ell_j(x) = \ell(x) \cdot w_j / (x - x_j) and:

p(x)=(x)j=0nwjxxjyjp(x) = \ell(x) \sum_{j=0}^{n} \frac{w_j}{x - x_j} y_j

Since jj(x)=1\sum_j \ell_j(x) = 1 implies (x)jwj/(xxj)=1\ell(x) \sum_j w_j/(x-x_j) = 1, dividing:

p(x)=j=0nwjxxjyjj=0nwjxxjp(x) = \frac{\sum_{j=0}^{n} \frac{w_j}{x - x_j} y_j}{\sum_{j=0}^{n} \frac{w_j}{x - x_j}}

A.2 Numerical Stability

Theorem (Higham 2004): The barycentric formula is backward stable: it computes the exact interpolant of slightly perturbed data y~j=yj(1+δj)\tilde{y}_j = y_j (1 + \delta_j) with δjO(εmachn)|\delta_j| \leq O(\varepsilon_{\text{mach}} n).

Practical implications:

  1. Pre-compute weights wjw_j once: O(n2)O(n^2) preprocessing
  2. Evaluate at any xx: O(n)O(n) per evaluation
  3. Adding a new node (xn+1,yn+1)(x_{n+1}, y_{n+1}): O(n)O(n) update to weights
  4. No division by (x)\ell(x) - the denominator is the normalizing sum, never computed separately

A.3 Barycentric Weights for Special Nodes

Equispaced nodes on [1,1][-1,1]: wj=(1)j(nj)w_j = (-1)^j \binom{n}{j} - grows like 2n2^n.

Chebyshev nodes of the 2nd kind xj=cos(jπ/n)x_j = \cos(j\pi/n): wj=(1)jδjw_j = (-1)^j \delta_j where δ0=δn=1/2\delta_0 = \delta_n = 1/2, δj=1\delta_j = 1 otherwise.

Chebyshev nodes of the 1st kind xj=cos((2j+1)π/(2n+2))x_j = \cos((2j+1)\pi/(2n+2)): wj=(1)jsin((2j+1)π/(2n+2))w_j = (-1)^j \sin((2j+1)\pi/(2n+2)).

The Chebyshev weights are computed in O(n)O(n) and are moderate in size - explaining their superior numerical properties.

def barycentric_weights_cheb2(n):
    """Barycentric weights for Chebyshev points of the 2nd kind."""
    w = np.ones(n + 1)
    w[0] = 0.5; w[n] = 0.5
    w[1::2] = -1
    return w

def barycentric_eval(x, nodes, values, weights):
    """Evaluate barycentric interpolant at x."""
    diffs = x - nodes
    # Handle case x == node exactly
    exact = np.where(diffs == 0)[0]
    if len(exact) > 0:
        return values[exact[0]]
    w_over_d = weights / diffs
    return (w_over_d @ values) / w_over_d.sum()

Appendix B: Divided Difference Properties and Connections

B.1 The Divided Difference as a Derivative

Theorem: If fCn[a,b]f \in C^n[a,b] and x0,,xn[a,b]x_0, \ldots, x_n \in [a,b] are distinct, then:

[y0,y1,,yn]=f(n)(ξ)n![y_0, y_1, \ldots, y_n] = \frac{f^{(n)}(\xi)}{n!}

for some ξ\xi in the convex hull of {x0,,xn}\{x_0, \ldots, x_n\}.

Proof: Apply Rolle's theorem nn times, as in the interpolation error theorem. The nn-th divided difference telescopes to f(n)(ξ)/n!f^{(n)}(\xi)/n!. \square

Corollary: For equally spaced nodes xj=x0+jhx_j = x_0 + jh:

[y0,y1,,yn]=Δny0n!hn[y_0, y_1, \ldots, y_n] = \frac{\Delta^n y_0}{n! h^n}

where Δn\Delta^n is the nn-th forward difference operator: Δ0yj=yj\Delta^0 y_j = y_j, Δyj=yj+1yj\Delta y_j = y_{j+1} - y_j, Δnyj=Δ(Δn1yj)\Delta^n y_j = \Delta(\Delta^{n-1} y_j).

B.2 Newton's Forward Difference Formula

For equispaced nodes xj=x0+jhx_j = x_0 + jh, with x=x0+shx = x_0 + sh:

pn(x)=k=0n(sk)Δky0p_n(x) = \sum_{k=0}^{n} \binom{s}{k} \Delta^k y_0

where (sk)=s(s1)(sk+1)/k!\binom{s}{k} = s(s-1)\cdots(s-k+1)/k! is the generalized binomial coefficient. This is Newton's forward difference interpolation formula - the precursor to finite difference schemes in PDE solvers.

B.3 Connection to Finite Difference Schemes

The divided difference is exactly the finite difference approximation to the derivative:

[y0,y1]=y1y0x1x0f(x0),[y0,y1,y2]=[y1,y2][y0,y1]x2x0f(ξ)2[y_0, y_1] = \frac{y_1 - y_0}{x_1 - x_0} \approx f'(x_0), \quad [y_0, y_1, y_2] = \frac{[y_1,y_2] - [y_0,y_1]}{x_2 - x_0} \approx \frac{f''(\xi)}{2}

For equispaced nodes: [y0,y1]=Δy0hf(x0)+O(h)[y_0, y_1] = \frac{\Delta y_0}{h} \approx f'(x_0) + O(h) (first-order forward difference).

For AI: Numerical differentiation (Section10-03) uses finite differences - they are divided differences with equispaced nodes. The second divided difference gives the second derivative approximation needed for Newton's method and Hessian-vector products.


Appendix C: Chebyshev Polynomial Deep Dive

C.1 Chebyshev Expansion Coefficients via DCT

The Chebyshev coefficients of ff on [1,1][-1,1]:

ck=2π0πf(cosθ)cos(kθ)dθc_k = \frac{2}{\pi} \int_0^\pi f(\cos\theta) \cos(k\theta) d\theta

can be computed via the Discrete Cosine Transform (DCT) at the Chebyshev nodes:

ck2nj=0n1f(cosθj)cos(kθj),θj=π(2j+1)2nc_k \approx \frac{2}{n} \sum_{j=0}^{n-1} f(\cos\theta_j) \cos(k\theta_j), \quad \theta_j = \frac{\pi(2j+1)}{2n}

This is a DCT-II computation, achievable in O(nlogn)O(n \log n) using FFT.

Algorithm:

  1. Evaluate ff at nn Chebyshev nodes
  2. Apply DCT-II (via FFT: reflect and take FFT, or use scipy.fft.dct)
  3. Normalize: c0c0/nc_0 \leftarrow c_0/n, ckck/nc_k \leftarrow c_k/n for k>0k > 0

C.2 Clenshaw Recurrence for Evaluation

To evaluate S(x)=k=0nckTk(x)S(x) = \sum_{k=0}^n c_k T_k(x) without computing each Tk(x)T_k(x):

b[n+2] = 0, b[n+1] = 0
for k = n, n-1, ..., 1:
    b[k] = 2x * b[k+1] - b[k+2] + c[k]
result = x * b[1] - b[2] + c[0]

This is the Clenshaw algorithm - stable O(n)O(n) evaluation of Chebyshev series.

C.3 Chebyshev Approximation Convergence Rates

For functions with various smoothness:

SmoothnessConvergence of ckc_kConvergence of best approx EnE_n
Analytic in ellipse EρE_\rho$c_k
fCp[1,1]f \in C^p[-1,1]$c_k
Lipschitz continuous$c_k
Discontinuous$c_k

For AI: The exponential convergence for analytic functions is why spectral methods achieve machine precision with far fewer points than finite differences. This matters for PINN solvers - use spectral collocation (Chebyshev) not finite differences when the solution is smooth.


Appendix D: B-Spline Implementation Details

D.1 De Boor's Algorithm

Evaluating a B-spline S(x)=iciBi,p(x)S(x) = \sum_i c_i B_{i,p}(x) at a point xx in knot interval [tj,tj+1)[t_j, t_{j+1}):

d[i] = c[i]  for i = j-p, j-p+1, ..., j

for r = 1, 2, ..., p:
    for i = j, j-1, ..., j-p+r:
        alpha = (x - t[i]) / (t[i+p-r+1] - t[i])
        d[i] = (1-alpha) * d[i-1] + alpha * d[i]

result = d[j]

This is the de Boor algorithm - O(p2)O(p^2) per evaluation, numerically stable.

D.2 Knot Vectors

The knot vector T=(t0t1tn+p+1)T = (t_0 \leq t_1 \leq \cdots \leq t_{n+p+1}) controls the spline:

  • Uniform: ti=i/nt_i = i/n - equal spacing, best for smooth periodic data
  • Clamped/open: t0==tp=0t_0 = \cdots = t_p = 0, tn+1==tn+p+1=1t_{n+1} = \cdots = t_{n+p+1} = 1 - spline passes through first and last control points
  • Multiple knots: Repeating ti=ti+1t_i = t_{i+1} reduces continuity (degree pp knot = discontinuous (p1)(p-1)-th derivative)

D.3 B-Splines in KAN Architecture

KolmogorovArnold Networks (KAN, Liu et al. 2024) represent each network edge ϕl,i,j\phi_{l,i,j} as a learnable B-spline:

ϕ(x)=wbb(x)+wsspline(x)\phi(x) = w_b \cdot b(x) + w_s \cdot \text{spline}(x)

where b(x)=x/(1+ex)b(x) = x/(1+e^{-x}) (SiLU) is a base function and spline(x)=iciBi(x)\text{spline}(x) = \sum_i c_i B_i(x) is a B-spline with learnable coefficients cic_i and optionally adaptive grid.

Training: The grid is fixed initially; coefficients cic_i are learned by gradient descent. Periodically, the grid is updated to better cover the range of activations - grid extension.

Advantages over MLP: KANs can represent functions with simple structure (e.g., f(x,y)=sin(x+y)f(x,y) = \sin(x+y)) with far fewer parameters than MLPs. The spline representation is interpretable - you can read off the learned function shape.


Appendix E: Spline Interpolation - Error Analysis and Conditioning

E.1 Cubic Spline Error Bounds

Theorem (Hall-Meyer, 1976): Let fC4[a,b]f \in C^4[a,b] and SS be the not-a-knot cubic spline. Then:

f(k)S(k)Ckh4kf(4),k=0,1,2,3\|f^{(k)} - S^{(k)}\|_\infty \leq C_k h^{4-k} \|f^{(4)}\|_\infty, \quad k = 0, 1, 2, 3

with constants C0=5/384C_0 = 5/384, C1=1/24C_1 = 1/24, C2=3/8C_2 = 3/8, C3=O(1/h)C_3 = O(1/h) (for the derivative).

Practical significance:

  • Halving the mesh spacing hh reduces position error by 24=162^4 = 16
  • The O(h4)O(h^4) convergence is "super-convergent" - better than the O(h2)O(h^2) one might expect from a piecewise cubic

E.2 Condition Number of the Spline System

The tridiagonal matrix for cubic splines has entries in [hmin,hmax][h_{\min}, h_{\max}] range. For equispaced nodes with hi=hh_i = h:

A=h(4114114)/3A = h \begin{pmatrix} 4 & 1 & & \\ 1 & 4 & 1 & \\ & \ddots & \ddots & \ddots \\ & & 1 & 4 \end{pmatrix} / 3

This is a diagonally dominant tridiagonal matrix with κ(A)=O(1)\kappa(A) = O(1) - the condition number is bounded independent of nn! This is why cubic spline computation is numerically stable.

E.3 Why Natural Splines Minimize Curvature

Proof of minimum curvature property: Let gg be any twice-differentiable interpolant. Write g=S+hg = S + h where SS is the natural cubic spline. Then h(xi)=0h(x_i) = 0 for all nodes, and S(a)=S(b)=0S''(a) = S''(b) = 0.

ab(g)2dx=ab(S)2dx+2abShdx+ab(h)2dx\int_a^b (g'')^2 dx = \int_a^b (S'')^2 dx + 2\int_a^b S'' h'' dx + \int_a^b (h'')^2 dx

The cross term: integrate by parts twice, using S=0S'''' = 0 on each interval (cubic), h(xi)=0h(x_i) = 0 at nodes, and S(a)=S(b)=0S''(a) = S''(b) = 0. The cross term vanishes, and (h)2dx0\int (h'')^2 dx \geq 0, so (g)2dx(S)2dx\int (g'')^2 dx \geq \int (S'')^2 dx. \square


Appendix F: The Fast Fourier Transform - Detailed Analysis

F.1 Cooley-Tukey Algorithm Complexity

The radix-2 FFT recursion T(n)=2T(n/2)+cnT(n) = 2T(n/2) + cn with T(1)=0T(1) = 0:

  • Unrolling: T(n)=cnlog2nT(n) = cn \log_2 n multiplications and additions
  • For n=220106n = 2^{20} \approx 10^6: 20×106\sim 20 \times 10^6 operations vs 1012\sim 10^{12} for naive DFT

General nn: When n=p1a1pkakn = p_1^{a_1} \cdots p_k^{a_k}, FFT achieves O(n(a1++ak))=O(nlogn)O(n(a_1 + \cdots + a_k)) = O(n \log n). For prime nn, Rader's algorithm reduces to a convolution of size n1n-1. The "FFT is fast" claim requires nn with small prime factors - choose n=2kn = 2^k in practice.

F.2 Numerical Issues in FFT

Rounding error: The FFT error is O(log2nεmach)O(\log_2 n \cdot \varepsilon_{\text{mach}}) - logarithmic growth, much better than direct DFT's O(nεmach)O(n \varepsilon_{\text{mach}}).

Planck's convention vs engineering convention: The sign in e±2πijk/ne^{\pm 2\pi i jk/n} differs between conventions. NumPy uses e2πijk/ne^{-2\pi i jk/n} in np.fft.fft (physics/engineering convention).

Bit-reversal permutation: The in-place Cooley-Tukey FFT requires reordering inputs by bit-reversal. For output-order kk, the input index is the bit-reversal of kk in log2n\log_2 n bits.

F.3 FFT Applications in Deep Learning

Spectral convolution (S4, Hyena): The state-space model layer computes:

y=uhy = u * h

where hh is a learned convolutional kernel. Using FFT:

y=IFFT(FFT(u)FFT(h))y = \text{IFFT}(\text{FFT}(u) \odot \text{FFT}(h))

Complexity: O(LlogL)O(L \log L) vs O(L2)O(L^2) for direct attention. For sequence length L=8192L = 8192, this is a 10×\sim 10\times speedup.

FNet (Lee-Thorp et al., 2021): Replace self-attention with 2D FFT:

x = torch.fft.fft2(x, norm='ortho').real

Unparameterized - no learned weights - yet achieves 92-97% of BERT accuracy on GLUE. Demonstrates that long-range mixing doesn't require learned attention patterns.


Appendix G: Radial Basis Functions - Stability and Algorithms

G.1 The Shape Parameter Dilemma

For Gaussian RBFs ϕ(r)=eε2r2\phi(r) = e^{-\varepsilon^2 r^2}, the interpolation matrix Φij=eε2xixj2\Phi_{ij} = e^{-\varepsilon^2 \|x_i - x_j\|^2}:

  • Large ε\varepsilon (narrow): ΦI\Phi \approx I (well-conditioned), but poor approximation quality
  • Small ε\varepsilon (flat): Near-flat Gaussians, κ(Φ)enε2\kappa(\Phi) \sim e^{n\varepsilon^{-2}} (catastrophically ill-conditioned)

The uncertainty principle for RBFs: You cannot simultaneously have a well-conditioned system and high approximation accuracy. This is Schaback's uncertainty principle.

Solutions:

  1. Regularized RBF: Minimize i(s(xi)yi)2+λsH2\sum_i (s(x_i) - y_i)^2 + \lambda \|s\|_{\mathcal{H}}^2 - ridge regression with RBF kernel
  2. Compactly supported RBFs (Wendland): Sparse Φ\Phi, cheap to solve, but less smooth
  3. Contour-Pade algorithm: Reformulate using contour integration, avoiding catastrophic cancellation

G.2 Matern Kernels and Sobolev Spaces

The Matern kernel of order ν\nu corresponds to the RKHS being the Sobolev space Hν+d/2(Rd)H^{\nu + d/2}(\mathbb{R}^d):

kν(r)=21νΓ(ν)(2νr)νKν(2νr)k_\nu(r) = \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu} r}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu} r}{\ell}\right)

where KνK_\nu is the modified Bessel function. Special cases:

  • ν=1/2\nu = 1/2: k(r)=er/k(r) = e^{-r/\ell} (exponential, C0C^0 functions)
  • ν=3/2\nu = 3/2: k(r)=(1+3r/)e3r/k(r) = (1 + \sqrt{3}r/\ell)e^{-\sqrt{3}r/\ell} (C2C^2 functions)
  • ν=5/2\nu = 5/2: k(r)=(1+5r/+5r2/(32))e5r/k(r) = (1 + \sqrt{5}r/\ell + 5r^2/(3\ell^2))e^{-\sqrt{5}r/\ell} (C4C^4 functions)
  • ν\nu \to \infty: Gaussian (CC^\infty functions)

For ML: The Matern-5/2 kernel is the default in most GP libraries (scikit-learn, GPyTorch) because it provides enough smoothness without the extreme sensitivity of the Gaussian kernel.

G.3 Sparse Gaussian Processes

Inducing point methods: For nn data points and mnm \ll n inducing points z\mathbf{z}:

p(fy)q(f)=p(fu)q(u)dup(f | \mathbf{y}) \approx q(f) = \int p(f | \mathbf{u}) q(\mathbf{u}) d\mathbf{u}

where u=f(z)\mathbf{u} = f(\mathbf{z}) are function values at inducing points. Complexity: O(nm2)O(nm^2) instead of O(n3)O(n^3).

Variational inducing methods (SVGP, Hensman et al.): Optimize inducing point locations jointly with model parameters via variational lower bound. Enables stochastic mini-batch training for GPs - analogous to stochastic gradient descent for neural networks.


Appendix H: Fourier Features and Neural Tangent Kernel

H.1 Bochner's Theorem and Random Fourier Features

Bochner's theorem: A continuous, shift-invariant kernel k(x,y)=k(xy)k(x,y) = k(x-y) is positive definite if and only if it is the Fourier transform of a non-negative measure p(ω)p(\omega):

k(Δ)=eiωΔp(ω)dωk(\Delta) = \int e^{i\omega^\top \Delta} p(\omega) d\omega

Random Fourier features approximate this as a Monte Carlo integral:

k(x,y)z(x)z(y),z(x)=2/D[cos(ω1x+b1),,cos(ωDx+bD)]k(x, y) \approx z(x)^\top z(y), \quad z(x) = \sqrt{2/D}[\cos(\omega_1^\top x + b_1), \ldots, \cos(\omega_D^\top x + b_D)]

where ωjp(ω)\omega_j \sim p(\omega) and bjUniform[0,2π]b_j \sim \text{Uniform}[0, 2\pi].

Convergence: With DD random features, the approximation error satisfies:

supx,yk(x,y)z(x)z(y)=O(1/D)\sup_{x,y} |k(x,y) - z(x)^\top z(y)| = O(1/\sqrt{D})

with high probability.

H.2 The Neural Tangent Kernel

An infinite-width neural network trained by gradient descent with squared loss implements kernel regression with the neural tangent kernel (NTK):

kNTK(x,x)=Eθinit[θf(x;θ)θf(x;θ)]k_{\text{NTK}}(x, x') = \mathbb{E}_{\theta \sim \text{init}} \left[\nabla_\theta f(x;\theta)^\top \nabla_\theta f(x';\theta)\right]

Key results:

  1. At initialization, kNTKk_{\text{NTK}} is fixed (independent of θ\theta)
  2. For infinite width, kNTKk_{\text{NTK}} stays constant throughout training
  3. The trained model converges to the kernel regression solution: f(x)=kNTK(x,X)(kNTK(X,X))1yf(x) = k_{\text{NTK}}(x, X)(k_{\text{NTK}}(X,X))^{-1} y

Implications:

  • Infinite-width networks are not superior to kernel SVMs (they implement kernel regression)
  • The expressiveness advantage of neural networks comes from finite width and feature learning
  • NTK theory breaks down for finite-width networks with large learning rates (the "feature learning regime")

For AI: The NTK framework explains:

  • Why overparameterized networks can fit any training data (kernel interpolation)
  • Why neural scaling laws exist (larger networks have richer NTK kernels)
  • The transition from "kernel regime" to "feature learning regime" in modern LLM training

Appendix I: Practical Implementation Guide

I.1 Python Ecosystem for Interpolation

from scipy import interpolate

# Lagrange/Barycentric interpolation
interp = interpolate.BarycentricInterpolator(nodes, values)
y_new = interp(x_new)

# Cubic spline
cs = interpolate.CubicSpline(x, y, bc_type='natural')
y_pred = cs(x_new)
y_deriv = cs(x_new, 1)  # First derivative

# B-spline fitting
from scipy.interpolate import make_interp_spline
bspline = make_interp_spline(x, y, k=3)  # cubic
y_pred = bspline(x_new)

# Smoothing spline (least-squares B-spline)
tck = interpolate.splrep(x, y, s=0.1)  # s = smoothing factor
y_pred = interpolate.splev(x_new, tck)

# RBF interpolation
rbf = interpolate.RBFInterpolator(X_2d, y, kernel='gaussian', epsilon=1.0)
y_new = rbf(X_test)

I.2 NumPy/SciPy FFT

import numpy as np
from scipy.fft import fft, ifft, fftfreq, dct, idct

# FFT of a real signal
n = 1024
t = np.linspace(0, 1, n, endpoint=False)
y = np.sin(2*np.pi*5*t) + 0.5*np.sin(2*np.pi*13*t)

Y = fft(y)
freqs = fftfreq(n, d=1/n)  # Frequencies in cycles/unit

# Power spectral density
psd = np.abs(Y)**2 / n

# Inverse FFT
y_recovered = np.real(ifft(Y))

# DCT for Chebyshev coefficients
from scipy.fft import dct
coeffs = dct(f_values, type=2, norm='ortho') / n  # Normalized

I.3 Chebyshev Interpolation with Chebfun-style Code

def cheb_nodes(n, a=-1, b=1):
    """Chebyshev nodes of the 2nd kind on [a,b]."""
    k = np.arange(n+1)
    x = np.cos(np.pi * k / n)  # on [-1, 1]
    return (a + b)/2 + (b - a)/2 * x

def cheb_coeffs(f_values):
    """Chebyshev expansion coefficients via DCT."""
    n = len(f_values) - 1
    from scipy.fft import dct
    c = dct(f_values[::-1], type=1) / n  # DCT-I
    c[0] /= 2; c[-1] /= 2
    return c

def cheb_eval(c, x):
    """Clenshaw algorithm for Chebyshev series evaluation."""
    n = len(c) - 1
    b2, b1 = 0.0, 0.0
    for k in range(n, 0, -1):
        b2, b1 = b1, 2*x*b1 - b2 + c[k]
    return x*b1 - b2 + c[0]

Appendix J: Worked Examples - Interpolation in Practice

J.1 Fitting a Binding Energy Curve

In computational chemistry, the binding energy E(r)E(r) as a function of inter-atomic distance rr is often known at a few points from DFT calculations. We want a smooth interpolant for geometry optimization.

Problem:  r (A):  1.8   2.0   2.2   2.5   3.0   4.0
          E (eV): -2.5  -3.1  -3.2  -3.0  -2.7  -2.3

Method: Cubic spline with not-a-knot BC
Result: Minimum at r* \\approx 2.24 A, E* \\approx -3.21 eV
Derivatives: E'(r*) \\approx 0 (equilibrium), E''(r*) > 0 (stable)

This is used in molecular dynamics force fields and materials design ML models.

J.2 Positional Encoding Extrapolation

Problem: A model trained with sinusoidal PEs on positions {0,,511}\{0, \ldots, 511\} needs to handle position 1024.

Analysis: At dimension ii, the PE period is Ti=2π100002i/dT_i = 2\pi \cdot 10000^{2i/d}:

  • Small ii (low freq): T0=2π6.3T_0 = 2\pi \approx 6.3 - position 1024 has been seen many times
  • Large ii (high freq): Td1=2π×1000062832T_{d-1} = 2\pi \times 10000 \approx 62832 - period >> 512, position 1024 lies in same "first cycle"

YaRN solution: Scale position indices by s=L/Ls = L'/L (training length / target length):

PEYaRN(pos,i)=PE(poss,i)\text{PE}_{\text{YaRN}}(pos, i) = \text{PE}(pos \cdot s, i)

This maps position 1024 in the target context to position 1024×(512/1024)=5121024 \times (512/1024) = 512 in the training range - guaranteed to be seen. At the cost of reduced resolution at low frequencies.

J.3 Kernel Method vs Neural Network

For a 1D regression problem with n=100n = 100 points from f(x)=sin(3x)+0.1εf(x) = \sin(3x) + 0.1\varepsilon:

MethodParametersTraining errorTest errorTime
Degree-5 polynomial (QR)60.090.11<<1ms
Cubic spline (natural)1000.0 (exact)0.10<<1ms
Gaussian RBF (ε=1\varepsilon=1)1000.0 (exact)0.0910ms
Gaussian RBF + ridge (λ=103\lambda=10^{-3})1000.0080.0910ms
2-layer MLP (100 units)102000.0020.101s

For smooth 1D functions with n<1000n < 1000, splines and kernel methods outperform neural networks in accuracy and efficiency. Neural networks win for high-dimensional data, compositional structure, and when priors about the function class are unknown.


Appendix K: Connections to Other Chapters

TopicConnectionReference
Eigenvalue decompositionSpectral methods diagonalize differentiation/integration operatorsSection03-Advanced-Linear-Algebra
Fourier seriesSpectral analysis of signals, convolutional layersSection07-Probability (spectral density)
Taylor seriesPolynomial approximation of smooth functionsSection01-Mathematical-Foundations
OptimizationRemez algorithm, RKHS optimization (representer theorem)Section08-Optimization, Section10-03
Graph theory (spectral GNNs)Chebyshev polynomial filters on graph LaplacianSection11-Graph-Theory
Numerical integrationGaussian quadrature uses zeros of orthogonal polynomialsSection10-05
Floating-pointVandermonde ill-conditioning; Chebyshev stabilitySection10-01
Linear algebraQR for least squares; tridiagonal solver for splinesSection10-02

Appendix L: Lebesgue Constants and Interpolation Stability

L.1 Definition and Computation

The Lebesgue constant for a set of nodes {x0,,xn}\{x_0, \ldots, x_n\} is:

Λn=maxx[a,b]j=0nj(x)\Lambda_n = \max_{x \in [a,b]} \sum_{j=0}^{n} |\ell_j(x)|

It measures the worst-case amplification of errors:

pnΛny,fpn(1+Λn)En\|p_n\|_\infty \leq \Lambda_n \|y\|_\infty, \quad \|f - p_n\|_\infty \leq (1 + \Lambda_n) E_n^*

where En=infpPnfpE_n^* = \inf_{p \in \mathcal{P}_n} \|f - p\|_\infty is the best approximation error.

Growth rates:

Node setΛn\Lambda_n growth
Equispaced on [1,1][-1,1]2n/(enlnn)\sim 2^n / (e n \ln n) (exponential)
Chebyshev points (1st kind)2πln(n+1)+1\leq \frac{2}{\pi} \ln(n+1) + 1 (logarithmic)
Chebyshev points (2nd kind)2πln(n)+0.974\leq \frac{2}{\pi} \ln(n) + 0.974 (logarithmic)
Optimal nodes2πln(n+1)1\geq \frac{2}{\pi} \ln(n+1) - 1 (optimal up to constant)
Gauss-Legendre nodesO(logn)O(\log n)

Practical consequence: At n=30n = 30, equispaced Λ30107\Lambda_{30} \approx 10^7 while Chebyshev Λ304.3\Lambda_{30} \approx 4.3. An error of 10610^{-6} in the data values would cause a 10\sim 10 error in the interpolant with equispaced nodes but only 4×1064 \times 10^{-6} with Chebyshev.

L.2 Lebesgue Function Visualization

LEBESGUE FUNCTION |sum_j |l_j(x)||  (n=10)
========================================================================

  20 -                    Equispaced nodes
     |
  15 -
     |
  10 -
     |
   5 -
     |
   1 ----------------------------------- Chebyshev (\\approx2.0 everywhere)
     -1                                 1

  Equispaced Lebesgue function peaks near +/-1 (endpoint Runge zones).
  Chebyshev function is nearly constant across [-1,1].

========================================================================

Appendix M: Multidimensional Interpolation

M.1 Tensor Product Methods

For functions on [a,b]d[a,b]^d, a tensor product approach uses one-dimensional bases in each dimension:

p(x1,,xd)=i1=0nid=0nci1,,idϕi1(x1)ϕid(xd)p(x_1, \ldots, x_d) = \sum_{i_1=0}^{n} \cdots \sum_{i_d=0}^{n} c_{i_1,\ldots,i_d} \phi_{i_1}(x_1) \cdots \phi_{i_d}(x_d)

Curse of dimensionality: Requires (n+1)d(n+1)^d basis functions - exponential in dd. For n=10n=10, d=10d=10: 11102.6×101011^{10} \approx 2.6 \times 10^{10} coefficients.

Sparse grids (Smolyak): Use only tensor products of 1D bases where i1++idn+d1i_1 + \cdots + i_d \leq n + d - 1. Complexity: O(n(logn)d1)O(n (\log n)^{d-1}) - much better. Used in high-dimensional integration and interpolation (quantum chemistry, finance).

M.2 Scattered Data in High Dimensions

For scattered data in Rd\mathbb{R}^d with d>3d > 3, polynomial interpolation and tensor product methods fail. Alternatives:

RBF interpolation: s(x)=jcjϕ(xxj)s(x) = \sum_j c_j \phi(\|x - x_j\|) - mesh-free, works for any dd. But condition number grows exponentially with nn for Gaussian RBF.

Nearest-neighbor interpolation: s(x)=yjs(x) = y_{j^*} where j=argminjxxjj^* = \arg\min_j \|x - x_j\|. Discontinuous, O(1)O(1) per query with KD-tree preprocessing.

Inverse distance weighting (IDW): s(x)=jwj(x)yj/jwj(x)s(x) = \sum_j w_j(x) y_j / \sum_j w_j(x) where wj(x)=1/xxjpw_j(x) = 1/\|x - x_j\|^p. Continuous but not smooth.

For AI: High-dimensional function approximation is precisely the task of neural networks. The kernel trick (kernel regression) extends RBF to high dimensions through the kernel matrix Kij=k(xi,xj)K_{ij} = k(x_i, x_j) - but computing and inverting this n×nn \times n matrix is O(n3)O(n^3).

M.3 Triangulation-Based Methods

For scattered 2D data, Delaunay triangulation partitions the domain into non-overlapping triangles. Piecewise linear or cubic interpolation on each triangle:

  • Linear (barycentric): C0C^0, O(1)O(1) per evaluation after triangulation
  • Clough-Tocher cubic: C1C^1, splits each triangle into 3 subtriangles

scipy.interpolate.LinearNDInterpolator and CloughTocher2DInterpolator implement these.


Appendix N: Approximation Theory Connections to Deep Learning

N.1 Width vs Depth in Approximation

Shallow networks (width NN, depth 1):

ffNL2=O(N2s/d)\|f - f_N\|_{L^2} = O(N^{-2s/d})

for ff in the Sobolev space Ws,2(Rd)W^{s,2}(\mathbb{R}^d) - polynomial rate, cursed by dimension.

Deep ReLU networks: For functions with compositional structure f=f1f2fLf = f_1 \circ f_2 \circ \cdots \circ f_L where each f:RdRd+1f_\ell: \mathbb{R}^{d_\ell} \to \mathbb{R}^{d_{\ell+1}} is dd_\ell-dimensional:

ffNL=O(N2s/dmax)\|f - f_N\|_{L^\infty} = O(N^{-2s/d_{\max}})

where dmax=maxdd_{\max} = \max_\ell d_\ell - the approximation rate depends on the intrinsic dimension of the composition, not the ambient dimension dd.

Implication: Deep networks exploit compositional structure to avoid the curse of dimensionality. A function like f(x)=sin(x1+x2)ex3x4f(x) = \sin(x_1 + x_2) \cdot e^{x_3 x_4} has intrinsic dimension 2 even though d=4d = 4 - deep networks can approximate it with O(Ns)O(N^{-s}) error.

N.2 The Kolmogorov-Arnold Representation Theorem

Theorem (Kolmogorov, 1957): Every continuous function f:[0,1]nRf: [0,1]^n \to \mathbb{R} can be written as:

f(x1,,xn)=q=02nΦq(p=1nϕq,p(xp))f(x_1, \ldots, x_n) = \sum_{q=0}^{2n} \Phi_q\left(\sum_{p=1}^n \phi_{q,p}(x_p)\right)

for continuous functions ϕq,p:[0,1]R\phi_{q,p}: [0,1] \to \mathbb{R} and Φq:RR\Phi_q: \mathbb{R} \to \mathbb{R}.

Significance: Every multivariate continuous function can be written using only univariate functions and addition. No curse of dimensionality in this representation.

Practical issue: The inner functions ϕq,p\phi_{q,p} may be nowhere differentiable - impossible to learn efficiently with gradient methods. The theorem is an existence result, not a constructive algorithm.

KAN circumvents this: By restricting to learned B-splines (differentiable, finitely parameterized), KAN trades the full generality of Kolmogorov for a practical learnable version.

N.3 Function Classes and Approximation Rates

APPROXIMATION RATE BY FUNCTION CLASS AND METHOD
========================================================================

  Function class          Best method          Rate
  ---------------------   ------------------   ------------------
  Polynomial of deg \\leq k   Exact polynomial     0 (exact)
  Analytic in disk D\\rho     Chebyshev truncation O(\\rho^-n) (exponential)
  Cp on [a,b]             Poly degree n        O(n^-p)
  Sobolev Ws'^2()         Spline               O(hs)
  Barron class            Shallow network      O(N^-1/^2) (dim-free!)
  Compositional           Deep network         O(N^-^2s/d_max)
  Arbitrary continuous    Universal approx.    \\varepsilon-approx, no rate

  N = number of neurons, n = polynomial degree, h = mesh size

========================================================================

Appendix O: Summary Reference Tables

O.1 Interpolation Methods Comparison

MethodNodesError orderStabilityAI use
Lagrange (equispaced)AnyO(hn+1)O(h^{n+1}) - but large!Poor (Runge)Avoid
Lagrange (Chebyshev)ChebyshevO(2n)O(2^{-n}) for analyticGood (Λn=O(logn)\Lambda_n = O(\log n))ChebNet
Newton divided diff.AnySame as LagrangeModerateIncremental
Cubic spline (natural)EquispacedO(h4)O(h^4)ExcellentNeural ODEs
B-splineArbitrary knotsO(hp+1)O(h^{p+1})ExcellentKAN
TrigonometricEquispacedExponential (smooth periodic)GoodFFT, PE
RBF (Gaussian)ScatteredExponential (analytic)Sensitive to ε\varepsilonKernels, GP
Polynomial LS (QR)AnyBest L2L^2 approxGoodFeature eng.

O.2 Key Theorems Quick Reference

TheoremStatementLocation
Existence/uniquenessUnique degree-nn interpolant for n+1n+1 distinct nodesSection2.1
Interpolation errorfpn=f(n+1)(ξ)ωn+1/(n+1)!f - p_n = f^{(n+1)}(\xi)\omega_{n+1}/(n+1)!Section2.3
Chebyshev minimaxTn/2n1T_n/2^{n-1} has smallest \infty-norm among monic degree-nn polysSection3.1
EquioscillationBest approx iff error equioscillates at n+2\geq n+2 pointsSection3.3
Spline min. curvatureNatural cubic spline minimizes $\intS''
Representer theoremRegularized RKHS problem solved by finite expansionSection7.2
Universal approximationShallow networks dense in C([0,1]d)C([0,1]^d)Section8.2
Nyquist theoremReconstruct BB-band-limited signal from rate 2B\geq 2B samplesSection6.4
Bochner's theoremPD shift-invariant kernels = FT of non-negative measuresSectionH.1
Kolmogorov-ArnoldAll fC([0,1]n)f \in C([0,1]^n) = composition of univariatesSectionN.2

Appendix P: Extended Exercises with Solutions

P.1 Barycentric Interpolation - Full Implementation

import numpy as np

def cheb2_nodes(n, a=-1.0, b=1.0):
    """Chebyshev nodes of the 2nd kind (n+1 nodes) on [a,b]."""
    k = np.arange(n + 1)
    x = np.cos(np.pi * k / n)  # on [-1, 1]
    return 0.5*(a+b) + 0.5*(b-a)*x

def barycentric_weights(nodes):
    """Compute barycentric weights for given nodes."""
    n = len(nodes) - 1
    w = np.ones(n + 1)
    for j in range(n + 1):
        for k in range(n + 1):
            if k != j:
                w[j] *= (nodes[j] - nodes[k])
    return 1.0 / w

def cheb2_weights(n):
    """Barycentric weights for Chebyshev 2nd kind nodes (O(n) formula)."""
    w = np.ones(n + 1)
    w[0] = 0.5
    w[n] = 0.5
    w[1::2] = -1.0
    return w

def bary_eval(x, nodes, values, weights):
    """Evaluate barycentric interpolant at scalar x."""
    diffs = x - nodes
    mask = diffs == 0.0
    if np.any(mask):
        return values[mask][0]
    w_d = weights / diffs
    return np.dot(w_d, values) / w_d.sum()

# Example: interpolate f(x) = 1/(1+25x^2) with Chebyshev nodes
n = 20
nodes = cheb2_nodes(n)
f = lambda x: 1.0 / (1 + 25*x**2)
values = f(nodes)
weights = cheb2_weights(n)

x_test = np.linspace(-1, 1, 500)
y_interp = np.array([bary_eval(xi, nodes, values, weights) for xi in x_test])
y_exact = f(x_test)
print(f"Max error (Chebyshev n=20): {np.abs(y_interp - y_exact).max():.2e}")

P.2 Cubic Spline - Full Implementation

def natural_cubic_spline(x, y):
    """
    Compute natural cubic spline coefficients.
    Returns: (a, b, c, d) polynomial coefficients for each interval.
    """
    n = len(x) - 1
    h = np.diff(x)

    # Set up tridiagonal system for second derivatives M
    rhs = 6 * (np.diff(y[1:] / h[1:] - y[:-1] / h[:-1]))  # Simplified
    # More careful: second differences of function values
    dd = np.zeros(n - 1)
    for i in range(1, n):
        dd[i-1] = 6 * ((y[i+1] - y[i])/h[i] - (y[i] - y[i-1])/h[i-1])

    # Tridiagonal matrix: diagonal, sub/super diagonal
    diag = 2*(h[:-1] + h[1:])
    off = h[1:-1]

    # Thomas algorithm (tridiagonal solver)
    n_sys = n - 1
    c_diag = diag.copy()
    c_rhs = dd.copy()

    # Forward sweep
    for i in range(1, n_sys):
        m = off[i-1] / c_diag[i-1]
        c_diag[i] -= m * off[i-1]
        c_rhs[i] -= m * c_rhs[i-1]

    # Back substitution
    M = np.zeros(n + 1)  # Second derivatives at nodes
    M[n_sys] = c_rhs[n_sys-1] / c_diag[n_sys-1]
    for i in range(n_sys-2, -1, -1):
        M[i+1] = (c_rhs[i] - off[i] * M[i+2]) / c_diag[i]
    # M[0] = M[n] = 0 (natural BC)

    return M, h

def eval_spline(x_eval, x_nodes, y_nodes, M, h):
    """Evaluate natural cubic spline at x_eval."""
    i = np.searchsorted(x_nodes, x_eval, side='right') - 1
    i = np.clip(i, 0, len(x_nodes) - 2)
    t = x_eval - x_nodes[i]
    hi = h[i]
    S = (M[i] * (x_nodes[i+1] - x_eval)**3 / (6*hi) +
         M[i+1] * (x_eval - x_nodes[i])**3 / (6*hi) +
         (y_nodes[i] - M[i]*hi**2/6) * (x_nodes[i+1]-x_eval)/hi +
         (y_nodes[i+1] - M[i+1]*hi**2/6) * (x_eval-x_nodes[i])/hi)
    return S

P.3 Chebyshev Spectral Differentiation Matrix

A key tool in spectral methods is the differentiation matrix DD such that (Df)jf(xj)(Df)_j \approx f'(x_j) at Chebyshev nodes:

def cheb_diff_matrix(n):
    """
    Chebyshev spectral differentiation matrix for n+1 nodes.
    D @ f_values approximates f' at Chebyshev nodes.
    """
    if n == 0:
        return np.array([[0.]])
    x = np.cos(np.pi * np.arange(n+1) / n)
    c = np.ones(n+1)
    c[0] = 2; c[n] = 2
    c *= (-1)**np.arange(n+1)

    D = np.zeros((n+1, n+1))
    for i in range(n+1):
        for j in range(n+1):
            if i != j:
                D[i, j] = c[i] / (c[j] * (x[i] - x[j]))

    # Diagonal: sum condition
    D -= np.diag(D.sum(axis=1))
    return D

# Example: differentiate f(x) = sin(pi*x) on [-1,1]
n = 32
D = cheb_diff_matrix(n)
x = np.cos(np.pi * np.arange(n+1) / n)
f_vals = np.sin(np.pi * x)
f_prime_approx = D @ f_vals
f_prime_exact = np.pi * np.cos(np.pi * x)
print(f"Spectral diff error: {np.abs(f_prime_approx - f_prime_exact).max():.2e}")

Appendix Q: Advanced Topics in Function Approximation

Q.1 Pade Approximants

Definition: A Pade approximant [m/n](x)[m/n](x) is a rational function Pm(x)/Qn(x)P_m(x)/Q_n(x) that agrees with the Taylor series of ff up to order m+nm+n:

f(x)Pm(x)Qn(x)=O(xm+n+1)f(x) - \frac{P_m(x)}{Q_n(x)} = O(x^{m+n+1})

Advantages over polynomials:

  • Can represent poles and asymptotic behavior
  • Often much more accurate than same-degree Taylor polynomials
  • Diagonal Pade [n/n][n/n] often converges to analytic functions in larger domains

Example: The exponential function:

ex[2/2](x)=1+x/2+x2/121x/2+x2/12e^x \approx [2/2](x) = \frac{1 + x/2 + x^2/12}{1 - x/2 + x^2/12}

This is the foundation of Pade-based matrix exponential algorithms (used in ODE solvers and graph neural networks for diffusion operators).

Q.2 Wavelet Approximation

Wavelets provide multi-resolution approximation: approximate a function at coarse scale first, then add fine-scale details.

The Daubechies wavelet basis {ψj,k(x)=2j/2ψ(2jxk)}\{\psi_{j,k}(x) = 2^{j/2}\psi(2^j x - k)\} provides:

  • Orthogonal decomposition
  • Local support: ψj,k\psi_{j,k} has support 2j\sim 2^{-j} around 2jk2^{-j}k
  • Vanishing moments: xpψ(x)dx=0\int x^p \psi(x) dx = 0 for p=0,,p0p = 0, \ldots, p_0
  • Adaptive sparsity: Functions with local irregularities (edges, singularities) represented sparsely

For AI: Wavelets appear in:

  • Image compression (JPEG 2000 uses wavelet decomposition)
  • Multi-scale neural architectures (UNet uses hierarchical scales)
  • Graph wavelets for multi-resolution graph signals

Q.3 The NUFFT and Non-Uniform Sampling

The standard FFT assumes uniform sampling. For non-uniform samples {xj}\{x_j\}, the Non-Uniform FFT (NUFFT) computes:

Fk=j=1nfje2πikxj,k=n/2,,n/21F_k = \sum_{j=1}^{n} f_j e^{-2\pi i k x_j}, \quad k = -n/2, \ldots, n/2-1

in O(nlogn)O(n \log n) by:

  1. Convolving point masses at xjx_j with a Gaussian kernel onto a uniform grid
  2. Applying FFT on the uniform grid
  3. Deconvolving the Gaussian in frequency domain

For AI: NUFFT is used in:

  • MRI reconstruction: k-space data is sampled along non-uniform trajectories
  • Molecular dynamics: Computing electrostatic interactions via particle-mesh Ewald (PME)
  • Astronomy: Radio telescope arrays measure non-uniform Fourier samples

Appendix R: Reproducible Experiments

All code in this section is reproducible with numpy, scipy, and matplotlib. The key experiments:

# Experiment 1: Runge's phenomenon comparison
import numpy as np
import matplotlib.pyplot as plt
from scipy.interpolate import BarycentricInterpolator

f = lambda x: 1/(1+25*x**2)
x_plot = np.linspace(-1, 1, 500)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for n in [5, 10, 15]:
    # Equispaced
    nodes_eq = np.linspace(-1, 1, n+1)
    p_eq = BarycentricInterpolator(nodes_eq, f(nodes_eq))

    # Chebyshev
    nodes_ch = np.cos(np.pi * np.arange(n+1) / n)
    p_ch = BarycentricInterpolator(nodes_ch, f(nodes_ch))

    err_eq = np.abs(f(x_plot) - p_eq(x_plot)).max()
    err_ch = np.abs(f(x_plot) - p_ch(x_plot)).max()
    print(f"n={n}: equispaced error={err_eq:.2e}, Chebyshev error={err_ch:.2e}")

# Experiment 2: Lebesgue constants
for n in range(1, 20):
    nodes = np.cos(np.pi * np.arange(n+1) / n)  # Chebyshev 2nd kind
    x_test = np.linspace(-1, 1, 1000)
    # Compute Lebesgue function: sum of |ell_j(x)|
    weights = np.ones(n+1)
    weights[0] = 0.5; weights[n] = 0.5; weights[1::2] = -1.0
    lebesgue = []
    for x in x_test:
        diffs = x - nodes
        if np.any(diffs == 0):
            lebesgue.append(1.0)
        else:
            w_d = weights / diffs
            # Lagrange basis values: ell_j = w_j/diff_j / sum(w_k/diff_k)
            total = w_d.sum()
            lebesgue.append(np.abs(w_d / total).sum())
    print(f"n={n}: Lambda_n = {max(lebesgue):.4f}")

Appendix S: Historical Notes on Interpolation

S.1 Newton's Contribution

Isaac Newton introduced the method of divided differences in "Methodus Differentialis" (1711). His motivation was astronomical: computing planetary positions at arbitrary times from a table of observed positions at discrete times. The divided difference table allowed him to update the interpolant incrementally as new observations came in - an O(n)O(n) update, remarkable for 1711.

S.2 Chebyshev's Insight

Pafnuty Chebyshev (1853) asked: which monic polynomial of degree nn has the smallest maximum absolute value on [1,1][-1,1]? His answer - Tn(x)/2n1T_n(x)/2^{n-1} - was the first systematic use of the minimax criterion in approximation theory. Chebyshev was motivated by mechanism design: he wanted to minimize the deviation from ideal motion in steam engine linkages.

The connection between the trigonometric definition Tn(cosθ)=cos(nθ)T_n(\cos\theta) = \cos(n\theta) and the polynomial form was noticed by Chebyshev himself - using the addition formula for cosines iteratively.

S.3 Runge's Warning

Carl Runge published his famous counterexample in 1901. The function 1/(1+25x2)1/(1+25x^2) is analytic everywhere in the complex plane except at ±i/5\pm i/5 - which lie inside the unit circle. This means the function's Taylor series has radius of convergence 1/5<11/5 < 1, and polynomial interpolation at equispaced nodes tries to capture behavior that is "barely analytic" in the relevant region.

Runge's result was important as a warning against increasing polynomial degree. The modern response (Chebyshev nodes) came gradually through the 20th century.

S.4 Cooley-Tukey and the FFT

The Cooley-Tukey FFT algorithm (1965) is often cited as one of the most important algorithms of the 20th century. Its impact was immediate: it made digital signal processing practical, enabled the digital revolution in audio/communications, and later became the backbone of scientific computing.

Remarkably, the core idea (exploiting DFT factorizations) was known to Gauss in 1805 - unpublished in his collected works. The algorithm was independently rediscovered multiple times before Cooley and Tukey's landmark paper.

S.5 The Representer Theorem and RKHS

The representer theorem (Kimeldorf-Wahba, 1971) established that regularized function estimation over an RKHS has a finite-dimensional solution - transforming an infinite-dimensional optimization into a finite linear algebra problem. This unified spline smoothing, kernel regression, and support vector machines under one mathematical framework. The same theorem, applied to the neural tangent kernel, explains why overparameterized neural networks trained by gradient descent converge to global minima that generalize well.


Appendix T: Rational Approximation and Pade Theory

T.1 Pade Table

For a function with Taylor series f(x)=k=0ckxkf(x) = \sum_{k=0}^\infty c_k x^k, the Pade approximant [m/n][m/n] satisfies:

f(x)Qn(x)Pm(x)=O(xm+n+1)f(x) \cdot Q_n(x) - P_m(x) = O(x^{m+n+1})

The Pade table arranges these approximants in a grid by (m,n)(m, n):

PADE TABLE for e^x
========================================================================

  n\m |  0          1           2
  ----+------------------------------------------------------
   0  |  1          1+x         1+x+x^2/2
   1  |  1/(1-x)   (2+x)/(2-x)   (6+2x)/(6-4x+x^2)
   2  |  1/(1-x+x^2/2)  (6+4x+x^2)/(6-2x+x^2/6)  ...

  Diagonal [n/n] entries converge fastest for analytic functions.

========================================================================

Convergence: For functions analytic in a disk x<R|x| < R, the diagonal Pade [n/n][n/n] converges to ff in the entire disk - in contrast to Taylor series which may diverge for x>r|x| > r where rr is the radius of convergence.

T.2 Matrix Pade and the Matrix Exponential

The matrix exponential eAe^A is computed using Pade approximants in most numerical libraries. The (m,m)(m,m) diagonal Pade:

eA[Dm(A)]1Nm(A)e^A \approx [D_m(A)]^{-1} N_m(A)

where NmN_m and DmD_m are matrix polynomials of degree mm. Combined with scaling and squaring (using eA=(eA/2s)2se^A = (e^{A/2^s})^{2^s}):

  1. Scale: A=A/2sA' = A/2^s with A1\|A'\| \leq 1
  2. Compute Pade approximant eA[m/m](A)e^{A'} \approx [m/m](A')
  3. Square: eA(eA)2se^A \approx (e^{A'})^{2^s}

This is used in scipy.linalg.expm and is essential for:

  • Continuous-time Markov chain transition matrices
  • Neural ODE solvers (computing etAe^{tA} exactly)
  • Graph diffusion for GNNs (heat kernel etLe^{-tL})

Appendix U: Connection to Modern LLM Architectures

U.1 RoPE as Trigonometric Interpolation

Rotary Position Embedding (RoPE) represents positions as rotations in 2D subspaces of the embedding:

qm=RΘmq,kn=RΘnkq_m = R_\Theta^m q, \quad k_n = R_\Theta^n k

where RΘmR_\Theta^m is a block-diagonal rotation matrix with blocks (cosmθisinmθisinmθicosmθi)\begin{pmatrix} \cos m\theta_i & -\sin m\theta_i \\ \sin m\theta_i & \cos m\theta_i \end{pmatrix}.

The attention score:

qmkn=qRΘmnkq_m^\top k_n = q^\top R_\Theta^{m-n} k

depends only on relative position mnm-n. The frequencies θi=100002(i1)/d\theta_i = 10000^{-2(i-1)/d} match the sinusoidal PE frequencies - this is the same trigonometric interpolation scheme.

LongRoPE (2024): For context length extension, scales frequencies non-uniformly:

  • High-frequency dimensions (small ii): scale by λlong>1\lambda_{\text{long}} > 1 (extend range)
  • Low-frequency dimensions (large ii): scale by λshort1\lambda_{\text{short}} \approx 1 (minimal change)

This is essentially frequency-domain interpolation of the positional encoding.

U.2 Flash Attention and Numerical Integration

The attention matrix Aij=softmax(qikj/d)A_{ij} = \text{softmax}(q_i^\top k_j / \sqrt{d}) can be viewed as a matrix of weights for a kernel regression:

Attention(Q,K,V)=AV=jAijvj\text{Attention}(Q, K, V) = A V = \sum_j A_{ij} v_j

This is a discrete approximation to the integral:

y(x)=k(x,z)v(z)μ(dz)y(x) = \int k(x, z) v(z) \mu(dz)

where k(x,z)=eq(x)k(z)/eq(x)k(z)μ(dz)k(x, z) = e^{q(x)^\top k(z)} / \int e^{q(x)^\top k(z')} \mu(dz') is a normalized kernel. Understanding attention as kernel regression connects to RBF interpolation theory - the "query point" qiq_i selects a weighted combination of "function values" vjv_j based on "proximity" qikjq_i^\top k_j.

U.3 NeRF and Neural Radiance Fields

Neural Radiance Fields (NeRF, Mildenhall et al. 2020) represent a 3D scene as:

fθ:(x,y,z,θ,ϕ)(RGB,σ)f_\theta: (x, y, z, \theta, \phi) \to (\text{RGB}, \sigma)

The key challenge: standard MLPs cannot represent high-frequency spatial details. Fourier feature encoding solves this:

γ(x)=[sin(20πx),cos(20πx),,sin(2Lπx),cos(2Lπx)]\gamma(x) = [\sin(2^0\pi x), \cos(2^0\pi x), \ldots, \sin(2^L\pi x), \cos(2^L\pi x)]

This is the frequency embedding - equivalent to the first LL Fourier basis functions. The Fourier features lift the input to a space where the target function is smooth, enabling efficient approximation by the MLP.

Connection to RFF: Tancik et al. (2020) showed that Fourier features are equivalent to sampling the random Fourier feature map at specific fixed frequencies rather than random ones - targeted at the frequencies of the scene being rendered.


Appendix V: Error Analysis Summary and Complexity Reference

V.1 Computational Complexity

OperationComplexityNotes
Lagrange interpolation setupO(n2)O(n^2)Computing nn basis polynomials
Barycentric weights setupO(n2)O(n^2)General nodes; O(n)O(n) for Chebyshev
Barycentric evaluationO(n)O(n) per pointMost efficient form
Newton divided differencesO(n2)O(n^2)Table construction
Newton evaluation (Horner)O(n)O(n) per point
Cubic spline setupO(n)O(n)Thomas algorithm on tridiagonal system
Cubic spline evaluationO(logn)O(\log n)Binary search + O(1)O(1) eval
FFT of length nnO(nlogn)O(n \log n)Cooley-Tukey, n=2kn = 2^k
Least-squares QRO(mn2)O(mn^2)mm data points, degree nn
RBF interpolation setupO(n3)O(n^3)Dense system solve
GP posteriorO(n3)O(n^3)Cholesky of n×nn \times n kernel matrix
Sparse GP (SVGP)O(nm2)O(nm^2)mm inducing points

V.2 Error Summary

MethodError boundWhen tight
Lagrange (equispaced, nn)Can grow exponentiallyRunge function, large nn
Lagrange (Chebyshev, nn)f(n+1)/(2n(n+1)!)\|f^{(n+1)}\|_\infty / (2^n (n+1)!)Analytic ff
Cubic spline5384h4f(4)\frac{5}{384} h^4 \|f^{(4)}\|_\inftySmooth ff, equispaced hh
Least-squares (degree nn)(1+Λn)En(f)(1 + \Lambda_n) E_n^*(f)Depends on node Lebesgue constant
Trigonometric (nn terms)$2\sum_{k
RBF (Gaussian)Exponential in nn for analytic ffSensitive to shape parameter ε\varepsilon

V.3 Choosing the Right Method

DECISION TREE: WHICH INTERPOLATION METHOD?
========================================================================

  Data type?
  +-- Periodic function on [0,2\\pi]?
  |   +-- USE: Trigonometric interpolation (FFT)
  |       Best accuracy for smooth periodic functions
  |
  +-- Scattered data in R^d (d\\geq2)?
  |   +-- Need uncertainty estimate?
  |   |   +-- USE: Gaussian Process regression
  |   +-- No uncertainty needed?
  |       +-- USE: RBF interpolation (Matern kernel)
  |
  +-- 1D data on [a,b]?
      +-- Few points (n \\leq 30), need exact interpolation?
      |   +-- USE: Barycentric Lagrange (Chebyshev nodes)
      |
      +-- Many points (n > 30) or need local control?
      |   +-- USE: Cubic splines (natural or not-a-knot BC)
      |
      +-- More data than parameters (m > n)?
          +-- Need derivative information?
          |   +-- USE: Smoothing spline
          +-- Polynomial features?
              +-- USE: QR least-squares (NOT normal equations)

========================================================================

Appendix W: Notation Reference

Throughout this section, the following notation is used consistently:

SymbolMeaning
xjx_j or tjt_jInterpolation nodes / knots
yjy_jData values f(xj)f(x_j)
nnDegree of polynomial interpolant (using n+1n+1 nodes)
j(x)\ell_j(x)jj-th Lagrange basis polynomial
wjw_jBarycentric weight for node xjx_j
ωn+1(x)\omega_{n+1}(x)Node polynomial j=0n(xxj)\prod_{j=0}^n (x - x_j)
Λn\Lambda_nLebesgue constant
Tn(x)T_n(x)Chebyshev polynomial of degree nn
[f0,f1,,fk][f_0, f_1, \ldots, f_k]kk-th divided difference
S(x)S(x)Spline function
MiM_iSecond derivative of spline at node ii: S(xi)S''(x_i)
hih_iMesh spacing xi+1xix_{i+1} - x_i
Bi,p(x)B_{i,p}(x)ii-th B-spline basis function of degree pp
En(f)E_n(f)Best polynomial approximation error infpPnfp\inf_{p \in \mathcal{P}_n} \|f-p\|
f^k\hat{f}_kkk-th Fourier coefficient
ckc_kkk-th Chebyshev coefficient
k(x,y)k(x,y)Kernel function
Hk\mathcal{H}_kReproducing Kernel Hilbert Space with kernel kk
ϕ(r)\phi(r)Radial basis function evaluated at r=xxjr = \|x - x_j\|
ε\varepsilonRBF shape parameter
\ellRBF lengthscale