"It is not the duty of mathematics to explain nature, but to describe it with the greatest possible accuracy. Yet Fourier's discovery - that any periodic motion is a superposition of simple oscillations - turned out to be both a perfect description and the deepest explanation."
- paraphrasing Hermann Weyl
Overview
A Fourier series decomposes a periodic function into an infinite sum of sinusoidal harmonics. What seems like a purely mathematical curiosity - writing a square wave as a sum of sines - turns out to be one of the most powerful analytical tools in science and engineering. Every sound you hear, every image you see, every signal your phone transmits, and every positional encoding in a large language model has Fourier series at its mathematical core.
This section develops the theory from scratch. We begin with the geometric idea: functions on form an inner product space, and the trigonometric functions form an orthonormal basis in that space. Computing a Fourier series is simply projecting a function onto each basis vector. This geometric picture immediately explains why the formulas look the way they do, and why the theory is so clean.
We then turn to the hard question: when does the series actually converge back to ? The answer - Dirichlet's theorem, convergence, and the Parseval identity - is where the real mathematics lies. We examine the Gibbs phenomenon (the persistent 9% overshoot at jump discontinuities that cannot be removed by adding more terms), and close with the applications that connect directly to modern AI: RoPE positional encodings in LLaMA-3, the spectral bias of neural networks, and random Fourier features for kernel approximation.
Prerequisites
- Complex exponentials - , Euler's formula (Section 01 Mathematical Foundations)
- Integration - definite integrals, integration by parts, improper integrals (Section 04 Calculus Fundamentals)
- Inner products and orthogonality - abstract inner product spaces, orthonormal bases (Section 12-02 Hilbert Spaces)
- Series convergence - pointwise, uniform, and convergence (Section 01 Mathematical Foundations)
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive computation of Fourier series, convergence animations, Gibbs phenomenon, RoPE derivation |
| exercises.ipynb | 10 graded problems from computing coefficients to proving Parseval's theorem to implementing RoPE |
Learning Objectives
After completing this section, you will:
- Compute real and complex Fourier coefficients for standard periodic functions (square wave, sawtooth, triangle wave)
- State and apply the Dirichlet conditions for pointwise convergence of a Fourier series
- Explain the difference between pointwise, uniform, and convergence in the context of Fourier series
- Derive Parseval's identity and use it to evaluate infinite series (e.g., )
- Describe the Gibbs phenomenon and explain why it cannot be eliminated by adding more terms
- Relate the smoothness of to the decay rate of its Fourier coefficients
- Derive the RoPE positional encoding formula from the complex Fourier basis
- Explain the spectral bias phenomenon and its implications for neural network training dynamics
- Prove the orthonormality of in
- Apply Fejer's theorem to understand Cesaro summation and convergence acceleration
Table of Contents
- 1. Intuition
- 2. Formal Definitions
- 3. Convergence Theory
- 4. Parseval's Theorem and Energy
- 5. Properties of Fourier Coefficients
- 6. Fourier Series on General Intervals
- 7. Applications in Machine Learning
- 8. Common Mistakes
- 9. Exercises
- 10. Why This Matters for AI (2026 Perspective)
- 11. Conceptual Bridge
1. Intuition
1.1 What Is a Fourier Series?
Strike a guitar string and you hear a pitch - the fundamental frequency. Listen more carefully and you hear overtones: frequencies at twice, three times, four times the fundamental. The rich timbre of a guitar versus a piano playing the same note comes entirely from the relative amplitudes of these overtones. A Fourier series is the precise mathematical statement of this physical observation: every periodic function is a (possibly infinite) weighted sum of sinusoids at integer multiples of a fundamental frequency.
More precisely, if is a -periodic function satisfying mild regularity conditions, then:
where the coefficients and are uniquely determined by . The term with is the fundamental harmonic, and the terms with are the overtones or harmonics.
The miracle is how general this is. The function does not have to be smooth. The square wave - which jumps discontinuously between and - has a Fourier series:
Summing 100 terms gives a function that is nearly indistinguishable from the square wave - except near the jumps, where a small overshoot persists no matter how many terms you include. This is the Gibbs phenomenon, and we will study it carefully.
Non-example: Not every function has a convergent Fourier series in the pointwise sense. A function that is not in (e.g., near ) does not have a well-defined Fourier series. A continuous function can even have a Fourier series that diverges at a single point (Du Bois-Reymond, 1876). The correct framework is convergence, not pointwise convergence.
1.2 Why It Matters for AI
Fourier series are not an abstract curiosity - they are embedded in the architecture of modern LLMs, CNNs, and audio models:
For AI:
- Positional encodings in Transformers (Vaswani et al., 2017) use and - these are Fourier basis functions at geometrically spaced frequencies.
- Rotary Position Embedding (RoPE) (Su et al., 2021), used in LLaMA-3, GPT-NeoX, and Mistral, interprets token positions as rotations in the complex plane - directly using complex Fourier basis vectors .
- Spectral bias (Rahaman et al., 2019): neural networks learn low-frequency Fourier components of the target function first and high-frequency components last. This governs convergence speed, generalization, and the effectiveness of data augmentation.
- Random Fourier features (Rahimi & Recht, 2007): kernel machines (SVMs, GPs) can be approximated in by projecting data onto random Fourier basis functions sampled from the kernel's spectral density.
1.3 Historical Timeline
FOURIER ANALYSIS - HISTORICAL MILESTONES
========================================================================
1748 Euler studies vibrating strings; writes trigonometric series
1807 Fourier presents "Theorie de la chaleur" to Paris Academy;
claims every function has a trigonometric series expansion
(claim rejected - Laplace and Lagrange are skeptical)
1822 Fourier publishes "Theorie analytique de la chaleur"
1829 Dirichlet proves convergence for piecewise smooth functions;
gives first rigorous proof of pointwise convergence
1854 Riemann extends the theory; defines the Riemann integral partly
for the purposes of studying Fourier series
1873 Du Bois-Reymond exhibits a continuous function whose Fourier
series diverges at a point - Fourier's original claim was wrong
1900 Fejer proves Cesaro summability: every continuous function's
Fourier series converges in the Cesaro sense
1907 Parseval's identity proved rigorously by Fatou & Riesz
1966 Carleson proves pointwise convergence a.e. for L2 functions
(Fields Medal, 2006)
2017 Vaswani et al. use Fourier basis for transformer PE
2021 Su et al. introduce RoPE; now standard in LLaMA-3, Mistral
2022 FNet (Lee-Thorp) replaces attention with Fourier transforms
2023 Spectral analysis of LLM weight matrices goes mainstream
========================================================================
1.4 The Geometric Picture
The key insight that makes Fourier series transparent is to think of functions as vectors in an infinite-dimensional inner product space.
On the interval , define the inner product of two functions and by:
The functions form an orthonormal set under this inner product:
where is the Kronecker delta. This is easy to verify: if , the integral of over a full period is zero.
Now the Fourier series is just the expansion of in this orthonormal basis. The coefficient is the projection of onto the basis vector :
This is identical to the formula for coordinates in any orthonormal basis: . The Fourier series is not a formula to be memorized - it is the inevitable consequence of projecting onto an orthonormal basis.
For AI: This geometric view directly explains RoPE. In RoPE, each attention head dimension pair is treated as a 2D vector and rotated by angle for token position . This rotation is multiplication by - a Fourier basis vector. The inner product between rotated queries and keys then depends only on their relative position , making the positional encoding relative rather than absolute.
1.5 Three Equivalent Representations
Every Fourier series can be written in three equivalent forms. Each has advantages in different contexts.
Real (sine-cosine) form:
Best for: real-valued functions where you want to see real amplitudes explicitly.
Complex exponential form:
Best for: computation, derivations, and AI applications (RoPE, FNet). More compact; the complex exponentials are eigenfunctions of differentiation.
Amplitude-phase form:
where , . Best for: signal processing applications where amplitude and phase are the physically meaningful quantities.
Conversion between forms: for , , and for real . These identities follow immediately from Euler's formula .
2. Formal Definitions
2.1 The Space
The natural domain for Fourier series is the square-integrable functions on :
This space carries the inner product:
and induced norm . Two functions that agree except on a set of measure zero are identified. Under this identification, is a complete inner product space - a Hilbert space. The full abstract theory is in Section 12-02 Hilbert Spaces; here we use only the concrete definition.
Examples in :
- Every bounded measurable function (e.g., square wave, triangle wave)
- (integrable singularity; )
- for any
Non-examples:
- near : , so
- near : grows too fast
2.2 The Trigonometric System
Definition (Trigonometric System): The collection is called the complex trigonometric system on .
Theorem (Orthonormality): The rescaled functions satisfy , using the unweighted inner product .
Proof. For :
since so . For : .
The real trigonometric system is similarly orthogonal, with norms and for .
Completeness (stated): The trigonometric system is complete in : for every and every , there exists a trigonometric polynomial such that . The proof uses the Weierstrass approximation theorem and is in Section 12-02.
2.3 Fourier Coefficients
Definition (Fourier Coefficients): For , the complex Fourier coefficients of are:
The real Fourier coefficients are:
Relations between forms: For real : , for , and (Hermitian symmetry).
Standard examples - complete computations:
(i) Square wave: for , for .
Since is odd, for all . For :
So .
(ii) Sawtooth wave: for , extended periodically.
is odd, so . For , integrate by parts:
So .
(iii) Triangle wave: for .
is even, so . For ():
And . So .
Non-examples: Not every sequence is the Fourier coefficient sequence of an function. By Parseval's identity (Section 4.2), we need . The sequence for all is not a valid Fourier coefficient sequence.
2.4 Partial Sums and the Dirichlet Kernel
Definition: The -th partial sum of the Fourier series of is:
A crucial observation: the partial sum can be written as a convolution:
where the Dirichlet kernel is:
The closed form follows by summing the geometric series and simplifying using .
Key properties of :
- (normalized)
- oscillates with peaks near
- does NOT stay non-negative (unlike an approximate identity), causing convergence issues
The failure of to remain non-negative is precisely what allows the Gibbs phenomenon (Section 3.4) and prevents uniform convergence at discontinuities.
3. Convergence Theory
3.1 Pointwise Convergence: Dirichlet's Theorem
Theorem (Dirichlet, 1829): Let be a -periodic function that is piecewise smooth on (i.e., and are piecewise continuous). Then the Fourier series of converges for every , and:
At points of continuity, , so the series converges to . At a jump discontinuity, the series converges to the average of the left and right limits.
Proof sketch. Write as an integral involving and the function . The key step: is integrable if is piecewise smooth (the singularity at is removable). Then apply the Riemann-Lebesgue Lemma (Section 5.5): as for any integrable .
Dirichlet conditions (sufficient, not necessary):
- is bounded and has finitely many maxima, minima, and discontinuities on
- has left and right derivatives everywhere
What Dirichlet's theorem does NOT say:
- It does not guarantee uniform convergence (fails at jumps due to Gibbs)
- It does not apply to all continuous functions (Du Bois-Reymond's example)
- It says nothing about convergence rate
3.2 Convergence and Completeness
The convergence theorem is stronger and cleaner than the pointwise result:
Theorem ( Convergence): For every :
Equivalently, as an limit. This follows directly from the completeness of the trigonometric system in (stated in Section 2.2).
Bessel's Inequality (preliminary to Parseval): For any :
Proof. Expand . This shows the partial sums of are bounded by . Taking gives Bessel.
Equality holds (Bessel becomes Parseval) exactly when the system is complete.
3.3 Uniform Convergence
Theorem: If (continuously differentiable) and periodic, then uniformly.
Proof sketch. Integrate by parts: for . So . The partial sums converge uniformly by the Weierstrass M-test since (conditionally). For , , giving faster uniform convergence.
Key point: Continuity alone is insufficient for uniform convergence. The counterexample (Du Bois-Reymond) is a continuous function whose Fourier series diverges at a single point.
3.4 The Gibbs Phenomenon
The Gibbs phenomenon is one of the most important practical facts about Fourier series.
Observation: Near a jump discontinuity at , the partial sum overshoots the function value by approximately - about 9% of the jump height - regardless of how large is.
Precise statement for the square wave: For with jump height :
So the overshoot is , which is of the total jump of .
Why it persists: The Dirichlet kernel has a tall central spike but also oscillating side lobes with total negative area . These side lobes cannot be eliminated by taking larger - they just become narrower and move closer to the discontinuity.
For AI: The Gibbs phenomenon is why "ringing" artifacts appear when you sharply truncate a frequency spectrum (e.g., in audio compression, image filtering, or when a language model encounters out-of-distribution high-frequency tokens). Windowing functions (Section 20-03) are the engineering fix.
Remedy - Fejer's theorem: Instead of taking partial sums, take their Cesaro means . Fejer proved these converge everywhere and do NOT exhibit Gibbs overshoot.
3.5 Fejer's Theorem and Cesaro Summability
Theorem (Fejer, 1900): Let be -periodic. Then uniformly as .
The Cesaro means use the Fejer kernel :
Unlike the Dirichlet kernel, the Fejer kernel is non-negative. This is why Cesaro summation fixes the Gibbs phenomenon - the averaging eliminates the negative side lobes.
For AI: Fejer's theorem is the prototype for windowing in signal processing. Multiplying a signal by a window function before taking the Fourier transform is equivalent to using a smoother summation kernel, which reduces spectral leakage and ringing (-> Section 20-03).
4. Parseval's Theorem and Energy
4.1 Bessel's Inequality
We proved Bessel's inequality in Section 3.2: . This says the "energy" in the frequency representation is at most the energy in the time representation. Equality requires completeness.
4.2 Parseval's Identity
Theorem (Parseval's Identity): For with Fourier coefficients :
In real form: .
Proof. By completeness, . Then:
using orthonormality of for the second equality.
Physical interpretation: The total energy (power) of a signal equals the sum of energies in each frequency component. Fourier analysis is an energy-preserving change of basis.
More general form: For :
This is the polarization identity version of Parseval.
4.3 Applications: Evaluating Infinite Series
Parseval's identity is a powerful tool for evaluating series that have no elementary closed form.
Example 1 - Basel problem via triangle wave: Apply Parseval to on :
Parseval gives , so .
Example 2 - : Apply Parseval to the square wave. The non-zero Fourier coefficients are . Then:
giving .
5. Properties of Fourier Coefficients
5.1 Linearity, Shift, and Scaling
Let denote the -th Fourier coefficient of .
| Property | Statement | Proof idea |
|---|---|---|
| Linearity | Integral is linear | |
| Shift | Change of variable | |
| Conjugation | Conjugate the integral | |
| Reflection | Change of variable |
The shift property is the Fourier-series version of the time-shift property of the Fourier transform. It says: shifting a signal in time multiplies its Fourier coefficients by a complex exponential - i.e., introduces a linear phase. This is fundamental in signal alignment and is the mechanism behind relative positional encodings in transformers.
For AI - RoPE connection: In RoPE, the key and query vectors at position and are and where is a rotation matrix. The dot product - it depends only on the relative position . This is exactly the Fourier shift property: shifting both signals by the same amount leaves their inner product unchanged.
5.2 Differentiation and Integration
Differentiation in frequency space:
Proof. Integrate by parts: . By periodicity, the boundary term vanishes, leaving .
Iterated differentiation: .
Implication for smoothness: A function (k-times continuously differentiable and periodic) has as . The smoother is, the faster its Fourier coefficients decay. This is the quantitative content of Section 5.4.
Integration: for , where is the mean.
5.3 Even and Odd Functions
Even functions () have for all , so the Fourier series contains only cosines: - a cosine series. The coefficients are .
Odd functions () have for all , so the Fourier series contains only sines: - a sine series. The coefficients are .
For AI - DCT connection: The Discrete Cosine Transform (DCT) used in JPEG and MP3 compression is the even-extension Fourier series. The DCT coefficients are exactly the Fourier coefficients of the even extension of the signal to . This is why the DCT achieves better energy compaction than the DFT for real signals.
5.4 Smoothness and Spectral Decay
The relationship between regularity and spectral decay is fundamental to both signal processing and machine learning:
| Regularity of | Decay of | Practical implication | |-------------------|------------------|-----------------------| | | (by Riemann-Lebesgue) | All coefficients vanish asymptotically | | piecewise smooth, discontinuity | | Slow decay (Gibbs) | | (continuously diff.) | | Faster decay, no Gibbs | | | | Super-algebraic if large | | analytic | for some | Exponential decay |
For AI - Spectral bias (Rahaman et al., 2019): Neural networks learn the target function's Fourier decomposition from lowest to highest frequency. Formally, if is the target function, a network trained by gradient descent first approximates the for small (low frequencies) and only later approximates high-frequency components. This implies:
- Benefit: Implicit regularization toward smooth (low-frequency) solutions -> good generalization
- Cost: Slow convergence on high-frequency targets; requires more data for texture-rich images
- Fix: Fourier feature embeddings (RFF, NeRF's positional encoding) inject high-frequency components explicitly
5.5 Riemann-Lebesgue Lemma
Theorem (Riemann-Lebesgue): If , then as .
Proof sketch. For a step function: direct computation. For general : approximate by step functions and use the bound.
Significance: This says every function has Fourier coefficients that vanish at high frequencies. The rate of decay depends on smoothness (Section 5.4), but the qualitative statement holds for all integrable functions. This is the mathematical foundation of the claim "smooth signals are compressible in the Fourier domain."
6. Fourier Series on General Intervals
6.1 Extension to
For a -periodic function , the Fourier series on is obtained by the change of variables :
with coefficients:
The fundamental frequency is , and the -th harmonic has frequency .
For AI - Transformer context length: In a transformer with context length , sinusoidal positional encodings use frequencies for . This covers a geometric range of frequencies over the interval - analogous to sampling the Fourier series on at geometrically spaced frequencies. Longer context requires lower minimum frequencies, which is why RoPE's limits effective context length and why extended-context models (LLaMA-3-128K, Mistral) modify the base .
6.2 Half-Range Expansions
If is defined on (not the full period), we can extend it to in two ways:
Even extension: Extend by -> leads to a cosine series on :
Odd extension: Extend by -> leads to a sine series on :
Application: Solving the heat equation on with Dirichlet boundary conditions () uses the sine series expansion. With Neumann conditions (), use the cosine series.
7. Applications in Machine Learning
7.1 Sinusoidal Positional Encodings in Transformers
The original Transformer (Vaswani et al., 2017) uses positional encodings:
This assigns each position a -dimensional vector whose components are sine and cosine values at geometrically spaced frequencies .
Why this works: The set of frequencies forms a geometric progression spanning from (very high frequency, distinguishes adjacent tokens) down to (very low frequency, distinguishes widely separated tokens). This is exactly the strategy of a Fourier series on a long interval: use many harmonics at different scales.
Limitation: These are absolute positional encodings - the embedding for position 5 is fixed regardless of context. This creates problems for generalization to longer sequences than seen in training.
7.2 Rotary Positional Encoding (RoPE)
RoPE (Su et al., 2021), now standard in LLaMA-2/3, Mistral, Falcon, and GPT-NeoX, encodes position as a rotation of the query and key vectors.
Construction: For a -dimensional query vector , split into pairs . Rotate each pair by angle where is the token position and :
This is multiplication by in the complex representation .
The key property: The attention score between query at position and key at position is:
The score depends only on - the relative position. This is the Fourier shift theorem: multiplying by and taking a conjugate product gives sensitivity to relative displacement.
Extended context: The maximum effective context length is determined by the lowest frequency . Models like LLaMA-3-128K extend context by scaling (Position Interpolation, Chen et al., 2023) or by increasing the base from 10000 to 500000 (Roziere et al., 2023).
7.3 Spectral Bias of Neural Networks
The phenomenon (Rahaman et al., 2019): Neural networks trained with gradient descent learn a biased decomposition of the target function: low-frequency Fourier components are learned first, high-frequency components last.
Mathematical statement: Let be the target function with Fourier decomposition . A network trained on a finite sample first converges on the low- components ( decreases for small first).
Consequences for AI:
- Good: Networks converge to smooth solutions -> implicit regularization -> good out-of-distribution generalization for smooth targets
- Bad: Learning high-frequency signals (sharp edges, fine texture) requires more data and training time
- NeRF / SIREN fix: Sinusoidal activations (Sitzmann et al., 2020) or Fourier feature mappings (Tancik et al., 2020) inject high-frequency components, overcoming spectral bias for 3D scene representation
Mechanism: The NTK (Neural Tangent Kernel) of a standard MLP has a spectrum that decays with frequency. High-frequency components of the target correspond to high-eigenvalue directions of the NTK, which converge slowly under gradient descent.
7.4 Random Fourier Features and Kernel Approximation
The problem: Kernel methods (SVMs, GPs) require storing and computing an kernel matrix - memory and computation. For large , this is infeasible.
The solution (Rahimi & Recht, 2007): Any shift-invariant kernel can be written, by Bochner's theorem, as the Fourier transform of a non-negative measure:
Sampling frequencies and defining the feature map:
gives . This reduces kernel computation to - linear in the dataset size.
Connection to Fourier series: is a finite Fourier expansion with randomly sampled frequencies. The approximation quality improves as increases, with concentration bounds showing with high probability for .
8. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Using without the factor | Missing normalization; coefficients will be off by | Always check which convention your source uses; in this repo we use |
| 2 | Assuming the Fourier series always converges pointwise | False: Du Bois-Reymond exhibited a continuous function whose series diverges at a point | State the convergence type explicitly; use convergence for most theory |
| 3 | Assuming the sum at a discontinuity equals | The series converges to the average at a jump | Apply Dirichlet's theorem: check for discontinuities before claiming convergence to |
| 4 | Writing for a real function | Correct relation is (conjugate, not equal) | For real : ; only real if is real |
| 5 | Applying differentiation term-by-term without checking conditions | requires to be absolutely continuous and periodic | Check that (absolutely continuous) before differentiating term-by-term |
| 6 | Confusing Parseval's theorem with Bessel's inequality | Bessel gives ; Parseval gives - they differ | Parseval requires completeness of the trigonometric system; Bessel holds for any orthonormal set |
| 7 | Claiming $ | c_n | = O(1/n)$ decay for a smooth function |
| 8 | Forgetting to subtract the mean before computing sine/cosine coefficients | is the DC component; it appears in the term, not in for | Always compute separately; the series is |
| 9 | Using the complex form with but real form coefficients | Complex form requires ; real form uses - these are different | Pick one form and use it consistently; convert via |
| 10 | Claiming the Gibbs phenomenon goes away as | The overshoot height stays at ~9% of the jump; only its width shrinks | Gibbs is a permanent feature of the partial sum near discontinuities; use windowing to mitigate |
9. Exercises
Exercise 1 (*): Compute the complex Fourier coefficients of the triangle wave on and write out the first five non-zero terms of the Fourier series. Verify that is even, so .
Exercise 2 (*): Prove from first principles that the functions form an orthonormal set in with the inner product .
Exercise 3 (*): Using Parseval's identity applied to the square wave, show that . Then use to recover the Basel sum .
Exercise 4 ():** Let for with real. (a) Compute . (b) Apply Parseval's identity to obtain a formula for in terms of a series involving . (c) Verify your answer for numerically.
Exercise 5 ():** Prove the Riemann-Lebesgue Lemma: if , then as . (Hint: first prove it for step functions, then approximate general .)
Exercise 6 ():** Consider on . (a) Find all Fourier coefficients . (b) Show that uniformly. (c) Set in the resulting series to derive .
Exercise 7 (*):** Implement RoPE from scratch in Python. (a) Given query and key vectors of dimension and sequence length , construct the rotation matrices for each position using . (b) Compute the rotated attention scores for all pairs . (c) Verify that depends only on by checking for several values of .
Exercise 8 (*):** Implement Random Fourier Features for the RBF kernel. (a) For the Gaussian kernel , show that the spectral density is . (b) Implement the feature map using random frequency samples. (c) On a synthetic dataset in , compare the exact kernel matrix with the approximation and plot the approximation error as a function of .
10. Why This Matters for AI (2026 Perspective)
| Concept | AI Impact | Concrete System |
|---|---|---|
| Fourier basis vectors | Foundation of positional encodings in all transformer variants | RoPE in LLaMA-3, Gemma-2, Mistral-7B, Falcon |
| Complex Fourier form | Rotation = position; enables relative PE via shift theorem | RoPE (Su et al., 2021), xPos, YaRN |
| Parseval's identity | Energy preservation: Fourier is a unitary transform; spectral analysis of embeddings | WeightWatcher spectral analysis of LLM health |
| Spectral bias ( decay) | Networks learn smooth functions first; governs training dynamics | SIREN (Sitzmann et al., 2020), NeRF frequency encoding |
| Gibbs phenomenon | Ringing in over-compressed audio/images; sudden context-length failure modes | JPEG compression artifacts, LLM boundary token issues |
| Fourier completeness | Any periodic signal can be exactly represented; digital audio encoding | MP3, AAC, Ogg Vorbis compression standards |
| Random Fourier features | kernel approximation replacing kernel matrix | Large-scale SVM/GP inference (Rahimi & Recht, 2007) |
11. Conceptual Bridge
Looking backward: Fourier series is built on three pillars from earlier chapters: the inner product structure of (from Section 12-02 Hilbert Spaces), complex exponentials (from Section 01 Mathematical Foundations), and convergence of series (from real analysis in Section 04 Calculus). If those foundations feel shaky, revisit them before proceeding.
Looking forward: Fourier series handles periodic functions on a bounded interval. The Fourier Transform (Section 20-02) extends this to aperiodic signals on the entire real line by taking the period . The discrete, computationally feasible version is the DFT and FFT (Section 20-03). The Convolution Theorem (Section 20-04) shows why Fourier analysis is so powerful: convolution in time becomes multiplication in frequency. Finally, Wavelets (Section 20-05) overcome Fourier's fundamental limitation - the inability to localize in both time and frequency simultaneously.
POSITION IN THE FOURIER ANALYSIS CHAPTER
========================================================================
Section 20-01 Fourier Series -- YOU ARE HERE
v (take T -> infty)
Section 20-02 Fourier Transform
v (discretize: N samples)
Section 20-03 DFT and FFT
v (mult. in freq conv. in time)
Section 20-04 Convolution Theorem
v (localize in time AND freq)
Section 20-05 Wavelets
Prerequisites: Forward pointers:
Section 01 Mathematical Foundations -> Section 20-02 (continuous FT)
Section 04 Calculus Fundamentals -> Section 20-03 (discrete FT)
Section 12-02 Hilbert Spaces -> Section 12-03 (kernel methods via Bochner)
========================================================================
<- Back to Fourier Analysis | Next: Fourier Transform ->
Appendix A: Extended Examples and Computations
A.1 Full Derivation: Fourier Series of Common Waveforms
This appendix provides complete, step-by-step derivations for the Fourier coefficients of the most important periodic waveforms. These are not presented as exercises - they are reference derivations that you should work through once and understand deeply. Every step is motivated.
A.1.1 Rectangular Pulse Train
Define a pulse of width centered at , repeated with period :
Complex coefficients: for :
where . For : (the duty cycle).
Key observation: The envelope of versus follows a function. The first zero crossing occurs at . A narrower pulse ( small) -> wider spectrum (more high frequencies needed to represent the sharp edges). This is the time-frequency trade-off in action.
Spectrum width: The "bandwidth" (distance to first spectral null) is in normalized frequency. Narrow pulses are "broadband"; wide pulses (large , approaching a constant) have most energy concentrated near .
A.1.2 Full-Wave Rectified Sine
is even and non-negative. Since , for all . For :
Using the product-to-sum formula :
For : .
For :
So .
Application: The full-wave rectified sine is used in AC-to-DC conversion. Its spectrum has no odd harmonics - only even harmonics - which is why its Fourier representation converges faster (coefficients decay as ) than the square wave (which has decay).
A.1.3 The Sawtooth Wave (Staircase Convergence)
on , (we assign the average at the jumps). We showed .
Let us verify: at , . The series gives:
This uses the Leibniz formula .
Convergence rate: For the sawtooth, , so (from the integral test). Convergence is slow: to reduce the error below , we need terms.
A.2 The Heat Equation: Fourier's Original Application
The problem that motivated Fourier's entire theory is the heat equation on a rod:
with boundary conditions (zero temperature at the ends) and initial condition .
Solution by separation of variables:
Assume . Substituting: , so (both sides must equal the same constant). This gives:
The boundary condition forces with eigenvalues . The corresponding time solution is .
Superposition: The general solution is:
The coefficients are determined by the initial condition:
This is exactly the sine series (half-range expansion, Section 6.2). So .
Physical insight: Each mode decays at rate . High-frequency modes (large ) decay much faster than low-frequency modes. At long times, the solution looks like just the fundamental mode. This is Fourier's original discovery: heat diffusion is a low-pass filter in frequency space.
For AI: This is the origin of the spectral bias observation. In a sense, gradient descent on a neural network is like running the heat equation in weight space - it diffuses high-frequency components (noise) faster than low-frequency components (signal). The spectral bias of neural networks has exactly the same mathematical structure as heat diffusion.
A.3 Dirichlet's Kernel: Detailed Analysis
Understanding why the Dirichlet kernel causes problems requires a careful look at its behavior.
Closed form derivation:
Properties:
- (the value at the origin)
- (normalized)
- has zeros at for
- alternates sign: it is NOT a non-negative approximate identity
- (the norm grows logarithmically)
The logarithmic growth of is the root cause of convergence problems. A proper approximate identity would have bounded - the Dirichlet kernel fails this, which is exactly why pointwise convergence can fail for continuous functions.
Contrast with the Fejer kernel:
The Fejer kernel satisfies: , , and for any : uniformly on . These three properties make a genuine approximate identity, guaranteeing uniform convergence.
A.4 Complex Analysis Connection: Laurent Series
The Fourier series of a function on is closely related to the Laurent series in complex analysis. If has Fourier series , define (a point on the unit circle ). Then:
This is a Laurent series in centered at the origin, evaluated on the unit circle. The Fourier coefficient is the coefficient of this Laurent series.
Analyticity and series convergence: If extends analytically to an annulus (with ), then the Laurent series converges absolutely and uniformly on the unit circle, and the Fourier coefficients decay exponentially: . This is the deep reason why analytic functions have exponentially decaying Fourier coefficients (Section 5.4).
For AI: The connection between Fourier analysis and analytic continuation underlies the theory of analytic functions of neural networks and the frequency-domain analysis of attention patterns. When LLM researchers analyze learned positional encodings in the complex plane, they are (often implicitly) using this Laurent series picture.
A.5 Numerical Experiments: Convergence Visualization
The following experiments (implemented in theory.ipynb) illustrate the key convergence phenomena:
Experiment 1 - Convergence speed: Compute partial sums for for the square wave. Plot the error as a function of . Observe the decay rate from the coefficients.
Experiment 2 - Gibbs phenomenon: Plot for the square wave. Zoom in near . Measure the overshoot height and verify it is of the jump height 2.
Experiment 3 - Cesaro means fix Gibbs: Plot (Cesaro means) alongside . Observe that the Gibbs overshoot is absent in the Cesaro means.
Experiment 4 - Decay rate vs smoothness: Compare the coefficient decay rate for: (a) square wave (discontinuous): ; (b) triangle wave (continuous, piecewise linear): ; (c) smooth bump function: decays super-polynomially. Plot vs to see the exponent.
Experiment 5 - RoPE implementation: Implement RoPE as in Exercise 7. Visualize the rotation matrices for positions and frequency index . Verify that consecutive positions differ by a fixed rotation angle , confirming the Fourier basis interpretation.
Appendix B: Advanced Topics
B.1 Multidimensional Fourier Series
On a -dimensional torus , the Fourier series generalizes naturally. For and :
The orthonormality relation is , and Parseval generalizes to .
For AI - 2D images: Images are functions on a 2D torus (with periodic boundary). The 2D Fourier series (or DFT) represents an image as a sum of 2D complex exponentials . Low-frequency components near the origin represent slowly varying regions; high-frequency components represent edges and textures. JPEG compression discards high-frequency 2D DCT coefficients - this is the 2D cosine series applied to blocks.
B.2 Distributions and the Dirac Delta
For rigorous treatment of impulse-like functions in Fourier analysis, we need distributions (generalized functions). The Dirac delta is defined by:
Its Fourier coefficients are , so its Fourier series is:
This is the completeness relation of the Fourier system: the identity operator decomposes as a sum over all frequency components. In functional analysis, this is written as (in Dirac notation).
Verification: For any smooth , ... (Parseval confirms this is by the inversion formula).
Comb function: The Dirac comb has Fourier coefficients for all - its Fourier series is the same as . This might seem paradoxical, but it reflects the periodization in the series context.
B.3 Pointwise Convergence: The Full Story (Carleson's Theorem)
The question of pointwise convergence of Fourier series remained open for 120 years after Dirichlet's 1829 result.
Timeline of the convergence problem:
- Dirichlet (1829): Piecewise smooth functions -> pointwise convergence everywhere
- Du Bois-Reymond (1876): Exhibited a continuous function whose Fourier series diverges at one point
- Kolmogorov (1923): Exhibited an function whose Fourier series diverges everywhere!
- Carleson (1966): Proved that for every , the Fourier series converges pointwise almost everywhere
- Hunt (1968): Extended to for any
Carleson's theorem (which won him the Abel Prize in 2006) resolved the central question: functions have Fourier series converging a.e. The proof is one of the most difficult in analysis, involving a refined version of the Calderon-Zygmund theory.
Practical implication: For all functions we encounter in ML (bounded measurable, piecewise smooth, ), Fourier series converge pointwise. The pathological examples require non-measurable functions or unbounded ones.
B.4 Sobolev Spaces and Regularity via Fourier Series
Definition (Sobolev spaces): For , the Sobolev space consists of functions whose Fourier coefficients satisfy:
The parameter measures smoothness: , approx "once weakly differentiable in ", for consists of continuous functions (Sobolev embedding theorem).
Fourier characterization of regularity: A function is in if and only if for some . This makes Sobolev regularity entirely readable from the Fourier spectrum.
For AI - Spectral bias quantified: The spectral bias (Section 7.3) says networks approximate more easily if is large (smooth target). Formally, a network trained by gradient descent with samples achieves risk of order when targeting - the standard minimax rate for nonparametric regression. Higher -> faster convergence -> need fewer samples.
For AI - Kernel smoothness: The RKHS of a kernel is a Sobolev space . The RBF kernel corresponds to (infinitely smooth functions), while the Matern kernel corresponds to for the regularity parameter and dimension . This is why the choice of kernel implicitly assumes a smoothness level for the target function.
B.5 Fejer-Riesz Theorem and Spectral Factorization
Theorem (Fejer-Riesz): Any trigonometric polynomial that is non-negative on can be written as where is an analytic polynomial.
Spectral factorization: Given a non-negative spectral density (such as the power spectral density of a stationary process), we can find a causal filter such that . This is the foundation of the Wiener filter (used in denoising) and Cholesky-like factorizations in probability theory.
For AI - Connection to SSMs: State Space Models (S4, Mamba) parameterize their convolution kernels through a spectral factorization. The kernel is written as where is constrained to be the frequency response of a stable ARMA model. The Fejer-Riesz theorem guarantees that the spectral density can always be factored into a causal part.
B.6 The Discrete Cosine Transform and Signal Compression
The Discrete Cosine Transform (DCT) of Type II is defined as:
This is the even-extension Fourier series (cosine series) applied to a finite sequence. The DCT has two critical properties for signal compression:
-
Energy compaction: For natural signals (images, audio), the DCT concentrates most of the signal energy in the first few low-frequency coefficients. The DCT of a typical image block has 95%+ of its energy in the top-left coefficients.
-
Decorrelation: The DCT approximately diagonalizes the covariance matrix of natural signals, making the coefficients nearly uncorrelated. This allows independent quantization of each coefficient.
JPEG compression pipeline:
- Divide image into blocks
- Apply 2D DCT to each block
- Quantize coefficients (coarser quantization for high frequencies)
- Run-length encode the quantized sparse coefficients
- Huffman encode
The quality parameter in JPEG controls the quantization step sizes - higher quality means finer quantization of the high-frequency DCT coefficients.
DCT vs DFT: The DCT avoids the "spectral leakage" (Gibbs phenomenon) of the DFT because the even extension of a finite sequence is continuous at the boundaries. This is why JPEG uses DCT rather than DFT.
B.7 Uniform Convergence: Weierstrass M-Test Application
Let us prove uniform convergence of the triangle wave Fourier series rigorously.
Claim: The Fourier series converges uniformly.
Proof via Weierstrass M-Test: We need where .
Then .
Since the series of bounds converges, the Weierstrass M-test gives uniform convergence.
Contrast with square wave: For the square wave, would bound a DIFFERENT series than what appears. But the square wave's coefficients are , and diverges (harmonic series). So the Weierstrass M-test fails - consistent with non-uniform convergence near the jumps.
B.8 Connection to Functional Analysis: Completeness Proof Sketch
Here is a sketch of why the trigonometric system is complete in - a result we stated but did not prove in Section 2.2.
Step 1: Stone-Weierstrass theorem: Continuous periodic functions can be uniformly approximated by trigonometric polynomials (polynomials in ).
Step 2: Density: Continuous functions are dense in (this is a standard result in measure theory).
Step 3: Combining: For any and , find continuous with , then find trigonometric polynomial with , giving . Triangle inequality gives .
Full proof: See Section 12-02 Hilbert Spaces where the abstract completeness theorem is proved and the trigonometric system is given as the primary example.
Appendix C: Problem-Solving Strategies
C.1 Computing Fourier Coefficients: A Systematic Approach
Follow this checklist when computing Fourier coefficients:
-
Check parity first. Is even? -> only terms. Odd? -> only terms. Neither? -> compute all.
-
Check for known waveform. Square wave, sawtooth, triangle wave, rectangular pulse - look up or recall the result.
-
Split piecewise functions. Write if is defined piecewise.
-
Use integration by parts for polynomial trig: . Set , . Repeat until done.
-
Check boundary terms. For periodic : boundary terms in integration by parts vanish. For non-periodic problems (PDEs), keep them.
-
Use the differentiation theorem as a shortcut: if you know , then .
-
Verify via Parseval. Compute and check it equals - this catches algebra errors.
C.2 Determining Convergence Type
| Condition | Conclusion |
|---|---|
| piecewise smooth | Pointwise convergence to (Dirichlet) |
| continuous + piecewise | (uniform) |
| (mean-square) | |
| , periodic | Uniform convergence; coefficients |
| analytic | Exponential coefficient decay; exponentially fast uniform convergence |
C.3 Recognizing When Fourier Series Appear in ML
Watch for these signatures of Fourier analysis in ML papers and code:
- and in positional encodings -> Fourier basis
- "Frequency domain" or "spectral" in the title -> likely uses DFT or FT
- Complex multiplication or rotation in attention -> RoPE or ALiBi variant
- "Low-pass filter" or "high-pass filter" -> filter = convolution = Fourier operation
- or weight decay in regularization -> Sobolev norm regularization
- "Spectral density" or "power spectral density" -> autocorrelation -> Wiener-Khinchin -> Fourier
- State space model (SSM, S4, Mamba) -> the convolution kernel is parameterized in frequency domain
Appendix D: Deep Dive - Proofs and Derivations
D.1 Proof of Dirichlet's Theorem (Complete)
We prove the central convergence result: for piecewise smooth and -periodic, .
Setup. Write . By periodicity, we may center the integral at :
Since , we can write:
Split the integral. Since , and using (even function):
Key function. Define:
We need . Near : the numerator behaves like (by piecewise smoothness), so numerator , and , giving - bounded near . Away from : is bounded since is bounded and . So .
Apply Riemann-Lebesgue. The expression becomes:
By the Riemann-Lebesgue lemma (proved below), this as .
Riemann-Lebesgue Lemma (proof): For :
Proof for step functions: If , then . By linearity, this holds for any step function. For general functions, approximate by step functions and use the bound.
D.2 Parseval's Identity: Detailed Proof
Theorem. For with complex Fourier coefficients :
Proof (assuming completeness). Consider the partial sum error:
The cross-term vanishes: (by the projection property: is the best approximation to from ).
So (Bessel).
By completeness, , so:
This gives the result.
Remark: The key step is . This is the Pythagorean theorem in : decomposes orthogonally into (the projection) and (the residual). No analogy - it IS the Pythagorean theorem, in an infinite-dimensional space.
D.3 Gibbs Phenomenon: Precise Quantification
Let (square wave). We compute the exact overshoot at the jump .
The partial sum has its first local maximum at (the first zero of , which is zero when ).
Computing:
This is a Riemann sum for with step . As :
The jump in is from to (height 2). The overshoot above the limiting value is , which is of the jump height.
Numerical value of :
Why is universal: The integral does not depend on the specific piecewise smooth function - only on the jump height. Every jump discontinuity in any piecewise smooth function gives a Gibbs overshoot of this universal fraction.
D.4 The Weierstrass Approximation Theorem and Completeness
Weierstrass Approximation Theorem (1885): Every continuous function on can be uniformly approximated by algebraic polynomials.
Trigonometric variant: Every continuous -periodic function can be uniformly approximated by trigonometric polynomials .
Proof sketch (Fejer's approach): The Cesaro means are trigonometric polynomials (each is a trigonometric polynomial of degree , and their average is also one). Fejer proved uniformly for continuous .
Why completeness follows: Given and :
- Find continuous with (density of in )
- Find trigonometric polynomial with (Weierstrass/Fejer)
- Then
- Triangle inequality:
D.5 Fourier Series and Operator Theory
The Fourier series establishes a fundamental connection between function analysis and operator theory.
The operator perspective: Consider the differentiation operator on periodic functions. The eigenfunctions of are exactly the complex exponentials:
with eigenvalue . The Fourier series is the eigenfunction expansion of . Every linear differential operator with constant coefficients:
is diagonalized by the Fourier basis:
where is the symbol of the operator.
For AI: Attention is (in the linear case) an operator on sequence positions. The linear attention matrix with entries (for a shift-invariant kernel) is diagonalized by the Fourier basis - its eigenvectors are the complex exponentials and its eigenvalues are the Fourier transform of . This is why spectral analysis of attention patterns connects directly to Fourier theory, and why RoPE (which multiplies by ) is so natural in this framework.
Compact operators: If the coefficients as , then is a compact operator. The inverse problem (recovering from ) is then ill-posed (small changes in can cause large changes in ) - this is the mathematical underpinning of regularization in machine learning.
Appendix E: Summary Tables
E.1 Fourier Coefficient Reference
| Function on | (complex) | Notes |
|---|---|---|
| (constant) | , () | DC component only |
| () | , () | Single frequency |
| , rest | ||
| , | ||
| (square wave) | ( odd), (even/zero) | decay |
| (sawtooth) | (), () | decay |
| $ | x | $ (triangle) |
| , () | decay | |
| Exponential decay | ||
| (Dirac) | (all ) | No decay (distributional) |
E.2 Properties Reference Table
| Property | Time domain | Frequency domain |
|---|---|---|
| Linearity | ||
| Shift | ||
| Modulation | ||
| Differentiation | ||
| Integration | () | |
| Reflection | ||
| Conjugation | ||
| Convolution | ||
| Multiplication | (convolution in freq) |
E.3 Convergence Summary
| Type | Condition | Rate | Gibbs? |
|---|---|---|---|
| (always) | N/A | ||
| Pointwise | piecewise smooth | error at smooth points | Yes at jumps |
| Uniform | , periodic | in | No |
| Superpolynomial | , periodic | No | |
| Exponential | analytic | No | |
| Cesaro | continuous | Uniform (Fejer) | No |
Appendix F: Complete Worked Problems
F.1 Problem: Full Solution of Fourier Series for on
This is a common PDE boundary value problem. We want the sine series of on (odd extension):
Step 1: Split the integral.
Step 2: Compute by parts.
Set , :
Step 3: Compute by parts twice.
First integration by parts: , :
Second integration by parts:
So:
Step 4: Combine.
Result:
Verification via Parseval:
Parseval: , giving . (Verifiable!)
Application to heat equation: The solution of on with and is:
At : this reduces to the Fourier series of . At large : the dominant term is - the solution relaxes toward zero via the fundamental mode.
F.2 Problem: Fourier Series of - The Modulation Property
This trivial-seeming example illuminates the modulation property.
is already a single Fourier mode with and for .
Now consider for integer . By the modulation property:
So has all energy concentrated at frequency . This is the frequency shift or modulation property: multiplying by shifts the spectrum by .
Application to RoPE in detail: In RoPE, the query at position for dimension is:
This is multiplication by the complex exponential - i.e., modulation by frequency . The inner product with key at position :
The score depends only on - relative position - by the modulation-then-conjugate-product structure. This is the Fourier basis shift theorem instantiated in transformer attention.
F.3 Problem: Convergence Rate Comparison
Compare the convergence rates of the Fourier series for four functions:
Function A: Square wave (1 discontinuity per period)
- ->
- Needs terms for relative error
Function B: Triangle wave (continuous, corners = 1 discontinuity in derivative)
- ->
- Needs terms for relative error
Function C: Smooth periodic bump
- is analytic -> decays exponentially,
- Needs terms for relative error
Function D: (slowly varying linear ramp)
- -> similar to triangle wave
- Fast convergence
Lesson: The algebraic convergence rates for discontinuous or non-smooth functions make Fourier truncation expensive. This motivates wavelets (Section 20-05), which can represent localized features (edges) efficiently without paying the Gibbs penalty.
F.4 Problem: Convolution in Fourier Space
Setup: Let and . Compute using the convolution theorem.
Using the convolution theorem: .
Fourier coefficients of : , rest zero.
Fourier coefficients of : , , rest zero.
Product : This is zero for all since the non-zero frequencies of () and () are disjoint!
So - the convolution of and is identically zero.
Direct verification: . Each integral is zero by orthogonality.
Lesson: The convolution of two functions with non-overlapping spectra is zero. This is the frequency-domain multiplication interpretation of convolution. In signal processing, this means two signals at different frequencies don't interfere - the superposition principle holds in frequency space.
For AI: This is why attention heads can specialize to different frequency ranges. If head attends to low-frequency positional patterns and head attends to high-frequency syntactic patterns, their convolution-like operations in frequency space are orthogonal.
F.5 Problem: Parseval and the Isoperimetric Inequality
The isoperimetric inequality - that among all closed curves of length , the circle encloses the maximum area - has an elegant proof via Fourier series.
Setup: Parameterize a smooth closed curve of length by arclength: for , with period . The constraint is .
Area via Green's theorem: .
Fourier expand: , .
Apply Parseval to the arc length constraint: -> after Parseval:
Bound the area: Using the Fourier expansion and Cauchy-Schwarz:
Equality holds when only the terms are non-zero - i.e., when the curve is a circle.
Conclusion: , with equality for a circle. This is the isoperimetric inequality, proved entirely via Parseval's theorem.
For AI: Isoperimetric inequalities in function spaces underlie capacity control in learning theory. The VC dimension and Rademacher complexity measure the "surface area to volume ratio" of hypothesis classes in high-dimensional function spaces - and Fourier analysis provides the tools to compute these quantities.
Appendix G: In-Depth AI Applications
G.1 Transformer Positional Encodings: A Unified View
To understand the deep connection between Fourier series and transformer positional encodings, let us develop the theory carefully.
The fundamental problem in attention. The self-attention mechanism is permutation-equivariant: permuting the input tokens permutes the output, but the attention weights treat all positions symmetrically. Without positional encodings, "the cat sat on the mat" and "the mat sat on the cat" would produce the same attention weights.
The sinusoidal solution (Vaswani et al., 2017). Add a position-dependent vector to the -th token embedding before attention. Use:
The key property. For any fixed offset , there exists a linear map such that . This is because:
Each dimension pair transforms via a rotation by - a Fourier shift in 2D.
Why different frequencies? At frequency :
- : , period - distinguishes adjacent tokens
- : - distinguishes tokens apart
- : - distinguishes tokens up to distance
This is exactly the geometric progression of harmonics in a Fourier series on the interval .
The Fourier basis interpretation. Define . The inner product depends only on . This is the autocorrelation function of the positional encoding - it measures how "similar" two positions are based on their relative offset.
Limitation. These encodings are absolute (the encoding of position 5 is the same regardless of whether the sequence has length 10 or 1000). For sequences longer than those seen in training, the high-frequency components () still work, but the model has never seen the low-frequency components () vary much. This causes length generalization failure.
G.2 RoPE: Relative Positions via Complex Multiplication
The RoPE construction. Instead of adding to the embedding, RoPE rotates the query and key vectors:
where is the block-diagonal rotation matrix:
Computing the attention score.
since (rotation by ). The score is a function of only - position-independent in the absolute sense, position-sensitive in the relative sense.
The complex representation. In complex notation, the -th pair maps to the complex number . The RoPE operation is:
This is multiplication by the unit complex number - a phase rotation of the complex Fourier coefficient at frequency .
Long-context extrapolation. The base frequency means the maximum period is . For positions beyond this, the encoding wraps around (becomes periodic) and position information is lost. This is why standard RoPE models struggle at contexts beyond tokens.
Fix - Extended RoPE: Scale where is the ratio of new context length to training context length (Position Interpolation, Chen et al., 2023). Alternatively, YaRN (Peng et al., 2023) uses a non-uniform frequency scaling based on Fourier analysis of the empirical attention patterns.
G.3 Spectral Bias: Mathematical Theory
The spectral bias (or frequency principle) of neural networks has a rigorous formulation in terms of the NTK and Fourier analysis.
The NTK framework. For a wide neural network trained with gradient descent, the NTK matrix is approximately constant during training. The training dynamics are:
This is a kernel regression problem in function space, where acts as the kernel.
Spectral decomposition. Let be eigenvalues/eigenfunctions of in . The projection of the target onto eigenvector has coefficient . Under gradient descent:
where is the target coefficient. The coefficient at eigenfrequency converges at rate - fast if large, slow if small.
The spectral bias. For standard MLPs with Gaussian initialization and ReLU activations, the NTK has eigenvalues that decrease with frequency (high-frequency eigenfunctions have small eigenvalues). Therefore:
- Low-frequency Fourier components have large eigenvalues -> learned quickly
- High-frequency components have small eigenvalues -> learned slowly
Quantitative: the Barron approximation theorem. Functions that can be represented with neurons (where is the Barron norm) can be approximated to accuracy in . The Barron norm penalizes high-frequency components: if has large high-frequency content ( large for large ), then is large and the network needs more parameters.
Implications for practice:
- Overfitting to noise: High-frequency noise (which has large ) is learned last -> early stopping acts as a frequency low-pass filter
- Data augmentation: Adding geometric augmentations (rotations, flips) destroys high-frequency features -> forces the network to rely on low-frequency structure -> better generalization
- NeRF: Without positional encoding, networks can't represent the high-frequency variation in scene appearance -> Fourier feature encodings inject high frequencies explicitly
G.4 SIREN Networks: Replacing ReLU with Sine
SIREN (Sitzmann et al., 2020). Instead of using activations, SIREN uses:
The derivative is again a SIREN (with cosines -> sines one layer deeper). This is the key property: the derivative of a SIREN is also a SIREN with the same architecture.
Fourier interpretation. The output of a SIREN is a sum of sinusoids at multiple frequencies and phases. The composition of sine activations creates a rich set of Fourier components. The Fourier spectrum of a SIREN output can represent high-frequency content that a standard ReLU network would struggle to learn.
Mathematical property. For a single-layer SIREN with neurons, the output is:
a sum of Fourier basis functions with learned frequencies, amplitudes, and phases. Unlike a standard Fourier series with fixed frequencies , SIREN can choose the frequencies freely - making it more expressive.
Applications: Implicit neural representations (NeRF, video), solving PDEs (physics-informed neural networks), generative modeling of images and 3D shapes.
G.5 The Spectral Analysis of LLM Attention Patterns
Recent research (2023-2024) has revealed striking spectral structure in LLM attention patterns:
Attention as a filter. In a transformer layer, the attention mechanism computes:
where . If depends only on (which is approximately true for well-trained models with relative PE), then this is a convolution with kernel . In frequency space:
The attention pattern is a frequency filter. Different heads learn different filters:
- Local heads: large only for small -> low-pass filter (smooth the sequence)
- Global heads: relatively uniform -> constant filter (compute the mean)
- Periodic heads: large when -> periodic filter (detect regular patterns)
Empirical findings:
- In BERT, some heads function as "copy heads" (attend to identical tokens) and "position heads" (attend to positions with specific relative offsets)
- In GPT-2/3, induction heads attend strongly to previous occurrences of the same token
- The frequency-domain view reveals why attention heads naturally factorize into frequency bands
WeightWatcher (Martin & Mahoney, 2021). The singular value spectra of weight matrices in LLMs follow power laws with exponent related to model quality. Healthy models have . Overfit or poorly trained models have outside this range. This is a Fourier analysis of the weight matrices - the spectral density tells us about the effective rank and implicit regularization.
G.6 Circular Convolution and LLMs
The circular convolution theorem (proved in Section 20-04) states that in the DFT domain, elementwise multiplication corresponds to circular convolution in the time domain. This has a direct connection to autoregressive language models.
Autoregressive language models as circular filters. A causal language model computes by attending to all previous tokens. In the frequency domain:
where is a one-sided (causal) filter. The causal constraint means for - only attending to past tokens. In frequency space, this corresponds to a filter whose phase response is for all frequencies (a Hilbert transform).
State space models. Models like S4, H3, and Mamba parameterize the causal convolution kernel directly in frequency space (via diagonal state space matrix). The output is:
where is the learned frequency response. This is exactly the convolution theorem applied to sequence modeling, and it enables computation instead of attention.
Appendix H: Connections to Probability and Statistics
H.1 Characteristic Functions as Fourier Transforms of Distributions
The characteristic function of a random variable is:
This is the Fourier transform of the probability measure ! The characteristic function exists for all random variables (unlike the moment generating function) and uniquely determines the distribution.
Fourier inversion of densities. If has density :
The density is recovered by the inverse Fourier transform: .
Central limit theorem via characteristic functions. Let i.i.d. with mean and variance . The standardized sum has characteristic function:
Expanding for small and substituting:
The Gaussian is the Fourier transform of the standard normal density. The CLT is a convergence in Fourier space to the Gaussian fixed point.
For AI - Training stability. The distribution of gradient updates in neural network training follows a CLT when the batch size is large: the stochastic gradient is a sum of per-sample gradients, so its distribution is approximately Gaussian for large . This is why large-batch training behaves differently from small-batch training - the gradient noise becomes more Gaussian (less heavy-tailed) as grows.
H.2 Autocorrelation and Power Spectral Density
For a stationary stochastic process with mean zero, the autocorrelation function is:
The power spectral density (PSD) is the Fourier transform of :
This is the Wiener-Khinchin theorem: the PSD and autocorrelation are a Fourier pair.
Interpretation of PSD:
- for all (non-negative by Bochner's theorem)
- (total power)
- at frequency tells how much of the signal's power is concentrated near frequency
White noise: (constant) -> (uncorrelated across all lags). White noise has equal power at all frequencies - it is the "maximally random" process.
1/f noise (pink noise): -> found in LLM token frequency distributions (Zipf's law), financial returns, EEG signals. The 1/f spectrum is exactly at the boundary between stationary ( integrable) and non-stationary processes.
For AI: The power spectral density of the hidden states in LLMs has been studied empirically. Models trained on natural language show 1/f-like spectra in their activations, consistent with the fractal/scale-free structure of language. This suggests that efficient LLM architectures should process information at multiple timescales simultaneously - exactly what multi-head attention (with different positional frequency ranges per head) achieves.
H.3 Fourier Analysis and the Sampling Theorem
The Shannon-Nyquist Sampling Theorem connects continuous Fourier analysis to discrete signal representation:
Theorem (Shannon, 1949; Nyquist, 1928): A bandlimited signal with for (bandwidth ) can be perfectly reconstructed from its samples :
Proof sketch. Define the sampled signal . Its Fourier transform is the periodized spectrum: . Since is bandlimited to , the copies don't overlap (no aliasing). Applying an ideal low-pass filter with cutoff recovers , and inverse FT gives .
Aliasing: If has frequency content above but is sampled at rate , the high-frequency copies in overlap, corrupting the spectrum. This is aliasing - high-frequency components masquerade as low-frequency ones.
For AI - Tokenization: Tokenization is analogous to sampling. A tokenizer samples text at a rate of tokens/word (for byte-pair encoding). If the "semantic content" of text has structure at finer granularity than 1 word, the tokenizer introduces aliasing: some semantic distinctions are lost. This is why character-level models can capture morphological patterns that word-level models miss.
Anti-aliasing in CNNs: The standard strided convolution in CNNs (stride-2 for downsampling) can alias: it samples the feature map at rate without first applying a low-pass filter. Zhang (2019) showed that adding a blur filter before strided convolutions dramatically improves shift-equivariance. This is the CNN version of anti-aliasing.
Appendix I: Implementation Details
I.1 Numerical Computation of Fourier Coefficients
Direct formula. For a function given by samples at equally spaced points :
This approximates the integral using the rectangle rule.
Accuracy. The approximation error is for smooth functions and for functions. For discontinuous functions, the error is regardless of how fine the sampling.
Using NumPy:
# N equispaced samples of f on [-pi, pi)
x = np.linspace(-np.pi, np.pi, N, endpoint=False)
f_samples = f(x)
# Fourier coefficients (approximate)
c = np.fft.fft(f_samples) / N
# c[n] corresponds to the n-th coefficient for n = 0, 1, ..., N-1
# Negative frequencies: c[N-n] corresponds to c_{-n}
Aliasing in discrete computation. The DFT computes only coefficients . For a bandlimited function with for , this is exact. Otherwise, the -th computed coefficient is actually (all aliased copies).
I.2 Plotting Spectra Correctly
Common mistakes when plotting Fourier spectra:
-
Frequency axis: NumPy's FFT output has frequencies in units of cycles/sample. Use
np.fft.fftfreq(N)to get the correct axis. -
Centering: Use
np.fft.fftshiftto rearrange the output so that zero frequency is in the center. -
One-sided vs two-sided: For real signals, plot only (the two-sided spectrum is symmetric). Use
np.fft.rfftfor real signals - it returns only the positive-frequency half. -
dB scale: Power spectra are often plotted in dB: or . This makes it easier to see components spanning many orders of magnitude.
-
Parseval check: Always verify . If these don't match, there's a normalization error.
I.3 RoPE Implementation Reference
def rope_embed(x, positions, base=10000):
"""
Apply RoPE to input tensor x.
x: shape (seq_len, d_model) - query or key vectors
positions: integer positions (seq_len,)
Returns: rotated x, same shape
"""
d = x.shape[-1]
# Frequencies: theta_i = base^{-2i/d} for i = 0, ..., d/2 - 1
i = np.arange(0, d, 2)
freqs = base ** (-i / d) # shape (d/2,)
# Angles: m * theta_i
angles = positions[:, None] * freqs[None, :] # (seq_len, d/2)
# Build rotation: cos(angle) and sin(angle)
cos = np.cos(angles) # (seq_len, d/2)
sin = np.sin(angles) # (seq_len, d/2)
# Rotate pairs (x_{2i}, x_{2i+1})
x_even = x[:, 0::2] # shape (seq_len, d/2)
x_odd = x[:, 1::2]
x_rot_even = x_even * cos - x_odd * sin
x_rot_odd = x_even * sin + x_odd * cos
# Interleave back
x_rot = np.empty_like(x)
x_rot[:, 0::2] = x_rot_even
x_rot[:, 1::2] = x_rot_odd
return x_rot
Verification: For any queries and keys with relative position :
score(m, n) = dot(rope_embed(q, [m]), rope_embed(k, [n]))
== score(m + delta, n + delta) # for any integer delta
This confirms that the attention score depends only on relative position.
Appendix J: Fourier Series in Diverse Mathematical Contexts
J.1 Fourier Series and Number Theory
The connection between Fourier analysis and number theory is profound and historically important.
Dirichlet characters. For a modulus , a Dirichlet character is a completely multiplicative arithmetic function with period . These are precisely the characters of the group . Dirichlet's proof that there are infinitely many primes in any arithmetic progression with uses Fourier analysis on the group - a finite version of Fourier series.
The Poisson summation formula. For a "nice" function :
This connects sums over integers to sums over their Fourier transforms. Applications include:
- Jacobi theta functions: satisfies a functional equation from Poisson summation
- Riemann zeta function: The functional equation is proved via Poisson summation
- Quadratic forms: The number of ways to represent an integer as a sum of squares is expressed via theta functions
For AI: Poisson summation appears in the theory of neural network generalization. The neural tangent kernel at infinite width is related to the Fourier transform of the network architecture, and bounds on generalization involve the Poisson summation formula applied to the empirical Fourier spectrum of the training data.
J.2 Fourier Series on Non-Euclidean Spaces
Fourier analysis on groups. The Fourier series on is really harmonic analysis on the group . The generalization to compact groups gives:
where is the set of irreducible unitary representations of and is the dimension of representation . For : all irreducible representations are 1-dimensional (), and we recover the ordinary Fourier series.
For AI - Equivariant networks: Group-equivariant neural networks (Cohen & Welling, 2016) build in symmetries (rotations, reflections) by working with group-Fourier analysis. A convolution network on is equivariant to translations; lifting to (rigid motions) gives equivariance to rotation and translation simultaneously. The mathematical foundation is the group Fourier transform - Fourier series on a non-abelian group.
Fourier analysis on graphs. The graph Laplacian has eigenvectors with eigenvalues . The graph Fourier transform is . Graph convolution (used in GCN, ChebNet - covered in Section 11-04) is multiplication in this graph Fourier domain:
This is exactly the convolution theorem applied to graphs, replacing the standard Fourier basis with the Laplacian eigenvectors .
J.3 Fourier Series and Quantum Mechanics
The mathematical structure of quantum mechanics is built on Fourier analysis.
Position and momentum representations. A quantum state can be represented in the position basis: (the wavefunction), or in the momentum basis: . The two representations are related by the Fourier transform:
The Heisenberg uncertainty principle is exactly the Fourier uncertainty principle in disguise, with and .
For AI: The mathematical structure of quantum mechanics (Hilbert space, operators, eigenvalues) and the mathematical structure of attention (inner products, softmax, value aggregation) are strikingly parallel. Several researchers have proposed "quantum attention" mechanisms, and the Fourier analysis connecting position and momentum representations underlies proposals for "quantum-inspired" sequence modeling.
J.4 The Ergodic Theorem and Fourier Analysis
Birkhoff's ergodic theorem. For an ergodic measure-preserving transformation and :
The time average equals the space average.
Connection to Fourier series. For the rotation on the circle (irrational ), Birkhoff's theorem gives:
In Fourier space: . For irrational and : (equidistribution) -> only the term survives -> the time average equals .
For AI: The ergodic theorem underpins the mixing of training data. In language model training, a sequence of tokens is generated by a stationary stochastic process (natural language). Ergodicity says that sampling from a long enough sequence gives a representative sample of the distribution - justifying the use of random windowed crops for training, rather than requiring the full corpus at each step.
Appendix K: Connections Between Sections
K.1 Why This Section is the Foundation
All subsequent sections in this chapter build on the foundations laid here:
-> Section 20-02 (Fourier Transform): Take the period in the Fourier series. The discrete sum becomes the integral . Every property of the Fourier series has an analogue in the Fourier transform - the derivations are nearly identical. The key new feature is the continuous spectrum and the setting.
-> Section 20-03 (DFT and FFT): Discretize the Fourier transform: sample at points and compute frequency coefficients. The DFT is the Fourier series applied to the sequence viewed as a periodic sequence of period . The FFT is the efficient algorithm for computing it.
-> Section 20-04 (Convolution Theorem): The property (multiplication in frequency = convolution in time) is the single most important theorem in signal processing. It follows directly from the orthogonality of the Fourier basis and is proved here in the series setting; the full treatment with applications to CNNs is in Section 20-04.
-> Section 20-05 (Wavelets): Wavelets are a replacement for Fourier series that provides both time and frequency localization. The Haar wavelet is a piecewise constant basis - like the Fourier basis, but localized in time. The mathematical framework of multiresolution analysis is the generalization of the Fourier series framework (nested subspaces, orthonormal bases) to time-localized expansions.
K.2 What to Read Next
After this section, you have two natural paths:
Path A - Depth first: Continue with Section 20-02 (Fourier Transform) to see how the theory extends to aperiodic signals and the full real line. This path eventually leads to the uncertainty principle and the deeper function space theory.
Path B - Applications first: Jump to Section 20-04 (Convolution Theorem) to see immediately how Fourier series powers CNNs and sequence models. The convolution theorem proof requires the Fourier transform (Section 20-02), so read that first.
For the AI-focused reader: The most direct path from this section to ML applications is:
- Section 20-01 (this section) - Fourier series fundamentals
- Section 20-03 (DFT/FFT) - computational implementation
- Section 20-04 (Convolution Theorem) - CNNs and sequence models
- Section 20-02 (Fourier Transform) - theory for spectral methods
- Section 20-05 (Wavelets) - multi-scale models and compression
Appendix L: The Fourier Series in Numerical Practice
L.1 Accuracy Benchmarks: Theoretical vs. Observed
When implementing Fourier series approximations numerically, understanding the gap between theoretical and observed accuracy is essential.
Experiment: Square wave approximation accuracy.
For the square wave with -term Fourier series :
- Theoretical error:
- Observed: Plot vs on a log-log scale. Slope should be , confirming decay.
- At jump: regardless of (Gibbs)
Experiment: Triangle wave approximation accuracy.
For with (odd ):
- Theoretical error:
- Observed: slope on log-log plot, confirming decay
Experiment: Smooth function accuracy.
For (analytic, -periodic):
- Theoretical: ->
- Observed: semi-log plot of vs is linear (exponential decay)
- Essentially machine-precision accuracy for
Takeaway: The rule of thumb is "smooth = exponentially good, nonsmooth = algebraically poor." For discontinuous signals, truncated Fourier series are often a poor choice - JPEG-style block transforms or wavelets are better.
L.2 The Fast Fourier Transform Preview
The DFT (fully treated in Section 20-03) computes exactly the same Fourier coefficients as the Fourier series formula, but for a finite sequence and in time.
The connection: For samples at equispaced points :
Comparing with (rectangle rule with ):
So the DFT is just the Fourier series computation via a rectangle-rule quadrature, scaled by .
The FFT (Cooley-Tukey, 1965) computes all DFT values in operations instead of the naive . For : naive DFT requires operations; FFT requires operations - a factor of 50,000 speedup.
This speedup is what made digital signal processing, audio compression, and modern AI practically feasible. The STFT and mel-spectrogram preprocessing in Whisper, the frequency-domain convolutions in FNet, and the state-space model computations in S4/Mamba all rely on this speedup. Full treatment: Section 20-03 Discrete Fourier Transform and FFT.
L.3 Fourier Series vs. Polynomial Approximation
Both Fourier series and polynomial approximation (Taylor, Chebyshev, Legendre) approximate functions by basis expansions. Understanding when to use each is important.
| Aspect | Fourier Series | Polynomial (Chebyshev) |
|---|---|---|
| Best for | Periodic functions, global frequency analysis | Non-periodic functions on |
| Convergence for smooth | Exponential (analytic), algebraic (smooth) | Exponential (analytic), algebraic (smooth) |
| At discontinuities | Gibbs phenomenon (9% overshoot) | Runge phenomenon near endpoints |
| Complexity for terms | via FFT | (or for Chebyshev via DCT) |
| AI applications | RoPE, spectrograms, convolutions | Polynomial activations, rational networks |
| Sparse representation | Works for bandlimited signals | Works for smooth functions |
Key insight: For periodic and global structure, use Fourier series. For localized and non-periodic structure, use wavelets (Section 20-05) or local polynomial bases. The choice of approximation basis should match the intrinsic structure of the function.
L.4 When Fourier Series Fail: Pathological Examples
Understanding the failure modes of Fourier series deepens intuition.
Example 1 - Du Bois-Reymond's construction (1876): There exists a continuous -periodic function whose Fourier series diverges at . The construction uses the slow growth of : build to make grow like .
Example 2 - Kolmogorov's function (1923): There exists an function whose Fourier series diverges everywhere. This shows that is too weak for pointwise Fourier convergence.
Example 3 - Weierstrass nowhere-differentiable function:
This is a Fourier series (cosine series) that converges uniformly (geometric series bounds), but the sum is nowhere differentiable. The coefficients decay geometrically but the frequencies grow geometrically - the function has "fractal" structure with self-similar behavior at all scales.
For AI: Nowhere-differentiable functions are the worst case for neural network optimization. The loss landscape of a neural network restricted to certain parameter subspaces can exhibit Weierstrass-like behavior - highly oscillatory, with no gradient information about the global minimum. This motivates smooth parameterizations (Adam's momentum, gradient clipping) and understanding of the loss landscape geometry.
Appendix M: Quick Reference
M.1 Key Formulas at a Glance
M.2 Decision Tree: Which Tool to Use
Is your signal periodic?
+-- YES -> Fourier series (Section 20-01)
| +-- Need to compute? -> DFT/FFT (Section 20-03)
| +-- Need to convolve? -> Convolution theorem (Section 20-04)
+-- NO -> Fourier transform (Section 20-02)
+-- Bandlimited? -> Sampling theorem
+-- Need time localization? -> Wavelets (Section 20-05)
+-- Aperiodic transient? -> STFT (Section 20-03)
Is your signal discrete?
+-- YES -> DFT/FFT (Section 20-03)
+-- NO -> Fourier transform (Section 20-02)
Do you need time AND frequency localization?
+-- YES -> Wavelets (Section 20-05)
Appendix N: Fourier Series and Modern Transformers - Extended Analysis
N.1 YaRN and NTK-Aware Frequency Scaling
The original RoPE with base runs out of positional expressiveness beyond tokens. Several methods have been proposed to extend it:
Linear Position Interpolation (PI) - Chen et al., 2023. Scale all positions by : position becomes . This compresses the angles to keep within the trained range . Equivalently, reduce all frequencies: .
Issue with linear interpolation: The lowest-frequency components (large , which change slowly) are barely affected, but the highest-frequency components (small , which rotate rapidly) are compressed most aggressively. High-frequency positional information (distinguishing adjacent tokens) degrades.
NTK-RoPE - Bloc97, 2023. Scale the base rather than the positions: . This modifies low-frequency components more than high-frequency ones - a non-uniform scaling derived from the NTK analysis.
YaRN - Peng et al., 2023. Combines interpolation for middle frequencies with no interpolation for high frequencies (which extrapolate) and linear interpolation for low frequencies. The frequency ranges are: (i) : unchanged; (ii) : linear interpolation; (iii) : extrapolation. The parameters are tuned per model. YaRN achieves strong performance at 128K context with only 0.1% of the parameters fine-tuned.
The Fourier interpretation: All these methods are attempting to find the right frequency allocation - which Fourier basis vectors should be responsible for which positional scales. The ideal is to have each frequency range cover roughly equal logarithmic bandwidth, so no scale is under- or over-represented.
N.2 Frequency Analysis of Layer Norms
LayerNorm as a high-pass filter. LayerNorm subtracts the mean and divides by standard deviation:
The mean subtraction is a DC removal in signal processing terms: it removes the (constant) Fourier component from the sequence. The division by is a gain normalization.
Implication: After LayerNorm, the token representation has zero mean. This concentrates the representational capacity in the non-DC Fourier components, reinforcing the tendency of transformers to represent relative (not absolute) positional information.
RMSNorm (used in LLaMA) only divides by the RMS (root mean square), without mean subtraction. It does not remove the DC component - preserving more absolute magnitude information. The trade-off: slightly less stable training vs. richer representation.
N.3 Fourier Analysis of Embedding Matrices
What does the Fourier transform of a token embedding matrix reveal?
Let be the embedding matrix for vocabulary size and dimension . For a fixed dimension , the vector gives the -th coordinate of all token embeddings.
If we sort tokens by frequency rank (most common first) and look at as a function of token rank, we can ask: what is its Fourier transform?
Empirical finding (2024): The embedding vectors of tokens sorted by frequency show 1/f-like spectra in dimension vs. frequency rank - consistent with the Zipf distribution of token frequencies. This suggests that the embedding space has a natural "frequency ordering" that mirrors the frequency distribution of the training corpus.
For interpretability: The Fourier components of the embedding matrix at different "token frequencies" correspond to different semantic clusters. Low-frequency components (affecting the most common tokens) encode universal semantic primitives (noun/verb, positive/negative). High-frequency components encode rare, specific meanings.
N.4 The Fourier Perspective on Attention Sinks
The "attention sink" phenomenon (Xiao et al., 2023): In all transformer models, the first token (often <BOS> or the period .) receives disproportionate attention from most heads, even when semantically irrelevant.
Fourier explanation: The attention sink is the DC component (zero frequency) of the attention pattern. In signal processing, a low-pass filter preserves the DC component. If attention heads are implicitly low-pass filtering the sequence, they must "send" the removed energy somewhere - and the first token becomes the "drain" for this energy.
StreamingLLM (Xiao et al., 2023): By always keeping the first 4 tokens in the KV cache (the "attention sinks"), models can be extended to arbitrarily long sequences without performance degradation. This is a practical application of the Fourier insight: preserve the DC reference point.
N.5 Complete Worked Example: Sinusoidal PE Computation
Let (small for illustration), positions .
Frequencies: for :
Positional encoding matrix :
| pos | sin(pos1.0) | cos(pos1.0) | sin(pos0.1) | cos(pos0.1) | sin(pos0.01) | cos(pos0.01) | sin(pos0.001) | cos(pos0.001) |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 | 0.000 | 1.000 |
| 1 | 0.841 | 0.540 | 0.100 | 0.995 | 0.010 | 1.000 | 0.001 | 1.000 |
| 2 | 0.909 | -0.416 | 0.199 | 0.980 | 0.020 | 1.000 | 0.002 | 1.000 |
| 3 | 0.141 | -0.990 | 0.296 | 0.955 | 0.030 | 1.000 | 0.003 | 1.000 |
| 4 | -0.757 | -0.654 | 0.389 | 0.921 | 0.040 | 1.000 | 0.004 | 1.000 |
Observations:
- Dimension 0-1 (): Oscillates rapidly - completes a full cycle in positions. Distinguishes adjacent tokens.
- Dimension 2-3 (): Oscillates at 1/10th the rate. Distinguishes tokens up to ~63 apart.
- Dimension 4-5 (): Very slow oscillation. Distinguishes tokens up to ~628 apart.
- Dimension 6-7 (): Extremely slow. Distinguishes tokens up to ~6283 apart.
This is the Fourier series multi-resolution strategy: use harmonics at multiple scales to encode position information at all scales simultaneously. Each dimension pair is a Fourier basis function at a different frequency - the entire positional encoding is a multi-frequency Fourier expansion of the position index.
Appendix O: Further Reading and References
O.1 Primary References
Textbooks (Classical):
- Korner, T.W. (1988). Fourier Analysis. Cambridge University Press. - The most readable rigorous treatment; includes Gibbs phenomenon, convergence theory, and applications to number theory.
- Katznelson, Y. (2004). An Introduction to Harmonic Analysis, 3rd ed. Cambridge. - Definitive graduate reference.
- Stein, E. & Weiss, G. (1971). Introduction to Fourier Analysis on Euclidean Spaces. Princeton. - The multidimensional extension.
- Tolstov, G.P. (1962). Fourier Series (Silverman translation). Dover. - Inexpensive, detailed, complete derivations of all standard results.
Textbooks (Applied):
- Oppenheim, A. & Willsky, A. (1997). Signals and Systems, 2nd ed. Prentice Hall. - Standard engineering reference for Fourier methods in signal processing.
- Bracewell, R.N. (2000). The Fourier Transform and Its Applications, 3rd ed. McGraw-Hill. - Intuitive engineering treatment with excellent visualizations.
O.2 Key ML Papers Using Fourier Methods
| Paper | Year | Contribution | Fourier connection |
|---|---|---|---|
| Vaswani et al., "Attention Is All You Need" | 2017 | Transformer; sinusoidal PE | Fourier basis for position |
| Rahimi & Recht, "Random Features for Large-Scale Kernel Machines" | 2007 | Random Fourier features | Bochner's theorem |
| Rahaman et al., "On the Spectral Bias of Neural Networks" | 2019 | Spectral bias phenomenon | NTK Fourier decomposition |
| Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" | 2021 | RoPE | Complex Fourier rotation |
| Lee-Thorp et al., "FNet: Mixing Tokens with Fourier Transforms" | 2021 | FNet | DFT replaces attention |
| Sitzmann et al., "Implicit Neural Representations with Periodic Activations" | 2020 | SIREN | Fourier basis activations |
| Tancik et al., "Fourier Features Let Networks Learn High Frequency Functions" | 2020 | Fourier feature encoding | NeRF; overcomes spectral bias |
| Gu et al., "Efficiently Modeling Long Sequences with Structured State Spaces" | 2021 | S4 model | Frequency domain parameterization |
| Peng et al., "YaRN: Efficient Context Window Extension of LLMs" | 2023 | Extended context | RoPE frequency scaling |
| Xiao et al., "Efficient Streaming Language Models with Attention Sinks" | 2023 | StreamingLLM | DC component preservation |
O.3 Connections to Other Sections in This Curriculum
| Topic | Where to find it | Connection to Fourier Series |
|---|---|---|
| Hilbert spaces, completeness | Section 12-02 Hilbert Spaces | The abstract framework for Fourier series |
| Inner products on function spaces | Section 12-01 Normed Spaces | The that defines Fourier coefficients |
| Kernel methods, RKHS | Section 12-03 Kernel Methods | Bochner's theorem; RFFs |
| Complex numbers, series | Section 01 Mathematical Foundations | , Euler's formula |
| Gradient descent, loss landscape | Section 08 Optimization | NTK, spectral bias |
| Spectral graph convolution | Section 11-04 Spectral Graph Theory | Graph Fourier transform |
| Probability characteristic functions | Section 06 Probability Theory | Bochner, CLT via characteristic functions |
Appendix P: Common Error Patterns in Code
When implementing Fourier series and Fourier-based methods in Python, these are the most frequent bugs:
P.1 Normalization Factor Confusion
Bug: Using np.fft.fft(x) without dividing by N.
# WRONG: FFT includes the N normalization in a different convention
c_wrong = np.fft.fft(x) # These are N * c_n, not c_n!
# CORRECT: divide by N to get Fourier coefficients
c_correct = np.fft.fft(x) / N
Why: NumPy's FFT convention is (no factor). The mathematical Fourier series coefficient is .
P.2 Frequency Axis Misidentification
# WRONG: assuming output index k corresponds to frequency k
freqs = np.arange(N) # Wrong!
# CORRECT: use fftfreq
freqs = np.fft.fftfreq(N, d=1.0/sample_rate) # In Hz
# or for [-pi, pi] interval:
freqs = np.fft.fftfreq(N) * N # integer wavenumbers -N/2 to N/2
Why: NumPy FFT output has frequencies cycles/sample, not .
P.3 Forgetting to Handle Real FFT Symmetry
For real inputs, (Hermitian symmetry). The positive-frequency half is sufficient:
# For real input, use rfft (returns only positive frequencies):
c = np.fft.rfft(x) / N # Length N//2 + 1
freqs = np.fft.rfftfreq(N) # Corresponding frequencies
# To recover the full spectrum, remember to double all |c_n|^2 for n > 0:
power_onesided = np.abs(c)**2
power_onesided[1:-1] *= 2 # Double all except DC and Nyquist
P.4 Aliasing from Insufficient Sampling
# DANGER: if signal has frequency content above Nyquist
t = np.linspace(0, 1, 100) # 100 samples per second
x = np.sin(2 * np.pi * 80 * t) # 80 Hz signal
# Nyquist frequency = 50 Hz; 80 Hz is ABOVE Nyquist
# The DFT will show this as a false 20 Hz component (80 - 100/2 = 30 != 20... wait)
# Actually: 80 Hz aliases to 100 - 80 = 20 Hz.
# ALWAYS check: max frequency in signal < sample_rate / 2
P.5 Misinterpreting Parseval in Discrete vs. Continuous Setting
Continuous Parseval:
Discrete Parseval (DFT): , where .
# Check Parseval numerically:
x = np.random.randn(N)
X = np.fft.fft(x)
lhs = np.mean(np.abs(x)**2) # (1/N) * sum |x[k]|^2
rhs = np.mean(np.abs(X/N)**2) # (1/N) * sum |c_n|^2
assert np.isclose(lhs, rhs), f"Parseval failed: {lhs} != {rhs}"
This completes the Section 20-01 Fourier Series reference notes. Continue with Section 20-02 Fourier Transform -> to see how the period- Fourier series becomes the continuous Fourier transform as .
Appendix Q: Historical Vignettes
Q.1 Fourier's Audacity and the Reception of His Theory
Joseph Fourier (1768-1830) was not primarily a mathematician. He was a physicist and administrator who, as Napoleon's prefect of Isere, oversaw the draining of malarial swamps and the construction of roads. His mathematical masterpiece grew from a practical problem: how does heat conduct through a solid body?
When Fourier presented his theory to the Paris Academy in 1807, the reaction was hostile. The paper was rejected - or rather, shelved - on the grounds that the claim "every function can be represented as a trigonometric series" was both wrong (they thought) and insufficiently rigorous (it was). The judges included Laplace, Lagrange, and Monge - the greatest mathematicians of the age.
Lagrange's objection was mathematical: a sum of smooth functions (sines and cosines) cannot converge to a discontinuous function. This was a perfectly reasonable objection in 1807, before the modern theory of pointwise vs. convergence had been developed. Fourier's answer - essentially, "but it does!" - was more right than Lagrange, but not for the reasons Fourier gave.
Fourier persisted. In 1822, he published his masterpiece Theorie analytique de la chaleur, which remains one of the great works of mathematical physics. His key equations - the heat equation, the Fourier series, and the Fourier integral - were all present, though not rigorously proved.
The rigorous proof came from Dirichlet in 1829. The controversy about pointwise convergence of the Fourier series of a continuous function was not finally resolved until Carleson in 1966 - 159 years after Fourier's original claim. Mathematics sometimes takes time to catch up with correct intuitions.
Q.2 Dirichlet's Proof: The Birth of Rigorous Analysis
Peter Gustav Lejeune Dirichlet (1805-1859) was a student of Fourier's. His 1829 paper giving the first rigorous proof of pointwise convergence is often cited as one of the founding documents of modern analysis - not just because of the result, but because of the method.
Dirichlet's proof introduced two innovations that transformed mathematics:
-
The modern definition of function. Before Dirichlet, a "function" meant something expressible by a formula. Dirichlet explicitly worked with functions that could behave differently on different parts of their domain, introducing what we now recognize as the modern concept. His paper is one of the first places where the contemporary idea of function (as any rule assigning outputs to inputs) appears clearly.
-
Careful convergence arguments. Dirichlet's proof carefully tracked when limits could be exchanged and when they could not - a concern that Cauchy had introduced but that Dirichlet applied systematically to Fourier series. This careful accounting of convergence, rather than treating series as formal objects, is the hallmark of 19th-century rigor.
In this sense, Fourier series didn't just give us a useful computational tool. They provoked the development of the rigorous analysis that underlies all of modern mathematics and machine learning theory.
Q.3 Carleson's Theorem: The 159-Year-Old Problem
For 120 years after Dirichlet's theorem (1829), the following question remained open:
Does the Fourier series of every continuous function converge pointwise everywhere?
- 1873: Du Bois-Reymond showed NO for continuous functions at one point
- 1923: Kolmogorov showed NO for functions everywhere (!)
- But for the natural space ...?
In 1966, Lennart Carleson proved: YES, the Fourier series of every function converges pointwise almost everywhere. This was one of the most celebrated theorems in 20th-century analysis. Carleson received the Abel Prize in 2006 - the only prize given specifically for this theorem.
The proof uses a remarkable decomposition of the Dirichlet kernel into "tiles" in the time-frequency plane, combined with a sophisticated stopping-time argument. It is considered one of the most difficult proofs in all of harmonic analysis - spanning 30+ pages of dense mathematics.
For our purposes, the practical upshot is: every function (which includes every function we encounter in ML) has its Fourier series converging pointwise almost everywhere. The theoretical pathologies of convergence failure are measure-zero exceptions.
Appendix R: Self-Assessment Checklist
Use this checklist to verify understanding before moving to Section 20-02.
R.1 Computational Mastery
- Can you compute the Fourier coefficients of the square wave, sawtooth, and triangle wave from scratch without looking them up?
- Can you use the differentiation theorem () to derive the triangle wave coefficients from the square wave coefficients?
- Can you apply Parseval's identity to evaluate (via the Fourier series of )?
- Can you identify the decay rate of Fourier coefficients (, , exponential) from a description of the function?
- Can you write the partial sum as a convolution with the Dirichlet kernel?
R.2 Theoretical Understanding
- Can you state Dirichlet's theorem and identify what happens at a jump discontinuity?
- Can you explain the Gibbs phenomenon and why it persists at any truncation order?
- Can you state Parseval's identity and explain its physical meaning (energy conservation)?
- Can you explain why convergence is the "correct" convergence for Fourier series?
- Can you explain why the Fejer kernel (Cesaro means) fixes the Gibbs phenomenon while the Dirichlet kernel doesn't?
- Can you explain the Riemann-Lebesgue lemma and its consequence: smooth signals have decaying Fourier coefficients?
R.3 AI Connections
- Can you derive the sinusoidal PE formula from the Fourier basis perspective?
- Can you explain RoPE in terms of complex Fourier multiplication and the shift theorem?
- Can you explain the spectral bias phenomenon and its consequences for neural network training?
- Can you explain why Random Fourier Features approximate kernel methods and how Bochner's theorem is involved?
- Can you explain the attention sink phenomenon using Fourier analysis (DC component)?
R.4 Connections to Other Sections
- Do you understand that the Fourier Transform (Section 20-02) is the limit of the Fourier series?
- Do you understand that the DFT (Section 20-03) is the Fourier series for a finite discrete sequence?
- Do you understand that the convolution theorem (Section 20-04) is a consequence of Fourier series orthogonality ()?
- Do you understand that wavelets (Section 20-05) overcome the time-localization failure of Fourier series?
If you cannot answer YES to all items in R.1 and R.2, and at least 4 items in R.3 and R.4, review the corresponding sections before proceeding.
End of Section 20-01 Fourier Series
<- Back to Fourier Analysis | Next: Fourier Transform ->
Appendix S: Summary of Key Results
S.1 The Core Theorems
Theorem 1 (Dirichlet, 1829). If is -periodic and piecewise smooth, then for every :
Theorem 2 ( convergence). For every :
Theorem 3 (Parseval-Bessel). For :
Theorem 4 (Fejer, 1900). For every (continuous and -periodic):
Theorem 5 (Riemann-Lebesgue). For :
Theorem 6 (Carleson, 1966). For every :
S.2 The Most Important Formula
Among all the formulas in this section, one stands out as the most consequential for both mathematics and AI:
Every other result in this section - convergence theorems, Parseval, the shift property, RoPE, spectral bias - is a consequence or application of this pair.