"The series may seem paradoxical - an infinite number of terms summing to a finite value. But this is exactly the miracle that makes calculus possible."
- Augustin-Louis Cauchy, on the foundations of analysis
Overview
Sequences and series are the mathematics of infinity made tractable. A sequence is an ordered list without end; a series is the sum of such a list. The central question - does an infinite process converge to something finite? - is one of the deepest and most practically consequential questions in mathematics.
For machine learning, this is not abstract. Every gradient descent trajectory is a sequence of parameter vectors. Every optimizer update rule is derived via Taylor series truncation. Every attention mechanism approximation exploits the Taylor expansion of the exponential function. Every positional encoding scheme (RoPE, ALiBi, sinusoidal) draws on Fourier series theory. The Laplace approximation - used in Bayesian deep learning - is a quadratic Taylor expansion of the log-posterior. Understanding why these approximations work, when they fail, and how accurate they are requires the theory in this section.
This section builds from sequences through convergence tests through power series to Taylor series, culminating in concrete ML applications with rigorous error analysis.
Prerequisites
- Limits and continuity - 04/01: epsilon-delta definition, limit laws, continuity
- Derivatives - 04/02: differentiation rules, higher-order derivatives
- Integration - 04/03: definite integrals, improper integrals, p-integral test
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Interactive: convergence visualization, partial sums, Taylor approximation error, ML applications |
| exercises.ipynb | 10 graded exercises from geometric series to linear attention approximations |
Learning Objectives
After completing this section, you will be able to:
- Define a sequence formally and determine convergence using the epsilon-N definition
- State and prove the Monotone Convergence Theorem for sequences
- Determine convergence of infinite series using all standard tests
- Distinguish absolute from conditional convergence and explain the rearrangement theorem
- Compute the radius and interval of convergence of a power series
- Differentiate and integrate power series term-by-term within the radius of convergence
- Derive Taylor series for , , , , and
- Bound approximation error using the Lagrange remainder
- Apply series theory to ML: softmax temperature, linear attention, GELU, Adam, Laplace
- Recognise Fourier series as orthogonal function expansion and connect to positional encodings
Table of Contents
- 1. Intuition
- 2. Sequences
- 3. Infinite Series
- 4. Convergence Tests
- 5. Power Series
- 6. Taylor and Maclaurin Series
- 7. Key Series in Machine Learning
- 8. Fourier Series (Preview)
- 9. Common Mistakes
- 10. Exercises
- 11. Why This Matters for AI (2026 Perspective)
- Conceptual Bridge
- Appendix A: Extended Convergence Proofs
- Appendix B: Uniform Convergence
- Appendix C: Complex Power Series
- Appendix D: Generating Functions
- Appendix E: Multivariable Taylor (Preview)
- Appendix F: Fourier Series Extended
- Appendix G: ML Derivations in Full
- Appendix H: Numerical Summation Methods
- Appendix I: Glossary and Reference Tables
1. Intuition
1.1 What Sequences and Series Represent
A sequence is a function from the natural numbers to the real numbers: an infinite, ordered list The key question is always: where does this list go? Does it home in on a specific value, grow without bound, or oscillate forever?
A series adds up all the terms: This is more subtle. An infinite sum might converge to a finite value - or it might grow forever. The sum converges. The sum - the harmonic series - diverges to infinity, even though each term goes to zero. This counterintuitive behavior is why convergence tests exist.
SEQUENCES vs. SERIES - VISUAL INTUITION
SEQUENCE {a}: where do the terms go?
a = 1/n: 1, 1/2, 1/3, 1/4, ... -> 0 (converges)
a = n: 1, 2, 3, 4, ... -> infinity (diverges)
a = (-1): -1, 1, -1, 1, ... -> ? (oscillates, diverges)
SERIES Sigmaa: where does the cumulative sum go?
Sigma (1/2): 1/2 + 1/4 + 1/8 + ... -> 1 (converges, geometric)
Sigma 1/n: 1 + 1/2 + 1/3 + ... -> infinity (diverges! harmonic)
Sigma 1/n^2: 1 + 1/4 + 1/9 + ... -> pi^2/6 (converges, Basel problem)
Key insight: a -> 0 is NECESSARY but NOT SUFFICIENT for Sigmaa to converge.
The fundamental distinction between sequences and series is that sequences ask about the terms themselves; series ask about the cumulative totals. A sequence can converge to zero while its series diverges - the harmonic series is the canonical example.
1.2 Historical Context
The history of series is a history of infinity being tamed:
| Era | Mathematician | Achievement |
|---|---|---|
| ~250 BCE | Archimedes | Geometric series sum for quadrature of parabola |
| 1350 | Nicole Oresme | First proof that the harmonic series diverges |
| 1600s | Newton | Binomial series ; power series for , , |
| 1735 | Euler | Basel problem: ; |
| 1748 | Euler | Introductio: systematic treatment of infinite series |
| 1821 | Cauchy | Rigorous convergence theory; ratio test; uniform convergence concept |
| 1826 | Abel | Abel's theorem on power series at boundary |
| 1837 | Dirichlet | Fourier series convergence conditions; conditional vs. absolute convergence |
| 1854 | Riemann | Rearrangement theorem: conditionally convergent series can be rearranged to any sum |
| 1900 | Taylor | Taylor series already known since Gregory (1668), but named after Brook Taylor (1715) |
The key conceptual revolution was Cauchy's insistence on rigor. Before him, mathematicians manipulated series freely, often getting wrong answers. The definition of convergence as a limit of partial sums made the subject precise and allowed the rich theory we use today.
1.3 Why Series Are Central to AI
Series appear throughout modern machine learning - often hidden in plain sight:
| ML Context | Series Connection |
|---|---|
| Softmax function | - the denominator is a sum over all classes; as , only the max survives (argmax); as , uniform distribution (Taylor expansion of dominates) |
| Linear attention | Approximates (first-order Taylor) to reduce complexity from to |
| GELU activation | Defined via error function, approximated by Taylor-Hermite expansion |
| Adam optimizer | Bias correction uses geometric series ; step size derivation uses Taylor expansion |
| Positional encodings | Sinusoidal (original Transformer), RoPE - both are Fourier series evaluated at discrete positions |
| Laplace approximation | Quadratic Taylor expansion of $\log p(\theta |
| Neural ODEs | Euler method is first-order Taylor; higher-order Runge-Kutta methods use more Taylor terms |
| Residual networks | - the skip connections are the partial sums of a Taylor series for the exponential of the network function |
2. Sequences
2.1 Formal Definition
Definition: A sequence is a function . We write the sequence as or , where is the -th term.
This formal view - sequence as function - is the right way to think about it. Everything we know about functions (domain, range, composition) applies to sequences.
Examples:
- : terms - decreasing, converges to
- : terms - oscillates, does not converge
- : terms - converges to
- : terms decrease rapidly to (Stirling: )
- for : geometric decay, converges to
Non-examples (not standard sequences):
- for : undefined at ; need to restrict domain to
- : well-defined for all , but its limit behavior is subtle (it doesn't converge)
2.2 Convergence: epsilon-N Definition
Definition: A sequence converges to the limit if for every there exists such that for all :
We write or . If no such exists, the sequence diverges.
Reading the definition: For any desired accuracy (no matter how small), there is a threshold beyond which every term is within of . The threshold may depend on - tighter accuracy requires waiting longer.
Worked example: Prove .
Proof: Given , choose (the smallest integer ). Then for all :
Therefore .
Uniqueness: If converges, its limit is unique. (Proof: if and , then for any , for large ; since is arbitrary, .)
Limit laws for sequences: If and , then:
- provided and eventually
- if is continuous at (sequential characterization of continuity)
Squeeze theorem for sequences: If and and , then .
2.3 Bounded and Monotone Sequences
Definition: A sequence is:
- Bounded above if there exists such that for all
- Bounded below if there exists such that for all
- Bounded if both: for some and all
- Monotone increasing if for all (strictly increasing if )
- Monotone decreasing if for all
Theorem (Monotone Convergence Theorem - MCT for sequences): Every bounded monotone sequence converges.
More precisely: If is monotone increasing and bounded above, then .
Proof sketch: Let , which exists by the completeness of (every non-empty set bounded above has a supremum). Given , by definition of supremum, there exists with . Since the sequence is increasing, for all . Also . So for all .
Key consequence: To prove convergence without knowing the limit, it suffices to show the sequence is monotone and bounded. This is widely used in analysis to establish existence of limits.
Example: is increasing and bounded above by (provable by AM-GM). Therefore it converges. Its limit is defined to be .
For ML: Gradient descent sequences are not monotone in parameter space, but the loss sequence is often bounded below (losses are non-negative) and may be monotone decreasing. The MCT guarantees loss convergence for many convex problems.
2.4 Subsequences and Bolzano-Weierstrass
Definition: A subsequence of is a sequence of the form where is a strictly increasing sequence of indices.
Theorem: If , then every subsequence also converges to .
*Converse (useful)**: If a sequence has two subsequences converging to different limits, the original sequence diverges. This proves diverges: the subsequences and converge to and respectively.
Theorem (Bolzano-Weierstrass): Every bounded sequence has a convergent subsequence.
Proof idea (bisection): Let be bounded in . Bisect the interval into and . At least one half contains infinitely many terms - pick that half and pick any term in it as . Repeat: bisect the chosen half, pick the half with infinitely many terms, pick from it (with ). The resulting intervals have lengths , so the chosen terms form a Cauchy sequence and hence converge.
For ML: Bolzano-Weierstrass underlies the proof that compact sets in always contain convergent subsequences - which is why SGD on compact parameter spaces always has convergent subsequences (though full convergence requires additional conditions).
2.5 Important Sequences in ML
The following sequences appear constantly in machine learning. Understanding their convergence is practically important:
ML SEQUENCES - CONVERGENCE ANALYSIS
Learning rate schedules:
Step decay: alpha = alpha_0 * gamma^t/s geometric, -> 0
Cosine: alpha = alpha_0/2 * (1 + cos(pit/T)) bounded, periodic
Warmup: alpha = t*alpha_0/T for t <= T linear ramp, then decay
Polynomial: alpha = alpha_0/(1 + betat)^p -> 0 if p > 0
Exponential moving average (EMA / Adam momentum):
m_t = beta*m_{t-1} + (1-beta)*g_t
= (1-beta) Sigma_0^{t-1} beta^k * g_{t-k} geometric series weights!
As t -> infinity: weights sum to 1 (geometric series identity)
Bias-corrected: m_t = m_t / (1 - beta) corrects for initial zero
Convergence of gradient norms in Adam:
Required for convergence: Sigma alpha = infinity (enough total movement)
And: Sigma alpha^2 < infinity (steps shrink fast enough)
This is the Robbins-Monro condition from stochastic approximation.
Exponential moving average as geometric series: The Adam first moment unfolds to:
For and : the weights form a geometric series summing to . The bias correction normalizes this to 1, giving a proper weighted average of past gradients. As , and the correction becomes negligible.
3. Infinite Series
3.1 Partial Sums and Convergence
Definition: Given a sequence , the -th partial sum is:
The infinite series converges to if the sequence of partial sums converges to :
If diverges, the series diverges.
Properties (inherited from sequence limit laws):
- Linearity: (if both converge)
- Tail behavior: converges iff converges for any fixed (convergence is about the tail, not finitely many initial terms)
- Term rearrangement: For absolutely convergent series, any rearrangement converges to the same sum. For conditionally convergent series, rearrangement can change or destroy convergence (Riemann's theorem).
3.2 Geometric Series
The geometric series is the most important series in mathematics:
Theorem: The geometric series converges if and only if , and:
Proof: The -th partial sum satisfies . Multiply by : . Subtract: , so . As : if then , so . If , does not go to zero, so diverges.
Shifted and scaled versions:
The last identity comes from differentiating term-by-term.
ML applications of geometric series:
-
Discounted future rewards (RL): . The total return is a geometric series with ratio ; it converges to when rewards are bounded.
-
Adam bias correction: . Sum of weights: . Correction factor: .
-
Infinite-horizon value function: , which is the Bellman equation's foundation.
-
Attention denominator as geometric series: For scalar queries and keys with (constant), - the denominator is a finite geometric-like sum.
3.3 The Harmonic Series Diverges
The harmonic series diverges despite .
Proof (Oresme's grouping argument, ~1350): Group terms in blocks of doubling size:
Each block contributes at least . Adding infinitely many blocks of gives infinity.
Why this matters: The harmonic series illustrates that alone does not guarantee convergence. Every convergence test addresses this problem: it must establish convergence faster than harmonic decay.
The harmonic series in ML: The divergence of is why the learning rate schedule satisfies the Robbins-Monro condition - it decreases slowly enough to ensure the optimizer explores sufficiently. The companion condition holds since (p-series with ).
3.4 p-Series
The p-series is for .
Theorem: The p-series converges if and only if .
| Series | Sum | Converges? | |
|---|---|---|---|
| Harmonic | No | ||
| (Basel) | Yes | ||
| (Apry) | Yes | ||
| No | |||
| No (barely!) |
Proof: By the integral test (proven in 4.2): converges iff converges. This integral equals for . For : as , so the integral (converges). For : , so the integral diverges. For : .
3.5 Telescoping Series
A telescoping series has a form that allows consecutive cancellation:
Example:
Proof via partial fractions: . Then .
Another example: diverges (telescopes to ).
4. Convergence Tests
The fundamental challenge: given a series , does it converge? The convergence tests below address different structural properties of .
CONVERGENCE TEST SELECTION GUIDE
a -> 0? -> NO -> DIVERGES (divergence test)
Is a = f(n) with f decreasing, positive?
-> Use INTEGRAL TEST: compare to integralf(x)dx
Is a comparable to a known series b?
-> a <= Cb and Sigmab converges -> CONVERGES
-> a >= Cb and Sigmab diverges -> DIVERGES (comparison)
-> lim a/b = L in (0,infinity) -> same behavior (limit comparison)
Does a involve factorials or exponentials?
-> Try RATIO TEST: L = lim |a_1/a|
L < 1 -> converges; L > 1 -> diverges; L = 1 -> inconclusive
Does a involve n powers?
-> Try ROOT TEST: L = lim |a|^{1/n}
L < 1 -> converges; L > 1 -> diverges; L = 1 -> inconclusive
Does the series alternate signs?
-> Try ALTERNATING SERIES TEST: a decreasing, -> 0 -> converges
4.1 Divergence Test
Theorem (Divergence Test): If converges, then .
Equivalently (contrapositive): If , then diverges.
Proof: If , then .
Important: The converse is FALSE. does NOT imply converges. The harmonic series is the canonical counterexample.
Examples:
- : since , diverges immediately.
- : since does not converge to , diverges.
- : , so the test is inconclusive. Must use another test.
4.2 Integral Test
Theorem (Integral Test): Suppose is continuous, positive, and decreasing, with . Then:
Proof sketch: Since is decreasing, for , so:
Summing: . As , the partial sums and the integral bound each other, so both converge or both diverge.
Applications:
- p-series: converges iff -> p-series converges iff
- : -> diverges
- : -> converges
Error bound from integral test: If the series converges,
This gives a computable bound on the tail error after terms.
4.3 Comparison and Limit Comparison Tests
Theorem (Direct Comparison): Suppose for all sufficiently large .
- If converges, then converges.
- If diverges, then diverges.
Proof: For the first part: partial sums of are bounded above by the (convergent) sum of , and they're increasing, so they converge by MCT.
Theorem (Limit Comparison): Suppose and with . Then and either both converge or both diverge.
Why: For large , , so and have the same order of magnitude.
Examples:
- : compare to (p-series, converges). Limit comparison: . Conclusion: converges.
- : dominant term is . Limit comparison with : ratio . Converges.
4.4 Ratio Test
Theorem (Ratio Test, d'Alembert): Let .
- If : converges absolutely
- If (or ): diverges
- If : inconclusive
Proof (convergence case): If , pick with . Then for all large . By induction, for some constant . So (geometric series).
Best for: Factorials (), exponentials (), or products involving .
Examples:
- : ratio -> converges
- : ratio -> converges
4.5 Root Test
Theorem (Root Test, Cauchy): Let .
- If : converges absolutely
- If : diverges
- If : inconclusive
Proof: Same structure as ratio test - if , eventually , so , a convergent geometric series.
Best for: Terms of the form .
Example: : root test gives -> converges.
Relationship to ratio test: If the ratio test limit exists, the root test limit also exists and equals the same value. The root test is stronger (applicable when ratio test fails), but harder to compute.
4.6 Alternating Series Test
Definition: An alternating series has the form where .
Theorem (Alternating Series Test / Leibniz criterion): If:
- for all
- (decreasing)
Then converges.
Proof idea: Even and odd partial sums are monotone and bounded, converging to a common limit.
Error bound: The error in approximating the sum by satisfies .
Key example: The alternating harmonic series .
This is conditionally convergent - it converges but (harmonic series diverges).
4.7 Absolute vs. Conditional Convergence
Definition: A series is:
- Absolutely convergent if converges
- Conditionally convergent if converges but diverges
Theorem: Absolute convergence implies convergence.
Proof: , so converges by comparison. Then converges as a difference of convergent series.
Riemann Rearrangement Theorem: If is conditionally convergent, then for any (or ), there exists a rearrangement of the terms that converges to .
This shocking result means: the value of a conditionally convergent series depends on the order of summation! This is not a curiosity - it has direct implications for:
- Stochastic gradient descent: the order in which mini-batches are processed affects the path (though not the limit for convergent algorithms)
- Floating-point summation: the order of floating-point additions affects the accumulated rounding error
For ML: Practically, all series in ML are absolutely convergent (finite sums, or convergent geometric/p-series). But the theory of conditional convergence explains why naive summation of large arrays can lose precision.
5. Power Series
5.1 Definition and Convergence Interval
Definition: A power series centered at is a series of the form:
where are the coefficients and is the center. When we get a Maclaurin (or standard) power series .
A power series is fundamentally different from the series we've studied: it is a function of , not just a number. For each value of , we get a numerical series that may or may not converge.
Abel's Theorem on convergence structure: For any power series , exactly one of three things holds:
- The series converges only at
- The series converges for all
- There exists (the radius of convergence) such that the series converges absolutely for and diverges for
The interval is the interval of convergence (open, possibly including one or both endpoints).
5.2 Radius of Convergence
Theorem (Radius of Convergence via Ratio Test): If , then .
Derivation: Apply the ratio test to :
Ratio test converges when , i.e., .
Hadamard formula (root test version):
This formula always works, even when the ratio test limit doesn't exist.
Key cases:
- : converges only at center (e.g., )
- : converges for all (e.g., , the exponential series)
- : converges in or possibly (e.g., , geometric series)
Examples:
: ratio for all . So .
: ratio unless . So .
: ratio , so . Interval: - check endpoints separately.
5.3 Behavior at Endpoints
The ratio test is inconclusive at (endpoints). Endpoints must be checked individually using series convergence tests.
Example: has .
- At : - harmonic series, diverges
- At : - alternating harmonic series, converges (to )
So the interval of convergence is .
Example: has .
- At : - p-series with , converges
- At : - absolutely convergent (compare to ), converges
Interval of convergence: (closed).
5.4 Term-by-Term Differentiation and Integration
Theorem: If has radius of convergence , then:
-
Term-by-term differentiation: , with radius of convergence
-
Term-by-term integration: , with radius of convergence
Both operations preserve the radius of convergence (though endpoint behavior may change).
Why this works: Inside the interval of convergence, power series converge uniformly (see Appendix B), which allows interchange of limit and derivative/integral. This is the key technical condition.
Application - deriving new series from known ones:
From (geometric series, ):
Differentiate:
Integrate from 0 to x: , so
Substitute :
This is the Taylor series for , derived purely from the geometric series plus integration.
5.5 Uniqueness of Power Series Representations
Theorem: If for all in some interval around , then the coefficients are uniquely determined:
Proof: Set : . Differentiate: . Differentiate again: , so . Inductively: .
Consequence: The Taylor series of at is the only power series centered at that can represent . If you find any power series for , it must be the Taylor series.
6. Taylor and Maclaurin Series
6.1 Deriving the Taylor Series Formula
The question: given a smooth function , can we represent it as a power series near a point ? If so, the coefficients must be (by 5.5). This leads to:
Definition (Taylor Series): The Taylor series of at is:
The question mark is crucial: even if has derivatives of all orders, the Taylor series may not converge to . The conditions for equality are given in 6.4.
Intuition via polynomial fitting: The -th Taylor polynomial is the unique polynomial of degree that matches and its first derivatives at :
As , we hope - this is the convergence question.
TAYLOR POLYNOMIAL APPROXIMATION
f(x) = sin(x) at a = 0:
T_1(x) = x (tangent line)
T_3(x) = x - x^3/6 (cubic approximation)
T_5(x) = x - x^3/6 + x^5/120 (quintic approximation)
T_7(x) = x - x^3/6 + x^5/120 - x^7/5040
Error |sin(x) - T_5(x)| at x = 0.1: ~= 2.6 x 10^1^1 (tiny!)
Error |sin(x) - T_5(x)| at x = 2.0: ~= 4.6 x 10^3 (small but visible)
Error |sin(x) - T_5(x)| at x = pi: ~= 1.9 x 10^1 (large near pi!)
The polynomial is most accurate near x = 0 (the expansion point).
6.2 Standard Maclaurin Series
The most important Maclaurin series (Taylor series at ):
Exponential: ,
Derivation: for all , so , giving .
Sine: ,
Cosine: ,
Note: has only odd powers; has only even powers - reflecting their symmetry properties.
Natural logarithm: ,
Geometric series: ,
Binomial series: , (for any )
where are generalized binomial coefficients.
Arctangent: ,
Interesting consequence: At : (Leibniz formula for - converges very slowly).
6.3 Lagrange Remainder and Error Bounds
Theorem (Taylor's Theorem with Lagrange Remainder): If has continuous derivatives on an interval containing and , then:
where the Lagrange remainder is:
for some strictly between and .
Error bound: If for all between and , then:
Proof sketch: Apply the Cauchy Mean Value Theorem or integrate the -th derivative times.
Worked example: Bound the error in approximating by at , .
. On : , so .
True value: ; . Error (even smaller than our bound, as expected).
For ML: The Lagrange remainder quantifies how many Taylor terms are needed for a desired approximation accuracy. This determines, e.g., how many terms of the linear attention approximation are needed for a given accuracy trade-off.
6.4 Analytic Functions and Convergence Domain
A function is analytic at if its Taylor series at converges to in some neighborhood of . All functions we encounter in ML are analytic wherever they are smooth.
However: A function can be infinitely differentiable (smooth) without being analytic. The classic example:
This function has for all , so its Taylor series is the zero series - which does NOT converge to (except at ).
For ML functions: All standard activation functions and loss functions are analytic on their domains. The Taylor series of , , and converge to the function on appropriate intervals.
Domain of convergence: For , the Taylor series has - it converges in but not at or for . Computing for large via truncated Taylor series will give wrong results. This is why implementations use the formula for large (log-sum-exp trick).
6.5 Algebraic Manipulation of Taylor Series
Taylor series can be combined algebraically to produce new series without recomputing from scratch.
Substitution: Replace with a multiple, power, or function:
- (substitute in )
- (substitute in )
Multiplication: Cauchy product of two absolutely convergent series:
Composition:
Collect terms by power of .
Example - softmax Taylor expansion: Near (high temperature), . To first order:
As , this (uniform). The correction term shows that even at high temperature, the softmax is not exactly uniform - it favors larger by .
7. Key Series in Machine Learning
7.1 Softmax and Temperature Scaling
The softmax function with temperature :
Low temperature (): Let . Factor out :
As : terms with have ; the term (the max) stays at 1. So (one-hot on argmax). This is the temperature limit used in hard attention and Gumbel-Softmax.
High temperature (): Taylor expand :
The distribution becomes uniform. This is used in knowledge distillation (Hinton 2015): training the student on soft targets with high temperature captures the "dark knowledge" in the teacher's probability distribution.
Series representation of the denominator: For the standard softmax, is the partition function - a finite sum but in theory for continuous distributions it becomes a genuine integral or series.
7.2 GELU via Hermite Polynomials
The GELU (Gaussian Error Linear Unit) activation is:
where is the standard Gaussian CDF. This has no closed-form Taylor series at in elementary functions, but the approximation:
uses a polynomial approximation inside . This is the approximation used in practice (GPT-2, BERT).
Taylor series of erf: Since , substitute and integrate term-by-term:
Near :
So: near .
Connection to Hermite polynomials: The derivatives of GELU at 0 involve Hermite polynomials - the orthogonal polynomials with respect to the Gaussian weight . This connection explains why GELU interacts well with Gaussian-distributed inputs.
7.3 Optimizer Update Rules as Taylor Approximations
Newton's method (second-order optimization) applies a quadratic Taylor approximation to the loss:
Minimizing over : (Newton step). This is exact for quadratic losses.
SGD discards the Hessian (): - first-order Taylor only.
Adam's bias correction via geometric series:
The Adam first moment is . Its expectation:
using the geometric series partial sum . The bias correction restores .
Gradient clipping via Taylor: For very large gradients, the loss is approximated by its first-order Taylor, and clipping bounds the step size to stay within a trust region where the linear approximation is valid.
7.4 Linear Attention via Taylor Expansion of exp
Standard self-attention has complexity because it computes:
The key operation is for all pairs.
Linear attention (Katharopoulos et al., 2020) replaces with its first-order Taylor approximation:
where (feature map). Then:
The sums and can be computed once in , reducing complexity to .
Error of first-order approximation: By Taylor's theorem, . For small (as when have small norm), the error is - small in high dimension.
Higher-order approximations use , giving Performer-style random feature approximations via the polynomial kernel .
7.5 Log-Sum-Exp and Numerical Stability
The LogSumExp operation appears in softmax computation:
Numerical problem: For large (e.g., ), overflows float32 ().
Stable computation: Let . Then:
Each , so no overflow.
Taylor analysis: Why does this work? Write where :
For the dominant term (, the max): contribution is . Other terms are exponentially small: . The Taylor expansion shows the correction is small when all other are much smaller than .
7.6 Laplace Approximation
Setup: Given a posterior , we want a Gaussian approximation around the MAP estimate .
Method: Taylor-expand around (where gradient is zero):
where (negative Hessian at MAP). The second-order Taylor expansion gives:
This is the Laplace approximation - a Gaussian centered at MAP with covariance equal to the inverse of the negative log-posterior Hessian.
Where Taylor truncation is valid: The approximation is accurate when the log-posterior is nearly quadratic near , i.e., when the third-order remainder is small. This holds when the posterior is concentrated (many data points, by Bernstein-von Mises theorem).
Applications in ML (2026):
- Bayesian neural networks: Laplace approximation used in Laplace-Redux (Immer et al., 2021), PostHoc Laplace (Daxberger et al., 2021)
- Uncertainty quantification: covariance estimates predictive uncertainty
- Last-layer Laplace: apply Laplace approximation only to the last linear layer (computationally tractable for large models)
8. Fourier Series (Preview)
Scope note: The full theory of Fourier series - convergence theorems, Parseval's identity, Fourier transform, and their ML applications - belongs in a dedicated treatment. Here we give the essential concepts needed to understand positional encodings and frequency decomposition in transformers.
8.1 Periodic Functions as Trigonometric Series
A Fourier series represents a periodic function with period as an infinite sum of sines and cosines:
Unlike Taylor series (which expand around a single point using polynomials), Fourier series use global basis functions - sine and cosine waves of different frequencies - to represent the function over its entire period.
The key difference from Taylor series:
| Feature | Taylor Series | Fourier Series |
|---|---|---|
| Basis functions | (powers) | (sinusoids) |
| Expansion point | Local (around ) | Global (over the full period) |
| Best for | Smooth, analytic functions | Periodic functions, possibly discontinuous |
| Convergence | In a neighborhood of | In norm over the period |
| Derivatives | Easy (power rule) | Easy (differentiation rotates sin/cos) |
Key property: The derivative of a Fourier series is obtained by differentiating term-by-term:
Differentiation multiplies each frequency- component by - high-frequency components dominate the derivative.
8.2 Orthogonality and Fourier Coefficients
The key property that makes Fourier series work is orthogonality: different basis functions are orthogonal with respect to the inner product :
These orthogonality relations are provable using the integration formulas for products of sines and cosines (from 03-Integration).
Deriving the Fourier coefficients: Multiply both sides of the Fourier series by and integrate over . All terms vanish by orthogonality except the term:
These are exactly the projection coefficients of onto the orthogonal basis - the Fourier series is the best approximation of in the sinusoidal basis.
Complex form: Using Euler's formula :
The complex coefficients encode both amplitude and phase of the -th frequency component.
8.3 ML Relevance: Positional Encodings and RoPE
Original Transformer positional encodings (Vaswani et al., 2017) use sinusoidal functions evaluated at position and dimension :
This is a Fourier-like encoding: position is represented by a superposition of sinusoids at different frequencies .
Why sinusoidal? Any relative shift in position corresponds to a linear transformation of the encoding vector - provable because the Fourier basis diagonalizes the shift operator. This allows the attention mechanism to learn relative positions.
Rotary Position Embeddings (RoPE) (Su et al., 2021, used in LLaMA, GPT-NeoX) rotate the query and key vectors in the complex plane by angle at position :
The dot product depends only on (relative position). This is the Fourier multiplication theorem: convolution in position space <-> pointwise multiplication in frequency space.
Full Fourier theory: Fourier transform, Parseval's theorem, Gibbs phenomenon, and non-periodic functions are treated in the Probability chapter (characteristic functions) and an advanced analysis appendix.
9. Common Mistakes
| # | Mistake | Why It's Wrong | Fix |
|---|---|---|---|
| 1 | Concluding convergence from | Necessary but not sufficient: harmonic series has but diverges | Use a proper convergence test (ratio, comparison, integral) |
| 2 | Using ratio test when ratio | L = 1 is the inconclusive case: works for p-series, where the ratio always -> 1 | Switch to integral test or comparison test |
| 3 | Forgetting to check endpoints separately | Ratio/root tests are strict inequalities: they say nothing at $ | x-a |
| 4 | Rearranging conditionally convergent series | Riemann's theorem: rearrangement can change the sum to any value | Only rearrange absolutely convergent series freely |
| 5 | Applying a Taylor series outside its radius of convergence | The series may diverge or converge to the wrong value | Check $ |
| 6 | Confusing the Taylor series existing with converging to | A smooth function may have a Taylor series that doesn't equal (e.g., ) | Check analyticity; use Lagrange remainder to bound error |
| 7 | Differentiating a power series and changing the radius of convergence | Radius is preserved under differentiation/integration - endpoints can change | Re-check endpoints only; radius stays the same |
| 8 | Treating as the negative of | The alternating series converges to ; the non-alternating diverges entirely | Absolute and conditional convergence are fundamentally different |
| 9 | Applying the integral test to non-monotone functions | The test requires to be eventually decreasing and positive | Verify monotonicity; if is not monotone, use comparison instead |
| 10 | Assuming and both diverge implies diverges | Both can diverge while their sum converges: , | Convergence of sums is NOT additive for divergent series |
| 11 | Using for not small | First-order Taylor has error : for , error is | Check $ |
| 12 | Confusing (radius of convergence) with the interval of convergence | is the radius; the interval is , and endpoints need separate analysis | Always state both and the interval (with endpoints explicitly checked) |
10. Exercises
Exercise 1 - Geometric Series Compute the sum of each geometric series or state that it diverges: (a) (b) (c) (d) In reinforcement learning, the discounted return is where (constant reward) and . Compute .
Exercise 2 - Convergence Tests Determine whether each series converges or diverges. State which test you use: (a) (b) (c) (d) (e)
Exercise 3 - Radius of Convergence Find the radius of convergence and the interval of convergence (including endpoint checks) for each power series: (a) (b) (c) (d)
Exercise 4 - Taylor Series Derivation (a) Derive the Maclaurin series for from scratch (compute coefficients via ). (b) Use the series for to derive the series for . (c) Using integration of the series for from 0 to , express as a power series. (d) How many terms are needed to approximate to within ?
Exercise 5 - Lagrange Remainder and Error Bounds (a) Find the 4th-order Taylor polynomial for at . (b) Use the Lagrange remainder to bound . (c) Compute the actual error and verify your bound holds. (d) How many terms of the Taylor series for are needed to compute to 10 decimal places?
Exercise 6 - GELU Taylor Analysis (a) Show that by integrating term-by-term. (b) Write out the first 4 terms of the GELU series: . (c) Compare the GELU Taylor approximation to (ReLU-like) and for . (d) At what does the 5th-order approximation exceed 1% relative error?
Exercise 7 - Linear Attention Approximation (a) Show that to first order, and bound the error in terms of . (b) If queries and keys are unit vectors in (so ), bound the maximum first-order approximation error. (c) Write the standard attention formula and show how the linear approximation reduces complexity from to . (d) Would using the second-order approximation improve accuracy? What is the trade-off?
Exercise 8 - Geometric Series in Adam (a) Show that the Adam first moment (for ) satisfies the recurrence . (b) Compute assuming all gradients have the same mean (i.e., ). Show it equals . (c) Why is the bias correction needed in early training ( small)? (d) For , at what step is the bias correction less than 1% (i.e., )?
11. Why This Matters for AI (2026 Perspective)
| Concept | AI/ML Application | Where in Modern Systems |
|---|---|---|
| Geometric series | Discounted return in RL; Adam bias correction | PPO, SAC, A3C; all Adam-family optimizers |
| Convergence tests | Proving SGD convergence; Robbins-Monro conditions | Convergence proofs in NeurIPS/ICML papers |
| Taylor series (1st order) | Linear attention; Performer approximation | Linformer, Performer, FNet |
| Taylor series (2nd order) | Newton-CG optimizer; K-FAC; Gauss-Newton | Natural gradient methods, second-order optimizers |
| Lagrange remainder | Quantifying approximation error in linearized attention | Guarantees in approximation papers |
| Power series / radius of convergence | Convergence radius of optimizer iterates; spectral radius | Convergence theory for RNNs, ResNets |
| Maclaurin series for | Deriving the exponential map on Lie groups (rotation matrices in equivariant networks) | SE(3)-equivariant networks, FrameDiff |
| Log-sum-exp stability | Numerically stable softmax in all transformer implementations | PyTorch, JAX - used in every attention block |
| Fourier series / sinusoidal PE | Position encoding in original Transformer; period-length encoding | Vaswani et al. 2017; long-context models |
| RoPE (Rotary PE) | Relative position encoding in LLaMA, GPT-NeoX, Falcon | Most open-source LLMs >= 2023 |
| Laplace approximation | Bayesian uncertainty quantification for LLMs | Laplace-Redux; PostHoc Laplace for safety |
| Alternating series (MCMC) | Controlling numerical error in annealed importance sampling | Bayesian inference; diffusion model likelihoods |
| Complex rotations in RoPE; frequency analysis of loss landscapes | Mechanistic interpretability of sinusoidal features | |
| Binomial series | Legendre polynomials; polynomial feature maps in kernels | Polynomial attention, spherical harmonics |
Conceptual Bridge
Where you came from: This section is the culmination of single-variable calculus. Limits (04/01) are the foundation - sequence convergence IS a limit. Derivatives (04/02) provide the Taylor coefficients . Integration (04/03) allows deriving new power series from old ones via term-by-term integration. Every idea in this section rests on the three preceding sections.
What this unlocks forward: Series are the bridge from single-variable to infinite-dimensional mathematics:
-
-> 05 Multivariate Calculus: The multivariable Taylor theorem uses the same structure: where is the Hessian matrix. Every gradient descent convergence proof uses the second-order Taylor expansion.
-
-> 06 Probability and Statistics: Generating functions encode probability distributions as power series. Characteristic functions are Fourier transforms of probability densities. The cumulant generating function has Taylor coefficients equal to the cumulants (mean, variance, skewness, kurtosis).
-
-> Functional Analysis (advanced): Functions in can be expanded in any orthonormal basis (Legendre polynomials, Fourier series, wavelets). This is the infinite-dimensional generalization of expressing a vector in an orthonormal basis. Neural networks implicitly learn such basis decompositions.
POSITION IN THE CURRICULUM
04/01 Limits -> epsilon-N convergence used in 04/04 sequence theory
04/02 Derivatives -> Taylor coefficients f(a)/n! in 04/04
04/03 Integration -> term-by-term integration; integral test
04/04 SERIES AND SEQUENCES <- YOU ARE HERE
-> 05 Multivariable Taylor theorem
(gradient descent convergence)
-> 06 Probability: generating functions,
characteristic functions, CLT proof
-> Advanced: Fourier analysis, functional
analysis, spectral methods in ML
The single most important takeaway: Taylor series and the geometric series are not just abstract tools - they are the mathematical engine behind every major optimization algorithm and attention approximation in modern AI. Every time a researcher writes , they are using the first-order Maclaurin approximation with a remainder bounded by the Lagrange error formula.
Appendix A: Extended Convergence Proofs
A.1 Proof of the Integral Test
Theorem: Let be continuous and decreasing, with . Then converges iff .
Proof: Since is decreasing, for :
Integrating over :
Sum from to :
If : The partial sums are bounded above by . Since is increasing (terms are positive) and bounded above, converges by MCT.
If : Then . Since is increasing and bounded, it converges.
A.2 Proof of the Ratio Test
Theorem: If , then converges absolutely.
Proof: Choose with . Since , there exists such that for all . Then for :
So for . Therefore:
Since the tail converges, so does the full series.
If : Then eventually, so , violating the necessary condition . Series diverges.
A.3 Proof of the Alternating Series Test (Leibniz)
Theorem: If , , and , then converges.
Proof: Consider even and odd partial sums:
Each parenthesis (since is decreasing), so is increasing.
Also: . So is bounded above.
By MCT, for some .
Now (since ).
Both subsequences of partial sums converge to , so .
Error bound proof: Note (odd sums overshoot from above). So:
Appendix B: Uniform Convergence
B.1 Pointwise vs. Uniform Convergence
Definition (Pointwise convergence): A sequence of functions converges pointwise to if for each fixed :
Definition (Uniform convergence): converges uniformly to on if:
Uniform convergence is stronger: the same works for ALL simultaneously.
The critical distinction:
POINTWISE vs. UNIFORM CONVERGENCE
f(x) = x on [0,1]:
At x = 0.5: (0.5) -> 0 (converges)
At x = 1.0: 1 = 1 -> 1 (converges)
Pointwise limit: f(x) = 0 for x in [0,1), f(1) = 1
But: sup_{xin[0,1]} |x - f(x)| = sup_{xin[0,1)} x = 1 for all n
-> NOT uniformly convergent on [0,1]
On [0, 0.9]: sup x = 0.9 -> 0 -> uniformly convergent here.
B.2 Why Uniform Convergence Enables Term-by-Term Operations
Theorem: If uniformly on and each is continuous:
- is continuous
- (limit and integral interchange)
Theorem: If pointwise and uniformly, then (limit and derivative interchange).
These theorems are exactly what justify term-by-term differentiation and integration of power series inside the radius of convergence: the partial sums converge uniformly on any closed interval with .
B.3 Failure Without Uniform Convergence
Without uniform convergence, swapping limits and integrals leads to errors:
Example: on .
- Pointwise: for all (compute via L'Hpital or squeeze)
- But:
The limit and integral don't commute! Convergence is NOT uniform ( for all ).
For ML: This is why convergence arguments for neural network training must be careful about the order of limits. The Dominated Convergence Theorem (from integration theory, 03/Appendix O) provides the right condition under which limits and integrals commute.
Appendix C: Complex Power Series and Euler's Formula
C.1 Extending Power Series to Complex Numbers
Power series work equally well for complex . The exponential series converges for all (same ratio test argument).
Euler's formula: Setting (purely imaginary):
Using , , , , , (period 4):
Euler's identity (): , giving .
This remarkable identity connects the five most fundamental constants of mathematics: , , , , .
C.2 Rotations as Complex Exponentials
Multiplying by rotates a complex number by angle :
This is the mathematical foundation of RoPE (Rotary Position Embeddings):
In RoPE, each attention head dimension pair is treated as a complex number . Multiplying by (rotation by position times frequency ) encodes position:
The inner product depends only on the relative position - the rotation encodes position via complex multiplication, making it equivariant to sequence shifts.
C.3 Hyperbolic Functions
The hyperbolic functions emerge from the real and imaginary parts of :
The tanh activation function: . Its Taylor series at 0:
This shows tanh is approximately linear () for small and saturates to for large .
Appendix D: Generating Functions
D.1 Ordinary Generating Functions
A generating function packages a sequence as a formal power series:
The sequence can be recovered as (or via coefficient extraction). The power of generating functions is that algebraic operations on correspond to combinatorial operations on the sequence.
Key examples:
| Sequence | Generating function | Notes |
|---|---|---|
| Geometric series | ||
| (fixed ) | Binomial coefficients | |
| Fibonacci: | Closed form via partial fractions | |
| Derived by differentiating geometric series |
Operations:
- Shift: if , then (sequence shifted right by 1)
- Differentiation: (sequence )
- Multiplication: (convolution)
D.2 Moment Generating Functions (MGF) in ML
The moment generating function of a random variable :
The Taylor coefficients encode the moments: .
For Gaussian : . This gives:
- (mean)
- (second moment)
- Variance:
Cumulant generating function: for Gaussian. Taylor coefficients of are the cumulants: (mean), (variance), (skewness), (excess kurtosis). For Gaussian, for - a Gaussian is completely characterized by its first two cumulants. This is why many normality tests check higher cumulants.
Appendix E: Multivariable Taylor (Preview)
Forward reference: Full treatment in 05 Multivariate Calculus.
The Taylor series extends naturally to functions . Expanding around :
where is the gradient vector and is the Hessian matrix.
Reading the terms:
- 0th order: - the function value at the expansion point
- 1st order: - the linear (gradient) term; this is the basis of SGD
- 2nd order: - the curvature term; this is why quadratic forms appear in optimization
Gradient descent as first-order Taylor: Setting in the 1st-order expansion:
The loss decreases by per step - valid when is small enough that the linear approximation holds. The "right" is determined by the second-order term.
Newton's method as second-order Taylor: Minimize the 2nd-order Taylor model exactly:
For a quadratic , Newton's method is exact in one step. This is why L-BFGS and other quasi-Newton methods (which approximate ) converge much faster than SGD on smooth losses.
Appendix F: Fourier Series Extended
F.1 Parseval's Identity
For a function with Fourier series :
Interpretation: The norm squared of equals the sum of squared Fourier coefficients. This is an infinite-dimensional Pythagorean theorem - the orthonormal basis satisfies the same relationship as a standard orthonormal basis in .
For ML: In the theory of neural networks as function approximators, the Fourier perspective shows that learning high-frequency functions requires exponentially more training data (the "spectral bias" of neural networks - they learn low frequencies first).
F.2 Gibbs Phenomenon
For a discontinuous function (like a square wave), the Fourier partial sum overshoots near the discontinuity by approximately of the jump, regardless of .
This Gibbs phenomenon does not disappear as - it is an intrinsic property of Fourier series near discontinuities. The partial sum does converge to the function's average at the discontinuity.
For ML: Neural networks trained on discontinuous target functions (e.g., step functions, indicator functions) exhibit analogous ringing artifacts near edges. Regularization and smooth activations mitigate this.
F.3 Convergence in vs. Pointwise
Theorem (Riesz-Fischer): Every square-integrable function equals the limit of its Fourier partial sums in norm:
However, pointwise convergence is more subtle - Carleson's theorem (1966) states that the Fourier series of any function converges pointwise almost everywhere, but this is a very deep result.
The convergence is sufficient for ML applications - it means the Fourier approximation is the best approximation of the function in the sinusoidal basis, analogous to how the truncated SVD is the best low-rank approximation (Eckart-Young theorem).
Appendix G: ML Derivations in Full
G.1 Full Derivation: Adam Bias Correction
The Adam update rule:
Unrolling the recurrence:
(since ). Taking expectation with (true gradient, stationary assumption):
The geometric series sum is the key step.
Bias correction: ensures . For :
- : , correction factor = 10 (large!)
- : , correction
- : , correction negligible
Similarly for the second moment with correction .
G.2 Full Derivation: Linear Attention Complexity
Standard attention for tokens and -dimensional heads:
Cost: is ; applying to is . Total: .
Linear attention with (first-order Taylor):
The key insight: define and . These are computed ONCE in . Then for each query:
Cost per query: . Total: . For fixed , this is vs. .
Trade-off: The first-order Taylor approximation introduces error . For unit vectors in : , error . In practice, the error is managed by choosing large and using random feature approximations (Performer).
G.3 Full Derivation: Softmax Temperature and Entropy
For logits and temperature , let .
Entropy :
As : (one-hot), (minimum entropy).
As : , (maximum entropy). Proof via Taylor:
So entropy increases with temperature, controlled by the correction term.
Appendix H: Numerical Summation Methods
H.1 Euler-Maclaurin Formula
The Euler-Maclaurin formula connects sums to integrals with correction terms:
where are Bernoulli numbers (, , , ).
Application: Summing (Basel-like): the formula gives the exact asymptotic expansion. This is how the exact value was computed.
For ML: Euler-Maclaurin underlies the Euler method for ODEs (Neural ODEs), and is the bridge between discrete (sequence) and continuous (integral) formulations of the same problem.
H.2 Kahan Summation Algorithm
Naive summation of floating-point numbers accumulates rounding error where is machine epsilon (~ for float32). Kahan summation reduces this to :
def kahan_sum(arr):
total = 0.0
c = 0.0 # compensation
for x in arr:
y = x - c # corrected term
t = total + y # new sum (may lose low bits)
c = (t - total) - y # lost low bits
total = t
return total
This is why torch.sum() and numpy.sum() use pairwise summation (not Kahan but better than naive) for numerical stability.
H.3 Richardson Extrapolation
If approximates with error , then combining two approximations eliminates the leading error:
Example: Trapezoidal rule has error . Richardson extrapolation on two trapezoidal estimates gives Simpson's rule (). Repeated extrapolation gives Romberg integration with accuracy.
For series, Richardson extrapolation can accelerate convergence of slowly converging sequences - a technique used in numerical analysis for computing , , and other constants.
Appendix I: Glossary and Reference Tables
I.1 Key Definitions
| Term | Definition |
|---|---|
| Sequence | Function ; an ordered list |
| Converges (sequence) | exists and is finite |
| Bounded sequence | $\exists B: |
| Monotone sequence | Either always increasing () or always decreasing |
| Infinite series | |
| Absolutely convergent | $\sum |
| Conditionally convergent | converges but $\sum |
| Power series | - a series whose terms are polynomial monomials |
| Radius of convergence | Largest such that converges absolutely for $ |
| Taylor series | - power series coefficients from derivatives |
| Maclaurin series | Taylor series at |
| Analytic function | Function that equals its Taylor series in a neighborhood of each point |
| Uniform convergence | $\sup_x |
| Fourier series | Expansion of a periodic function in the sinusoidal basis |
| Lagrange remainder | - exact error in Taylor approximation |
I.2 Standard Series Table
| Series | Sum | Domain |
|---|---|---|
| $ | ||
| $ | ||
| All | ||
| All | ||
| All | ||
| $ | ||
| $ | ||
| All | ||
| All | ||
| (Basel problem) | ||
| (Leibniz) |
I.3 Convergence Test Decision Table
| Test | Best For | Condition for Convergence | Condition for Divergence | Inconclusive |
|---|---|---|---|---|
| Divergence | Any | - | ||
| Geometric | $ | r | < 1$ | |
| p-series | - | |||
| Integral | Decreasing positive | not monotone | ||
| Comparison | converges | diverges | - | |
| Limit Comparison | -> same | same | or | |
| Ratio | Factorials, exponentials | |||
| Root | -th powers | |||
| Alternating | decreasing, | - | not decreasing |
I.4 Self-Assessment Checklist
After completing this section:
Sequences
- Can give epsilon-N proof that
- Can state and use Monotone Convergence Theorem
- Can explain Bolzano-Weierstrass and its ML relevance
Series and Convergence Tests
- Can prove harmonic series diverges (Oresme)
- Can apply all 7 convergence tests and know when each applies
- Can explain absolute vs. conditional convergence and Riemann rearrangement
Power Series
- Can find radius of convergence via ratio test
- Can differentiate and integrate power series term-by-term
- Can derive series from geometric series via integration
Taylor Series
- Can derive Taylor series for , , ,
- Can bound Taylor approximation error via Lagrange remainder
- Can manipulate series algebraically (substitution, multiplication, composition)
ML Applications
- Can derive Adam bias correction using geometric series
- Can explain linear attention as first-order Taylor approximation
- Can explain softmax temperature behavior at and
- Can describe Laplace approximation as quadratic Taylor expansion
Appendix J: Worked Solutions to Selected Exercises
J.1 Exercise 1 - Geometric Series
(a) : geometric series with , . Since :
(b) : ratio . Diverges.
(c) : geometric with , :
(d) .
J.2 Exercise 3 - Radius of Convergence
(a) : geometric series with ratio . Converges when , i.e., . So .
- At : diverges.
- At : diverges (oscillates).
- Interval of convergence: .
(b) : ratio test: . So .
- At : - alternating harmonic, converges to .
- At : - harmonic, diverges.
- Interval of convergence: .
(c) : ratio for any . So ; converges only at .
(d) : ratio . So , centered at .
- At (): converges (compare to p-series ).
- At (): converges absolutely.
- Interval of convergence: .
J.3 Exercise 5 - Lagrange Remainder
(a) : derivatives at :
- ,
- ,
- ,
- ,
- ,
(b) At : . On : . So:
(c) True: . . Error (less than our bound).
(d) Need . For : . So 9 terms suffice.
Appendix K: Historical Deep Dives
K.1 The Basel Problem (Euler, 1735)
The Basel problem asks: what is ?
This series was known to converge (p-series with ), but its exact value was unknown until Euler's stunning proof in 1735.
Euler's argument (not fully rigorous by modern standards, but correct):
The function has zeros at So it factors as:
Expanding the left side via the Taylor series for :
Expanding the right side (infinite product), the coefficient of is:
Equating: , so .
This approach - identifying a function by its zeros and comparing Taylor series - is now a standard technique in complex analysis. Euler's original argument was made rigorous by Weierstrass's factorization theorem (1876).
K.2 Ramanujan's Enigmatic Formulas
Srinivasa Ramanujan (1887-1920) discovered series identities that seemed impossible:
This series was discovered empirically and later proven using the theory of elliptic integrals and modular forms. It converges incredibly fast - each term adds about 8 decimal digits to .
Modern use: The Chudnovsky algorithm (1988), based on a Ramanujan-like formula, is used for computing to trillions of digits and is implemented in the mpmath Python library.
K.3 Taylor vs. Fourier: Two Ways to Decompose a Function
The Taylor and Fourier series represent two fundamentally different decomposition strategies:
| Taylor | Fourier | |
|---|---|---|
| Basis | Powers | Sinusoids |
| Locality | Local: describes near | Global: describes over |
| Best for | Analytic functions, approximation | Periodic functions, signal processing |
| Convergence | In disk $ | x-a |
| Derivatives | Change polynomial degrees | Multiply by (frequency) |
| ML use | Optimizer derivation, GELU, attention | Positional encoding, spectral analysis |
Both are special cases of harmonic analysis on different groups:
- Taylor series: the group near 0 (polynomial approximation)
- Fourier series: the circle group (periodic structure)
- Fourier transform (on ): the full translation group
Appendix L: Advanced Convergence Topics
L.1 Cauchy Sequences and Completeness
Definition: A sequence is a Cauchy sequence if for every , there exists such that for all : .
Intuitively: the terms get arbitrarily close to each other as .
Theorem (Completeness of ): Every Cauchy sequence in converges. (Equivalently: is a complete metric space.)
This is the formal statement of the fact that has "no gaps". It is what distinguishes from : the sequence (rational approximations to ) is Cauchy in but does not converge in (since ). It does converge in .
For ML: Gradient descent sequences are sometimes analyzed as Cauchy sequences - if parameter updates become arbitrarily small, the sequence is Cauchy in the complete space and therefore converges (to some local minimum or saddle point).
L.2 Absolute Convergence and Rearrangements
Theorem (Absolute convergence -> any rearrangement converges to same sum): If converges absolutely to , then any rearrangement also converges to .
Proof sketch: For any , choose so that . Any rearrangement that keeps the first terms (in possibly different order) agrees with the original sum to within .
Theorem (Riemann Rearrangement): If converges conditionally, then for any , there exists a rearrangement converging to .
Proof idea: Separate positive and negative parts: and . Conditional convergence implies (both diverge). To get sum : greedily add positive terms until partial sum exceeds , then negative terms until below , and repeat. Since , the oscillations decrease and the series converges to .
Concrete example: The alternating harmonic series .
A rearrangement (two positives, one negative): .
This is not a curiosity - it shows that the value of a conditionally convergent series is an artifact of the order of summation.
L.3 Power Series at the Boundary: Abel's Theorem
Theorem (Abel): If converges (where is the radius of convergence), then:
In other words: if the power series converges at an endpoint, the function is continuous there from the inside.
Example: for . This series converges at (alternating harmonic). By Abel's theorem:
This justifies the formula from the power series perspective.
L.4 Summability Methods
Sometimes a divergent series can be assigned a meaningful sum via a summability method:
Cesro summability: . The series has partial sums with average . So it is Cesro summable to .
Abel summability: . For : as . Abel sum .
For ML: The value appears in zeta function regularization, used in string theory and occasionally in neural network theory. The regularized sum (via analytic continuation of the Riemann zeta function) appears in Casimir effect calculations and dimensional regularization in physics.
Appendix M: Worked Examples - Taylor Series Applications
M.1 Computing to High Precision
The series converges extremely fast (ratio test: ratio ).
Error after terms:
For : error . For : error .
This explains why float64 (15 significant digits) is enough: just terms suffice.
M.2 Taylor Series for Numerical Integration
Integrals like have no elementary antiderivative, but the series gives:
This alternating series has error bounded by the first omitted term. For 10 significant digits:
So 10 terms give:
True value: .
M.3 L'Hpital's Rule via Taylor Series
Limits of the form are often cleaner via Taylor series:
The Taylor series approach reveals the rate of vanishing explicitly - it shows not just that the limit exists but also the leading behavior, which L'Hpital's rule alone does not.
For ML: This technique is used to analyze the behavior of loss functions near saddle points and to derive learning rate bounds from second-order Taylor expansions.
M.4 Stability of Softmax Computed via log-sum-exp
Standard softmax: softmax(x)[i] = exp(x[i]) / sum(exp(x)).
For in float32: exp(1002) overflows. Result: nan.
Log-sum-exp stable version: subtract :
Numerically: , , . Sum . .
. Softmax values: .
The Taylor connection: (first-order Taylor of for small ) - valid when the max logit dominates by a large margin.
Appendix N: Connections to Advanced Topics
N.1 Function and the Riemann Hypothesis
The Riemann zeta function extends the p-series to complex :
This converges for (by the real p-series result). Riemann's 1859 paper showed can be analytically continued to all of via functional equation.
Special values:
- (Basel problem)
- (Bernoulli numbers)
- (regularized value; related to by analytic continuation)
The Riemann Hypothesis: all non-trivial zeros of have . Unsolved since 1859, it is one of the Millennium Prize Problems (worth $1M).
Connections to ML: Spectral methods in deep learning analysis (neural tangent kernel theory) involve the spectrum of infinite-width limits, which connect to -function theory. The random matrix theory of neural network weight spectra has connections to the distribution of Riemann zeta zeros.
N.2 Spectral Bias of Neural Networks
Theorem (Rahaman et al., 2019): Neural networks with smooth activations learn low-frequency components of the target function before high-frequency components.
Fourier analysis explanation: If the target function has Fourier decomposition , a neural network with gradient flow learns the -th frequency component at rate proportional to the -th Fourier coefficient of the NTK (Neural Tangent Kernel).
For standard activations, the NTK decays with frequency , so low frequencies are learned first. This spectral bias (also called "frequency principle") has practical implications:
- High-frequency functions (sharp edges, textures) require more training
- Random Fourier features can overcome this for specific tasks
- This is why position encodings must explicitly inject high-frequency sinusoids (the transformer couldn't learn high-frequency positional patterns from data alone)
N.3 Nonlinear Series: Fixed-Point Theorems
Many iterative algorithms in ML produce sequences converging to a fixed point. The key tool is:
Banach Fixed-Point Theorem (Contraction Mapping): If is a contraction on a complete metric space (i.e., for ), then has a unique fixed point , and the iteration converges to with geometric rate .
Application - Bellman equations: The Bellman backup operator in reinforcement learning satisfies with contraction rate . By Banach's theorem, the value iteration converges to the unique optimal value function .
The convergence rate is itself a geometric series - each policy evaluation step reduces the error by factor .
N.4 Matrix Exponential and Lie Groups
For a square matrix , the matrix exponential is defined by the same power series:
This series always converges (operator norm satisfies , so the series is bounded by the scalar exponential ).
Applications in ML:
- Equivariant networks: The group of rotations has a Lie algebra (skew-symmetric matrices). The exponential map parameterizes rotations via power series. Used in SE(3)-equivariant networks (FrameDiff, AlphaFold2 structure module).
- Continuous normalizing flows: The flow for a learned ; the trace controls the log-determinant.
- Mamba / state space models: The discrete-time recurrence is derived from continuous matrices via matrix exponential with zero-order hold.
Appendix O: Final Reference and Checklist
O.1 Essential Formulas at a Glance
Geometric series:
p-series criterion: converges
Radius of convergence: (ratio method); (root method)
Taylor series:
Lagrange error bound: where
Euler's formula:
Key Maclaurin series:
- (all )
- (all )
- (all )
- ()
- ()
Adam bias correction (geometric series):
Linear attention (Taylor approximation):
Laplace approximation:
O.2 Final Self-Assessment
After completing this entire section, you should be able to:
Prove from scratch:
- Harmonic series diverges (Oresme grouping)
- Ratio test (convergence case)
- Alternating series test (Leibniz)
- Taylor's theorem with Lagrange remainder
- Adam bias correction derivation using geometric series
Compute fluently:
- Sums of geometric and telescoping series
- Radius and interval of convergence for power series
- Taylor/Maclaurin series for standard functions
- Approximation errors using Lagrange remainder
- New series via substitution, differentiation, integration
Explain to a colleague:
- Why conditional convergence allows Riemann rearrangement
- How linear attention reduces complexity using Taylor series
- Why softmax becomes uniform at high temperature
- What the Laplace approximation approximates and when it's valid
- How RoPE uses complex exponentials for relative position encoding
Appendix P: Connections to Adjacent Sections
P.1 <- 04/01 Limits and Continuity
The entire theory of sequences is the theory of limits restricted to integer inputs. Every result about sequence convergence is a special case of the limit definition from 01:
- is the epsilon-delta limit with and replaced by
- The Squeeze Theorem for sequences is the Squeeze Theorem from 01 restricted to integers
- Continuous functions preserve sequence limits: - this is the sequential characterization of continuity from 01
Formal link: A function is continuous at if and only if for every sequence , we have . This makes sequences a fundamental tool for studying continuous functions.
P.2 <- 04/02 Derivatives and Differentiation
Taylor series are the direct children of differentiation:
- The -th Taylor coefficient is , requiring derivatives at
- The Lagrange remainder involves the -th derivative
- L'Hpital's rule (for limits) is superseded by Taylor series analysis, which gives the limit value AND the rate of approach
- The derivative of a power series is a power series (via term-by-term differentiation) - justified by uniform convergence
P.3 <- 04/03 Integration
Integration provides two key tools for series theory:
- Integral test: converges iff converges - directly using the improper integrals from 03
- Term-by-term integration: - used to derive from and from
The connection is deep: both integration and summation are instances of the same "linear functional on functions" idea from functional analysis.
P.4 -> 05 Multivariate Calculus
The multivariable Taylor theorem (the most important formula in optimization theory) is the direct extension of the univariate Taylor series:
Every gradient descent convergence proof, every Newton method derivation, and every trust region method is an application of this formula. The radius of convergence concept extends to a "trust region" around the expansion point.
-> Full treatment: 05 Multivariate Calculus
P.5 -> 06 Probability and Statistics
Generating functions, characteristic functions, and moment generating functions are all power series / Fourier transforms in disguise:
- MGF:
- PGF: (discrete distributions)
- Characteristic function: (Fourier transform of density)
The Central Limit Theorem proof via characteristic functions uses Taylor expansion of and the fact that the Gaussian characteristic function is .
-> Full treatment: 06 Probability and Statistics
Appendix Q: Additional Exercises with Full Solutions
Q.1 Power Series Manipulation
Problem: Starting from , derive: (a) (b) (c) (alternating harmonic series)
Solutions:
(a) Differentiate term-by-term (valid for ):
(b) Multiply (a) by : .
(c) From (derived by integrating ). At : the series converges by the alternating series test ( is decreasing to 0). By Abel's theorem (continuity at boundary), .
Q.2 Convergence Test Practice
Determine convergence of .
Solution: Rewrite as . The ratio , so we first try the absolute value: .
Ratio test: .
So diverges. Since the series diverges absolutely, check conditional convergence: alternating series test needs ? No - . So , violating the necessary condition. The series diverges.
Q.3 Full Taylor Expansion of
Derive: for .
Solution: Start from (geometric series with , valid for ).
Integrate from 0 to :
Valid for . At : converges by alternating series test. At : also converges. By Abel's theorem, the formula holds on .
Leibniz formula for : Setting :
This converges, but VERY slowly (error after terms ; need for 10 digits). In practice, is computed via faster-converging formulas (Machin's: ).
Appendix R: Series in Numerical Computing
R.1 Float Arithmetic and Series Truncation
Every floating-point number is represented with finite precision. When computing a power series numerically, two errors compete:
- Truncation error: stopping at terms instead of . Bounded by Lagrange remainder.
- Rounding error: each term is rounded to nearest float. Accumulates with terms.
The optimal balances these - adding more terms reduces truncation error but eventually increases rounding error. For float64 with :
Optimal for near : truncation error ; rounding error . Set equal: . For : optimal (matches the 17 significant digits of float64).
For large : The terms grow before shrinking (the series oscillates in magnitude). Catastrophic cancellation: many large positive and negative terms cancel. Solution: use exp(x) = exp(x/2)^2 (argument reduction) to bring near 0 before applying the series.
R.2 Compensated Summation (Kahan Algorithm)
When summing many floating-point numbers, naive summation accumulates error. The Kahan compensated sum tracks the lost bits:
def kahan_sum(values):
s = 0.0
c = 0.0 # running compensation
for v in values:
y = v - c # compensate the next term
t = s + y # new sum (low bits of y may be lost)
c = (t - s) - y # recover what was lost
s = t
return s
This reduces error to regardless of . Libraries like NumPy use pairwise summation (divide-and-conquer), achieving error - a good balance.
For ML: In mixed-precision training (float16 weights, float32 accumulation), the summation algorithm in the optimizer matters. Kahan summation in the gradient accumulation step significantly reduces precision loss.
R.3 Euler's Transform for Alternating Series Acceleration
The alternating harmonic series converges slowly. Euler's transform accelerates convergence:
where are finite differences: , .
For (alternating harmonic):
- , ,
The transformed series converges much faster. This is used in high-precision computation of , , and other constants.
Appendix S: Topic Map and Reading Guide
S.1 Minimum Viable Reading
For an ML practitioner who needs only the most practically relevant material:
- 1.3: Why series matter for AI (overview table)
- 3.2: Geometric series (Adam bias correction, discounted rewards)
- 4.4: Ratio test (to understand radius of convergence)
- 5.2: Radius of convergence (when to trust a Taylor approximation)
- 6.1-6.3: Taylor series derivation and Lagrange error bounds
- 7.1-7.6: All ML applications in full
- 8.3: RoPE and positional encodings
S.2 Theoretical Deep Dive
For a reader interested in rigorous analysis:
- 2.2: epsilon-N definition of sequence convergence (in full)
- 2.3: Monotone Convergence Theorem (proof)
- 4: All convergence tests (with proofs in Appendix A)
- 5: Power series theory, radius of convergence, uniform convergence (App. B)
- 6.4: When does the Taylor series converge to ? (Analytic functions)
- Appendix L: Cauchy sequences, Riemann rearrangement, Abel's theorem
- Appendix N: Zeta function, spectral bias, Banach fixed-point theorem
S.3 Prerequisites by Section
| Section | Requires |
|---|---|
| 2 Sequences | 04/01 Limits (epsilon-delta definition) |
| 3 Infinite Series | 2 (convergence of sequences) |
| 4 Convergence Tests | 3.1 (partial sums), 04/03 Integration (integral test) |
| 5 Power Series | 4 (convergence tests), especially ratio test |
| 6 Taylor Series | 04/02 Derivatives (all orders), 5 (power series) |
| 7 ML Applications | 6 (Taylor series), 04/03 (integration) |
| 8 Fourier Preview | 04/03 (integration, orthogonality integrals) |
| Appendix B | 5 (uniform convergence motivation) |
| Appendix C | Complex numbers (basic) |
| Appendix D | 6 (generating functions), 04/03 probability |
| Appendix N | 6 (Taylor), 04/01 (limits), some complex analysis |