Exercises Notebook
Converted from
exercises.ipynbfor web reading.
Series and Sequences - Exercises
10 graded exercises covering the full section arc, from core calculus mechanics to ML-facing applications.
| Format | Description |
|---|---|
| Problem | Markdown cell with task description |
| Your Solution | Code cell for learner work |
| Solution | Reference solution with checks |
Difficulty: straightforward -> moderate -> challenging.
Code cell 2
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
try:
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
HAS_SNS = True
except ImportError:
plt.style.use("seaborn-v0_8-whitegrid")
HAS_SNS = False
mpl.rcParams.update({
"figure.figsize": (10, 6),
"figure.dpi": 120,
"font.size": 13,
"axes.titlesize": 15,
"axes.labelsize": 13,
"xtick.labelsize": 11,
"ytick.labelsize": 11,
"legend.fontsize": 11,
"legend.framealpha": 0.85,
"lines.linewidth": 2.0,
"axes.spines.top": False,
"axes.spines.right": False,
"savefig.bbox": "tight",
"savefig.dpi": 150,
})
np.random.seed(42)
print("Plot setup complete.")
Code cell 3
import numpy as np
import numpy.linalg as la
from scipy import integrate, special, stats
from math import factorial
import matplotlib.patches as patches
COLORS = {
"primary": "#0077BB",
"secondary": "#EE7733",
"tertiary": "#009988",
"error": "#CC3311",
"neutral": "#555555",
"highlight": "#EE3377",
}
HAS_MPL = True
np.set_printoptions(precision=8, suppress=True)
np.random.seed(42)
def header(title):
print("\n" + "=" * len(title))
print(title)
print("=" * len(title))
def check_true(name, cond):
ok = bool(cond)
print(f"{'PASS' if ok else 'FAIL'} - {name}")
return ok
def check_close(name, got, expected, tol=1e-8):
ok = np.allclose(got, expected, atol=tol, rtol=tol)
print(f"{'PASS' if ok else 'FAIL'} - {name}: got {got}, expected {expected}")
return ok
def centered_diff(f, x, h=1e-6):
return (f(x + h) - f(x - h)) / (2 * h)
def forward_diff(f, x, h=1e-6):
return (f(x + h) - f(x)) / h
def backward_diff(f, x, h=1e-6):
return (f(x) - f(x - h)) / h
def grad_check(f, x, analytic_grad, h=1e-6):
x = np.asarray(x, dtype=float)
analytic_grad = np.asarray(analytic_grad, dtype=float)
numeric_grad = np.zeros_like(x, dtype=float)
for idx in np.ndindex(x.shape):
x_plus = x.copy(); x_minus = x.copy()
x_plus[idx] += h; x_minus[idx] -= h
numeric_grad[idx] = (f(x_plus) - f(x_minus)) / (2 * h)
denom = la.norm(analytic_grad) + la.norm(numeric_grad) + 1e-12
return la.norm(analytic_grad - numeric_grad) / denom
def check(name, got, expected, tol=1e-8):
return check_close(name, got, expected, tol=tol)
print("Chapter helper setup complete.")
Exercise 1 (★): Geometric Series
The geometric series for is the foundation of Adam bias correction and reinforcement learning returns.
(a) Compute using the closed form.
(b) Verify numerically that the partial sum as .
(c) Compute the discounted return for (constant reward = 1).
(d) For , compute exactly, then verify with 1000 partial terms.
Code cell 5
# Your Solution
# Exercise 1 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 1.")
Code cell 6
# Solution
# Exercise 1 - reference solution
r = 2/3
# (a) Closed form: 1/(1-r)
sum_a = 1 / (1 - r)
# (b) Partial sum: (1 - r^(N+1))/(1 - r)
def partial_geom_sum(r, N):
return (1 - r**(N+1)) / (1 - r)
S_100 = partial_geom_sum(r, 100)
# (c) G0 = 1/(1-gamma)
gamma = 0.95
G0 = 1 / (1 - gamma)
# (d)
r_alt = -1/4
sum_d = 1 / (1 - r_alt) # = 4/5 = 0.8
partial_d = sum(r_alt**n for n in range(1000))
header('Exercise 1: Geometric Series')
print(f'(a) sum (2/3)^n = {sum_a:.8f}')
check_close('(a) = 3.0', sum_a, 3.0)
check_close('(b) S_100 approx 3', S_100, 3.0, tol=1e-6)
print(f'(c) G0 = 1/(1-0.95) = {G0:.4f}')
check_close('(c) G0 = 20', G0, 20.0)
check_close('(d) exact = 0.8', sum_d, 0.8)
check_close('(d) partial sum converges to 0.8', partial_d, 0.8, tol=1e-6)
print("\nTakeaway: The geometric series formula 1/(1-r) is the foundation "
"of Adam's bias correction and discounted RL returns.")
print("Exercise 1 solution complete.")
Exercise 2 (★): Convergence Tests
Determine whether each series converges or diverges. Implement a function that computes partial sums and visually confirms convergence or divergence.
(a) — apply p-series test
(b) — apply ratio test
(c) — apply alternating series test
(d) — apply ratio test
For each, compute the first 100 partial sums and state the verdict.
Code cell 8
# Your Solution
# Exercise 2 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 2.")
Code cell 9
# Solution
# Exercise 2 - reference solution
def partial_sums(a_fn, N):
s = 0.0
ps = []
for n in range(1, N+1):
s += a_fn(n)
ps.append(s)
return ps
from math import factorial, exp, sqrt
a_fn_a = lambda n: 1 / n**1.5
a_fn_b = lambda n: n / exp(n)
a_fn_c = lambda n: (-1)**n / sqrt(n)
a_fn_d = lambda n: factorial(n) / 2**n
S_a = partial_sums(a_fn_a, 100)
S_b = partial_sums(a_fn_b, 30)
S_c = partial_sums(a_fn_c, 100)
S_d = partial_sums(a_fn_d, 20)
# Ratio test for (b)
ratios_b = [abs(a_fn_b(n+1)/a_fn_b(n)) for n in range(1, 20)]
# Ratio test for (d)
ratios_d = [abs(a_fn_d(n+1)/a_fn_d(n)) for n in range(1, 15)]
header('Exercise 2: Convergence Tests')
print(f'(a) sum 1/n^1.5: S_100 = {S_a[-1]:.6f} (converges, p=1.5>1)')
check_true('(a) converges (S_100 < 5)', S_a[-1] < 5 and S_a[-1] > 2)
print(f'(b) sum n/e^n: S_30 = {S_b[-1]:.8f}, ratio -> {ratios_b[-1]:.6f} < 1 (converges)')
check_true('(b) ratio test L < 1', ratios_b[-1] < 0.5)
print(f'(c) sum (-1)^n/sqrt(n): S_100 = {S_c[-1]:.6f} (converges by AST)')
check_true('(c) converges (|S| < 1)', abs(S_c[-1]) < 1)
print(f'(d) sum n!/2^n: S_20 last term = {a_fn_d(20):.2e} -> inf (diverges by ratio L->inf)')
check_true('(d) diverges (ratio -> inf)', ratios_d[-1] > 5)
print('\nTakeaway: The ratio test is most powerful for factorials; '
'p-series works directly; alternating series test handles signed series.')
print("Exercise 2 solution complete.")
Exercise 3 (★): Radius of Convergence
The radius of convergence determines where a power series is valid.
(a) Find for using the ratio method.
(b) Find and the full interval of convergence for . Check both endpoints.
(c) Show that has .
(d) For the power series for , verify numerically that the series converges at but diverges at .
Code cell 11
# Your Solution
# Exercise 3 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 3.")
Code cell 12
# Solution
# Exercise 3 - reference solution
def radius_ratio(coeffs):
"""R = lim |c_n / c_{n+1}|"""
ratios = [abs(coeffs[n]) / abs(coeffs[n+1])
for n in range(len(coeffs)-1) if coeffs[n+1] != 0]
return ratios[-1]
# (a) c_n = 1/3^n; R = lim |(1/3^n)/(1/3^(n+1))| = 3
c_a = [1/3**n for n in range(30)]
R_a = radius_ratio(c_a)
# (b) c_n = (-1)^n/n; R = lim |(-1)^n/n| / |(-1)^(n+1)/(n+1)| = (n+1)/n -> 1
c_b = [(-1)**n / n for n in range(1, 31)]
R_b = radius_ratio(c_b)
# (c) c_n = n!; ratio = |n!/(n+1)!| = 1/(n+1) -> 0, so R = lim (n+1) = inf in denominator -> R=0
c_c = [float(factorial(n)) for n in range(1, 15)]
R_c_ratios = [c_c[n]/c_c[n+1] for n in range(len(c_c)-1)]
# (d) ln(1+x) series
def log1p_series(x, N):
return sum((-1)**(n+1) * x**n / n for n in range(1, N+1))
S_x1 = log1p_series(1.0, 1000) # should converge to ln(2)
S_xm1 = log1p_series(-1.0, 1000) # should diverge (harmonic)
header('Exercise 3: Radius of Convergence')
print(f'(a) R = {R_a:.4f} (should be 3.0)')
check_close('(a) R = 3', R_a, 3.0)
print(f'(b) R = {R_b:.4f} (should be 1.0)')
check_close('(b) R = 1', R_b, 1.0)
print(f'(c) R = 0: ratios -> {R_c_ratios[-1]:.6f} (-> 0, meaning R = lim 1/ratio = 0)')
check_true('(c) R -> 0', R_c_ratios[-1] < 0.1)
print(f'(d) At x=1: S_1000 = {S_x1:.8f}, ln(2) = {np.log(2):.8f}')
check_close('(d) ln(1+1) via series', S_x1, np.log(2), tol=1e-3)
print(f'(d) At x=-1: S_1000 = {S_xm1:.4f} (harmonic series, diverges to -inf)')
check_true('(d) x=-1 diverges (S < -5)', S_xm1 < -5)
print('\nTakeaway: Radius of convergence from ratio test; must check endpoints separately.')
print("Exercise 3 solution complete.")
Exercise 4 (★★): Taylor Series Derivation
Derive and verify the Maclaurin series for elementary functions.
(a) Derive the series for by computing .
(b) Starting from , substitute to get , then to get .
(c) Integrate the series for term-by-term from 0 to to get the series for .
(d) How many terms are needed to compute to within ?
Code cell 14
# Your Solution
# Exercise 4 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 4.")
Code cell 15
# Solution
# Exercise 4 - reference solution
import numpy as np
from math import factorial
# (a) e^(-x) = sum (-1)^n x^n / n!
def taylor_exp_neg_x(x, N):
return sum((-1)**n * x**n / factorial(n) for n in range(N))
# (b) e^(x^2) = sum x^(2n)/n! and e^(-x^2) = sum (-1)^n x^(2n)/n!
def taylor_exp_x_squared(x, N):
return sum(x**(2*n) / factorial(n) for n in range(N))
def taylor_exp_neg_x_squared(x, N):
return sum((-1)**n * x**(2*n) / factorial(n) for n in range(N))
# (c) Integrate e^(-t^2) = sum (-1)^n t^(2n)/n! term by term:
# integral_0^x = sum (-1)^n x^(2n+1) / (n! * (2n+1))
def integral_exp_neg_t_squared(x, N):
return sum((-1)**n * x**(2*n+1) / (factorial(n) * (2*n+1)) for n in range(N))
# (d) True value = sqrt(pi)/2 * erf(0.5)
x0 = 0.5
try:
from scipy.special import erf
true_val = np.sqrt(np.pi)/2 * erf(x0)
except:
true_val = 0.46128 # approx
# Find minimum N
for N in range(1, 30):
approx = integral_exp_neg_t_squared(x0, N)
err = abs(approx - true_val)
if err < 1e-8:
min_N = N
break
header('Exercise 4: Taylor Series Derivation')
check_close('(a) e^(-0.5)', taylor_exp_neg_x(0.5, 20), np.exp(-0.5))
check_close('(b) e^(0.5^2)', taylor_exp_x_squared(0.5, 20), np.exp(0.25))
check_close('(b) e^(-0.5^2)', taylor_exp_neg_x_squared(0.5, 20), np.exp(-0.25))
print(f'(c) integral_0^0.5 e^(-t^2) dt (N=10): '
f'{integral_exp_neg_t_squared(0.5, 10):.10f}')
check_close('(c) matches true value', integral_exp_neg_t_squared(0.5, 10), true_val)
print(f'(d) Minimum N for error < 1e-8: N = {min_N}')
check_true('(d) error < 1e-8', abs(integral_exp_neg_t_squared(x0, min_N) - true_val) < 1e-8)
print('\nTakeaway: Substitution and term-by-term integration produce new series; '
'alternating series error bound = first omitted term.')
print("Exercise 4 solution complete.")
Exercise 5 (★★): Lagrange Remainder and Error Bounds
(a) Write the 4th-order Taylor polynomial for at .
(b) Use the Lagrange remainder to bound the error . (Hint: ; bound on .)
(c) Compute the actual error and confirm the bound holds.
(d) How many terms of the series are needed to compute to 10 decimal places? (Hint: alternating series error first omitted term.)
Code cell 17
# Your Solution
# Exercise 5 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 5.")
Code cell 18
# Solution
# Exercise 5 - reference solution
import numpy as np
from math import factorial
# (a) T4(x) = x - x^2/2 + x^3/3 - x^4/4
def T4_log(x):
return x - x**2/2 + x**3/3 - x**4/4
# (b) f^(5)(x) = 4!/(1+x)^5; M = 24 at x=0 (max on [0,0.2])
x0 = 0.2
n = 4
M = factorial(4) # = 24, since f^(5)(x) = 4!/(1+x)^5 <= 24 on [0, 0.2]
lagrange_bound = M * x0**(n+1) / factorial(n+1) # M * 0.2^5 / 5!
# (c) Actual error
actual_err = abs(np.log(1.2) - T4_log(x0))
# (d) Alternating series: |error| <= |first omitted term| = x^(N+1)/(N+1)
x_target = 0.1
target_tol = 5e-11 # 10 decimal places
min_N = 1
while x_target**(min_N+1) / (min_N+1) >= target_tol:
min_N += 1
header('Exercise 5: Lagrange Remainder')
print(f'(a) T4(0.2) = {T4_log(x0):.10f}')
print(f' ln(1.2) = {np.log(1.2):.10f}')
print(f'(b) Lagrange bound = {lagrange_bound:.4e}')
print(f'(c) Actual error = {actual_err:.4e}')
check_true('(c) actual error <= Lagrange bound', actual_err <= lagrange_bound)
print(f'(d) Min N for 10 decimal places: N = {min_N}')
# Verify
approx_ln11 = sum((-1)**(n+1) * 0.1**n / n for n in range(1, min_N+1))
check_close('(d) ln(1.1) to 10 places', approx_ln11, np.log(1.1), tol=1e-10)
print('\nTakeaway: Lagrange remainder gives guaranteed error bounds; '
'for alternating series, error <= first omitted term (often tighter).')
print("Exercise 5 solution complete.")
Exercise 6 (★★): GELU Taylor Analysis
The GELU activation where is the Gaussian CDF.
(a) Using and term-by-term integration, show that .
(b) Compute the first 4 non-zero terms of the GELU series: .
(c) Compare GELU, its Taylor approximation (4 terms), and ReLU for .
(d) At what does the 4-term approximation exceed 1% relative error?
Code cell 20
# Your Solution
# Exercise 6 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 6.")
Code cell 21
# Solution
# Exercise 6 - reference solution
import numpy as np
from math import factorial, sqrt, pi
try:
from scipy.special import erf as scipy_erf
HAS_SCIPY = True
except:
HAS_SCIPY = False
# (a) erf(x) = (2/sqrt(pi)) * sum (-1)^n x^(2n+1) / (n! * (2n+1))
def erf_series(x, N):
coeff = 2 / sqrt(pi)
return coeff * sum((-1)**n * x**(2*n+1) / (factorial(n) * (2*n+1)) for n in range(N))
# (b) GELU(x) = x * 0.5 * (1 + erf(x/sqrt(2)))
def gelu_series(x, N):
return x * 0.5 * (1 + erf_series(x / sqrt(2), N))
# (c) Comparison
x_arr = np.linspace(-3, 3, 300)
if HAS_SCIPY:
gelu_exact = x_arr * 0.5 * (1 + scipy_erf(x_arr / sqrt(2)))
else:
gelu_exact = np.vectorize(lambda x: gelu_series(x, 30))(x_arr)
gelu_4terms = np.vectorize(lambda x: gelu_series(x, 4))(x_arr)
relu = np.maximum(0, x_arr)
if HAS_MPL:
plt.figure(figsize=(9, 5))
plt.plot(x_arr, gelu_exact, 'b-', lw=2, label='GELU (exact)')
plt.plot(x_arr, gelu_4terms, 'r--', lw=2, label='GELU (4-term Taylor)')
plt.plot(x_arr, relu, 'g:', lw=2, label='ReLU')
plt.legend(); plt.title('GELU vs Taylor Approximation vs ReLU')
plt.xlabel('x'); plt.show()
# (d) Find 1% crossover
x_test = np.linspace(0, 3, 1000)
gelu_ex = np.vectorize(lambda x: gelu_series(x, 30))(x_test)
gelu_4 = np.vectorize(lambda x: gelu_series(x, 4))(x_test)
rel_err = np.abs(gelu_ex - gelu_4) / (np.abs(gelu_ex) + 1e-10)
crossover = x_test[rel_err > 0.01][0] if (rel_err > 0.01).any() else float('inf')
header('Exercise 6: GELU Taylor Analysis')
check_close('(a) erf(0.5)', erf_series(0.5, 15),
scipy_erf(0.5) if HAS_SCIPY else 0.5205, tol=1e-6)
check_close('(b) GELU(1.0)', gelu_series(1.0, 15),
gelu_exact[x_arr >= 1.0][0], tol=1e-4)
print(f'(d) 1% relative error crossover at |x| ≈ {crossover:.2f}')
check_true('(d) crossover > 1', crossover > 1.0)
print('\nTakeaway: GELU is smooth everywhere; its Taylor series works well for |x| < ~2.')
print("Exercise 6 solution complete.")
Exercise 7 (★★★): Linear Attention via Taylor Approximation
Standard attention computes , which costs . A key idea in linear attention is to approximate using a feature map .
(a) Show that the 1st-order Taylor expansion of gives . What is the error (2nd-order remainder)?
(b) Using the approximation , rewrite un-normalized attention as . Show this decomposes into — meaning the cumulative sum can be precomputed.
(c) Simulate: generate random , compute exact softmax attention and linear approximation, measure the approximation error.
(d) How does approximation error scale with sequence length and ?
Code cell 23
# Your Solution
# Exercise 7 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 7.")
Code cell 24
# Solution
# Exercise 7 - reference solution
import numpy as np
def softmax(x, axis=-1):
x = x - x.max(axis=axis, keepdims=True)
e = np.exp(x)
return e / e.sum(axis=axis, keepdims=True)
def exact_attention(Q, K, V):
d = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d)
weights = softmax(scores, axis=-1)
return weights @ V
def linear_attention(Q, K, V):
# (a)/(b): e^(q^T k) ~ 1 + q^T k => unnorm attn = sum_j (1 + q^T k_j) v_j
n = Q.shape[0]
# Kernel: phi(x) = [1, x] (concat constant 1)
# Un-normalized: sum_j v_j + Q @ (K^T V)
sum_v = V.sum(axis=0, keepdims=True) # (1, d)
KtV = K.T @ V # (d, d)
unnorm = sum_v + Q @ KtV # (n, d)
# Normalize by sum of kernel weights = sum_j (1 + q^T k_j) = n + q^T sum(k_j)
sum_k = K.sum(axis=0, keepdims=True) # (1, d)
norm = n + (Q @ sum_k.T) # (n, 1)
return unnorm / norm
n, d = 32, 16
np.random.seed(42)
Q = np.random.randn(n, d) / np.sqrt(d)
K = np.random.randn(n, d) / np.sqrt(d)
V = np.random.randn(n, d)
exact = exact_attention(Q, K, V)
linear = linear_attention(Q, K, V)
err = np.abs(exact - linear).max()
# (d) Scale with n and d
errors = {}
for nn in [16, 32, 64]:
Q2 = np.random.randn(nn, d) / np.sqrt(d)
K2 = np.random.randn(nn, d) / np.sqrt(d)
V2 = np.random.randn(nn, d)
e2 = np.abs(exact_attention(Q2, K2, V2) - linear_attention(Q2, K2, V2)).mean()
errors[nn] = e2
header('Exercise 7: Linear Attention via Taylor')
print(f'Max abs error (n=32, d=16): {err:.4f}')
check_true('exact and linear have same shape', exact.shape == linear.shape)
print('Error vs n:', {k: f'{v:.4f}' for k, v in errors.items()})
check_true('error < 0.5 for small n, d', err < 0.5)
print('\nTakeaway: 1st-order Taylor of exp enables O(nd^2) attention, but accuracy '
'degrades for large sequence lengths and sharp attention distributions.')
print("Exercise 7 solution complete.")
Exercise 8 (★★★): Adam Bias Correction via Geometric Series
Adam maintains exponential moving averages: . Starting from , unrolling gives .
(a) If all gradients equal (constant), show that . (Use the geometric series formula.)
(b) Hence , so the bias-corrected estimate is . Verify numerically for , .
(c) Plot the bias factor for for . Explain why requires many more steps to 'warm up'.
(d) Implement a minimal Adam update step and verify loss decreases on a quadratic .
Code cell 26
# Your Solution
# Exercise 8 - learner workspace
# Write your solution here, then run the reference solution below to compare.
print("Learner workspace ready for Exercise 8.")
Code cell 27
# Solution
# Exercise 8 - reference solution
import numpy as np
# (a) Geometric series: sum_{i=1}^t beta^(t-i) = (1-beta^t)/(1-beta)
# m_t = (1-beta1) * g * (1-beta1^t)/(1-beta1) = g*(1-beta1^t)
def bias_factor(beta, t):
return 1 - beta**t
beta1 = 0.9
print('(b) Bias factors for beta1=0.9:')
for t in [1, 5, 10, 100]:
bf = bias_factor(beta1, t)
print(f' t={t:3d}: 1-beta^t = {bf:.6f}, correction = {1/bf:.4f}')
if HAS_MPL:
t_arr = np.arange(1, 51)
plt.figure(figsize=(9, 4))
for beta, color in zip([0.9, 0.99, 0.999], ['blue', 'orange', 'red']):
plt.plot(t_arr, [bias_factor(beta, t) for t in t_arr],
color=color, label=f'beta={beta}')
plt.axhline(0.99, ls='--', color='gray', label='99% warmup')
plt.xlabel('Step t'); plt.ylabel('Bias factor 1-β^t')
plt.title('Adam Bias Correction Warmup'); plt.legend(); plt.show()
# (d) Minimal Adam on f(theta) = ||theta||^2
def adam_step(theta, grad, m, v, t, lr=0.01, beta1=0.9, beta2=0.999, eps=1e-8):
m = beta1 * m + (1 - beta1) * grad
v = beta2 * v + (1 - beta2) * grad**2
m_hat = m / (1 - beta1**t)
v_hat = v / (1 - beta2**t)
theta = theta - lr * m_hat / (np.sqrt(v_hat) + eps)
return theta, m, v
theta = np.array([3.0, -2.0])
m = np.zeros_like(theta)
v = np.zeros_like(theta)
losses = []
for t in range(1, 201):
grad = 2 * theta # gradient of ||theta||^2
theta, m, v = adam_step(theta, grad, m, v, t)
losses.append(np.sum(theta**2))
header('Exercise 8: Adam Bias Correction')
check_close('(a) bias_factor(0.9, 1)', bias_factor(0.9, 1), 0.1)
check_close('(a) bias_factor(0.9, 100)', bias_factor(0.9, 100), 1.0, tol=1e-5)
check_true('(d) loss decreased', losses[-1] < losses[0])
check_true('(d) converged to near 0', losses[-1] < 1e-4)
print(f'(d) Initial loss: {losses[0]:.4f}, Final loss: {losses[-1]:.2e}')
print('\nTakeaway: Adam bias correction = dividing by (1-beta^t), derived directly '
'from geometric series; large beta requires many warmup steps.')
print("Exercise 8 solution complete.")
Exercise 9 (★★★): Neumann Series for an Inverse
If , then
Implement truncated Neumann sums and verify that the approximation improves as the number of terms grows.
Code cell 29
# Your Solution
# Exercise 9 - learner workspace
# Build partial sums I + A + ... + A^K and compare with np.linalg.inv(I-A).
print("Learner workspace ready for Exercise 9.")
Code cell 30
# Solution
# Exercise 9 - Neumann inverse series
header("Exercise 9: Neumann series")
A = np.array([[0.20, 0.05], [-0.10, 0.15]])
I = np.eye(2)
true_inv = la.inv(I - A)
def neumann_inverse(A, K):
S = np.eye(A.shape[0])
power = np.eye(A.shape[0])
for _ in range(1, K + 1):
power = power @ A
S = S + power
return S
prev = np.inf
for K in [1, 2, 4, 8, 16]:
approx = neumann_inverse(A, K)
err = la.norm(approx - true_inv, ord='fro')
print(f"K={K:2d}: Frobenius error={err:.3e}")
check_true("error is decreasing", err < prev)
prev = err
print("Takeaway: many iterative solvers and residual refinements are controlled infinite series.")
Exercise 10 (★★★): Taylor Approximation of GELU
The GELU activation can be written
where is the standard normal CDF. Approximate GELU near zero by fitting a polynomial and compare approximation error on a small interval.
Code cell 32
# Your Solution
# Exercise 10 - learner workspace
# Fit a polynomial approximation to GELU around zero and inspect its error.
print("Learner workspace ready for Exercise 10.")
Code cell 33
# Solution
# Exercise 10 - GELU polynomial approximation
header("Exercise 10: GELU polynomial approximation")
def gelu(x):
return x * stats.norm.cdf(x)
xs_fit = np.linspace(-1.0, 1.0, 201)
ys_fit = gelu(xs_fit)
coef = np.polyfit(xs_fit, ys_fit, deg=5)
poly = np.poly1d(coef)
xs_test = np.linspace(-1.25, 1.25, 251)
err = np.max(np.abs(poly(xs_test) - gelu(xs_test)))
print("degree-5 coefficients:", coef)
print(f"max error on [-1.25,1.25]: {err:.6e}")
check_true("degree-5 local approximation is accurate", err < 2e-3)
print("Takeaway: activation approximations are series ideas made computational.")
What to Review After Finishing
- Geometric series — closed form and partial sum
- Convergence tests — know when to apply ratio, integral, comparison, alternating tests
- Radius of convergence — formula ; behavior at endpoints
- Taylor series — derivation from ; standard series table
- Lagrange remainder — guaranteeing error bounds, not just computing approximations
- ML connections — Adam (geometric series), linear attention (Taylor of exp), GELU (erf Taylor)
References
- Rudin, Principles of Mathematical Analysis, Ch. 3 (sequences) and Ch. 8 (series)
- Apostol, Calculus Vol. 1, Ch. 11–12 (power series, Taylor series)
- Kingma & Ba, Adam: A Method for Stochastic Optimization (2015) — geometric series in §2
- Katharopoulos et al., Transformers are RNNs (2020) — linear attention via Taylor of exp
- Hendrycks & Gimpel, Gaussian Error Linear Units (GELUs) (2016)