Notes - Math for LLMs Tutorial

Notes

2 min read6 headings3 reading parts

Training at scale is the mathematics of keeping the same language-model objective useful when the model, data, batch size, memory footprint, communication pattern, and wall-clock budget are all large enough to break naive code.

Overview

The previous section derived the next-token probability objective. This section asks what changes when that objective is optimized over billions of parameters and trillions of token updates. The answer is not a new loss. The answer is a stack of constraints:

loss math -> optimizer state -> batch schedule -> memory budget -> parallelism -> communication -> checkpoint/restart -> validation signal

The central skill is quantitative accounting. You should be able to estimate memory before launching, compute effective batch size from the training configuration, reason about why ZeRO/FSDP helps, explain the pipeline bubble, compute approximate training FLOPs, and debug a run whose loss or throughput looks wrong.

Prerequisites

Cross-entropy and next-token probability from 05-Language-Model-Probability
Gradients, vector norms, and stochastic optimization
Matrix multiplication shapes from attention and MLP blocks
Basic probability for mini-batch gradient variance
Familiarity with GPU memory and distributed workers is useful, but not required

Companion Notebooks

Notebook	Purpose
theory.ipynb	Executable demos for AdamW, clipping, schedules, memory accounting, ZeRO stages, pipeline bubbles, FLOPs, MFU, and checkpoint overhead.
exercises.ipynb	Ten practice problems that make you compute the quantities used in real training-plan reviews.

Learning Objectives

After this section, you should be able to:

Explain why training memory is larger than inference memory.
Derive Adam and AdamW updates with bias correction.
Compute effective batch size under data parallelism and gradient accumulation.
Build warmup and cosine learning-rate schedules.
Estimate parameter, gradient, optimizer, and activation memory.
Compare data, tensor, pipeline, sequence, and sharded parallelism.
Compute pipeline bubble fraction and approximate all-reduce cost.
Estimate training FLOPs with $C\approx 6ND$ and interpret MFU.
Explain the practical meaning of Kaplan-style and Chinchilla-style scaling laws.
Build a debugging checklist for loss spikes, bad resumes, low throughput, and masking errors.

Study Flow

Read the pages in order and pause after each page to restate the main definition or theorem.
Run theory.ipynb when you want to check the formulas numerically.
Use exercises.ipynb after the reading path, not before it.
Return to this overview page when you need the chapter-level navigation.

Training at Scale

Overview

Prerequisites

Companion Notebooks

Learning Objectives

Study Flow

Runnable Companions

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?