Training at scale is the mathematics of keeping the same language-model objective useful when the model, data, batch size, memory footprint, communication pattern, and wall-clock budget are all large enough to break naive code.
Overview
The previous section derived the next-token probability objective. This section asks what changes when that objective is optimized over billions of parameters and trillions of token updates. The answer is not a new loss. The answer is a stack of constraints:
loss math -> optimizer state -> batch schedule -> memory budget -> parallelism -> communication -> checkpoint/restart -> validation signal
The central skill is quantitative accounting. You should be able to estimate memory before launching, compute effective batch size from the training configuration, reason about why ZeRO/FSDP helps, explain the pipeline bubble, compute approximate training FLOPs, and debug a run whose loss or throughput looks wrong.
Prerequisites
- Cross-entropy and next-token probability from 05-Language-Model-Probability
- Gradients, vector norms, and stochastic optimization
- Matrix multiplication shapes from attention and MLP blocks
- Basic probability for mini-batch gradient variance
- Familiarity with GPU memory and distributed workers is useful, but not required
Companion Notebooks
| Notebook | Purpose |
|---|---|
| theory.ipynb | Executable demos for AdamW, clipping, schedules, memory accounting, ZeRO stages, pipeline bubbles, FLOPs, MFU, and checkpoint overhead. |
| exercises.ipynb | Ten practice problems that make you compute the quantities used in real training-plan reviews. |
Learning Objectives
After this section, you should be able to:
- Explain why training memory is larger than inference memory.
- Derive Adam and AdamW updates with bias correction.
- Compute effective batch size under data parallelism and gradient accumulation.
- Build warmup and cosine learning-rate schedules.
- Estimate parameter, gradient, optimizer, and activation memory.
- Compare data, tensor, pipeline, sequence, and sharded parallelism.
- Compute pipeline bubble fraction and approximate all-reduce cost.
- Estimate training FLOPs with and interpret MFU.
- Explain the practical meaning of Kaplan-style and Chinchilla-style scaling laws.
- Build a debugging checklist for loss spikes, bad resumes, low throughput, and masking errors.
Study Flow
- Read the pages in order and pause after each page to restate the main definition or theorem.
- Run
theory.ipynbwhen you want to check the formulas numerically. - Use
exercises.ipynbafter the reading path, not before it. - Return to this overview page when you need the chapter-level navigation.