Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Notes
2 min read6 headings3 reading parts

Training at scale is the mathematics of keeping the same language-model objective useful when the model, data, batch size, memory footprint, communication pattern, and wall-clock budget are all large enough to break naive code.

Overview

The previous section derived the next-token probability objective. This section asks what changes when that objective is optimized over billions of parameters and trillions of token updates. The answer is not a new loss. The answer is a stack of constraints:

loss math -> optimizer state -> batch schedule -> memory budget -> parallelism -> communication -> checkpoint/restart -> validation signal

The central skill is quantitative accounting. You should be able to estimate memory before launching, compute effective batch size from the training configuration, reason about why ZeRO/FSDP helps, explain the pipeline bubble, compute approximate training FLOPs, and debug a run whose loss or throughput looks wrong.

Prerequisites

  • Cross-entropy and next-token probability from 05-Language-Model-Probability
  • Gradients, vector norms, and stochastic optimization
  • Matrix multiplication shapes from attention and MLP blocks
  • Basic probability for mini-batch gradient variance
  • Familiarity with GPU memory and distributed workers is useful, but not required

Companion Notebooks

NotebookPurpose
theory.ipynbExecutable demos for AdamW, clipping, schedules, memory accounting, ZeRO stages, pipeline bubbles, FLOPs, MFU, and checkpoint overhead.
exercises.ipynbTen practice problems that make you compute the quantities used in real training-plan reviews.

Learning Objectives

After this section, you should be able to:

  • Explain why training memory is larger than inference memory.
  • Derive Adam and AdamW updates with bias correction.
  • Compute effective batch size under data parallelism and gradient accumulation.
  • Build warmup and cosine learning-rate schedules.
  • Estimate parameter, gradient, optimizer, and activation memory.
  • Compare data, tensor, pipeline, sequence, and sharded parallelism.
  • Compute pipeline bubble fraction and approximate all-reduce cost.
  • Estimate training FLOPs with C6NDC\approx 6ND and interpret MFU.
  • Explain the practical meaning of Kaplan-style and Chinchilla-style scaling laws.
  • Build a debugging checklist for loss spikes, bad resumes, low throughput, and masking errors.

Study Flow

  1. Read the pages in order and pause after each page to restate the main definition or theorem.
  2. Run theory.ipynb when you want to check the formulas numerically.
  3. Use exercises.ipynb after the reading path, not before it.
  4. Return to this overview page when you need the chapter-level navigation.

Runnable Companions

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue