"A corpus is not a pile of files; it is a reproducible sampling distribution over tokens."
Overview
Full dataset assembly combines accepted records into deterministic, balanced, token- accounted corpora for training and validation. In an LLM training run, data is not an inert pile of text; it is the empirical distribution that defines the examples, losses, risks, and capabilities the model will see.
This section is written as LaTeX Markdown. Inline mathematics uses $...$, and display
equations use `
`. The goal is to connect data engineering decisions to mathematical objects such as records , token sequences , filters , hashes , mixture weights , and empirical expectations.
The scope is deliberately narrow: this chapter owns the training-data pipeline. Tokenizer design, GPU training systems, benchmark methodology, alignment objectives, and production MLOps each have their own canonical chapters. Here we study the data objects that those later systems consume.
Prerequisites
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Executable demonstrations for full dataset assembly |
| exercises.ipynb | Graded practice for full dataset assembly |
Learning Objectives
After completing this section, you will be able to:
- Define source sets, mixture weights, token budgets, shards, and split assignments
- Build source manifests with checksums and version pins
- Implement weighted and stratified sampling over multiple data sources
- Compute token budgets and source proportions
- Create deterministic train/validation/test splits
- Explain sequence packing, document-boundary masks, and packed-shard statistics
- Verify assembled corpora through shard counts, token counts, and loader smoke tests
- Connect assembly decisions to curriculum and training stability
Study Flow
- Read the pages in order and pause after each page to restate the main definition or theorem.
- Run
theory.ipynbwhen you want to check the formulas numerically. - Use
exercises.ipynbafter the reading path, not before it. - Return to this overview page when you need the chapter-level navigation.