Notes - Math for LLMs Tutorial

Notes

2 min read6 headings2 reading parts

"Filtering is not cleaning; it is choosing which errors the model is allowed to learn from."

Overview

Quality checks estimate which records increase effective training signal and which records inject noise, risk, or distributional distortion. In an LLM training run, data is not an inert pile of text; it is the empirical distribution that defines the examples, losses, risks, and capabilities the model will see.

This section is written as LaTeX Markdown. Inline mathematics uses $...$ , and display equations use `

...

`. The goal is to connect data engineering decisions to mathematical objects such as records $r_i$ , token sequences $x_{1:T}$ , filters $f(x)$ , hashes $h(x)$ , mixture weights $\boldsymbol{\alpha}$ , and empirical expectations.

The scope is deliberately narrow: this chapter owns the training-data pipeline. Tokenizer design, GPU training systems, benchmark methodology, alignment objectives, and production MLOps each have their own canonical chapters. Here we study the data objects that those later systems consume.

Prerequisites

Companion Notebooks

Notebook	Description
theory.ipynb	Executable demonstrations for quality checks
exercises.ipynb	Graded practice for quality checks

Learning Objectives

After completing this section, you will be able to:

Define quality scores, filter functions, acceptance rates, and filter cascades
Implement length, repetition, language, and character-ratio filters
Explain model-based quality filtering and threshold calibration
Analyze the tradeoff between quality, toxicity, diversity, and coverage
Detect PII-like and secret-like patterns with conservative regex audits
Summarize filter behavior by source, language, length, and time slice
Design human audit rubrics and filter ablation reports
Connect quality filtering to effective token count $D_{\mathrm{eff}}$

Study Flow

Read the pages in order and pause after each page to restate the main definition or theorem.
Run theory.ipynb when you want to check the formulas numerically.
Use exercises.ipynb after the reading path, not before it.
Return to this overview page when you need the chapter-level navigation.

Quality Checks

Overview

Prerequisites

Companion Notebooks

Learning Objectives

Study Flow

Runnable Companions

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?