"Before an LLM predicts text, a tokenizer decides what the text is allowed to be made of."
Overview
Tokenization is the first mathematical map in an LLM pipeline. It converts raw text into integer ids, and those ids determine embedding lookups, attention positions, loss targets, context-window usage, retrieval chunks, special-token boundaries, and serving cost.
The key tradeoff is simple but deep: larger vocabularies usually shorten sequences, while smaller vocabularies share pieces more aggressively. BPE, unigram tokenization, WordPiece, byte fallback, and special-token design are different answers to that tradeoff.
This section uses LaTeX Markdown with $...$ and `
`. The companion notebooks implement small tokenizers from scratch so the math is visible without external packages.
Prerequisites
Companion Notebooks
| Notebook | Description |
|---|---|
| theory.ipynb | Executable BPE, unigram, Viterbi, entropy, fertility, special-token, and context-cost demonstrations. |
| exercises.ipynb | Ten checked exercises for tokenizer mechanics and LLM cost reasoning. |
Learning Objectives
After completing this section, you will be able to:
- Define alphabets, vocabularies, token ids, encoders, and decoders.
- Explain why tokenization is a mathematical compression map, not only text cleanup.
- Train a tiny BPE tokenizer and apply learned merges to new strings.
- Use dynamic programming to find a best unigram segmentation.
- Compare BPE, unigram, and WordPiece segmentation criteria.
- Compute compression ratio, token entropy, context cost, and vocabulary parameter cost.
- Diagnose multilingual fertility and tokenization tax.
- Explain how special tokens affect chat templates, padding masks, tools, and safety boundaries.
- Design round-trip and boundary tests for a tokenizer.
- Explain why tokenizer changes usually require retraining embeddings or careful migration.
Study Flow
- Read the pages in order and pause after each page to restate the main definition or theorem.
- Run
theory.ipynbwhen you want to check the formulas numerically. - Use
exercises.ipynbafter the reading path, not before it. - Return to this overview page when you need the chapter-level navigation.