Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Notes
2 min read6 headings3 reading parts

"Before an LLM predicts text, a tokenizer decides what the text is allowed to be made of."

Overview

Tokenization is the first mathematical map in an LLM pipeline. It converts raw text into integer ids, and those ids determine embedding lookups, attention positions, loss targets, context-window usage, retrieval chunks, special-token boundaries, and serving cost.

The key tradeoff is simple but deep: larger vocabularies usually shorten sequences, while smaller vocabularies share pieces more aggressively. BPE, unigram tokenization, WordPiece, byte fallback, and special-token design are different answers to that tradeoff.

This section uses LaTeX Markdown with $...$ and `

......

`. The companion notebooks implement small tokenizers from scratch so the math is visible without external packages.

Prerequisites

Companion Notebooks

NotebookDescription
theory.ipynbExecutable BPE, unigram, Viterbi, entropy, fertility, special-token, and context-cost demonstrations.
exercises.ipynbTen checked exercises for tokenizer mechanics and LLM cost reasoning.

Learning Objectives

After completing this section, you will be able to:

  • Define alphabets, vocabularies, token ids, encoders, and decoders.
  • Explain why tokenization is a mathematical compression map, not only text cleanup.
  • Train a tiny BPE tokenizer and apply learned merges to new strings.
  • Use dynamic programming to find a best unigram segmentation.
  • Compare BPE, unigram, and WordPiece segmentation criteria.
  • Compute compression ratio, token entropy, context cost, and vocabulary parameter cost.
  • Diagnose multilingual fertility and tokenization tax.
  • Explain how special tokens affect chat templates, padding masks, tools, and safety boundaries.
  • Design round-trip and boundary tests for a tokenizer.
  • Explain why tokenizer changes usually require retraining embeddings or careful migration.

Study Flow

  1. Read the pages in order and pause after each page to restate the main definition or theorem.
  2. Run theory.ipynb when you want to check the formulas numerically.
  3. Use exercises.ipynb after the reading path, not before it.
  4. Return to this overview page when you need the chapter-level navigation.

Runnable Companions

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue