Part 3Math for LLMs

Tokenization Math: Part 3 - Llm System Effects To References

Math for LLMs / Tokenization Math

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 3
21 min read17 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Tokenization Math: Part 7: LLM System Effects to References

7. LLM System Effects

LLM System Effects develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.

7.1 Special tokens

Purpose. Special tokens focuses on BOS EOS padding masks roles and tool delimiters. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

paramsembed=Vdmodel.\operatorname{params}_{\mathrm{embed}}=|\mathcal{V}|\,d_{\mathrm{model}}.

Operational definition.

Special tokens are vocabulary entries with control meaning rather than ordinary lexical meaning. They must be protected from accidental splitting.

Worked reading.

A chat template may reserve tokens for system, user, assistant, tool call, end-of-message, padding, or beginning-of-sequence boundaries.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. BOS/EOS.
  2. padding and masks.
  3. role delimiters in chat models.

Non-examples:

  1. ordinary word pieces.
  2. strings that can be merged through by BPE.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

7.2 Attention cost

Purpose. Attention cost focuses on why token length changes quadratic compute. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

attention costntokens2.\operatorname{attention\ cost}\propto n_{\mathrm{tokens}}^2.

Operational definition.

Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.

Worked reading.

A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. characters per token.
  2. tokens per word.
  3. entropy of token frequencies.

Non-examples:

  1. judging cost by words alone.
  2. ignoring sequence-length effects in attention.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

7.3 Numeracy and spelling

Purpose. Numeracy and spelling focuses on why digit and character segmentation matters. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

E(x)=(t1,,tn),ti{0,,V1}.E(x)=(t_1,\ldots,t_n),\qquad t_i\in\{0,\ldots,|\mathcal{V}|-1\}.

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. BPE pieces.
  2. unigram pieces.
  3. special tokens.

Non-examples:

  1. raw text passed directly to attention.
  2. word counts used as token counts.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

7.4 Retrieval chunking

Purpose. Retrieval chunking focuses on why chunk size should be token-aware. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

D(E(x))=xfor a lossless tokenizer on its supported domain.D(E(x))=x\quad\text{for a lossless tokenizer on its supported domain}.

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. BPE pieces.
  2. unigram pieces.
  3. special tokens.

Non-examples:

  1. raw text passed directly to attention.
  2. word counts used as token counts.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

7.5 Safety and prompt boundaries

Purpose. Safety and prompt boundaries focuses on why control tokens need exact handling. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

scoreBPE(a,b)=count(ab).\operatorname{score}_{\mathrm{BPE}}(a,b)=\operatorname{count}(ab).

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. BPE pieces.
  2. unigram pieces.
  3. special tokens.

Non-examples:

  1. raw text passed directly to attention.
  2. word counts used as token counts.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

8. Evaluation and Diagnostics

Evaluation and Diagnostics develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.

8.1 Round-trip tests

Purpose. Round-trip tests focuses on checking decode encode identity where promised. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

attention costntokens2.\operatorname{attention\ cost}\propto n_{\mathrm{tokens}}^2.

Operational definition.

Tokenizer diagnostics catch failures before training: non-reversible text, unexpected unknowns, costly scripts, and broken control boundaries.

Worked reading.

A tokenizer migration changes ids, embeddings, logits, cached data, and often every downstream checkpoint assumption.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. decode(encode(x)) tests.
  2. offset mapping checks.
  3. special-token boundary tests.

Non-examples:

  1. only checking English prose.
  2. changing tokenizers without retraining embeddings.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

8.2 Coverage tests

Purpose. Coverage tests focuses on finding unknowns or byte fallback explosions. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

E(x)=(t1,,tn),ti{0,,V1}.E(x)=(t_1,\ldots,t_n),\qquad t_i\in\{0,\ldots,|\mathcal{V}|-1\}.

Operational definition.

Tokenizer diagnostics catch failures before training: non-reversible text, unexpected unknowns, costly scripts, and broken control boundaries.

Worked reading.

A tokenizer migration changes ids, embeddings, logits, cached data, and often every downstream checkpoint assumption.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. decode(encode(x)) tests.
  2. offset mapping checks.
  3. special-token boundary tests.

Non-examples:

  1. only checking English prose.
  2. changing tokenizers without retraining embeddings.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

8.3 Fertility dashboards

Purpose. Fertility dashboards focuses on comparing groups domains and scripts. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

D(E(x))=xfor a lossless tokenizer on its supported domain.D(E(x))=x\quad\text{for a lossless tokenizer on its supported domain}.

Operational definition.

Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.

Worked reading.

A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. characters per token.
  2. tokens per word.
  3. entropy of token frequencies.

Non-examples:

  1. judging cost by words alone.
  2. ignoring sequence-length effects in attention.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

8.4 Boundary tests

Purpose. Boundary tests focuses on URLs code numbers whitespace and emoji-like symbols. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

scoreBPE(a,b)=count(ab).\operatorname{score}_{\mathrm{BPE}}(a,b)=\operatorname{count}(ab).

Operational definition.

Tokenizer diagnostics catch failures before training: non-reversible text, unexpected unknowns, costly scripts, and broken control boundaries.

Worked reading.

A tokenizer migration changes ids, embeddings, logits, cached data, and often every downstream checkpoint assumption.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. decode(encode(x)) tests.
  2. offset mapping checks.
  3. special-token boundary tests.

Non-examples:

  1. only checking English prose.
  2. changing tokenizers without retraining embeddings.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

8.5 Tokenizer migration tests

Purpose. Tokenizer migration tests focuses on why changing tokenizers invalidates checkpoints. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

x=argmaxv1,,vk:v1vk=xi=1klogp(vi).x^*=\arg\max_{v_1,\ldots,v_k:\,v_1\cdots v_k=x}\sum_{i=1}^k\log p(v_i).

Operational definition.

Tokenizer diagnostics catch failures before training: non-reversible text, unexpected unknowns, costly scripts, and broken control boundaries.

Worked reading.

A tokenizer migration changes ids, embeddings, logits, cached data, and often every downstream checkpoint assumption.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. decode(encode(x)) tests.
  2. offset mapping checks.
  3. special-token boundary tests.

Non-examples:

  1. only checking English prose.
  2. changing tokenizers without retraining embeddings.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

9. Common Mistakes

#MistakeWhy it is wrongFix
1Assuming words are tokensModern LLMs usually use subword or byte-level tokens.Inspect actual token ids before estimating cost or behavior.
2Ignoring reversibilityNormalization can make decode-encode behavior lossy.State whether the tokenizer is byte-level reversible or normalized.
3Changing tokenizers after trainingEmbeddings and output heads are tied to token ids.Treat tokenizer choice as part of the checkpoint.
4Comparing context windows by charactersModels attend over tokens, not characters.Measure tokens per sample and fertility.
5Forgetting special tokensControl tokens change sequence boundaries and masks.Reserve and test special ids explicitly.
6Assuming all languages pay the same token costScripts and training data frequency affect fertility.Audit multilingual fertility and bytes-per-token.
7Using unknown tokens silentlyUNK loses information and can hide coverage failures.Prefer byte fallback or explicit coverage reports.
8Treating BPE merges as globally optimalBPE is greedy and merge-order dependent.Use diagnostics and compare with unigram/WordPiece behavior.
9Chunking retrieval by characters onlyThe model budget is token-limited.Chunk by token count with overlap in token space.
10Ignoring whitespaceWhitespace handling changes ids, offsets, and detokenization.Test leading spaces, newlines, tabs, and code blocks.

10. Exercises

  1. (*) Run two BPE merges by hand on a tiny corpus.

    • (a) State the tokenizer object involved.
    • (b) Compute the small numeric or string example.
    • (c) Explain the LLM training or serving consequence.
  2. (*) Compute characters-per-token before and after a merge.

    • (a) State the tokenizer object involved.
    • (b) Compute the small numeric or string example.
    • (c) Explain the LLM training or serving consequence.
  3. (*) Compute embedding parameter cost for two vocabulary sizes.

    • (a) State the tokenizer object involved.
    • (b) Compute the small numeric or string example.
    • (c) Explain the LLM training or serving consequence.
  4. (**) Find the best unigram segmentation with dynamic programming.

    • (a) State the tokenizer object involved.
    • (b) Compute the small numeric or string example.
    • (c) Explain the LLM training or serving consequence.
  5. (**) Compute token entropy from token counts.

    • (a) State the tokenizer object involved.
    • (b) Compute the small numeric or string example.
    • (c) Explain the LLM training or serving consequence.
  6. (**) Compare fertility for two short multilingual examples.

    • (a) State the tokenizer object involved.
    • (b) Compute the small numeric or string example.
    • (c) Explain the LLM training or serving consequence.
  7. (**) Design a round-trip test for whitespace and code.

    • (a) State the tokenizer object involved.
    • (b) Compute the small numeric or string example.
    • (c) Explain the LLM training or serving consequence.
  8. (***) Compute attention-cost growth when token count doubles.

    • (a) State the tokenizer object involved.
    • (b) Compute the small numeric or string example.
    • (c) Explain the LLM training or serving consequence.
  9. (***) Identify which strings must be protected as special tokens.

    • (a) State the tokenizer object involved.
    • (b) Compute the small numeric or string example.
    • (c) Explain the LLM training or serving consequence.
  10. (***) Explain why tokenizer migration changes a trained checkpoint.

  • (a) State the tokenizer object involved.
  • (b) Compute the small numeric or string example.
  • (c) Explain the LLM training or serving consequence.

11. Why This Matters for AI

ConceptAI impact
Token idsDefine the rows of embedding matrices and the columns of output logits.
Vocabulary sizeControls embedding parameters, softmax cost, and rare-piece coverage.
Sequence lengthControls attention compute, memory, context-window utilization, and API cost.
Subword segmentationDetermines how words, names, code, numbers, and rare strings are decomposed.
Byte fallbackImproves robustness to arbitrary text and reduces unknown-token failures.
Special tokensEncode conversation roles, tools, padding, sequence boundaries, and safety delimiters.
FertilityReveals fairness and cost differences across languages, domains, and scripts.
Round-trip behaviorProtects data pipelines from silent corruption before training or inference.

12. Conceptual Bridge

The backward bridge is information theory: tokenization is compression with a finite codebook, but unlike pure compression it must also support neural prediction, stable ids, and clean detokenization.

The forward bridge is embedding space. Once a tokenizer emits ids, each id selects a row of an embedding matrix. Attention, next-token probability, scaling laws, RAG chunking, and serving cost all inherit the tokenizer's sequence length and boundary choices.

+-----------+      +--------------+      +--------------+      +------------+
| raw text  | ---> | token ids    | ---> | embeddings   | ---> | attention  |
| bytes     |      | finite vocab |      | vector rows  |      | positions  |
+-----------+      +--------------+      +--------------+      +------------+

The practical habit is to inspect tokens before trusting intuition. If the model behaves strangely on numbers, names, code, or multilingual text, the tokenizer is one of the first places to look.

References

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue