Part 3

21 min read17 headingsSplit lesson page

Lesson overview | Previous part | Lesson overview

Tokenization Math: Part 7: LLM System Effects to References

7. LLM System Effects

LLM System Effects develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.

7.1 Special tokens

Purpose. Special tokens focuses on BOS EOS padding masks roles and tool delimiters. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

\operatorname{params}_{\mathrm{embed}}=|\mathcal{V}|\,d_{\mathrm{model}}.

Operational definition.

Special tokens are vocabulary entries with control meaning rather than ordinary lexical meaning. They must be protected from accidental splitting.

Worked reading.

A chat template may reserve tokens for system, user, assistant, tool call, end-of-message, padding, or beginning-of-sequence boundaries.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

BOS/EOS.
padding and masks.
role delimiters in chat models.

Non-examples:

ordinary word pieces.
strings that can be merged through by BPE.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

7.2 Attention cost

Purpose. Attention cost focuses on why token length changes quadratic compute. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

\operatorname{attention\ cost}\propto n_{\mathrm{tokens}}^2.

Operational definition.

Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.

Worked reading.

A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

characters per token.
tokens per word.
entropy of token frequencies.

Non-examples:

judging cost by words alone.
ignoring sequence-length effects in attention.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

7.3 Numeracy and spelling

Purpose. Numeracy and spelling focuses on why digit and character segmentation matters. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

E(x)=(t_1,\ldots,t_n),\qquad t_i\in\{0,\ldots,|\mathcal{V}|-1\}.

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

BPE pieces.
unigram pieces.
special tokens.

Non-examples:

raw text passed directly to attention.
word counts used as token counts.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

7.4 Retrieval chunking

Purpose. Retrieval chunking focuses on why chunk size should be token-aware. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

D(E(x))=x\quad\text{for a lossless tokenizer on its supported domain}.

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

BPE pieces.
unigram pieces.
special tokens.

Non-examples:

raw text passed directly to attention.
word counts used as token counts.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

7.5 Safety and prompt boundaries

Purpose. Safety and prompt boundaries focuses on why control tokens need exact handling. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

\operatorname{score}_{\mathrm{BPE}}(a,b)=\operatorname{count}(ab).

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

BPE pieces.
unigram pieces.
special tokens.

Non-examples:

raw text passed directly to attention.
word counts used as token counts.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

8. Evaluation and Diagnostics

Evaluation and Diagnostics develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.

8.1 Round-trip tests

Purpose. Round-trip tests focuses on checking decode encode identity where promised. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

\operatorname{attention\ cost}\propto n_{\mathrm{tokens}}^2.

Operational definition.

Tokenizer diagnostics catch failures before training: non-reversible text, unexpected unknowns, costly scripts, and broken control boundaries.

Worked reading.

A tokenizer migration changes ids, embeddings, logits, cached data, and often every downstream checkpoint assumption.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

decode(encode(x)) tests.
offset mapping checks.
special-token boundary tests.

Non-examples:

only checking English prose.
changing tokenizers without retraining embeddings.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

8.2 Coverage tests

Purpose. Coverage tests focuses on finding unknowns or byte fallback explosions. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

E(x)=(t_1,\ldots,t_n),\qquad t_i\in\{0,\ldots,|\mathcal{V}|-1\}.

Operational definition.

Tokenizer diagnostics catch failures before training: non-reversible text, unexpected unknowns, costly scripts, and broken control boundaries.

Worked reading.

A tokenizer migration changes ids, embeddings, logits, cached data, and often every downstream checkpoint assumption.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

decode(encode(x)) tests.
offset mapping checks.
special-token boundary tests.

Non-examples:

only checking English prose.
changing tokenizers without retraining embeddings.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

8.3 Fertility dashboards

Purpose. Fertility dashboards focuses on comparing groups domains and scripts. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

D(E(x))=x\quad\text{for a lossless tokenizer on its supported domain}.

Operational definition.

Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.

Worked reading.

A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

characters per token.
tokens per word.
entropy of token frequencies.

Non-examples:

judging cost by words alone.
ignoring sequence-length effects in attention.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

8.4 Boundary tests

Purpose. Boundary tests focuses on URLs code numbers whitespace and emoji-like symbols. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

\operatorname{score}_{\mathrm{BPE}}(a,b)=\operatorname{count}(ab).

Operational definition.

Tokenizer diagnostics catch failures before training: non-reversible text, unexpected unknowns, costly scripts, and broken control boundaries.

Worked reading.

A tokenizer migration changes ids, embeddings, logits, cached data, and often every downstream checkpoint assumption.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

decode(encode(x)) tests.
offset mapping checks.
special-token boundary tests.

Non-examples:

only checking English prose.
changing tokenizers without retraining embeddings.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

8.5 Tokenizer migration tests

Purpose. Tokenizer migration tests focuses on why changing tokenizers invalidates checkpoints. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

x^*=\arg\max_{v_1,\ldots,v_k:\,v_1\cdots v_k=x}\sum_{i=1}^k\log p(v_i).

Operational definition.

Tokenizer diagnostics catch failures before training: non-reversible text, unexpected unknowns, costly scripts, and broken control boundaries.

Worked reading.

A tokenizer migration changes ids, embeddings, logits, cached data, and often every downstream checkpoint assumption.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

decode(encode(x)) tests.
offset mapping checks.
special-token boundary tests.

Non-examples:

only checking English prose.
changing tokenizers without retraining embeddings.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

9. Common Mistakes

#	Mistake	Why it is wrong	Fix
1	Assuming words are tokens	Modern LLMs usually use subword or byte-level tokens.	Inspect actual token ids before estimating cost or behavior.
2	Ignoring reversibility	Normalization can make decode-encode behavior lossy.	State whether the tokenizer is byte-level reversible or normalized.
3	Changing tokenizers after training	Embeddings and output heads are tied to token ids.	Treat tokenizer choice as part of the checkpoint.
4	Comparing context windows by characters	Models attend over tokens, not characters.	Measure tokens per sample and fertility.
5	Forgetting special tokens	Control tokens change sequence boundaries and masks.	Reserve and test special ids explicitly.
6	Assuming all languages pay the same token cost	Scripts and training data frequency affect fertility.	Audit multilingual fertility and bytes-per-token.
7	Using unknown tokens silently	UNK loses information and can hide coverage failures.	Prefer byte fallback or explicit coverage reports.
8	Treating BPE merges as globally optimal	BPE is greedy and merge-order dependent.	Use diagnostics and compare with unigram/WordPiece behavior.
9	Chunking retrieval by characters only	The model budget is token-limited.	Chunk by token count with overlap in token space.
10	Ignoring whitespace	Whitespace handling changes ids, offsets, and detokenization.	Test leading spaces, newlines, tabs, and code blocks.

10. Exercises

(*) Run two BPE merges by hand on a tiny corpus.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
(*) Compute characters-per-token before and after a merge.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
(*) Compute embedding parameter cost for two vocabulary sizes.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
(**) Find the best unigram segmentation with dynamic programming.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
(**) Compute token entropy from token counts.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
(**) Compare fertility for two short multilingual examples.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
(**) Design a round-trip test for whitespace and code.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
(***) Compute attention-cost growth when token count doubles.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
(***) Identify which strings must be protected as special tokens.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
(***) Explain why tokenizer migration changes a trained checkpoint.

(a) State the tokenizer object involved.
(b) Compute the small numeric or string example.
(c) Explain the LLM training or serving consequence.

11. Why This Matters for AI

Concept	AI impact
Token ids	Define the rows of embedding matrices and the columns of output logits.
Vocabulary size	Controls embedding parameters, softmax cost, and rare-piece coverage.
Sequence length	Controls attention compute, memory, context-window utilization, and API cost.
Subword segmentation	Determines how words, names, code, numbers, and rare strings are decomposed.
Byte fallback	Improves robustness to arbitrary text and reduces unknown-token failures.
Special tokens	Encode conversation roles, tools, padding, sequence boundaries, and safety delimiters.
Fertility	Reveals fairness and cost differences across languages, domains, and scripts.
Round-trip behavior	Protects data pipelines from silent corruption before training or inference.

12. Conceptual Bridge

The backward bridge is information theory: tokenization is compression with a finite codebook, but unlike pure compression it must also support neural prediction, stable ids, and clean detokenization.

The forward bridge is embedding space. Once a tokenizer emits ids, each id selects a row of an embedding matrix. Attention, next-token probability, scaling laws, RAG chunking, and serving cost all inherit the tokenizer's sequence length and boundary choices.

+-----------+      +--------------+      +--------------+      +------------+
| raw text  | ---> | token ids    | ---> | embeddings   | ---> | attention  |
| bytes     |      | finite vocab |      | vector rows  |      | positions  |
+-----------+      +--------------+      +--------------+      +------------+

The practical habit is to inspect tokens before trusting intuition. If the model behaves strangely on numbers, names, code, or multilingual text, the tokenizer is one of the first places to look.

References

Sennrich, Haddow, Birch. Neural Machine Translation of Rare Words with Subword Units. https://aclanthology.org/P16-1162/
Kudo and Richardson. SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer. https://arxiv.org/abs/1808.06226
Google. SentencePiece repository. https://github.com/google/sentencepiece
Hugging Face. Tokenization algorithms. https://huggingface.co/docs/transformers/tokenizer_summary
OpenAI. tiktoken tokenizer library. https://github.com/openai/tiktoken

Tokenization Math: Part 3 - Llm System Effects To References

Tokenization Math: Part 7: LLM System Effects to References

7. LLM System Effects

7.1 Special tokens

7.2 Attention cost

7.3 Numeracy and spelling

7.4 Retrieval chunking

7.5 Safety and prompt boundaries

8. Evaluation and Diagnostics

8.1 Round-trip tests

8.2 Coverage tests

8.3 Fertility dashboards

8.4 Boundary tests

8.5 Tokenizer migration tests

9. Common Mistakes

10. Exercises

11. Why This Matters for AI

12. Conceptual Bridge

References

Test this lesson

Which module does this lesson belong to?

Which section is covered in this lesson content?

Which term is most central to this lesson?

What is the best way to use this lesson for real learning?