Lesson overview | Previous part | Lesson overview
Tokenization Math: Part 7: LLM System Effects to References
7. LLM System Effects
LLM System Effects develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.
7.1 Special tokens
Purpose. Special tokens focuses on BOS EOS padding masks roles and tool delimiters. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
Special tokens are vocabulary entries with control meaning rather than ordinary lexical meaning. They must be protected from accidental splitting.
Worked reading.
A chat template may reserve tokens for system, user, assistant, tool call, end-of-message, padding, or beginning-of-sequence boundaries.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BOS/EOS.
- padding and masks.
- role delimiters in chat models.
Non-examples:
- ordinary word pieces.
- strings that can be merged through by BPE.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
7.2 Attention cost
Purpose. Attention cost focuses on why token length changes quadratic compute. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.
Worked reading.
A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- characters per token.
- tokens per word.
- entropy of token frequencies.
Non-examples:
- judging cost by words alone.
- ignoring sequence-length effects in attention.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
7.3 Numeracy and spelling
Purpose. Numeracy and spelling focuses on why digit and character segmentation matters. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
7.4 Retrieval chunking
Purpose. Retrieval chunking focuses on why chunk size should be token-aware. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
7.5 Safety and prompt boundaries
Purpose. Safety and prompt boundaries focuses on why control tokens need exact handling. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
8. Evaluation and Diagnostics
Evaluation and Diagnostics develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.
8.1 Round-trip tests
Purpose. Round-trip tests focuses on checking decode encode identity where promised. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
Tokenizer diagnostics catch failures before training: non-reversible text, unexpected unknowns, costly scripts, and broken control boundaries.
Worked reading.
A tokenizer migration changes ids, embeddings, logits, cached data, and often every downstream checkpoint assumption.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- decode(encode(x)) tests.
- offset mapping checks.
- special-token boundary tests.
Non-examples:
- only checking English prose.
- changing tokenizers without retraining embeddings.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
8.2 Coverage tests
Purpose. Coverage tests focuses on finding unknowns or byte fallback explosions. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
Tokenizer diagnostics catch failures before training: non-reversible text, unexpected unknowns, costly scripts, and broken control boundaries.
Worked reading.
A tokenizer migration changes ids, embeddings, logits, cached data, and often every downstream checkpoint assumption.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- decode(encode(x)) tests.
- offset mapping checks.
- special-token boundary tests.
Non-examples:
- only checking English prose.
- changing tokenizers without retraining embeddings.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
8.3 Fertility dashboards
Purpose. Fertility dashboards focuses on comparing groups domains and scripts. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.
Worked reading.
A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- characters per token.
- tokens per word.
- entropy of token frequencies.
Non-examples:
- judging cost by words alone.
- ignoring sequence-length effects in attention.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
8.4 Boundary tests
Purpose. Boundary tests focuses on URLs code numbers whitespace and emoji-like symbols. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
Tokenizer diagnostics catch failures before training: non-reversible text, unexpected unknowns, costly scripts, and broken control boundaries.
Worked reading.
A tokenizer migration changes ids, embeddings, logits, cached data, and often every downstream checkpoint assumption.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- decode(encode(x)) tests.
- offset mapping checks.
- special-token boundary tests.
Non-examples:
- only checking English prose.
- changing tokenizers without retraining embeddings.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
8.5 Tokenizer migration tests
Purpose. Tokenizer migration tests focuses on why changing tokenizers invalidates checkpoints. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
Tokenizer diagnostics catch failures before training: non-reversible text, unexpected unknowns, costly scripts, and broken control boundaries.
Worked reading.
A tokenizer migration changes ids, embeddings, logits, cached data, and often every downstream checkpoint assumption.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- decode(encode(x)) tests.
- offset mapping checks.
- special-token boundary tests.
Non-examples:
- only checking English prose.
- changing tokenizers without retraining embeddings.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
9. Common Mistakes
| # | Mistake | Why it is wrong | Fix |
|---|---|---|---|
| 1 | Assuming words are tokens | Modern LLMs usually use subword or byte-level tokens. | Inspect actual token ids before estimating cost or behavior. |
| 2 | Ignoring reversibility | Normalization can make decode-encode behavior lossy. | State whether the tokenizer is byte-level reversible or normalized. |
| 3 | Changing tokenizers after training | Embeddings and output heads are tied to token ids. | Treat tokenizer choice as part of the checkpoint. |
| 4 | Comparing context windows by characters | Models attend over tokens, not characters. | Measure tokens per sample and fertility. |
| 5 | Forgetting special tokens | Control tokens change sequence boundaries and masks. | Reserve and test special ids explicitly. |
| 6 | Assuming all languages pay the same token cost | Scripts and training data frequency affect fertility. | Audit multilingual fertility and bytes-per-token. |
| 7 | Using unknown tokens silently | UNK loses information and can hide coverage failures. | Prefer byte fallback or explicit coverage reports. |
| 8 | Treating BPE merges as globally optimal | BPE is greedy and merge-order dependent. | Use diagnostics and compare with unigram/WordPiece behavior. |
| 9 | Chunking retrieval by characters only | The model budget is token-limited. | Chunk by token count with overlap in token space. |
| 10 | Ignoring whitespace | Whitespace handling changes ids, offsets, and detokenization. | Test leading spaces, newlines, tabs, and code blocks. |
10. Exercises
-
(*) Run two BPE merges by hand on a tiny corpus.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
-
(*) Compute characters-per-token before and after a merge.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
-
(*) Compute embedding parameter cost for two vocabulary sizes.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
-
(**) Find the best unigram segmentation with dynamic programming.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
-
(**) Compute token entropy from token counts.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
-
(**) Compare fertility for two short multilingual examples.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
-
(**) Design a round-trip test for whitespace and code.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
-
(***) Compute attention-cost growth when token count doubles.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
-
(***) Identify which strings must be protected as special tokens.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
-
(***) Explain why tokenizer migration changes a trained checkpoint.
- (a) State the tokenizer object involved.
- (b) Compute the small numeric or string example.
- (c) Explain the LLM training or serving consequence.
11. Why This Matters for AI
| Concept | AI impact |
|---|---|
| Token ids | Define the rows of embedding matrices and the columns of output logits. |
| Vocabulary size | Controls embedding parameters, softmax cost, and rare-piece coverage. |
| Sequence length | Controls attention compute, memory, context-window utilization, and API cost. |
| Subword segmentation | Determines how words, names, code, numbers, and rare strings are decomposed. |
| Byte fallback | Improves robustness to arbitrary text and reduces unknown-token failures. |
| Special tokens | Encode conversation roles, tools, padding, sequence boundaries, and safety delimiters. |
| Fertility | Reveals fairness and cost differences across languages, domains, and scripts. |
| Round-trip behavior | Protects data pipelines from silent corruption before training or inference. |
12. Conceptual Bridge
The backward bridge is information theory: tokenization is compression with a finite codebook, but unlike pure compression it must also support neural prediction, stable ids, and clean detokenization.
The forward bridge is embedding space. Once a tokenizer emits ids, each id selects a row of an embedding matrix. Attention, next-token probability, scaling laws, RAG chunking, and serving cost all inherit the tokenizer's sequence length and boundary choices.
+-----------+ +--------------+ +--------------+ +------------+
| raw text | ---> | token ids | ---> | embeddings | ---> | attention |
| bytes | | finite vocab | | vector rows | | positions |
+-----------+ +--------------+ +--------------+ +------------+
The practical habit is to inspect tokens before trusting intuition. If the model behaves strangely on numbers, names, code, or multilingual text, the tokenizer is one of the first places to look.
References
- Sennrich, Haddow, Birch. Neural Machine Translation of Rare Words with Subword Units. https://aclanthology.org/P16-1162/
- Kudo and Richardson. SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer. https://arxiv.org/abs/1808.06226
- Google. SentencePiece repository. https://github.com/google/sentencepiece
- Hugging Face. Tokenization algorithms. https://huggingface.co/docs/transformers/tokenizer_summary
- OpenAI. tiktoken tokenizer library. https://github.com/openai/tiktoken