Lesson overview | Lesson overview | Next part
Tokenization Math: Part 1: Intuition to 3. Byte Pair Encoding
1. Intuition
Intuition develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.
1.1 Text is not the model input
Purpose. Text is not the model input focuses on why neural networks consume integer ids. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
A transformer receives integer token ids, not raw strings. The tokenizer decides the discrete sequence over which embeddings, attention, and next-token probabilities are defined.
Worked reading.
If the text lowering becomes [low, er, ing], the model predicts over those pieces. If it becomes [lower, ing], the prediction problem changes.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- token embeddings.
- next-token loss.
- attention over token positions.
Non-examples:
- raw Unicode text inside a matrix multiply.
- a word-level assumption for every language.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
1.2 The tokenizer as a learned compression map
Purpose. The tokenizer as a learned compression map focuses on why segmentation changes sequence length. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.
Worked reading.
A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- characters per token.
- tokens per word.
- entropy of token frequencies.
Non-examples:
- judging cost by words alone.
- ignoring sequence-length effects in attention.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
1.3 Open vocabulary pressure
Purpose. Open vocabulary pressure focuses on why word-level vocabularies fail. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
The vocabulary is a finite set of pieces with stable integer ids. Neural embeddings make those ids trainable vectors.
Worked reading.
With vocabulary size and model width , the input embedding table has parameters.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- embedding lookup.
- output softmax.
- reserved special ids.
Non-examples:
- renumbering tokens after training.
- adding pieces without resizing embeddings.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
1.4 Tokenization tax
Purpose. Tokenization tax focuses on why some languages and strings cost more tokens. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
1.5 Where tokenization affects LLM behavior
Purpose. Where tokenization affects LLM behavior focuses on context, cost, arithmetic, safety, retrieval. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
2. Formal Definitions
Formal Definitions develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.
2.1 Alphabet strings and byte sequences
Purpose. Alphabet strings and byte sequences focuses on the raw domain . In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
The raw alphabet determines whether the tokenizer starts from bytes, Unicode code points, characters, or pre-tokenized pieces.
Worked reading.
Byte-level systems can represent arbitrary input without an unknown character because every string can be encoded as bytes.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- UTF-8 byte sequences.
- ASCII text.
- mixed code and natural language.
Non-examples:
- assuming one character equals one byte.
- dropping unsupported symbols silently.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
2.2 Vocabulary and token ids
Purpose. Vocabulary and token ids focuses on the finite set and integer map. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
The vocabulary is a finite set of pieces with stable integer ids. Neural embeddings make those ids trainable vectors.
Worked reading.
With vocabulary size and model width , the input embedding table has parameters.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- embedding lookup.
- output softmax.
- reserved special ids.
Non-examples:
- renumbering tokens after training.
- adding pieces without resizing embeddings.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
2.3 Tokenizer and detokenizer functions
Purpose. Tokenizer and detokenizer functions focuses on the pair and . In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
2.4 Lossless versus lossy normalization
Purpose. Lossless versus lossy normalization focuses on why reversible byte tokenization matters. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
2.5 Pre-tokenization boundaries
Purpose. Pre-tokenization boundaries focuses on how whitespace and regex choices constrain merges. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
3. Byte Pair Encoding
Byte Pair Encoding develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.
3.1 BPE merge objective
Purpose. BPE merge objective focuses on frequency-based pair replacement. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
BPE starts from small symbols and repeatedly merges the most frequent adjacent pair. The learned merge order becomes a compression table.
Worked reading.
If l o is the most frequent pair, BPE creates lo; later it may create low if lo w becomes frequent.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- GPT-style byte-level tokenizers.
- subword NMT.
- domain-specific vocabulary learning.
Non-examples:
- probabilistically summing all segmentations.
- longest-match WordPiece with continuation markers.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
3.2 Greedy training loop
Purpose. Greedy training loop focuses on why merges are sequential decisions. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
3.3 Encoding with learned merges
Purpose. Encoding with learned merges focuses on applying ranks to new text. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
3.4 Byte-level BPE
Purpose. Byte-level BPE focuses on avoiding unknown characters. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
The raw alphabet determines whether the tokenizer starts from bytes, Unicode code points, characters, or pre-tokenized pieces.
Worked reading.
Byte-level systems can represent arbitrary input without an unknown character because every string can be encoded as bytes.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- UTF-8 byte sequences.
- ASCII text.
- mixed code and natural language.
Non-examples:
- assuming one character equals one byte.
- dropping unsupported symbols silently.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.
3.5 BPE limitations
Purpose. BPE limitations focuses on greedy segmentation and brittle numeric splits. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.
Operational definition.
This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.
Worked reading.
The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.
| Tokenizer object | Mathematical role | LLM consequence |
|---|---|---|
| alphabet | atomic input symbols | bytes, characters, or normalized symbols |
| vocabulary | finite token set | embedding and output-logit dimensions |
| encoder | maps text to ids | prompt length, training examples, costs |
| decoder | maps ids back to text | detokenization and round-trip safety |
| merge/probability table | segmentation rule | subword boundaries and rare-string handling |
Examples:
- BPE pieces.
- unigram pieces.
- special tokens.
Non-examples:
- raw text passed directly to attention.
- word counts used as token counts.
Derivation habit.
- State the raw alphabet and any normalization step.
- State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
- Compute token count before making cost, context, or memory claims.
- Check reversibility with
decode(encode(x))when the pipeline promises losslessness. - Treat special tokens as protected control symbols, not ordinary text pieces.
Implementation lens.
A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.
The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.
For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.
For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.