Part 2Math for LLMs

Tokenization Math: Part 2 - Unigram And Sentencepiece To 6 Information And Cost

Math for LLMs / Tokenization Math

Private notes
0/8000

Notes stay private to your browser until account sync is configured.

Part 2
26 min read18 headingsSplit lesson page

Lesson overview | Previous part | Next part

Tokenization Math: Part 4: Unigram and SentencePiece to 6. Information and Cost

4. Unigram and SentencePiece

Unigram and SentencePiece develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.

4.1 Unigram token probability model

Purpose. Unigram token probability model focuses on probabilistic segmentation with p(v)p(v). In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

x=argmaxv1,,vk:v1vk=xi=1klogp(vi).x^*=\arg\max_{v_1,\ldots,v_k:\,v_1\cdots v_k=x}\sum_{i=1}^k\log p(v_i).

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. BPE pieces.
  2. unigram pieces.
  3. special tokens.

Non-examples:

  1. raw text passed directly to attention.
  2. word counts used as token counts.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

4.2 Viterbi segmentation

Purpose. Viterbi segmentation focuses on dynamic programming for best token path. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

H(T)=vVp(v)log2p(v).H(T)=-\sum_{v\in\mathcal{V}}p(v)\log_2 p(v).

Operational definition.

Given token probabilities, Viterbi dynamic programming finds the highest-probability segmentation of a string.

Worked reading.

For abab, a unigram model compares [ab, ab], [a, b, ab], [aba, b], and other valid paths by summing log probabilities.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. SentencePiece unigram decoding.
  2. best-path segmentation.
  3. subword sampling baseline.

Non-examples:

  1. frequency-only pair merging.
  2. a regex split with no scoring.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

4.3 Forward probabilities

Purpose. Forward probabilities focuses on summing over all segmentations. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

fertility(x)=#tokens(x)#words(x).\operatorname{fertility}(x)=\frac{\#\operatorname{tokens}(x)}{\#\operatorname{words}(x)}.

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. BPE pieces.
  2. unigram pieces.
  3. special tokens.

Non-examples:

  1. raw text passed directly to attention.
  2. word counts used as token counts.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

4.4 EM intuition

Purpose. EM intuition focuses on soft counts for token pieces. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

paramsembed=Vdmodel.\operatorname{params}_{\mathrm{embed}}=|\mathcal{V}|\,d_{\mathrm{model}}.

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. BPE pieces.
  2. unigram pieces.
  3. special tokens.

Non-examples:

  1. raw text passed directly to attention.
  2. word counts used as token counts.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

4.5 Subword regularization

Purpose. Subword regularization focuses on sampling multiple valid segmentations. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

attention costntokens2.\operatorname{attention\ cost}\propto n_{\mathrm{tokens}}^2.

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. BPE pieces.
  2. unigram pieces.
  3. special tokens.

Non-examples:

  1. raw text passed directly to attention.
  2. word counts used as token counts.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

5. WordPiece

WordPiece develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.

5.1 WordPiece merge score

Purpose. WordPiece merge score focuses on association-style pair scoring. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

H(T)=vVp(v)log2p(v).H(T)=-\sum_{v\in\mathcal{V}}p(v)\log_2 p(v).

Operational definition.

WordPiece builds subwords using an association-style score and commonly distinguishes word starts from continuation pieces.

Worked reading.

A longest-match encoder chooses the longest valid vocabulary piece at each position, using continuation markers inside words.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. BERT tokenization.
  2. continuation pieces.
  3. greedy longest match.

Non-examples:

  1. byte fallback.
  2. unigram path sampling.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

5.2 Continuation markers

Purpose. Continuation markers focuses on why prefixes and inside-word pieces differ. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

fertility(x)=#tokens(x)#words(x).\operatorname{fertility}(x)=\frac{\#\operatorname{tokens}(x)}{\#\operatorname{words}(x)}.

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. BPE pieces.
  2. unigram pieces.
  3. special tokens.

Non-examples:

  1. raw text passed directly to attention.
  2. word counts used as token counts.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

5.3 Greedy longest-match encoding

Purpose. Greedy longest-match encoding focuses on BERT-style deterministic segmentation. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

paramsembed=Vdmodel.\operatorname{params}_{\mathrm{embed}}=|\mathcal{V}|\,d_{\mathrm{model}}.

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. BPE pieces.
  2. unigram pieces.
  3. special tokens.

Non-examples:

  1. raw text passed directly to attention.
  2. word counts used as token counts.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

5.4 WordPiece versus BPE

Purpose. WordPiece versus BPE focuses on score criterion and encoding behavior. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

attention costntokens2.\operatorname{attention\ cost}\propto n_{\mathrm{tokens}}^2.

Operational definition.

WordPiece builds subwords using an association-style score and commonly distinguishes word starts from continuation pieces.

Worked reading.

A longest-match encoder chooses the longest valid vocabulary piece at each position, using continuation markers inside words.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. BERT tokenization.
  2. continuation pieces.
  3. greedy longest match.

Non-examples:

  1. byte fallback.
  2. unigram path sampling.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

5.5 Unknown-token risk

Purpose. Unknown-token risk focuses on why byte fallback changes robustness. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

E(x)=(t1,,tn),ti{0,,V1}.E(x)=(t_1,\ldots,t_n),\qquad t_i\in\{0,\ldots,|\mathcal{V}|-1\}.

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. BPE pieces.
  2. unigram pieces.
  3. special tokens.

Non-examples:

  1. raw text passed directly to attention.
  2. word counts used as token counts.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

6. Information and Cost

Information and Cost develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.

6.1 Compression ratio

Purpose. Compression ratio focuses on characters per token and bytes per token. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

fertility(x)=#tokens(x)#words(x).\operatorname{fertility}(x)=\frac{\#\operatorname{tokens}(x)}{\#\operatorname{words}(x)}.

Operational definition.

Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.

Worked reading.

A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. characters per token.
  2. tokens per word.
  3. entropy of token frequencies.

Non-examples:

  1. judging cost by words alone.
  2. ignoring sequence-length effects in attention.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

6.2 Token entropy

Purpose. Token entropy focuses on distributional balance of token ids. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

paramsembed=Vdmodel.\operatorname{params}_{\mathrm{embed}}=|\mathcal{V}|\,d_{\mathrm{model}}.

Operational definition.

Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.

Worked reading.

A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. characters per token.
  2. tokens per word.
  3. entropy of token frequencies.

Non-examples:

  1. judging cost by words alone.
  2. ignoring sequence-length effects in attention.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

6.3 Vocabulary parameter cost

Purpose. Vocabulary parameter cost focuses on embedding and output matrix scaling. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

attention costntokens2.\operatorname{attention\ cost}\propto n_{\mathrm{tokens}}^2.

Operational definition.

The vocabulary is a finite set of pieces with stable integer ids. Neural embeddings make those ids trainable vectors.

Worked reading.

With vocabulary size V|\mathcal{V}| and model width dmodeld_{\mathrm{model}}, the input embedding table has Vdmodel|\mathcal{V}|d_{\mathrm{model}} parameters.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. embedding lookup.
  2. output softmax.
  3. reserved special ids.

Non-examples:

  1. renumbering tokens after training.
  2. adding pieces without resizing embeddings.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

6.4 Context window efficiency

Purpose. Context window efficiency focuses on how token count changes usable context. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

E(x)=(t1,,tn),ti{0,,V1}.E(x)=(t_1,\ldots,t_n),\qquad t_i\in\{0,\ldots,|\mathcal{V}|-1\}.

Operational definition.

Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.

Worked reading.

A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. characters per token.
  2. tokens per word.
  3. entropy of token frequencies.

Non-examples:

  1. judging cost by words alone.
  2. ignoring sequence-length effects in attention.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

6.5 Multilingual fertility

Purpose. Multilingual fertility focuses on tokens per word across languages or scripts. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

D(E(x))=xfor a lossless tokenizer on its supported domain.D(E(x))=x\quad\text{for a lossless tokenizer on its supported domain}.

Operational definition.

Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.

Worked reading.

A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.

Tokenizer objectMathematical roleLLM consequence
alphabet Σ\Sigmaatomic input symbolsbytes, characters, or normalized symbols
vocabulary V\mathcal{V}finite token setembedding and output-logit dimensions
encoder EEmaps text to idsprompt length, training examples, costs
decoder DDmaps ids back to textdetokenization and round-trip safety
merge/probability tablesegmentation rulesubword boundaries and rare-string handling

Examples:

  1. characters per token.
  2. tokens per word.
  3. entropy of token frequencies.

Non-examples:

  1. judging cost by words alone.
  2. ignoring sequence-length effects in attention.

Derivation habit.

  1. State the raw alphabet and any normalization step.
  2. State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
  3. Compute token count before making cost, context, or memory claims.
  4. Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
  5. Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

Skill Check

Test this lesson

Answer 4 quick questions to lock in the lesson and feed your adaptive practice queue.

--
Score
0/4
Answered
Not attempted
Status
1

Which module does this lesson belong to?

2

Which section is covered in this lesson content?

3

Which term is most central to this lesson?

4

What is the best way to use this lesson for real learning?

Your answers save locally first, then sync when account storage is available.
Practice queue