Tokenization Math

27 min read18 headingsSplit lesson page

Lesson overview | Lesson overview | Next part

Tokenization Math: Part 1: Intuition to 3. Byte Pair Encoding

1. Intuition

Intuition develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.

1.1 Text is not the model input

Purpose. Text is not the model input focuses on why neural networks consume integer ids. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

E(x)=(t_1,\ldots,t_n),\qquad t_i\in\{0,\ldots,|\mathcal{V}|-1\}.

Operational definition.

A transformer receives integer token ids, not raw strings. The tokenizer decides the discrete sequence over which embeddings, attention, and next-token probabilities are defined.

Worked reading.

If the text lowering becomes [low, er, ing], the model predicts over those pieces. If it becomes [lower, ing], the prediction problem changes.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

token embeddings.
next-token loss.
attention over token positions.

Non-examples:

raw Unicode text inside a matrix multiply.
a word-level assumption for every language.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

A tokenizer is not a preprocessing detail that can be swapped freely. The embedding matrix, output projection, cached datasets, labels, special-token masks, and generation stop conditions all depend on the exact integer ids.

The most useful debugging habit is to print the text, tokens, ids, decoded text, and offsets for edge cases: leading spaces, repeated newlines, numbers, URLs, code, mixed scripts, and delimiter strings used by chat templates.

For model training, the tokenizer changes the effective curriculum. A tokenizer that splits common words into many pieces makes the model spend more positions modeling spelling-level structure. A tokenizer that memorizes very long pieces may save context but increase vocabulary cost and reduce compositional sharing.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

1.2 The tokenizer as a learned compression map

Purpose. The tokenizer as a learned compression map focuses on why segmentation changes sequence length. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

D(E(x))=x\quad\text{for a lossless tokenizer on its supported domain}.

Operational definition.

Tokenization is compression under constraints: it trades vocabulary size against sequence length and distribution balance.

Worked reading.

A lower token count improves context efficiency, but a huge vocabulary increases embedding and output-layer cost.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

characters per token.
tokens per word.
entropy of token frequencies.

Non-examples:

judging cost by words alone.
ignoring sequence-length effects in attention.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

1.3 Open vocabulary pressure

Purpose. Open vocabulary pressure focuses on why word-level vocabularies fail. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

\operatorname{score}_{\mathrm{BPE}}(a,b)=\operatorname{count}(ab).

Operational definition.

The vocabulary is a finite set of pieces with stable integer ids. Neural embeddings make those ids trainable vectors.

Worked reading.

With vocabulary size $|\mathcal{V}|$ and model width $d_{\mathrm{model}}$ , the input embedding table has $|\mathcal{V}|d_{\mathrm{model}}$ parameters.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

embedding lookup.
output softmax.
reserved special ids.

Non-examples:

renumbering tokens after training.
adding pieces without resizing embeddings.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

1.4 Tokenization tax

Purpose. Tokenization tax focuses on why some languages and strings cost more tokens. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

x^*=\arg\max_{v_1,\ldots,v_k:\,v_1\cdots v_k=x}\sum_{i=1}^k\log p(v_i).

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

BPE pieces.
unigram pieces.
special tokens.

Non-examples:

raw text passed directly to attention.
word counts used as token counts.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

1.5 Where tokenization affects LLM behavior

Purpose. Where tokenization affects LLM behavior focuses on context, cost, arithmetic, safety, retrieval. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

H(T)=-\sum_{v\in\mathcal{V}}p(v)\log_2 p(v).

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

BPE pieces.
unigram pieces.
special tokens.

Non-examples:

raw text passed directly to attention.
word counts used as token counts.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

2. Formal Definitions

Formal Definitions develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.

2.1 Alphabet strings and byte sequences

Purpose. Alphabet strings and byte sequences focuses on the raw domain $\Sigma^*$ . In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

D(E(x))=x\quad\text{for a lossless tokenizer on its supported domain}.

Operational definition.

The raw alphabet determines whether the tokenizer starts from bytes, Unicode code points, characters, or pre-tokenized pieces.

Worked reading.

Byte-level systems can represent arbitrary input without an unknown character because every string can be encoded as bytes.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

UTF-8 byte sequences.
ASCII text.
mixed code and natural language.

Non-examples:

assuming one character equals one byte.
dropping unsupported symbols silently.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

2.2 Vocabulary and token ids

Purpose. Vocabulary and token ids focuses on the finite set $\mathcal{V}$ and integer map. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

\operatorname{score}_{\mathrm{BPE}}(a,b)=\operatorname{count}(ab).

Operational definition.

The vocabulary is a finite set of pieces with stable integer ids. Neural embeddings make those ids trainable vectors.

Worked reading.

With vocabulary size $|\mathcal{V}|$ and model width $d_{\mathrm{model}}$ , the input embedding table has $|\mathcal{V}|d_{\mathrm{model}}$ parameters.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

embedding lookup.
output softmax.
reserved special ids.

Non-examples:

renumbering tokens after training.
adding pieces without resizing embeddings.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

2.3 Tokenizer and detokenizer functions

Purpose. Tokenizer and detokenizer functions focuses on the pair $E:\Sigma^*\to [|\mathcal{V}|]^*$ and $D$ . In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

x^*=\arg\max_{v_1,\ldots,v_k:\,v_1\cdots v_k=x}\sum_{i=1}^k\log p(v_i).

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

BPE pieces.
unigram pieces.
special tokens.

Non-examples:

raw text passed directly to attention.
word counts used as token counts.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

2.4 Lossless versus lossy normalization

Purpose. Lossless versus lossy normalization focuses on why reversible byte tokenization matters. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

H(T)=-\sum_{v\in\mathcal{V}}p(v)\log_2 p(v).

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

BPE pieces.
unigram pieces.
special tokens.

Non-examples:

raw text passed directly to attention.
word counts used as token counts.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

2.5 Pre-tokenization boundaries

Purpose. Pre-tokenization boundaries focuses on how whitespace and regex choices constrain merges. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

\operatorname{fertility}(x)=\frac{\#\operatorname{tokens}(x)}{\#\operatorname{words}(x)}.

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

BPE pieces.
unigram pieces.
special tokens.

Non-examples:

raw text passed directly to attention.
word counts used as token counts.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

3. Byte Pair Encoding

Byte Pair Encoding develops the tokenizer concepts needed before embeddings, attention, language-model probability, and serving tradeoffs can be understood correctly.

3.1 BPE merge objective

Purpose. BPE merge objective focuses on frequency-based pair replacement. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

\operatorname{score}_{\mathrm{BPE}}(a,b)=\operatorname{count}(ab).

Operational definition.

BPE starts from small symbols and repeatedly merges the most frequent adjacent pair. The learned merge order becomes a compression table.

Worked reading.

If l o is the most frequent pair, BPE creates lo; later it may create low if lo w becomes frequent.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

GPT-style byte-level tokenizers.
subword NMT.
domain-specific vocabulary learning.

Non-examples:

probabilistically summing all segmentations.
longest-match WordPiece with continuation markers.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

3.2 Greedy training loop

Purpose. Greedy training loop focuses on why merges are sequential decisions. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

x^*=\arg\max_{v_1,\ldots,v_k:\,v_1\cdots v_k=x}\sum_{i=1}^k\log p(v_i).

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

BPE pieces.
unigram pieces.
special tokens.

Non-examples:

raw text passed directly to attention.
word counts used as token counts.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

3.3 Encoding with learned merges

Purpose. Encoding with learned merges focuses on applying ranks to new text. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

H(T)=-\sum_{v\in\mathcal{V}}p(v)\log_2 p(v).

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

BPE pieces.
unigram pieces.
special tokens.

Non-examples:

raw text passed directly to attention.
word counts used as token counts.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

3.4 Byte-level BPE

Purpose. Byte-level BPE focuses on avoiding unknown characters. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

\operatorname{fertility}(x)=\frac{\#\operatorname{tokens}(x)}{\#\operatorname{words}(x)}.

Operational definition.

The raw alphabet determines whether the tokenizer starts from bytes, Unicode code points, characters, or pre-tokenized pieces.

Worked reading.

Byte-level systems can represent arbitrary input without an unknown character because every string can be encoded as bytes.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

UTF-8 byte sequences.
ASCII text.
mixed code and natural language.

Non-examples:

assuming one character equals one byte.
dropping unsupported symbols silently.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.

3.5 BPE limitations

Purpose. BPE limitations focuses on greedy segmentation and brittle numeric splits. In LLM systems this choice affects integer ids, sequence length, embedding rows, loss targets, and serving cost.

\operatorname{params}_{\mathrm{embed}}=|\mathcal{V}|\,d_{\mathrm{model}}.

Operational definition.

This concept controls how raw text becomes the sequence of discrete units optimized by an LLM.

Worked reading.

The practical question is always how the choice changes ids, sequence length, reversibility, or downstream loss.

Tokenizer object	Mathematical role	LLM consequence
alphabet $\Sigma$	atomic input symbols	bytes, characters, or normalized symbols
vocabulary $\mathcal{V}$	finite token set	embedding and output-logit dimensions
encoder $E$	maps text to ids	prompt length, training examples, costs
decoder $D$	maps ids back to text	detokenization and round-trip safety
merge/probability table	segmentation rule	subword boundaries and rare-string handling

Examples:

BPE pieces.
unigram pieces.
special tokens.

Non-examples:

raw text passed directly to attention.
word counts used as token counts.

Derivation habit.

State the raw alphabet and any normalization step.
State whether the model uses BPE, unigram, WordPiece, byte fallback, or a hybrid.
Compute token count before making cost, context, or memory claims.
Check reversibility with decode(encode(x)) when the pipeline promises losslessness.
Treat special tokens as protected control symbols, not ordinary text pieces.

Implementation lens.

For inference, the tokenizer changes both latency and price because attention cost grows with token length. A prompt that is short in words can still be expensive if it tokenizes poorly.