Tokenization

Tokenization is the process of breaking text into smaller units called tokens, which serve as the fundamental input to natural language processing (NLP) systems and large language models (LLMs). These tokens can be words, subwords, characters, or even individual bytes depending on the tokenization strategy used. Because machine learning models operate on numerical data rather than raw text, tokenization acts as the bridge between human-readable language and the numerical representations that models consume.

Every modern language model, from BERT and GPT-4 to Claude and LLaMA, relies on a tokenizer to convert input text into a sequence of integer IDs. Each ID maps to a specific entry in the model's vocabulary, which is then transformed into a dense vector via an embedding layer. The quality and design of the tokenizer directly influences how well a model understands and generates text. Andrej Karpathy, in his widely watched 2024 lecture on building the GPT tokenizer, observed that tokenization is the source of many oddities in language model behavior, from poor arithmetic to spelling failures and unequal treatment of languages [13].

Because tokens are the units in which input is consumed, output is generated, context windows are measured, and pricing is calculated, tokenization sits at the intersection of several practical concerns: model accuracy, computational efficiency, and the cost of using commercial APIs. Decisions made when designing or training a tokenizer have lasting effects that cannot easily be undone after a model is trained.

why tokenization matters

Tokenization is not merely a preprocessing step; it shapes nearly every aspect of a language model's behavior. The choice of tokenizer determines:

Vocabulary size: how many unique tokens the model recognizes, which affects the size of the embedding matrix and the final softmax layer.
Sequence length: how many tokens are needed to represent a given text, which directly impacts computational cost in transformer architectures where self-attention scales quadratically with sequence length.
Out-of-vocabulary handling: whether the model can gracefully process rare words, neologisms, typos, and code.
Multilingual capability: how efficiently the model encodes text in different languages and scripts.
Numerical reasoning: how digits are grouped, which has been shown to influence arithmetic accuracy [14].
API pricing: commercial LLM APIs charge per token, making tokenization efficiency a direct cost factor.
Robustness to noise: how the model handles misspellings, character substitutions, and adversarial perturbations.

A poorly designed tokenizer can waste context window space, introduce artifacts into generated text, expose strange failure modes such as glitch tokens (see below), and create systematic disadvantages for underrepresented languages.

types of tokenization

Tokenization strategies fall along a spectrum from coarse-grained (full words) to fine-grained (individual bytes). Each approach involves trade-offs between vocabulary size, sequence length, and coverage.

word-level tokenization

The simplest approach splits text on whitespace and punctuation, treating each word as a separate token. Early NLP systems commonly used word-level tokenization. The Word2Vec and GloVe embedding models of the mid-2010s typically operated on vocabularies of 100,000 to 400,000 word forms.

Advantages: tokens are semantically meaningful and easy to interpret. Each token corresponds directly to a word that humans recognize, which simplifies linguistic analysis and downstream tasks like named entity recognition.

Disadvantages: the vocabulary must include every word the model might encounter. English alone has hundreds of thousands of distinct word forms when accounting for conjugations, plurals, and compounds. Rare words, proper nouns, and misspellings produce unknown tokens. For morphologically rich languages like Finnish, Turkish, or German (with its long compound words), word-level tokenization is particularly impractical. A German compound such as "Donaudampfschifffahrtsgesellschaftskapitan" cannot reasonably exist in a fixed word vocabulary, yet its meaning is fully compositional from familiar parts.

character-level tokenization

At the opposite extreme, character-level tokenization treats each individual character (letter, digit, punctuation mark) as a token. The vocabulary is extremely small, often fewer than 300 entries for English plus common punctuation.

Advantages: zero out-of-vocabulary problems, since any text can be represented as a sequence of known characters. Very compact vocabulary. Strong robustness to spelling errors and rare words.

Disadvantages: sequences become very long. The sentence "Tokenization is important" requires roughly 25 character tokens instead of 3 word tokens. Longer sequences mean higher computational costs, and the model must learn to compose meaning from individual characters, which is a harder learning problem. Pure character models also struggle with multibyte Unicode characters in some scripts.

subword tokenization

Subword tokenization occupies the middle ground and has become the dominant approach in modern NLP. It breaks text into units that are larger than characters but often smaller than words. Common words remain intact as single tokens, while rare words are decomposed into recognizable subword pieces.

For example, the word "tokenization" might be split into ["token", "ization"], while a common word like "the" stays as a single token. This balances vocabulary size, sequence length, and coverage. All major LLMs released since 2017 use some form of subword tokenization.

byte-level tokenization

Byte-level tokenization operates on raw bytes (values 0 through 255) rather than Unicode characters. Since any digital text is ultimately a sequence of bytes, this approach can handle any language, encoding, or even binary data without special preprocessing. GPT-2 introduced byte-level BPE, which starts from a base vocabulary of 256 byte values and builds up subword tokens from there. Pure byte-level models without a learned vocabulary, such as Google's ByT5 [15] and Meta's Byte Latent Transformer (BLT) [1], explore the further extreme of operating directly on bytes without a fixed tokenizer at all.

key tokenization algorithms

Several specific algorithms have become standard in the field. Each takes a different approach to learning which subword units to include in the vocabulary.

byte pair encoding (BPE)

Byte pair encoding was originally a data compression algorithm described by Philip Gage in February 1994 in The C Users Journal article "A New Algorithm for Data Compression" [16]. The algorithm iteratively replaced the most frequent pair of bytes in a sequence with a single unused byte value, achieving compression comparable to Lempel-Ziv methods of the era. In 2015, Rico Sennrich, Barry Haddow, and Alexandra Birch adapted it for neural machine translation in their paper "Neural Machine Translation of Rare Words with Subword Units," first posted to arXiv as 1508.07909 in August 2015 and published at ACL 2016 [2]. This paper demonstrated that BPE could effectively handle rare and unknown words by decomposing them into subword units, improving WMT 2015 English-German and English-Russian results by up to 1.3 BLEU over a back-off dictionary baseline.

How BPE works:

Start with a base vocabulary of individual characters (or bytes, in byte-level BPE).
Count the frequency of every adjacent pair of tokens in the training corpus.
Merge the most frequent pair into a new token and add it to the vocabulary.
Repeat steps 2 and 3 for a predetermined number of merge operations.

For example, suppose the pair ("t", "h") is the most frequent. After merging, every occurrence of "t" followed by "h" becomes the single token "th". Then perhaps ("th", "e") becomes the next most frequent pair, producing "the". The process continues until the desired vocabulary size is reached.

A toy training example illustrates the merge sequence:

step 0:  l o w  l o w e r  n e w e s t  w i d e s t
step 1:  lo w  lo w e r  n e w e s t  w i d e s t        (merge l+o)
step 2:  low  low e r  n e w e s t  w i d e s t          (merge lo+w)
step 3:  low  low e r  n e w es t  w i d es t            (merge e+s)
step 4:  low  low e r  n e w est  w i d est              (merge es+t)

BPE is deterministic: given the same training data and the same number of merges, it always produces the same vocabulary. It is the most widely used tokenization algorithm in modern LLMs, employed by the GPT model family, LLaMA 3, Mistral, DeepSeek, and many others.

byte-level BPE

Byte-level BPE is the variant introduced by GPT-2 in 2019 and now standard across OpenAI models, the LLaMA 3 family, and most code-focused LLMs. Instead of starting from a base vocabulary of Unicode characters (which can include hundreds of thousands of CJK and emoji codepoints), byte-level BPE always starts from the 256 possible byte values. Any Unicode codepoint is represented as a sequence of one to four UTF-8 bytes, and merges learn to recombine those byte sequences into useful tokens.

This design has three important properties. First, the base vocabulary is fixed at 256, so no Unicode character is ever truly out of vocabulary. Second, the same tokenizer can encode any sequence of bytes whatsoever, including binary data, mojibake, and characters the model has never seen. Third, ASCII text remains compactly represented because each ASCII character is one byte. The trade-off is that non-ASCII scripts may need multiple bytes per character before any merges are learned, which is one source of multilingual inefficiency for English-dominant training corpora.

WordPiece

WordPiece was developed by Mike Schuster and Kaisuke Nakajima at Google in 2012, originally for Japanese and Korean voice search systems and described in their ICASSP 2012 paper "Japanese and Korean Voice Search" [3]. It was later adopted for BERT and other Google NLP models including DistilBERT, ELECTRA, and the early Multilingual BERT.

WordPiece is similar to BPE but differs in how it selects which pairs to merge. Instead of choosing the most frequent pair, WordPiece selects the pair whose merger maximizes the likelihood of the training data under a unigram language model. In practice, this means it computes a score for each candidate merge:

score(pair) = freq(pair) / (freq(first) * freq(second))

The pair with the highest score is merged. This tends to favor merges that combine individually rare symbols that frequently appear together, rather than simply the globally most common pair.

WordPiece uses a distinctive notation: continuation tokens are prefixed with "##" to indicate they are not the start of a word. For instance, the word "embedding" might be tokenized as ["em", "##bed", "##ding"]. This convention makes it straightforward to reconstruct the original text by joining tokens and removing the "##" prefixes. The ## marker has become one of the most recognizable artifacts of the BERT era of NLP.

unigram language model

The Unigram model takes a fundamentally different approach from BPE and WordPiece. Rather than starting small and iteratively merging, it starts with a large initial vocabulary and prunes it down. Taku Kudo introduced this method in his 2018 paper "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates," published at ACL 2018 [4].

How the Unigram model works:

Start with a large candidate vocabulary (e.g., all substrings up to a certain length that appear in the training data).
Assign a probability to each token using a unigram language model (each token's probability is independent of context).
For each token, compute how much the overall likelihood of the training data would decrease if that token were removed.
Remove a fixed percentage of the tokens whose removal causes the smallest decrease in likelihood.
Repeat steps 2 through 4 until the vocabulary reaches the target size.

A notable feature of the Unigram model is that it can produce multiple valid segmentations for the same input, each with a different probability. This enables subword regularization, where different segmentations are sampled during training to improve the model's robustness. Subword regularization has been shown to improve machine translation BLEU scores by 1 to 2 points on low-resource language pairs [4].

SentencePiece

SentencePiece, also by Taku Kudo (with John Richardson), is not a tokenization algorithm per se but a library and framework that implements both BPE and the Unigram model. It was published as a system demonstration paper at EMNLP 2018 [5]. What distinguishes SentencePiece from other implementations is its treatment of text:

It operates directly on raw Unicode text, treating the input as a sequence of Unicode characters (or bytes) without requiring any pre-tokenization or language-specific word segmentation.
Whitespace is treated as a regular character (represented internally as the lower one-eighth block character "\u2581" or a similar meta-symbol), not as a token boundary. This means the tokenizer is truly language-agnostic and can handle languages without spaces (like Chinese and Japanese) natively.
It is fully self-contained: the trained model file contains everything needed to tokenize and detokenize text, with no external dependencies.
It supports a byte-fallback mode in which any character not in the vocabulary is represented as a sequence of UTF-8 byte tokens, eliminating UNK in practice.

SentencePiece is used by many prominent models including the original LLaMA (1 and 2), Mistral, T5, ALBERT, XLNet, and Google's Gemini family (which uses SentencePiece with the Unigram algorithm).

tiktoken

tiktoken is OpenAI's open-source tokenizer library, written in Rust with Python bindings for speed. It implements byte-level BPE and is designed specifically for OpenAI's model family. According to OpenAI's own benchmarks, tiktoken is approximately 3 to 6 times faster than comparable open-source tokenizers in pure Python, with measured throughput of roughly 600 MiB/s on cached multilingual text [6][17]. Independent Rust implementations like rs-bpe have since outperformed tiktoken further in adversarial worst-case scenarios, where pre-tokenization can become quadratic [17].

OpenAI has released several tokenizer encodings through tiktoken:

Encoding	Vocabulary Size	Models	Release
r50k_base	~50,257	GPT-2, early GPT-3	2019
p50k_base	~50,281	Codex, text-davinci-002/003	2022
cl100k_base	~100,256	GPT-3.5-turbo, GPT-4	2023
o200k_base	~200,015	GPT-4o, o3-mini, o4-mini	2024
o200k_harmony	~201,088	GPT-5	2025

The progression from 50K to 200K tokens reflects a broader industry trend toward larger vocabularies. Larger vocabularies produce shorter token sequences (reducing compute in self-attention) and improve multilingual coverage, at the cost of a larger embedding matrix and output projection layer.

The regex pre-tokenization patterns also evolved between encodings. The cl100k_base pattern is approximately:

(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{2,}|
[^\r\n\p{L}\p{N}]?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+

The o200k_base pattern revises this to better handle modifier letters, marks, and other Unicode categories such as \p{Lo}, \p{Lm}, and \p{M}, which substantially improves coverage for non-Latin scripts.

how tokenizers are trained

Tokenizer training is a separate step from model training. It typically occurs before the language model itself is trained, using a large representative corpus of text. The trained tokenizer is then frozen and reused for the entire life of the model.

vocabulary building

The training process varies by algorithm, but the general workflow is:

Collect a training corpus that represents the distribution of text the model will encounter. This typically includes web text, books, code, and multilingual data. Sources commonly used include Common Crawl, Wikipedia, The Pile, and book corpora. Modern frontier models often blend in proprietary curated data as well.
Choose a base vocabulary. For byte-level BPE, this is the 256 possible byte values. For character-level approaches, this is the set of Unicode characters observed in the training data.
Apply the chosen algorithm (BPE merges, WordPiece likelihood maximization, or Unigram pruning) to build up or pare down the vocabulary to the target size.
Add special tokens (see below) to the vocabulary.
Serialize the tokenizer into a model file that can be loaded at inference time.

The choice of training corpus profoundly affects the resulting tokenizer. A tokenizer trained primarily on English web text will allocate most of its vocabulary to English subwords, leaving non-English text fragmented into many short tokens. Tokenizers used by multilingual models are typically trained on data sampled to give a fair share of vocabulary slots to each target language. DeepSeek-V3, for example, modified its pretokenizer and training data specifically to optimize multilingual compression efficiency [18].

merge rules in BPE

For BPE-based tokenizers, the output of training is an ordered list of merge rules. During tokenization, these merges are applied in order to the input text. The order matters: earlier merges take precedence. For example, if merge rule 100 is ("t", "o") and merge rule 500 is ("to", "ken"), then "token" is first processed by merging "t"+"o" into "to", and later (if applicable) "to"+"ken" into "token".

The total number of merge rules equals the final vocabulary size minus the base vocabulary size. A tokenizer with a 100K vocabulary and a 256-byte base vocabulary has approximately 99,744 merge rules.

pre-tokenization

Many tokenizers apply a pre-tokenization step before running the subword algorithm. This typically involves:

Splitting text on whitespace boundaries.
Separating punctuation from words.
Applying regex patterns to identify categories such as letters, digits, and whitespace.
Normalizing Unicode (often NFKC normalization in WordPiece-based tokenizers, none in byte-level BPE).

GPT-4's tokenizer, for instance, uses a complex regex pattern that handles contractions, letter sequences, digit sequences, and whitespace. This pre-tokenization prevents merges from spanning across word boundaries in most cases, which helps maintain linguistically meaningful tokens. SentencePiece is notable for skipping pre-tokenization entirely, treating the raw input stream as a flat sequence and instead encoding spaces as visible meta-characters.

The SuperBPE work at COLM 2025 questions whether word-boundary pre-tokenization is desirable at all [11]. By relaxing the boundary in a curriculum (first learning subwords, then learning multi-word "superword" tokens), SuperBPE achieves up to 33 percent fewer tokens for the same vocabulary size and improves average benchmark performance by 4.0 percent across 30 downstream tasks, including 8.2 percent on MMLU. The related BoundlessBPE paper (also COLM 2025) eliminates the boundary constraint entirely [19].

example: training a BPE tokenizer

Using the Hugging Face tokenizers library, training a custom BPE tokenizer for a corpus of text files is straightforward [20]:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(
    vocab_size=32_000,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
)

tokenizer.train(files=["corpus_part1.txt", "corpus_part2.txt"], trainer=trainer)
tokenizer.save("my-tokenizer.json")

For large in-memory iterators or streaming data, train_from_iterator accepts any Python iterable yielding strings.

vocabulary size and its impact

Vocabulary size is one of the most important hyperparameters in tokenizer design. It directly affects model architecture, training efficiency, and inference cost.

Model	Tokenizer	Algorithm	Vocabulary Size
GPT-2	tiktoken (r50k_base)	Byte-level BPE	50,257
BERT	WordPiece	WordPiece	30,522
LLaMA 2	SentencePiece	BPE	32,000
LLaMA 3	tiktoken-based	Byte-level BPE	128,256
GPT-4	tiktoken (cl100k_base)	Byte-level BPE	100,256
GPT-4o	tiktoken (o200k_base)	Byte-level BPE	~200,015
Gemini / Gemma 3	SentencePiece	Unigram	262,144
Claude (Anthropic)	Proprietary	Not publicly disclosed	Not publicly disclosed
Mistral	SentencePiece	BPE	32,768
Qwen (Alibaba)	tiktoken-based	Byte-level BPE	151,643
DeepSeek-V3	Custom	Byte-level BPE	128,000

Small vocabularies (30K to 50K) produce longer token sequences but have smaller embedding matrices. They may struggle with multilingual text and code.

Large vocabularies (100K to 260K+) produce shorter sequences (reducing attention computation) and handle diverse languages better, but increase the size of the embedding and output projection layers. Gemini 3's 262K vocabulary, for example, requires 262,144 softmax computations per token, which is 8x more than LLaMA 2's 32K vocabulary [7].

The trend since 2023 has been clearly toward larger vocabularies. LLaMA jumped from 32K to 128K between versions 2 and 3, and at the same time switched from a SentencePiece BPE tokenizer to a tiktoken-based byte-level BPE tokenizer compatible with the OpenAI ecosystem [21]. OpenAI doubled from 100K to 200K with GPT-4o. This growth is driven by improved multilingual coverage and the observation that the computational savings from shorter sequences generally outweigh the cost of larger embedding tables.

tradeoffs in choosing vocabulary size

Doubling the vocabulary roughly doubles the parameter count of the embedding matrix (which has shape vocab_size x hidden_dim) and the output projection (often tied to the embedding). For a model with a hidden size of 4096, increasing vocabulary from 32K to 128K adds approximately (128,000 - 32,000) * 4096 = 393 million parameters to each of the input and output layers. In smaller models, the embedding matrix can become a substantial fraction of the total parameter count.

However, larger vocabularies shorten sequences, and self-attention scales quadratically with sequence length. For a 32K vocabulary that fragments multilingual text into roughly twice as many tokens as a 128K vocabulary, the FLOPs spent in attention layers can grow by a factor of four, which often dominates the embedding-table cost in deep transformer stacks. Empirically, the move to larger vocabularies has been a net win for nearly all frontier models since 2023.

tokens and pricing

Commercial LLM APIs universally measure usage in tokens, making tokenization directly relevant to cost.

how token-based pricing works

When you send a prompt to an API like OpenAI's or Anthropic's, the text is tokenized, and you are charged based on the number of input tokens and output tokens. Input tokens (your prompt) and output tokens (the model's response) are typically priced separately, with output tokens costing more because they require the computationally expensive autoregressive generation process.

As of early 2026, representative pricing per million tokens [8]:

Model	Input (per 1M tokens)	Output (per 1M tokens)
GPT-4o	$2.50	$10.00
GPT-5.2	$1.75	$14.00
Claude Sonnet 4.5	$3.00	$15.00
Claude Haiku 4.5	$1.00	$5.00
Claude Opus 4.5	$5.00	$25.00
Gemini 2.5 Pro	$1.25	$10.00

Because different tokenizers produce different numbers of tokens for the same text, comparing API prices across providers requires understanding the tokenization efficiency of each model. A model that tokenizes your text into fewer tokens may be cheaper even if its per-token price is higher.

counting tokens before sending

For cost estimation and prompt-engineering work, it is common to count tokens locally before making an API call. The standard tools are:

# OpenAI: tiktoken
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
print(len(enc.encode("Hello, world!")))  # 4

# Anthropic: count_tokens endpoint (free of charge)
from anthropic import Anthropic
client = Anthropic()
count = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    messages=[{"role": "user", "content": "Hello, world!"}],
)
print(count.input_tokens)

# Hugging Face: tokenizer.encode
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
print(len(tok.encode("Hello, world!")))

Anthropic exposes a free messages.count_tokens endpoint that accepts the same payload shape as a messages.create call, including system prompts, tools, images, and PDFs [22]. The returned count is an estimate that may differ from the actual billing count by a small amount.

practical cost implications

For English text, a rough rule of thumb is that 1 million tokens is approximately 750,000 words. For code, the ratio can vary significantly depending on the programming language, formatting, and comment density. Tokenization efficiency drops sharply for non-English text (see the multilingual section below), meaning the same content in Chinese or Arabic can cost significantly more to process than its English equivalent. Qwen's documentation cites a rule of thumb of 3 to 4 characters per token for English versus 1.5 to 1.8 characters per token for Chinese under its tokenizer, despite Chinese having a much higher information density per character [23].

tokens vs. words

A common point of confusion is the relationship between tokens and words. They are not the same thing.

For English text processed by modern subword tokenizers, the typical ratio is approximately 1.3 tokens per word, though this varies by content type [9]. Common short words like "the", "is", and "a" are usually single tokens. Longer or less common words may be split into 2 to 4 subword tokens. Punctuation marks, spaces (in some tokenizers), and special formatting characters each consume their own tokens.

OpenAI's documentation suggests a useful approximation: 1 token is roughly 4 characters or 0.75 words in English [9].

Here is an example of how GPT-4's tokenizer (cl100k_base) processes a sentence:

Text	Tokens	Token Count
"Hello world"	["Hello", " world"]	2
"Tokenization"	["Token", "ization"]	2
"GPT-4 is great"	["GPT", "-", "4", " is", " great"]	5
"antidisestablishmentarianism"	["ant", "idis", "establishment", "arian", "ism"]	5
"こんにちは" (Japanese greeting)	["こんにちは"]	1
"人工智能" (Chinese: AI)	["人工", "智能"]	2

Note how the long English word "antidisestablishmentarianism" is broken into five meaningful subword pieces, while the common word "Hello" is kept intact. The CJK examples show that character-dense languages can be tokenized relatively efficiently by modern tokenizers with large vocabularies.

special tokens

Beyond the tokens derived from text, every tokenizer includes a set of special tokens that carry structural or control information. These tokens are never produced by splitting normal text; they are inserted by the tokenizer or the model framework to delimit sequences and convey metadata.

Token	Name	Purpose
`<BOS>` or `<s>`	Beginning of Sequence	Marks the start of an input sequence. Used by autoregressive models like GPT and LLaMA to signal where generation begins.
`<EOS>` or `</s>`	End of Sequence	Marks the end of a sequence. The model is trained to output this token when it has finished generating.
`<PAD>`	Padding	Used to pad shorter sequences to the same length within a batch. Attention masks ensure that PAD tokens are ignored during computation.
`<UNK>`	Unknown	Represents tokens not in the vocabulary. Modern subword tokenizers rarely produce UNK tokens because they can decompose any input into known subwords or bytes.
`[CLS]`	Classification	Used by BERT at the start of every input. The final hidden state at this position serves as the aggregate sequence representation for classification tasks.
`[SEP]`	Separator	Used by BERT to separate two segments (e.g., a question and a passage in question answering).
`[MASK]`	Mask	Used in masked language modeling (BERT's pre-training objective). A percentage of input tokens are replaced with [MASK], and the model predicts the originals.
`<\|endoftext\|>`	End of text	Used by GPT-2 and successors to mark document boundaries during pretraining.
`<\|im_start\|>`, `<\|im_end\|>`	Chat role markers	OpenAI's ChatML format delimiters for system, user, and assistant turns.
`<\|begin_of_text\|>`, `<\|eot_id\|>`	LLaMA 3 markers	Begin-of-text and end-of-turn special tokens introduced in LLaMA 3.

Modern chat-oriented models also use special tokens to delimit roles in a conversation. For example, OpenAI's chat format uses tokens like <|im_start|> and <|im_end|> to mark the boundaries of system, user, and assistant messages. LLaMA 3 uses <|begin_of_text|>, <|start_header_id|>, <|end_header_id|>, and <|eot_id|>. These structural tokens are critical for the model to correctly interpret multi-turn conversations, and prompt-injection attacks frequently target these markers in an attempt to confuse the model about role boundaries.

Many modern tokenizers also reserve large blocks of unused special-token slots for future use. LLaMA 3, for example, defines hundreds of reserved tokens (e.g., <|reserved_special_token_0|> through <|reserved_special_token_250|>) so that follow-up models can introduce new structural markers without breaking compatibility.

glitch tokens and tokenizer pathologies

One of the most surprising consequences of how tokenizers are trained is the existence of glitch tokens: vocabulary entries that the model behaves strangely on, often refusing to repeat them, hallucinating in response to them, or producing unexpected outputs.

The phenomenon was popularized in February 2023 by Jessica Rumbelow and Matthew Watkins, working as part of the SERI MATS research program, in their post "SolidGoldMagikarp (plus, prompt generation)" on the LessWrong and AI Alignment Forum [24]. They documented that asking GPT-2 and GPT-3 models to repeat back specific strings such as SolidGoldMagikarp, TheNitromeFan, guiActiveUn, and Smartstocks produced wildly off-target responses. For example, ChatGPT in early 2023 would respond to a request to repeat "SolidGoldMagikarp" by repeating the unrelated word "distribute."

The root cause is a mismatch between the corpus used to train the tokenizer and the corpus used to train the language model itself. Many of the GPT-2 era anomalous tokens were apparently scraped from Reddit threads, the r/counting subreddit, online gaming logs, and e-commerce backends. They appeared frequently enough in the tokenizer training corpus to earn vocabulary slots, but were filtered out (or appeared too rarely) in the larger language-model training corpus. Their associated embedding vectors were therefore never updated meaningfully during pretraining, leaving them in a near-random state.

Glitch tokens have been documented in nearly every major frontier model family. Subsequent work by Land and others has identified "unreachable tokens" in GPT-4o, vocabulary entries that the BPE merge rules cannot produce from any input string, and so will never appear during inference [25]. Robust modern tokenizers and training pipelines try to detect and either remove or downweight such tokens before final model release.

numerical and arithmetic tokenization

The way digits are tokenized has a measurable impact on a model's arithmetic ability, a finding that has accelerated changes in how new tokenizers are designed.

LLaMA 1, LLaMA 2, and PaLM all tokenize numbers digit-by-digit, treating each of the ten digits 0 through 9 as its own single token. GPT-3.5 and GPT-4, by contrast, learned multi-digit tokens during BPE training. The cl100k_base vocabulary contains separate tokens for many one-, two-, and three-digit numbers, but not for all of them. As a result, the integer 480 might be a single token while 481 is split into ["4", "81"] and 482 is split into ["48", "2"]. This irregular grouping creates inconsistent input representations for arithmetic problems.

A 2024 study by Singh and Strouse demonstrated that GPT-3.5 and GPT-4 perform measurably better on multi-digit arithmetic when numbers are written with comma separators (e.g., 1,234,567 instead of 1234567) because commas force a more consistent right-to-left tokenization that aligns the digits of the addends and the answer [14]. LLaMA's per-digit tokenization avoids this issue and was a significant factor in research showing that fine-tuned LLaMA could outperform GPT-4 on certain arithmetic tasks [26]. Newer OpenAI tokenizers and the GPT-4o family have moved toward more constrained digit-grouping rules.

Similar effects have been documented for date and time data, where inconsistent tokenization of numeric components leads to systematic errors in temporal reasoning tasks [14].

code tokenization

Programming languages place specific demands on tokenizers that differ from natural language. Indentation in Python encodes block structure; tabs and runs of spaces are semantically meaningful; identifiers can mix camelCase, snake_case, and dot notation; and the same lexical tokens (such as braces and brackets) may appear with high frequency in tightly nested patterns.

StarCoder and StarCoder2, Code LLaMA, DeepSeek-Coder, and other code-focused LLMs train their BPE tokenizers on heavily code-weighted corpora and add special token treatment for whitespace runs. The GPT-4 tokenizer, for example, encodes runs of spaces (such as four spaces or eight spaces) as single tokens, which dramatically reduces the token count of indented Python source. StarCoder2 additionally includes special repository and filename markers to help the model distinguish code within the same project from code across projects [27]. Code LLaMA uses byte-level BPE with byte-fallback and no Unicode normalization to faithfully preserve every character of source code.

The practical effect is that a well-designed code tokenizer can fit substantially more source code into a fixed context window than a general-purpose tokenizer. A 1,000-line Python file might be 4,000 tokens for a code-trained tokenizer but 8,000 or more for an English-text-trained tokenizer that handles indentation poorly.

multilingual tokenization challenges

Tokenization works reasonably well for English and other Latin-script languages with clear word boundaries, but significant challenges arise with other writing systems.

CJK languages

Chinese, Japanese, and Korean present unique difficulties:

No word boundaries: Chinese and Japanese do not use spaces between words. A tokenizer cannot rely on whitespace to identify word boundaries, making pre-tokenization strategies designed for English ineffective.
Large character sets: Chinese uses thousands of unique Hanzi characters. Japanese mixes three scripts (Kanji, Hiragana, and Katakana). Korean uses Hangul syllable blocks. Covering these character sets requires a large vocabulary.
Token fertility: the ratio of tokens to semantic content varies dramatically. BPE tokenizers trained primarily on English data can produce sequences for Japanese text that are up to 8x longer than the English equivalent, with an average multiplier of about 2.12x. Mandarin Chinese averages about 1.76x, and Korean about 2.36x [10].

For models targeting Chinese-language users, tokenizer designers explicitly augment the vocabulary with Chinese characters and common bigrams. Qwen's 151,643-token vocabulary contains roughly 25,000 Chinese tokens, while DeepSeek-V3's 128,000-token vocabulary contains roughly 35,000 [18][23].

non-Latin scripts

Arabic, Hebrew, Devanagari, Thai, and other scripts each bring their own complexities:

Right-to-left scripts (Arabic, Hebrew) require careful handling of directionality.
Abugida scripts (Devanagari, Thai) combine consonants and vowels in ways that do not map cleanly to byte boundaries in UTF-8.
UTF-8 encoding inefficiency: non-Latin characters often require 2 to 4 bytes in UTF-8, compared to 1 byte for ASCII characters. Byte-level BPE tokenizers therefore start at a disadvantage for these scripts, needing more base tokens just to represent a single character.

A 2025 evaluation found Hindi to have the lowest single-token retention rate across all evaluated tokenizers, indicating pronounced fragmentation [28]. The same study introduced the Single Token Retention Rate (STRR) as a complement to fertility, capturing how many words are preserved as single tokens rather than just the average tokens per word.

the fairness problem

Research has documented that tokenization creates systematic unfairness between languages. The 2023 NeurIPS paper "Language Model Tokenizers Introduce Unfairness Between Languages" by Aleksandar Petrov, Emanuele La Malfa, Philip H. S. Torr, and Adel Bibi found that the same text translated into different languages can produce token counts that differ by up to 15x, even for tokenizers explicitly designed to be multilingual [10]. Character-level and byte-level models reduced but did not eliminate the disparity, with differences of more than 4x still observed for some language pairs.

A text that costs $1 to process in English might cost $2 to $3 in Chinese or Japanese and potentially much more in low-resource languages like Tigrinya or Burmese. This "tokenization tax" means that users of non-English languages get less content per dollar and use more of their context window for the same amount of information.

Larger vocabularies partially address this problem. GPT-4o's o200k_base encoding includes roughly 10x more Cyrillic tokens than GPT-4's cl100k_base (4,660 vs. approximately 435), substantially improving efficiency for Russian and other Cyrillic-script languages [7]. Similarly, LLaMA 3's expansion from 32K to 128K tokens was motivated in part by the need for better multilingual coverage, and Gemma 3's adoption of the Gemini 2.0 tokenizer was framed by Google as more balanced for non-English languages, accepting a small increase in English token counts in exchange for substantially better Chinese, Japanese, and Korean coverage [29].

tokenizer comparison for major models

The following table summarizes the tokenization approaches used by the most prominent LLM families as of early 2026:

Model Family	Tokenizer Library	Algorithm	Vocab Size	Pre-tokenization	Notable Features
GPT-4	tiktoken (cl100k_base)	Byte-level BPE	~100K	Regex-based	Strong English and code performance
GPT-4o / GPT-5	tiktoken (o200k_base / Harmony)	Byte-level BPE	~200K	Regex-based	2x vocabulary over GPT-4; much better multilingual
Claude (Anthropic)	Proprietary	Not disclosed	Not disclosed	Not disclosed	Anthropic provides token counting via API
LLaMA 2	SentencePiece	BPE	32,000	None (SentencePiece)	Relatively small vocabulary
LLaMA 3 / 3.2	Custom (tiktoken-compatible)	Byte-level BPE	128,256	Regex-based	4x vocabulary increase over LLaMA 2
Gemini / Gemma 3	SentencePiece	Unigram	262,144	None (SentencePiece)	Largest vocabulary among major models
BERT	WordPiece	WordPiece	30,522	Whitespace + punctuation	Uses ## prefix for continuation tokens
T5	SentencePiece	Unigram	32,000	None (SentencePiece)	Treats spaces as tokens
Mistral / Mixtral	SentencePiece	BPE	32,768	None (SentencePiece)	Compact vocabulary, byte-fallback for OOV
Qwen 2.5/3	tiktoken-based	Byte-level BPE	151,643	Regex-based	Extends cl100k_base with Chinese tokens
DeepSeek-V3	Custom	Byte-level BPE	128,000	Custom regex	Optimized for multilingual compression
StarCoder2	Custom	Byte-level BPE	~49,000	Code-aware	Repository and filename special tokens

tokenizer-free and byte-level models

A growing line of research questions whether fixed tokenizers are necessary at all, motivated in part by the multilingual fairness problem and in part by glitch-token pathologies.

ByT5 (2021)

ByT5 was introduced by Google researchers in 2021 in the paper "ByT5: Towards a Token-Free Future With Pre-trained Byte-to-Byte Models" [15]. It is a byte-level adaptation of mT5 that operates directly on UTF-8 bytes with no learned vocabulary at all. The authors found that for the same parameter budget, a deeper encoder relative to the decoder works best, and the resulting models outperform mT5 on classification and generation tasks while being more robust to orthographic noise (typos, character substitutions, casing changes). The price is that byte sequences are roughly 4x longer than equivalent subword sequences, raising attention costs.

MEGABYTE (2023)

MEGABYTE, introduced by Lili Yu, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis at Meta in May 2023 (arxiv 2305.07185), tackles the byte-sequence-length problem by chunking bytes into fixed-size patches and applying a hierarchical transformer [30]. A small "local" transformer operates within each patch, while a large "global" transformer operates between patches. This decomposition makes attention sub-quadratic in total sequence length and enables modeling of sequences over one million bytes.

MambaByte (2024)

MambaByte applies the Mamba state-space architecture to byte sequences [31]. Because Mamba has fixed-size hidden state regardless of sequence length, the long-sequence cost of byte-level modeling becomes manageable. Junxiong Wang and colleagues showed that MambaByte is competitive with and sometimes outperforms subword transformer baselines, while inheriting the robustness benefits of byte-level modeling. Speculative decoding gives an additional 2.6x inference speedup.

Byte Latent Transformer (BLT, 2024)

Meta's Byte Latent Transformer, posted as arxiv 2412.09871 in December 2024, takes a different approach to chunking bytes [1]. Instead of fixed patches, BLT segments byte sequences dynamically based on the entropy of the next byte. Predictable runs (like long English words) get grouped into long patches, while complex regions (rare tokens, numbers, code) get shorter patches and more compute. BLT is the first byte-level model demonstrated to match the performance of LLaMA 3 at the 8 billion parameter / 4 trillion training-byte scale, while eliminating the tokenizer entirely. The Meta team has open-sourced training and inference code in the facebookresearch/blt repository.

implications

If tokenizer-free approaches scale further, several practical pain points of fixed-vocabulary models simply disappear: there are no glitch tokens, no multilingual fairness gap from vocabulary allocation, no need to retrain a tokenizer when adding new domains, and no asymmetric digit grouping for arithmetic. The trade-off is engineering: byte-level inference requires hierarchical or state-space architectures to remain compute-efficient, and the existing tooling ecosystem assumes fixed-vocabulary tokenizers at every layer.

advanced BPE variants

Researchers continue to refine BPE itself rather than abandon tokenization entirely.

SuperBPE (Liu et al., COLM 2025) uses a two-pass curriculum that first learns standard subword tokens, then learns cross-word "superword" tokens that bridge whitespace [11]. With a vocabulary capped at 200K, SuperBPE encodes text with up to 33 percent fewer tokens than vanilla BPE, improves average benchmark performance by 4.0 percent across 30 downstream tasks including 8.2 percent on MMLU, and reduces inference compute by 27 percent for the same quality.
BoundlessBPE (COLM 2025) eliminates the pre-tokenization word-boundary constraint entirely, allowing merges to span any two adjacent tokens [19]. It achieves up to 15 percent improvement in bytes-per-token and is particularly effective for languages without whitespace.
LiteToken (February 2026) identifies and removes "intermediate merge residues," tokens that are frequent during BPE training but rarely appear in the final tokenized output [11]. It is positioned as a plug-and-play improvement for any existing tokenizer.
Data Mixture Inference (Hayase et al. 2024) studies the inverse problem of recovering training-data composition from a trained BPE merge list, showing that BPE merges leak surprising amounts of information about pretraining corpora [32].

practical considerations

counting tokens

Several tools exist for counting tokens before sending text to an API:

tiktoken (Python): OpenAI's official library. Install with pip install tiktoken and use tiktoken.encoding_for_model("gpt-4o") to get the right encoding for a specific model.
Anthropic API: provides a free messages.count_tokens endpoint for Claude models.
Hugging Face Transformers: any tokenizer from the transformers library can count tokens via tokenizer.encode(text) or tokenizer(text)["input_ids"].
Online tools: OpenAI's Tokenizer playground at platform.openai.com/tokenizer and various third-party browser tools provide visual token counting.

choosing a vocabulary size

When training a custom tokenizer, the vocabulary size is a key decision. A common starting point for English-only models is 32K to 50K tokens. Multilingual models typically need 100K or more. The right size depends on the training data distribution, target languages, and computational budget. Larger vocabularies improve tokenization efficiency but increase model parameter count and the cost of the final softmax.

tokenizer and model must match

A language model can only be used with the tokenizer it was trained with. Swapping in a different tokenizer at inference time will produce nonsensical results because the token IDs will map to the wrong embeddings. If you fine-tune a model, you generally keep the same tokenizer unless you are adding new tokens (e.g., for a specialized domain), in which case the embedding matrix must be resized accordingly and the new rows initialized (typically as the mean of related existing rows).

Reusing a pretrained tokenizer for a new model from scratch (sometimes called "tokenizer recycling") is technically possible but can introduce subtle issues if the new training corpus has very different statistics from the corpus the tokenizer was trained on, leading to underused or overused tokens [33].

the minBPE tutorial

For practitioners who want to build intuition for how BPE works under the hood, Andrej Karpathy's minbpe repository and the accompanying "Let's build the GPT Tokenizer" video lecture (released February 2024 as part of the Neural Networks: Zero to Hero series) provide a step-by-step from-scratch implementation of byte-level BPE in roughly 200 lines of Python [13]. The tutorial walks through training, encoding, decoding, and reproducing the GPT-2 and GPT-4 regex pre-tokenizers, and is widely cited as the most accessible introduction to the topic.

history of tokenization in NLP

Tokenization has evolved significantly alongside the field of NLP:

Pre-2013: rule-based tokenizers split on whitespace and punctuation. Word-level vocabularies of 50K to 200K entries were common, with unknown words handled via heuristics or a single UNK token. The Penn Treebank tokenizer was a de facto standard.
2012: Schuster and Nakajima introduced WordPiece for Japanese and Korean voice search at ICASSP 2012 [3].
2013 to 2015: Word2Vec (Mikolov et al. 2013) and GloVe (Pennington et al. 2014) embeddings popularized fixed-vocabulary approaches, typically with 400K word vocabularies. Out-of-vocabulary words remained a persistent problem.
2015 to 2016: Sennrich, Haddow, and Birch adapted BPE for NLP (arxiv 1508.07909, published at ACL 2016), providing an elegant solution to the open-vocabulary problem [2]. This was arguably the most important single advance in tokenization for neural models.
2018: WordPiece became widely known through its use in BERT [3]. Kudo and Richardson published SentencePiece, unifying BPE and Unigram tokenization in a language-independent framework, while Kudo's separate ACL 2018 paper introduced subword regularization and the Unigram language model [4][5].
2019 to 2020: GPT-2 introduced byte-level BPE, eliminating UNK tokens entirely. This approach was carried forward into GPT-3 and remains standard in OpenAI's lineup.
2021: ByT5 demonstrated competitive performance for fully token-free byte-to-byte modeling [15].
2023: Petrov et al. (NeurIPS 2023) formalized the multilingual fairness problem in tokenization [10]. SolidGoldMagikarp brought public attention to glitch tokens [24]. MEGABYTE introduced patch-based byte modeling [30].
2023 to 2024: vocabulary sizes began expanding rapidly (LLaMA 3's 128K, GPT-4o's 200K, Gemma 3's 262K) to improve efficiency and multilingual support. Karpathy's minBPE tutorial (February 2024) became the canonical educational reference [13].
2025 to 2026: research into tokenizer-free byte-level models (BLT, MambaByte) gained momentum, alongside advanced BPE variants like SuperBPE, BoundlessBPE, and LiteToken. The COLM 2025 conference featured several papers questioning long-held assumptions about word-boundary pre-tokenization.

references

[1] Meta AI, "Byte Latent Transformer: Patches Scale Better Than Tokens," arxiv 2412.09871, December 2024. Available: https://arxiv.org/abs/2412.09871

[2] R. Sennrich, B. Haddow, and A. Birch, "Neural Machine Translation of Rare Words with Subword Units," Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715-1725, 2016 (arxiv 1508.07909, August 2015). Available: https://aclanthology.org/P16-1162/

[3] M. Schuster and K. Nakajima, "Japanese and Korean Voice Search," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149-5152, 2012. Available: https://research.google/pubs/japanese-and-korean-voice-search/

[4] T. Kudo, "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates," Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018 (arxiv 1804.10959). Available: https://arxiv.org/abs/1804.10959

[5] T. Kudo and J. Richardson, "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing," Proceedings of EMNLP 2018 System Demonstrations, pp. 66-71, 2018 (arxiv 1808.06226). Available: https://arxiv.org/abs/1808.06226

[6] OpenAI, "tiktoken: A fast BPE tokeniser for use with OpenAI's models," GitHub repository, 2023. Available: https://github.com/openai/tiktoken

[7] Tokenization efficiency benchmarks and vocabulary analysis, LLM Calculator, 2025. Available: https://llm-calculator.com/blog/tokenization-performance-benchmark/

[8] API pricing as of early 2026 from official provider documentation: OpenAI (https://openai.com/pricing), Anthropic (https://docs.anthropic.com/en/docs/about-claude/pricing).

[9] OpenAI, "What are tokens and how to count them?" OpenAI Help Center. Available: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them

[10] A. Petrov, E. La Malfa, P. H. S. Torr, and A. Bibi, "Language Model Tokenizers Introduce Unfairness Between Languages," Advances in Neural Information Processing Systems (NeurIPS) 2023 (arxiv 2305.15425). Available: https://arxiv.org/abs/2305.15425

[11] Recent BPE improvements presented at COLM 2025 and early 2026 publications, including SuperBPE (arxiv 2503.13423) and LiteToken. Available: https://arxiv.org/abs/2503.13423

[12] ByteFlow authors, "ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer," March 2026. Available: https://arxiv.org/abs/2603.03583

[13] A. Karpathy, "Let's build the GPT Tokenizer," Neural Networks: Zero to Hero, February 2024, and the minbpe GitHub repository. Available: https://github.com/karpathy/minbpe and https://karpathy.ai/zero-to-hero.html

[14] A. K. Singh and D. Strouse, "Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs," arxiv 2402.14903, 2024. Available: https://arxiv.org/abs/2402.14903

[15] L. Xue, A. Barua, N. Constant, R. Al-Rfou, S. Narang, M. Kale, A. Roberts, and C. Raffel, "ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models," Transactions of the Association for Computational Linguistics, 2022 (arxiv 2105.13626). Available: https://arxiv.org/abs/2105.13626

[16] P. Gage, "A New Algorithm for Data Compression," The C Users Journal, vol. 12, no. 2, pp. 23-38, February 1994. Available: https://web.archive.org/web/20160326135237/http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM

[17] Performance benchmarks comparing tiktoken with rs-bpe and other Rust implementations. Available: https://github.com/openai/tiktoken and https://crates.io/crates/bpe

[18] DeepSeek-AI, "DeepSeek-V3 Technical Report," arxiv 2412.19437, December 2024. Available: https://arxiv.org/abs/2412.19437

[19] BoundlessBPE authors, "Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier," arxiv 2504.00178, COLM 2025. Available: https://arxiv.org/abs/2504.00178

[20] Hugging Face, "tokenizers: Fast State-of-the-Art Tokenizers," GitHub repository. Available: https://github.com/huggingface/tokenizers

[21] Meta AI, "Llama 3 model documentation," Hugging Face. Available: https://huggingface.co/docs/transformers/en/model_doc/llama3

[22] Anthropic, "Token counting," Claude API Documentation. Available: https://docs.claude.com/en/docs/build-with-claude/token-counting

[23] Qwen team, "Tokenization Note," GitHub repository. Available: https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md

[24] J. Rumbelow and M. Watkins, "SolidGoldMagikarp (plus, prompt generation)," AI Alignment Forum, February 2023. Available: https://www.alignmentforum.org/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

[25] S. Land, "Unreachable tokens in GPT-4o," Token Contributions, 2024. Available: https://tokencontributions.substack.com/p/unreachable-tokens-in-gpt-4o

[26] T. Liu, B. Low, et al., "Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks," arxiv 2305.14201, 2023. Available: https://arxiv.org/abs/2305.14201

[27] BigCode Project, "StarCoder2 and The Stack v2," 2024. Available: https://huggingface.co/blog/starcoder

[28] Authors of the STRR study, "Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation," arxiv 2510.09947, 2025. Available: https://arxiv.org/abs/2510.09947

[29] Google Developers Blog, "Gemma explained: What's new in Gemma 3," 2025. Available: https://developers.googleblog.com/gemma-explained-whats-new-in-gemma-3/

[30] L. Yu, D. Simig, C. Flaherty, A. Aghajanyan, L. Zettlemoyer, and M. Lewis, "MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers," arxiv 2305.07185, May 2023. Available: https://arxiv.org/abs/2305.07185

[31] J. Wang et al., "MambaByte: Token-free Selective State Space Model," arxiv 2401.13660, 2024. Available: https://arxiv.org/abs/2401.13660

[32] J. Hayase, A. Liu, et al., "Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?" arxiv 2407.16607, 2024. Available: https://arxiv.org/abs/2407.16607

[33] C. Arnett, "wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??" Hugging Face blog, 2024. Available: https://huggingface.co/blog/catherinearnett/dangers-of-tokenizer-recycling

why tokenization matters

types of tokenization

word-level tokenization

character-level tokenization

subword tokenization

byte-level tokenization

key tokenization algorithms

byte pair encoding (BPE)

byte-level BPE

WordPiece

unigram language model

SentencePiece

tiktoken

how tokenizers are trained

vocabulary building

merge rules in BPE

pre-tokenization

example: training a BPE tokenizer

vocabulary size and its impact

tradeoffs in choosing vocabulary size

tokens and pricing

how token-based pricing works

counting tokens before sending

practical cost implications

tokens vs. words

special tokens

glitch tokens and tokenizer pathologies

numerical and arithmetic tokenization

code tokenization

multilingual tokenization challenges

CJK languages

non-Latin scripts

the fairness problem

tokenizer comparison for major models

tokenizer-free and byte-level models

ByT5 (2021)

MEGABYTE (2023)

MambaByte (2024)

Byte Latent Transformer (BLT, 2024)

implications

advanced BPE variants

practical considerations

counting tokens

choosing a vocabulary size

tokenizer and model must match

the minBPE tutorial

history of tokenization in NLP

see also

references

Improve this article

Related Articles

ARC-AGI 2

DeepSeek 3.0

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Post-training

why tokenization matters

types of tokenization

word-level tokenization

character-level tokenization

subword tokenization

byte-level tokenization

key tokenization algorithms

byte pair encoding (BPE)

byte-level BPE

WordPiece

unigram language model

SentencePiece

tiktoken

how tokenizers are trained

vocabulary building

merge rules in BPE

pre-tokenization

example: training a BPE tokenizer

vocabulary size and its impact

tradeoffs in choosing vocabulary size

tokens and pricing

how token-based pricing works

counting tokens before sending