Tokenization
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 · 8,098 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 · 8,098 words
Add missing citations, update stale details, or suggest a clearer explanation.
Tokenization is the process of breaking text into smaller units called tokens, which serve as the fundamental input to natural language processing (NLP) systems and large language models (LLMs). These tokens can be words, subwords, characters, or even individual bytes depending on the tokenization strategy used. Because machine learning models operate on numerical data rather than raw text, tokenization acts as the bridge between human-readable language and the numerical representations that models consume.
Every modern language model, from BERT and GPT-4 to Claude and LLaMA, relies on a tokenizer to convert input text into a sequence of integer IDs. Each ID maps to a specific entry in the model's vocabulary, which is then transformed into a dense vector via an embedding layer. The quality and design of the tokenizer directly influences how well a model understands and generates text. Andrej Karpathy, in his widely watched 2024 lecture on building the GPT tokenizer, observed that tokenization is the source of many oddities in language model behavior, from poor arithmetic to spelling failures and unequal treatment of languages [13].
Because tokens are the units in which input is consumed, output is generated, context windows are measured, and pricing is calculated, tokenization sits at the intersection of several practical concerns: model accuracy, computational efficiency, and the cost of using commercial APIs. Decisions made when designing or training a tokenizer have lasting effects that cannot easily be undone after a model is trained.
Tokenization is not merely a preprocessing step; it shapes nearly every aspect of a language model's behavior. The choice of tokenizer determines:
A poorly designed tokenizer can waste context window space, introduce artifacts into generated text, expose strange failure modes such as glitch tokens (see below), and create systematic disadvantages for underrepresented languages.
Tokenization strategies fall along a spectrum from coarse-grained (full words) to fine-grained (individual bytes). Each approach involves trade-offs between vocabulary size, sequence length, and coverage.
The simplest approach splits text on whitespace and punctuation, treating each word as a separate token. Early NLP systems commonly used word-level tokenization. The Word2Vec and GloVe embedding models of the mid-2010s typically operated on vocabularies of 100,000 to 400,000 word forms.
Advantages: tokens are semantically meaningful and easy to interpret. Each token corresponds directly to a word that humans recognize, which simplifies linguistic analysis and downstream tasks like named entity recognition.
Disadvantages: the vocabulary must include every word the model might encounter. English alone has hundreds of thousands of distinct word forms when accounting for conjugations, plurals, and compounds. Rare words, proper nouns, and misspellings produce unknown tokens. For morphologically rich languages like Finnish, Turkish, or German (with its long compound words), word-level tokenization is particularly impractical. A German compound such as "Donaudampfschifffahrtsgesellschaftskapitan" cannot reasonably exist in a fixed word vocabulary, yet its meaning is fully compositional from familiar parts.
At the opposite extreme, character-level tokenization treats each individual character (letter, digit, punctuation mark) as a token. The vocabulary is extremely small, often fewer than 300 entries for English plus common punctuation.
Advantages: zero out-of-vocabulary problems, since any text can be represented as a sequence of known characters. Very compact vocabulary. Strong robustness to spelling errors and rare words.
Disadvantages: sequences become very long. The sentence "Tokenization is important" requires roughly 25 character tokens instead of 3 word tokens. Longer sequences mean higher computational costs, and the model must learn to compose meaning from individual characters, which is a harder learning problem. Pure character models also struggle with multibyte Unicode characters in some scripts.
Subword tokenization occupies the middle ground and has become the dominant approach in modern NLP. It breaks text into units that are larger than characters but often smaller than words. Common words remain intact as single tokens, while rare words are decomposed into recognizable subword pieces.
For example, the word "tokenization" might be split into ["token", "ization"], while a common word like "the" stays as a single token. This balances vocabulary size, sequence length, and coverage. All major LLMs released since 2017 use some form of subword tokenization.
Byte-level tokenization operates on raw bytes (values 0 through 255) rather than Unicode characters. Since any digital text is ultimately a sequence of bytes, this approach can handle any language, encoding, or even binary data without special preprocessing. GPT-2 introduced byte-level BPE, which starts from a base vocabulary of 256 byte values and builds up subword tokens from there. Pure byte-level models without a learned vocabulary, such as Google's ByT5 [15] and Meta's Byte Latent Transformer (BLT) [1], explore the further extreme of operating directly on bytes without a fixed tokenizer at all.
Several specific algorithms have become standard in the field. Each takes a different approach to learning which subword units to include in the vocabulary.
Byte pair encoding was originally a data compression algorithm described by Philip Gage in February 1994 in The C Users Journal article "A New Algorithm for Data Compression" [16]. The algorithm iteratively replaced the most frequent pair of bytes in a sequence with a single unused byte value, achieving compression comparable to Lempel-Ziv methods of the era. In 2015, Rico Sennrich, Barry Haddow, and Alexandra Birch adapted it for neural machine translation in their paper "Neural Machine Translation of Rare Words with Subword Units," first posted to arXiv as 1508.07909 in August 2015 and published at ACL 2016 [2]. This paper demonstrated that BPE could effectively handle rare and unknown words by decomposing them into subword units, improving WMT 2015 English-German and English-Russian results by up to 1.3 BLEU over a back-off dictionary baseline.
How BPE works:
For example, suppose the pair ("t", "h") is the most frequent. After merging, every occurrence of "t" followed by "h" becomes the single token "th". Then perhaps ("th", "e") becomes the next most frequent pair, producing "the". The process continues until the desired vocabulary size is reached.
A toy training example illustrates the merge sequence:
step 0: l o w l o w e r n e w e s t w i d e s t
step 1: lo w lo w e r n e w e s t w i d e s t (merge l+o)
step 2: low low e r n e w e s t w i d e s t (merge lo+w)
step 3: low low e r n e w es t w i d es t (merge e+s)
step 4: low low e r n e w est w i d est (merge es+t)
BPE is deterministic: given the same training data and the same number of merges, it always produces the same vocabulary. It is the most widely used tokenization algorithm in modern LLMs, employed by the GPT model family, LLaMA 3, Mistral, DeepSeek, and many others.
Byte-level BPE is the variant introduced by GPT-2 in 2019 and now standard across OpenAI models, the LLaMA 3 family, and most code-focused LLMs. Instead of starting from a base vocabulary of Unicode characters (which can include hundreds of thousands of CJK and emoji codepoints), byte-level BPE always starts from the 256 possible byte values. Any Unicode codepoint is represented as a sequence of one to four UTF-8 bytes, and merges learn to recombine those byte sequences into useful tokens.
This design has three important properties. First, the base vocabulary is fixed at 256, so no Unicode character is ever truly out of vocabulary. Second, the same tokenizer can encode any sequence of bytes whatsoever, including binary data, mojibake, and characters the model has never seen. Third, ASCII text remains compactly represented because each ASCII character is one byte. The trade-off is that non-ASCII scripts may need multiple bytes per character before any merges are learned, which is one source of multilingual inefficiency for English-dominant training corpora.
WordPiece was developed by Mike Schuster and Kaisuke Nakajima at Google in 2012, originally for Japanese and Korean voice search systems and described in their ICASSP 2012 paper "Japanese and Korean Voice Search" [3]. It was later adopted for BERT and other Google NLP models including DistilBERT, ELECTRA, and the early Multilingual BERT.
WordPiece is similar to BPE but differs in how it selects which pairs to merge. Instead of choosing the most frequent pair, WordPiece selects the pair whose merger maximizes the likelihood of the training data under a unigram language model. In practice, this means it computes a score for each candidate merge:
score(pair) = freq(pair) / (freq(first) * freq(second))
The pair with the highest score is merged. This tends to favor merges that combine individually rare symbols that frequently appear together, rather than simply the globally most common pair.
WordPiece uses a distinctive notation: continuation tokens are prefixed with "##" to indicate they are not the start of a word. For instance, the word "embedding" might be tokenized as ["em", "##bed", "##ding"]. This convention makes it straightforward to reconstruct the original text by joining tokens and removing the "##" prefixes. The ## marker has become one of the most recognizable artifacts of the BERT era of NLP.
The Unigram model takes a fundamentally different approach from BPE and WordPiece. Rather than starting small and iteratively merging, it starts with a large initial vocabulary and prunes it down. Taku Kudo introduced this method in his 2018 paper "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates," published at ACL 2018 [4].
How the Unigram model works:
A notable feature of the Unigram model is that it can produce multiple valid segmentations for the same input, each with a different probability. This enables subword regularization, where different segmentations are sampled during training to improve the model's robustness. Subword regularization has been shown to improve machine translation BLEU scores by 1 to 2 points on low-resource language pairs [4].
SentencePiece, also by Taku Kudo (with John Richardson), is not a tokenization algorithm per se but a library and framework that implements both BPE and the Unigram model. It was published as a system demonstration paper at EMNLP 2018 [5]. What distinguishes SentencePiece from other implementations is its treatment of text:
SentencePiece is used by many prominent models including the original LLaMA (1 and 2), Mistral, T5, ALBERT, XLNet, and Google's Gemini family (which uses SentencePiece with the Unigram algorithm).
tiktoken is OpenAI's open-source tokenizer library, written in Rust with Python bindings for speed. It implements byte-level BPE and is designed specifically for OpenAI's model family. According to OpenAI's own benchmarks, tiktoken is approximately 3 to 6 times faster than comparable open-source tokenizers in pure Python, with measured throughput of roughly 600 MiB/s on cached multilingual text [6][17]. Independent Rust implementations like rs-bpe have since outperformed tiktoken further in adversarial worst-case scenarios, where pre-tokenization can become quadratic [17].
OpenAI has released several tokenizer encodings through tiktoken:
| Encoding | Vocabulary Size | Models | Release |
|---|---|---|---|
| r50k_base | ~50,257 | GPT-2, early GPT-3 | 2019 |
| p50k_base | ~50,281 | Codex, text-davinci-002/003 | 2022 |
| cl100k_base | ~100,256 | GPT-3.5-turbo, GPT-4 | 2023 |
| o200k_base | ~200,015 | GPT-4o, o3-mini, o4-mini | 2024 |
| o200k_harmony | ~201,088 | GPT-5 | 2025 |
The progression from 50K to 200K tokens reflects a broader industry trend toward larger vocabularies. Larger vocabularies produce shorter token sequences (reducing compute in self-attention) and improve multilingual coverage, at the cost of a larger embedding matrix and output projection layer.
The regex pre-tokenization patterns also evolved between encodings. The cl100k_base pattern is approximately:
(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{2,}|
[^\r\n\p{L}\p{N}]?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+
The o200k_base pattern revises this to better handle modifier letters, marks, and other Unicode categories such as \p{Lo}, \p{Lm}, and \p{M}, which substantially improves coverage for non-Latin scripts.
Tokenizer training is a separate step from model training. It typically occurs before the language model itself is trained, using a large representative corpus of text. The trained tokenizer is then frozen and reused for the entire life of the model.
The training process varies by algorithm, but the general workflow is:
The choice of training corpus profoundly affects the resulting tokenizer. A tokenizer trained primarily on English web text will allocate most of its vocabulary to English subwords, leaving non-English text fragmented into many short tokens. Tokenizers used by multilingual models are typically trained on data sampled to give a fair share of vocabulary slots to each target language. DeepSeek-V3, for example, modified its pretokenizer and training data specifically to optimize multilingual compression efficiency [18].
For BPE-based tokenizers, the output of training is an ordered list of merge rules. During tokenization, these merges are applied in order to the input text. The order matters: earlier merges take precedence. For example, if merge rule 100 is ("t", "o") and merge rule 500 is ("to", "ken"), then "token" is first processed by merging "t"+"o" into "to", and later (if applicable) "to"+"ken" into "token".
The total number of merge rules equals the final vocabulary size minus the base vocabulary size. A tokenizer with a 100K vocabulary and a 256-byte base vocabulary has approximately 99,744 merge rules.
Many tokenizers apply a pre-tokenization step before running the subword algorithm. This typically involves:
GPT-4's tokenizer, for instance, uses a complex regex pattern that handles contractions, letter sequences, digit sequences, and whitespace. This pre-tokenization prevents merges from spanning across word boundaries in most cases, which helps maintain linguistically meaningful tokens. SentencePiece is notable for skipping pre-tokenization entirely, treating the raw input stream as a flat sequence and instead encoding spaces as visible meta-characters.
The SuperBPE work at COLM 2025 questions whether word-boundary pre-tokenization is desirable at all [11]. By relaxing the boundary in a curriculum (first learning subwords, then learning multi-word "superword" tokens), SuperBPE achieves up to 33 percent fewer tokens for the same vocabulary size and improves average benchmark performance by 4.0 percent across 30 downstream tasks, including 8.2 percent on MMLU. The related BoundlessBPE paper (also COLM 2025) eliminates the boundary constraint entirely [19].
Using the Hugging Face tokenizers library, training a custom BPE tokenizer for a corpus of text files is straightforward [20]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(
vocab_size=32_000,
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
)
tokenizer.train(files=["corpus_part1.txt", "corpus_part2.txt"], trainer=trainer)
tokenizer.save("my-tokenizer.json")
For large in-memory iterators or streaming data, train_from_iterator accepts any Python iterable yielding strings.
Vocabulary size is one of the most important hyperparameters in tokenizer design. It directly affects model architecture, training efficiency, and inference cost.
| Model | Tokenizer | Algorithm | Vocabulary Size |
|---|---|---|---|
| GPT-2 | tiktoken (r50k_base) | Byte-level BPE | 50,257 |
| BERT | WordPiece | WordPiece | 30,522 |
| LLaMA 2 | SentencePiece | BPE | 32,000 |
| LLaMA 3 | tiktoken-based | Byte-level BPE | 128,256 |
| GPT-4 | tiktoken (cl100k_base) | Byte-level BPE | 100,256 |
| GPT-4o | tiktoken (o200k_base) | Byte-level BPE | ~200,015 |
| Gemini / Gemma 3 | SentencePiece | Unigram | 262,144 |
| Claude (Anthropic) | Proprietary | Not publicly disclosed | Not publicly disclosed |
| Mistral | SentencePiece | BPE | 32,768 |
| Qwen (Alibaba) | tiktoken-based | Byte-level BPE | 151,643 |
| DeepSeek-V3 | Custom | Byte-level BPE | 128,000 |
Small vocabularies (30K to 50K) produce longer token sequences but have smaller embedding matrices. They may struggle with multilingual text and code.
Large vocabularies (100K to 260K+) produce shorter sequences (reducing attention computation) and handle diverse languages better, but increase the size of the embedding and output projection layers. Gemini 3's 262K vocabulary, for example, requires 262,144 softmax computations per token, which is 8x more than LLaMA 2's 32K vocabulary [7].
The trend since 2023 has been clearly toward larger vocabularies. LLaMA jumped from 32K to 128K between versions 2 and 3, and at the same time switched from a SentencePiece BPE tokenizer to a tiktoken-based byte-level BPE tokenizer compatible with the OpenAI ecosystem [21]. OpenAI doubled from 100K to 200K with GPT-4o. This growth is driven by improved multilingual coverage and the observation that the computational savings from shorter sequences generally outweigh the cost of larger embedding tables.
Doubling the vocabulary roughly doubles the parameter count of the embedding matrix (which has shape vocab_size x hidden_dim) and the output projection (often tied to the embedding). For a model with a hidden size of 4096, increasing vocabulary from 32K to 128K adds approximately (128,000 - 32,000) * 4096 = 393 million parameters to each of the input and output layers. In smaller models, the embedding matrix can become a substantial fraction of the total parameter count.
However, larger vocabularies shorten sequences, and self-attention scales quadratically with sequence length. For a 32K vocabulary that fragments multilingual text into roughly twice as many tokens as a 128K vocabulary, the FLOPs spent in attention layers can grow by a factor of four, which often dominates the embedding-table cost in deep transformer stacks. Empirically, the move to larger vocabularies has been a net win for nearly all frontier models since 2023.
Commercial LLM APIs universally measure usage in tokens, making tokenization directly relevant to cost.
When you send a prompt to an API like OpenAI's or Anthropic's, the text is tokenized, and you are charged based on the number of input tokens and output tokens. Input tokens (your prompt) and output tokens (the model's response) are typically priced separately, with output tokens costing more because they require the computationally expensive autoregressive generation process.
As of early 2026, representative pricing per million tokens [8]:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-5.2 | $1.75 | $14.00 |
| Claude Sonnet 4.5 | $3.00 | $15.00 |
| Claude Haiku 4.5 | $1.00 | $5.00 |
| Claude Opus 4.5 | $5.00 | $25.00 |
| Gemini 2.5 Pro | $1.25 | $10.00 |
Because different tokenizers produce different numbers of tokens for the same text, comparing API prices across providers requires understanding the tokenization efficiency of each model. A model that tokenizes your text into fewer tokens may be cheaper even if its per-token price is higher.
For cost estimation and prompt-engineering work, it is common to count tokens locally before making an API call. The standard tools are:
# OpenAI: tiktoken
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
print(len(enc.encode("Hello, world!"))) # 4
# Anthropic: count_tokens endpoint (free of charge)
from anthropic import Anthropic
client = Anthropic()
count = client.messages.count_tokens(
model="claude-sonnet-4-5",
messages=[{"role": "user", "content": "Hello, world!"}],
)
print(count.input_tokens)
# Hugging Face: tokenizer.encode
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
print(len(tok.encode("Hello, world!")))
Anthropic exposes a free messages.count_tokens endpoint that accepts the same payload shape as a messages.create call, including system prompts, tools, images, and PDFs [22]. The returned count is an estimate that may differ from the actual billing count by a small amount.
For English text, a rough rule of thumb is that 1 million tokens is approximately 750,000 words. For code, the ratio can vary significantly depending on the programming language, formatting, and comment density. Tokenization efficiency drops sharply for non-English text (see the multilingual section below), meaning the same content in Chinese or Arabic can cost significantly more to process than its English equivalent. Qwen's documentation cites a rule of thumb of 3 to 4 characters per token for English versus 1.5 to 1.8 characters per token for Chinese under its tokenizer, despite Chinese having a much higher information density per character [23].
A common point of confusion is the relationship between tokens and words. They are not the same thing.
For English text processed by modern subword tokenizers, the typical ratio is approximately 1.3 tokens per word, though this varies by content type [9]. Common short words like "the", "is", and "a" are usually single tokens. Longer or less common words may be split into 2 to 4 subword tokens. Punctuation marks, spaces (in some tokenizers), and special formatting characters each consume their own tokens.
OpenAI's documentation suggests a useful approximation: 1 token is roughly 4 characters or 0.75 words in English [9].
Here is an example of how GPT-4's tokenizer (cl100k_base) processes a sentence:
| Text | Tokens | Token Count |
|---|---|---|
| "Hello world" | ["Hello", " world"] | 2 |
| "Tokenization" | ["Token", "ization"] | 2 |
| "GPT-4 is great" | ["GPT", "-", "4", " is", " great"] | 5 |
| "antidisestablishmentarianism" | ["ant", "idis", "establishment", "arian", "ism"] | 5 |
| "こんにちは" (Japanese greeting) | ["こんにちは"] | 1 |
| "人工智能" (Chinese: AI) | ["人工", "智能"] | 2 |
Note how the long English word "antidisestablishmentarianism" is broken into five meaningful subword pieces, while the common word "Hello" is kept intact. The CJK examples show that character-dense languages can be tokenized relatively efficiently by modern tokenizers with large vocabularies.
Beyond the tokens derived from text, every tokenizer includes a set of special tokens that carry structural or control information. These tokens are never produced by splitting normal text; they are inserted by the tokenizer or the model framework to delimit sequences and convey metadata.
| Token | Name | Purpose |
|---|---|---|
<BOS> or <s> | Beginning of Sequence | Marks the start of an input sequence. Used by autoregressive models like GPT and LLaMA to signal where generation begins. |
<EOS> or </s> | End of Sequence | Marks the end of a sequence. The model is trained to output this token when it has finished generating. |
<PAD> | Padding | Used to pad shorter sequences to the same length within a batch. Attention masks ensure that PAD tokens are ignored during computation. |
<UNK> | Unknown | Represents tokens not in the vocabulary. Modern subword tokenizers rarely produce UNK tokens because they can decompose any input into known subwords or bytes. |
[CLS] | Classification | Used by BERT at the start of every input. The final hidden state at this position serves as the aggregate sequence representation for classification tasks. |
[SEP] | Separator | Used by BERT to separate two segments (e.g., a question and a passage in question answering). |
[MASK] | Mask | Used in masked language modeling (BERT's pre-training objective). A percentage of input tokens are replaced with [MASK], and the model predicts the originals. |
<|endoftext|> | End of text | Used by GPT-2 and successors to mark document boundaries during pretraining. |
<|im_start|>, <|im_end|> | Chat role markers | OpenAI's ChatML format delimiters for system, user, and assistant turns. |
<|begin_of_text|>, <|eot_id|> | LLaMA 3 markers | Begin-of-text and end-of-turn special tokens introduced in LLaMA 3. |
Modern chat-oriented models also use special tokens to delimit roles in a conversation. For example, OpenAI's chat format uses tokens like <|im_start|> and <|im_end|> to mark the boundaries of system, user, and assistant messages. LLaMA 3 uses <|begin_of_text|>, <|start_header_id|>, <|end_header_id|>, and <|eot_id|>. These structural tokens are critical for the model to correctly interpret multi-turn conversations, and prompt-injection attacks frequently target these markers in an attempt to confuse the model about role boundaries.
Many modern tokenizers also reserve large blocks of unused special-token slots for future use. LLaMA 3, for example, defines hundreds of reserved tokens (e.g., <|reserved_special_token_0|> through <|reserved_special_token_250|>) so that follow-up models can introduce new structural markers without breaking compatibility.
One of the most surprising consequences of how tokenizers are trained is the existence of glitch tokens: vocabulary entries that the model behaves strangely on, often refusing to repeat them, hallucinating in response to them, or producing unexpected outputs.
The phenomenon was popularized in February 2023 by Jessica Rumbelow and Matthew Watkins, working as part of the SERI MATS research program, in their post "SolidGoldMagikarp (plus, prompt generation)" on the LessWrong and AI Alignment Forum [24]. They documented that asking GPT-2 and GPT-3 models to repeat back specific strings such as SolidGoldMagikarp, TheNitromeFan, guiActiveUn, and Smartstocks produced wildly off-target responses. For example, ChatGPT in early 2023 would respond to a request to repeat "SolidGoldMagikarp" by repeating the unrelated word "distribute."
The root cause is a mismatch between the corpus used to train the tokenizer and the corpus used to train the language model itself. Many of the GPT-2 era anomalous tokens were apparently scraped from Reddit threads, the r/counting subreddit, online gaming logs, and e-commerce backends. They appeared frequently enough in the tokenizer training corpus to earn vocabulary slots, but were filtered out (or appeared too rarely) in the larger language-model training corpus. Their associated embedding vectors were therefore never updated meaningfully during pretraining, leaving them in a near-random state.
Glitch tokens have been documented in nearly every major frontier model family. Subsequent work by Land and others has identified "unreachable tokens" in GPT-4o, vocabulary entries that the BPE merge rules cannot produce from any input string, and so will never appear during inference [25]. Robust modern tokenizers and training pipelines try to detect and either remove or downweight such tokens before final model release.
The way digits are tokenized has a measurable impact on a model's arithmetic ability, a finding that has accelerated changes in how new tokenizers are designed.
LLaMA 1, LLaMA 2, and PaLM all tokenize numbers digit-by-digit, treating each of the ten digits 0 through 9 as its own single token. GPT-3.5 and GPT-4, by contrast, learned multi-digit tokens during BPE training. The cl100k_base vocabulary contains separate tokens for many one-, two-, and three-digit numbers, but not for all of them. As a result, the integer 480 might be a single token while 481 is split into ["4", "81"] and 482 is split into ["48", "2"]. This irregular grouping creates inconsistent input representations for arithmetic problems.
A 2024 study by Singh and Strouse demonstrated that GPT-3.5 and GPT-4 perform measurably better on multi-digit arithmetic when numbers are written with comma separators (e.g., 1,234,567 instead of 1234567) because commas force a more consistent right-to-left tokenization that aligns the digits of the addends and the answer [14]. LLaMA's per-digit tokenization avoids this issue and was a significant factor in research showing that fine-tuned LLaMA could outperform GPT-4 on certain arithmetic tasks [26]. Newer OpenAI tokenizers and the GPT-4o family have moved toward more constrained digit-grouping rules.
Similar effects have been documented for date and time data, where inconsistent tokenization of numeric components leads to systematic errors in temporal reasoning tasks [14].
Programming languages place specific demands on tokenizers that differ from natural language. Indentation in Python encodes block structure; tabs and runs of spaces are semantically meaningful; identifiers can mix camelCase, snake_case, and dot notation; and the same lexical tokens (such as braces and brackets) may appear with high frequency in tightly nested patterns.
StarCoder and StarCoder2, Code LLaMA, DeepSeek-Coder, and other code-focused LLMs train their BPE tokenizers on heavily code-weighted corpora and add special token treatment for whitespace runs. The GPT-4 tokenizer, for example, encodes runs of spaces (such as four spaces or eight spaces) as single tokens, which dramatically reduces the token count of indented Python source. StarCoder2 additionally includes special repository and filename markers to help the model distinguish code within the same project from code across projects [27]. Code LLaMA uses byte-level BPE with byte-fallback and no Unicode normalization to faithfully preserve every character of source code.
The practical effect is that a well-designed code tokenizer can fit substantially more source code into a fixed context window than a general-purpose tokenizer. A 1,000-line Python file might be 4,000 tokens for a code-trained tokenizer but 8,000 or more for an English-text-trained tokenizer that handles indentation poorly.
Tokenization works reasonably well for English and other Latin-script languages with clear word boundaries, but significant challenges arise with other writing systems.
Chinese, Japanese, and Korean present unique difficulties:
For models targeting Chinese-language users, tokenizer designers explicitly augment the vocabulary with Chinese characters and common bigrams. Qwen's 151,643-token vocabulary contains roughly 25,000 Chinese tokens, while DeepSeek-V3's 128,000-token vocabulary contains roughly 35,000 [18][23].
Arabic, Hebrew, Devanagari, Thai, and other scripts each bring their own complexities:
A 2025 evaluation found Hindi to have the lowest single-token retention rate across all evaluated tokenizers, indicating pronounced fragmentation [28]. The same study introduced the Single Token Retention Rate (STRR) as a complement to fertility, capturing how many words are preserved as single tokens rather than just the average tokens per word.
Research has documented that tokenization creates systematic unfairness between languages. The 2023 NeurIPS paper "Language Model Tokenizers Introduce Unfairness Between Languages" by Aleksandar Petrov, Emanuele La Malfa, Philip H. S. Torr, and Adel Bibi found that the same text translated into different languages can produce token counts that differ by up to 15x, even for tokenizers explicitly designed to be multilingual [10]. Character-level and byte-level models reduced but did not eliminate the disparity, with differences of more than 4x still observed for some language pairs.
A text that costs $1 to process in English might cost $2 to $3 in Chinese or Japanese and potentially much more in low-resource languages like Tigrinya or Burmese. This "tokenization tax" means that users of non-English languages get less content per dollar and use more of their context window for the same amount of information.
Larger vocabularies partially address this problem. GPT-4o's o200k_base encoding includes roughly 10x more Cyrillic tokens than GPT-4's cl100k_base (4,660 vs. approximately 435), substantially improving efficiency for Russian and other Cyrillic-script languages [7]. Similarly, LLaMA 3's expansion from 32K to 128K tokens was motivated in part by the need for better multilingual coverage, and Gemma 3's adoption of the Gemini 2.0 tokenizer was framed by Google as more balanced for non-English languages, accepting a small increase in English token counts in exchange for substantially better Chinese, Japanese, and Korean coverage [29].
The following table summarizes the tokenization approaches used by the most prominent LLM families as of early 2026:
| Model Family | Tokenizer Library | Algorithm | Vocab Size | Pre-tokenization | Notable Features |
|---|---|---|---|---|---|
| GPT-4 | tiktoken (cl100k_base) | Byte-level BPE | ~100K | Regex-based | Strong English and code performance |
| GPT-4o / GPT-5 | tiktoken (o200k_base / Harmony) | Byte-level BPE | ~200K | Regex-based | 2x vocabulary over GPT-4; much better multilingual |
| Claude (Anthropic) | Proprietary | Not disclosed | Not disclosed | Not disclosed | Anthropic provides token counting via API |
| LLaMA 2 | SentencePiece | BPE | 32,000 | None (SentencePiece) | Relatively small vocabulary |
| LLaMA 3 / 3.2 | Custom (tiktoken-compatible) | Byte-level BPE | 128,256 | Regex-based | 4x vocabulary increase over LLaMA 2 |
| Gemini / Gemma 3 | SentencePiece | Unigram | 262,144 | None (SentencePiece) | Largest vocabulary among major models |
| BERT | WordPiece | WordPiece | 30,522 | Whitespace + punctuation | Uses ## prefix for continuation tokens |
| T5 | SentencePiece | Unigram | 32,000 | None (SentencePiece) | Treats spaces as tokens |
| Mistral / Mixtral | SentencePiece | BPE | 32,768 | None (SentencePiece) | Compact vocabulary, byte-fallback for OOV |
| Qwen 2.5/3 | tiktoken-based | Byte-level BPE | 151,643 | Regex-based | Extends cl100k_base with Chinese tokens |
| DeepSeek-V3 | Custom | Byte-level BPE | 128,000 | Custom regex | Optimized for multilingual compression |
| StarCoder2 | Custom | Byte-level BPE | ~49,000 | Code-aware | Repository and filename special tokens |
A growing line of research questions whether fixed tokenizers are necessary at all, motivated in part by the multilingual fairness problem and in part by glitch-token pathologies.
ByT5 was introduced by Google researchers in 2021 in the paper "ByT5: Towards a Token-Free Future With Pre-trained Byte-to-Byte Models" [15]. It is a byte-level adaptation of mT5 that operates directly on UTF-8 bytes with no learned vocabulary at all. The authors found that for the same parameter budget, a deeper encoder relative to the decoder works best, and the resulting models outperform mT5 on classification and generation tasks while being more robust to orthographic noise (typos, character substitutions, casing changes). The price is that byte sequences are roughly 4x longer than equivalent subword sequences, raising attention costs.
MEGABYTE, introduced by Lili Yu, Daniel Simig, Colin Flaherty, Armen Aghajanyan, Luke Zettlemoyer, and Mike Lewis at Meta in May 2023 (arxiv 2305.07185), tackles the byte-sequence-length problem by chunking bytes into fixed-size patches and applying a hierarchical transformer [30]. A small "local" transformer operates within each patch, while a large "global" transformer operates between patches. This decomposition makes attention sub-quadratic in total sequence length and enables modeling of sequences over one million bytes.
MambaByte applies the Mamba state-space architecture to byte sequences [31]. Because Mamba has fixed-size hidden state regardless of sequence length, the long-sequence cost of byte-level modeling becomes manageable. Junxiong Wang and colleagues showed that MambaByte is competitive with and sometimes outperforms subword transformer baselines, while inheriting the robustness benefits of byte-level modeling. Speculative decoding gives an additional 2.6x inference speedup.
Meta's Byte Latent Transformer, posted as arxiv 2412.09871 in December 2024, takes a different approach to chunking bytes [1]. Instead of fixed patches, BLT segments byte sequences dynamically based on the entropy of the next byte. Predictable runs (like long English words) get grouped into long patches, while complex regions (rare tokens, numbers, code) get shorter patches and more compute. BLT is the first byte-level model demonstrated to match the performance of LLaMA 3 at the 8 billion parameter / 4 trillion training-byte scale, while eliminating the tokenizer entirely. The Meta team has open-sourced training and inference code in the facebookresearch/blt repository.
If tokenizer-free approaches scale further, several practical pain points of fixed-vocabulary models simply disappear: there are no glitch tokens, no multilingual fairness gap from vocabulary allocation, no need to retrain a tokenizer when adding new domains, and no asymmetric digit grouping for arithmetic. The trade-off is engineering: byte-level inference requires hierarchical or state-space architectures to remain compute-efficient, and the existing tooling ecosystem assumes fixed-vocabulary tokenizers at every layer.
Researchers continue to refine BPE itself rather than abandon tokenization entirely.
Several tools exist for counting tokens before sending text to an API:
pip install tiktoken and use tiktoken.encoding_for_model("gpt-4o") to get the right encoding for a specific model.messages.count_tokens endpoint for Claude models.transformers library can count tokens via tokenizer.encode(text) or tokenizer(text)["input_ids"].When training a custom tokenizer, the vocabulary size is a key decision. A common starting point for English-only models is 32K to 50K tokens. Multilingual models typically need 100K or more. The right size depends on the training data distribution, target languages, and computational budget. Larger vocabularies improve tokenization efficiency but increase model parameter count and the cost of the final softmax.
A language model can only be used with the tokenizer it was trained with. Swapping in a different tokenizer at inference time will produce nonsensical results because the token IDs will map to the wrong embeddings. If you fine-tune a model, you generally keep the same tokenizer unless you are adding new tokens (e.g., for a specialized domain), in which case the embedding matrix must be resized accordingly and the new rows initialized (typically as the mean of related existing rows).
Reusing a pretrained tokenizer for a new model from scratch (sometimes called "tokenizer recycling") is technically possible but can introduce subtle issues if the new training corpus has very different statistics from the corpus the tokenizer was trained on, leading to underused or overused tokens [33].
For practitioners who want to build intuition for how BPE works under the hood, Andrej Karpathy's minbpe repository and the accompanying "Let's build the GPT Tokenizer" video lecture (released February 2024 as part of the Neural Networks: Zero to Hero series) provide a step-by-step from-scratch implementation of byte-level BPE in roughly 200 lines of Python [13]. The tutorial walks through training, encoding, decoding, and reproducing the GPT-2 and GPT-4 regex pre-tokenizers, and is widely cited as the most accessible introduction to the topic.
Tokenization has evolved significantly alongside the field of NLP:
[1] Meta AI, "Byte Latent Transformer: Patches Scale Better Than Tokens," arxiv 2412.09871, December 2024. Available: https://arxiv.org/abs/2412.09871
[2] R. Sennrich, B. Haddow, and A. Birch, "Neural Machine Translation of Rare Words with Subword Units," Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715-1725, 2016 (arxiv 1508.07909, August 2015). Available: https://aclanthology.org/P16-1162/
[3] M. Schuster and K. Nakajima, "Japanese and Korean Voice Search," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5149-5152, 2012. Available: https://research.google/pubs/japanese-and-korean-voice-search/
[4] T. Kudo, "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates," Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018 (arxiv 1804.10959). Available: https://arxiv.org/abs/1804.10959
[5] T. Kudo and J. Richardson, "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing," Proceedings of EMNLP 2018 System Demonstrations, pp. 66-71, 2018 (arxiv 1808.06226). Available: https://arxiv.org/abs/1808.06226
[6] OpenAI, "tiktoken: A fast BPE tokeniser for use with OpenAI's models," GitHub repository, 2023. Available: https://github.com/openai/tiktoken
[7] Tokenization efficiency benchmarks and vocabulary analysis, LLM Calculator, 2025. Available: https://llm-calculator.com/blog/tokenization-performance-benchmark/
[8] API pricing as of early 2026 from official provider documentation: OpenAI (https://openai.com/pricing), Anthropic (https://docs.anthropic.com/en/docs/about-claude/pricing).
[9] OpenAI, "What are tokens and how to count them?" OpenAI Help Center. Available: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
[10] A. Petrov, E. La Malfa, P. H. S. Torr, and A. Bibi, "Language Model Tokenizers Introduce Unfairness Between Languages," Advances in Neural Information Processing Systems (NeurIPS) 2023 (arxiv 2305.15425). Available: https://arxiv.org/abs/2305.15425
[11] Recent BPE improvements presented at COLM 2025 and early 2026 publications, including SuperBPE (arxiv 2503.13423) and LiteToken. Available: https://arxiv.org/abs/2503.13423
[12] ByteFlow authors, "ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer," March 2026. Available: https://arxiv.org/abs/2603.03583
[13] A. Karpathy, "Let's build the GPT Tokenizer," Neural Networks: Zero to Hero, February 2024, and the minbpe GitHub repository. Available: https://github.com/karpathy/minbpe and https://karpathy.ai/zero-to-hero.html
[14] A. K. Singh and D. Strouse, "Tokenization counts: the impact of tokenization on arithmetic in frontier LLMs," arxiv 2402.14903, 2024. Available: https://arxiv.org/abs/2402.14903
[15] L. Xue, A. Barua, N. Constant, R. Al-Rfou, S. Narang, M. Kale, A. Roberts, and C. Raffel, "ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models," Transactions of the Association for Computational Linguistics, 2022 (arxiv 2105.13626). Available: https://arxiv.org/abs/2105.13626
[16] P. Gage, "A New Algorithm for Data Compression," The C Users Journal, vol. 12, no. 2, pp. 23-38, February 1994. Available: https://web.archive.org/web/20160326135237/http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM
[17] Performance benchmarks comparing tiktoken with rs-bpe and other Rust implementations. Available: https://github.com/openai/tiktoken and https://crates.io/crates/bpe
[18] DeepSeek-AI, "DeepSeek-V3 Technical Report," arxiv 2412.19437, December 2024. Available: https://arxiv.org/abs/2412.19437
[19] BoundlessBPE authors, "Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier," arxiv 2504.00178, COLM 2025. Available: https://arxiv.org/abs/2504.00178
[20] Hugging Face, "tokenizers: Fast State-of-the-Art Tokenizers," GitHub repository. Available: https://github.com/huggingface/tokenizers
[21] Meta AI, "Llama 3 model documentation," Hugging Face. Available: https://huggingface.co/docs/transformers/en/model_doc/llama3
[22] Anthropic, "Token counting," Claude API Documentation. Available: https://docs.claude.com/en/docs/build-with-claude/token-counting
[23] Qwen team, "Tokenization Note," GitHub repository. Available: https://github.com/QwenLM/Qwen/blob/main/tokenization_note.md
[24] J. Rumbelow and M. Watkins, "SolidGoldMagikarp (plus, prompt generation)," AI Alignment Forum, February 2023. Available: https://www.alignmentforum.org/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
[25] S. Land, "Unreachable tokens in GPT-4o," Token Contributions, 2024. Available: https://tokencontributions.substack.com/p/unreachable-tokens-in-gpt-4o
[26] T. Liu, B. Low, et al., "Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks," arxiv 2305.14201, 2023. Available: https://arxiv.org/abs/2305.14201
[27] BigCode Project, "StarCoder2 and The Stack v2," 2024. Available: https://huggingface.co/blog/starcoder
[28] Authors of the STRR study, "Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation," arxiv 2510.09947, 2025. Available: https://arxiv.org/abs/2510.09947
[29] Google Developers Blog, "Gemma explained: What's new in Gemma 3," 2025. Available: https://developers.googleblog.com/gemma-explained-whats-new-in-gemma-3/
[30] L. Yu, D. Simig, C. Flaherty, A. Aghajanyan, L. Zettlemoyer, and M. Lewis, "MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers," arxiv 2305.07185, May 2023. Available: https://arxiv.org/abs/2305.07185
[31] J. Wang et al., "MambaByte: Token-free Selective State Space Model," arxiv 2401.13660, 2024. Available: https://arxiv.org/abs/2401.13660
[32] J. Hayase, A. Liu, et al., "Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?" arxiv 2407.16607, 2024. Available: https://arxiv.org/abs/2407.16607
[33] C. Arnett, "wHy DoNt YoU jUsT uSe ThE lLaMa ToKeNiZeR??" Hugging Face blog, 2024. Available: https://huggingface.co/blog/catherinearnett/dangers-of-tokenizer-recycling