Tokenization is the process of breaking text into smaller units called tokens, which serve as the fundamental input to natural language processing (NLP) systems and large language models (LLMs). These tokens can be words, subwords, characters, or even individual bytes depending on the tokenization strategy used. Because machine learning models operate on numerical data rather than raw text, tokenization acts as the bridge between human-readable language and the numerical representations that models consume.
Every modern language model, from BERT and GPT-4 to Claude and LLaMA, relies on a tokenizer to convert input text into a sequence of integer IDs. Each ID maps to a specific entry in the model's vocabulary, which is then transformed into a dense vector via an embedding layer. The quality and design of the tokenizer directly influences how well a model understands and generates text.
Tokenization is not merely a preprocessing step; it shapes nearly every aspect of a language model's behavior. The choice of tokenizer determines:
A poorly designed tokenizer can waste context window space, introduce artifacts into generated text, and create systematic disadvantages for underrepresented languages.
Tokenization strategies fall along a spectrum from coarse-grained (full words) to fine-grained (individual bytes). Each approach involves trade-offs between vocabulary size, sequence length, and coverage.
The simplest approach splits text on whitespace and punctuation, treating each word as a separate token. Early NLP systems commonly used word-level tokenization.
Advantages: Tokens are semantically meaningful and easy to interpret. Each token corresponds directly to a word that humans recognize.
Disadvantages: The vocabulary must include every word the model might encounter. English alone has hundreds of thousands of distinct word forms when accounting for conjugations, plurals, and compounds. Rare words, proper nouns, and misspellings produce unknown tokens. For morphologically rich languages like Finnish, Turkish, or German (with its long compound words), word-level tokenization is particularly impractical.
At the opposite extreme, character-level tokenization treats each individual character (letter, digit, punctuation mark) as a token. The vocabulary is extremely small, often fewer than 300 entries for English.
Advantages: Zero out-of-vocabulary problems, since any text can be represented as a sequence of known characters. Very compact vocabulary.
Disadvantages: Sequences become very long. The sentence "Tokenization is important" requires roughly 25 character tokens instead of 3 word tokens. Longer sequences mean higher computational costs, and the model must learn to compose meaning from individual characters, which is a harder learning problem.
Subword tokenization occupies the middle ground and has become the dominant approach in modern NLP. It breaks text into units that are larger than characters but often smaller than words. Common words remain intact as single tokens, while rare words are decomposed into recognizable subword pieces.
For example, the word "tokenization" might be split into ["token", "ization"], while a common word like "the" stays as a single token. This balances vocabulary size, sequence length, and coverage. All major LLMs released since 2017 use some form of subword tokenization.
Byte-level tokenization operates on raw bytes (values 0 through 255) rather than Unicode characters. Since any digital text is ultimately a sequence of bytes, this approach can handle any language, encoding, or even binary data without special preprocessing. GPT-2 introduced byte-level BPE, which starts from a base vocabulary of 256 byte values and builds up subword tokens from there. More recent research into byte-level models, such as Meta's Byte Latent Transformer (BLT), explores operating directly on bytes without a fixed tokenizer at all [1].
Several specific algorithms have become standard in the field. Each takes a different approach to learning which subword units to include in the vocabulary.
Byte Pair Encoding was originally a data compression algorithm described by Philip Gage in 1994. In 2015, Rico Sennrich, Barry Haddow, and Alexandra Birch adapted it for neural machine translation in their paper "Neural Machine Translation of Rare Words with Subword Units," published at ACL 2016 [2]. This paper demonstrated that BPE could effectively handle rare and unknown words by decomposing them into subword units.
How BPE works:
For example, suppose the pair ("t", "h") is the most frequent. After merging, every occurrence of "t" followed by "h" becomes the single token "th". Then perhaps ("th", "e") becomes the next most frequent pair, producing "the". The process continues until the desired vocabulary size is reached.
BPE is deterministic: given the same training data and the same number of merges, it always produces the same vocabulary. It is the most widely used tokenization algorithm in modern LLMs, employed by the GPT model family, LLaMA 3, and many others.
WordPiece was developed by Mike Schuster and Kaisuke Nakajima at Google in 2012, originally for Japanese and Korean voice search systems [3]. It was later adopted for BERT and other Google NLP models.
WordPiece is similar to BPE but differs in how it selects which pairs to merge. Instead of choosing the most frequent pair, WordPiece selects the pair whose merger maximizes the likelihood of the training data under a unigram language model. In practice, this means it computes a score for each candidate merge:
score(pair) = freq(pair) / (freq(first) * freq(second))
The pair with the highest score is merged. This tends to favor merges that combine individually rare symbols that frequently appear together, rather than simply the globally most common pair.
WordPiece uses a distinctive notation: continuation tokens are prefixed with "##" to indicate they are not the start of a word. For instance, the word "embedding" might be tokenized as ["em", "##bed", "##ding"]. This convention makes it straightforward to reconstruct the original text by joining tokens and removing the "##" prefixes.
The Unigram model takes a fundamentally different approach from BPE and WordPiece. Rather than starting small and iteratively merging, it starts with a large initial vocabulary and prunes it down. Taku Kudo introduced this method in his 2018 paper "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates" [4].
How the Unigram model works:
A notable feature of the Unigram model is that it can produce multiple valid segmentations for the same input, each with a different probability. This enables subword regularization, where different segmentations are sampled during training to improve the model's robustness.
SentencePiece, also by Taku Kudo (with John Richardson), is not a tokenization algorithm per se but a library and framework that implements both BPE and the Unigram model [5]. It was published at EMNLP 2018. What distinguishes SentencePiece from other implementations is its treatment of text:
SentencePiece is used by many prominent models including the original LLaMA (1 and 2), T5, ALBERT, XLNet, and Google's Gemini family (which uses SentencePiece with the Unigram model).
tiktoken is OpenAI's open-source tokenizer library, written in Rust with Python bindings for speed. It implements byte-level BPE and is designed specifically for OpenAI's model family. tiktoken is between 3x and 6x faster than comparable open-source tokenizers [6].
OpenAI has released several tokenizer encodings through tiktoken:
| Encoding | Vocabulary Size | Models | Release |
|---|---|---|---|
| r50k_base | ~50,257 | GPT-2, early GPT-3 | 2019 |
| p50k_base | ~50,281 | Codex, text-davinci-002/003 | 2022 |
| cl100k_base | ~100,256 | GPT-3.5-turbo, GPT-4 | 2023 |
| o200k_base | ~200,015 | GPT-4o, o3-mini, o4-mini | 2024 |
| o200k_harmony | ~201,088 | GPT-5 | 2025 |
The progression from 50K to 200K tokens reflects a broader industry trend toward larger vocabularies. Larger vocabularies produce shorter token sequences (reducing compute in self-attention) and improve multilingual coverage, at the cost of a larger embedding matrix.
Tokenizer training is a separate step from model training. It typically occurs before the language model itself is trained, using a large representative corpus of text.
The training process varies by algorithm, but the general workflow is:
For BPE-based tokenizers, the output of training is an ordered list of merge rules. During tokenization, these merges are applied in order to the input text. The order matters: earlier merges take precedence. For example, if merge rule 100 is ("t", "o") and merge rule 500 is ("to", "ken"), then "token" is first processed by merging "t"+"o" into "to", and later (if applicable) "to"+"ken" into "token".
The total number of merge rules equals the final vocabulary size minus the base vocabulary size. A tokenizer with a 100K vocabulary and a 256-byte base vocabulary has approximately 99,744 merge rules.
Many tokenizers apply a pre-tokenization step before running the subword algorithm. This typically involves:
GPT-4's tokenizer, for instance, uses a complex regex pattern that handles contractions, letter sequences, digit sequences, and whitespace. This pre-tokenization prevents merges from spanning across word boundaries in most cases, which helps maintain linguistically meaningful tokens.
SentencePiece is notable for skipping pre-tokenization entirely, treating the raw input stream as a flat sequence.
Vocabulary size is one of the most important hyperparameters in tokenizer design. It directly affects model architecture, training efficiency, and inference cost.
| Model | Tokenizer | Algorithm | Vocabulary Size |
|---|---|---|---|
| GPT-2 | tiktoken (r50k_base) | Byte-level BPE | 50,257 |
| BERT | WordPiece | WordPiece | 30,522 |
| LLaMA 2 | SentencePiece | BPE | 32,000 |
| LLaMA 3 | tiktoken-based | Byte-level BPE | 128,256 |
| GPT-4 | tiktoken (cl100k_base) | Byte-level BPE | 100,256 |
| GPT-4o | tiktoken (o200k_base) | Byte-level BPE | ~200,015 |
| Gemini / Gemma 3 | SentencePiece | Unigram | 262,144 |
| Claude (Anthropic) | Proprietary | Not publicly disclosed | Not publicly disclosed |
| Mistral | SentencePiece | BPE | 32,768 |
Small vocabularies (30K to 50K) produce longer token sequences but have smaller embedding matrices. They may struggle with multilingual text and code.
Large vocabularies (100K to 260K+) produce shorter sequences (reducing attention computation) and handle diverse languages better, but increase the size of the embedding and output projection layers. Gemini 3's 262K vocabulary, for example, requires 262,144 softmax computations per token, which is 8x more than LLaMA 2's 32K vocabulary [7].
The trend since 2023 has been clearly toward larger vocabularies. LLaMA jumped from 32K to 128K between versions 2 and 3. OpenAI doubled from 100K to 200K with GPT-4o. This growth is driven by improved multilingual coverage and the observation that the computational savings from shorter sequences outweigh the cost of larger embedding tables.
Commercial LLM APIs universally measure usage in tokens, making tokenization directly relevant to cost.
When you send a prompt to an API like OpenAI's or Anthropic's, the text is tokenized, and you are charged based on the number of input tokens and output tokens. Input tokens (your prompt) and output tokens (the model's response) are typically priced separately, with output tokens costing more because they require the computationally expensive autoregressive generation process.
As of early 2026, representative pricing per million tokens [8]:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-5.2 | $1.75 | $14.00 |
| Claude Sonnet 4.5 | $3.00 | $15.00 |
| Claude Haiku 4.5 | $1.00 | $5.00 |
| Claude Opus 4.5 | $5.00 | $25.00 |
| Gemini 2.5 Pro | $1.25 | $10.00 |
Because different tokenizers produce different numbers of tokens for the same text, comparing API prices across providers requires understanding the tokenization efficiency of each model. A model that tokenizes your text into fewer tokens may be cheaper even if its per-token price is higher.
For English text, a rough rule of thumb is that 1 million tokens is approximately 750,000 words. For code, the ratio can vary significantly depending on the programming language, formatting, and comment density. Tokenization efficiency drops sharply for non-English text (see the multilingual section below), meaning the same content in Chinese or Arabic can cost significantly more to process than its English equivalent.
A common point of confusion is the relationship between tokens and words. They are not the same thing.
For English text processed by modern subword tokenizers, the typical ratio is approximately 1.3 tokens per word, though this varies by content type [9]. Common short words like "the", "is", and "a" are usually single tokens. Longer or less common words may be split into 2 to 4 subword tokens. Punctuation marks, spaces (in some tokenizers), and special formatting characters each consume their own tokens.
OpenAI's documentation suggests a useful approximation: 1 token is roughly 4 characters or 0.75 words in English [9].
Here is an example of how GPT-4's tokenizer (cl100k_base) processes a sentence:
| Text | Tokens | Token Count |
|---|---|---|
| "Hello world" | ["Hello", " world"] | 2 |
| "Tokenization" | ["Token", "ization"] | 2 |
| "GPT-4 is great" | ["GPT", "-", "4", " is", " great"] | 5 |
| "antidisestablishmentarianism" | ["ant", "idis", "establishment", "arian", "ism"] | 5 |
| "こんにちは" (Japanese greeting) | ["こんにちは"] | 1 |
| "人工智能" (Chinese: AI) | ["人工", "智能"] | 2 |
Note how the long English word "antidisestablishmentarianism" is broken into five meaningful subword pieces, while the common word "Hello" is kept intact. The CJK examples show that character-dense languages can be tokenized relatively efficiently by modern tokenizers with large vocabularies.
Beyond the tokens derived from text, every tokenizer includes a set of special tokens that carry structural or control information. These tokens are never produced by splitting normal text; they are inserted by the tokenizer or the model framework to delimit sequences and convey metadata.
| Token | Name | Purpose |
|---|---|---|
<BOS> or <s> | Beginning of Sequence | Marks the start of an input sequence. Used by autoregressive models like GPT and LLaMA to signal where generation begins. |
<EOS> or </s> | End of Sequence | Marks the end of a sequence. The model is trained to output this token when it has finished generating. |
<PAD> | Padding | Used to pad shorter sequences to the same length within a batch. Attention masks ensure that PAD tokens are ignored during computation. |
<UNK> | Unknown | Represents tokens not in the vocabulary. Modern subword tokenizers rarely produce UNK tokens because they can decompose any input into known subwords or bytes. |
[CLS] | Classification | Used by BERT at the start of every input. The final hidden state at this position serves as the aggregate sequence representation for classification tasks. |
[SEP] | Separator | Used by BERT to separate two segments (e.g., a question and a passage in question answering). |
[MASK] | Mask | Used in masked language modeling (BERT's pre-training objective). A percentage of input tokens are replaced with [MASK], and the model predicts the originals. |
Modern chat-oriented models also use special tokens to delimit roles in a conversation. For example, OpenAI's chat format uses tokens like <|im_start|> and <|im_end|> to mark the boundaries of system, user, and assistant messages. LLaMA uses <|begin_of_text|>, <|start_header_id|>, and related tokens. These structural tokens are critical for the model to correctly interpret multi-turn conversations.
Tokenization works reasonably well for English and other Latin-script languages with clear word boundaries, but significant challenges arise with other writing systems.
Chinese, Japanese, and Korean present unique difficulties:
Arabic, Hebrew, Devanagari, Thai, and other scripts each bring their own complexities:
Research has documented that tokenization creates systematic unfairness between languages [10]. A text that costs $1 to process in English might cost $2 to $3 in Chinese or Japanese and potentially much more in low-resource languages like Tigrinya or Burmese. This "tokenization tax" means that users of non-English languages get less content per dollar and use more of their context window for the same amount of information.
Larger vocabularies partially address this problem. GPT-4o's o200k_base encoding includes roughly 10x more Cyrillic tokens than GPT-4's cl100k_base (4,660 vs. approximately 435), substantially improving efficiency for Russian and other Cyrillic-script languages [7]. Similarly, LLaMA 3's expansion from 32K to 128K tokens was motivated in part by the need for better multilingual coverage.
The following table summarizes the tokenization approaches used by the most prominent LLM families as of early 2026:
| Model Family | Tokenizer Library | Algorithm | Vocab Size | Pre-tokenization | Notable Features |
|---|---|---|---|---|---|
| GPT-4 | tiktoken (cl100k_base) | Byte-level BPE | ~100K | Regex-based | Strong English and code performance |
| GPT-4o / GPT-5 | tiktoken (o200k_base / Harmony) | Byte-level BPE | ~200K | Regex-based | 2x vocabulary over GPT-4; much better multilingual |
| Claude (Anthropic) | Proprietary | Not disclosed | Not disclosed | Not disclosed | Anthropic provides token counting via API |
| LLaMA 2 | SentencePiece | BPE | 32,000 | None (SentencePiece) | Relatively small vocabulary |
| LLaMA 3 / 3.2 | Custom (tiktoken-compatible) | Byte-level BPE | 128,256 | Regex-based | 4x vocabulary increase over LLaMA 2 |
| Gemini / Gemma 3 | SentencePiece | Unigram | 262,144 | None (SentencePiece) | Largest vocabulary among major models |
| BERT | WordPiece | WordPiece | 30,522 | Whitespace + punctuation | Uses ## prefix for continuation tokens |
| T5 | SentencePiece | Unigram | 32,000 | None (SentencePiece) | Treats spaces as tokens |
| Mistral / Mixtral | SentencePiece | BPE | 32,768 | None (SentencePiece) | Compact vocabulary, efficient for European languages |
Tokenization research remains active, with several notable developments in the 2025 to 2026 timeframe.
Vocabulary sizes have grown roughly 8x in three years, from 32K (LLaMA 2, 2023) to 262K (Gemini 3, 2025). This expansion is driven by the need for better multilingual support and the finding that the computational savings from shorter sequences generally outweigh the overhead of larger embedding tables.
Researchers have proposed several improvements to standard BPE:
A growing line of research questions whether fixed tokenizers are necessary at all:
These approaches promise to eliminate the fairness issues inherent in fixed vocabularies and remove the need for a separate tokenizer training step.
ADAT (NeurIPS 2024) introduced the idea of iteratively refining the vocabulary based on model feedback during training, rather than fixing it beforehand. This dynamic approach allows the tokenizer to adapt to the specific data distribution the model is learning from.
There is growing interest in training specialized tokenizers for specific domains. Code-focused models, biomedical NLP systems, and legal text processors can benefit from tokenizers trained on domain-specific corpora, where the standard web-text-trained vocabulary may be inefficient.
Several tools exist for counting tokens before sending text to an API:
pip install tiktoken and use tiktoken.encoding_for_model("gpt-4o") to get the right encoding for a specific model.transformers library can count tokens via tokenizer.encode(text).When training a custom tokenizer, the vocabulary size is a key decision. A common starting point for English-only models is 32K to 50K tokens. Multilingual models typically need 100K or more. The right size depends on the training data distribution, target languages, and computational budget. Larger vocabularies improve tokenization efficiency but increase model parameter count.
A language model can only be used with the tokenizer it was trained with. Swapping in a different tokenizer at inference time will produce nonsensical results because the token IDs will map to the wrong embeddings. If you fine-tune a model, you generally keep the same tokenizer unless you are adding new tokens (e.g., for a specialized domain), in which case the embedding matrix must be resized accordingly.
Tokenization has evolved significantly alongside the field of NLP: