Byte pair encoding (BPE) is a subword tokenization algorithm that has become the dominant method for splitting text into tokens in modern large language models. Originally introduced as a data compression technique by Philip Gage in 1994, BPE was adapted for natural language processing (NLP) by Rico Sennrich, Barry Haddow, and Alexandra Birch in 2015. Today, BPE and its variants power the tokenizers behind the GPT family, LLaMA, Mistral, Falcon, and most other transformer-based language models.
The core idea behind BPE is simple: start with individual characters (or bytes) and iteratively merge the most frequent adjacent pair into a new token. After thousands of merges, the resulting vocabulary captures common words as single tokens while still being able to represent rare or unseen words as sequences of smaller subword units. This balance between vocabulary compactness and coverage is what makes BPE so effective for neural language modeling.
Philip Gage published "A New Algorithm for Data Compression" in the February 1994 issue of C Users Journal. The algorithm he described was straightforward: scan a block of data for the most frequently occurring pair of adjacent bytes, replace every occurrence of that pair with a single unused byte value, and record the substitution in a lookup table. Repeat this process until no byte pair occurs frequently enough to justify replacement, or until all 256 byte values are in use.
The original BPE algorithm operated on raw byte sequences and was designed purely for lossless data compression. Decompression required reversing the substitutions using the lookup table. Gage reported that BPE achieved compression ratios competitive with the Lempel-Ziv-Welch (LZW) algorithm while being simpler to implement.
The compression variant of BPE never gained widespread adoption because more sophisticated algorithms (such as gzip and bzip2) offered better compression ratios. However, the core principle of iteratively merging frequent pairs would prove transformative when applied to a different problem two decades later.
In August 2015, Rico Sennrich, Barry Haddow, and Alexandra Birch posted a paper to arXiv titled "Neural Machine Translation of Rare Words with Subword Units," which was later presented at the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016) in Berlin. This paper adapted the BPE algorithm from a data compression tool into a subword segmentation method for neural machine translation (NMT).
The motivation was practical. Neural machine translation systems at the time operated over fixed vocabularies, typically containing 30,000 to 50,000 of the most common words. Any word outside this vocabulary was mapped to a special unknown token (<UNK>), which the model could not translate. This was especially problematic for morphologically rich languages like German, where compound words and inflected forms create an enormous number of distinct word types.
Sennrich, Haddow, and Birch recognized that many translation problems could be decomposed into subword-level operations. Named entities can be transliterated character by character. Compound words can be translated compositionally. Cognates and loanwords share subword structure across languages. BPE provided a principled way to learn these subword units automatically from data.
Their approach worked as follows:
The number of merge operations became a hyperparameter controlling the vocabulary size. More merges produced larger vocabularies with longer tokens; fewer merges produced smaller vocabularies that split words more aggressively.
The results were strong. On the WMT 2015 English-to-German and English-to-Russian translation tasks, BPE-based subword models improved BLEU scores by up to 1.3 points over back-off dictionary baselines, while eliminating the need for the <UNK> token entirely. The reference implementation, subword-nmt, became widely adopted in the NMT community.
The BPE training process builds a vocabulary of subword tokens by learning merge rules from a training corpus. Here is a step-by-step walkthrough using a small example.
Suppose the training corpus contains the following words with their frequencies:
| Word | Frequency |
|---|---|
| hug | 10 |
| pug | 5 |
| pun | 12 |
| bun | 4 |
| hugs | 5 |
The initial vocabulary consists of all unique characters: {b, g, h, n, p, s, u}. Each word is represented as a sequence of characters:
("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
Count every adjacent pair across all words, weighted by word frequency:
| Pair | Frequency |
|---|---|
| (h, u) | 15 |
| (u, g) | 20 |
| (p, u) | 17 |
| (u, n) | 16 |
| (b, u) | 4 |
| (g, s) | 5 |
The pair (u, g) has the highest frequency (20), so it is merged into the new token ug. The vocabulary becomes {b, g, h, n, p, s, u, ug}, and the word representations update:
("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
Recount pair frequencies and merge the next most frequent pair. The pair (u, n) now has frequency 16, so it merges into un. The vocabulary becomes {b, g, h, n, p, s, u, ug, un}:
("h" "ug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("h" "ug" "s", 5)
The next merge might combine (h, ug) into hug (frequency 15), then (hug, s) into hugs (frequency 5), and so on. Each merge adds one new token to the vocabulary.
Training stops after a predetermined number of merges. The final vocabulary size equals the number of initial characters plus the number of merge operations. For example, GPT-2 performed 50,000 merges on top of a 256-byte base vocabulary, yielding a vocabulary of 50,257 tokens (including one special end-of-text token).
Once the merge rules have been learned during training, tokenizing new text at inference time follows a deterministic process:
This process is greedy and deterministic. Given the same input and the same merge table, BPE always produces the same tokenization. The order of the merge rules matters: a merge learned earlier in training takes priority over one learned later.
For example, if the merge table contains (u, g) -> ug as rule 1 and (h, ug) -> hug as rule 2, then the word "hug" is tokenized by first merging "u" and "g" into "ug", then merging "h" and "ug" into "hug", producing a single token.
A significant refinement to BPE was introduced in the GPT-2 paper, "Language Models are Unsupervised Multitask Learners" (Radford et al., 2019). Standard character-level BPE starts with a base vocabulary of Unicode characters, which can be very large depending on the training data. If the training data includes Chinese, Japanese, Korean, Arabic, and emoji, the base character set alone could contain tens of thousands of symbols.
Byte-level BPE solves this by operating on raw UTF-8 bytes instead of Unicode characters. The base vocabulary is always exactly 256 entries (one per possible byte value), regardless of what languages or scripts appear in the data. Any text encoded in UTF-8 can be tokenized without ever producing an unknown token.
The GPT-2 implementation also introduced an important preprocessing step: a regex-based pre-tokenizer that splits input text into chunks before BPE merges are applied. The regex pattern ensures that merges never cross certain boundaries. For instance, letters, digits, and punctuation are kept in separate groups, preventing the tokenizer from creating tokens that combine a letter with a punctuation mark. This pre-tokenization pattern was refined for GPT-4 but the fundamental approach remains the same.
Byte-level BPE has become the standard approach for modern language models. It guarantees complete coverage of any input text, keeps the base vocabulary small, and lets the merge rules handle the task of discovering meaningful subword units.
tiktoken is OpenAI's open-source BPE tokenizer library, written in Rust with Python bindings. Released in late 2022, tiktoken is the official tokenizer for OpenAI's API models.
Key characteristics of tiktoken:
tiktoken._educational) that demonstrates BPE training on small text samples.The encoding schemes used by tiktoken correspond to different model generations:
| Encoding | Vocabulary size | Models |
|---|---|---|
| r50k_base | ~50,257 | GPT-2, early GPT-3 |
| p50k_base | ~50,257 | text-davinci-002, text-davinci-003 |
| cl100k_base | ~100,256 | GPT-3.5-turbo, GPT-4 |
| o200k_base | ~200,000 | GPT-4o, o1, o3 |
The progression from 50K to 200K tokens reflects a broader trend: larger vocabularies encode text more efficiently (fewer tokens per word on average), which directly reduces inference cost and latency.
Nearly every prominent large language model uses some form of BPE tokenization, though the specific implementations differ.
The GPT series has used byte-level BPE since GPT-2. GPT-2 and GPT-3 shared the same tokenizer with 50,257 tokens. GPT-3.5 and GPT-4 moved to the cl100k_base encoding with roughly 100,000 tokens, which improved handling of code, whitespace, and non-English text. GPT-4o introduced the o200k_base encoding with approximately 200,000 tokens, further improving multilingual efficiency. The pre-tokenization regex pattern was updated between GPT-2 and GPT-4 to better handle whitespace runs and digit sequences.
LLaMA 1 and LLaMA 2 used SentencePiece-based BPE with a vocabulary of 32,000 tokens. The tokenizer operated at the Unicode code point level rather than the byte level. This worked well for English and other Latin-script languages but sometimes produced suboptimal segmentations for Chinese, Japanese, and other non-Latin scripts. LLaMA 3 switched to a tiktoken-based byte-level BPE tokenizer with a vocabulary of 128,256 tokens, dramatically improving multilingual performance and reducing the average number of tokens needed to encode the same text.
Mistral 7B uses a byte-fallback BPE tokenizer with a vocabulary of 32,000 tokens. The byte-fallback mechanism ensures that characters not present in the vocabulary are encoded as individual bytes rather than mapped to an unknown token. Later versions (Mistral v0.3) expanded the vocabulary slightly to 32,768 tokens. Mistral Nemo, a 12B parameter model, uses a larger vocabulary specifically designed for improved multilingual performance.
Gemma uses a SentencePiece BPE tokenizer with a vocabulary of 256,128 tokens, one of the largest among open-weight models. This large vocabulary was derived as a subset of the tokenizer used by Google's proprietary Gemini models. The tokenizer splits digits into individual tokens and preserves whitespace explicitly, which helps with code generation and text formatting tasks.
DeepSeek models use byte-level BPE. DeepSeek-V2 had a vocabulary of approximately 100,000 tokens. DeepSeek-V3 expanded this to 128,000 tokens, with an improved pre-tokenizer that better handles punctuation and line breaks.
The following table summarizes the tokenizer type and vocabulary size for major language models:
| Model | Tokenizer type | Vocabulary size | Year |
|---|---|---|---|
| BERT | WordPiece | 30,522 | 2018 |
| GPT-2 | Byte-level BPE | 50,257 | 2019 |
| GPT-3 | Byte-level BPE | 50,257 | 2020 |
| T5 | SentencePiece (Unigram) | 32,000 | 2019 |
| LLaMA 1 | SentencePiece BPE | 32,000 | 2023 |
| LLaMA 2 | SentencePiece BPE | 32,000 | 2023 |
| Mistral 7B | Byte-fallback BPE | 32,000 | 2023 |
| Falcon | BPE | 65,024 | 2023 |
| GPT-3.5-turbo | Byte-level BPE | ~100,256 | 2023 |
| GPT-4 | Byte-level BPE | ~100,256 | 2023 |
| DeepSeek-V2 | Byte-level BPE | ~100,000 | 2024 |
| LLaMA 3 | Byte-level BPE (tiktoken) | 128,256 | 2024 |
| DeepSeek-V3 | Byte-level BPE | 128,000 | 2024 |
| GPT-4o | Byte-level BPE | ~200,000 | 2024 |
| Gemma | SentencePiece BPE | 256,128 | 2024 |
The trend toward larger vocabularies is driven by the need for better multilingual support and more efficient encoding. A larger vocabulary means that common words and subwords across more languages are captured as single tokens, reducing the total number of tokens needed for a given piece of text.
WordPiece is a subword tokenization algorithm first described by Mike Schuster and Kaisuke Nakajima in 2012 for Japanese and Korean voice search at Google. It is most widely known as the tokenizer for BERT and related models such as DistilBERT, MobileBERT, and ELECTRA.
WordPiece and BPE are structurally similar: both start with individual characters and iteratively merge pairs to build a vocabulary. The critical difference is in how they select which pair to merge.
BPE merges whichever pair occurs most frequently in the training data. WordPiece merges the pair that maximizes the likelihood of the training data under a language model. In practice, this means WordPiece computes a score for each candidate pair:
score(x, y) = frequency(xy) / (frequency(x) * frequency(y))
This score measures how much more often two tokens appear together than would be expected if they were independent. A high score indicates that the pair co-occurs far more than chance would predict, suggesting it forms a meaningful unit.
At inference time, WordPiece also differs from BPE. While BPE applies merge rules in order, WordPiece uses a longest-match-first (maximum matching) strategy: it scans the input word from left to right and greedily matches the longest token in the vocabulary.
Another difference is notation. WordPiece prefixes continuation subwords with ## to indicate that they are not the start of a new word. For example, "tokenization" might be split into ["token", "##ization"].
| Feature | BPE | WordPiece |
|---|---|---|
| Merge criterion | Most frequent pair | Pair maximizing likelihood |
| Inference strategy | Apply merge rules in order | Longest match first |
| Subword marker | Prefix on word-initial tokens (varies) | ## prefix on continuation tokens |
| Primary models | GPT family, LLaMA, Mistral | BERT, DistilBERT, ELECTRA |
| Deterministic | Yes | Yes |
The Unigram language model tokenizer, proposed by Taku Kudo in 2018 in the paper "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates," takes a fundamentally different approach from BPE.
Where BPE builds its vocabulary bottom-up by starting with characters and merging pairs, Unigram works top-down. It begins with a large initial vocabulary of candidate subword tokens (often all substrings up to a certain length, plus all characters). It then iteratively removes the tokens whose removal causes the smallest increase in the overall loss (negative log-likelihood) of the training data under a unigram language model. This pruning continues until the vocabulary reaches a target size.
Because Unigram assigns a probability to each token in the vocabulary, it can produce multiple valid segmentations for the same input string and select the one with the highest probability. During training, it can even sample different segmentations stochastically, which acts as a form of data augmentation. BPE, by contrast, always produces a single deterministic segmentation.
SentencePiece is a tokenization library, also developed by Taku Kudo (with John Richardson), that provides a unified framework for both BPE and Unigram tokenization. The key innovation of SentencePiece is that it treats the input text as a raw stream of characters (or bytes) without assuming any whitespace-based word boundaries. Whitespace is not discarded but is instead converted into a special Unicode character (\u2581, rendered as _), which becomes part of the token vocabulary. This design makes SentencePiece language-independent and particularly effective for languages like Chinese, Japanese, and Thai that do not use spaces between words.
SentencePiece also handles the full pipeline from raw text to token IDs and back, making tokenization fully reversible without any external preprocessing.
| Feature | BPE | Unigram | SentencePiece |
|---|---|---|---|
| Vocabulary construction | Bottom-up (merge pairs) | Top-down (prune tokens) | Framework for BPE or Unigram |
| Segmentation | Deterministic | Probabilistic | Depends on algorithm |
| Whitespace handling | Pre-tokenization required | Pre-tokenization required | Treats whitespace as a token |
| Subword sampling | Not natively supported | Supported | Supported |
| Language independence | Requires pre-tokenizer | Requires pre-tokenizer | Yes (language-independent) |
| Used by | GPT family, LLaMA 3 | T5, ALBERT | LLaMA 1/2, Gemma, many others |
One limitation of standard BPE is its deterministic nature: the same word always receives the same segmentation. In 2020, Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita proposed BPE-Dropout (published at ACL 2020), which addresses this by randomly skipping some merge operations during training. At each step of the merge process, each applicable merge rule is dropped with some probability p (typically 0.1). This produces different segmentations of the same word across different training epochs, forcing the model to learn from multiple subword decompositions.
BPE-Dropout improved translation quality by up to 2.3 BLEU points compared to standard BPE and up to 0.9 BLEU points compared to Unigram-based subword regularization. At inference time, standard BPE (without dropout) is used, so the tokenization remains deterministic. SentencePiece includes support for BPE-Dropout as a training option.
BPE tokenizers are typically trained on data that skews heavily toward English or other high-resource languages. This creates a disparity in how efficiently different languages are encoded, a phenomenon measured by token fertility: the average number of tokens produced per word in a given language.
For English, common words like "the," "and," or "information" are typically encoded as single tokens. For morphologically rich or low-resource languages, the same semantic content may require 3 to 10 times as many tokens. This disparity has several consequences:
The primary solution is to train the tokenizer on more balanced multilingual data and to increase the vocabulary size. LLaMA 3's expansion from 32,000 to 128,256 tokens was motivated in part by the need for better multilingual coverage. Similarly, GPT-4o's o200k_base encoding roughly doubles the vocabulary of its predecessor, providing better representation for non-English languages.
Research has shown that token fertility ratios of up to 10 to 15 times are possible for the most disadvantaged languages when using tokenizers trained primarily on English data. Vocabulary sizes in multilingual setups are typically 3 to 8 times larger than in monolingual English models to maintain comparable fertility rates.
BPE tokenization introduces several well-documented artifacts and edge cases that can affect model behavior in unexpected ways.
Trailing whitespace changes tokenization. The string "once upon a " (with a trailing space) tokenizes differently from "once upon a time" because the space becomes a separate token. In the second case, the space is merged with "time" into a single token " time". This means that adding or removing trailing whitespace in a prompt can affect the probability distribution over the next token.
Numbers tokenize inconsistently. The number 1000 might be a single token, while 1,000 becomes three tokens (1, ,, 000), and 10000 might split into 100 and 00. This fragmentation makes arithmetic and numerical reasoning harder for language models because the same quantity can have very different token representations depending on formatting. Some newer tokenizers address this by splitting all digits into individual tokens.
Different cases of the same word are treated as completely separate tokens. In the GPT-2 tokenizer, "hello" maps to token 31373, "Hello" maps to token 15496, and "HELLO" is split into multiple tokens. The model must learn separately that these all refer to the same word.
The same substring may tokenize differently depending on its position within a word. BPE learns merge rules based on the contexts it observes during training, so word-initial, word-internal, and word-final substrings may follow different merge paths. For instance, "un" at the start of "unfortunately" might merge differently than "un" at the end of "bun".
Programming languages, especially whitespace-sensitive ones like Python, present challenges. The GPT-2 tokenizer encoded each indentation space as a separate token, which meant that a four-space indent consumed four tokens. Later tokenizers (cl100k_base and beyond) addressed this by allowing whitespace runs to be merged into single tokens.
Emoji sequences can explode into many tokens because each emoji is encoded as a multi-byte UTF-8 sequence, and complex emoji (such as those with skin tone modifiers or zero-width joiners) may consist of many code points. A single visual emoji character can require a dozen or more tokens.
The choice of tokenizer and vocabulary size has direct consequences for model performance, training cost, and inference speed.
A larger vocabulary captures more common subwords and whole words as individual tokens, which means shorter token sequences for the same amount of text. Shorter sequences reduce the computational cost of self-attention, which scales quadratically with sequence length. However, a larger vocabulary also increases the size of the embedding matrix and the output projection layer, which adds parameters and memory.
In practice, modern models have trended toward larger vocabularies. The move from 50K (GPT-2) to 100K (GPT-4) to 200K (GPT-4o) tokens reflects empirical findings that the efficiency gains from shorter sequences outweigh the parameter cost of larger embedding tables, especially at the scale of models with tens or hundreds of billions of parameters.
A more efficient tokenizer means that the same amount of raw text produces fewer tokens. Because language models are trained on a fixed token budget (a set number of tokens processed during training), a more efficient tokenizer effectively allows the model to "see" more raw text within that budget. This can improve downstream performance simply by increasing the effective data coverage.
Transformer-based models generate text one token at a time. Reducing the number of tokens needed to express a given piece of text directly reduces the number of forward passes required and thus the time and compute cost of generation. For API-based models where pricing is per-token, this also translates to lower costs for end users.
The regex pattern used for pre-tokenization (splitting text into chunks before applying BPE merges) also affects model quality. GPT-2's pattern separated letters, digits, and punctuation into distinct groups. GPT-4's pattern refined this further, particularly for handling whitespace runs and digit sequences more cleanly. Poor pre-tokenization can lead to tokens that cross natural boundaries (such as a token that includes both a letter and a punctuation mark), which can confuse the model.
Several open-source implementations of BPE tokenization are widely used:
| Tool | Language | Description |
|---|---|---|
| subword-nmt | Python | Original BPE for NLP implementation by Sennrich et al. |
| SentencePiece | C++ | Google's language-independent tokenizer supporting BPE and Unigram |
| tiktoken | Rust/Python | OpenAI's fast BPE tokenizer for GPT models |
| tokenizers | Rust/Python | Hugging Face's fast tokenizer library supporting BPE, WordPiece, and Unigram |
| minbpe | Python | Andrej Karpathy's minimal educational BPE implementation |
Andrej Karpathy's minbpe project is particularly notable as an educational resource. It provides clean, minimal implementations of both basic BPE and the GPT-4 regex-based BPE tokenizer in under 300 lines of Python, making the algorithm accessible for study and experimentation.
BPE has also been the subject of formal analysis. A 2023 paper by Zouhar and Meister, "A Formal Perspective on Byte-Pair Encoding," provided theoretical foundations for understanding BPE's behavior, including analysis of its compression properties and the relationship between the training corpus and the resulting vocabulary.
Additional work in 2024 examined BPE from an information-theoretic perspective, studying how well BPE vocabularies approximate optimal compression for natural language text. These analyses have shown that while BPE is a greedy algorithm and does not guarantee globally optimal tokenizations, it performs well in practice because natural language exhibits strong local regularities that greedy merging can exploit effectively.
Research continues on improving BPE and subword tokenization more broadly. Active areas include: