Byte-Pair Encoding
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 ยท 4,995 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 ยท 4,995 words
Add missing citations, update stale details, or suggest a clearer explanation.
Byte-pair encoding (BPE) is a subword tokenization algorithm that has become the dominant method for splitting text into tokens in modern large language models. Originally introduced as a data compression technique by Philip Gage in 1994,[1] BPE was adapted for natural language processing (NLP) by Rico Sennrich, Barry Haddow, and Alexandra Birch in 2015 for neural machine translation.[2] Today, BPE and its close variants power the tokenizers behind the GPT family (via tiktoken), LLaMA, Mistral, Gemma, Claude, and most other transformer-based language models.
The core idea behind BPE is simple: start with individual characters (or bytes) and iteratively merge the most frequent adjacent pair into a new token. After thousands or hundreds of thousands of merges, the resulting vocabulary captures common words as single tokens while still being able to represent rare or unseen words as sequences of smaller subword units. This balance between vocabulary compactness and coverage is what makes BPE so effective for neural language modeling.[3]
BPE sits between two extremes. Character-level tokenization produces very long sequences but never encounters out-of-vocabulary (OOV) words. Word-level tokenization produces compact sequences but explodes the vocabulary and cannot handle unseen words. Subword tokenization with BPE strikes a middle ground by learning a vocabulary of variable-length pieces directly from data, with frequencies driving the granularity.
Philip Gage published "A New Algorithm for Data Compression" in the February 1994 issue of C Users Journal.[1] The algorithm he described was straightforward: scan a block of data for the most frequently occurring pair of adjacent bytes, replace every occurrence of that pair with a single unused byte value, and record the substitution in a lookup table. Repeat this process until no byte pair occurs frequently enough to justify replacement, or until all 256 byte values are in use.
The original BPE algorithm operated on raw byte sequences and was designed purely for lossless data compression. Decompression reversed the substitutions using the lookup table. Gage reported compression ratios competitive with Lempel-Ziv-Welch (LZW) but with simpler implementation. The compression variant never gained widespread adoption because gzip and bzip2 offered better ratios, but the core principle of iteratively merging frequent pairs would prove transformative when applied to a different problem two decades later.
In August 2015, Rico Sennrich, Barry Haddow, and Alexandra Birch of the University of Edinburgh posted a paper to arXiv titled "Neural Machine Translation of Rare Words with Subword Units" (arXiv:1508.07909),[2] which was later presented at the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016) in Berlin. This paper adapted the BPE algorithm from a data compression tool into a subword segmentation method for neural machine translation (NMT).
The motivation was practical. Neural machine translation systems at the time operated over fixed vocabularies, typically containing 30,000 to 50,000 of the most common words. Any word outside this vocabulary was mapped to a special unknown token (<UNK>), which the model could not translate. This was especially problematic for morphologically rich languages like German, where compound words and inflected forms create an enormous number of distinct word types.
Sennrich, Haddow, and Birch recognized that many translation problems decompose into subword-level operations: named entities transliterate character by character; compound words translate compositionally; cognates and loanwords share subword structure across languages. BPE provided a principled way to learn these subword units automatically from data. On WMT 2015 English-to-German and English-to-Russian, BPE-based subword models improved BLEU scores by up to 1.3 points over back-off dictionary baselines while eliminating the need for <UNK> entirely.[2] The reference implementation, subword-nmt,[4] was widely adopted in the NMT community and set the stage for tokenizer choices in later language models.
The BPE training algorithm builds a vocabulary of subword tokens by learning merge rules from a training corpus.
Training begins by initializing a base vocabulary. Two common choices are:
A pre-tokenization step typically splits the input into word-like chunks (by whitespace, punctuation, and digit boundaries) before BPE merges are applied. This prevents the algorithm from learning merges that span across natural boundaries (such as a token combining a letter and a punctuation mark).
After initialization, BPE repeatedly:
The output of training is two artifacts: the vocabulary (the set of all tokens) and the merges list (the ordered list of merge rules, which is what makes tokenization deterministic at inference time).[6]
Training terminates after a predetermined number of merge operations, set so that the final vocabulary reaches a target size. The final vocabulary size equals the number of initial tokens plus the number of merge operations. For example, GPT-2 performed 50,000 merges on top of a 256-byte base vocabulary, yielding a vocabulary of 50,257 tokens (including one special end-of-text token).[5]
Once the merge rules have been learned, tokenizing new text follows a deterministic process:
Given the same input and the same merge table, BPE always produces the same tokenization. The order of merge rules matters: a merge learned earlier in training takes priority over one learned later.[6]
Suppose the training corpus contains the following words with their frequencies:
| Word | Frequency |
|---|---|
| hug | 10 |
| pug | 5 |
| pun | 12 |
| bun | 4 |
| hugs | 5 |
The initial vocabulary consists of all unique characters: {b, g, h, n, p, s, u}. Each word is represented as a sequence of characters:
("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)
Count every adjacent pair, weighted by word frequency:
| Pair | Frequency |
|---|---|
| (h, u) | 15 |
| (u, g) | 20 |
| (p, u) | 17 |
| (u, n) | 16 |
| (b, u) | 4 |
| (g, s) | 5 |
The pair (u, g) has the highest frequency (20), so it merges into the new token ug. The vocabulary becomes {b, g, h, n, p, s, u, ug}, and the words update:
("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)
Recount and merge the next most frequent pair. (u, n) now has frequency 16, so it merges into un:
("h" "ug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("h" "ug" "s", 5)
The next merge might combine (h, ug) into hug (frequency 15), then (hug, s) into hugs (frequency 5), and so on. Each merge adds one new token to the vocabulary, and the recorded merges list (u+g, u+n, h+ug, ...) is what is later applied at inference time.
A significant refinement was introduced in the GPT-2 paper, "Language Models are Unsupervised Multitask Learners" (Radford et al., 2019).[5] Standard character-level BPE starts with a base vocabulary of Unicode characters, which can be very large depending on the training data. If the training data includes Chinese, Japanese, Korean, Arabic, and emoji, the base character set alone could contain tens of thousands of symbols.
Byte-level BPE solves this by operating on raw UTF-8 bytes instead of Unicode characters. The base vocabulary is always exactly 256 entries (one per possible byte value), regardless of what languages or scripts appear in the data. Any text encoded in UTF-8 can be tokenized without ever producing an unknown token, because every byte is in the base vocabulary by construction.[5]
The GPT-2 implementation also introduced an important preprocessing step: a regex-based pre-tokenizer that splits input text into chunks before BPE merges are applied. The regex pattern ensures that merges never cross certain boundaries (letters, digits, and punctuation are kept in separate groups). This pre-tokenization pattern was refined for GPT-4 (the cl100k_base encoding), for example, it became case-insensitive for English contractions like 's and 't, treated runs of two or more digits specially, and grouped runs of whitespace into single tokens to densify Python indentation.[7]
Byte-level BPE has become the standard for modern language models: it guarantees complete coverage of any input text, keeps the base vocabulary small, and lets the merge rules handle the task of discovering meaningful subword units.
WordPiece is a related but distinct subword tokenization algorithm first described by Mike Schuster and Kaisuke Nakajima in 2012 for Japanese and Korean voice search at Google.[8] It is most widely known as the tokenizer for BERT and related models such as DistilBERT, MobileBERT, and ELECTRA.[9]
WordPiece and BPE are structurally similar: both start with individual characters and iteratively merge pairs to build a vocabulary. The critical difference is in how they select which pair to merge. BPE merges whichever pair occurs most frequently in the training data. WordPiece instead merges the pair that maximizes the likelihood of the training data under a unigram language model, which in practice means scoring each candidate pair as:
score(x, y) = frequency(xy) / (frequency(x) * frequency(y))
This score measures how much more often two tokens appear together than would be expected if they were independent. At inference time, WordPiece also differs: while BPE applies merge rules in order, WordPiece uses a longest-match-first (maximum matching) strategy that scans the input word from left to right and greedily matches the longest token in the vocabulary. WordPiece also prefixes continuation subwords with ## to indicate that they are not the start of a new word (e.g., "tokenization" might split into ["token", "##ization"]).[10]
The Unigram language model tokenizer, proposed by Taku Kudo in 2018 in "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates" (arXiv:1804.10959),[11] takes a fundamentally different approach. Where BPE builds its vocabulary bottom-up by starting with characters and merging pairs, Unigram works top-down. It begins with a large initial vocabulary of candidate subword tokens (often all substrings up to a certain length, plus all characters). It then iteratively removes the tokens whose removal causes the smallest increase in the overall negative log-likelihood of the training data under a unigram language model. This pruning continues until the vocabulary reaches a target size.
Because Unigram assigns a probability to each token in the vocabulary, it can produce multiple valid segmentations for the same input string and select the one with the highest probability. During training it can even sample different segmentations stochastically, which acts as a form of data augmentation (subword regularization). BPE, by contrast, always produces a single deterministic segmentation. Unigram is the default algorithm used by Google's T5 and is also offered by SentencePiece alongside BPE.[12]
| Tool | Language | Description | Source |
|---|---|---|---|
subword-nmt | Python | Original BPE-for-NLP implementation by Sennrich et al. | github.com/rsennrich/subword-nmt |
| tiktoken | Rust + Python | OpenAI's fast byte-level BPE tokenizer for GPT-2/3/4/4o | github.com/openai/tiktoken |
| SentencePiece | C++ | Google's language-independent tokenizer supporting BPE and Unigram | github.com/google/sentencepiece |
| Hugging Face tokenizers | Rust + Python | Multi-algorithm fast tokenizer library (BPE, WordPiece, Unigram) | github.com/huggingface/tokenizers |
minbpe | Python | Andrej Karpathy's minimal educational BPE implementation | github.com/karpathy/minbpe |
tiktoken is OpenAI's open-source BPE tokenizer library, written in Rust with Python bindings. Released in late 2022, tiktoken is the official tokenizer for OpenAI's API models.[13] Key characteristics include:
tiktoken._educational) that demonstrates BPE training on small text samples.The encoding schemes used by tiktoken correspond to different model generations:
| Encoding | Vocabulary size | Models |
|---|---|---|
r50k_base | ~50,257 | GPT-2, early GPT-3 |
p50k_base | ~50,281 | text-davinci-002, text-davinci-003 |
cl100k_base | ~100,256 | GPT-3.5-turbo, GPT-4 |
o200k_base | ~200,000 | GPT-4o, o1, o3 |
The progression from 50K to 200K tokens reflects larger vocabularies encoding text more efficiently, reducing inference cost and latency, especially for non-English text and code.
SentencePiece is a tokenization library by Taku Kudo and John Richardson (Google) that provides a unified framework for BPE and Unigram tokenization (arXiv:1808.06226).[12] Its key innovation is treating input text as a raw stream of characters (or bytes) without assuming whitespace-based word boundaries: whitespace is converted into a special Unicode character (โ) and joins the vocabulary. This makes SentencePiece language-independent and effective for Chinese, Japanese, and Thai. The library handles the full pipeline from raw text to token IDs and back, making tokenization fully reversible. It was used by LLaMA 1, LLaMA 2, and Gemma (BPE mode) and by T5 (Unigram mode).
The tokenizers library from Hugging Face is a Rust-backed Python library that supports BPE, WordPiece, and Unigram in a unified API. It is the tokenizer backend used by most models in the transformers library and includes utilities for training, saving, loading, and converting tokenizers between formats.[14]
minbpe, by Andrej Karpathy, is a minimal, educational implementation of BPE in pure Python.[15] It accompanies his February 20, 2024 lecture "Let's build the GPT Tokenizer" (a 2h13m YouTube video) in which Karpathy builds BPE from scratch and replicates the GPT-4 regex pre-tokenizer.[16] The repo includes a basic BPE class, a regex-based class mirroring GPT-4's pattern, and a GPT4Tokenizer that loads tiktoken's published merges, all in under a few hundred lines of Python.
Nearly every prominent large language model uses some form of BPE-family tokenization, though the specific implementations differ.
| Model | Tokenizer | Vocabulary size | Year | Notes |
|---|---|---|---|---|
| BERT | WordPiece | 30,522 | 2018 | English uncased base |
| GPT-2 | Byte-level BPE | 50,257 | 2019 | First major byte-level BPE |
| GPT-3 | Byte-level BPE | 50,257 | 2020 | Same tokenizer as GPT-2 |
| T5 | SentencePiece Unigram | 32,000 | 2019 | Unigram, not BPE |
| LLaMA 1 | SentencePiece BPE | 32,000 | 2023 | Character-level base |
| LLaMA 2 | SentencePiece BPE | 32,000 | 2023 | Same tokenizer as LLaMA 1 |
| Mistral 7B | Byte-fallback BPE | 32,000 | 2023 | SentencePiece with byte fallback |
| Mistral 7B v0.3 | Byte-fallback BPE | 32,768 | 2024 | Extended vocab |
| GPT-3.5-turbo | Byte-level BPE (cl100k_base) | ~100,256 | 2023 | tiktoken |
| GPT-4 | Byte-level BPE (cl100k_base) | ~100,256 | 2023 | Same as GPT-3.5-turbo |
| Mistral Nemo (Tekken) | tiktoken-style byte-level BPE | 131,072 | 2024 | Tekken; trained on 100+ languages |
| LLaMA 3 | tiktoken byte-level BPE | 128,256 | 2024 | Move away from SentencePiece |
| Claude | BPE (private) | ~65K (estimated) | 2023+ | Anthropic has not released full spec |
| GPT-4o | Byte-level BPE (o200k_base) | ~200,000 | 2024 | Improved multilingual coverage |
| Gemma | SentencePiece BPE | 256,128 | 2024 | Subset of Gemini's vocab |
The GPT series has used byte-level BPE since GPT-2.[5] GPT-2 and GPT-3 shared the same tokenizer with 50,257 tokens. GPT-3.5 and GPT-4 moved to the cl100k_base encoding with roughly 100,000 tokens, which improved handling of code, whitespace, and non-English text. GPT-4o introduced the o200k_base encoding with approximately 200,000 tokens, further improving multilingual efficiency. The pre-tokenization regex pattern was updated between GPT-2 and GPT-4 to better handle whitespace runs and digit sequences.[7]
BERT uses WordPiece, not BPE, but the algorithms are close cousins.[9][10] BERT-base has a vocabulary of 30,522 WordPiece tokens with continuation pieces marked by ##. Many later "BERT-style" encoder models (DistilBERT, RoBERTa, ELECTRA) either reuse this vocabulary or use byte-level BPE in BERT's place.
LLaMA 1 and LLaMA 2 used SentencePiece BPE with a vocabulary of 32,000 tokens.[17] The tokenizer operated at the Unicode code point level rather than the byte level, with a byte-fallback escape for unknown characters. This worked well for English but sometimes produced suboptimal segmentations for Chinese, Japanese, and other non-Latin scripts. LLaMA 3 switched to a tiktoken-based byte-level BPE tokenizer with a vocabulary of 128,256 tokens, dramatically improving multilingual performance and reducing the average number of tokens needed to encode the same text.[18]
Mistral 7B originally used a byte-fallback BPE tokenizer via SentencePiece with a 32,000-token vocabulary.[19] The byte-fallback mechanism encodes any character missing from the vocabulary as individual bytes rather than <UNK>. Mistral 7B v0.3 expanded the vocabulary to 32,768. With Mistral Nemo and later models, Mistral introduced Tekken, a tiktoken-based byte-level BPE tokenizer with 131,072 tokens trained on 100+ languages, reportedly ~30% more efficient than the prior SentencePiece tokenizer on Chinese, French, German, Spanish, Russian, and source code.[19]
Gemma uses a SentencePiece BPE tokenizer with a vocabulary of 256,128 tokens, one of the largest among open-weight models.[20] This large vocabulary was derived as a subset of the tokenizer used by Google's proprietary Gemini models. The tokenizer splits digits into individual tokens and preserves whitespace explicitly, which helps with code generation and structured text.
Claude uses a byte-pair encoding tokenizer, but Anthropic has not publicly released a tokenizer file in the way OpenAI did with tiktoken. Reverse-engineering by third parties has estimated Claude's vocabulary at roughly 65,000 tokens, with a large fraction overlapping with GPT-4's cl100k_base merges; Anthropic released an official token-counting endpoint via its API in late 2024, but the underlying vocabulary is not redistributed.[21]
The choice of tokenizer and vocabulary size has direct consequences for model performance, training cost, and inference speed.
A larger vocabulary captures more common subwords and whole words as individual tokens, which means shorter token sequences for the same amount of text. Shorter sequences reduce the computational cost of self-attention, which scales quadratically with sequence length. However, a larger vocabulary also increases the size of the embedding matrix and the output projection layer, which adds parameters and memory.
In practice, modern models have trended toward larger vocabularies. The move from 50K (GPT-2) to 100K (GPT-4) to 200K (GPT-4o) tokens reflects empirical findings that the efficiency gains from shorter sequences outweigh the parameter cost of larger embedding tables, especially at the scale of models with tens or hundreds of billions of parameters.[7]
Numbers tokenize inconsistently. Under GPT-2's BPE, the number 1000 is a single token, while 1,000 becomes three tokens (1, ,, 000), and 10000 may split as 100 and 00. The string 0.99 typically splits into separate tokens for the digit and the decimal, with the result depending on surrounding context. This fragmentation makes arithmetic and numerical reasoning harder for language models, because the same quantity can have very different token representations depending on formatting. Newer tokenizers address this by splitting all digits into individual tokens (Gemma) or by using regex pre-tokenization rules that group digits in chunks of two or three (GPT-4 cl100k_base).[7]
BPE tokenizers are typically trained on English-heavy data, creating a disparity in how efficiently different languages are encoded, a phenomenon measured by token fertility: the average number of tokens produced per word. Common English words like "the" or "information" are usually single tokens. The same semantic content in morphologically rich or low-resource languages may require 3-10 times as many tokens, with consequences for cost (per-token API pricing), latency (one forward pass per generated token), context window capacity, and quality (less spare capacity for reasoning).
The primary remedies are training on more balanced multilingual data and increasing the vocabulary size. LLaMA 3's jump from 32,000 to 128,256 tokens, GPT-4o's o200k_base (~200K), Tekken (131K), and Gemma's 256K vocabulary all reflect this trajectory.[7][18][19][20]
A more efficient tokenizer produces fewer tokens for the same raw text. Because models are trained on a fixed token budget, this effectively allows the model to "see" more raw text. At inference time, fewer tokens means fewer forward passes per response, lowering latency and per-request cost.
Trailing whitespace changes tokenization. The string "once upon a " (with a trailing space) tokenizes differently from "once upon a time" because the space becomes a separate token. In the second case, the space is merged with "time" into a single token " time". This means adding or removing trailing whitespace in a prompt can shift the next-token distribution.
Different cases of the same word are also treated as completely separate tokens. In the GPT-2 tokenizer, "hello", "Hello", and "HELLO" are different tokens or sequences of tokens, and the model must learn separately that they refer to the same word.[16]
BPE has no native notion of edit distance or typo tolerance. A misspelled word ("teh" instead of "the") tokenizes into entirely different pieces, sometimes single characters, even though the intended meaning is identical. This can affect performance on user-generated text and is one reason why robust LLMs implicitly learn to "self-correct" typos at the token level. Similarly, the same substring may tokenize differently depending on its position within a word; word-initial, word-internal, and word-final positions can follow different merge paths.
Because of the digit and whitespace quirks above, BPE tokenizers historically struggle with arithmetic and code. In GPT-2's tokenizer, each indentation space was a separate token, which meant a four-space Python indent consumed four tokens. Later tokenizers (GPT-4 cl100k_base, Tekken, Gemma) introduced whitespace-run merges and digit-grouping rules to address this, substantially improving code-generation efficiency.[7][16][19]
In January 2023, AI safety researchers Jessica Rumbelow and Matthew Watkins (then in SERI-MATS) discovered a class of anomalous tokens in GPT-2 and GPT-3 that produced bizarre behavior when included in prompts. They published their findings in a now-famous Alignment Forum / LessWrong post, "SolidGoldMagikarp (plus, prompt generation)".[22] The tokens included strings like SolidGoldMagikarp, TheNitromeFan, cloneembedreportprint, guiActiveUn, and Smartstocks. Asked to repeat SolidGoldMagikarp, GPT-3 returned distribute; TheNitromeFan came back as 182; guiActiveUn as reception. The cause was a mismatch between the tokenizer and the training data: these strings were present in the BPE merges (originating from web-crawl data that included a Reddit dataset where these usernames or strings were common) but appeared rarely or not at all in the model's training corpus, leaving the model with untrained embeddings for those vocabulary entries. The episode became the standard illustration that tokenizer and model are trained on different data and that mismatches can produce hard-to-debug failure modes, later generalized as "glitch tokens" across many models.[22]
| Approach | Vocabulary | Sequence length | OOV behavior | Multilingual |
|---|---|---|---|---|
| Character-level | Hundreds to thousands (Unicode) or 256 (bytes) | Very long | None: every character is in vocab | Excellent (esp. byte-level) |
| Word-level | Hundreds of thousands; closed | Short | <UNK> for any unseen word | Poor; needs separate vocab per language |
| BPE / WordPiece / Unigram subword | Typically 30K-256K | Medium | None: decomposes into known subpieces | Good, scales with vocab size |
Character-level and byte-level models avoid tokenization artifacts entirely but produce sequences 4-8 times longer than subword sequences, multiplying the cost of self-attention. Word-level models produce the shortest sequences but cannot handle rare or novel words and require enormous closed vocabularies. Subword BPE is the practical sweet spot: it handles arbitrary text robustly, keeps sequences reasonably short, and adapts naturally to the statistics of the training data.
| Feature | BPE | WordPiece |
|---|---|---|
| Merge criterion | Most frequent pair | Pair maximizing data likelihood |
| Inference strategy | Apply merge rules in order | Longest match first |
| Continuation marker | Often none (GPT-style); sometimes a leading โ (SentencePiece) | ## prefix on continuation tokens |
| Primary models | GPT family, LLaMA, Mistral, Claude | BERT, DistilBERT, ELECTRA |
| Determinism | Yes | Yes |
| Feature | BPE | Unigram |
|---|---|---|
| Vocabulary construction | Bottom-up (merge pairs) | Top-down (prune tokens) |
| Segmentation | Deterministic | Probabilistic (chooses highest-likelihood) |
| Subword regularization | Not natively (see BPE-Dropout) | Built-in (subword sampling) |
| Primary models | GPT family, LLaMA, Mistral, Claude | T5, ALBERT |
| Library default | tiktoken, sentencepiece (BPE mode) | sentencepiece (Unigram mode) |
One limitation of standard BPE is its determinism: the same word always receives the same segmentation. BPE-Dropout, proposed by Provilkov, Emelianenko, and Voita at ACL 2020 (arXiv:1910.13267),[23] addresses this by randomly skipping some merge operations during training with probability p (typically 0.1). This produces different segmentations of the same word across different training epochs, forcing the model to learn from multiple subword decompositions. BPE-Dropout improved translation quality by up to 2.3 BLEU points over standard BPE and up to 0.9 BLEU points over Unigram-based subword regularization. At inference time, standard BPE (without dropout) is used, so tokenization remains deterministic. SentencePiece includes support for BPE-Dropout as a training option.
A 2023 paper by Zouhar, Meister, et al., "A Formal Perspective on Byte-Pair Encoding" (arXiv:2306.16837),[24] provided theoretical foundations for BPE's behavior, including its compression properties and the relationship between training corpus and learned vocabulary. While BPE is a greedy algorithm and does not guarantee globally optimal tokenizations, it performs well in practice because natural language exhibits strong local regularities that greedy merging can exploit.
Active research areas in tokenization include vocabulary-free and byte-level models (ByT5, CANINE, and more recent byte-level language models that skip tokenization but produce much longer sequences); adaptive tokenization that adjusts dynamically per input to reduce the multilingual fertility gap; learned pre-tokenization (e.g., "Boundless BPE") that removes the dependence on hand-crafted regex patterns; and token fusion / supertoken methods that add a small number of composite tokens to an existing vocabulary to achieve compression rates surpassing much larger vocabularies.