Byte-Pair Encoding

Large Language Models Machine Learning Natural Language Processing

26 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

24 citations

Revision

v10 · 5,288 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Byte-pair encoding (BPE) is a subword tokenization algorithm that splits text into tokens by starting from individual characters or bytes and iteratively merging the most frequent adjacent pair into a new token, and it is the dominant tokenization method behind modern large language models. Originally introduced as a data compression technique by Philip Gage in 1994,^[1] BPE was adapted for natural language processing (NLP) by Rico Sennrich, Barry Haddow, and Alexandra Birch in 2015 for neural machine translation.^[2] Today, BPE and its close variants power the tokenizers behind the GPT family (via tiktoken), LLaMA, Mistral, Gemma, Claude, and most other transformer-based language models.

The core idea behind BPE is simple: start with individual characters (or bytes) and iteratively merge the most frequent adjacent pair into a new token. After thousands or hundreds of thousands of merges, the resulting vocabulary captures common words as single tokens while still being able to represent rare or unseen words as sequences of smaller subword units. This balance between vocabulary compactness and coverage is what makes BPE so effective for neural language modeling.^[3]

BPE sits between two extremes. Character-level tokenization produces very long sequences but never encounters out-of-vocabulary (OOV) words. Word-level tokenization produces compact sequences but explodes the vocabulary and cannot handle unseen words. Subword tokenization with BPE strikes a middle ground by learning a vocabulary of variable-length pieces directly from data, with frequencies driving the granularity. As Sennrich, Haddow, and Birch put it in the paper that brought BPE to NLP, "Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem."^[2]

What problem does BPE solve?

The central challenge BPE addresses is the open-vocabulary problem: natural language contains an effectively unbounded set of word forms (names, compounds, inflections, typos, code, numbers), but neural networks operate over a fixed, finite vocabulary. A pure word-level vocabulary must either grow enormous or fall back to a single unknown token (<UNK>) for anything unseen, which discards information the model cannot recover. A pure character-level or byte-level vocabulary avoids unknowns entirely but stretches sequences to many times their length, and self-attention cost grows quadratically with sequence length. BPE resolves this tension by learning, directly from the training corpus, a vocabulary of variable-length subword pieces: frequent words become single tokens, while rare or novel strings decompose into known subword units, so no input is ever truly out of vocabulary.^[3]

History

Gage 1994: data compression origin

Philip Gage published "A New Algorithm for Data Compression" in the February 1994 issue of C Users Journal.^[1] The algorithm he described was straightforward: scan a block of data for the most frequently occurring pair of adjacent bytes, replace every occurrence of that pair with a single unused byte value, and record the substitution in a lookup table. Repeat this process until no byte pair occurs frequently enough to justify replacement, or until all 256 byte values are in use.

The original BPE algorithm operated on raw byte sequences and was designed purely for lossless data compression. Decompression reversed the substitutions using the lookup table. Gage reported compression ratios competitive with Lempel-Ziv-Welch (LZW) but with simpler implementation. The compression variant never gained widespread adoption because gzip and bzip2 offered better ratios, but the core principle of iteratively merging frequent pairs would prove transformative when applied to a different problem two decades later.

Sennrich, Haddow, and Birch 2016: NLP adoption

In August 2015 (the arXiv preprint was submitted on August 31, 2015), Rico Sennrich, Barry Haddow, and Alexandra Birch of the University of Edinburgh posted a paper to arXiv titled "Neural Machine Translation of Rare Words with Subword Units" (arXiv:1508.07909),^[2] which was later presented at the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016) in Berlin (pages 1715-1725). This paper adapted the BPE algorithm from a data compression tool into a subword segmentation method for neural machine translation (NMT).

The motivation was practical. Neural machine translation systems at the time operated over fixed vocabularies, typically containing 30,000 to 50,000 of the most common words. Any word outside this vocabulary was mapped to a special unknown token (<UNK>), which the model could not translate. This was especially problematic for morphologically rich languages like German, where compound words and inflected forms create an enormous number of distinct word types.

Sennrich, Haddow, and Birch recognized that many translation problems decompose into subword-level operations: named entities transliterate character by character; compound words translate compositionally; cognates and loanwords share subword structure across languages. BPE provided a principled way to learn these subword units automatically from data. On WMT 2015, BPE-based subword models improved BLEU by 1.1 points on English-to-German and 1.3 points on English-to-Russian over a back-off dictionary baseline, while eliminating the need for <UNK> entirely.^[2] The reference implementation, subword-nmt,^[4] was widely adopted in the NMT community and set the stage for tokenizer choices in later language models.

How does the BPE algorithm work?

The BPE training algorithm builds a vocabulary of subword tokens by learning merge rules from a training corpus.

Vocabulary initialization

Training begins by initializing a base vocabulary. Two common choices are:

Character-level: every distinct Unicode character that appears in the training corpus becomes an initial token, plus a special end-of-word marker.
Byte-level: the base vocabulary is always exactly 256 entries, one per possible byte value, regardless of the corpus. This is the choice used by GPT-2 and later GPT models.^[5]

A pre-tokenization step typically splits the input into word-like chunks (by whitespace, punctuation, and digit boundaries) before BPE merges are applied. This prevents the algorithm from learning merges that span across natural boundaries (such as a token combining a letter and a punctuation mark).

Iterative pair merging

After initialization, BPE repeatedly:

Counts the frequency of every adjacent token pair across the corpus, weighted by word frequency.
Identifies the pair with the highest count.
Adds a new token equal to the concatenation of that pair to the vocabulary.
Replaces every occurrence of the pair with the new token in the corpus.
Records the merge as a rule with its priority (the order in which it was learned).

The output of training is two artifacts: the vocabulary (the set of all tokens) and the merges list (the ordered list of merge rules, which is what makes tokenization deterministic at inference time).^[6]

Stop condition

Training terminates after a predetermined number of merge operations, set so that the final vocabulary reaches a target size. The final vocabulary size equals the number of initial tokens plus the number of merge operations. For example, GPT-2 performed 50,000 merges on top of a 256-byte base vocabulary, yielding a vocabulary of 50,257 tokens (the 256 base bytes plus 50,000 merged tokens plus one special end-of-text token).^[5]

Encoding new text

Once the merge rules have been learned, tokenizing new text follows a deterministic process:

Split the input text into individual characters (or bytes), applying any pre-tokenization regex.
Apply the learned merge rules in the exact order they were learned during training, greedily.
Continue applying merges until no more applicable rules remain.

Given the same input and the same merge table, BPE always produces the same tokenization. The order of merge rules matters: a merge learned earlier in training takes priority over one learned later.^[6]

Worked example

Suppose the training corpus contains the following words with their frequencies:

Word	Frequency
hug	10
pug	5
pun	12
bun	4
hugs	5

The initial vocabulary consists of all unique characters: {b, g, h, n, p, s, u}. Each word is represented as a sequence of characters:

("h" "u" "g", 10), ("p" "u" "g", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "u" "g" "s", 5)

Count every adjacent pair, weighted by word frequency:

Pair	Frequency
(h, u)	15
(u, g)	20
(p, u)	17
(u, n)	16
(b, u)	4
(g, s)	5

The pair (u, g) has the highest frequency (20), so it merges into the new token ug. The vocabulary becomes {b, g, h, n, p, s, u, ug}, and the words update:

("h" "ug", 10), ("p" "ug", 5), ("p" "u" "n", 12), ("b" "u" "n", 4), ("h" "ug" "s", 5)

Recount and merge the next most frequent pair. (u, n) now has frequency 16, so it merges into un:

("h" "ug", 10), ("p" "ug", 5), ("p" "un", 12), ("b" "un", 4), ("h" "ug" "s", 5)

The next merge might combine (h, ug) into hug (frequency 15), then (hug, s) into hugs (frequency 5), and so on. Each merge adds one new token to the vocabulary, and the recorded merges list (u+g, u+n, h+ug, ...) is what is later applied at inference time.

Variants

Byte-level BPE (GPT-2)

A significant refinement was introduced in the GPT-2 paper, "Language Models are Unsupervised Multitask Learners" (Radford et al., 2019).^[5] Standard character-level BPE starts with a base vocabulary of Unicode characters, which can be very large depending on the training data. If the training data includes Chinese, Japanese, Korean, Arabic, and emoji, the base character set alone could contain tens of thousands of symbols.

Byte-level BPE solves this by operating on raw UTF-8 bytes instead of Unicode characters. The base vocabulary is always exactly 256 entries (one per possible byte value), regardless of what languages or scripts appear in the data. Any text encoded in UTF-8 can be tokenized without ever producing an unknown token, because every byte is in the base vocabulary by construction.^[5]

The GPT-2 implementation also introduced an important preprocessing step: a regex-based pre-tokenizer that splits input text into chunks before BPE merges are applied. The regex pattern ensures that merges never cross certain boundaries (letters, digits, and punctuation are kept in separate groups). This pre-tokenization pattern was refined for GPT-4 (the cl100k_base encoding), for example, it became case-insensitive for English contractions like 's and 't, treated runs of two or more digits specially, and grouped runs of whitespace into single tokens to densify Python indentation.^[7]

Byte-level BPE has become the standard for modern language models: it guarantees complete coverage of any input text, keeps the base vocabulary small, and lets the merge rules handle the task of discovering meaningful subword units.

WordPiece (BERT, DistilBERT, ELECTRA)

WordPiece is a related but distinct subword tokenization algorithm first described by Mike Schuster and Kaisuke Nakajima in 2012 for Japanese and Korean voice search at Google.^[8] It is most widely known as the tokenizer for BERT and related models such as DistilBERT, MobileBERT, and ELECTRA.^[9]

WordPiece and BPE are structurally similar: both start with individual characters and iteratively merge pairs to build a vocabulary. The critical difference is in how they select which pair to merge. BPE merges whichever pair occurs most frequently in the training data. WordPiece instead merges the pair that maximizes the likelihood of the training data under a unigram language model, which in practice means scoring each candidate pair as:

score(x, y) = frequency(xy) / (frequency(x) * frequency(y))

This score measures how much more often two tokens appear together than would be expected if they were independent. At inference time, WordPiece also differs: while BPE applies merge rules in order, WordPiece uses a longest-match-first (maximum matching) strategy that scans the input word from left to right and greedily matches the longest token in the vocabulary. WordPiece also prefixes continuation subwords with ## to indicate that they are not the start of a new word (e.g., "tokenization" might split into ["token", "##ization"]).^[10]

Unigram language model (T5, ALBERT)

The Unigram language model tokenizer, proposed by Taku Kudo in 2018 in "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates" (arXiv:1804.10959),^[11] takes a fundamentally different approach. Where BPE builds its vocabulary bottom-up by starting with characters and merging pairs, Unigram works top-down. It begins with a large initial vocabulary of candidate subword tokens (often all substrings up to a certain length, plus all characters). It then iteratively removes the tokens whose removal causes the smallest increase in the overall negative log-likelihood of the training data under a unigram language model. This pruning continues until the vocabulary reaches a target size.

Because Unigram assigns a probability to each token in the vocabulary, it can produce multiple valid segmentations for the same input string and select the one with the highest probability. During training it can even sample different segmentations stochastically, which acts as a form of data augmentation (subword regularization). BPE, by contrast, always produces a single deterministic segmentation. Unigram is the default algorithm used by Google's T5 and is also offered by SentencePiece alongside BPE.^[12]

Implementations

Tool	Language	Description	Source
`subword-nmt`	Python	Original BPE-for-NLP implementation by Sennrich et al.	github.com/rsennrich/subword-nmt
tiktoken	Rust + Python	OpenAI's fast byte-level BPE tokenizer for GPT-2/3/4/4o	github.com/openai/tiktoken
SentencePiece	C++	Google's language-independent tokenizer supporting BPE and Unigram	github.com/google/sentencepiece
Hugging Face tokenizers	Rust + Python	Multi-algorithm fast tokenizer library (BPE, WordPiece, Unigram)	github.com/huggingface/tokenizers
`minbpe`	Python	Andrej Karpathy's minimal educational BPE implementation	github.com/karpathy/minbpe

tiktoken (OpenAI)

tiktoken is OpenAI's open-source BPE tokenizer library, written in Rust with Python bindings. OpenAI describes it simply: "tiktoken is a fast BPE tokeniser for use with OpenAI's models."^[7] Released in late 2022, tiktoken is the official tokenizer for OpenAI's API models.^[13] Key characteristics include:

It is between 3 and 6 times faster than a comparable open-source tokenizer, measured by OpenAI on 1 GB of text using the GPT-2 tokenizer.^[7]
The Rust core releases Python's Global Interpreter Lock (GIL) during computation, enabling parallel execution across threads.
It supports multiple encoding schemes corresponding to different model families.
It includes an educational submodule (tiktoken._educational) that demonstrates BPE training on small text samples.

The encoding schemes used by tiktoken correspond to different model generations:

Encoding	Vocabulary size	Models
`r50k_base`	~50,257	GPT-2, early GPT-3
`p50k_base`	~50,281	text-davinci-002, text-davinci-003
`cl100k_base`	~100,256	GPT-3.5-turbo, GPT-4
`o200k_base`	~200,000	GPT-4o, o1, o3

The progression from 50K to 200K tokens reflects larger vocabularies encoding text more efficiently, reducing inference cost and latency, especially for non-English text and code.

SentencePiece (Google)

SentencePiece is a tokenization library by Taku Kudo and John Richardson (Google) that provides a unified framework for BPE and Unigram tokenization (arXiv:1808.06226).^[12] Its key innovation is treating input text as a raw stream of characters (or bytes) without assuming whitespace-based word boundaries: whitespace is converted into a special Unicode character (▁) and joins the vocabulary. This makes SentencePiece language-independent and effective for Chinese, Japanese, and Thai. The library handles the full pipeline from raw text to token IDs and back, making tokenization fully reversible. It was used by LLaMA 1, LLaMA 2, and Gemma (BPE mode) and by T5 (Unigram mode).

Hugging Face tokenizers

The tokenizers library from Hugging Face is a Rust-backed Python library that supports BPE, WordPiece, and Unigram in a unified API. It is the tokenizer backend used by most models in the transformers library and includes utilities for training, saving, loading, and converting tokenizers between formats.^[14]

minBPE (Karpathy)

minbpe, by Andrej Karpathy, is a minimal, educational implementation of BPE in pure Python.^[15] It accompanies his February 20, 2024 lecture "Let's build the GPT Tokenizer" (a 2h13m YouTube video) in which Karpathy builds BPE from scratch and replicates the GPT-4 regex pre-tokenizer.^[16] The repo includes a basic BPE class, a regex-based class mirroring GPT-4's pattern, and a GPT4Tokenizer that loads tiktoken's published merges, all in under a few hundred lines of Python.

Which major models use BPE?

Nearly every prominent large language model uses some form of BPE-family tokenization, though the specific implementations differ.

Model	Tokenizer	Vocabulary size	Year	Notes
BERT	WordPiece	30,522	2018	English uncased base
GPT-2	Byte-level BPE	50,257	2019	First major byte-level BPE
GPT-3	Byte-level BPE	50,257	2020	Same tokenizer as GPT-2
T5	SentencePiece Unigram	32,000	2019	Unigram, not BPE
LLaMA 1	SentencePiece BPE	32,000	2023	Character-level base
LLaMA 2	SentencePiece BPE	32,000	2023	Same tokenizer as LLaMA 1
Mistral 7B	Byte-fallback BPE	32,000	2023	SentencePiece with byte fallback
Mistral 7B v0.3	Byte-fallback BPE	32,768	2024	Extended vocab
GPT-3.5-turbo	Byte-level BPE (`cl100k_base`)	~100,256	2023	tiktoken
GPT-4	Byte-level BPE (`cl100k_base`)	~100,256	2023	Same as GPT-3.5-turbo
Mistral Nemo (Tekken)	tiktoken-style byte-level BPE	131,072	2024	Tekken; trained on 100+ languages
LLaMA 3	tiktoken byte-level BPE	128,256	2024	Move away from SentencePiece
Claude	BPE (private)	~65K (estimated)	2023+	Anthropic has not released full spec
GPT-4o	Byte-level BPE (`o200k_base`)	~200,000	2024	Improved multilingual coverage
Gemma	SentencePiece BPE	256,128	2024	Subset of Gemini's vocab

GPT family (OpenAI)

The GPT series has used byte-level BPE since GPT-2.^[5] GPT-2 and GPT-3 shared the same tokenizer with 50,257 tokens. GPT-3.5 and GPT-4 moved to the cl100k_base encoding with roughly 100,000 tokens, which improved handling of code, whitespace, and non-English text. GPT-4o introduced the o200k_base encoding with approximately 200,000 tokens, further improving multilingual efficiency. The pre-tokenization regex pattern was updated between GPT-2 and GPT-4 to better handle whitespace runs and digit sequences.^[7]

BERT (Google)

BERT uses WordPiece, not BPE, but the algorithms are close cousins.^[9]^[10] BERT-base has a vocabulary of 30,522 WordPiece tokens with continuation pieces marked by ##. Many later "BERT-style" encoder models (DistilBERT, RoBERTa, ELECTRA) either reuse this vocabulary or use byte-level BPE in BERT's place.

LLaMA (Meta)

LLaMA 1 and LLaMA 2 used SentencePiece BPE with a vocabulary of 32,000 tokens.^[17] The tokenizer operated at the Unicode code point level rather than the byte level, with a byte-fallback escape for unknown characters. This worked well for English but sometimes produced suboptimal segmentations for Chinese, Japanese, and other non-Latin scripts. LLaMA 3 switched to a tiktoken-based byte-level BPE tokenizer with a vocabulary of 128,256 tokens, dramatically improving multilingual performance and reducing the average number of tokens needed to encode the same text.^[18]

Mistral

Mistral 7B originally used a byte-fallback BPE tokenizer via SentencePiece with a 32,000-token vocabulary.^[19] The byte-fallback mechanism encodes any character missing from the vocabulary as individual bytes rather than <UNK>. Mistral 7B v0.3 expanded the vocabulary to 32,768. With Mistral Nemo and later models, Mistral introduced Tekken, a tiktoken-based byte-level BPE tokenizer with 131,072 tokens trained on 100+ languages, reportedly ~30% more efficient than the prior SentencePiece tokenizer on Chinese, French, German, Spanish, Russian, and source code.^[19]

Gemma (Google)

Gemma uses a SentencePiece BPE tokenizer with a vocabulary of 256,128 tokens, one of the largest among open-weight models.^[20] This large vocabulary was derived as a subset of the tokenizer used by Google's proprietary Gemini models. The tokenizer splits digits into individual tokens and preserves whitespace explicitly, which helps with code generation and structured text.

Claude (Anthropic)

Claude uses a byte-pair encoding tokenizer, but Anthropic has not publicly released a tokenizer file in the way OpenAI did with tiktoken. Reverse-engineering by third parties has estimated Claude's vocabulary at roughly 65,000 tokens, with a large fraction overlapping with GPT-4's cl100k_base merges; Anthropic released an official token-counting endpoint via its API in late 2024, but the underlying vocabulary is not redistributed.^[21]

Properties and tradeoffs

The choice of tokenizer and vocabulary size has direct consequences for model performance, training cost, and inference speed.

Vocabulary size vs. sequence length

A larger vocabulary captures more common subwords and whole words as individual tokens, which means shorter token sequences for the same amount of text. Shorter sequences reduce the computational cost of self-attention, which scales quadratically with sequence length. However, a larger vocabulary also increases the size of the embedding matrix and the output projection layer, which adds parameters and memory.

In practice, modern models have trended toward larger vocabularies. The move from 50K (GPT-2) to 100K (GPT-4) to 200K (GPT-4o) tokens reflects empirical findings that the efficiency gains from shorter sequences outweigh the parameter cost of larger embedding tables, especially at the scale of models with tens or hundreds of billions of parameters.^[7]

Numerical encoding quirks

Numbers tokenize inconsistently. Under GPT-2's BPE, the number 1000 is a single token, while 1,000 becomes three tokens (1, ,, 000), and 10000 may split as 100 and 00. The string 0.99 typically splits into separate tokens for the digit and the decimal, with the result depending on surrounding context. This fragmentation makes arithmetic and numerical reasoning harder for language models, because the same quantity can have very different token representations depending on formatting. Newer tokenizers address this by splitting all digits into individual tokens (Gemma) or by using regex pre-tokenization rules that group digits in chunks of two or three (GPT-4 cl100k_base).^[7]

Multilingual coverage and token fertility

BPE tokenizers are typically trained on English-heavy data, creating a disparity in how efficiently different languages are encoded, a phenomenon measured by token fertility: the average number of tokens produced per word. Common English words like "the" or "information" are usually single tokens. The same semantic content in morphologically rich or low-resource languages may require 3-10 times as many tokens, with consequences for cost (per-token API pricing), latency (one forward pass per generated token), context window capacity, and quality (less spare capacity for reasoning).

The primary remedies are training on more balanced multilingual data and increasing the vocabulary size. LLaMA 3's jump from 32,000 to 128,256 tokens, GPT-4o's o200k_base (~200K), Tekken (131K), and Gemma's 256K vocabulary all reflect this trajectory.^[7]^[18]^[19]^[20]

Inference cost and training data efficiency

A more efficient tokenizer produces fewer tokens for the same raw text. Because models are trained on a fixed token budget, this effectively allows the model to "see" more raw text. At inference time, fewer tokens means fewer forward passes per response, lowering latency and per-request cost.

What are the limitations of BPE?

Whitespace and case sensitivity

Trailing whitespace changes tokenization. The string "once upon a " (with a trailing space) tokenizes differently from "once upon a time" because the space becomes a separate token. In the second case, the space is merged with "time" into a single token " time". This means adding or removing trailing whitespace in a prompt can shift the next-token distribution.

Different cases of the same word are also treated as completely separate tokens. In the GPT-2 tokenizer, "hello", "Hello", and "HELLO" are different tokens or sequences of tokens, and the model must learn separately that they refer to the same word.^[16]

Spelling errors and OOV decomposition

BPE has no native notion of edit distance or typo tolerance. A misspelled word ("teh" instead of "the") tokenizes into entirely different pieces, sometimes single characters, even though the intended meaning is identical. This can affect performance on user-generated text and is one reason why robust LLMs implicitly learn to "self-correct" typos at the token level. Similarly, the same substring may tokenize differently depending on its position within a word; word-initial, word-internal, and word-final positions can follow different merge paths.

Math and code tokenization issues

Because of the digit and whitespace quirks above, BPE tokenizers historically struggle with arithmetic and code. In GPT-2's tokenizer, each indentation space was a separate token, which meant a four-space Python indent consumed four tokens. Later tokenizers (GPT-4 cl100k_base, Tekken, Gemma) introduced whitespace-run merges and digit-grouping rules to address this, substantially improving code-generation efficiency.^[7]^[16]^[19]

Glitch tokens (e.g., " SolidGoldMagikarp")

In January 2023, AI safety researchers Jessica Rumbelow and Matthew Watkins (then in SERI-MATS) discovered a class of anomalous tokens in GPT-2 and GPT-3 that produced bizarre behavior when included in prompts. They published their findings on February 5, 2023 in a now-famous Alignment Forum / LessWrong post, "SolidGoldMagikarp (plus, prompt generation)".^[22] The tokens included strings like SolidGoldMagikarp, TheNitromeFan, cloneembedreportprint, guiActiveUn, and Smartstocks. Asked to repeat SolidGoldMagikarp, GPT-3 returned distribute; TheNitromeFan came back as 182; guiActiveUn as reception; Smartstocks as Followers. The cause was a mismatch between the tokenizer and the training data: these strings were present in the BPE merges (originating from web-crawl data that included a Reddit dataset where these usernames or strings were common) but appeared rarely or not at all in the model's training corpus, leaving the model with untrained embeddings for those vocabulary entries. The authors noted that the anomalous tokens "tend to cluster near the centroid" of the embedding space, which helps explain their erratic behavior. The episode became the standard illustration that tokenizer and model are trained on different data and that mismatches can produce hard-to-debug failure modes, later generalized as "glitch tokens" across many models.^[22]

Comparison to other tokenizers

Character-level and word-level

Approach	Vocabulary	Sequence length	OOV behavior	Multilingual
Character-level	Hundreds to thousands (Unicode) or 256 (bytes)	Very long	None: every character is in vocab	Excellent (esp. byte-level)
Word-level	Hundreds of thousands; closed	Short	`<UNK>` for any unseen word	Poor; needs separate vocab per language
BPE / WordPiece / Unigram subword	Typically 30K-256K	Medium	None: decomposes into known subpieces	Good, scales with vocab size

Character-level and byte-level models avoid tokenization artifacts entirely but produce sequences 4-8 times longer than subword sequences, multiplying the cost of self-attention. Word-level models produce the shortest sequences but cannot handle rare or novel words and require enormous closed vocabularies. Subword BPE is the practical sweet spot: it handles arbitrary text robustly, keeps sequences reasonably short, and adapts naturally to the statistics of the training data.

How does BPE differ from WordPiece?

Feature	BPE	WordPiece
Merge criterion	Most frequent pair	Pair maximizing data likelihood
Inference strategy	Apply merge rules in order	Longest match first
Continuation marker	Often none (GPT-style); sometimes a leading `▁` (SentencePiece)	`##` prefix on continuation tokens
Primary models	GPT family, LLaMA, Mistral, Claude	BERT, DistilBERT, ELECTRA
Determinism	Yes	Yes

How does BPE differ from Unigram?

Feature	BPE	Unigram
Vocabulary construction	Bottom-up (merge pairs)	Top-down (prune tokens)
Segmentation	Deterministic	Probabilistic (chooses highest-likelihood)
Subword regularization	Not natively (see BPE-Dropout)	Built-in (subword sampling)
Primary models	GPT family, LLaMA, Mistral, Claude	T5, ALBERT
Library default	`tiktoken`, `sentencepiece` (BPE mode)	`sentencepiece` (Unigram mode)

BPE-Dropout

One limitation of standard BPE is its determinism: the same word always receives the same segmentation. BPE-Dropout, proposed by Provilkov, Emelianenko, and Voita at ACL 2020 (arXiv:1910.13267),^[23] addresses this by randomly skipping some merge operations during training with probability p (typically 0.1). This produces different segmentations of the same word across different training epochs, forcing the model to learn from multiple subword decompositions. BPE-Dropout improved translation quality by up to 2.3 BLEU points over standard BPE and up to 0.9 BLEU points over Unigram-based subword regularization. At inference time, standard BPE (without dropout) is used, so tokenization remains deterministic. SentencePiece includes support for BPE-Dropout as a training option.

Formal analyses and future directions

A 2023 paper by Zouhar, Meister, et al., "A Formal Perspective on Byte-Pair Encoding" (arXiv:2306.16837),^[24] provided theoretical foundations for BPE's behavior, including its compression properties and the relationship between training corpus and learned vocabulary. While BPE is a greedy algorithm and does not guarantee globally optimal tokenizations, it performs well in practice because natural language exhibits strong local regularities that greedy merging can exploit.

Active research areas in tokenization include vocabulary-free and byte-level models (ByT5, CANINE, and more recent byte-level language models that skip tokenization but produce much longer sequences); adaptive tokenization that adjusts dynamically per input to reduce the multilingual fertility gap; learned pre-tokenization (e.g., "Boundless BPE") that removes the dependence on hand-crafted regex patterns; and token fusion / supertoken methods that add a small number of composite tokens to an existing vocabulary to achieve compression rates surpassing much larger vocabularies.

References

Gage, P. (1994). "A New Algorithm for Data Compression." *C Users Journal*, 12(2):23-38. Archived copy: https://www.derczynski.com/papers/archive/BPE_Gage.pdf ↩
Sennrich, R., Haddow, B., & Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." *Proceedings of ACL 2016*, 1715-1725. arXiv:1508.07909 (submitted August 31, 2015). https://arxiv.org/abs/1508.07909 ↩
Hugging Face. "Summary of the tokenizers." https://huggingface.co/docs/transformers/en/tokenizer_summary ↩
Sennrich, R. "subword-nmt: Unsupervised Word Segmentation for Neural Machine Translation and Text Generation." GitHub repository. https://github.com/rsennrich/subword-nmt ↩
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI Technical Report (GPT-2). https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf ↩
Hugging Face. "Byte-Pair Encoding tokenization." https://huggingface.co/learn/llm-course/chapter6/5 ↩
OpenAI. "tiktoken: a fast BPE tokeniser for use with OpenAI's models." GitHub repository (includes encoding definitions, the 3-6x speed benchmark, and pre-tokenizer regex patterns). https://github.com/openai/tiktoken ↩
Schuster, M. & Nakajima, K. (2012). "Japanese and Korean Voice Search." *Proceedings of ICASSP 2012*, 5149-5152. https://research.google/pubs/japanese-and-korean-voice-search/ ↩
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT 2019*. arXiv:1810.04805. https://arxiv.org/abs/1810.04805 ↩
Hugging Face. "WordPiece tokenization." https://huggingface.co/learn/llm-course/chapter6/6 ↩
Kudo, T. (2018). "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates." *Proceedings of ACL 2018*. arXiv:1804.10959. https://arxiv.org/abs/1804.10959 ↩
Kudo, T. & Richardson, J. (2018). "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing." *Proceedings of EMNLP 2018: System Demonstrations*. arXiv:1808.06226. https://arxiv.org/abs/1808.06226 ↩
OpenAI Cookbook. "How to count tokens with tiktoken." https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken ↩
Hugging Face. "tokenizers" library documentation. https://huggingface.co/docs/tokenizers/index ↩
Karpathy, A. "minbpe: Minimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm." GitHub repository. https://github.com/karpathy/minbpe ↩
Karpathy, A. (2024, February 20). "Let's build the GPT Tokenizer." YouTube (2h13m). https://www.youtube.com/watch?v=zduSFxRajkE ↩
Touvron, H., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971. https://arxiv.org/abs/2302.13971 ↩
Meta AI. "Introducing Meta Llama 3: The most capable openly available LLM to date." (April 2024). https://ai.meta.com/blog/meta-llama-3/ ↩
Mistral AI. "Concept Deep Dive: Tokenization." Mistral AI documentation. https://docs.mistral.ai/guides/tokenization/ ↩
Gemma Team, Google DeepMind. (2024). "Gemma: Open Models Based on Gemini Research and Technology." Technical report. https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf ↩
Rando, J. "anthropic-tokenizer: Approximation of the Claude 3 tokenizer by inspecting generation stream." GitHub repository. https://github.com/javirandor/anthropic-tokenizer ↩
Rumbelow, J. & Watkins, M. ("mwatkins"). (2023, February 5). "SolidGoldMagikarp (plus, prompt generation)." LessWrong / AI Alignment Forum. https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation ↩
Provilkov, I., Emelianenko, D., & Voita, E. (2020). "BPE-Dropout: Simple and Effective Subword Regularization." *Proceedings of ACL 2020*. arXiv:1910.13267. https://arxiv.org/abs/1910.13267 ↩
Zouhar, V., Meister, C., Gastaldi, J. L., Du, L., Vieira, T., Sachan, M., & Cotterell, R. (2023). "A Formal Perspective on Byte-Pair Encoding." *Findings of ACL 2023*. arXiv:2306.16837. https://arxiv.org/abs/2306.16837 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

9 revisions by 1 contributors · full history

Suggest edit