WordPiece
Last reviewed
Apr 30, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 3,341 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 3,341 words
Add missing citations, update stale details, or suggest a clearer explanation.
WordPiece is a data-driven subword tokenization algorithm that builds a fixed-size vocabulary of word pieces by iteratively merging the symbol pair whose combination produces the largest increase in the language model likelihood of the training corpus. It was introduced by Mike Schuster and Kaisuke Nakajima at Google in their 2012 ICASSP paper Japanese and Korean voice search, originally as a way to build pronunciation inventories for languages whose orthography lacks clear word boundaries. The algorithm later became the standard subword model for several important neural systems: it powered the Google Neural Machine Translation (GNMT) system described by Wu et al. in 2016, and it became widely known after Jacob Devlin and colleagues used it for the BERT pre-training corpus in 2018. Most of the BERT-family models, including DistilBERT, MobileBERT, ELECTRA, and multilingual BERT, continue to ship with WordPiece vocabularies.
WordPiece is closely related to Byte pair encoding (BPE) and to the unigram language model tokenizer that ships in Google's SentencePiece library, but the three algorithms differ in important ways. BPE merges the most frequent pair of symbols on each step. WordPiece merges the pair that scores highest under a likelihood criterion, equivalent to the ratio of the pair frequency to the product of the individual frequencies. The unigram language model approach used by SentencePiece starts with a large seed vocabulary and prunes pieces that contribute least to the data likelihood, so the build direction is opposite to that of WordPiece. These distinctions matter in practice because the resulting vocabularies, although similar in size, tokenize the same text in slightly different ways, which can affect downstream task transfer.
WordPiece was first described in Mike Schuster and Kaisuke Nakajima, Japanese and Korean voice search, presented at the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) in 2012, pages 5149 to 5152. The Schuster and Nakajima paper deals with a problem that does not exist in English speech recognition the same way: Japanese is normally written without spaces between words, and Korean uses Eojeol units that are themselves morphologically complex. A traditional speech recognizer needs a closed pronunciation lexicon, so the authors had to choose what counted as a "word" before they could build language and acoustic models. They wanted an inventory that would be small enough to be tractable, large enough to cover the long tail of names and rare terms that show up in voice queries, and learnable directly from text rather than relying on a hand-curated dictionary.
The authors propose a procedure they call the WordPieceModel. Starting from an inventory of basic Unicode characters (about 22,000 for Japanese and 11,000 for Korean), they train a unigram language model on a large text corpus using the current inventory, then add a new word piece to the inventory by combining two existing pieces in a way that maximizes the likelihood of the training data. They iterate until the inventory reaches a target size or the marginal likelihood gain falls below a threshold. The greedy character of the algorithm means that each step is a local optimum rather than a global one, but in practice it produces inventories that fit the long-tail distribution of voice queries well. The resulting word pieces include common morphemes, frequent word stems, full common words, and a sprinkling of useful character n-grams.
Schuster and Nakajima's original motivation was speech recognition for languages without spaces, but the same algorithm turned out to be useful any time a model needs an open vocabulary at a fixed budget. That is exactly the situation that arose four years later when Google's translation team replaced their phrase-based system with a sequence-to-sequence neural model.
The second major appearance of WordPiece was in Yonghui Wu et al., Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, posted on arXiv in September 2016 (arXiv:1609.08144). GNMT used a deep stacked LSTM encoder and decoder with attention, and the authors needed a way to handle rare words and morphological variation in many language pairs without ballooning the softmax. Their answer was to encode both the source and the target sides of the corpus with WordPiece. The paper says the system divides words into a limited set of common sub-word units called wordpieces for both input and output, and that this provides a balance between the flexibility of character-level models and the efficiency of word-level models. It also notes that wordpieces naturally handle the translation of rare words.
The GNMT paper reports that vocabulary sizes between 8,000 and 32,000 wordpieces gave both good BLEU scores and fast decoding. The released production models used a shared source and target vocabulary of 32,000 wordpieces. The training procedure followed the same maximum-likelihood criterion as the 2012 paper, applied at scale to the WMT and Google internal translation corpora.
GNMT was deployed in Google Translate starting in late 2016, beginning with English to Chinese. The deployment made WordPiece by far the most widely used subword model in production at the time, although the broader research community largely associated subword tokenization with Sennrich, Haddow, and Birch's 2016 BPE paper for neural machine translation. Both approaches were in active use across different teams.
The paper that made WordPiece a household name in NLP was Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, posted on arXiv in October 2018 and published at NAACL 2019. The BERT paper says only a few words about its tokenizer: it uses WordPiece embeddings (citing Wu et al. 2016) with a 30,000 token vocabulary. The released checkpoints actually ship with a vocabulary of 30,522 WordPieces for the English uncased model and 28,996 for the English cased model. The non-initial pieces of a word carry the convention prefix ##, so the word playing becomes play ##ing and a less common word like embeddings may become em ##bed ##ding ##s.
BERT's success made the BERT vocabularies a de facto standard in 2019 and 2020. Researchers built models that either used the released BERT vocabulary directly, retrained a fresh WordPiece vocabulary on a new corpus, or borrowed the WordPiece tokenizer for a different transformer architecture. A non-exhaustive list of BERT-family models that use WordPiece is shown in the table below.
| Model | Year | WordPiece vocabulary size | Notes |
|---|---|---|---|
| BERT-base, uncased | 2018 | 30,522 | English Wikipedia plus BookCorpus |
| BERT-base, cased | 2018 | 28,996 | Cased English |
| BERT multilingual cased (mBERT) | 2018 | 119,547 | Top 104 Wikipedia languages |
| DistilBERT | 2019 | 30,522 | Inherits BERT vocabulary |
| ALBERT | 2019 | 30,000 | Switched to SentencePiece (unigram) |
| ELECTRA | 2020 | 30,522 | Inherits BERT vocabulary |
| MobileBERT | 2020 | 30,522 | Inherits BERT vocabulary |
| TinyBERT | 2019 | 30,522 | Inherits BERT vocabulary |
The multilingual BERT vocabulary is built from the 104 largest Wikipedias using exponentially smoothed weighting with a smoothing factor of 0.7, so high-resource languages such as English are under-sampled and low-resource languages such as Icelandic are over-sampled. The README for the open-source BERT release says English is sampled roughly 100 times more than Icelandic after smoothing, compared to roughly 1,000 times more before smoothing.
The WordPiece training algorithm has the same skeleton as BPE: start from a small character-level vocabulary, then iteratively grow it by merging adjacent symbol pairs in the corpus. The difference is the merge criterion. Google never open-sourced its training code, so the published descriptions are best guesses based on the 2012 paper, the GNMT paper, and the Hugging Face reverse-engineering documented in the transformers tokenizer summary and the LLM Course chapter on WordPiece.
Before training, the corpus is pre-tokenized into words using whitespace and punctuation rules. Each word is split into its individual characters. To preserve the information that a character was the second, third, or later character of a word, every non-initial character is given the WordPiece prefix ## (in BERT's convention; other implementations use ▁ or no prefix at all). The character c at the start of a word becomes the vocabulary item c, and the same character in the middle of a word becomes the distinct vocabulary item ##c. A word like word is therefore initially split as w ##o ##r ##d. Special tokens such as [PAD], [UNK], [CLS], [SEP], and [MASK] are added at the start of the vocabulary.
For every adjacent pair (x, y) in the corpus, WordPiece computes a score:
score(x, y) = freq(xy) / (freq(x) * freq(y))
It then merges the pair with the highest score. Intuitively, the score asks how much more often the pair occurs together than would be expected if x and y were independent unigrams. This is mathematically equivalent to choosing the pair that maximizes the increase in log-likelihood of the training corpus under a unigram language model when the merged token is added to the vocabulary, which is the criterion stated in the 2012 paper. BPE, in contrast, would simply pick the pair with the highest raw frequency freq(xy).
The Hugging Face LLM Course works through a small example with the corpus ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5). After the initial split with ## prefixes, the most frequent pair is ("##u", "##g") with frequency 20, but the score under WordPiece is only 1/36 because ##u is itself very frequent. The pair ("##g", "##s") has a lower frequency of 5 but a higher score of 1/20, so WordPiece merges ##g and ##s first, producing the new piece ##gs. BPE on the same corpus would have merged ("##u", "##g") first.
Merging stops when the vocabulary reaches a predefined target size or when the best available score falls below a threshold. The original 2012 paper used a likelihood-gain threshold; modern implementations more commonly fix the target size up front, since downstream model architectures need a fixed embedding table.
At inference time, BERT-style WordPiece does not store the merge rules. Instead it stores only the final vocabulary and runs a greedy longest-match-first algorithm against the input. Given a word, the encoder repeatedly finds the longest prefix that appears in the vocabulary and emits that piece. The remainder of the word, prefixed with ##, is then re-encoded by the same procedure. If no prefix of the remaining string is in the vocabulary, the entire original word is emitted as [UNK], not as a partial sequence. This is one of the more visible behavioral differences from BPE, which would emit the unknown character as <unk> and continue with the rest of the word.
For example, with the toy vocabulary trained on the corpus above, the word hugs tokenizes as [hug, ##s] because hug is the longest prefix in the vocabulary. A WordPiece-encoded bugs becomes [b, ##u, ##gs]. A word like mug whose middle character ##m is not in the vocabulary at all tokenizes as [[UNK]] rather than [m, [UNK]] or any partial decomposition.
Four subword schemes dominate modern NLP: BPE, WordPiece, the unigram language model, and SentencePiece (which is really a wrapper that hosts BPE or unigram). The table below summarizes the differences.
| Algorithm | Direction | Merge or prune criterion | Stores | Used by |
|---|---|---|---|---|
| BPE (Sennrich et al. 2016) | Bottom-up, grow | Most frequent pair | Merge rules | GPT, GPT-2 (byte-level), RoBERTa, BART, Llama family |
| WordPiece (Schuster & Nakajima 2012) | Bottom-up, grow | Pair maximizing likelihood gain freq(xy) / (freq(x)·freq(y)) | Final vocabulary only | BERT, DistilBERT, ELECTRA, mBERT, MobileBERT |
| Unigram LM (Kudo 2018) | Top-down, prune | Remove pieces whose deletion least decreases corpus likelihood | Vocabulary with probabilities | T5, ALBERT, XLNet, mBART (via SentencePiece) |
| SentencePiece (Kudo & Richardson 2018) | Either (BPE or unigram), trained on raw text | Inherits from BPE or unigram | Vocabulary, optional merges | Multilingual models, T5 family, XLNet, ALBERT |
A few cross-cutting points are worth pulling out. BPE's bottom-up frequency criterion and WordPiece's likelihood criterion produce vocabularies that overlap substantially in the most common pieces but diverge on the long tail. WordPiece tends to keep more morpheme-like fragments, since pieces whose individual parts are rare get a high score even when they themselves are not super frequent. BPE tends to keep more highly frequent surface n-grams. The unigram LM approach is conceptually closer to a probabilistic model: it learns probabilities for each piece and at decoding time picks the most likely segmentation under the unigram model, which makes it easy to sample alternative segmentations for subword regularization.
SentencePiece is sometimes loosely called "WordPiece" in conversation, but the comparison in the official SentencePiece README clarifies that the BPE algorithm used in WordPiece is slightly different from the original BPE, and that SentencePiece itself implements BPE and unigram, not WordPiece. SentencePiece can be configured to mimic WordPiece behavior closely, for example by choosing a bottom-up BPE training and an ##-style continuation marker, but it is not a bit-exact reimplementation. The Hugging Face tokenizers library does include a true WordPiece trainer and tokenizer, exposed through the BertWordPieceTokenizer class and the fast BertTokenizerFast interface used to load BERT checkpoints.
There is no single canonical reference implementation of WordPiece because Google did not open-source the original training code. Several implementations have grown up around the algorithm, each with slightly different defaults.
| Implementation | Maintainer | Training | Inference |
|---|---|---|---|
tensor2tensor SubwordTextEncoder | Google Brain | Heuristic likelihood-based merger close to WordPiece | Greedy longest-match |
TensorFlow Text BertTokenizer and WordpieceTokenizer | Wraps a separate trainer such as wordpiece_tokenizer_learner | Greedy longest-match | |
Hugging Face tokenizers (BertWordPieceTokenizer) | Hugging Face | Score freq(xy)/(freq(x)·freq(y)), target vocabulary size | Greedy longest-match, ## continuation |
Hugging Face transformers (BertTokenizer, BertTokenizerFast) | Hugging Face | Loads pretrained vocabulary | Greedy longest-match |
| SentencePiece (with BPE config) | BPE on raw text | Approximation of WordPiece behavior |
The Hugging Face tokenizer summary cautions that even Hugging Face's own train_new_from_iterator does not reproduce WordPiece exactly, since the underlying tokenizers library uses BPE for training, not WordPiece. For new projects that want to reproduce BERT-style behavior, the most common path is to load the released BERT vocabulary directly rather than train a fresh one.
WordPiece's main strength is the same as that of any good subword tokenizer: it gives a fixed-size vocabulary that can still represent any word, including rare names, technical jargon, and compound forms in morphologically rich languages such as Turkish, Finnish, and German. The likelihood-gain criterion produces pieces that often line up with morphemes, which is convenient for downstream tasks like named entity recognition where subword consistency matters. Because the vocabulary is fixed and small (typically 30,000 to 50,000 pieces for monolingual models, around 120,000 for highly multilingual ones), the embedding table and softmax stay tractable.
The algorithm has a few real weaknesses. Greedy longest-match decoding is deterministic, but it sometimes produces non-intuitive splits: the same English word can tokenize differently when the surrounding context happens to make a longer match available, and the splits do not always respect linguistic morphology. The fall-back to [UNK] for an entire word when a single internal character is missing is harsh, and modern pipelines work around it by either guaranteeing every byte is in the vocabulary (the byte-level BPE approach used by GPT-2) or by adding a character-level fallback.
WordPiece also inherits the downsides of pre-tokenization. The standard BERT pipeline assumes the input has been pre-split into rough words on whitespace and punctuation before WordPiece sees it. For Chinese, Japanese, Thai, and other languages without obvious word boundaries, this assumption breaks, and the original BERT release had to ship a separate Chinese tokenizer that splits at every character before WordPiece runs. Sentence-piece-based unigram tokenizers handle this case more gracefully because they operate directly on raw text.
Different subword segmentations of the same word in different contexts can also affect transfer learning, since downstream models that rely on consistent surface forms (for example, span extraction in question answering) are more brittle when their tokenizer behaves inconsistently.
WordPiece is a compromise between word-level tokenizers (which produce short sequences but huge vocabularies and unknown-word problems) and character-level or byte-level tokenizers (which have tiny vocabularies but very long sequences). Two notable character-level or byte-level alternatives have been published since BERT.
ByT5 (Xue et al. 2021) is a variant of T5 that operates on UTF-8 bytes rather than SentencePiece tokens. The approach trades a larger transformer to handle longer sequences against the simplicity of having no tokenizer at all. ByT5 is more robust to noise and to languages that the original SentencePiece vocabulary did not cover well.
CANINE (Clark et al. 2021) skips the pre-tokenizer entirely and operates on Unicode codepoints, using downsampling inside the encoder to keep the sequence manageable. The CANINE paper presents itself explicitly as an alternative to WordPiece for multilingual settings where vocabulary mismatch is a recurring problem.
Neither byte-level nor character-level approaches have displaced WordPiece in production. They show up in research and in models that need to support languages or scripts that WordPiece-trained checkpoints handle poorly. WordPiece, BPE, and the unigram LM remain the three workhorse subword schemes in modern NLP.
WordPiece is now nearly fifteen years old, and the bulk of new BERT-style pre-training projects launched since 2022 have moved to byte-level BPE (for Llama-style decoder-only models) or to SentencePiece unigram (for T5-style encoder-decoder models). The original BERT vocabularies are still the most-downloaded tokenizers on Hugging Face Hub, however, and the BERT-family models that ship with WordPiece (DistilBERT, MobileBERT, ELECTRA, mBERT, and a long tail of domain-specific BERTs) remain heavily used in production for classification, retrieval, named entity recognition, and other encoder-only tasks.
WordPiece also lives on through SentencePiece's BPE mode, since several teams use SentencePiece as a more flexible drop-in replacement for the original WordPiece pipeline. The phrase "WordPiece tokenizer" is sometimes used loosely in modern documentation to mean any greedy longest-match subword tokenizer with a ## continuation prefix, even when the underlying training algorithm is BPE.