WordPiece

Large Language Models Natural Language Processing

18 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v2 · 3,577 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

WordPiece is a subword tokenization algorithm that builds a fixed-size vocabulary of word pieces by repeatedly merging the symbol pair whose combination most increases the likelihood of the training corpus under a language model. Unlike Byte pair encoding (BPE), which merges the single most frequent pair on each step, WordPiece merges the pair that scores highest on freq(xy) / (freq(x) * freq(y)): the pair that occurs together far more often than its two parts would predict if they were independent ^[1]^[7]. It was introduced by Mike Schuster and Kaisuke Nakajima at Google in their 2012 ICASSP paper Japanese and Korean voice search ^[1], and it became the standard tokenizer of BERT, whose released English model ships a 30,522-piece WordPiece vocabulary ^[3]. WordPiece marks every non-initial piece of a word with the prefix ##, so the word playing tokenizes as play ##ing ^[7].

WordPiece originated as a way to build pronunciation inventories for languages whose orthography lacks clear word boundaries, then became the standard subword model for several important neural systems. It powered the Google Neural Machine Translation (GNMT) system described by Wu et al. in 2016, which used a shared source and target vocabulary of 32,000 wordpieces ^[2], and it became widely known after Jacob Devlin and colleagues used it for the BERT pre-training corpus in 2018 ^[3]. Most of the BERT-family models, including DistilBERT, MobileBERT, ELECTRA, and multilingual BERT, continue to ship with WordPiece vocabularies.

WordPiece is closely related to BPE and to the unigram language model tokenizer that ships in Google's SentencePiece library, but the three algorithms differ in important ways. BPE merges the most frequent pair of symbols on each step. WordPiece merges the pair that scores highest under a likelihood criterion, equivalent to the ratio of the pair frequency to the product of the individual frequencies ^[7]^[8]. The unigram language model approach used by SentencePiece starts with a large seed vocabulary and prunes pieces that contribute least to the data likelihood, so the build direction is opposite to that of WordPiece ^[5]^[6]. These distinctions matter in practice because the resulting vocabularies, although similar in size, tokenize the same text in slightly different ways, which can affect downstream task transfer.

When was WordPiece created, and why?

WordPiece was first described in Mike Schuster and Kaisuke Nakajima, Japanese and Korean voice search, presented at the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) in 2012, pages 5149 to 5152 ^[1]. The Schuster and Nakajima paper deals with a problem that does not exist in English speech recognition the same way: Japanese is normally written without spaces between words, and Korean uses Eojeol units that are themselves morphologically complex. A traditional speech recognizer needs a closed pronunciation lexicon, so the authors had to choose what counted as a "word" before they could build language and acoustic models. They wanted an inventory that would be small enough to be tractable, large enough to cover the long tail of names and rare terms that show up in voice queries, and learnable directly from text rather than relying on a hand-curated dictionary.

The authors propose a procedure they call the WordPieceModel, which, in their words, "learns word units from large amounts of text automatically and incrementally by running a greedy algorithm" ^[1]. Starting from an inventory of basic Unicode characters (about 22,000 for Japanese and 11,000 for Korean), they train a unigram language model on a large text corpus using the current inventory, then add a new word piece to the inventory by combining two existing pieces in a way that maximizes the likelihood of the training data ^[1]. They iterate until the inventory reaches a target size or the marginal likelihood gain falls below a threshold. The greedy character of the algorithm means that each step is a local optimum rather than a global one, but in practice it produces inventories that fit the long-tail distribution of voice queries well. The resulting word pieces include common morphemes, frequent word stems, full common words, and a sprinkling of useful character n-grams.

Schuster and Nakajima's original motivation was speech recognition for languages without spaces, but the same algorithm turned out to be useful any time a model needs an open vocabulary at a fixed budget. That is exactly the situation that arose four years later when Google's translation team replaced their phrase-based system with a sequence-to-sequence neural model.

How was WordPiece used in Google Neural Machine Translation?

The second major appearance of WordPiece was in Yonghui Wu et al., Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation, posted on arXiv in September 2016 (arXiv:1609.08144) ^[2]. GNMT used a deep stacked LSTM encoder and decoder with attention, and the authors needed a way to handle rare words and morphological variation in many language pairs without ballooning the softmax. Their answer was to encode both the source and the target sides of the corpus with WordPiece. The paper says the system divides words "into a limited set of common sub-word units ("wordpieces") for both input and output," and that "this method provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delimited models" ^[2]. It also notes that wordpieces naturally handle the translation of rare words.

The GNMT paper reports that a total vocabulary of between 8,000 and 32,000 wordpieces gave both good BLEU scores and fast decoding ^[2]. The released production models used a shared source and target vocabulary of 32,000 wordpieces ^[2]. The training procedure followed the same maximum-likelihood criterion as the 2012 paper, applied at scale to the WMT and Google internal translation corpora.

GNMT was deployed in Google Translate starting in late 2016, beginning with English to Chinese. The deployment made WordPiece by far the most widely used subword model in production at the time, although the broader research community largely associated subword tokenization with Sennrich, Haddow, and Birch's 2016 BPE paper for neural machine translation ^[4]. Both approaches were in active use across different teams.

How did WordPiece become the standard tokenizer for BERT?

The paper that made WordPiece a household name in NLP was Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, posted on arXiv in October 2018 and published at NAACL 2019 ^[3]. The BERT paper says only a few words about its tokenizer: it uses "WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary" ^[3]. The released checkpoints actually ship with a vocabulary of 30,522 WordPieces for the English uncased model and 28,996 for the English cased model ^[3]^[10]. The non-initial pieces of a word carry the convention prefix ##, so the word playing becomes play ##ing and a less common word like embeddings may become em ##bed ##ding ##s ^[7].

BERT's success made the BERT vocabularies a de facto standard in 2019 and 2020. Researchers built models that either used the released BERT vocabulary directly, retrained a fresh WordPiece vocabulary on a new corpus, or borrowed the WordPiece tokenizer for a different transformer architecture. A non-exhaustive list of BERT-family models that use WordPiece is shown in the table below.

Model	Year	WordPiece vocabulary size	Notes
BERT-base, uncased	2018	30,522	English Wikipedia plus BookCorpus
BERT-base, cased	2018	28,996	Cased English
BERT multilingual cased (mBERT)	2018	119,547	Top 104 Wikipedia languages
DistilBERT	2019	30,522	Inherits BERT vocabulary
ALBERT	2019	30,000	Switched to SentencePiece (unigram)
ELECTRA	2020	30,522	Inherits BERT vocabulary
MobileBERT	2020	30,522	Inherits BERT vocabulary
TinyBERT	2019	30,522	Inherits BERT vocabulary

The multilingual BERT vocabulary is built from the 104 largest Wikipedias using exponentially smoothed weighting with a smoothing factor of 0.7, so high-resource languages such as English are under-sampled and low-resource languages such as Icelandic are over-sampled ^[10]. The README for the open-source BERT release says English is sampled roughly 100 times more than Icelandic after smoothing, compared to roughly 1,000 times more before smoothing ^[10].

How does the WordPiece algorithm work?

The WordPiece training algorithm has the same skeleton as BPE: start from a small character-level vocabulary, then iteratively grow it by merging adjacent symbol pairs in the corpus. The difference is the merge criterion. Google never open-sourced its training code, so the published descriptions are best guesses based on the 2012 paper, the GNMT paper, and the Hugging Face reverse-engineering documented in the transformers tokenizer summary and the LLM Course chapter on WordPiece. As the Hugging Face LLM Course puts it, "Google never open-sourced its implementation of the training algorithm of WordPiece, so what follows is our best guess based on the published literature" ^[7].

Initialization

Before training, the corpus is pre-tokenized into words using whitespace and punctuation rules. Each word is split into its individual characters. To preserve the information that a character was the second, third, or later character of a word, every non-initial character is given the WordPiece prefix ## (in BERT's convention; other implementations use the lower-eighth-block marker or no prefix at all) ^[7]. The character c at the start of a word becomes the vocabulary item c, and the same character in the middle of a word becomes the distinct vocabulary item ##c. A word like word is therefore initially split as w ##o ##r ##d ^[7]. Special tokens such as [PAD], [UNK], [CLS], [SEP], and [MASK] are added at the start of the vocabulary.

Merge selection

For every adjacent pair (x, y) in the corpus, WordPiece computes a score:

score(x, y) = freq(xy) / (freq(x) * freq(y))

It then merges the pair with the highest score. Intuitively, the score asks how much more often the pair occurs together than would be expected if x and y were independent unigrams ^[7]. This is mathematically equivalent to choosing the pair that maximizes the increase in log-likelihood of the training corpus under a unigram language model when the merged token is added to the vocabulary, which is the criterion stated in the 2012 paper ^[1]. BPE, in contrast, would simply pick the pair with the highest raw frequency freq(xy) ^[8].

The Hugging Face LLM Course works through a small example with the corpus ("hug", 10), ("pug", 5), ("pun", 12), ("bun", 4), ("hugs", 5) ^[7]. After the initial split with ## prefixes, the most frequent pair is ("##u", "##g") with frequency 20, but the score under WordPiece is only 1/36 because ##u is itself very frequent. The pair ("##g", "##s") has a lower frequency of 5 but a higher score of 1/20, so WordPiece merges ##g and ##s first, producing the new piece ##gs ^[7]. BPE on the same corpus would have merged ("##u", "##g") first.

Termination

Merging stops when the vocabulary reaches a predefined target size or when the best available score falls below a threshold. The original 2012 paper used a likelihood-gain threshold; modern implementations more commonly fix the target size up front, since downstream model architectures need a fixed embedding table.

Inference (encoding new text)

At inference time, BERT-style WordPiece does not store the merge rules. Instead it stores only the final vocabulary and runs a greedy longest-match-first algorithm against the input: it "finds the longest subword that is in the vocabulary, then splits on it" and repeats on the remainder ^[7]. Given a word, the encoder repeatedly finds the longest prefix that appears in the vocabulary and emits that piece. The remainder of the word, prefixed with ##, is then re-encoded by the same procedure. If no prefix of the remaining string is in the vocabulary, the entire original word is emitted as [UNK], not as a partial sequence ^[7]. This is one of the more visible behavioral differences from BPE, which would emit the unknown character as <unk> and continue with the rest of the word.

For example, with the toy vocabulary trained on the corpus above, the word hugs tokenizes as [hug, ##s] because hug is the longest prefix in the vocabulary. A WordPiece-encoded bugs becomes [b, ##u, ##gs]. A word like mug whose middle character ##m is not in the vocabulary at all tokenizes as [[UNK]] rather than [m, [UNK]] or any partial decomposition ^[7].

How does WordPiece differ from BPE and Unigram?

Four subword schemes dominate modern NLP: BPE, WordPiece, the unigram language model, and SentencePiece (which is really a wrapper that hosts BPE or unigram). The table below summarizes the differences.

Algorithm	Direction	Merge or prune criterion	Stores	Used by
BPE (Sennrich et al. 2016)	Bottom-up, grow	Most frequent pair	Merge rules	GPT, GPT-2 (byte-level), RoBERTa, BART, Llama family
WordPiece (Schuster & Nakajima 2012)	Bottom-up, grow	Pair maximizing likelihood gain `freq(xy) / (freq(x)*freq(y))`	Final vocabulary only	BERT, DistilBERT, ELECTRA, mBERT, MobileBERT
Unigram LM (Kudo 2018)	Top-down, prune	Remove pieces whose deletion least decreases corpus likelihood	Vocabulary with probabilities	T5, ALBERT, XLNet, mBART (via SentencePiece)
SentencePiece (Kudo & Richardson 2018)	Either (BPE or unigram), trained on raw text	Inherits from BPE or unigram	Vocabulary, optional merges	Multilingual models, T5 family, XLNet, ALBERT

A few cross-cutting points are worth pulling out. BPE's bottom-up frequency criterion and WordPiece's likelihood criterion produce vocabularies that overlap substantially in the most common pieces but diverge on the long tail. WordPiece tends to keep more morpheme-like fragments, since pieces whose individual parts are rare get a high score even when they themselves are not super frequent ^[7]. BPE tends to keep more highly frequent surface n-grams. The unigram LM approach is conceptually closer to a probabilistic model: it learns probabilities for each piece and at decoding time picks the most likely segmentation under the unigram model, which makes it easy to sample alternative segmentations for subword regularization ^[5].

SentencePiece is sometimes loosely called "WordPiece" in conversation, but the comparison in the official SentencePiece README clarifies that the BPE algorithm used in WordPiece is slightly different from the original BPE, and that SentencePiece itself implements BPE and unigram, not WordPiece ^[9]. SentencePiece can be configured to mimic WordPiece behavior closely, for example by choosing a bottom-up BPE training and an ##-style continuation marker, but it is not a bit-exact reimplementation. The Hugging Face tokenizers library does include a true WordPiece trainer and tokenizer, exposed through the BertWordPieceTokenizer class and the fast BertTokenizerFast interface used to load BERT checkpoints ^[8].

What are the main WordPiece implementations?

There is no single canonical reference implementation of WordPiece because Google did not open-source the original training code ^[7]. Several implementations have grown up around the algorithm, each with slightly different defaults.

Implementation	Maintainer	Training	Inference
`tensor2tensor` SubwordTextEncoder	Google Brain	Heuristic likelihood-based merger close to WordPiece	Greedy longest-match
TensorFlow Text `BertTokenizer` and `WordpieceTokenizer`	Google	Wraps a separate trainer such as `wordpiece_tokenizer_learner`	Greedy longest-match
Hugging Face `tokenizers` (`BertWordPieceTokenizer`)	Hugging Face	Score `freq(xy)/(freq(x)*freq(y))`, target vocabulary size	Greedy longest-match, `##` continuation
Hugging Face `transformers` (`BertTokenizer`, `BertTokenizerFast`)	Hugging Face	Loads pretrained vocabulary	Greedy longest-match
SentencePiece (with BPE config)	Google	BPE on raw text	Approximation of WordPiece behavior

The Hugging Face tokenizer summary cautions that even Hugging Face's own train_new_from_iterator does not reproduce WordPiece exactly, since the underlying tokenizers library uses BPE for training, not WordPiece ^[8]. For new projects that want to reproduce BERT-style behavior, the most common path is to load the released BERT vocabulary directly rather than train a fresh one.

What are the strengths and weaknesses of WordPiece?

WordPiece's main strength is the same as that of any good subword tokenizer: it gives a fixed-size vocabulary that can still represent any word, including rare names, technical jargon, and compound forms in morphologically rich languages such as Turkish, Finnish, and German. The likelihood-gain criterion produces pieces that often line up with morphemes, which is convenient for downstream tasks like named entity recognition where subword consistency matters. Because the vocabulary is fixed and small (typically 30,000 to 50,000 pieces for monolingual models, around 120,000 for highly multilingual ones ^[10]), the embedding table and softmax stay tractable.

The algorithm has a few real weaknesses. Greedy longest-match decoding is deterministic, but it sometimes produces non-intuitive splits: the same English word can tokenize differently when the surrounding context happens to make a longer match available, and the splits do not always respect linguistic morphology. The fall-back to [UNK] for an entire word when a single internal character is missing is harsh, and modern pipelines work around it by either guaranteeing every byte is in the vocabulary (the byte-level BPE approach used by GPT-2) or by adding a character-level fallback ^[7].

WordPiece also inherits the downsides of pre-tokenization. The standard BERT pipeline assumes the input has been pre-split into rough words on whitespace and punctuation before WordPiece sees it. For Chinese, Japanese, Thai, and other languages without obvious word boundaries, this assumption breaks, and the original BERT release had to ship a separate Chinese tokenizer that splits at every character before WordPiece runs ^[10]. Sentence-piece-based unigram tokenizers handle this case more gracefully because they operate directly on raw text ^[6].

Different subword segmentations of the same word in different contexts can also affect transfer learning, since downstream models that rely on consistent surface forms (for example, span extraction in question answering) are more brittle when their tokenizer behaves inconsistently.

How does WordPiece relate to character-level and byte-level tokenization?

WordPiece is a compromise between word-level tokenizers (which produce short sequences but huge vocabularies and unknown-word problems) and character-level or byte-level tokenizers (which have tiny vocabularies but very long sequences). Two notable character-level or byte-level alternatives have been published since BERT.

ByT5 (Xue et al. 2021) is a variant of T5 that operates on UTF-8 bytes rather than SentencePiece tokens ^[11]. The approach trades a larger transformer to handle longer sequences against the simplicity of having no tokenizer at all. ByT5 is more robust to noise and to languages that the original SentencePiece vocabulary did not cover well.

CANINE (Clark et al. 2021) skips the pre-tokenizer entirely and operates on Unicode codepoints, using downsampling inside the encoder to keep the sequence manageable ^[12]. The CANINE paper presents itself explicitly as an alternative to WordPiece for multilingual settings where vocabulary mismatch is a recurring problem.

Neither byte-level nor character-level approaches have displaced WordPiece in production. They show up in research and in models that need to support languages or scripts that WordPiece-trained checkpoints handle poorly. WordPiece, BPE, and the unigram LM remain the three workhorse subword schemes in modern NLP.

Is WordPiece still used today?

WordPiece is now nearly fifteen years old, and the bulk of new BERT-style pre-training projects launched since 2022 have moved to byte-level BPE (for Llama-style decoder-only models) or to SentencePiece unigram (for T5-style encoder-decoder models). The original BERT vocabularies are still the most-downloaded tokenizers on Hugging Face Hub, however, and the BERT-family models that ship with WordPiece (DistilBERT, MobileBERT, ELECTRA, mBERT, and a long tail of domain-specific BERTs) remain heavily used in production for classification, retrieval, named entity recognition, and other encoder-only tasks.

WordPiece also lives on through SentencePiece's BPE mode, since several teams use SentencePiece as a more flexible drop-in replacement for the original WordPiece pipeline ^[9]. The phrase "WordPiece tokenizer" is sometimes used loosely in modern documentation to mean any greedy longest-match subword tokenizer with a ## continuation prefix, even when the underlying training algorithm is BPE.

References

Schuster, M., and Nakajima, K. (2012). Japanese and Korean voice search. *2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, Kyoto, Japan, March 25-30, 2012, pp. 5149-5152. PDF via Google Research. ↩
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., et al. (2016). Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144. arXiv preprint. ↩
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. *Proceedings of NAACL-HLT 2019*, pp. 4171-4186. ACL Anthology. Also arXiv:1810.04805. ↩
Sennrich, R., Haddow, B., and Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. *Proceedings of ACL 2016*. arXiv preprint. ↩
Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. *Proceedings of ACL 2018*. ACL Anthology. ↩
Kudo, T., and Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. *EMNLP 2018: System Demonstrations*, pp. 66-71. ACL Anthology. ↩
Hugging Face. WordPiece tokenization. *LLM Course, Chapter 6.6*. https://huggingface.co/learn/llm-course/en/chapter6/6. ↩
Hugging Face. Tokenization algorithms (Tokenizer summary). *transformers documentation*. https://huggingface.co/docs/transformers/tokenizer_summary. ↩
Google Research. SentencePiece (open-source repository). https://github.com/google/sentencepiece. ↩
Google Research. Multilingual BERT README. https://github.com/google-research/bert/blob/master/multilingual.md. ↩
Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S., Kale, M., Roberts, A., and Raffel, C. (2021). ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. arXiv:2105.13626. ↩
Clark, J. H., Garrette, D., Turc, I., and Wieting, J. (2021). CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. *Transactions of the Association for Computational Linguistics*. arXiv:2103.06874. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Abbreviations BERT Bert-base-uncased model Byte-Pair Encoding Data preprocessing DistilBERT Hugging Face SentencePiece Tokenization fastText

When was WordPiece created, and why?

How was WordPiece used in Google Neural Machine Translation?

How did WordPiece become the standard tokenizer for BERT?

How does the WordPiece algorithm work?

Initialization

Merge selection

Termination

Inference (encoding new text)

How does WordPiece differ from BPE and Unigram?

What are the main WordPiece implementations?

What are the strengths and weaknesses of WordPiece?

How does WordPiece relate to character-level and byte-level tokenization?

Is WordPiece still used today?

See also

References

Improve this article

Related Articles

LLaMA

Prompt Engineering

Agentic Context Engineering

Bert-base-uncased model

Bidirectional language model

Claude (language model)

What links here

Related Articles

LLaMA

Prompt Engineering

Agentic Context Engineering

Bert-base-uncased model

Bidirectional language model

Claude (language model)

What links here