SentencePiece
Last reviewed
Apr 28, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 3,510 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 3,510 words
Add missing citations, update stale details, or suggest a clearer explanation.
SentencePiece is an open-source subword tokenization library and detokenizer developed at Google and originally introduced by Taku Kudo and John Richardson in their 2018 EMNLP system demonstration paper [1]. The library is language-independent and operates directly on raw Unicode text, removing the need for any external pre-tokenization step. SentencePiece supports two main subword segmentation algorithms: byte-pair encoding (BPE), and a unigram language model introduced in Kudo's separate ACL 2018 paper on subword regularization [2]. The library is released under the Apache 2.0 license and is hosted on GitHub at github.com/google/sentencepiece [3].
SentencePiece has become a foundational piece of modern natural language processing infrastructure. It powers the tokenizers used by T5, ALBERT, XLNet, mT5, mBART, NLLB, Marian NMT, Llama 1 and 2, Mistral, and Gemma, among many other large-scale language models. The library has also been ported into the Hugging Face Tokenizers ecosystem, where its design choices influenced a generation of fast Rust-based tokenizers.
Most early subword tokenizers, including the influential BPE implementation released by Sennrich, Haddow, and Birch in 2016 for neural machine translation, expected pre-tokenized text as input [4]. The standard pipeline for English or other space-separated languages typically involved a chain of tools such as the Moses tokenizer, which would split punctuation from words, normalize quote characters, and apply language-specific rules before any subword model saw the text. This worked acceptably for European languages but introduced several real problems.
First, languages without explicit word boundaries, including Chinese, Japanese, Thai, and many Southeast Asian scripts, do not have whitespace separating words. Pre-tokenization for these languages requires external segmenters such as MeCab for Japanese or Stanford Word Segmenter for Chinese, each adding installation complexity and producing slightly different segmentations across versions. Second, end-to-end models that learn directly from raw text cannot easily integrate language-specific tokenizers without breaking the goal of language independence. Third, detokenization in this multistage pipeline is lossy because spacing and punctuation conventions cannot always be reconstructed from the token sequence alone.
SentencePiece was designed to address these issues with a single deliberate choice: treat whitespace as a regular character. The library encodes the space character as the meta symbol U+2581 (the LOWER ONE EIGHTH BLOCK character, written _ in many displays) and then trains its subword model on the raw Unicode text. Because the space symbol is part of the vocabulary and appears as a prefix on subword tokens that follow whitespace, detokenization is reduced to concatenating the pieces and replacing U+2581 with a regular space. The reconstruction is exact and language-agnostic, and it requires no auxiliary metadata beyond the model file itself [1].
The second motivation was packaging. The authors wanted a tokenizer that ships as a single self-contained model file, can be loaded into any process without additional dictionaries or scripts, and can be reproduced bit-for-bit across machines. The .model file is encoded in Protocol Buffers format and contains the vocabulary, the merge rules or unigram probabilities, the normalization rules, and a small amount of bookkeeping metadata.
SentencePiece supports four model types selected at training time via the --model_type flag: bpe, unigram, word, and char. The first two are the practically important options, the latter two exist mostly as baselines or for special cases.
The BPE algorithm in SentencePiece is a faithful implementation of the original byte-pair encoding compression algorithm proposed by Gage in 1994 and adapted to subword segmentation by Sennrich et al. in 2016 [4]. Training proceeds bottom-up: the corpus is first split into individual Unicode characters, then the most frequent adjacent pair of symbols is merged into a new symbol, and the process repeats until the vocabulary reaches the target size. The ordered list of merges is stored in the model. At inference time, the same merge list is applied greedily to new input, in the order it was learned during training, which guarantees deterministic segmentation.
SentencePiece's BPE differs from the original Sennrich implementation in two visible ways. The first is that it operates on raw text where whitespace has been mapped to U+2581, so merges can cross what used to be word boundaries if such merges are common in the training data. In practice this is rare because cross-word character combinations are not frequent enough to outrank within-word combinations during early merges. The second difference is that the SentencePiece BPE encoder is implemented in C++ with a priority queue based merge step, which is substantially faster than the Python reference implementation that shipped with the original paper.
When byte_fallback=True is set, BPE training adds 256 byte-level tokens (one for each value 0x00 through 0xFF) to the vocabulary so that any character that the trained merges cannot produce can still be represented as a sequence of UTF-8 bytes. This option is used by Llama and Gemma to avoid out-of-vocabulary tokens entirely while keeping the merged vocabulary at a manageable size.
The unigram subword model, introduced by Kudo in 2018 [2], takes the opposite top-down approach. Training begins with a large seed vocabulary of candidate substrings drawn from the corpus, often built using suffix arrays or simple heuristics, and then iteratively prunes pieces whose removal causes the smallest decrease in the marginal likelihood of the data. The probability of a sentence under the unigram model is the product of the probabilities of the pieces in the chosen segmentation, marginalized over all valid segmentations using a forward-backward style computation during training and a Viterbi search at inference time.
The unigram model has several attractive properties. Because it is probabilistic, multiple valid segmentations exist for any input, and the most probable one is selected at inference. The library can also sample alternative segmentations from the posterior over tokenizations, which is the basis of subword regularization. Vocabularies trained with the unigram method tend to contain fewer suffix-only or prefix-only fragments and more linguistically meaningful pieces such as common morphemes, although this is a tendency rather than a guarantee.
Unigram training is more expensive than BPE training for the same target vocabulary size because each iteration requires a likelihood computation over a candidate set that is often several times larger than the final vocabulary. In return, unigram models often perform slightly better in neural machine translation BLEU evaluations on low-resource language pairs, especially when subword regularization is applied during training [2].
The word model splits on whitespace only and uses each unique word as a token, capped at the requested vocabulary size with the rest going to a single unknown token. The char model uses each Unicode codepoint as its own token. Both options are mostly used for ablation studies and for very small models where subword granularity is not needed.
SentencePiece's design emphasizes a small set of properties that together make it convenient for production use:
spm_train without language-specific preprocessing, including text in scripts without word boundaries..model file, typically a few hundred kilobytes to a few megabytes depending on vocabulary size.byte_fallback=True ensures that any input string can be encoded, even if it contains characters that the trained vocabulary does not cover, by emitting raw UTF-8 byte tokens for the unknown spans.<s>, </s>, <unk>, <pad>) and user-defined symbols (such as <mask> or task-specific markers) that are guaranteed to be tokenized as single units regardless of frequency.SentencePiece exposes a small set of command-line tools and bindings for several languages. The basic training, encoding, and decoding workflow looks as follows in shell.
spm_train --input=corpus.txt \
--model_prefix=mymodel \
--vocab_size=32000 \
--model_type=bpe
spm_encode --model=mymodel.model < text.txt > tokens.txt
spm_decode --model=mymodel.model < tokens.txt > reconstructed.txt
The Python bindings, installed with pip install sentencepiece, expose the same operations through a class-based API.
import sentencepiece as spm
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='mymodel',
vocab_size=32000,
model_type='bpe',
byte_fallback=True
)
sp = spm.SentencePieceProcessor(model_file='mymodel.model')
pieces = sp.encode_as_pieces('Hello, world!')
ids = sp.encode_as_ids('Hello, world!')
text = sp.decode_pieces(pieces)
For the unigram model, the same code can be used with model_type='unigram', and the runtime additionally exposes the sample_encode_as_pieces method, which draws a tokenization from the posterior. Setting nbest_size and alpha controls the temperature of the sampling and is the entry point to subword regularization during training of downstream models.
The library ships with bindings for C++, Python, Java, JavaScript via WebAssembly, Rust, and Go through community projects. The shipped Python wheels include precompiled binaries for Linux, macOS, and Windows on common architectures.
SentencePiece tokenizers underpin a large fraction of widely used language models. The following table summarizes notable models, their tokenizer choice within the SentencePiece family, and approximate vocabulary size, based on each model's published configuration or paper.
| Model | Tokenizer type | Vocabulary size | Notes |
|---|---|---|---|
| T5 | SentencePiece unigram | 32,000 | English-centric, 2019 [5] |
| mT5 | SentencePiece unigram | 250,000 | 101 languages, 2020 [6] |
| ALBERT | SentencePiece unigram | 30,000 | 2019 [7] |
| XLNet | SentencePiece | 32,000 | 2019 [8] |
| mBART | SentencePiece | 250,000 | Multilingual denoising autoencoder |
| NLLB | SentencePiece | 256,000 | 200+ languages, 2022 [9] |
| Marian NMT | SentencePiece (default) | varies | Used widely for production MT |
| Llama 1 | SentencePiece BPE | 32,000 | byte_fallback enabled, 2023 [10] |
| Llama 2 | SentencePiece BPE | 32,000 | Same tokenizer family as Llama 1, 2023 |
| Mistral 7B | SentencePiece BPE | 32,000 | byte_fallback enabled, 2023 |
| Gemma | SentencePiece | 256,000 | Subset of Gemini's tokenizer, 2024 [11] |
| PaLM, PaLM 2 | SentencePiece | 256,000 | byte_fallback enabled |
A notable contrast appears with Llama 3, which switched to a tiktoken-style byte-level BPE tokenizer with a vocabulary of 128,000 instead of continuing the SentencePiece tradition of Llama 1 and 2. This reflects a broader trend among frontier laboratories: while SentencePiece remains the standard for many open-weights model families and for multilingual systems, the most recent generation of frontier English-centric models often prefers byte-level BPE for additional speed and for cleaner handling of arbitrary byte sequences. The Hugging Face Tokenizers library is sometimes used as a drop-in replacement that can load SentencePiece models while running entirely in Rust.
BERT does not use SentencePiece; it uses a WordPiece tokenizer, although WordPiece and SentencePiece's unigram method share conceptual ancestry in likelihood-based vocabulary selection. Many GPT family models from OpenAI use a custom byte-level BPE that predates and is independent of SentencePiece.
Subword regularization is the training-time technique that gives the unigram model its main practical advantage over greedy BPE [2]. The idea is straightforward. Instead of always feeding the model the single most likely segmentation of each input sentence, the training pipeline samples a different segmentation each epoch, drawn from the posterior over tokenizations under the trained unigram model. The model therefore sees the same content under many different surface forms, which has a regularizing effect similar to dropout on token sequences.
In the original ACL 2018 paper, Kudo reported BLEU improvements of roughly one to three points on low-resource and out-of-domain neural machine translation tasks, with the largest gains appearing where data was scarce or where domain shift between training and test was most severe [2]. The technique requires no architectural changes to the downstream model. It only changes how the input is tokenized at each training step. Two hyperparameters control the sampling: nbest_size, which truncates the posterior to the top n candidates before sampling, and alpha, the temperature applied to the truncated distribution. Setting nbest_size=-1 and a small positive alpha such as 0.1 corresponds to drawing from the full posterior with mild temperature smoothing.
For BPE, an analogous technique called BPE-dropout was introduced by Provilkov et al. in 2020 [12]. BPE-dropout randomly skips merges with a small probability during encoding, which produces alternative segmentations even though the underlying vocabulary is fixed and ordered. BPE-dropout is implemented in some downstream tokenizer libraries but is not part of the SentencePiece BPE training algorithm itself. SentencePiece's bpe_dropout runtime parameter at encode time provides similar functionality.
The table below summarizes the main subword tokenization libraries and methods used in modern NLP, with their distinguishing characteristics.
| Library or method | Primary algorithms | Input expectation | Notable users |
|---|---|---|---|
Sennrich BPE (subword-nmt) | BPE | Pre-tokenized text | Early NMT systems, original BPE work |
| WordPiece | Likelihood-greedy BPE-like | Pre-tokenized text | BERT, early Transformer encoders |
| SentencePiece BPE | BPE | Raw text | Llama 1/2, Mistral, Gemma |
| SentencePiece unigram | Unigram LM | Raw text | T5, mT5, ALBERT, XLNet, NLLB |
| Hugging Face Tokenizers | BPE, WordPiece, Unigram | Raw or pre-tokenized | Hugging Face Transformers ecosystem |
| tiktoken | Byte-level BPE | Raw bytes | OpenAI GPT models, Llama 3 (variant) |
| Tokenmonster | Approximate BPE | Raw bytes | Community |
The practical comparison usually comes down to four axes. Speed of encoding favors tiktoken and the Rust-based Hugging Face Tokenizers, both of which can be several times faster than SentencePiece on long documents. Vocabulary control favors SentencePiece, which offers the most explicit options for normalization, byte fallback, and user-defined symbols. Multilingual coverage favors SentencePiece because almost every major multilingual model has used it. Training reproducibility favors SentencePiece because the single self-contained model file makes bit-for-bit reproduction straightforward.
SentencePiece is distributed under the Apache 2.0 license, which permits commercial use, modification, and redistribution provided that copyright notices are preserved and a copy of the license is included. The library has been continuously maintained since its 2018 release, with releases on PyPI and on the GitHub repository. As of the late 2020s, the project has accumulated more than ten thousand GitHub stars and is downloaded tens of millions of times per month from PyPI alone.
The library has been integrated into many higher-level frameworks. Hugging Face Transformers ships SentencePiece-backed tokenizer classes for every model that was originally trained with SentencePiece, including T5Tokenizer, AlbertTokenizer, XLNetTokenizer, MarianTokenizer, MBartTokenizer, NllbTokenizer, and the LlamaTokenizer and GemmaTokenizer families. Many of these have a corresponding fast version backed by the Hugging Face Tokenizers library that can load the same SentencePiece model file but run encoding in Rust. The fast versions are usually preferred at inference time because they support multi-threaded batching, while the slow Python-bound versions remain useful for debugging because they expose the underlying SentencePiece processor directly.
SentencePiece has a number of practical limits, most of which stem from its design vintage rather than from algorithmic flaws.
Very large vocabularies, beyond roughly 500,000 pieces, become slow and memory-hungry to train. Both BPE and unigram training keep the candidate counts and statistics in memory, which means RAM scales with vocabulary size. For multilingual systems that target hundreds of languages with extremely large vocabularies, training is often distributed across machines using sharded corpora and approximate counting.
Unigram training is computationally more expensive than BPE training, by roughly a factor of two to four for the same target vocabulary size, depending on the corpus and the seed vocabulary heuristic.
Detokenization assumes the U+2581 whitespace convention. If a downstream system mixes tokens from a SentencePiece tokenizer with tokens from a different tokenizer that uses a different whitespace convention, detokenization can produce unexpected output. This becomes relevant in tool-augmented systems that splice token streams from multiple sources.
Modern frontier large language models are increasingly moving to byte-level BPE implemented in tiktoken or in the Hugging Face Tokenizers library, primarily for speed and for cleaner handling of arbitrary input bytes including emoji sequences and control characters. SentencePiece's unigram model in particular is rare in the latest generation of public frontier model releases, although it remains the dominant choice for multilingual systems.
Serialized model files are tied to the SentencePiece library version's protobuf schema. Forward and backward compatibility have been good in practice across releases, but very old models can occasionally need a migration step when loaded with newer versions, and newer options such as byte fallback cannot be loaded by older library versions. Finally, the library does not directly support learned tokenizers that are jointly trained with the downstream model, nor does it expose hooks for online vocabulary expansion.
SentencePiece's design has been influential well beyond its own codebase. The Hugging Face Tokenizers library, written in Rust with Python bindings, generalizes SentencePiece's separation of normalizer, pre-tokenizer, model, and post-processor into a modular pipeline. It supports BPE, WordPiece, and unigram models with the same logic and serialization conventions as SentencePiece, and it can load most SentencePiece .model files directly. The library is several times faster than SentencePiece on encoding workloads and is the default tokenizer in the Hugging Face Transformers ecosystem.
The tiktoken library released by OpenAI in 2022 takes a different path. It uses byte-level BPE with a Rust core, and it is heavily optimized for the encoding patterns specific to GPT-family models. It does not implement unigram models or expose a training entry point in the released version. Its main appeal is raw speed.
Research directions that build on SentencePiece include character-level and byte-level models that aim to dispense with subword tokenization entirely, such as Charformer, ByT5, and CANINE. These approaches eliminate the tokenization step at the cost of longer sequences and additional computation in the model itself, and they have not displaced subword tokenization in mainstream use.