SentencePiece
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,664 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,664 words
Add missing citations, update stale details, or suggest a clearer explanation.
SentencePiece is an open-source subword tokenization library and detokenizer developed at Google and originally introduced by Taku Kudo and John Richardson in their 2018 EMNLP system demonstration paper.[1] The library is language-independent and operates directly on raw Unicode text, removing the need for any external pre-tokenization step. SentencePiece supports two main subword segmentation algorithms: byte-pair encoding (BPE), and a unigram language model introduced in Kudo's separate ACL 2018 paper on subword regularization.[2] The library is released under the Apache 2.0 license and is hosted on GitHub at github.com/google/sentencepiece.[3]
SentencePiece has become a foundational piece of modern natural language processing infrastructure. It powers the tokenizers used by T5, ALBERT, XLNet, mT5, mBART, NLLB, Marian NMT, Llama 1 and 2, Mistral 7B, Gemma, PaLM, and many DeepSeek and Qwen releases, among other large-scale language models.[1][3][11][17][18] Its design choices also influenced a generation of fast Rust-based tokenizers in the Hugging Face Tokenizers ecosystem, which can load SentencePiece model files directly.[15]
Most early subword tokenizers, including the influential BPE implementation released by Sennrich, Haddow, and Birch in 2016 for neural machine translation, expected pre-tokenized text as input.[4] The standard pipeline for English or other space-separated languages typically involved a chain of tools such as the Moses tokenizer, which would split punctuation from words, normalize quote characters, and apply language-specific rules before any subword model saw the text. This worked acceptably for European languages but introduced several real problems.
First, languages without explicit word boundaries, including Chinese, Japanese, Thai, and many Southeast Asian scripts, do not have whitespace separating words. Pre-tokenization for these languages requires external segmenters such as MeCab for Japanese or the Stanford Word Segmenter for Chinese, each adding installation complexity and producing slightly different segmentations across versions.[1] Second, end-to-end models that learn directly from raw text cannot easily integrate language-specific tokenizers without breaking the goal of language independence. Third, detokenization in this multistage pipeline is lossy because spacing and punctuation conventions cannot always be reconstructed from the token sequence alone.
A second motivation was engineering. Earlier subword toolkits relied on multiple Python scripts, shell pipelines, and ancillary lookup tables that were hard to package inside a production model server. The authors of SentencePiece wanted a tokenizer that ships as a single self-contained model file, can be loaded into any process without additional dictionaries or scripts, and can be reproduced bit-for-bit across machines.[1][3] The model file is encoded in Protocol Buffers format and contains the vocabulary, the merge rules or unigram probabilities, the normalization rules, and bookkeeping metadata. Eliminating the external pre-tokenization step also made segmentation a pure function of the trained model file, which can be checked into source control and shipped along with the trained weights.[1]
SentencePiece was developed primarily by Taku Kudo and John Richardson at Google in Tokyo, with the first public release on GitHub in 2017 and the academic publication appearing as a system demonstration at EMNLP 2018.[1][3] Taku Kudo had already been a well-known figure in Japanese natural language processing for more than a decade as the original author of MeCab, the most widely used Japanese morphological analyzer, and of CRF++, an early and influential conditional random field implementation. His prior work in Japanese segmentation strongly shaped the design choices that made SentencePiece language-agnostic from the start: the same researcher who had spent years building MeCab understood firsthand why depending on a language-specific external segmenter for multilingual text was a structural problem.[1]
The unigram language model that became one of SentencePiece's two flagship algorithms was introduced by Kudo earlier in 2018 in a separate ACL paper on subword regularization, where it was used to provide stochastic alternative segmentations for training neural translation models.[2] The SentencePiece library bundled this algorithm together with BPE in a single C++ implementation, with the explicit goal of making both methods available behind a common interface.
The system demonstration paper for the library itself, "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing," appeared in the proceedings of EMNLP 2018 system demonstrations and on arXiv as preprint 1808.06226.[1] John Richardson co-authored the demonstration paper and contributed substantially to the library's engineering, including its Python and TensorFlow bindings.[1][3]
SentencePiece's design rests on a few deliberate principles that distinguish it from earlier subword tokenizers and that have been broadly adopted by its successors.[1][3]
The first principle is that the tokenizer must operate on raw text rather than pre-tokenized input. The library encodes the space character as the meta symbol U+2581 (LOWER ONE EIGHTH BLOCK) and then trains its subword model on the raw Unicode text.[1] Because the space symbol is part of the vocabulary and appears as a prefix on subword tokens that follow whitespace, detokenization is reduced to concatenating the pieces and replacing U+2581 with a regular space. This eliminates the need for language-specific tokenizers and makes the model file the sole source of segmentation behavior.
The second principle is lossless detokenization. The library is designed so that for any input text, encoding followed by decoding produces output that is byte-for-byte identical to the input, subject only to the configured Unicode normalization.[1] This contrasts with earlier pipelines where punctuation splitting and whitespace handling discarded information that could not be recovered at decoding time.
The third principle is language independence. Any UTF-8 text corpus can be fed to the trainer without language-specific preprocessing or dictionaries, including text in scripts without word boundaries.[1] This is essential for multilingual models that share a single tokenizer across many languages, including modern systems such as mT5 (101 languages) and NLLB (more than 200 languages).[6][9]
The fourth principle is self-containment. The trained model is a single Protocol Buffers serialized file containing all information needed to reproduce segmentation, with no external dictionaries or language-specific resources.[3] The fifth principle is determinism: given a fixed model file and fixed normalization, identical input strings always produce identical token sequences, across operating systems and across releases.[3] This is a critical property for large-model training reproducibility, where any drift in the input tokenization would silently change loss values across machines.
SentencePiece supports four model types selected at training time via the --model_type flag: bpe, unigram, word, and char. The first two are the practically important options, the latter two exist mostly as baselines or for special cases.[3]
The BPE algorithm in SentencePiece is a faithful implementation of the original byte-pair encoding compression algorithm proposed by Philip Gage in 1994 and adapted to subword segmentation by Sennrich, Haddow, and Birch in 2016.[4] Training proceeds bottom-up: the corpus is first split into individual Unicode characters, then the most frequent adjacent pair of symbols is merged into a new symbol, and the process repeats until the vocabulary reaches the target size. The ordered list of merges is stored in the model, and at inference time the same merge list is applied greedily to new input, guaranteeing deterministic segmentation.[4]
SentencePiece's BPE differs from the original Sennrich implementation in two visible ways. First, it operates on raw text where whitespace has been mapped to U+2581, so merges can cross what used to be word boundaries if such merges are common in the training data. In practice this is rare because cross-word character combinations are not frequent enough to outrank within-word combinations during early merges.[1] Second, the SentencePiece BPE encoder is implemented in C++ with a priority-queue-based merge step, substantially faster than the Python reference implementation that shipped with the original paper.[3]
When byte_fallback=True is set, BPE training adds 256 byte-level tokens (one for each value 0x00 through 0xFF) to the vocabulary so that any character that the trained merges cannot produce can still be represented as a sequence of UTF-8 bytes.[3] This option is used by Llama 1 and 2 and by Gemma to avoid out-of-vocabulary tokens entirely while keeping the merged vocabulary at a manageable size.[10][11]
The unigram subword model, introduced by Kudo in 2018,[2] takes the opposite top-down approach. Training begins with a large seed vocabulary of candidate substrings drawn from the corpus, often built using suffix arrays or simple heuristics, and then iteratively prunes pieces whose removal causes the smallest decrease in the marginal likelihood of the data. The probability of a sentence under the unigram model is the product of the probabilities of the pieces in the chosen segmentation, marginalized over all valid segmentations using a forward-backward computation during training and a Viterbi search at inference time.[2]
The unigram model has several attractive properties. Because it is probabilistic, multiple valid segmentations exist for any input, and the library can sample alternative segmentations from the posterior, which is the basis of subword regularization.[2] Vocabularies trained with the unigram method tend to contain fewer suffix-only or prefix-only fragments and more linguistically meaningful pieces such as common morphemes. Unigram training is more expensive than BPE training for the same target vocabulary size, but unigram models often perform slightly better in BLEU evaluations on low-resource language pairs, especially when subword regularization is applied during training.[2][3]
Subword regularization is the training-time technique that gives the unigram model its main practical advantage over greedy BPE.[2] Instead of always feeding the model the single most likely segmentation of each input sentence, the training pipeline samples a different segmentation each epoch, drawn from the posterior over tokenizations under the trained unigram model. The model therefore sees the same content under many different surface forms, which has a regularizing effect similar to dropout on token sequences.
In the original ACL 2018 paper, Kudo reported BLEU improvements of roughly one to three points on low-resource and out-of-domain neural machine translation tasks.[2] The technique requires no architectural changes to the downstream model. Two hyperparameters control the sampling: nbest_size, which truncates the posterior to the top n candidates, and alpha, the temperature applied to the truncated distribution.[2] For BPE, an analogous technique called BPE-dropout was introduced by Provilkov, Emelianenko, and Voita in 2020;[12] SentencePiece's bpe_dropout runtime parameter at encode time provides similar functionality natively.
The word model splits on whitespace only and uses each unique word as a token, capped at the requested vocabulary size with the rest going to a single unknown token. The char model uses each Unicode codepoint as its own token.[3] Both options are mostly used for ablation studies and for very small models where subword granularity is not needed. They are also occasionally useful when a downstream model has a custom architecture for character or word inputs that does not require a subword vocabulary.
▁ whitespace symbolThe single most distinctive design choice of SentencePiece, and the one that the library is most often recognized by visually, is its replacement of the space character with the Unicode code point U+2581, displayed as ▁ (LOWER ONE EIGHTH BLOCK).[1] Before training begins, every whitespace character in the input is mapped to this symbol; during inference, the same mapping is applied to incoming text. The marker therefore appears as a prefix on the first subword that follows any whitespace, making word boundaries explicit and learnable as part of the token vocabulary itself.
This choice has several practical consequences. Detokenization becomes trivial: the tokens are concatenated, and every ▁ is replaced by a regular space. The original text is recovered exactly, including punctuation handling and any number of consecutive spaces. There is no need for a separate detokenizer step that knows about language-specific punctuation rules. Token sequences also become directly inspectable in a way that older subword schemes did not allow, because the boundary information is visible in the tokens themselves.[1]
The choice of U+2581 rather than a more familiar symbol such as an underscore was made to ensure that the meta symbol does not collide with any character that might appear in the source text. The LOWER ONE EIGHTH BLOCK character is extremely rare in natural language text. When inspecting tokens, the symbol is often shown in terminals and editors with a small visible underscore-like glyph, which is why it is sometimes informally called "the underscore" in tutorials despite not being one. For models that need to start a sequence with a token that does not follow whitespace, SentencePiece's add_dummy_prefix flag inserts a leading ▁ automatically.[3]
SentencePiece training takes a raw text corpus, a target vocabulary size, and a model type, and produces a single self-contained .model file along with a human-readable .vocab file that lists every piece and its score.[3]
The trainer reads input either from plain text files passed via --input or from standard input. Each line of input is treated as one sentence; the library does not perform sentence splitting on its own. Lines longer than --max_sentence_length are skipped by default, and --input_sentence_size caps how many input sentences are used, important for very large corpora where training on every sentence is unnecessary.[3]
The training pipeline first applies Unicode normalization, defaulting to NFKC normalization, which folds compatibility characters such as full-width digits and ligatures into their canonical forms.[3] Whitespace is then collapsed and mapped to ▁, control symbols and any user-defined symbols are reserved, and the corpus is passed to the chosen model trainer.
For BPE, the trainer counts adjacent symbol pairs across the corpus, repeatedly merges the most frequent pair into a new symbol, and updates the counts efficiently using a priority queue. For unigram, the trainer initializes a large candidate vocabulary, runs an expectation-maximization loop that estimates each piece's probability, and prunes the lowest-scoring pieces in batches until the vocabulary reaches the target size.[2][3] Training is multithreaded; on modern hardware, training a 32,000-piece BPE vocabulary on a few gigabytes of text typically takes minutes, while a comparable unigram vocabulary takes several times longer.
SentencePiece ships with a small set of command-line tools and bindings for several languages.[3] The basic training, encoding, and decoding workflow looks as follows in shell.
spm_train --input=corpus.txt \
--model_prefix=mymodel \
--vocab_size=32000 \
--model_type=bpe
spm_encode --model=mymodel.model < text.txt > tokens.txt
spm_decode --model=mymodel.model < tokens.txt > reconstructed.txt
The Python bindings, installed with pip install sentencepiece, expose the same operations through a class-based API.
import sentencepiece as spm
spm.SentencePieceTrainer.train(
input='corpus.txt',
model_prefix='mymodel',
vocab_size=32000,
model_type='bpe',
byte_fallback=True
)
sp = spm.SentencePieceProcessor(model_file='mymodel.model')
pieces = sp.encode_as_pieces('Hello, world!')
ids = sp.encode_as_ids('Hello, world!')
text = sp.decode_pieces(pieces)
For the unigram model, the same code can be used with model_type='unigram', and the runtime additionally exposes the sample_encode_as_pieces method, which draws a tokenization from the posterior. Setting nbest_size and alpha controls the temperature of the sampling and is the entry point to subword regularization during training of downstream models.[2]
The library ships with bindings for C++, Python, Java, JavaScript via WebAssembly, and TensorFlow ops, with Rust and Go bindings available through community projects.[3] The shipped Python wheels include precompiled binaries for Linux, macOS, and Windows on common architectures. The TensorFlow ops package, tensorflow-text, integrates SentencePiece as a native graph operation, which allows tokenization inside a TensorFlow graph without round-tripping to Python.[3]
SentencePiece tokenizers underpin a large fraction of widely used language models. The following table summarizes notable models, their tokenizer choice within the SentencePiece family, and approximate vocabulary size, based on each model's published configuration or paper.
| Model | Tokenizer type | Vocabulary size | Notes |
|---|---|---|---|
| T5 | SentencePiece unigram | 32,000 | English-centric encoder-decoder, 2019[5] |
| mT5 | SentencePiece unigram | 250,000 | 101 languages, 2020[6] |
| ALBERT | SentencePiece unigram | 30,000 | Lite BERT, 2019[7] |
| XLNet | SentencePiece | 32,000 | Autoregressive pretraining, 2019[8] |
| mBART | SentencePiece | 250,000 | Multilingual denoising autoencoder |
| NLLB | SentencePiece | 256,000 | 200+ languages, 2022[9] |
| Marian NMT | SentencePiece (default) | varies | Production MT framework |
| Llama 1 | SentencePiece BPE | 32,000 | byte_fallback enabled, 2023[10] |
| Llama 2 | SentencePiece BPE | 32,000 | Same tokenizer family as Llama 1, 2023[17] |
| Mistral 7B | SentencePiece BPE | 32,000 | byte_fallback enabled, 2023[18] |
| Gemma | SentencePiece | 256,000 | Subset of Gemini's tokenizer, 2024[11] |
| PaLM | SentencePiece | 256,000 | 540B parameter dense model, 2022[19] |
| PaLM 2 | SentencePiece | larger than PaLM | Tokenizer expanded for multilingual coverage[20] |
| DeepSeek-LLM, DeepSeek-V2 | SentencePiece BPE | 100,000+ | byte_fallback enabled[21] |
| Whisper | Adapted from GPT-2 BPE | 50,257 / 51,865 | Custom multilingual tokenizer, not SentencePiece[22] |
T5 introduced the text-to-text framing of NLP tasks and used SentencePiece with a unigram model and a 32,000 piece vocabulary trained on a filtered subset of Common Crawl; the choice of unigram over BPE was motivated by the smoother vocabulary statistics and the natural support for subword regularization.[5] mT5, the multilingual extension covering 101 languages, expanded the SentencePiece unigram vocabulary to 250,000 pieces to accommodate the wider Unicode coverage required for global text.[6] ALBERT, a 2019 lighter-weight variant of BERT, used SentencePiece with a unigram model and a 30,000 piece vocabulary, in contrast to BERT's WordPiece.[7] XLNet, the 2019 autoregressive Transformer that competed with BERT on GLUE-style benchmarks, also adopted SentencePiece with a 32,000 piece vocabulary.[8]
Google's PaLM, a 540 billion parameter dense decoder-only model introduced in 2022, used SentencePiece with a 256,000 piece vocabulary that included extensive coverage of code tokens and multilingual text.[19] PaLM 2, the follow-up that appeared in 2023, used a further expanded SentencePiece vocabulary to support stronger multilingual reasoning.[20]
Meta's first Llama release in February 2023 used a SentencePiece BPE tokenizer with a 32,000 piece vocabulary and byte_fallback=True, trained on a mostly English-centric corpus.[10] Llama 2, released in July 2023, used the same tokenizer family, retaining the SentencePiece BPE setup with a 32,000 piece vocabulary and byte fallback.[17]
Llama 3, announced in April 2024, departed from this lineage by switching to a tiktoken-style byte-level BPE tokenizer with a 128,000 piece vocabulary.[16][23] The new tokenizer is implemented through OpenAI's tiktoken library rather than SentencePiece, and the larger vocabulary results in roughly fifteen percent fewer tokens per text for English on average and disproportionately fewer tokens for code and non-English languages.[23] The shift reflects a broader pattern across frontier English-centric models toward larger, byte-level BPE vocabularies for speed and uniform byte handling.
Mistral AI's first widely released model, Mistral 7B in 2023, used a SentencePiece BPE tokenizer with a 32,000 piece vocabulary, byte fallback enabled, and a setup very similar to Llama 1's.[18] Later Mistral models such as Mixtral retained the SentencePiece BPE family. Google DeepMind's Gemma family of open-weights models, released in 2024, uses a SentencePiece tokenizer with a 256,000 piece vocabulary that is a subset of the tokenizer used by Gemini, with byte fallback enabled.[11] DeepSeek has used SentencePiece BPE tokenizers across its open-weights releases including DeepSeek-LLM and the DeepSeek-V2 series, with vocabularies on the order of 100,000 pieces and byte fallback enabled.[21]
OpenAI's Whisper speech-to-text model, released in 2022, is sometimes mistakenly listed alongside SentencePiece users but actually uses a tokenizer adapted from the GPT-2 byte-level BPE vocabulary; the multilingual Whisper variant extends the vocabulary to 51,865 pieces.[22] BERT likewise does not use SentencePiece; it uses a WordPiece tokenizer, although WordPiece and SentencePiece's unigram method share conceptual ancestry in likelihood-based vocabulary selection.[13][14] Many GPT family models from OpenAI use a custom byte-level BPE that predates and is independent of SentencePiece.
tiktoken, released as open source by OpenAI in late 2022, is the byte-level BPE tokenizer used by GPT-2, GPT-3, GPT-3.5, GPT-4, and later OpenAI models.[16] The two libraries solve overlapping problems but make different tradeoffs.
tiktoken operates on raw UTF-8 bytes rather than Unicode code points. Every input is first split into bytes, and BPE merges are learned over byte sequences. This avoids any need for Unicode normalization and guarantees that any byte string can be encoded, no matter how unusual.[16] SentencePiece, by contrast, operates on normalized Unicode text with NFKC normalization by default and uses the byte_fallback option to handle any character outside the trained vocabulary as raw UTF-8 byte tokens. Both approaches reliably encode arbitrary input, but tiktoken's byte-first design handles unusual byte sequences more cleanly without configuration.
tiktoken is implemented in Rust with a thin Python wrapper and is widely benchmarked as several times faster than SentencePiece on encoding workloads. It does not implement unigram language models or expose a training entry point in its public release; users who want to train a new tokenizer on their own corpus must use a different library.[16] SentencePiece is a full training and inference library and is the standard choice when a model team needs to build a new tokenizer.
In practice, the choice now divides along ecosystem lines. SentencePiece remains the default for open-weights multilingual models, Google's models, and the bulk of academic releases. tiktoken-style byte-level BPE has become the default for OpenAI's GPT family, for Llama 3 and later, and for several other frontier English-centric models where raw encoding speed and byte-level cleanliness are prioritized.[16][23]
The Hugging Face Tokenizers library, often imported as tokenizers, is a Rust-based library with Python bindings that implements BPE, WordPiece, and unigram models in a single modular pipeline.[15] Its design generalizes SentencePiece's architecture by exposing the normalizer, pre-tokenizer, model, and post-processor as separate stages that can be combined in arbitrary configurations.
Hugging Face Tokenizers can load most SentencePiece .model files directly, and offers a fast Rust-based encoding path that is several times faster than SentencePiece.[15] In the Hugging Face Transformers ecosystem, every SentencePiece-based tokenizer ships with a corresponding "fast" version backed by Tokenizers that supports multi-threaded batch encoding. This is typically the preferred inference path for production workloads.
The most important practical distinction is that Tokenizers is a library of tokenization algorithms with a unified interface, whereas SentencePiece is a specific tokenization library with its own algorithms and file format. Many model teams train their tokenizer with SentencePiece, then convert the resulting model file into the Tokenizers fast format for production serving, gaining both the trained quality and the Rust-based serving speed.[15]
SentencePiece has several practical limits, most of which stem from its design vintage rather than algorithmic flaws.
Very large vocabularies, beyond roughly 500,000 pieces, become slow and memory-hungry to train.[3] Both BPE and unigram training keep candidate counts in memory, so RAM scales with vocabulary size. Unigram training is computationally more expensive than BPE training, by roughly a factor of two to four for the same target vocabulary size, depending on the corpus.[2][3]
Detokenization assumes the ▁ whitespace convention. If a downstream system mixes tokens from a SentencePiece tokenizer with tokens from a different tokenizer that uses a different whitespace convention, detokenization can produce unexpected output, which becomes relevant in tool-augmented systems that splice token streams from multiple sources.
Modern frontier large language models are increasingly moving to byte-level BPE implemented in tiktoken or in the Hugging Face Tokenizers library, primarily for speed and for cleaner handling of arbitrary input bytes including emoji sequences and control characters.[16][23] SentencePiece's unigram model in particular is rare in the latest generation of public frontier model releases, although it remains the dominant choice for multilingual systems.
Serialized model files are tied to the SentencePiece library version's protobuf schema. Forward and backward compatibility have been good in practice, but very old models can occasionally need a migration step when loaded with newer versions, and newer options such as byte fallback cannot be loaded by older library versions.[3] Finally, the library does not directly support tokenizers that are jointly trained with the downstream model. Research directions that build on or supersede SentencePiece include character-level and byte-level models such as ByT5, Charformer, and CANINE, which dispense with subword tokenization entirely at the cost of longer sequences and additional model computation.
SentencePiece is distributed under the Apache 2.0 license, which permits commercial use, modification, and redistribution provided that copyright notices are preserved.[3] The library has been continuously maintained since its 2018 release, with regular releases on PyPI and on GitHub, and is downloaded tens of millions of times per month from PyPI alone, reflecting its position as a default dependency in the open NLP and open-weights model ecosystem.[3]
Hugging Face Transformers ships SentencePiece-backed tokenizer classes for every model originally trained with SentencePiece, including T5Tokenizer, AlbertTokenizer, XLNetTokenizer, MarianTokenizer, MBartTokenizer, NllbTokenizer, and the LlamaTokenizer and GemmaTokenizer families.[15] Many have a corresponding fast version backed by the Hugging Face Tokenizers library that can load the same SentencePiece model file but run encoding in Rust.