mT5
Last reviewed
May 2, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,486 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,486 words
Add missing citations, update stale details, or suggest a clearer explanation.
mT5 (multilingual T5) is a transformer-based encoder-decoder language model released by Google Research in October 2020. It is a multilingual extension of the original T5 (Text-to-Text Transfer Transformer), pre-trained on a new web corpus called mC4 that covers 101 languages drawn from Common Crawl. The model was introduced in the paper "mT5: A massively multilingual pre-trained text-to-text transformer" by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel, published as arXiv:2010.11934 and presented at NAACL-HLT 2021.
The goal of mT5 was to take the recipe that worked well for English in T5 and stretch it across more than a hundred languages without changing the architecture, the training objective, or any task-specific machinery. Every task is still framed the same way it was in T5: feed the model some text, get some text back. The catch is that now the input might be in Swahili, the output in Burmese, and the model is supposed to do something reasonable with both. To make this work the authors built a 250,000-token SentencePiece vocabulary, sampled languages with a temperature-based scheme that boosts low-resource languages, and trained five model sizes ranging from 300 million to 13 billion parameters. The largest checkpoint, mT5-XXL, set new state-of-the-art results on cross-lingual benchmarks like XNLI, PAWS-X, TyDiQA-GoldP, MLQA, XQuAD, and WikiLingua at the time of release.
The checkpoints, code, and the mC4 dataset were released under the Apache 2.0 license, and the Hugging Face ports of mT5-Small, Base, Large, XL, and XXL became default starting points for multilingual fine-tuning research for several years. By 2026, decoder-only multilingual models (Llama 3, Qwen 2, Gemma, Aya) have largely replaced mT5 for general chat, but mT5 remains influential as a pre-training reference and is still used for translation, summarization, and other sequence-to-sequence tasks where the encoder-decoder shape matters.
The predecessor to mT5 is T5, introduced by Raffel et al. in 2019 ("Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"). T5 framed every NLP task as a text-to-text problem and was pre-trained on the C4 dataset (Colossal Clean Crawled Corpus), a roughly 750 GB English-only corpus extracted from one snapshot of Common Crawl. The original T5 family ranged from 60 million parameters (T5-Small) to 11 billion (T5-11B), and the larger checkpoints were competitive with the best specialized models of the time on SQuAD and the GLUE/SuperGLUE benchmarks.
T5's English focus left an obvious gap. Most of the world's text is not in English, and a sizeable research community had been building multilingual encoders to fill that gap. The two best-known examples in 2020 were mBERT (multilingual BERT, trained on Wikipedia in 104 languages) and XLM-R (XLM-RoBERTa, trained by Conneau et al. at Facebook AI on 2.5 TB of CommonCrawl data covering 100 languages). Both were encoder-only models, useful for classification and span extraction but awkward for generation tasks like translation or summarization. The authors of mT5 wanted to combine T5's clean text-to-text framework, a larger and more diverse multilingual corpus than mBERT or XLM-R had used, and the kind of scale that was working for monolingual models.
The biggest engineering contribution in the mT5 paper was not the model. It was mC4, the multilingual extension of C4. C4 was built by running a heuristic cleaning pipeline (deduplication, language detection, removal of bad lines, Lorem Ipsum filtering, NSFW word lists) on a single Common Crawl dump. mC4 generalized that pipeline to any language CLD3 could detect, then ran it across 71 monthly Common Crawl dumps spanning August 2018 through August 2020. The result was a corpus of 6.3 trillion tokens covering 107 languages by raw detection, trimmed to 101 languages after dropping ones with fewer than 10,000 pages of clean text or where CLD3's detection was unreliable.
The distribution across languages is severely uneven, which is the central challenge in any multilingual pre-training setup. English alone makes up the bulk of the corpus, and the gap between the highest and lowest resource languages spans more than four orders of magnitude. To keep the model from collapsing into a glorified English model, the authors sampled training batches with a probability proportional to a temperature-scaled language frequency:
p(L) ~ |L|^alpha
with alpha = 0.3. Lower alpha means flatter sampling (more equal weight to small languages). The 0.3 value was chosen by sweeping alpha on the development sets of XNLI and TyDiQA. Higher alpha favored English-heavy results on benchmarks where English happens to be a strong baseline; lower alpha pushed up scores in low-resource languages. 0.3 traded off both reasonably.
The 101 language list is dominated by major Indo-European languages but includes a long tail of African, South Asian, and Southeast Asian languages that had little to no pre-training coverage before. Igbo, Sinhala, Khmer, Yoruba, and Lao are all in the corpus, alongside very small entries like Hawaiian and Luxembourgish. mC4 is hosted on the TensorFlow Datasets catalog and on Hugging Face Datasets under allenai/c4 with a multilingual variant.
Architecturally, mT5 is identical to T5 v1.1, the slightly cleaned-up T5 release Google published a few months before the multilingual paper. T5 v1.1 differs from the original 2019 T5 in three small ways: it uses a GeGLU activation in the feed-forward sublayer instead of ReLU, it does not share the input embedding and output projection matrices, and it does not pretrain on any supervised data (only the unsupervised denoising objective). All three carry over into mT5.
The model is a standard encoder-decoder transformer with attention, the same shape Vaswani et al. proposed in 2017 in "Attention Is All You Need". The encoder reads the source sequence and produces contextual representations; the decoder generates the target sequence one token at a time, attending back to the encoder. There are no language-specific embeddings, no language IDs as input tokens by default, and no expert routing or mixture-of-experts blocks. The only thing distinguishing mT5 from T5 is its much larger vocabulary and the multilingual training data.
Like T5, mT5 is pre-trained with a span corruption objective. Roughly 15% of the input tokens are masked out, and contiguous masked tokens are replaced by a single sentinel token (<extra_id_0>, <extra_id_1>, and so on). The decoder is trained to reconstruct the masked spans in order, separated by the same sentinels. So an example input like:
The <extra_id_0> jumped over the <extra_id_1> dog.
would have the target:
<extra_id_0> quick brown fox <extra_id_1> lazy <extra_id_2>
The span corruption objective is monolingual in any single example: a Spanish sentence will be corrupted and reconstructed as Spanish. There is no parallel translation supervision in pre-training. Cross-lingual ability emerges entirely from the shared parameters and the shared vocabulary.
mT5 was trained on Cloud TPU v3 pods. The paper reports 1 trillion training tokens for all sizes, sampled from mC4 according to the temperature scheme above. Sequence length was 1024 for the encoder and 229 for the decoder. The XXL variant required a TPU v3-1024 slice (1024 chips) and took several weeks to train. All five model sizes were released as official Google Research checkpoints.
mT5 was released in five sizes. The parameter counts are noticeably larger than the equivalent T5 sizes because of the 250,000-token vocabulary, which contributes a 250K x d_model embedding matrix on each end. For the smaller models that vocabulary table is the dominant chunk of the parameter count.
| Variant | Parameters | Encoder layers | Decoder layers | d_model | d_ff | Heads | Embedding params |
|---|---|---|---|---|---|---|---|
| mT5-Small | 300M | 8 | 8 | 512 | 1024 | 6 | ~256M (vocab) |
| mT5-Base | 580M | 12 | 12 | 768 | 2048 | 12 | ~384M (vocab) |
| mT5-Large | 1.2B | 24 | 24 | 1024 | 2816 | 16 | ~512M (vocab) |
| mT5-XL | 3.7B | 24 | 24 | 2048 | 5120 | 32 | ~1024M (vocab) |
| mT5-XXL | 13B | 24 | 24 | 4096 | 10240 | 64 | ~2048M (vocab) |
In mT5-Small, the embedding matrix accounts for more than 80% of all parameters. That ratio drops as you scale up; at XXL the embeddings are around 16% of the total. The paper points this out because it complicates apples-to-apples comparisons against monolingual T5 of "the same size": mT5-Small has a similar parameter budget to T5-Base but most of those parameters are spent on the vocabulary, not on the transformer stack.
All five variants are available on Hugging Face under google/mt5-small, google/mt5-base, google/mt5-large, google/mt5-xl, and google/mt5-xxl.
mT5 uses a SentencePiece unigram language model tokenizer with a vocabulary of 250,000 subword pieces. The tokenizer was trained on a balanced sample of mC4 with the same temperature-scaled distribution used for pre-training. SentencePiece operates on raw Unicode without any language-specific preprocessing, so the same tokenizer handles whitespace-segmented languages like English, characters-as-words languages like Chinese, and complex morphological systems like Finnish and Turkish.
250K is much larger than English-only T5's 32K vocabulary and matches XLM-R's 250K (a deliberately chosen parity). The reason multilingual models need bigger vocabularies is mechanical: with a fixed budget of subword units, every language you add either takes pieces from other languages or causes the tokenizer to fall back on character-level fragments for rare words. A bigger vocabulary buys breathing room, at the cost of having the embedding matrix dominate the parameter count for smaller variants.
A known issue is that scripts with rich morphology (Hindi, Tamil, Sinhala) or many compound words (German, Finnish) tend to get tokenized into smaller subword pieces than the equivalent meaning in English. Effective sequence length is therefore unevenly distributed across languages, and downstream performance correlates with both training data volume and tokenizer-induced sequence length.
The mT5 paper evaluates on six multilingual benchmarks and reports cross-lingual transfer (train on English-only data, evaluate on every language) as well as in-language fine-tuning where supervised data is available.
| Benchmark | Task | Languages | mT5-XXL (zero-shot, English fine-tune) |
|---|---|---|---|
| XNLI | Cross-lingual natural language inference | 15 | 87.8 (avg accuracy) |
| PAWS-X | Cross-lingual paraphrase identification | 7 | 91.5 |
| MLQA | Multilingual extractive QA | 7 | 76.0 / 57.9 (F1 / EM avg) |
| XQuAD | Cross-lingual SQuAD | 11 | 82.5 / 66.0 (F1 / EM avg) |
| TyDiQA-GoldP | Information-seeking QA in 11 typologically diverse languages | 9 (gold passage) | 81.0 / 65.3 (F1 / EM avg) |
| WikiLingua | Cross-lingual abstractive summarization | 18 | varies, mT5-XXL leads |
At release, mT5-XXL outperformed prior multilingual encoders (XLM-R Large, mBERT) on every classification and QA benchmark by a meaningful margin, in many cases by 5 to 10 absolute points. On TyDiQA-GoldP it surpassed even strong English-only models like the original T5-XXL when restricted to the English subset, a result that tells you something about the value of cross-lingual transfer when the source language has abundant data.
A specific finding from the paper that gets cited often: mT5-Small and mT5-Base, which are smaller than XLM-R Large in raw parameter count, perform worse than XLM-R on many classification tasks. mT5 only starts dominating at Large and above. The paper interprets this as evidence that encoder-decoder models have a higher capacity floor than encoder-only models, presumably because half the parameters are sitting in a decoder that does not contribute to discriminative tasks. So mT5 is most attractive when you actually need to generate text.
In the cross-lingual transfer setting (English fine-tuning, zero-shot evaluation in other languages), mT5 showed gaps similar to mBERT and XLM-R: high-resource European languages perform close to English, while low-resource languages like Swahili and Burmese lag substantially. The XXL variant narrows that gap but does not eliminate it. Scaling the model is not a substitute for higher-quality data in the target language.
The phrase "curse of multilinguality" was coined by Conneau et al. in the XLM-R paper. The observation is that, holding model capacity fixed, adding more languages to a multilingual pre-training run improves average performance up to a point, then starts to hurt average performance because the parameters get spread too thin. The optimal number of languages depends on the parameter budget; XLM-R found a peak around 100 languages for their largest model.
mT5 sits inside the same dynamic. The paper notes that for mT5-Small, performance per language is meaningfully worse than what you would get from a hypothetical monolingual model of the same size. As you scale up, the gap closes. mT5-XXL with 13 billion parameters is large enough that the dilution penalty becomes small, at least for the languages with abundant pretraining data. For genuinely low-resource languages, even XXL underperforms a monolingual model of the same size.
No published paper has cleanly resolved the curse of multilinguality. Mixture-of-experts approaches (where each token is routed to a subset of language-specific experts) help. Bigger models help. Better data curation helps. But there is no free lunch: spending parameters on Bengali means fewer parameters available for English, and vice versa. mT5's contribution to the discussion was empirical evidence that the curse is real but scalable, in the sense that big enough models can pay the tax without their downstream quality dropping below useful thresholds.
A cluster of multilingual T5 derivatives followed mT5 over the next few years.
ByT5 (Xue et al. 2021, arXiv:2105.13626) is a token-free version of T5 that operates directly on UTF-8 bytes instead of subword pieces. The motivation was twofold: avoid the embedding-table tax that dominated mT5-Small, and remove the assumption that a learned subword vocabulary is the right unit for every language. ByT5 trains slower per step (sequences are much longer in bytes) but is more robust to typos, code-switching, and morphologically rich languages. It was released in five sizes matching mT5.
mT5x is the JAX/Flax reimplementation of T5 and mT5 inside Google's t5x framework. The original mT5 was trained in Mesh TensorFlow; t5x is the modern successor. All recent T5-family models from Google use t5x.
umT5 (Chung et al. 2023, "UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining") is a follow-on multilingual T5 with a more careful language sampling strategy called UniMax that caps the number of times any single language can be repeated in training. umT5 was released in the same five sizes as mT5 and improved low-resource language performance. The Google Research models repo lists umT5 as the recommended replacement for mT5 in 2024 and beyond.
MADLAD-400 (Kudugunta et al. 2023) is a 3B and 7B parameter T5-style multilingual machine translation model trained on a 3 trillion token corpus covering 419 languages, far beyond mT5's 101. MADLAD-400 was released as both a translation-specific and a general-purpose pre-trained model, and it remains one of the broadest-coverage open multilingual encoders as of 2026.
NLLB (No Language Left Behind, Meta AI 2022) is not a T5 derivative, but it sits in the same problem space. NLLB-200 is a 54B-parameter mixture-of-experts encoder-decoder trained for translation across 200 languages, with a strong focus on the long tail of African and other low-resource languages.
Aya, Aya 23, and Aya Expanse are instruction-tuned multilingual models from Cohere for AI, released between 2024 and 2025. Aya is decoder-only (it descends from Cohere's Command-R series) rather than encoder-decoder, and it covers 23 to 32 widely spoken languages with strong instruction-following. Aya Expanse 32B is the most capable of the family as of 2026.
For general use, decoder-only LLMs released after 2023 have largely displaced mT5. Llama 3, Qwen 2 and 2.5, Gemma 2, Mistral Nemo, Gemini, and the GPT family all support a respectable subset of languages out of the box, and their instruction-tuned variants tend to outperform mT5 on chat-style tasks even without explicit multilingual fine-tuning. The catch is that the language coverage in any specific decoder-only model is usually narrower than mT5's 101, and quality drops faster on unseen languages because the pre-training data is heavily English-weighted. For encoder-decoder workflows specifically (translation, summarization with high faithfulness requirements, structured-output tasks), mT5 and its descendants still see active use.
All five mT5 checkpoints are released under the Apache 2.0 license. The mC4 dataset is also Apache 2.0 in distribution, though the underlying Common Crawl text is subject to its own collection of source-website licenses.
The canonical model checkpoints live on Hugging Face under the google organization. Inference works out of the box with the transformers library:
from transformers import T5Tokenizer, MT5ForConditionalGeneration
tok = T5Tokenizer.from_pretrained("google/mt5-base")
model = MT5ForConditionalGeneration.from_pretrained("google/mt5-base")
Fine-tuning is the most common way mT5 is used in practice, since the raw pre-trained checkpoints are not instruction-tuned and produce mostly noisy text without task-specific adaptation. The original mT5 GitHub repo at google-research/multilingual-t5 was archived in favor of the broader t5x repo in 2023.
The mT5 paper has been cited several thousand times as of 2026. Its influence shows up in three places. The mC4 pipeline became a template: subsequent multilingual datasets (CulturaX, MADLAD-400, Glot500) follow some variant of it with language-aware filtering and temperature-based sampling. The alpha = 0.3 temperature sampling became a default starting point for almost every multilingual pre-training paper that came after. And the practical packaging on Hugging Face turned mT5 into the workhorse multilingual encoder-decoder for academic and applied NLP work between 2021 and 2023.
mT5 was also one of the early demonstrations that multilingual pre-training could be done with the same recipe as monolingual pre-training, just at larger scale. That observation looks obvious in retrospect but it was not at the time: the prevailing intuition was that multilingual models needed special architectural tricks like language adapters, language embeddings, or language-specific output heads to cope with the heterogeneity of the data. mT5 showed that at sufficient scale, a single homogeneous architecture trained on a temperature-balanced mixture works well enough to be competitive across most tasks.
The mT5 paper itself is candid about the model's weaknesses. mT5 was not instruction-tuned and sampling text from the raw checkpoint produces low-quality output, which limits its use as a general-purpose generator. Performance on the smallest 30 or so languages in mC4 is much worse than performance on the top 20, and for many of those low-resource languages a domain-matched monolingual model of the same parameter count would likely beat mT5.
A practical issue that surfaced in the years after release is that mT5's tokenizer can produce extremely fragmented sequences for non-Latin scripts, particularly Indic and Southeast Asian scripts, because the tokenizer's training distribution was dominated by Latin-script languages. ByT5 and the byte-level approach were partly motivated by this issue. Relatedly, the 250K vocabulary is too small in retrospect to give every covered language adequate subword coverage. Later models like umT5 and MADLAD-400 use larger vocabularies (up to 1 million pieces in some experimental setups) to address this.