# mT5

> Source: https://aiwiki.ai/wiki/mt5
> Updated: 2026-06-24
> Categories: Large Language Models, Natural Language Processing, Open Source AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**mT5** (multilingual [T5](/wiki/t5)) is a [transformer](/wiki/transformer)-based encoder-decoder [language model](/wiki/large_language_model) released by [Google](/wiki/google) Research in October 2020 that covers 101 languages in a single model, pre-trained on a Common Crawl corpus called mC4 and released in five sizes from 300 million to 13 billion parameters. It is a [multilingual](/wiki/multilingual_model) extension of the original T5 (Text-to-Text Transfer Transformer) that keeps T5's text-to-text framework unchanged: every task, in any language, is cast as feeding the model text and reading text back. The model was introduced in the paper "mT5: A massively multilingual pre-trained text-to-text transformer" by Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and [Colin Raffel](/wiki/colin_raffel), posted as arXiv:2010.11934 on 22 October 2020 and presented at NAACL-HLT 2021. [1] The authors summarize the contribution plainly: "we introduce mT5, a multilingual variant of T5 that was pre-trained on a new Common Crawl-based dataset covering 101 languages." [1]

The goal of mT5 was to take the recipe that worked well for English in T5 and stretch it across more than a hundred languages without changing the architecture, the training objective, or any task-specific machinery. Every task is still framed the same way it was in T5: feed the model some text, get some text back. The catch is that now the input might be in Swahili, the output in Burmese, and the model is supposed to do something reasonable with both. To make this work the authors built a 250,000-token [SentencePiece](/wiki/sentencepiece) vocabulary, sampled languages with a temperature-based scheme that boosts low-resource languages (exponent alpha = 0.3), and trained five model sizes ranging from 300 million to 13 billion parameters. [1] The largest checkpoint, mT5-XXL with 13 billion parameters, set new state-of-the-art results on cross-lingual benchmarks like XNLI, PAWS-X, TyDiQA-GoldP, MLQA, XQuAD, and WikiLingua at the time of release. [1][11]

The checkpoints, code, and the mC4 dataset were released under the [Apache 2.0 license](/wiki/apache_license_2_0), and the [Hugging Face](/wiki/hugging_face) ports of mT5-Small, Base, Large, XL, and XXL became default starting points for multilingual fine-tuning research for several years. [11][12] By 2026, decoder-only multilingual models ([Llama 3](/wiki/llama_3), [Qwen](/wiki/qwen) 2, [Gemma](/wiki/gemma), [Aya](/wiki/aya)) have largely replaced mT5 for general chat, but mT5 remains influential as a pre-training reference and is still used for translation, summarization, and other sequence-to-sequence tasks where the encoder-decoder shape matters. The mT5 paper had accumulated roughly 2,935 citations on Semantic Scholar by 2026 (including about 297 highly influential citations), a measure of how widely the recipe spread. [13]

## What problem was mT5 built to solve?

The predecessor to mT5 is [T5](/wiki/t5), introduced by Raffel et al. in 2019 ("Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"). [2] T5 framed every NLP task as a text-to-text problem and was pre-trained on the [C4 dataset](/wiki/c4_dataset) (Colossal Clean Crawled Corpus), a roughly 750 GB English-only corpus extracted from one snapshot of [Common Crawl](/wiki/common_crawl). The original T5 family ranged from 60 million parameters (T5-Small) to 11 billion (T5-11B), and the larger checkpoints were competitive with the best specialized models of the time on [SQuAD](/wiki/squad) and the GLUE/SuperGLUE benchmarks. [2]

T5's English focus left an obvious gap. Most of the world's text is not in English, and a sizeable research community had been building multilingual encoders to fill that gap. The two best-known examples in 2020 were [mBERT](/wiki/mbert) (multilingual [BERT](/wiki/bert), trained on Wikipedia in 104 languages) and [XLM-R](/wiki/xlm_r) (XLM-RoBERTa, trained by Conneau et al. at Facebook AI on 2.5 TB of CommonCrawl data covering 100 languages). [3][9] Both were encoder-only models, useful for classification and span extraction but awkward for generation tasks like translation or summarization. The authors of mT5 wanted to combine T5's clean text-to-text framework, a larger and more diverse multilingual corpus than mBERT or XLM-R had used, and the kind of scale that was working for monolingual models. [1]

## How many languages does mT5 cover, and what is the mC4 dataset?

The biggest engineering contribution in the mT5 paper was not the model. It was mC4, the multilingual extension of C4. [1] C4 was built by running a heuristic cleaning pipeline (deduplication, language detection, removal of bad lines, Lorem Ipsum filtering, NSFW word lists) on a single Common Crawl dump. mC4 generalized that pipeline to any language CLD3 could detect, then ran it across 71 monthly Common Crawl dumps. [1] The result was a corpus covering 107 languages by raw detection, trimmed to 101 languages after dropping ones with fewer than 10,000 pages of clean text or where CLD3's detection was unreliable. [1]

The distribution across languages is severely uneven, which is the central challenge in any multilingual pre-training setup. English alone makes up the bulk of the corpus, and the gap between the highest and lowest resource languages spans several orders of magnitude. To keep the model from collapsing into a glorified English model, the authors sampled training batches with a probability proportional to a temperature-scaled language frequency:

```
p(L) ~ |L|^alpha
```

with alpha = 0.3. [1] Lower alpha means flatter sampling (more equal weight to small languages). The 0.3 value was chosen by sweeping alpha on the development sets of XNLI and TyDiQA. Higher alpha favored English-heavy results on benchmarks where English happens to be a strong baseline; lower alpha pushed up scores in low-resource languages. 0.3 traded off both reasonably. [1]

The 101 language list is dominated by major Indo-European languages but includes a long tail of African, South Asian, and Southeast Asian languages that had little to no pre-training coverage before. Igbo, Sinhala, Khmer, Yoruba, and Lao are all in the corpus, alongside very small entries like Hawaiian and Luxembourgish. [11] mC4 is hosted on the [TensorFlow](/wiki/tensorflow) Datasets catalog and on Hugging Face Datasets under `allenai/c4` with a multilingual variant.

## How does mT5 differ from T5 architecturally?

Architecturally, mT5 is identical to T5 v1.1, the slightly cleaned-up T5 release Google published a few months before the multilingual paper. T5 v1.1 differs from the original 2019 T5 in three small ways: it uses a GeGLU activation in the feed-forward sublayer instead of ReLU, it does not share the input embedding and output projection matrices, and it does not pretrain on any supervised data (only the unsupervised denoising objective). All three carry over into mT5. [1]

The model is a standard encoder-decoder [transformer](/wiki/transformer) with [attention](/wiki/attention), the same shape Vaswani et al. proposed in 2017 in ["Attention Is All You Need"](/wiki/attention_is_all_you_need). The encoder reads the source sequence and produces contextual representations; the decoder generates the target sequence one token at a time, attending back to the encoder. There are no language-specific embeddings, no language IDs as input tokens by default, and no expert routing or mixture-of-experts blocks. The only thing distinguishing mT5 from T5 is its much larger vocabulary and the multilingual training data. [1]

### What is mT5's training objective?

Like T5, mT5 is pre-trained with a [span corruption](/wiki/span_corruption) objective. Roughly 15% of the input tokens are masked out, and contiguous masked tokens are replaced by a single sentinel token (`<extra_id_0>`, `<extra_id_1>`, and so on). The decoder is trained to reconstruct the masked spans in order, separated by the same sentinels. So an example input like:

```
The <extra_id_0> jumped over the <extra_id_1> dog.
```

would have the target:

```
<extra_id_0> quick brown fox <extra_id_1> lazy <extra_id_2>
```

The span corruption objective is monolingual in any single example: a Spanish sentence will be corrupted and reconstructed as Spanish. There is no parallel translation supervision in pre-training. Cross-lingual ability emerges entirely from the shared parameters and the shared vocabulary. [1]

### How was mT5 trained?

mT5 was trained on Cloud [TPU](/wiki/tpu) v3 pods. The paper reports 1 trillion training tokens for all sizes, sampled from mC4 according to the temperature scheme above. Sequence length was 1024 for the encoder and 229 for the decoder. The XXL variant required a TPU v3-1024 slice (1024 chips) and took several weeks to train. All five model sizes were released as official Google Research checkpoints. [1][12]

## What are the mT5 model sizes?

mT5 was released in five sizes. The parameter counts are noticeably larger than the equivalent T5 sizes because of the 250,000-token vocabulary, which contributes a 250K x d_model embedding matrix on each end. For the smaller models that vocabulary table is the dominant chunk of the parameter count. The official Google Research repository lists the five sizes as mT5-Small (300 million parameters), mT5-Base (580 million), mT5-Large (1.2 billion), mT5-XL (3.7 billion), and mT5-XXL (13 billion). [12]

| Variant | Parameters | Encoder layers | Decoder layers | d_model | d_ff | Heads | Embedding params |
|---------|------------|----------------|----------------|---------|------|-------|-------------------|
| mT5-Small | 300M | 8 | 8 | 512 | 1024 | 6 | ~256M (vocab) |
| mT5-Base | 580M | 12 | 12 | 768 | 2048 | 12 | ~384M (vocab) |
| mT5-Large | 1.2B | 24 | 24 | 1024 | 2816 | 16 | ~512M (vocab) |
| mT5-XL | 3.7B | 24 | 24 | 2048 | 5120 | 32 | ~1024M (vocab) |
| mT5-XXL | 13B | 24 | 24 | 4096 | 10240 | 64 | ~2048M (vocab) |

In mT5-Small, the embedding matrix accounts for more than 80% of all parameters. That ratio drops as you scale up; at XXL the embeddings are around 16% of the total. The paper points this out because it complicates apples-to-apples comparisons against monolingual T5 of "the same size": mT5-Small has a similar parameter budget to T5-Base but most of those parameters are spent on the vocabulary, not on the transformer stack. [1]

All five variants are available on [Hugging Face](/wiki/hugging_face) under `google/mt5-small`, `google/mt5-base`, `google/mt5-large`, `google/mt5-xl`, and `google/mt5-xxl`. [11]

## How does mT5 handle 101 languages with one tokenizer?

mT5 uses a [SentencePiece](/wiki/sentencepiece) unigram language model tokenizer with a vocabulary of 250,000 subword pieces. [1] The tokenizer was trained on a balanced sample of mC4 with the same temperature-scaled distribution used for pre-training. SentencePiece operates on raw Unicode without any language-specific preprocessing, so the same tokenizer handles whitespace-segmented languages like English, characters-as-words languages like Chinese, and complex morphological systems like Finnish and Turkish.

250K is much larger than English-only T5's 32K vocabulary and matches XLM-R's 250K (a deliberately chosen parity). [1][3] The reason multilingual models need bigger vocabularies is mechanical: with a fixed budget of subword units, every language you add either takes pieces from other languages or causes the tokenizer to fall back on character-level fragments for rare words. A bigger vocabulary buys breathing room, at the cost of having the embedding matrix dominate the parameter count for smaller variants.

A known issue is that scripts with rich morphology (Hindi, Tamil, Sinhala) or many compound words (German, Finnish) tend to get tokenized into smaller subword pieces than the equivalent meaning in English. Effective sequence length is therefore unevenly distributed across languages, and downstream performance correlates with both training data volume and tokenizer-induced sequence length.

## How well does mT5 perform on multilingual benchmarks?

The mT5 paper evaluates on six multilingual benchmarks and reports cross-lingual transfer (train on English-only data, evaluate on every language) as well as in-language fine-tuning where supervised data is available. Several of these benchmarks (XNLI, PAWS-X, TyDiQA, MLQA, XQuAD) are aggregated in the [XTREME](/wiki/xtreme) cross-lingual evaluation suite. [10] The headline scores for mT5-XXL reported in the official Google Research repository are as follows. [12]

| Benchmark | Task | Languages | mT5-XXL score |
|-----------|------|-----------|---------------|
| XNLI | Cross-lingual natural language inference | 15 | 85.0 (zero-shot avg accuracy); 87.8 (in-language multitask) |
| PAWS-X | Cross-lingual paraphrase identification | 7 | 90.0 (avg accuracy) |
| MLQA | Multilingual extractive QA | 7 | 76.0 (F1 avg) |
| XQuAD | Cross-lingual SQuAD | 11 | 82.5 (F1 avg) |
| TyDiQA-GoldP | Information-seeking QA in typologically diverse languages | 9 (gold passage) | 82.0 (F1 avg) |
| WikiLingua | Cross-lingual abstractive summarization | 18 | mT5-XXL leads at release |

At release, mT5-XXL outperformed prior multilingual encoders (XLM-R Large, mBERT) on every classification and QA benchmark by a meaningful margin, in many cases by 5 to 10 absolute points. [1] The paper notes that on XNLI the gap between the three fine-tuning regimes (zero-shot cross-lingual transfer, translate-train, and in-language multitask) shrinks as the model grows, so the largest model benefits least from in-language supervision. On named-entity recognition (WikiAnn NER) mT5-XXL reached 69.2, close to the prior best of 70.1. [1]

A specific finding from the paper that gets cited often: mT5-Small and mT5-Base, which are smaller than XLM-R Large in raw parameter count, perform worse than XLM-R on many classification tasks. mT5 only starts dominating at Large and above. [1] The paper interprets this as evidence that encoder-decoder models have a higher capacity floor than encoder-only models, presumably because half the parameters are sitting in a decoder that does not contribute to discriminative tasks. So mT5 is most attractive when you actually need to generate text.

### How well does mT5 transfer across languages zero-shot?

In the cross-lingual transfer setting (English fine-tuning, zero-shot evaluation in other languages), mT5 showed gaps similar to mBERT and XLM-R: high-resource European languages perform close to English, while low-resource languages like Swahili and Burmese lag substantially. The XXL variant narrows that gap but does not eliminate it. Scaling the model is not a substitute for higher-quality data in the target language. [1]

## What is accidental translation in mT5?

A distinctive contribution of the mT5 paper is its analysis of "accidental translation," a failure mode unique to generative multilingual models in the zero-shot setting. Because mT5 is fine-tuned on English-only labeled data and then asked to generate in another language, it sometimes slips back into producing English (or a mix of languages) instead of the target language. The paper states the problem directly: it describes "a simple technique to prevent 'accidental translation' in the zero-shot setting, where a generative model chooses to (partially) translate its prediction into the wrong language." [1]

The effect is most pronounced for the smaller variants (mT5-Small and mT5-Base) and for certain languages, and it does not appear in encoder-only models like XLM-R because those models never generate free text; they only classify or extract spans. The fix the authors propose is to mix a small fraction of the unsupervised mC4 span-corruption task back into the fine-tuning stage, which keeps the model anchored to multilingual generation and largely eliminates the wrong-language outputs. [1] Accidental translation has since become a standard reference point in work on zero-shot cross-lingual generation.

## What is the curse of multilinguality?

The phrase "curse of multilinguality" was coined by Conneau et al. in the XLM-R paper. [3] The observation is that, holding model capacity fixed, adding more languages to a multilingual pre-training run improves average performance up to a point, then starts to hurt average performance because the parameters get spread too thin. The optimal number of languages depends on the parameter budget; XLM-R found a peak around 100 languages for their largest model. [3]

mT5 sits inside the same dynamic. The paper notes that for mT5-Small, performance per language is meaningfully worse than what you would get from a hypothetical monolingual model of the same size. As you scale up, the gap closes. mT5-XXL with 13 billion parameters is large enough that the dilution penalty becomes small, at least for the languages with abundant pretraining data. For genuinely low-resource languages, even XXL underperforms a monolingual model of the same size. [1]

No published paper has cleanly resolved the curse of multilinguality. Mixture-of-experts approaches (where each token is routed to a subset of language-specific experts) help. Bigger models help. Better data curation helps. But there is no free lunch: spending parameters on Bengali means fewer parameters available for English, and vice versa. mT5's contribution to the discussion was empirical evidence that the curse is real but scalable, in the sense that big enough models can pay the tax without their downstream quality dropping below useful thresholds.

## What models did mT5 lead to?

A cluster of multilingual T5 derivatives followed mT5 over the next few years.

[**ByT5**](/wiki/byt5) (Xue et al. 2021, arXiv:2105.13626) is a token-free version of T5 that operates directly on UTF-8 bytes instead of subword pieces. [4] The motivation was twofold: avoid the embedding-table tax that dominated mT5-Small, and remove the assumption that a learned subword vocabulary is the right unit for every language. The ByT5 paper reports that "byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation." [4] ByT5 trains slower per step (sequences are much longer in bytes) but is more robust to typos, code-switching, and morphologically rich languages. It was released in five sizes matching mT5.

**mT5x** is the JAX/Flax reimplementation of T5 and mT5 inside Google's t5x framework. The original mT5 was trained in Mesh TensorFlow; t5x is the modern successor. All recent T5-family models from Google use t5x.

**umT5** (Chung et al. 2023, "UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining") is a follow-on multilingual T5 with a more careful language sampling strategy called UniMax that "explicitly caps the number of repeats over each language's corpus" to give more uniform coverage of head languages while reducing overfitting on tail languages. [5] umT5 was trained on a refreshed mC4 corpus of 29 trillion characters across 107 languages and released in the same five sizes as mT5, improving low-resource language performance. [5] The Google Research models repo lists umT5 as the recommended replacement for mT5 in 2024 and beyond.

[**MADLAD-400**](/wiki/madlad_400) (Kudugunta et al. 2023) is a 3B and 7B parameter T5-style multilingual machine translation model trained on a 3 trillion token corpus covering 419 languages, far beyond mT5's 101. [6] MADLAD-400 was released as both a translation-specific and a general-purpose pre-trained model, and it remains one of the broadest-coverage open multilingual encoders as of 2026.

## How does mT5 compare to other multilingual models?

[**NLLB**](/wiki/nllb) (No Language Left Behind, [Meta AI](/wiki/meta_ai) 2022) is not a T5 derivative, but it sits in the same problem space. NLLB-200 is a 54B-parameter mixture-of-experts encoder-decoder trained for translation across 200 languages, with a strong focus on the long tail of African and other low-resource languages. [7]

[**Aya**](/wiki/aya), Aya 23, and Aya Expanse are instruction-tuned multilingual models from [Cohere](/wiki/cohere) for AI, released between 2024 and 2025. [8] Aya is decoder-only (it descends from Cohere's Command-R series) rather than encoder-decoder, and it covers 23 to 32 widely spoken languages with strong instruction-following. Aya Expanse 32B is the most capable of the family as of 2026.

For general use, decoder-only LLMs released after 2023 have largely displaced mT5. [Llama 3](/wiki/llama_3), [Qwen](/wiki/qwen) 2 and 2.5, [Gemma](/wiki/gemma) 2, [Mistral](/wiki/mistral) Nemo, [Gemini](/wiki/gemini), and the [GPT](/wiki/gpt) family all support a respectable subset of languages out of the box, and their instruction-tuned variants tend to outperform mT5 on chat-style tasks even without explicit multilingual fine-tuning. The catch is that the language coverage in any specific decoder-only model is usually narrower than mT5's 101, and quality drops faster on unseen languages because the pre-training data is heavily English-weighted. For encoder-decoder workflows specifically (translation, summarization with high faithfulness requirements, structured-output tasks), mT5 and its descendants still see active use.

## Is mT5 open source, and how is it used?

All five mT5 checkpoints are released under the Apache 2.0 license. [11][12] The mC4 dataset is also Apache 2.0 in distribution, though the underlying Common Crawl text is subject to its own collection of source-website licenses.

The canonical model checkpoints live on [Hugging Face](/wiki/hugging_face) under the `google` organization. Inference works out of the box with the `transformers` library:

```
from transformers import T5Tokenizer, MT5ForConditionalGeneration
tok = T5Tokenizer.from_pretrained("google/mt5-base")
model = MT5ForConditionalGeneration.from_pretrained("google/mt5-base")
```

Fine-tuning is the most common way mT5 is used in practice, since the raw pre-trained checkpoints are not instruction-tuned and produce mostly noisy text without task-specific adaptation. As the official Hugging Face model card warns, "mT5 was only pre-trained on mC4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task." [11] The original mT5 GitHub repo at `google-research/multilingual-t5` was archived (read-only) on 17 February 2024 in favor of the broader t5x repo. [12]

## Why was mT5 influential?

The mT5 paper has been cited roughly 2,935 times on Semantic Scholar as of 2026, including about 297 highly influential citations. [13] Its influence shows up in three places. The mC4 pipeline became a template: subsequent multilingual datasets (CulturaX, MADLAD-400, Glot500) follow some variant of it with language-aware filtering and temperature-based sampling. The alpha = 0.3 temperature sampling became a default starting point for almost every multilingual pre-training paper that came after, until UniMax-style repeat capping began to replace it. [5] And the practical packaging on Hugging Face turned mT5 into the workhorse multilingual encoder-decoder for academic and applied NLP work between 2021 and 2023.

mT5 was also one of the early demonstrations that multilingual pre-training could be done with the same recipe as monolingual pre-training, just at larger scale. That observation looks obvious in retrospect but it was not at the time: the prevailing intuition was that multilingual models needed special architectural tricks like language adapters, language embeddings, or language-specific output heads to cope with the heterogeneity of the data. mT5 showed that at sufficient scale, a single homogeneous architecture trained on a temperature-balanced mixture works well enough to be competitive across most tasks.

## Limitations

The mT5 paper itself is candid about the model's weaknesses. mT5 was not instruction-tuned and sampling text from the raw checkpoint produces low-quality output, which limits its use as a general-purpose generator. [11] Performance on the smallest 30 or so languages in mC4 is much worse than performance on the top 20, and for many of those low-resource languages a domain-matched monolingual model of the same parameter count would likely beat mT5. [1]

A practical issue that surfaced in the years after release is that mT5's tokenizer can produce extremely fragmented sequences for non-Latin scripts, particularly Indic and Southeast Asian scripts, because the tokenizer's training distribution was dominated by Latin-script languages. ByT5 and the byte-level approach were partly motivated by this issue. [4] Relatedly, the 250K vocabulary is too small in retrospect to give every covered language adequate subword coverage. Later models like umT5 and MADLAD-400 use refreshed corpora and improved sampling to address this. [5][6]

## See also

- [T5](/wiki/t5)
- [C4 dataset](/wiki/c4_dataset)
- [ByT5](/wiki/byt5)
- [MADLAD-400](/wiki/madlad_400)
- [NLLB](/wiki/nllb)
- [Aya](/wiki/aya)
- [XLM-R](/wiki/xlm_r)
- [mBERT](/wiki/mbert)
- [Multilingual model](/wiki/multilingual_model)
- [SentencePiece](/wiki/sentencepiece)
- [Transformer](/wiki/transformer)
- [Span corruption](/wiki/span_corruption)
- [Hugging Face](/wiki/hugging_face)

## References

1. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, Colin Raffel. "mT5: A massively multilingual pre-trained text-to-text transformer." NAACL-HLT 2021. arXiv:2010.11934 (22 October 2020). https://arxiv.org/abs/2010.11934
2. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." Journal of Machine Learning Research 21(140), 2020. arXiv:1910.10683.
3. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov. "Unsupervised Cross-lingual Representation Learning at Scale." ACL 2020. arXiv:1911.02116. (XLM-R)
4. Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel. "ByT5: Towards a token-free future with pre-trained byte-to-byte models." Transactions of the Association for Computational Linguistics 10, 2022. arXiv:2105.13626. https://arxiv.org/abs/2105.13626
5. Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, Noah Constant. "UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining." ICLR 2023. (umT5)
6. Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A. Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, Orhan Firat. "MADLAD-400: A Multilingual And Document-Level Large Audited Dataset." NeurIPS Datasets and Benchmarks 2023.
7. NLLB Team et al. "No Language Left Behind: Scaling Human-Centered Machine Translation." 2022. arXiv:2207.04672.
8. Ahmet Ustun et al. "Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model." 2024. arXiv:2402.07827.
9. Telmo Pires, Eva Schlinger, Dan Garrette. "How multilingual is Multilingual BERT?" ACL 2019. arXiv:1906.01502. (mBERT analysis)
10. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, Melvin Johnson. "XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization." ICML 2020. (XNLI, PAWS-X, TyDiQA, MLQA, XQuAD aggregate evaluation)
11. Hugging Face model cards: `google/mt5-small`, `google/mt5-base`, `google/mt5-large`, `google/mt5-xl`, `google/mt5-xxl`. https://huggingface.co/google/mt5-xxl
12. Google Research GitHub repository `google-research/multilingual-t5` (archived read-only 17 February 2024). https://github.com/google-research/multilingual-t5
13. Semantic Scholar entry for "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer" (Xue et al.), citation metrics as of 2026. https://www.semanticscholar.org/paper/74276a37bfa50f90dfae37f767b2b67784bd402a

