XLM-RoBERTa
Last reviewed
May 31, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 ยท 2,659 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 ยท 2,659 words
Add missing citations, update stale details, or suggest a clearer explanation.
XLM-RoBERTa (often abbreviated XLM-R) is a multilingual masked language model developed by Facebook AI Research and released in late 2019. It applies the training recipe of RoBERTa to a single transformer encoder shared across 100 languages, pretrained on 2.5 terabytes of filtered web text drawn from CommonCrawl. The model was introduced in the paper "Unsupervised Cross-lingual Representation Learning at Scale" by Alexis Conneau, Kartikay Khandelwal, and colleagues, and it became one of the most widely used multilingual encoders in natural language processing, surpassing earlier systems such as multilingual BERT (mBERT) and the original XLM by large margins on cross-lingual benchmarks.
XLM-RoBERTa is distributed through Hugging Face under the FacebookAI organization, with the two principal checkpoints being FacebookAI/xlm-roberta-base and FacebookAI/xlm-roberta-large. Both are released under the permissive MIT license. The model is designed to be fine-tuned on downstream tasks such as text classification, named entity recognition, and extractive question answering, and it is especially valued for its ability to transfer knowledge learned in high-resource languages to languages with little labeled data.
Multilingual language models aim to build a single set of weights that represents text in many languages at once, so that a single model can serve dozens or hundreds of languages and so that supervision available in one language can benefit others. Before XLM-RoBERTa, the dominant multilingual encoders were mBERT, which Google trained on Wikipedia in 104 languages, and XLM, which Facebook trained using both monolingual data and a translation-based objective on parallel corpora. Both showed that cross-lingual transfer was possible, but their quality on low-resource languages remained limited, and their training corpora were modest in size.
The central thesis of XLM-RoBERTa is that scaling matters: by combining a much larger and more balanced multilingual corpus, a larger shared vocabulary, longer training, and the streamlined RoBERTa objective, a multilingual model can match or exceed strong monolingual baselines while covering a hundred languages. The result challenged the prevailing assumption that supporting many languages necessarily came at a steep cost to per-language accuracy. XLM-RoBERTa demonstrated that, with enough capacity and data, a single model could be competitive with dedicated monolingual models on the GLUE and XNLI benchmarks while still handling languages as different as English, Swahili, Urdu, and Chinese.
A practical convenience of the model is that it does not require any language identifier at inference time. Unlike some earlier XLM variants that needed a language-embedding or language tensor to indicate which language the input was written in, XLM-RoBERTa infers the language directly from the token sequence. This makes it simpler to deploy in settings where the input language is unknown or mixed.
XLM-RoBERTa is an encoder-only transformer, architecturally identical to RoBERTa, which in turn is a refinement of BERT. It stacks layers of multi-head self-attention and position-wise feed-forward networks, with learned positional embeddings and a GELU activation. There is no decoder and no autoregressive generation: the model produces contextual embeddings for every input token, which are then consumed by task-specific heads added during fine-tuning.
Several design choices inherited from RoBERTa distinguish XLM-RoBERTa from mBERT:
The most important architectural departure from the original XLM is the tokenizer. XLM-RoBERTa uses a single shared SentencePiece model trained directly on raw text, rather than language-specific preprocessing and byte-pair encoding. SentencePiece operates on Unicode characters without requiring language-aware word segmentation, which is well suited to a corpus that mixes scripts and writing systems and to languages such as Chinese, Japanese, and Thai that do not delimit words with spaces. The shared vocabulary of about 250,000 pieces is the same for every language, so the model has a unified input and output space across all 100 languages.
Because its implementation matches RoBERTa exactly, XLM-RoBERTa exposes the standard set of task heads in libraries such as Hugging Face Transformers, including heads for masked language modeling, sequence classification, token classification, multiple choice, and question answering. The model accepts the usual special tokens, with <s> marking the start of a sequence, </s> separating segments, and <mask> denoting positions to be predicted.
The headline ingredient behind XLM-RoBERTa is its training corpus. The authors built a new multilingual dataset, later known as CC-100, by processing 100 CommonCrawl web snapshots with the CCNet pipeline. CCNet deduplicates documents, applies a language-identification model to route text into per-language buckets, and filters for quality using a language model trained on Wikipedia, keeping text that resembles well-formed prose. The resulting collection totals about 2.5 terabytes of clean text across 100 languages, which is orders of magnitude larger than the Wikipedia dumps used to train mBERT. For high-resource languages such as English, the increase is large; for many low-resource languages, CommonCrawl supplies far more text than Wikipedia alone ever could, which is a major reason the model performs so well on those languages.
Because the natural distribution of web text is extremely skewed toward a handful of languages, the authors rebalanced the training mixture. They sampled languages according to a multinomial distribution with an exponent of alpha equal to 0.3, which upsamples low-resource languages and downsamples high-resource ones relative to their raw token counts. This same factor governs how the shared SentencePiece vocabulary is learned, so that smaller languages receive enough representation in the subword inventory.
The pretraining objective is multilingual masked language modeling. Following the RoBERTa convention, roughly 15 percent of subword tokens in each input are selected for prediction, and the model is trained to recover them from the surrounding bidirectional context. This is a form of self-supervised learning: no human labels are required, only large quantities of raw text. A point that distinguishes XLM-RoBERTa from the original XLM is that it uses only monolingual data and the plain MLM objective. It deliberately does not use the Translation Language Modeling (TLM) objective from XLM, which required aligned parallel sentence pairs. Dropping the dependence on parallel data made it possible to scale to far more text and more languages.
Training was carried out at large scale. The authors report performing 1.5 million optimizer updates with a batch size of 8,192 sequences, using 500 NVIDIA V100 GPUs with 32 GB of memory each. This combination of a very large corpus, large batches, and a long schedule mirrors the RoBERTa finding that BERT-style models are substantially undertrained, and that simply training longer on more data yields large gains.
The original release included two model sizes, both shipped with the same 100-language SentencePiece vocabulary of roughly 250,000 tokens. The base model matches the dimensions of BERT-base, and the large model matches BERT-large. The table below summarizes the published configurations.
| Model | Layers | Hidden size | Attention heads | Parameters | Vocabulary | Languages |
|---|---|---|---|---|---|---|
XLM-R Base (xlm-roberta-base) | 12 | 768 | 12 | ~270M | ~250k | 100 |
XLM-R Large (xlm-roberta-large) | 24 | 1024 | 16 | ~550M | ~250k | 100 |
The large multilingual vocabulary is one reason these models have more parameters than their monolingual counterparts: the embedding and output projection matrices for 250,000 subwords are sizeable, so a meaningful fraction of the parameter budget is spent on the vocabulary rather than on the transformer layers themselves.
In 2021, the same lab released two much larger successors in the paper "Larger-Scale Transformers for Multilingual Masked Language Modeling" by Naman Goyal and colleagues. XLM-R XL has about 3.5 billion parameters and XLM-R XXL about 10.7 billion. These models pushed cross-lingual accuracy higher still, improving over XLM-R Large by 1.8 and 2.4 percentage points respectively in average XNLI accuracy, and they reached performance competitive with multilingual T5. The XL and XXL checkpoints are also available on Hugging Face as facebook/xlm-roberta-xl and facebook/xlm-roberta-xxl, though their size makes them considerably more demanding to run than the base and large models that remain the default choices for most users.
XLM-RoBERTa is evaluated primarily on cross-lingual transfer, where a model is fine-tuned on labeled data in one language, typically English, and then evaluated on test sets in other languages without any additional training in those languages. The flagship benchmark for this setting is XNLI, a cross-lingual natural language inference dataset covering 15 languages.
In the zero-shot cross-lingual transfer setting, where the model sees English training data only, XLM-R Large reaches about 80.9 percent average accuracy on XNLI, an improvement of 14.6 percentage points over multilingual BERT, which scores around 65.4 percent. When the model is instead trained on machine-translated training sets in all target languages, the "translate-train-all" setting, XLM-R Large reached a new state of the art of about 83.6 percent average accuracy at the time of publication. The gains are largest on low-resource languages: relative to the earlier XLM model, XLM-R improved XNLI accuracy by 15.7 points on Swahili and 11.4 points on Urdu, languages for which mBERT and XLM had been comparatively weak.
The improvements extend beyond inference. On the MLQA multilingual question-answering benchmark, XLM-R improved average F1 by about 13 points over mBERT, and on cross-lingual named entity recognition it improved F1 by about 2.4 points. Perhaps the most consequential finding for the field was that these multilingual gains did not come at the expense of monolingual quality: XLM-R Large was shown to be competitive with strong monolingual models on the English GLUE benchmark, undermining the long-held belief that a model spread across a hundred languages must necessarily be much weaker in any single one.
A caveat that accompanies these numbers is that the largest gains accrue to the lower- and medium-resource languages, where more and cleaner data made the biggest difference. For very high-resource languages, the advantage over a well-trained monolingual model is smaller, and in some cases a dedicated monolingual model can still edge out the multilingual one.
XLM-RoBERTa is a general-purpose encoder, and like BERT and RoBERTa it is meant to be adapted to a specific task rather than used out of the box for generation. The most common applications include:
xlm-roberta-large-finetuned-conll03-english, fine-tunes the large model for English NER on the CoNLL-2003 data.The model is particularly attractive when an organization needs to support many languages without training and maintaining a separate model for each, and when labeled training data is concentrated in a few languages but the application must serve a global audience. Its zero-shot transfer ability means a classifier trained on English data can often perform respectably in dozens of other languages immediately.
The checkpoints are mature and well integrated into the ecosystem. The base and large models are downloaded many millions of times per month on Hugging Face, and thousands of fine-tuned derivatives have been published, making XLM-RoBERTa one of the most heavily reused multilingual backbones in production NLP.
XLM-RoBERTa is not a text generator. As an encoder trained with masked language modeling, it is suited to understanding tasks that consume an entire input to make a decision, such as classification, tagging, and span extraction. It cannot produce fluent free-form text the way an autoregressive model such as GPT-style systems can, and attempting to use it for open-ended generation is a misuse of the architecture.
The model is also subject to the curse of multilinguality, a phenomenon the authors characterized precisely. For a model of fixed capacity, adding more languages first helps, because related languages reinforce one another through positive transfer, but beyond a certain point the per-language capacity becomes too diluted and performance degrades across the board, including on high-resource languages. XLM-RoBERTa pushes this frontier outward by using a large model and a large vocabulary, but the trade-off does not disappear; it is the reason that scaling up to the XL and XXL sizes continued to yield improvements.
Coverage is uneven across the 100 languages. Despite the rebalanced sampling, high-resource languages still dominate the underlying web data, so quality is generally strongest for them and weaker for the lowest-resource languages and scripts. Languages not among the supported 100 are unlikely to be handled well.
Like other models pretrained on uncurated web crawls, XLM-RoBERTa can reflect biases and undesirable content present in its training data, and these biases can surface in fine-tuned systems. The official model documentation cautions that such biases may affect downstream predictions and that they tend to carry over even after task-specific fine-tuning. The original release also predates many later refinements in multilingual modeling, and for some applications newer or larger multilingual encoders and instruction-tuned multilingual generators may be preferable.
XLM-RoBERTa sits at the confluence of two research lines from Facebook AI Research and the broader transformer literature. The first is the lineage of bidirectional encoders: BERT introduced masked language modeling and bidirectional pretraining in 2018, and RoBERTa showed in 2019 that BERT was undertrained and that removing the next-sentence-prediction task, training on more data with larger batches, and using dynamic masking produced substantial gains. The second is the lineage of cross-lingual pretraining: multilingual BERT showed that a single BERT could be trained on many languages at once, and the original XLM, also from Conneau and collaborators, formalized cross-lingual pretraining and introduced the Translation Language Modeling objective that used parallel data.
XLM-RoBERTa, released in November 2019 and published at ACL 2020, fused these threads. It took the RoBERTa training methodology, removed the dependence on parallel data, and scaled the corpus and vocabulary dramatically using the CC-100 dataset built with CCNet. The empirical message was that careful scaling of data and model size could deliver a multilingual model that was strong everywhere, rather than one that traded breadth for depth.
The model's influence is visible in much of the multilingual NLP that followed. The CC-100 corpus became a standard resource for training other multilingual models. The XL and XXL successors extended the same recipe to billions of parameters. And a large family of derivative models, including multilingual sentence encoders and domain-adapted variants, took XLM-RoBERTa as their starting point. Even as instruction-tuned multilingual generative models have grown prominent, XLM-RoBERTa remains a default encoder for multilingual classification, tagging, and retrieval, valued for being compact, well understood, openly licensed, and reliable across a wide range of languages.