XLM-RoBERTa

Large Language Models Natural Language Processing

15 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 3,009 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

XLM-RoBERTa (often abbreviated XLM-R) is a multilingual masked language model developed by Facebook AI Research (now Meta AI) and released in November 2019. It applies the RoBERTa pretraining recipe to a single shared transformer encoder covering 100 languages, pretrained on more than 2.5 terabytes of filtered CommonCrawl web text (the corpus later released as CC-100) ^[1]^[3]. Introduced in the paper "Unsupervised Cross-lingual Representation Learning at Scale" by Alexis Conneau, Kartikay Khandelwal, and colleagues, it shipped in Base (~270M parameters) and Large (~550M parameters) sizes and substantially outperformed multilingual BERT (mBERT), improving average accuracy on the XNLI cross-lingual benchmark by 14.6 percentage points ^[1]^[2].

XLM-RoBERTa became one of the most widely used multilingual encoders in natural language processing, surpassing earlier systems such as mBERT and the original XLM by large margins on cross-lingual benchmarks ^[1]. It is distributed through Hugging Face under the FacebookAI organization, with the two principal checkpoints being FacebookAI/xlm-roberta-base and FacebookAI/xlm-roberta-large, both released under the permissive MIT license ^[3]^[4]. The model is designed to be fine-tuned on downstream tasks such as text classification, named entity recognition, and extractive question answering, and it is especially valued for its ability to transfer knowledge learned in high-resource languages to languages with little labeled data ^[3].

What is XLM-RoBERTa?

Multilingual language models aim to build a single set of weights that represents text in many languages at once, so that one model can serve dozens or hundreds of languages and so that supervision available in one language can benefit others. Before XLM-RoBERTa, the dominant multilingual encoders were mBERT, which Google trained on Wikipedia in 104 languages, and XLM, which Facebook trained using both monolingual data and a translation-based objective on parallel corpora. Both showed that cross-lingual transfer was possible, but their quality on low-resource languages remained limited, and their training corpora were modest in size.

The central thesis of XLM-RoBERTa is that scaling matters: by combining a much larger and more balanced multilingual corpus, a larger shared vocabulary, longer training, and the streamlined RoBERTa objective, a multilingual model can match or exceed strong monolingual baselines while covering a hundred languages. As the authors put it, the work "shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks" ^[1]^[2]. The result challenged the prevailing assumption that supporting many languages necessarily came at a steep cost to per-language accuracy. XLM-RoBERTa demonstrated that, with enough capacity and data, a single model could be competitive with dedicated monolingual models on the GLUE and XNLI benchmarks while still handling languages as different as English, Swahili, Urdu, and Chinese ^[1]^[2].

A practical convenience of the model is that it does not require any language identifier at inference time. Unlike some earlier XLM variants that needed a language-embedding or language tensor to indicate which language the input was written in, XLM-RoBERTa infers the language directly from the token sequence ^[5]. This makes it simpler to deploy in settings where the input language is unknown or mixed.

What architecture does XLM-RoBERTa use?

XLM-RoBERTa is an encoder-only transformer, architecturally identical to RoBERTa, which in turn is a refinement of BERT ^[3]^[8]. It stacks layers of multi-head self-attention and position-wise feed-forward networks, with learned positional embeddings and a GELU activation. There is no decoder and no autoregressive generation: the model produces contextual embeddings for every input token, which are then consumed by task-specific heads added during fine-tuning.

Several design choices inherited from RoBERTa distinguish XLM-RoBERTa from mBERT:

It drops BERT's next-sentence-prediction objective and trains purely with masked language modeling ^[3]^[8].
It uses a much larger subword vocabulary, roughly 250,000 tokens, compared with around 110,000 for mBERT, which reduces the rate at which rare words in low-resource languages are fragmented into many pieces ^[1].
It uses dynamic masking and large training batches in the RoBERTa style ^[8].

The most important architectural departure from the original XLM is the tokenizer. XLM-RoBERTa uses a single shared SentencePiece model trained directly on raw text, rather than language-specific preprocessing and byte-pair encoding ^[1]. SentencePiece operates on Unicode characters without requiring language-aware word segmentation, which is well suited to a corpus that mixes scripts and writing systems and to languages such as Chinese, Japanese, and Thai that do not delimit words with spaces. The shared vocabulary of about 250,000 pieces is the same for every language, so the model has a unified input and output space across all 100 languages ^[1].

Because its implementation matches RoBERTa exactly, XLM-RoBERTa exposes the standard set of task heads in libraries such as Hugging Face Transformers, including heads for masked language modeling, sequence classification, token classification, multiple choice, and question answering ^[5]. The model accepts the usual special tokens, with <s> marking the start of a sequence, </s> separating segments, and <mask> denoting positions to be predicted.

How was XLM-RoBERTa trained?

The headline ingredient behind XLM-RoBERTa is its training corpus. The authors built a new multilingual dataset, later known as CC-100, by processing 100 CommonCrawl web snapshots with the CCNet pipeline ^[1]^[7]. CCNet deduplicates documents, applies a language-identification model to route text into per-language buckets, and filters for quality using a language model trained on Wikipedia, keeping text that resembles well-formed prose ^[7]. The resulting collection totals about 2.5 terabytes of clean text across 100 languages, which is orders of magnitude larger than the Wikipedia dumps used to train mBERT ^[3]. For high-resource languages such as English, the increase is large; for many low-resource languages, CommonCrawl supplies far more text than Wikipedia alone ever could, which is a major reason the model performs so well on those languages.

Because the natural distribution of web text is extremely skewed toward a handful of languages, the authors rebalanced the training mixture. They sampled languages according to a multinomial distribution with an exponent of alpha equal to 0.3, which upsamples low-resource languages and downsamples high-resource ones relative to their raw token counts ^[1]. This same factor governs how the shared SentencePiece vocabulary is learned, so that smaller languages receive enough representation in the subword inventory.

The pretraining objective is multilingual masked language modeling. Following the RoBERTa convention, roughly 15 percent of subword tokens in each input are selected for prediction, and the model is trained to recover them from the surrounding bidirectional context ^[3]. This is a form of self-supervised learning: no human labels are required, only large quantities of raw text. A point that distinguishes XLM-RoBERTa from the original XLM is that it uses only monolingual data and the plain MLM objective. It deliberately does not use the Translation Language Modeling (TLM) objective from XLM, which required aligned parallel sentence pairs. Dropping the dependence on parallel data made it possible to scale to far more text and more languages ^[1].

Training was carried out at large scale. The authors report performing 1.5 million optimizer updates with a batch size of 8,192 sequences, using 500 NVIDIA V100 GPUs with 32 GB of memory each ^[1]. This combination of a very large corpus, large batches, and a long schedule mirrors the RoBERTa finding that BERT-style models are substantially undertrained, and that simply training longer on more data yields large gains ^[8].

What sizes and variants of XLM-RoBERTa are there?

The original release included two model sizes, both shipped with the same 100-language SentencePiece vocabulary of roughly 250,000 tokens. The base model matches the dimensions of BERT-base, and the large model matches BERT-large. The table below summarizes the published configurations ^[1]^[3]^[4].

Model	Layers	Hidden size	Attention heads	Parameters	Vocabulary	Languages
XLM-R Base (`xlm-roberta-base`)	12	768	12	~270M	~250k	100
XLM-R Large (`xlm-roberta-large`)	24	1024	16	~550M	~250k	100

The large multilingual vocabulary is one reason these models have more parameters than their monolingual counterparts: the embedding and output projection matrices for 250,000 subwords are sizeable, so a meaningful fraction of the parameter budget is spent on the vocabulary rather than on the transformer layers themselves.

In 2021, the same lab released two much larger successors in the paper "Larger-Scale Transformers for Multilingual Masked Language Modeling" by Naman Goyal and colleagues ^[6]. XLM-R XL has about 3.5 billion parameters and XLM-R XXL about 10.7 billion. These models pushed cross-lingual accuracy higher still, improving over XLM-R Large by 1.8 and 2.4 percentage points respectively in average XNLI accuracy, and they reached performance competitive with multilingual T5 ^[6]. The XL and XXL checkpoints are also available on Hugging Face as facebook/xlm-roberta-xl and facebook/xlm-roberta-xxl, though their size makes them considerably more demanding to run than the base and large models that remain the default choices for most users.

How does XLM-R compare to mBERT?

XLM-RoBERTa is evaluated primarily on cross-lingual transfer, where a model is fine-tuned on labeled data in one language, typically English, and then evaluated on test sets in other languages without any additional training in those languages. The flagship benchmark for this setting is XNLI, a cross-lingual natural language inference dataset covering 15 languages.

In the zero-shot cross-lingual transfer setting, where the model sees English training data only, XLM-R Large reaches about 80.9 percent average accuracy on XNLI, an improvement of 14.6 percentage points over multilingual BERT, which scores around 65.4 percent ^[1]. The paper reports this gain as "+14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER" over mBERT ^[1]^[2]. When the model is instead trained on machine-translated training sets in all target languages, the "translate-train-all" setting, XLM-R Large reached a new state of the art of about 83.6 percent average accuracy at the time of publication ^[1]. The gains are largest on low-resource languages: relative to the earlier XLM model, XLM-R improved XNLI accuracy by 15.7 points on Swahili and 11.4 points on Urdu, languages for which mBERT and XLM had been comparatively weak ^[1]^[2].

The improvements extend beyond inference. On the MLQA multilingual question-answering benchmark, XLM-R improved average F1 by about 13 points over mBERT, and on cross-lingual named entity recognition it improved F1 by about 2.4 points ^[1]. Perhaps the most consequential finding for the field was that these multilingual gains did not come at the expense of monolingual quality. The authors state that the work demonstrates "the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks" ^[1]^[2]. That result undermined the long-held belief that a model spread across a hundred languages must necessarily be much weaker in any single one.

A caveat that accompanies these numbers is that the largest gains accrue to the lower- and medium-resource languages, where more and cleaner data made the biggest difference. For very high-resource languages, the advantage over a well-trained monolingual model is smaller, and in some cases a dedicated monolingual model can still edge out the multilingual one.

What is XLM-RoBERTa used for?

XLM-RoBERTa is a general-purpose encoder, and like BERT and RoBERTa it is meant to be adapted to a specific task rather than used out of the box for generation ^[3]. The most common applications include:

Text and document classification across many languages, including sentiment analysis, topic labeling, and intent detection, where labeled data may exist in only one or a few languages but the deployed system must handle many.
Token-level tasks such as named entity recognition and part-of-speech tagging, including multilingual and cross-lingual settings. A widely used checkpoint, xlm-roberta-large-finetuned-conll03-english, fine-tunes the large model for English NER on the CoNLL-2003 data.
Extractive question answering, where the model locates answer spans within a passage, in benchmarks such as MLQA and across languages.
Sentence and passage representation for retrieval and semantic similarity, often after additional contrastive fine-tuning. Several popular multilingual sentence-embedding models in the Sentence-Transformers ecosystem are built on an XLM-RoBERTa backbone.

The model is particularly attractive when an organization needs to support many languages without training and maintaining a separate model for each, and when labeled training data is concentrated in a few languages but the application must serve a global audience. Its zero-shot transfer ability means a classifier trained on English data can often perform respectably in dozens of other languages immediately.

The checkpoints are mature and well integrated into the ecosystem. The base and large models are downloaded many millions of times per month on Hugging Face, and thousands of fine-tuned derivatives have been published, making XLM-RoBERTa one of the most heavily reused multilingual backbones in production NLP ^[3]^[4].

What are the limitations of XLM-RoBERTa?

XLM-RoBERTa is not a text generator. As an encoder trained with masked language modeling, it is suited to understanding tasks that consume an entire input to make a decision, such as classification, tagging, and span extraction ^[3]. It cannot produce fluent free-form text the way an autoregressive model such as GPT-style systems can, and attempting to use it for open-ended generation is a misuse of the architecture.

The model is also subject to the curse of multilinguality, a phenomenon the authors characterized precisely. For a model of fixed capacity, adding more languages first helps, because related languages reinforce one another through positive transfer, but beyond a certain point the per-language capacity becomes too diluted and performance degrades across the board, including on high-resource languages ^[1]. XLM-RoBERTa pushes this frontier outward by using a large model and a large vocabulary, but the trade-off does not disappear; it is the reason that scaling up to the XL and XXL sizes continued to yield improvements ^[6].

Coverage is uneven across the 100 languages. Despite the rebalanced sampling, high-resource languages still dominate the underlying web data, so quality is generally strongest for them and weaker for the lowest-resource languages and scripts. Languages not among the supported 100 are unlikely to be handled well.

Like other models pretrained on uncurated web crawls, XLM-RoBERTa can reflect biases and undesirable content present in its training data, and these biases can surface in fine-tuned systems. The official model documentation cautions that such biases may affect downstream predictions and that they tend to carry over even after task-specific fine-tuning ^[4]. The original release also predates many later refinements in multilingual modeling, and for some applications newer or larger multilingual encoders and instruction-tuned multilingual generators may be preferable.

How does XLM-RoBERTa relate to other models?

XLM-RoBERTa sits at the confluence of two research lines from Facebook AI Research and the broader transformer literature. The first is the lineage of bidirectional encoders: BERT introduced masked language modeling and bidirectional pretraining in 2018, and RoBERTa showed in 2019 that BERT was undertrained and that removing the next-sentence-prediction task, training on more data with larger batches, and using dynamic masking produced substantial gains ^[8]. The second is the lineage of cross-lingual pretraining: multilingual BERT showed that a single BERT could be trained on many languages at once, and the original XLM, also from Conneau and collaborators, formalized cross-lingual pretraining and introduced the Translation Language Modeling objective that used parallel data.

XLM-RoBERTa, released in November 2019 and published at ACL 2020, fused these threads ^[1]^[2]. It took the RoBERTa training methodology, removed the dependence on parallel data, and scaled the corpus and vocabulary dramatically using the CC-100 dataset built with CCNet ^[1]^[7]. The empirical message was that careful scaling of data and model size could deliver a multilingual model that was strong everywhere, rather than one that traded breadth for depth.

The model's influence is visible in much of the multilingual NLP that followed. The CC-100 corpus became a standard resource for training other multilingual models. The XL and XXL successors extended the same recipe to billions of parameters ^[6]. And a large family of derivative models, including multilingual sentence encoders and domain-adapted variants, took XLM-RoBERTa as their starting point. Even as instruction-tuned multilingual generative models have grown prominent, XLM-RoBERTa remains a default encoder for multilingual classification, tagging, and retrieval, valued for being compact, well understood, openly licensed, and reliable across a wide range of languages.

ELI5: XLM-RoBERTa explained simply

Imagine one very well-read reader who has studied books, news, and web pages in 100 different languages all at once, instead of learning each language separately. That is the idea behind XLM-RoBERTa. Researchers fed it a huge pile of internet text (about 2.5 terabytes) covering a hundred languages and had it play a giant fill-in-the-blank game: hide a word, then guess it from the words around it. By doing this billions of times, the model learned how language works in general, not just in English. The handy payoff is that if you teach it a task using English examples (say, spotting whether a review is positive), it can often do that same task in Swahili, Hindi, or Thai without being shown any examples in those languages. It does not write essays or chat; it reads and understands text so that other tools built on top of it can sort, label, and answer questions in many languages at once.

References

Conneau, Alexis; Khandelwal, Kartikay; Goyal, Naman; Chaudhary, Vishrav; Wenzek, Guillaume; Guzman, Francisco; Grave, Edouard; Ott, Myle; Zettlemoyer, Luke; Stoyanov, Veselin. "Unsupervised Cross-lingual Representation Learning at Scale." arXiv preprint arXiv:1911.02116, 2019/2020. https://arxiv.org/abs/1911.02116 ↩
Conneau, Alexis et al. "Unsupervised Cross-lingual Representation Learning at Scale." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020. https://aclanthology.org/2020.acl-main.747/ ↩
Facebook AI. "xlm-roberta-base model card." Hugging Face, 2019. https://huggingface.co/FacebookAI/xlm-roberta-base ↩
Facebook AI. "xlm-roberta-large model card." Hugging Face, 2019. https://huggingface.co/FacebookAI/xlm-roberta-large ↩
Hugging Face. "XLM-RoBERTa." Transformers documentation, 2024. https://huggingface.co/docs/transformers/model_doc/xlm-roberta ↩
Goyal, Naman; Du, Jingfei; Ott, Myle; Anantharaman, Giri; Conneau, Alexis. "Larger-Scale Transformers for Multilingual Masked Language Modeling." arXiv preprint arXiv:2105.00572, 2021. https://arxiv.org/abs/2105.00572 ↩
Wenzek, Guillaume; Lachaux, Marie-Anne; Conneau, Alexis; Chaudhary, Vishrav; Guzman, Francisco; Joulin, Armand; Grave, Edouard. "CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data." arXiv preprint arXiv:1911.00359, 2019. https://arxiv.org/abs/1911.00359 ↩
Liu, Yinhan et al. "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv preprint arXiv:1907.11692, 2019. https://arxiv.org/abs/1907.11692 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

CamemBERT Jina Embeddings v3 Text Classification Models Token Classification Models

What is XLM-RoBERTa?

What architecture does XLM-RoBERTa use?

How was XLM-RoBERTa trained?

What sizes and variants of XLM-RoBERTa are there?

How does XLM-R compare to mBERT?

What is XLM-RoBERTa used for?

What are the limitations of XLM-RoBERTa?

How does XLM-RoBERTa relate to other models?

ELI5: XLM-RoBERTa explained simply

See also

References

Improve this article

Related Articles

LLaMA

Prompt Engineering

Agentic Context Engineering

Bert-base-uncased model

Bidirectional language model

Claude (language model)

What links here

Related Articles

LLaMA

Prompt Engineering

Agentic Context Engineering

Bert-base-uncased model

Bidirectional language model

Claude (language model)

What links here