# RoBERTa

> Source: https://aiwiki.ai/wiki/roberta
> Updated: 2026-06-21
> Categories: Deep Learning, Machine Learning, Natural Language Processing, Transformer Models
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

RoBERTa (Robustly Optimized BERT Pretraining Approach) is an open-source [natural language processing](/wiki/natural_language_processing) model released in July 2019 by researchers at Facebook AI (now [Meta AI](/wiki/meta_ai)) and the University of Washington that reproduces and retrains [BERT](/wiki/bert) with a better recipe and shows BERT was "significantly undertrained."[1] Using the identical [transformer](/wiki/transformer) encoder architecture as BERT, RoBERTa trains longer on roughly ten times more data (over 160GB of text) with larger batches, drops the next sentence prediction objective, and adds dynamic masking, and as a result it matches or exceeds every model published after BERT at the time, reaching a score of 88.5 on the public [GLUE](/wiki/glue_benchmark) leaderboard.[1] Introduced in the paper "RoBERTa: A Robustly Optimized BERT Pretraining Approach" by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, the work was posted to arXiv on July 26, 2019 (arXiv:1907.11692) and has since accumulated over 27,000 citations, making it one of the most referenced works in modern NLP.[1]

The paper's central conclusion is stated plainly in its abstract: "We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it."[1] In other words, RoBERTa's gains came not from a new architecture but from a better training procedure applied to the same model.

## Quick facts

| Attribute | Value |
|---|---|
| Full name | Robustly Optimized BERT Pretraining Approach |
| Type | Bidirectional [transformer encoder](/wiki/transformer) (encoder-only) |
| Developer | Facebook AI Research; University of Washington |
| Released | July 26, 2019 (arXiv:1907.11692) |
| Architecture | Identical to BERT (no architectural changes) |
| Sizes | RoBERTa-base (125M params), RoBERTa-large (355M params) |
| Pretraining data | Over 160GB of English text across five corpora |
| Pretraining objective | Masked language modeling only (no NSP) |
| Tokenizer | Byte-level BPE, 50,265 tokens |
| GLUE test score | 88.5 (state of the art at release)[1] |
| License | MIT (released with code in [fairseq](/wiki/fairseq)) |
| Citations | Over 27,000 (as of 2025) |

## Background and Motivation

When Google released BERT in October 2018, it represented a major advance in [transfer learning](/wiki/transfer_learning) for NLP. BERT introduced a bidirectional [transformer](/wiki/transformer) encoder pretrained with two objectives: [masked language modeling](/wiki/masked_language_model) (MLM) and next sentence prediction (NSP). It achieved state-of-the-art results on a wide range of benchmarks, including [GLUE](/wiki/glue_benchmark), [SQuAD](/wiki/squad), and others.[2]

However, shortly after BERT's release, several competing models appeared that claimed architectural superiority. [XLNet](/wiki/xlnet) (Yang et al., 2019) introduced permutation-based language modeling to capture bidirectional context without masking.[3] Other approaches proposed novel pretraining objectives or architectural modifications.

The Facebook AI team noticed that these follow-up studies often changed multiple variables simultaneously, making it difficult to isolate which factors actually contributed to performance improvements. The authors hypothesized that BERT's original training recipe was suboptimal and that better tuning alone could recover much of the gap between BERT and its successors. RoBERTa was designed as a careful replication study to test this hypothesis.[1] The paper frames this motivation directly, noting that hyperparameter choices "have significant impact on the final results" and that the work "raise[s] questions about the source of recently reported improvements."[1]

## What did RoBERTa change from BERT?

The paper summarizes its modifications as four concrete changes to the BERT pretraining recipe: "(1) training the model longer, with bigger batches, over more data; (2) removing the next sentence prediction objective; (3) training on longer sequences; and (4) dynamically changing the masking pattern applied to the training data."[1] RoBERTa also collected a new dataset, CC-News, to control for the effect of training-set size, and switched to a byte-level [Byte Pair Encoding](/wiki/byte_pair_encoding) tokenizer.[1]

The authors list their contributions as: "(1) We present a set of important BERT design choices and training strategies and introduce alternatives that lead to better downstream task performance; (2) We use a novel dataset, CC-News, and confirm that using more data for pretraining further improves performance on downstream tasks; (3) Our training improvements show that masked language model pretraining, under the right design choices, is competitive with all other recently published methods."[1]

The following sections describe each change in detail.

## Key Changes from BERT

RoBERTa uses the same underlying [architecture](/wiki/neural_network) as BERT but modifies several aspects of the pretraining procedure.[1] The following sections describe each change in detail.

### Dynamic Masking

BERT's original implementation applies a single static masking pattern to each training example during data preprocessing. This means the model sees the same masked version of each sentence every time it encounters that sentence during training.[2]

RoBERTa replaces this with dynamic masking, where a new masking pattern is generated each time a sequence is fed to the model. In practice, the training data is duplicated 10 times, with each copy receiving a different mask. Over 40 epochs of training, each sequence is seen with approximately four different masking patterns. Experiments in the paper showed that dynamic masking performs comparably to or slightly better than static masking, with the advantage becoming more pronounced over longer training runs.[1]

| Masking Strategy | SQuAD 2.0 (F1) | MNLI-m (Acc) | SST-2 (Acc) |
|---|---|---|---|
| Static (BERT reimplementation) | 78.3 | 84.3 | 92.5 |
| Dynamic | 78.7 | 84.0 | 92.9 |

### Removal of Next Sentence Prediction

BERT's pretraining included a next sentence prediction (NSP) objective, where the model learned to predict whether two segments came from consecutive sentences in the original text. The NSP task was intended to help with downstream tasks that involve reasoning about sentence pairs, such as natural language inference.[2]

The RoBERTa authors tested four different input format configurations:[1]

| Configuration | Description | SQuAD 1.1/2.0 | MNLI-m | SST-2 | RACE |
|---|---|---|---|---|---|
| Segment-pair + NSP | Two segments from same or different documents, with NSP loss | 90.4 / 78.7 | 84.0 | 92.9 | 64.2 |
| Sentence-pair + NSP | Two natural sentences, with NSP loss | 88.7 / 76.2 | 82.9 | 92.1 | 63.0 |
| Full-sentences (no NSP) | Full sentences packed from one or more documents, no NSP loss | 90.4 / 79.1 | 84.7 | 92.5 | 64.8 |
| Doc-sentences (no NSP) | Full sentences from a single document only, no NSP loss | 90.6 / 79.7 | 84.7 | 92.7 | 65.6 |

The results showed that removing the NSP loss matched or slightly improved downstream task performance. The "full-sentences" format, which packs consecutive full sentences from one or more documents up to the 512-token limit without NSP, was adopted for the final RoBERTa model because the "doc-sentences" format introduced variable batch sizes that complicated training.[1]

### Larger Mini-Batches

BERT-large was originally trained with a [batch size](/wiki/batch_size) of 256 sequences for 1 million steps.[2] The RoBERTa authors explored the effect of increasing the batch size while keeping the total number of training tokens constant.[1]

| Batch Size | Steps | Learning Rate | Perplexity | MNLI-m | SST-2 |
|---|---|---|---|---|---|
| 256 | 1M | 1e-4 | 3.99 | 84.7 | 92.7 |
| 2K | 125K | 7e-4 | 3.68 | 85.2 | 92.9 |
| 8K | 31K | 1e-3 | 3.77 | 84.6 | 92.8 |

Training with a batch size of 2K sequences and 125K steps produced the best results, improving both perplexity and downstream task performance. Larger batches of 8K sequences also performed well and offered computational advantages through increased parallelism. The final RoBERTa model used a batch size of 8K sequences.[1]

### More Training Data

BERT was pretrained on a combination of [BookCorpus](/wiki/bookcorpus) (approximately 800 million words) and English Wikipedia (approximately 2,500 million words), totaling roughly 16GB of uncompressed text.[2] RoBERTa expanded the pretraining corpus significantly by adding three additional datasets, bringing the total to over 160GB.[1]

### Longer Training

The original BERT-large model was trained for 1 million steps with a batch size of 256.[2] RoBERTa was trained for 500K steps with a batch size of 8K, resulting in the model processing significantly more tokens overall. The authors observed that performance continued to improve as training progressed, with the best results at 500K steps.[1]

| Configuration | Data | Batch Size | Steps | SQuAD 1.1/2.0 (F1) | MNLI-m | SST-2 |
|---|---|---|---|---|---|---|
| RoBERTa (Books + Wiki) | 16GB | 8K | 100K | 93.6 / 87.3 | 89.0 | 95.3 |
| + additional data | 160GB | 8K | 100K | 94.0 / 87.7 | 89.3 | 95.6 |
| + pretrain longer (300K) | 160GB | 8K | 300K | 94.4 / 88.7 | 90.0 | 96.1 |
| + pretrain longer (500K) | 160GB | 8K | 500K | 94.6 / 89.4 | 90.2 | 96.4 |

This table illustrates the cumulative effect of each modification, showing steady gains from additional data and longer training.[1]

### Byte-Level BPE Tokenizer

BERT uses a WordPiece tokenizer with a vocabulary of approximately 30,000 tokens.[2] RoBERTa switched to a byte-level [Byte Pair Encoding](/wiki/byte_pair_encoding) (BPE) tokenizer, similar to the one used in [GPT-2](/wiki/gpt-2).[10] This tokenizer has a vocabulary size of 50,265 subword units.[1]

Byte-level BPE operates on raw bytes rather than Unicode characters, which means it can encode any input text without producing unknown tokens. The base vocabulary consists of 256 byte tokens, supplemented by 50,000 learned merge operations and special tokens. While this larger vocabulary adds approximately 15 million parameters to the model, it eliminates the need for any text preprocessing or tokenization heuristics. The RoBERTa authors found that byte-level BPE performed slightly worse on some individual tasks but was comparable overall, with the advantage of greater robustness to diverse inputs.[1]

## Paper Details

| Detail | Value |
|---|---|
| Title | RoBERTa: A Robustly Optimized BERT Pretraining Approach |
| Authors | Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov |
| Affiliation | Facebook AI Research; University of Washington |
| Published | July 26, 2019 (arXiv:1907.11692) |
| Conference Submission | Submitted to ICLR 2020 (not accepted) |
| Framework | Implemented in [fairseq](/wiki/fairseq) (Facebook AI Research Sequence-to-Sequence Toolkit) |
| Citations | Over 27,000 (as of 2025) |

Although the paper was not accepted at ICLR 2020, its practical impact has been enormous. Reviewers noted that the findings, while significant, were relatively straightforward (more data helps, longer training helps). Nevertheless, the systematic investigation and the resulting pretrained models became widely adopted across the NLP community.[1]

## Architecture

RoBERTa does not introduce any architectural changes compared to BERT. It uses the standard [transformer encoder](/wiki/transformer) architecture. The two primary variants follow the same configurations as BERT-base and BERT-large.[1]

### Model Variants

| Specification | RoBERTa-base | RoBERTa-large |
|---|---|---|
| Transformer layers | 12 | 24 |
| Hidden size | 768 | 1,024 |
| [Attention](/wiki/attention) heads | 12 | 16 |
| Feed-forward dimension | 3,072 | 4,096 |
| Parameters | 125M | 355M |
| Vocabulary size | 50,265 | 50,265 |
| Max sequence length | 512 | 512 |
| Tokenizer | Byte-level BPE | Byte-level BPE |

The key insight of RoBERTa is that the architecture did not need to change; better training procedures alone were sufficient to achieve substantial performance improvements.[1]

## Training Data

RoBERTa was pretrained on five English-language datasets totaling over 160GB of uncompressed text. This represents a tenfold increase over the 16GB of text used to train the original BERT.[1]

| Dataset | Size | Description |
|---|---|---|
| BookCorpus | ~5GB | Collection of over 11,000 unpublished books from various genres, originally used in BERT pretraining |
| English Wikipedia | ~11GB | Full text of English Wikipedia articles (excluding lists, tables, and headers), also used in BERT pretraining |
| CC-News | 76GB | A dataset of 63 million English news articles crawled from CommonCrawl News between September 2016 and February 2019 |
| OpenWebText | 38GB | An open-source recreation of the WebText corpus described by Radford et al. (2019), consisting of web content from URLs shared on Reddit with at least three upvotes |
| Stories | 31GB | A subset of CommonCrawl data filtered to match the style of Winograd schema stories, as introduced by Trinh and Le (2018) |

The CC-News, OpenWebText, and Stories datasets were the three additions beyond what BERT originally used. The diversity of these sources, spanning books, encyclopedic text, news articles, web content, and narrative stories, helped the model learn a broader range of language patterns.[1]

## Training Configuration

The full RoBERTa model was trained using the [Adam](/wiki/adam_optimizer) optimizer with the following hyperparameters.[1]

| Hyperparameter | RoBERTa-base | RoBERTa-large |
|---|---|---|
| Peak learning rate | 6e-4 | 4e-4 |
| Batch size (sequences) | 8,000 | 8,000 |
| Training steps | 500,000 | 500,000 |
| Warmup steps | 24,000 | 30,000 |
| Adam epsilon | 1e-6 | 1e-6 |
| Adam beta_2 | 0.98 | 0.98 |
| Weight decay | 0.01 | 0.01 |
| Dropout | 0.1 | 0.1 |
| Learning rate schedule | Linear decay | Linear decay |
| Max sequence length | 512 | 512 |

Training was conducted on 1,024 NVIDIA V100 GPUs using mixed-precision floating-point arithmetic. The DGX-1 machines used each contained 8 V100 GPUs with 32GB of memory, and peak memory usage was approximately 18GB per GPU. The training process took roughly one day for 100K steps, meaning the full 500K-step training run took several days.[1]

Notably, RoBERTa changed Adam's beta_2 parameter from BERT's default of 0.999 to 0.98, which the authors found to improve training stability with large batch sizes.[1]

## Benchmark Results

### GLUE Benchmark

The [General Language Understanding Evaluation](/wiki/glue_benchmark) (GLUE) benchmark is a collection of nine natural language understanding tasks. RoBERTa achieved state-of-the-art results on the GLUE leaderboard at the time of publication, with an overall test score of 88.5.[1] The paper reports this milestone directly: "When trained for longer over additional data, our model achieves a score of 88.5 on the public GLUE leaderboard, matching the 88.4 reported by Yang et al."[1] Notably, RoBERTa's leaderboard submission relied only on single-task finetuning yet set new state-of-the-art results on 4 of the 9 GLUE tasks (MNLI, QNLI, RTE, and STS-B), whereas many competing submissions depended on multi-task finetuning and ensembles.[1]

The following table shows development set results for single-task, single-model fine-tuning.

| Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B |
|---|---|---|---|---|---|---|---|---|
| BERT-large | 86.6 | 92.3 | 91.3 | 70.4 | 93.2 | 88.0 | 60.6 | 90.0 |
| XLNet-large | 89.8 | 93.9 | 91.8 | 83.8 | 95.6 | 89.2 | 63.6 | 91.8 |
| RoBERTa-large | 90.2 | 94.7 | 92.2 | 86.6 | 96.4 | 90.9 | 68.0 | 92.4 |

RoBERTa outperformed both BERT-large and XLNet-large on every single GLUE task in the development set. The improvements were especially notable on CoLA (68.0 vs. 60.6 for BERT), RTE (86.6 vs. 70.4 for BERT), and MNLI (90.2 vs. 86.6 for BERT).[1]

On the GLUE test set, RoBERTa achieved an average score of 88.5, which was the highest score on the leaderboard at the time, matching the performance achieved by XLNet's multi-task ensemble submission.[1]

### SQuAD

The [Stanford Question Answering Dataset](/wiki/squad) (SQuAD) evaluates reading comprehension. RoBERTa was evaluated on both SQuAD v1.1 and SQuAD v2.0, using only the provided SQuAD training data without any additional question answering datasets.[1]

| Model | SQuAD v1.1 EM | SQuAD v1.1 F1 | SQuAD v2.0 EM | SQuAD v2.0 F1 |
|---|---|---|---|---|
| BERT-large | 84.1 | 90.9 | 79.0 | 81.8 |
| XLNet-large | 89.0 | 94.5 | 86.1 | 88.8 |
| RoBERTa-large | 88.9 | 94.6 | 86.5 | 89.4 |

RoBERTa achieved results comparable to XLNet on SQuAD v1.1 (94.6 F1 vs. 94.5 F1) and slightly better on SQuAD v2.0 (89.4 F1 vs. 88.8 F1). Both models significantly outperformed the original BERT-large baseline.[1]

### RACE

The [RACE](/wiki/race_benchmark) (ReAding Comprehension from Examinations) benchmark consists of multiple-choice reading comprehension questions collected from English exams for Chinese middle and high school students.

| Model | Middle | High | Overall |
|---|---|---|---|
| BERT-large | 76.6 | 70.1 | 72.0 |
| XLNet-large | 85.4 | 80.2 | 81.7 |
| RoBERTa-large | 86.5 | 81.3 | 83.2 |

RoBERTa achieved the best overall accuracy on RACE at 83.2%, surpassing XLNet-large (81.7%) and BERT-large (72.0%) by significant margins.[1]

## How does RoBERTa compare with other transformer models?

RoBERTa emerged during a period of rapid development in pretrained language models. The following table provides a high-level comparison with other prominent models from the same era.

| Feature | [BERT](/wiki/bert)-large | RoBERTa-large | [XLNet](/wiki/xlnet)-large | [ALBERT](/wiki/albert)-xxlarge | [ELECTRA](/wiki/electra)-large |
|---|---|---|---|---|---|
| Year | 2018 | 2019 | 2019 | 2019 | 2020 |
| Organization | Google | Facebook AI | Google/CMU | Google | Google/Stanford |
| Parameters | 340M | 355M | 340M | 235M | 335M |
| Pretraining Objective | MLM + NSP | MLM only | Permutation LM | MLM + SOP | Replaced Token Detection |
| Tokenizer | WordPiece (30K) | Byte-level BPE (50K) | SentencePiece | SentencePiece (30K) | WordPiece (30K) |
| Training Data | 16GB | 160GB | ~126GB | 16GB | 16GB |
| Masking | Static | Dynamic | N/A (permutation) | Static | N/A (replacement) |
| NSP Task | Yes | No | No | No (uses SOP) | No |
| GLUE Dev (MNLI) | 86.6 | 90.2 | 89.8 | 90.8 | 90.9 |
| GLUE Dev (SST-2) | 93.2 | 96.4 | 95.6 | 96.9 | 96.9 |
| GLUE Dev (CoLA) | 60.6 | 68.0 | 63.6 | 71.4 | 69.1 |
| SQuAD v2.0 (F1) | 81.8 | 89.4 | 88.8 | N/A | 88.1 |

Several observations emerge from this comparison:

- **RoBERTa vs. BERT:** With identical architecture, RoBERTa demonstrated that better training procedures and more data could close the gap with newer models. The MNLI improvement from 86.6 to 90.2 and CoLA improvement from 60.6 to 68.0 were substantial.[1]
- **RoBERTa vs. XLNet:** Despite XLNet's more complex permutation language modeling approach, RoBERTa matched or exceeded XLNet on most benchmarks while using a simpler MLM objective.[3]
- **RoBERTa vs. ALBERT:** ALBERT achieved competitive or better scores with significantly fewer parameters (235M vs. 355M) through cross-layer parameter sharing and factorized embedding parameterization.[4]
- **RoBERTa vs. ELECTRA:** ELECTRA, published later in 2020, achieved comparable or better scores using a replaced token detection objective that is more sample-efficient, matching RoBERTa's performance with less than one-quarter of its pretraining compute.[5]

## Why did RoBERTa show BERT was undertrained?

The central finding of the RoBERTa paper was that BERT was "significantly undertrained."[1] The authors supported this claim through a systematic ablation study that isolated the contribution of each change.[1]

### Insufficient Training Duration

BERT-large was trained for 1 million steps with a batch size of 256.[2] When the RoBERTa authors retrained the same BERT architecture on the same BookCorpus and Wikipedia data, but for longer and with larger batches (8K batch size for 100K steps, representing the same total computational budget), they already observed improvements. Extending training to 300K and then 500K steps with the larger batch size yielded further gains on every benchmark.[1]

### Suboptimal Data Volume

BERT's 16GB training corpus was limited in both size and diversity. When RoBERTa added CC-News, OpenWebText, and Stories to reach 160GB, performance improved even at the same number of training steps. The additional data provided the model with a broader distribution of language patterns and factual knowledge.[1]

### Unnecessary NSP Objective

The next sentence prediction task, which consumed half of BERT's training signal, was shown to be unnecessary and potentially harmful. Removing it freed up the model's capacity to focus entirely on the masked language modeling objective, which proved more beneficial for downstream tasks.[1]

### Static Masking Limitations

BERT's static masking meant the model saw the same masking pattern for each training example in every epoch. Dynamic masking provided more diverse training signals, which became increasingly important over longer training runs.[1]

### Undertuned Hyperparameters

The change in Adam's beta_2 from 0.999 to 0.98, the use of larger learning rates with large batches, and other hyperparameter adjustments all contributed to better training dynamics.[1]

Taken together, these findings demonstrated that the gap between BERT and subsequent models like XLNet was not primarily due to architectural innovation but rather to undertrained baselines.[1]

## What is RoBERTa used for?

RoBERTa has been widely adopted for a variety of NLP tasks through [fine-tuning](/wiki/fine_tuning).

### Text Classification

RoBERTa serves as a strong backbone for [sentiment analysis](/wiki/sentiment_analysis), topic classification, spam detection, and other document-level classification tasks. Its strong performance on SST-2 (96.4% accuracy) reflects its ability to capture nuanced language patterns relevant to sentiment.[1]

### Named Entity Recognition

[Named entity recognition](/wiki/named_entity_recognition) (NER) benefits from RoBERTa's contextual representations. Fine-tuned RoBERTa models have been used for identifying persons, organizations, locations, and domain-specific entities in various domains including biomedical text and legal documents.

### Question Answering

RoBERTa's strong SQuAD results translate to effective question answering systems. The model can be fine-tuned on extractive QA datasets to locate answer spans within passages.

### Natural Language Inference

Tasks like MNLI, which require determining whether a hypothesis is entailed by, contradicted by, or neutral with respect to a premise, are well-served by RoBERTa's deep bidirectional representations.

### Semantic Similarity

RoBERTa's performance on STS-B (92.4 Spearman correlation) makes it effective for paraphrase detection, duplicate question identification, and semantic search applications.[1]

## Variants and Extensions

### RoBERTa-base and RoBERTa-large

The two primary variants correspond to BERT-base and BERT-large configurations. RoBERTa-base (125M parameters, 12 layers) is suitable for resource-constrained environments, while RoBERTa-large (355M parameters, 24 layers) provides the best performance.[1]

Development set results for both variants on the [GLUE benchmark](/wiki/glue_benchmark):

| Task | RoBERTa-base | RoBERTa-large |
|---|---|---|
| MNLI | 87.6 | 90.2 |
| QNLI | 92.8 | 94.7 |
| QQP | 91.9 | 92.2 |
| RTE | 78.7 | 86.6 |
| SST-2 | 94.8 | 96.4 |
| MRPC | 90.2 | 90.9 |
| CoLA | 63.6 | 68.0 |
| STS-B | 91.2 | 92.4 |

### XLM-RoBERTa

XLM-RoBERTa (Conneau et al., 2020) extended RoBERTa's training approach to the multilingual setting. Pretrained on 2.5TB of filtered CommonCrawl data spanning 100 languages, XLM-RoBERTa achieved strong results on cross-lingual benchmarks.[6] It outperformed multilingual BERT (mBERT) by +14.6% average accuracy on XNLI and +13% average F1 on MLQA, while also performing competitively with monolingual models on English benchmarks.[6] XLM-RoBERTa demonstrated that the RoBERTa training recipe could scale effectively across languages.[6]

### DistilRoBERTa

DistilRoBERTa applies [knowledge distillation](/wiki/knowledge_distillation) to RoBERTa, producing a smaller, faster model that retains much of RoBERTa's performance. With 82 million parameters (roughly 66% of RoBERTa-base), DistilRoBERTa offers a practical option for deployment in latency-sensitive applications.[12]

## Is RoBERTa open source and how do I use it?

RoBERTa was released as open source alongside the paper, with code and pretrained weights distributed through the [fairseq](/wiki/fairseq) toolkit under the MIT license, and the paper explicitly states "We release our models and code."[1][11] RoBERTa models are also available through the [Hugging Face](/wiki/hugging_face) Transformers library under the `FacebookAI` organization.[12] The primary model identifiers are:

| Model | Hugging Face ID | Parameters |
|---|---|---|
| RoBERTa-base | `FacebookAI/roberta-base` | 125M |
| RoBERTa-large | `FacebookAI/roberta-large` | 355M |
| RoBERTa-large (MNLI) | `FacebookAI/roberta-large-mnli` | 355M |
| XLM-RoBERTa-base | `FacebookAI/xlm-roberta-base` | 278M |
| XLM-RoBERTa-large | `FacebookAI/xlm-roberta-large` | 559M |

These models can be loaded with a few lines of Python code using the Transformers library:

```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-large")
model = AutoModel.from_pretrained("FacebookAI/roberta-large")
```

The Hugging Face model hub also hosts numerous community-contributed RoBERTa models fine-tuned for specific tasks, including sentiment analysis, NER, question answering, and domain-specific applications in biomedical, legal, and financial text.[12]

## Legacy and Influence

RoBERTa's impact on the NLP field extends well beyond its benchmark scores.

### Establishing Training Best Practices

The paper's most lasting contribution may be its demonstration that training procedure matters as much as, if not more than, model architecture. This insight influenced the entire field, leading researchers to invest more effort in hyperparameter tuning, data curation, and training duration before proposing new architectures.[1]

### Influence on Subsequent Models

RoBERTa's training methodology directly influenced several important subsequent models:

- **[BART](/wiki/bart)** (Lewis et al., 2019): Also from Facebook AI, BART uses a similar encoder architecture to RoBERTa but adds a decoder for sequence-to-sequence tasks. On NLU benchmarks, BART performs comparably to RoBERTa.[8]
- **[DeBERTa](/wiki/deberta)** (He et al., 2020): Microsoft's DeBERTa built upon the RoBERTa training recipe while introducing disentangled attention. Trained on half the data, DeBERTa outperformed RoBERTa on MNLI by +0.9%, SQuAD v2.0 by +2.3%, and RACE by +3.6%.[7]
- **[ELECTRA](/wiki/electra)** (Clark et al., 2020): While introducing a different pretraining objective, ELECTRA used RoBERTa as a primary baseline and demonstrated that replaced token detection could achieve RoBERTa-level performance with 25% of the compute.[5]
- **[Longformer](/wiki/longformer)** (Beltagy et al., 2020): Also from the [Allen Institute for AI](/wiki/ai2), Longformer initialized from the RoBERTa checkpoint and extended it to handle long documents with up to 4,096 tokens using sparse attention.[9]

### Widespread Adoption

RoBERTa became a default choice for many NLP practitioners and researchers as a strong pretrained encoder. Its availability through Hugging Face, combined with its consistent performance across tasks, made it a standard baseline in hundreds of research papers. Many shared task competitions and industrial NLP systems adopted RoBERTa as their starting point for fine-tuning.

### The "Scaling" Lesson

Before RoBERTa, the NLP community often attributed performance gains to architectural novelty. RoBERTa shifted this narrative by showing that scaling data, compute, and training duration could be equally or more important. This insight foreshadowed the broader "[scaling laws](/wiki/scaling_laws)" findings that would later emerge in the context of large language models like [GPT-3](/wiki/gpt-3) and beyond.

## Limitations

Despite its strengths, RoBERTa has several limitations:

- **Computational cost:** Training RoBERTa required 1,024 V100 GPUs, making it expensive to reproduce. At the time of publication, this represented a significant barrier to entry for academic labs.[1]
- **English only:** The original RoBERTa models were trained exclusively on English text. While XLM-RoBERTa addressed multilingual needs, the base models are not suitable for non-English tasks.[6]
- **Encoder only:** As a bidirectional encoder, RoBERTa cannot directly generate text. It is not suitable for tasks like text summarization, translation, or open-ended generation without additional architectural components.
- **Fixed sequence length:** RoBERTa's maximum input length is 512 tokens, which limits its application to long documents without modifications like those introduced by Longformer.[9]
- **No architectural innovation:** While this was the paper's deliberate point, it also means RoBERTa does not address any of BERT's structural limitations, such as its inability to model token dependencies beyond the attention window.

## See Also

- [BERT](/wiki/bert)
- [XLNet](/wiki/xlnet)
- [ALBERT](/wiki/albert)
- [ELECTRA](/wiki/electra)
- [DeBERTa](/wiki/deberta)
- [Transformer](/wiki/transformer)
- [Masked Language Modeling](/wiki/masked_language_model)
- [GLUE](/wiki/glue_benchmark)
- [SQuAD](/wiki/squad)
- [Hugging Face](/wiki/hugging_face)

## References

1. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv:1907.11692. https://arxiv.org/abs/1907.11692
2. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: [Pre-training](/wiki/pre-training) of Deep Bidirectional Transformers for Language Understanding." Proceedings of NAACL-HLT 2019. https://arxiv.org/abs/1810.04805
3. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding." [NeurIPS](/wiki/neurips) 2019. https://arxiv.org/abs/1906.08237
4. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." ICLR 2020. https://arxiv.org/abs/1909.11942
5. Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." ICLR 2020. https://arxiv.org/abs/2003.10555
6. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzman, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). "Unsupervised Cross-lingual Representation Learning at Scale." ACL 2020. https://arxiv.org/abs/1911.02116
7. He, P., Liu, X., Gao, J., & Chen, W. (2021). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention." ICLR 2021. https://arxiv.org/abs/2006.03654
8. Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., & Zettlemoyer, L. (2020). "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension." ACL 2020. https://arxiv.org/abs/1910.13461
9. Beltagy, I., Peters, M. E., & Cohan, A. (2020). "Longformer: The Long-Document Transformer." arXiv:2004.05150. https://arxiv.org/abs/2004.05150
10. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." [OpenAI](/wiki/openai) Technical Report.
11. Facebook AI Research. "fairseq: A Fast, Extensible Toolkit for Sequence Modeling." GitHub. https://github.com/facebookresearch/fairseq
12. Hugging Face. "RoBERTa documentation." https://huggingface.co/docs/transformers/model_doc/roberta

