RoBERTa (Robustly Optimized BERT Pretraining Approach) is a natural language processing model developed by researchers at Facebook AI (now Meta AI) in 2019. Introduced in the paper "RoBERTa: A Robustly Optimized BERT Pretraining Approach" by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, RoBERTa demonstrated that BERT had been significantly undertrained. By carefully tuning hyperparameters, removing the next sentence prediction (NSP) objective, training on more data with larger batches for longer durations, and applying dynamic masking, RoBERTa matched or surpassed the performance of all models published after BERT at the time of its release. The paper was posted to arXiv on July 26, 2019 (arXiv:1907.11692) and has since accumulated over 25,000 citations, making it one of the most referenced works in modern NLP.
When Google released BERT in October 2018, it represented a major advance in transfer learning for NLP. BERT introduced a bidirectional transformer encoder pretrained with two objectives: masked language modeling (MLM) and next sentence prediction (NSP). It achieved state-of-the-art results on a wide range of benchmarks, including GLUE, SQuAD, and others.
However, shortly after BERT's release, several competing models appeared that claimed architectural superiority. XLNet (Yang et al., 2019) introduced permutation-based language modeling to capture bidirectional context without masking. Other approaches proposed novel pretraining objectives or architectural modifications.
The Facebook AI team noticed that these follow-up studies often changed multiple variables simultaneously, making it difficult to isolate which factors actually contributed to performance improvements. The authors hypothesized that BERT's original training recipe was suboptimal and that better tuning alone could recover much of the gap between BERT and its successors. RoBERTa was designed as a careful replication study to test this hypothesis.
| Detail | Value |
|---|---|
| Title | RoBERTa: A Robustly Optimized BERT Pretraining Approach |
| Authors | Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov |
| Affiliation | Facebook AI Research; University of Washington |
| Published | July 26, 2019 (arXiv:1907.11692) |
| Conference Submission | Submitted to ICLR 2020 (not accepted) |
| Framework | Implemented in fairseq (Facebook AI Research Sequence-to-Sequence Toolkit) |
| Citations | Over 25,000 (as of 2025) |
Although the paper was not accepted at ICLR 2020, its practical impact has been enormous. Reviewers noted that the findings, while significant, were relatively straightforward (more data helps, longer training helps). Nevertheless, the systematic investigation and the resulting pretrained models became widely adopted across the NLP community.
RoBERTa uses the same underlying architecture as BERT but modifies several aspects of the pretraining procedure. The following sections describe each change in detail.
BERT's original implementation applies a single static masking pattern to each training example during data preprocessing. This means the model sees the same masked version of each sentence every time it encounters that sentence during training.
RoBERTa replaces this with dynamic masking, where a new masking pattern is generated each time a sequence is fed to the model. In practice, the training data is duplicated 10 times, with each copy receiving a different mask. Over 40 epochs of training, each sequence is seen with approximately four different masking patterns. Experiments in the paper showed that dynamic masking performs comparably to or slightly better than static masking, with the advantage becoming more pronounced over longer training runs.
| Masking Strategy | SQuAD 2.0 (F1) | MNLI-m (Acc) | SST-2 (Acc) |
|---|---|---|---|
| Static (BERT reimplementation) | 78.3 | 84.3 | 92.5 |
| Dynamic | 78.7 | 84.0 | 92.9 |
BERT's pretraining included a next sentence prediction (NSP) objective, where the model learned to predict whether two segments came from consecutive sentences in the original text. The NSP task was intended to help with downstream tasks that involve reasoning about sentence pairs, such as natural language inference.
The RoBERTa authors tested four different input format configurations:
| Configuration | Description | SQuAD 1.1/2.0 | MNLI-m | SST-2 | RACE |
|---|---|---|---|---|---|
| Segment-pair + NSP | Two segments from same or different documents, with NSP loss | 90.4 / 78.7 | 84.0 | 92.9 | 64.2 |
| Sentence-pair + NSP | Two natural sentences, with NSP loss | 88.7 / 76.2 | 82.9 | 92.1 | 63.0 |
| Full-sentences (no NSP) | Full sentences packed from one or more documents, no NSP loss | 90.4 / 79.1 | 84.7 | 92.5 | 64.8 |
| Doc-sentences (no NSP) | Full sentences from a single document only, no NSP loss | 90.6 / 79.7 | 84.7 | 92.7 | 65.6 |
The results showed that removing the NSP loss matched or slightly improved downstream task performance. The "full-sentences" format, which packs consecutive full sentences from one or more documents up to the 512-token limit without NSP, was adopted for the final RoBERTa model because the "doc-sentences" format introduced variable batch sizes that complicated training.
BERT-large was originally trained with a batch size of 256 sequences for 1 million steps. The RoBERTa authors explored the effect of increasing the batch size while keeping the total number of training tokens constant.
| Batch Size | Steps | Learning Rate | Perplexity | MNLI-m | SST-2 |
|---|---|---|---|---|---|
| 256 | 1M | 1e-4 | 3.99 | 84.7 | 92.7 |
| 2K | 125K | 7e-4 | 3.68 | 85.2 | 92.9 |
| 8K | 31K | 1e-3 | 3.77 | 84.6 | 92.8 |
Training with a batch size of 2K sequences and 125K steps produced the best results, improving both perplexity and downstream task performance. Larger batches of 8K sequences also performed well and offered computational advantages through increased parallelism. The final RoBERTa model used a batch size of 8K sequences.
BERT was pretrained on a combination of BookCorpus (approximately 800 million words) and English Wikipedia (approximately 2,500 million words), totaling roughly 16GB of uncompressed text. RoBERTa expanded the pretraining corpus significantly by adding three additional datasets, bringing the total to over 160GB.
The original BERT-large model was trained for 1 million steps with a batch size of 256. RoBERTa was trained for 500K steps with a batch size of 8K, resulting in the model processing significantly more tokens overall. The authors observed that performance continued to improve as training progressed, with the best results at 500K steps.
| Configuration | Data | Batch Size | Steps | SQuAD 1.1/2.0 (F1) | MNLI-m | SST-2 |
|---|---|---|---|---|---|---|
| RoBERTa (Books + Wiki) | 16GB | 8K | 100K | 93.6 / 87.3 | 89.0 | 95.3 |
| + additional data | 160GB | 8K | 100K | 94.0 / 87.7 | 89.3 | 95.6 |
| + pretrain longer (300K) | 160GB | 8K | 300K | 94.4 / 88.7 | 90.0 | 96.1 |
| + pretrain longer (500K) | 160GB | 8K | 500K | 94.6 / 89.4 | 90.2 | 96.4 |
This table illustrates the cumulative effect of each modification, showing steady gains from additional data and longer training.
BERT uses a WordPiece tokenizer with a vocabulary of approximately 30,000 tokens. RoBERTa switched to a byte-level Byte Pair Encoding (BPE) tokenizer, similar to the one used in GPT-2. This tokenizer has a vocabulary size of 50,265 subword units.
Byte-level BPE operates on raw bytes rather than Unicode characters, which means it can encode any input text without producing unknown tokens. The base vocabulary consists of 256 byte tokens, supplemented by 50,000 learned merge operations and special tokens. While this larger vocabulary adds approximately 15 million parameters to the model, it eliminates the need for any text preprocessing or tokenization heuristics. The RoBERTa authors found that byte-level BPE performed slightly worse on some individual tasks but was comparable overall, with the advantage of greater robustness to diverse inputs.
RoBERTa does not introduce any architectural changes compared to BERT. It uses the standard transformer encoder architecture. The two primary variants follow the same configurations as BERT-base and BERT-large.
| Specification | RoBERTa-base | RoBERTa-large |
|---|---|---|
| Transformer layers | 12 | 24 |
| Hidden size | 768 | 1,024 |
| Attention heads | 12 | 16 |
| Feed-forward dimension | 3,072 | 4,096 |
| Parameters | 125M | 355M |
| Vocabulary size | 50,265 | 50,265 |
| Max sequence length | 512 | 512 |
| Tokenizer | Byte-level BPE | Byte-level BPE |
The key insight of RoBERTa is that the architecture did not need to change; better training procedures alone were sufficient to achieve substantial performance improvements.
RoBERTa was pretrained on five English-language datasets totaling over 160GB of uncompressed text. This represents a tenfold increase over the 16GB of text used to train the original BERT.
| Dataset | Size | Description |
|---|---|---|
| BookCorpus | ~5GB | Collection of over 11,000 unpublished books from various genres, originally used in BERT pretraining |
| English Wikipedia | ~11GB | Full text of English Wikipedia articles (excluding lists, tables, and headers), also used in BERT pretraining |
| CC-News | 76GB | A dataset of 63 million English news articles crawled from CommonCrawl News between September 2016 and February 2019 |
| OpenWebText | 38GB | An open-source recreation of the WebText corpus described by Radford et al. (2019), consisting of web content from URLs shared on Reddit with at least three upvotes |
| Stories | 31GB | A subset of CommonCrawl data filtered to match the style of Winograd schema stories, as introduced by Trinh and Le (2018) |
The CC-News, OpenWebText, and Stories datasets were the three additions beyond what BERT originally used. The diversity of these sources, spanning books, encyclopedic text, news articles, web content, and narrative stories, helped the model learn a broader range of language patterns.
The full RoBERTa model was trained using the Adam optimizer with the following hyperparameters.
| Hyperparameter | RoBERTa-base | RoBERTa-large |
|---|---|---|
| Peak learning rate | 6e-4 | 4e-4 |
| Batch size (sequences) | 8,000 | 8,000 |
| Training steps | 500,000 | 500,000 |
| Warmup steps | 24,000 | 30,000 |
| Adam epsilon | 1e-6 | 1e-6 |
| Adam beta_2 | 0.98 | 0.98 |
| Weight decay | 0.01 | 0.01 |
| Dropout | 0.1 | 0.1 |
| Learning rate schedule | Linear decay | Linear decay |
| Max sequence length | 512 | 512 |
Training was conducted on 1,024 NVIDIA V100 GPUs using mixed-precision floating-point arithmetic. The DGX-1 machines used each contained 8 V100 GPUs with 32GB of memory, and peak memory usage was approximately 18GB per GPU. The training process took roughly one day for 100K steps, meaning the full 500K-step training run took several days.
Notably, RoBERTa changed Adam's beta_2 parameter from BERT's default of 0.999 to 0.98, which the authors found to improve training stability with large batch sizes.
The General Language Understanding Evaluation (GLUE) benchmark is a collection of nine natural language understanding tasks. RoBERTa achieved state-of-the-art results on the GLUE leaderboard at the time of publication, with an overall test score of 88.5.
The following table shows development set results for single-task, single-model fine-tuning.
| Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B |
|---|---|---|---|---|---|---|---|---|
| BERT-large | 86.6 | 92.3 | 91.3 | 70.4 | 93.2 | 88.0 | 60.6 | 90.0 |
| XLNet-large | 89.8 | 93.9 | 91.8 | 83.8 | 95.6 | 89.2 | 63.6 | 91.8 |
| RoBERTa-large | 90.2 | 94.7 | 92.2 | 86.6 | 96.4 | 90.9 | 68.0 | 92.4 |
RoBERTa outperformed both BERT-large and XLNet-large on every single GLUE task in the development set. The improvements were especially notable on CoLA (68.0 vs. 60.6 for BERT), RTE (86.6 vs. 70.4 for BERT), and MNLI (90.2 vs. 86.6 for BERT).
On the GLUE test set, RoBERTa achieved an average score of 88.5, which was the highest score on the leaderboard at the time, matching the performance achieved by XLNet's multi-task ensemble submission.
The Stanford Question Answering Dataset (SQuAD) evaluates reading comprehension. RoBERTa was evaluated on both SQuAD v1.1 and SQuAD v2.0, using only the provided SQuAD training data without any additional question answering datasets.
| Model | SQuAD v1.1 EM | SQuAD v1.1 F1 | SQuAD v2.0 EM | SQuAD v2.0 F1 |
|---|---|---|---|---|
| BERT-large | 84.1 | 90.9 | 79.0 | 81.8 |
| XLNet-large | 89.0 | 94.5 | 86.1 | 88.8 |
| RoBERTa-large | 88.9 | 94.6 | 86.5 | 89.4 |
RoBERTa achieved results comparable to XLNet on SQuAD v1.1 (94.6 F1 vs. 94.5 F1) and slightly better on SQuAD v2.0 (89.4 F1 vs. 88.8 F1). Both models significantly outperformed the original BERT-large baseline.
The RACE (ReAding Comprehension from Examinations) benchmark consists of multiple-choice reading comprehension questions collected from English exams for Chinese middle and high school students.
| Model | Middle | High | Overall |
|---|---|---|---|
| BERT-large | 76.6 | 70.1 | 72.0 |
| XLNet-large | 85.4 | 80.2 | 81.7 |
| RoBERTa-large | 86.5 | 81.3 | 83.2 |
RoBERTa achieved the best overall accuracy on RACE at 83.2%, surpassing XLNet-large (81.7%) and BERT-large (72.0%) by significant margins.
RoBERTa emerged during a period of rapid development in pretrained language models. The following table provides a high-level comparison with other prominent models from the same era.
| Feature | BERT-large | RoBERTa-large | XLNet-large | ALBERT-xxlarge | ELECTRA-large |
|---|---|---|---|---|---|
| Year | 2018 | 2019 | 2019 | 2019 | 2020 |
| Organization | Facebook AI | Google/CMU | Google/Stanford | ||
| Parameters | 340M | 355M | 340M | 235M | 335M |
| Pretraining Objective | MLM + NSP | MLM only | Permutation LM | MLM + SOP | Replaced Token Detection |
| Tokenizer | WordPiece (30K) | Byte-level BPE (50K) | SentencePiece | SentencePiece (30K) | WordPiece (30K) |
| Training Data | 16GB | 160GB | ~126GB | 16GB | 16GB |
| Masking | Static | Dynamic | N/A (permutation) | Static | N/A (replacement) |
| NSP Task | Yes | No | No | No (uses SOP) | No |
| GLUE Dev (MNLI) | 86.6 | 90.2 | 89.8 | 90.8 | 90.9 |
| GLUE Dev (SST-2) | 93.2 | 96.4 | 95.6 | 96.9 | 96.9 |
| GLUE Dev (CoLA) | 60.6 | 68.0 | 63.6 | 71.4 | 69.1 |
| SQuAD v2.0 (F1) | 81.8 | 89.4 | 88.8 | N/A | 88.1 |
Several observations emerge from this comparison:
The central finding of the RoBERTa paper was that BERT was "significantly undertrained." The authors supported this claim through a systematic ablation study that isolated the contribution of each change.
BERT-large was trained for 1 million steps with a batch size of 256. When the RoBERTa authors retrained the same BERT architecture on the same BookCorpus and Wikipedia data, but for longer and with larger batches (8K batch size for 100K steps, representing the same total computational budget), they already observed improvements. Extending training to 300K and then 500K steps with the larger batch size yielded further gains on every benchmark.
BERT's 16GB training corpus was limited in both size and diversity. When RoBERTa added CC-News, OpenWebText, and Stories to reach 160GB, performance improved even at the same number of training steps. The additional data provided the model with a broader distribution of language patterns and factual knowledge.
The next sentence prediction task, which consumed half of BERT's training signal, was shown to be unnecessary and potentially harmful. Removing it freed up the model's capacity to focus entirely on the masked language modeling objective, which proved more beneficial for downstream tasks.
BERT's static masking meant the model saw the same masking pattern for each training example in every epoch. Dynamic masking provided more diverse training signals, which became increasingly important over longer training runs.
The change in Adam's beta_2 from 0.999 to 0.98, the use of larger learning rates with large batches, and other hyperparameter adjustments all contributed to better training dynamics.
Taken together, these findings demonstrated that the gap between BERT and subsequent models like XLNet was not primarily due to architectural innovation but rather to undertrained baselines.
RoBERTa has been widely adopted for a variety of NLP tasks through fine-tuning.
RoBERTa serves as a strong backbone for sentiment analysis, topic classification, spam detection, and other document-level classification tasks. Its strong performance on SST-2 (96.4% accuracy) reflects its ability to capture nuanced language patterns relevant to sentiment.
Named entity recognition (NER) benefits from RoBERTa's contextual representations. Fine-tuned RoBERTa models have been used for identifying persons, organizations, locations, and domain-specific entities in various domains including biomedical text and legal documents.
RoBERTa's strong SQuAD results translate to effective question answering systems. The model can be fine-tuned on extractive QA datasets to locate answer spans within passages.
Tasks like MNLI, which require determining whether a hypothesis is entailed by, contradicted by, or neutral with respect to a premise, are well-served by RoBERTa's deep bidirectional representations.
RoBERTa's performance on STS-B (92.4 Spearman correlation) makes it effective for paraphrase detection, duplicate question identification, and semantic search applications.
The two primary variants correspond to BERT-base and BERT-large configurations. RoBERTa-base (125M parameters, 12 layers) is suitable for resource-constrained environments, while RoBERTa-large (355M parameters, 24 layers) provides the best performance.
Development set results for both variants on the GLUE benchmark:
| Task | RoBERTa-base | RoBERTa-large |
|---|---|---|
| MNLI | 87.6 | 90.2 |
| QNLI | 92.8 | 94.7 |
| QQP | 91.9 | 92.2 |
| RTE | 78.7 | 86.6 |
| SST-2 | 94.8 | 96.4 |
| MRPC | 90.2 | 90.9 |
| CoLA | 63.6 | 68.0 |
| STS-B | 91.2 | 92.4 |
XLM-RoBERTa (Conneau et al., 2020) extended RoBERTa's training approach to the multilingual setting. Pretrained on 2.5TB of filtered CommonCrawl data spanning 100 languages, XLM-RoBERTa achieved strong results on cross-lingual benchmarks. It outperformed multilingual BERT (mBERT) by +14.6% average accuracy on XNLI and +13% average F1 on MLQA, while also performing competitively with monolingual models on English benchmarks. XLM-RoBERTa demonstrated that the RoBERTa training recipe could scale effectively across languages.
DistilRoBERTa applies knowledge distillation to RoBERTa, producing a smaller, faster model that retains much of RoBERTa's performance. With 82 million parameters (roughly 66% of RoBERTa-base), DistilRoBERTa offers a practical option for deployment in latency-sensitive applications.
RoBERTa models are available through the Hugging Face Transformers library under the FacebookAI organization. The primary model identifiers are:
| Model | Hugging Face ID | Parameters |
|---|---|---|
| RoBERTa-base | FacebookAI/roberta-base | 125M |
| RoBERTa-large | FacebookAI/roberta-large | 355M |
| RoBERTa-large (MNLI) | FacebookAI/roberta-large-mnli | 355M |
| XLM-RoBERTa-base | FacebookAI/xlm-roberta-base | 278M |
| XLM-RoBERTa-large | FacebookAI/xlm-roberta-large | 559M |
These models can be loaded with a few lines of Python code using the Transformers library:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-large")
model = AutoModel.from_pretrained("FacebookAI/roberta-large")
The Hugging Face model hub also hosts numerous community-contributed RoBERTa models fine-tuned for specific tasks, including sentiment analysis, NER, question answering, and domain-specific applications in biomedical, legal, and financial text.
RoBERTa's impact on the NLP field extends well beyond its benchmark scores.
The paper's most lasting contribution may be its demonstration that training procedure matters as much as, if not more than, model architecture. This insight influenced the entire field, leading researchers to invest more effort in hyperparameter tuning, data curation, and training duration before proposing new architectures.
RoBERTa's training methodology directly influenced several important subsequent models:
RoBERTa became a default choice for many NLP practitioners and researchers as a strong pretrained encoder. Its availability through Hugging Face, combined with its consistent performance across tasks, made it a standard baseline in hundreds of research papers. Many shared task competitions and industrial NLP systems adopted RoBERTa as their starting point for fine-tuning.
Before RoBERTa, the NLP community often attributed performance gains to architectural novelty. RoBERTa shifted this narrative by showing that scaling data, compute, and training duration could be equally or more important. This insight foreshadowed the broader "scaling laws" findings that would later emerge in the context of large language models like GPT-3 and beyond.
Despite its strengths, RoBERTa has several limitations: