BART (language model)
Last reviewed
Apr 28, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 ยท 4,018 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 ยท 4,018 words
Add missing citations, update stale details, or suggest a clearer explanation.
BART (an acronym for Bidirectional and Auto-Regressive Transformers) is a transformer-based encoder-decoder language model introduced by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer at Facebook AI Research in October 2019, with the formal publication appearing at the Association for Computational Linguistics annual meeting (ACL) in 2020 [1][2]. The model is designed as a denoising autoencoder for sequence-to-sequence pretraining: text is corrupted by an arbitrary noising function, and a single transformer is trained to reconstruct the original input from the corrupted version. By coupling a bidirectional encoder in the style of BERT with a left-to-right autoregressive decoder in the style of GPT-3, BART unified two of the dominant pretraining paradigms of the late 2010s into one architecture that can be fine-tuned for both natural language understanding and natural language generation tasks.
BART was particularly influential as a stepping stone between encoder-only and decoder-only pretraining. Earlier work had shown that encoder-only masked language models like BERT excelled at classification and span extraction, while decoder-only causal models like the original GPT excelled at open-ended text generation, but neither family was ideal for sequence-to-sequence tasks like summarization, abstractive question answering, and machine translation. BART showed that a single denoising autoencoder could match the strongest encoder-only models on the GLUE and SQuAD benchmarks while simultaneously setting new state-of-the-art results on abstractive summarization (CNN/DailyMail and XSum) and abstractive dialogue (ConvAI2 and ELI5) [1]. The two main released checkpoints, BART-base with about 140 million parameters and BART-large with about 406 million parameters, both became staple workhorse models in natural language processing pipelines for several years.
The paper also introduced a multilingual extension, mBART, which applied the BART denoising recipe to large monolingual corpora in 25 languages (and later 50 languages in mBART-50), producing one of the first general-purpose pretrained sequence-to-sequence models for low-resource and unsupervised machine translation [3]. BART and its variants were rapidly absorbed into the Hugging Face Transformers library and have remained among the most downloaded encoder-decoder checkpoints. Although BART has since been largely superseded by T5 and the Flan-T5 family for generic encoder-decoder pretraining, and by decoder-only large language model families for general text generation, fine-tuned BART variants (especially facebook/bart-large-cnn for news summarization and facebook/bart-large-mnli for zero-shot text classification) remain in active production use.
The two years preceding BART had been dominated by two distinct pretraining recipes for transformer language models. The first, exemplified by BERT (Devlin et al., 2018), used an encoder-only architecture trained with masked language modeling: roughly 15 percent of input tokens were replaced with a [MASK] symbol, and the model was trained to predict the original tokens from surrounding context [4]. The masked language modeling objective produced exceptionally strong representations for classification and span-extraction tasks but was awkward for generation, because the model never learned to produce a coherent left-to-right sequence.
The second recipe, exemplified by the original GPT (Radford et al., 2018) and later GPT-2 and GPT-3, used a decoder-only architecture trained with a standard left-to-right language modeling objective. The decoder produced fluent text autoregressively but, because it could only attend to past tokens, its representations were strictly worse for tasks requiring bidirectional context. Several subsequent attempts tried to bridge the two camps. UniLM combined three attention masks within a single transformer. XLNet introduced permutation language modeling. MASS trained a sequence-to-sequence model to reconstruct masked spans. Each improved on either understanding or generation, but none simultaneously matched the best encoder-only model on understanding tasks while also setting new state-of-the-art results on generation tasks.
The BART authors argued that the most natural way to unify the two recipes was to build a standard sequence-to-sequence transformer (a bidirectional encoder coupled with a causal decoder) and train it as a denoising autoencoder. Because the encoder is bidirectional, the model can learn rich contextual representations of the input. Because the decoder is autoregressive, the model can also generate fluent output text. Because the corruption function applied to the input is arbitrary, the same architecture supports any noising scheme one might want to study. This generality let the BART team systematically compare pretraining objectives within a single fixed architecture.
BART uses a standard transformer encoder-decoder architecture, almost identical to the original transformer of Vaswani et al. (2017), with two minor modifications. First, following GPT, the ReLU activations in the feed-forward sublayers are replaced with GeLU (Gaussian Error Linear Units), and the model parameters are initialized from the normal distribution N(0, 0.02). Second, the BART decoder additionally performs cross-attention over the final hidden layer of the encoder, exactly as in the original transformer for translation. There is no extra feed-forward network applied to the encoder output before being passed to the decoder, unlike BERT, which applies a final feed-forward network before predicting masked tokens.
BART's encoder is bidirectional: every token can attend to every other token in the input. Its decoder is causal: each token can attend only to itself, all earlier decoder tokens, and the full encoder hidden state. The encoder and decoder do not share parameters. This combination yields roughly 10 percent more parameters than a comparably sized encoder-only BERT, because BART maintains a complete decoder stack rather than just a small prediction head.
The two principal released checkpoints have the following dimensions:
| Variant | Encoder layers | Decoder layers | Hidden size | Attention heads | Feed-forward size | Total parameters |
|---|---|---|---|---|---|---|
| BART-base | 6 | 6 | 768 | 12 | 3,072 | approximately 140 million |
| BART-large | 12 | 12 | 1,024 | 16 | 4,096 | approximately 406 million |
Both variants use a tied input/output embedding matrix, the same byte-pair encoding (BPE) vocabulary as GPT-2 (about 50,000 subword tokens), and a maximum input length of 1,024 tokens. The base model is roughly comparable in parameter count to BERT-base, while the large model is roughly comparable to BERT-large despite having an extra decoder stack, because BERT-large applies its parameters in a deeper encoder than BART-large does in its encoder alone [1].
The central idea of BART is that pretraining is a denoising task: a corruption function is applied to a chunk of text, the corrupted version is fed to the encoder, and the decoder is trained to autoregressively produce the original uncorrupted text. The training objective is the cross-entropy loss between the decoder's predictions and the original tokens, summed across all positions. This is identical to the standard sequence-to-sequence training objective used in neural machine translation, with the corruption function playing the role of the source language and the original document playing the role of the target.
A major contribution of the BART paper was a controlled empirical comparison of several different corruption functions, all trained at the same scale and evaluated on the same downstream tasks. Five noising functions were studied:
| Noising function | Description | Effect |
|---|---|---|
| Token masking | Replace random tokens (about 15%) with a special [MASK] symbol, as in BERT. | Forces the model to predict the original token from surrounding context, but does not require predicting the number of tokens missing from a span. |
| Token deletion | Delete random tokens from the input entirely. | The model must decide both which positions are missing and what tokens belong there. Deletion implicitly tests positional reasoning. |
| Text infilling | Sample a number of text spans whose lengths are drawn from a Poisson distribution with mean 3. Replace each span with a single [MASK] symbol. Zero-length spans correspond to insertions of a [MASK] between tokens. | Generalizes both masking (length-1 spans) and deletion (length-0 spans). The model must predict the missing tokens and the number of missing tokens. |
| Sentence permutation | Split the document on full stops and shuffle the resulting sentences into a random order. | Forces the model to learn document-level coherence and reordering, useful for tasks that depend on discourse structure. |
| Document rotation | Pick a token uniformly at random and rotate the document so it begins with that token. | Trains the model to identify the true beginning of the document. |
The authors compared these five functions, alone and in combination, on a battery of downstream tasks (SQuAD, MNLI, ELI5, XSum, ConvAI2, and CNN/DailyMail). The headline finding was that text infilling consistently performed best across the suite of tasks, that sentence permutation contributed modest additional gains on tasks involving long documents, and that document rotation hurt performance on essentially every task and was therefore excluded from the final recipe. Token masking and token deletion were both subsumed by text infilling because the latter generalizes both as special cases. The final BART pretraining recipe used a combination of text infilling (with about 30 percent of tokens corrupted) and full sentence permutation [1].
BART-large was pretrained on the same 160 GB of text used for RoBERTa, a curated mixture of news articles (CC-NEWS), books (BookCorpus), web text (OpenWebText), and stories (CC-Stories). Training used the Adam optimizer, a batch size of 8,000 sequences, and a similar schedule to RoBERTa, running for about 500,000 update steps. The training compute was therefore comparable to RoBERTa, allowing direct comparison of the two pretraining objectives at a controlled scale.
A key advantage of BART over encoder-only or decoder-only models is that it can be fine-tuned for a wide range of downstream tasks with minimal architectural surgery. The paper proposed several adaptation patterns, all of which involve loading the pretrained encoder and decoder and then fine-tuning the entire model on labeled data for the target task.
Sequence classification. For tasks like GLUE, the same input is fed to both the encoder and the decoder, and the final hidden state of a special end-of-sequence token in the decoder is passed to a small classification head. This approach lets the decoder attend to the full input via cross-attention while still producing a single representation for classification.
Token classification. For span-level tasks like SQuAD answer extraction, the decoder is again fed the full input, and the top hidden state at each position is used to predict the label of the corresponding input token.
Sequence generation. For abstractive tasks like summarization, dialogue, and abstractive question answering, fine-tuning is identical to pretraining: the input document is fed to the encoder, and the decoder is trained autoregressively to produce the target sequence.
Machine translation. For translation into English, BART can be used as a single pretrained decoder, with a small randomly initialized encoder placed in front to map the source language into something the BART encoder can understand. The randomly initialized encoder is trained first while BART is frozen, then the entire system is fine-tuned together. This approach uses BART as a strong pretrained denoising prior over English text.
BART-large was evaluated on a wide range of natural language understanding and generation benchmarks. The results below are from the original paper and represent the state of the art at the time of publication in 2019 and early 2020 [1].
On discriminative tasks, BART matched or slightly outperformed RoBERTa, the strongest encoder-only model at the time, despite using a sequence-to-sequence rather than encoder-only architecture. On the GLUE benchmark, BART achieved an average score competitive with RoBERTa across the eight subtasks. On SQuAD 1.1, BART reached an exact-match score of 88.8 and an F1 score of 94.6, comparable to RoBERTa's 88.9 EM and 94.6 F1. The result demonstrated that adding an autoregressive decoder did not hurt performance on classic understanding tasks, despite consuming roughly 10 percent more parameters in the decoder stack.
BART set new state-of-the-art results on two widely used abstractive summarization benchmarks. On CNN/DailyMail, a dataset of news articles paired with multi-sentence highlight summaries, BART-large achieved ROUGE-1 of 44.16, ROUGE-2 of 21.28, and ROUGE-L of 40.90, improving on the previous best system by about 1 ROUGE point. On XSum, a more aggressively abstractive single-sentence summarization dataset built from BBC articles, BART achieved ROUGE-1 of 45.14, ROUGE-2 of 22.27, and ROUGE-L of 37.25, improving on the previous best by roughly 6 ROUGE points (a much larger margin than on CNN/DailyMail, because XSum favors more abstractive systems).
On ELI5, a long-form question-answering dataset built from the Explain Like I'm Five subreddit, BART achieved a ROUGE-L of 25.3, improving on previous state-of-the-art systems by about 1.2 ROUGE-L points. On ConvAI2, a persona-based dialogue benchmark, BART obtained the lowest perplexity and the highest unigram F1 of any system at the time of publication.
For machine translation, BART was tested on WMT16 Romanian-English translation, with the small randomly initialized encoder placed in front of the pretrained BART decoder. The system improved on a strong back-translation baseline by 1.1 BLEU points, demonstrating that BART pretraining transferred meaningfully even to a task that did not directly resemble its denoising objective.
The BART recipe spawned a small ecosystem of derivative models, several of which are still widely used in production systems as of 2026.
| Model | Released | Parameters | Notes |
|---|---|---|---|
| BART-base | 2019 | approximately 140M | 6+6 encoder/decoder layers, hidden 768. Used for low-latency fine-tuning. |
| BART-large | 2019 | approximately 406M | 12+12 encoder/decoder layers, hidden 1024. The flagship model. |
facebook/bart-large-cnn | 2019 | approximately 406M | BART-large fine-tuned on CNN/DailyMail. The most widely deployed pretrained summarizer on Hugging Face. |
facebook/bart-large-xsum | 2019 | approximately 406M | BART-large fine-tuned on XSum for one-sentence abstractive summarization. |
facebook/bart-large-mnli | 2019 | approximately 406M | BART-large fine-tuned on MultiNLI. Heavily used for zero-shot text classification via the natural language inference framing of Yin et al. |
| mBART-25 | 2020 | approximately 610M | Multilingual BART pretrained on monolingual corpora in 25 languages. Strong on low-resource and document-level translation. |
| mBART-50 | 2020 | approximately 610M | Extension of mBART to 50 languages, including additional fine-tuning recipes for many-to-many translation. |
| DistilBART variants | 2020 | approximately 200-300M | Hugging Face distillations of BART summarizers using "shrink and fine-tune" (SFT). |
| BARTScore | 2021 | n/a | Not a model variant per se; a popular metric for text generation that uses BART log-likelihoods to score candidate outputs. |
| PLBART | 2021 | approximately 140M | Programming Language BART, pretrained on source code in addition to text, for code summarization and translation tasks. |
DistilBART was particularly important for production deployments. The Hugging Face team showed that one could keep all 12 encoder layers of BART-large and reduce the decoder to 6 or 3 layers (or apply an analogous reduction to the encoder), then fine-tune the smaller model on the same target task using the original BART-large outputs as soft targets [5]. The resulting models had roughly half the parameter count and almost the full quality, making them tractable to serve in production.
mBART, introduced by Liu et al. (2020), pretrained the same architecture on a 1.4 TB multilingual corpus assembled from Common Crawl (CC25), spanning 25 languages drawn from major language families including Romance, Germanic, Slavic, Sino-Tibetan, Indo-Aryan, Semitic, and Japonic [3]. Compared with prior monolingual or encoder-only multilingual pretraining, mBART produced especially strong gains on low-resource translation, with improvements of up to 12 BLEU points over a strong baseline on language pairs with under 10 million sentence pairs of supervision. mBART was later extended to mBART-50 with an additional 25 languages and adapted into a many-to-many multilingual translation model.
BART occupies a distinct position in the early history of pretrained transformer language models, sitting between encoder-only and decoder-only systems. The table below compares BART with its most direct contemporaries.
| Property | BART | BERT | GPT-3 | T5 |
|---|---|---|---|---|
| Architecture | Bidirectional encoder + autoregressive decoder | Bidirectional encoder only | Autoregressive decoder only | Bidirectional encoder + autoregressive decoder |
| Lab | Facebook AI Research | Google Research | OpenAI | Google Research |
| Released | October 2019 | October 2018 | May/June 2020 | October 2019 |
| Pretraining objective | Denoising autoencoder (text infilling + sentence permutation) | Masked language modeling + next sentence prediction | Causal (left-to-right) language modeling | Denoising span corruption (replace spans with sentinel tokens) |
| Largest released parameter count | 406M (BART-large) | 340M (BERT-large) | 175B (GPT-3) | 11B (T5-11B) |
| Pretraining data | 160 GB (same as RoBERTa) | 16 GB (Wikipedia + BookCorpus) | 570 GB filtered web | 750 GB (C4 corpus) |
| Strength on classification | Strong (matches RoBERTa) | Strong | Weak (only via in-context learning) | Strong |
| Strength on generation | Very strong (state of the art on summarization at release) | Weak (no autoregressive decoder) | Very strong (open-ended) | Very strong |
| Designed primary use | Sequence-to-sequence tasks | Discriminative tasks | Few-shot in-context learning | Unified text-to-text |
| Tokenizer | GPT-2 BPE (about 50K) | WordPiece (about 30K) | GPT-2 BPE (about 50K) | SentencePiece (about 32K) |
The key conceptual difference between BART and T5 is the framing of pretraining. BART is a denoising autoencoder: the input is a corrupted text, the target is the original text, and the loss is computed only on the target sequence. T5, in contrast, is framed as a unified text-to-text problem: every task (including pretraining) is cast as mapping an input string to an output string with task-specific prefixes such as summarize: or translate English to German:. The pretraining corruption in T5 also differs in detail: T5 replaces spans with unique sentinel tokens (<extra_id_0>, <extra_id_1>, etc.) and trains the decoder to emit the missing spans separated by the same sentinels, rather than reconstructing the entire original document.
The key difference between BART and BERT is that BART's autoregressive decoder makes it natural to fine-tune for free-form generation, while BERT requires either a separately trained decoder or awkward iterative procedures for generation. Conversely, BART has a small parameter overhead relative to BERT for the same encoder size, which is the price of carrying around a decoder.
The key difference between BART and GPT-style decoder-only models is the bidirectional encoder. Bidirectional attention over the input gives BART much stronger representations for tasks where understanding the full input precisely is critical (classification, span extraction, summarization of long documents). Decoder-only models only acquired comparable performance on these tasks much later, after scaling to tens or hundreds of billions of parameters and adopting in-context learning.
BART was published as an arXiv preprint in October 2019 and accepted to ACL 2020, where it won wide attention and has since accumulated tens of thousands of citations. The paper's most cited contribution outside of the BART model itself was the controlled comparison of pretraining objectives, which became an experimental template for many follow-up papers studying span corruption, denoising, and self-supervised learning. The finding that text infilling consistently outperformed simpler masking schemes directly informed the design of T5, ELECTRA, and many subsequent encoder-decoder systems.
The model was integrated into the Hugging Face Transformers library shortly after release and quickly became one of the most downloaded checkpoints on the hub. The facebook/bart-large-cnn checkpoint receives several million downloads per month and remains a default summarization model in many applied NLP pipelines, particularly for news, legal, and medical document summarization. The facebook/bart-large-mnli checkpoint is similarly central to many zero-shot text classification systems, using the natural language inference framing in which a candidate label is converted into a hypothesis and the input text is treated as the premise.
BART also strongly influenced the design of subsequent encoder-decoder language models. T5 generalized BART's pretraining philosophy into the unified text-to-text framework and scaled the architecture up to 11 billion parameters. BigBird and Longformer-Encoder-Decoder applied sparse attention patterns to extend BART-style models to long documents. PEGASUS introduced a summarization-specific pretraining objective (gap-sentence generation) and outperformed BART on several news summarization benchmarks. The MASS, MarianMT, and OPUS-MT families of translation models all adopted variants of the encoder-decoder denoising recipe pioneered by BART.
By 2026, the broader frontier of language modeling has shifted decisively toward decoder-only large language models trained at scales hundreds or thousands of times larger than BART-large. Models like GPT-4, Claude, Gemini, Llama 3, Mistral, and DeepSeek dominate the discourse on capabilities, and their general-purpose generation quality far exceeds anything BART could produce, especially for instruction following and multi-step reasoning. For new pretraining of encoder-decoder systems, T5, Flan-T5, and UL2 are typically preferred because their unified text-to-text framework integrates more cleanly with instruction-tuning recipes.
Despite this, BART variants remain in active production use. Fine-tuned summarization models like bart-large-cnn and bart-large-xsum have been thoroughly evaluated on real-world traffic for years: they hallucinate at predictable rates, follow input length constraints reliably, and run cheaply on commodity GPUs. The bart-large-mnli checkpoint has become a de facto standard for zero-shot text classification in production systems where an LLM with a few-shot prompt is overkill in latency, cost, or privacy footprint. BART's modest size (406M parameters) also makes it tractable to fine-tune on a single GPU for domain-specific summarization or generation tasks, an attractive operating point for many applied teams. As a foundational reference, BART remains the canonical example of a denoising-autoencoder sequence-to-sequence model, and its influence on T5 and the broader practice of treating pretraining as denoising rather than as next-token prediction is hard to overstate.
BART shares many limitations with other transformer-based language models of its generation. The maximum input length of 1,024 tokens is a hard ceiling that limits its ability to summarize long documents (full books, lengthy legal contracts, academic papers) without chunking. The model can hallucinate plausible but incorrect facts, especially in summarization where the training objective rewards fluency more strongly than factuality. It has limited multilingual capability outside of mBART. Its vocabulary is fixed at training time and cannot be extended without retraining. BART's pretraining corpus also reflects the biases of the 160 GB English text mixture used for RoBERTa, and its world knowledge is frozen at the corpus cutoff in mid-2019.