BART (language model)

AI Models Meta AI Natural Language Processing

21 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v3 · 4,244 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

BART (an acronym for Bidirectional and Auto-Regressive Transformers) is a transformer-based encoder-decoder language model introduced by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer at Facebook AI Research in October 2019, with the formal publication appearing at the Association for Computational Linguistics annual meeting (ACL) in 2020 ^[1]^[2]. It is a denoising autoencoder for sequence-to-sequence pretraining: text is corrupted by an arbitrary noising function, and a single transformer is trained to reconstruct the original input from the corrupted version. The authors describe BART succinctly as being "trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text" ^[1]. By coupling a bidirectional encoder in the style of BERT with a left-to-right autoregressive decoder in the style of GPT-3, BART unified two of the dominant pretraining paradigms of the late 2010s into one architecture that can be fine-tuned for both natural language understanding and natural language generation tasks.

BART matched the performance of RoBERTa on the GLUE and SQuAD understanding benchmarks while setting new state-of-the-art results on a range of generation tasks, with the paper reporting gains of up to 6 ROUGE points on abstractive summarization and a 1.1 BLEU increase over a back-translation system on machine translation ^[1]. The two main released checkpoints, BART-base with about 140 million parameters and BART-large with about 406 million parameters, both became staple workhorse models in natural language processing pipelines for several years. As of 2026 the fine-tuned facebook/bart-large-cnn summarizer and facebook/bart-large-mnli zero-shot classifier remain among the most downloaded encoder-decoder checkpoints on Hugging Face, drawing roughly 1.56 million and 3.23 million downloads per month respectively ^[8]^[9].

BART was particularly influential as a stepping stone between encoder-only and decoder-only pretraining. Earlier work had shown that encoder-only masked language models like BERT excelled at classification and span extraction, while decoder-only causal models like the original GPT excelled at open-ended text generation, but neither family was ideal for sequence-to-sequence tasks like summarization, abstractive question answering, and machine translation. BART showed that a single denoising autoencoder could match the strongest encoder-only models on the GLUE and SQuAD benchmarks while simultaneously setting new state-of-the-art results on abstractive summarization (CNN/DailyMail and XSum) and abstractive dialogue (ConvAI2 and ELI5) ^[1].

The paper also introduced a multilingual extension, mBART, which applied the BART denoising recipe to large monolingual corpora in 25 languages (and later 50 languages in mBART-50), producing one of the first general-purpose pretrained sequence-to-sequence models for low-resource and unsupervised machine translation ^[3]. BART and its variants were rapidly absorbed into the Hugging Face Transformers library and have remained among the most downloaded encoder-decoder checkpoints. Although BART has since been largely superseded by T5 and the Flan-T5 family for generic encoder-decoder pretraining, and by decoder-only large language model families for general text generation, fine-tuned BART variants (especially facebook/bart-large-cnn for news summarization and facebook/bart-large-mnli for zero-shot text classification) remain in active production use.

What problem was BART designed to solve?

The two years preceding BART had been dominated by two distinct pretraining recipes for transformer language models. The first, exemplified by BERT (Devlin et al., 2018), used an encoder-only architecture trained with masked language modeling: roughly 15 percent of input tokens were replaced with a [MASK] symbol, and the model was trained to predict the original tokens from surrounding context ^[4]. The masked language modeling objective produced exceptionally strong representations for classification and span-extraction tasks but was awkward for generation, because the model never learned to produce a coherent left-to-right sequence.

The second recipe, exemplified by the original GPT (Radford et al., 2018) and later GPT-2 and GPT-3, used a decoder-only architecture trained with a standard left-to-right language modeling objective. The decoder produced fluent text autoregressively but, because it could only attend to past tokens, its representations were strictly worse for tasks requiring bidirectional context. Several subsequent attempts tried to bridge the two camps. UniLM combined three attention masks within a single transformer. XLNet introduced permutation language modeling. MASS trained a sequence-to-sequence model to reconstruct masked spans. Each improved on either understanding or generation, but none simultaneously matched the best encoder-only model on understanding tasks while also setting new state-of-the-art results on generation tasks.

The BART authors argued that the most natural way to unify the two recipes was to build a standard sequence-to-sequence transformer (a bidirectional encoder coupled with a causal decoder) and train it as a denoising autoencoder. Because the encoder is bidirectional, the model can learn rich contextual representations of the input. Because the decoder is autoregressive, the model can also generate fluent output text. Because the corruption function applied to the input is arbitrary, the same architecture supports any noising scheme one might want to study. This generality let the BART team systematically compare pretraining objectives within a single fixed architecture.

How is BART's architecture structured?

BART uses a standard transformer encoder-decoder architecture, almost identical to the original transformer of Vaswani et al. (2017), with two minor modifications. First, following GPT, the ReLU activations in the feed-forward sublayers are replaced with GeLU (Gaussian Error Linear Units), and the model parameters are initialized from the normal distribution N(0, 0.02). Second, the BART decoder additionally performs cross-attention over the final hidden layer of the encoder, exactly as in the original transformer for translation. There is no extra feed-forward network applied to the encoder output before being passed to the decoder, unlike BERT, which applies a final feed-forward network before predicting masked tokens.

BART's encoder is bidirectional: every token can attend to every other token in the input. Its decoder is causal: each token can attend only to itself, all earlier decoder tokens, and the full encoder hidden state. The encoder and decoder do not share parameters. This combination yields roughly 10 percent more parameters than a comparably sized encoder-only BERT, because BART maintains a complete decoder stack rather than just a small prediction head.

The two principal released checkpoints have the following dimensions:

Variant	Encoder layers	Decoder layers	Hidden size	Attention heads	Feed-forward size	Total parameters
BART-base	6	6	768	12	3,072	approximately 140 million
BART-large	12	12	1,024	16	4,096	approximately 406 million

Both variants use a tied input/output embedding matrix, the same byte-pair encoding (BPE) vocabulary as GPT-2 (about 50,000 subword tokens), and a maximum input length of 1,024 tokens. The base model is roughly comparable in parameter count to BERT-base, while the large model is roughly comparable to BERT-large despite having an extra decoder stack, because BERT-large applies its parameters in a deeper encoder than BART-large does in its encoder alone ^[1].

How is BART pretrained as a denoising task?

The central idea of BART is that pretraining is a denoising task: a corruption function is applied to a chunk of text, the corrupted version is fed to the encoder, and the decoder is trained to autoregressively produce the original uncorrupted text. The training objective is the cross-entropy loss between the decoder's predictions and the original tokens, summed across all positions. This is identical to the standard sequence-to-sequence training objective used in neural machine translation, with the corruption function playing the role of the source language and the original document playing the role of the target.

A major contribution of the BART paper was a controlled empirical comparison of several different corruption functions, all trained at the same scale and evaluated on the same downstream tasks. Five noising functions were studied:

Noising function	Description	Effect
Token masking	Replace random tokens (about 15%) with a special `[MASK]` symbol, as in BERT.	Forces the model to predict the original token from surrounding context, but does not require predicting the number of tokens missing from a span.
Token deletion	Delete random tokens from the input entirely.	The model must decide both which positions are missing and what tokens belong there. Deletion implicitly tests positional reasoning.
Text infilling	Sample a number of text spans whose lengths are drawn from a Poisson distribution with mean 3. Replace each span with a single `[MASK]` symbol. Zero-length spans correspond to insertions of a `[MASK]` between tokens.	Generalizes both masking (length-1 spans) and deletion (length-0 spans). The model must predict the missing tokens and the number of missing tokens.
Sentence permutation	Split the document on full stops and shuffle the resulting sentences into a random order.	Forces the model to learn document-level coherence and reordering, useful for tasks that depend on discourse structure.
Document rotation	Pick a token uniformly at random and rotate the document so it begins with that token.	Trains the model to identify the true beginning of the document.

The authors compared these five functions, alone and in combination, on a battery of downstream tasks (SQuAD, MNLI, ELI5, XSum, ConvAI2, and CNN/DailyMail). The headline finding was that text infilling consistently performed best across the suite of tasks, that sentence permutation contributed modest additional gains on tasks involving long documents, and that document rotation hurt performance on essentially every task and was therefore excluded from the final recipe. The authors report that they found "the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token" ^[1]. Token masking and token deletion were both subsumed by text infilling because the latter generalizes both as special cases. The final BART pretraining recipe used a combination of text infilling (with about 30 percent of tokens corrupted) and full sentence permutation ^[1].

BART-large was pretrained on the same 160 GB of text used for RoBERTa, a curated mixture of news articles (CC-NEWS), books (BookCorpus), web text (OpenWebText), and stories (CC-Stories). Training used the Adam optimizer, a batch size of 8,000 sequences, and a similar schedule to RoBERTa, running for about 500,000 update steps. The training compute was therefore comparable to RoBERTa, allowing direct comparison of the two pretraining objectives at a controlled scale.

How is BART fine-tuned for downstream tasks?

A key advantage of BART over encoder-only or decoder-only models is that it can be fine-tuned for a wide range of downstream tasks with minimal architectural surgery. The paper proposed several adaptation patterns, all of which involve loading the pretrained encoder and decoder and then fine-tuning the entire model on labeled data for the target task.

Sequence classification. For tasks like GLUE, the same input is fed to both the encoder and the decoder, and the final hidden state of a special end-of-sequence token in the decoder is passed to a small classification head. This approach lets the decoder attend to the full input via cross-attention while still producing a single representation for classification.

Token classification. For span-level tasks like SQuAD answer extraction, the decoder is again fed the full input, and the top hidden state at each position is used to predict the label of the corresponding input token.

Sequence generation. For abstractive tasks like summarization, dialogue, and abstractive question answering, fine-tuning is identical to pretraining: the input document is fed to the encoder, and the decoder is trained autoregressively to produce the target sequence.

Machine translation. For translation into English, BART can be used as a single pretrained decoder, with a small randomly initialized encoder placed in front to map the source language into something the BART encoder can understand. The randomly initialized encoder is trained first while BART is frozen, then the entire system is fine-tuned together. This approach uses BART as a strong pretrained denoising prior over English text.

What benchmark results did BART achieve?

BART-large was evaluated on a wide range of natural language understanding and generation benchmarks. The results below are from the original paper and represent the state of the art at the time of publication in 2019 and early 2020 ^[1].

Discriminative tasks (GLUE and SQuAD)

On discriminative tasks, BART matched or slightly outperformed RoBERTa, the strongest encoder-only model at the time, despite using a sequence-to-sequence rather than encoder-only architecture. On the GLUE benchmark, BART achieved an average score competitive with RoBERTa across the eight subtasks. On SQuAD 1.1, BART reached an exact-match score of 88.8 and an F1 score of 94.6, comparable to RoBERTa's 88.9 EM and 94.6 F1. The result demonstrated that adding an autoregressive decoder did not hurt performance on classic understanding tasks, despite consuming roughly 10 percent more parameters in the decoder stack.

Abstractive summarization

BART set new state-of-the-art results on two widely used abstractive summarization benchmarks. On CNN/DailyMail, a dataset of news articles paired with multi-sentence highlight summaries, BART-large achieved ROUGE-1 of 44.16, ROUGE-2 of 21.28, and ROUGE-L of 40.90, improving on the previous best system by about 1 ROUGE point. On XSum, a more aggressively abstractive single-sentence summarization dataset built from BBC articles, BART achieved ROUGE-1 of 45.14, ROUGE-2 of 22.27, and ROUGE-L of 37.25, improving on the previous best by roughly 6 ROUGE points (a much larger margin than on CNN/DailyMail, because XSum favors more abstractive systems).

Abstractive question answering and dialogue

On ELI5, a long-form question-answering dataset built from the Explain Like I'm Five subreddit, BART achieved a ROUGE-L of 25.3, improving on previous state-of-the-art systems by about 1.2 ROUGE-L points. On ConvAI2, a persona-based dialogue benchmark, BART obtained the lowest perplexity and the highest unigram F1 of any system at the time of publication.

Machine translation

For machine translation, BART was tested on WMT16 Romanian-English translation, with the small randomly initialized encoder placed in front of the pretrained BART decoder. The system improved on a strong back-translation baseline by 1.1 BLEU points, demonstrating that BART pretraining transferred meaningfully even to a task that did not directly resemble its denoising objective.

What are the main BART variants and derivatives?

The BART recipe spawned a small ecosystem of derivative models, several of which are still widely used in production systems as of 2026.

Model	Released	Parameters	Notes
BART-base	2019	approximately 140M	6+6 encoder/decoder layers, hidden 768. Used for low-latency fine-tuning.
BART-large	2019	approximately 406M	12+12 encoder/decoder layers, hidden 1024. The flagship model.
`facebook/bart-large-cnn`	2019	approximately 406M	BART-large fine-tuned on CNN/DailyMail. The most widely deployed pretrained summarizer on Hugging Face, drawing roughly 1.56 million downloads per month ^[8].
`facebook/bart-large-xsum`	2019	approximately 406M	BART-large fine-tuned on XSum for one-sentence abstractive summarization.
`facebook/bart-large-mnli`	2019	approximately 406M	BART-large fine-tuned on MultiNLI. Heavily used for zero-shot text classification via the natural language inference framing of Yin et al., drawing roughly 3.23 million downloads per month ^[9].
mBART-25	2020	approximately 610M	Multilingual BART pretrained on monolingual corpora in 25 languages. Strong on low-resource and document-level translation.
mBART-50	2020	approximately 610M	Extension of mBART to 50 languages, including additional fine-tuning recipes for many-to-many translation.
DistilBART variants	2020	approximately 200-300M	Hugging Face distillations of BART summarizers using "shrink and fine-tune" (SFT).
BARTScore	2021	n/a	Not a model variant per se; a popular metric for text generation that uses BART log-likelihoods to score candidate outputs.
PLBART	2021	approximately 140M	Programming Language BART, pretrained on source code in addition to text, for code summarization and translation tasks.

DistilBART was particularly important for production deployments. The Hugging Face team showed that one could keep all 12 encoder layers of BART-large and reduce the decoder to 6 or 3 layers (or apply an analogous reduction to the encoder), then fine-tune the smaller model on the same target task using the original BART-large outputs as soft targets ^[5]. The resulting models had roughly half the parameter count and almost the full quality, making them tractable to serve in production.

mBART, introduced by Liu et al. (2020), pretrained the same architecture on a 1.4 TB multilingual corpus assembled from Common Crawl (CC25), spanning 25 languages drawn from major language families including Romance, Germanic, Slavic, Sino-Tibetan, Indo-Aryan, Semitic, and Japonic ^[3]. Compared with prior monolingual or encoder-only multilingual pretraining, mBART produced especially strong gains on low-resource translation, with improvements of up to 12 BLEU points over a strong baseline on language pairs with under 10 million sentence pairs of supervision. mBART was later extended to mBART-50 with an additional 25 languages and adapted into a many-to-many multilingual translation model.

How does BART differ from BERT, GPT, and T5?

BART occupies a distinct position in the early history of pretrained transformer language models, sitting between encoder-only and decoder-only systems. The table below compares BART with its most direct contemporaries.

Property	BART	BERT	GPT-3	T5
Architecture	Bidirectional encoder + autoregressive decoder	Bidirectional encoder only	Autoregressive decoder only	Bidirectional encoder + autoregressive decoder
Lab	Facebook AI Research	Google Research	OpenAI	Google Research
Released	October 2019	October 2018	May/June 2020	October 2019
Pretraining objective	Denoising autoencoder (text infilling + sentence permutation)	Masked language modeling + next sentence prediction	Causal (left-to-right) language modeling	Denoising span corruption (replace spans with sentinel tokens)
Largest released parameter count	406M (BART-large)	340M (BERT-large)	175B (GPT-3)	11B (T5-11B)
Pretraining data	160 GB (same as RoBERTa)	16 GB (Wikipedia + BookCorpus)	570 GB filtered web	750 GB (C4 corpus)
Strength on classification	Strong (matches RoBERTa)	Strong	Weak (only via in-context learning)	Strong
Strength on generation	Very strong (state of the art on summarization at release)	Weak (no autoregressive decoder)	Very strong (open-ended)	Very strong
Designed primary use	Sequence-to-sequence tasks	Discriminative tasks	Few-shot in-context learning	Unified text-to-text
Tokenizer	GPT-2 BPE (about 50K)	WordPiece (about 30K)	GPT-2 BPE (about 50K)	SentencePiece (about 32K)

The key conceptual difference between BART and T5 is the framing of pretraining. BART is a denoising autoencoder: the input is a corrupted text, the target is the original text, and the loss is computed only on the target sequence. T5, in contrast, is framed as a unified text-to-text problem: every task (including pretraining) is cast as mapping an input string to an output string with task-specific prefixes such as summarize: or translate English to German:. The pretraining corruption in T5 also differs in detail: T5 replaces spans with unique sentinel tokens (<extra_id_0>, <extra_id_1>, etc.) and trains the decoder to emit the missing spans separated by the same sentinels, rather than reconstructing the entire original document.

The key difference between BART and BERT is that BART's autoregressive decoder makes it natural to fine-tune for free-form generation, while BERT requires either a separately trained decoder or awkward iterative procedures for generation. Conversely, BART has a small parameter overhead relative to BERT for the same encoder size, which is the price of carrying around a decoder.

The key difference between BART and GPT-style decoder-only models is the bidirectional encoder. Bidirectional attention over the input gives BART much stronger representations for tasks where understanding the full input precisely is critical (classification, span extraction, summarization of long documents). Decoder-only models only acquired comparable performance on these tasks much later, after scaling to tens or hundreds of billions of parameters and adopting in-context learning.

Reception and influence

BART was published as an arXiv preprint in October 2019 and accepted to ACL 2020, where it won wide attention and has since accumulated tens of thousands of citations. The paper's most cited contribution outside of the BART model itself was the controlled comparison of pretraining objectives, which became an experimental template for many follow-up papers studying span corruption, denoising, and self-supervised learning. The finding that text infilling consistently outperformed simpler masking schemes directly informed the design of T5, ELECTRA, and many subsequent encoder-decoder systems.

The model was integrated into the Hugging Face Transformers library shortly after release and quickly became one of the most downloaded checkpoints on the hub. The facebook/bart-large-cnn checkpoint draws roughly 1.56 million downloads per month and remains a default summarization model in many applied NLP pipelines, particularly for news, legal, and medical document summarization ^[8]. The facebook/bart-large-mnli checkpoint is similarly central to many zero-shot text classification systems, drawing roughly 3.23 million downloads per month, using the natural language inference framing in which a candidate label is converted into a hypothesis and the input text is treated as the premise ^[9].

BART also strongly influenced the design of subsequent encoder-decoder language models. T5 generalized BART's pretraining philosophy into the unified text-to-text framework and scaled the architecture up to 11 billion parameters. BigBird and Longformer-Encoder-Decoder applied sparse attention patterns to extend BART-style models to long documents. PEGASUS introduced a summarization-specific pretraining objective (gap-sentence generation) and outperformed BART on several news summarization benchmarks. The MASS, MarianMT, and OPUS-MT families of translation models all adopted variants of the encoder-decoder denoising recipe pioneered by BART.

Is BART still relevant in 2026?

By 2026, the broader frontier of language modeling has shifted decisively toward decoder-only large language models trained at scales hundreds or thousands of times larger than BART-large. Models like GPT-4, Claude, Gemini, Llama 3, Mistral, and DeepSeek dominate the discourse on capabilities, and their general-purpose generation quality far exceeds anything BART could produce, especially for instruction following and multi-step reasoning. For new pretraining of encoder-decoder systems, T5, Flan-T5, and UL2 are typically preferred because their unified text-to-text framework integrates more cleanly with instruction-tuning recipes.

Despite this, BART variants remain in active production use. Fine-tuned summarization models like bart-large-cnn and bart-large-xsum have been thoroughly evaluated on real-world traffic for years: they hallucinate at predictable rates, follow input length constraints reliably, and run cheaply on commodity GPUs. The bart-large-mnli checkpoint has become a de facto standard for zero-shot text classification in production systems where an LLM with a few-shot prompt is overkill in latency, cost, or privacy footprint; it remains one of the highest-traffic models on Hugging Face, with roughly 3.23 million downloads per month as of 2026 ^[9]. BART's modest size (406M parameters) also makes it tractable to fine-tune on a single GPU for domain-specific summarization or generation tasks, an attractive operating point for many applied teams. As a foundational reference, BART remains the canonical example of a denoising-autoencoder sequence-to-sequence model, and its influence on T5 and the broader practice of treating pretraining as denoising rather than as next-token prediction is hard to overstate.

Limitations and known weaknesses

BART shares many limitations with other transformer-based language models of its generation. The maximum input length of 1,024 tokens is a hard ceiling that limits its ability to summarize long documents (full books, lengthy legal contracts, academic papers) without chunking. The model can hallucinate plausible but incorrect facts, especially in summarization where the training objective rewards fluency more strongly than factuality. It has limited multilingual capability outside of mBART. Its vocabulary is fixed at training time and cannot be extended without retraining. BART's pretraining corpus also reflects the biases of the 160 GB English text mixture used for RoBERTa, and its world knowledge is frozen at the corpus cutoff in mid-2019.

References

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2019). *BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension*. arXiv:1910.13461. https://arxiv.org/abs/1910.13461 ↩
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2020). *BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension*. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://aclanthology.org/2020.acl-main.703/ ↩
Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., and Zettlemoyer, L. (2020). *Multilingual Denoising Pre-training for Neural Machine Translation*. Transactions of the Association for Computational Linguistics, 8, 726-742. arXiv:2001.08210. https://arxiv.org/abs/2001.08210 ↩
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*. arXiv:1810.04805. https://arxiv.org/abs/1810.04805 ↩
Shleifer, S. and Rush, A. M. (2020). *Pre-trained Summarization Distillation*. arXiv:2010.13002. https://arxiv.org/abs/2010.13002 ↩
Hugging Face Documentation. *BART Model Documentation*. https://huggingface.co/docs/transformers/model_doc/bart
Hugging Face Model Hub. *facebook/bart-large*. https://huggingface.co/facebook/bart-large
Hugging Face Model Hub. *facebook/bart-large-cnn*. https://huggingface.co/facebook/bart-large-cnn ↩
Hugging Face Model Hub. *facebook/bart-large-mnli*. https://huggingface.co/facebook/bart-large-mnli ↩
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). *RoBERTa: A Robustly Optimized BERT Pretraining Approach*. arXiv:1907.11692. https://arxiv.org/abs/1907.11692
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020). *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer*. Journal of Machine Learning Research, 21(140), 1-67. arXiv:1910.10683. https://arxiv.org/abs/1910.10683
Yin, W., Hay, J., and Roth, D. (2019). *Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach*. arXiv:1909.00161. https://arxiv.org/abs/1909.00161
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). *Attention Is All You Need*. NeurIPS 2017. arXiv:1706.03762. https://arxiv.org/abs/1706.03762

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

BART (language model)

What problem was BART designed to solve?

How is BART's architecture structured?

How is BART pretrained as a denoising task?

How is BART fine-tuned for downstream tasks?

What benchmark results did BART achieve?

Discriminative tasks (GLUE and SQuAD)

Abstractive summarization

Abstractive question answering and dialogue

Machine translation

What are the main BART variants and derivatives?

How does BART differ from BERT, GPT, and T5?

Reception and influence

Is BART still relevant in 2026?

Limitations and known weaknesses

See also

References

Improve this article

What links here (24 of 26)

What links here (24 of 26)

What problem was BART designed to solve?

How is BART's architecture structured?

How is BART pretrained as a denoising task?

How is BART fine-tuned for downstream tasks?

What benchmark results did BART achieve?

Discriminative tasks (GLUE and SQuAD)

Abstractive summarization

Abstractive question answering and dialogue

Machine translation

What are the main BART variants and derivatives?

How does BART differ from BERT, GPT, and T5?

Reception and influence

Is BART still relevant in 2026?

Limitations and known weaknesses

See also

References

Improve this article

Related Articles

Llama 3

LLaMA

Wav2Vec

Mike Lewis

Large Concept Model

SeamlessM4T

What links here (24 of 26)

Related Articles

Llama 3

LLaMA

Wav2Vec

Mike Lewis

Large Concept Model

SeamlessM4T

What links here (24 of 26)