Summarization Models
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,499 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 ยท 2,499 words
Add missing citations, update stale details, or suggest a clearer explanation.
Summarization models are natural language processing systems that condense a source document, set of documents, or dialogue into a shorter version that preserves the most important information. They sit alongside machine translation and question answering as canonical sequence-to-sequence tasks, and form the technical core of news digests, scientific TLDR generation, meeting recaps, and the executive briefs produced by modern chat assistants.
The field has been reshaped twice in quick succession: first by pretrained encoder-decoder models such as BART and T5 in 2019, and again by general-purpose large language models such as GPT-4, Claude, and Gemini, which produce strong summaries with zero or few examples.
See also: Natural Language Processing Models, Text Summarization.
A summarization system reads an input of length N and produces a much shorter output while keeping it informative, fluent, and faithful to the source. Researchers split the task along three axes.
Extractive versus abstractive. Extractive systems select spans (usually whole sentences) from the source and concatenate them. Abstractive systems generate new text that may paraphrase, fuse, or reorder content. Extractive output stays grounded but reads choppily; abstractive output is smoother and shorter, with a higher risk of hallucination.
Single versus multi-document. Single-document summarization condenses one input, the standard setting for news articles or scientific papers. Multi-document summarization fuses several related inputs and adds the problems of redundancy removal and cross-document coreference.
Generic versus query- or aspect-focused. A generic summary covers the most salient content. A query-focused summary answers a user question against the source; an aspect-focused summary picks out a specific facet such as methods only from a research paper.
Before neural networks dominated the area, extractive summarization was the practical default. Two graph-based algorithms published in 2004 became long-running baselines: TextRank by Rada Mihalcea and Paul Tarau, and LexRank by Gunes Erkan and Dragomir Radev. Both build a graph in which sentences are nodes and edges carry a similarity weight, then rank nodes with an iterative algorithm in the style of PageRank to pick the most central sentences. These methods require no training data and remain useful when annotated corpora are unavailable.
The first wave of neural summarization adapted the encoder-decoder framework from machine translation, with a recurrent encoder reading the source and a recurrent decoder generating tokens while attending over the encoder states. The decisive refinement was the pointer-generator network of Abigail See, Peter J. Liu, and Christopher D. Manning in 2017, which augments the decoder with a copy mechanism that can generate a vocabulary word or copy a token from the source. A coverage loss discourages repetition. On CNN/DailyMail it beat earlier abstractive baselines by at least 2 ROUGE points and remained competitive until pretrained transformers arrived.
In 2019 three encoder-decoder transformers established the modern paradigm. BART, from Mike Lewis and colleagues at Facebook AI, pretrains a bidirectional encoder and an autoregressive decoder to reconstruct text corrupted by token masking, deletion, sentence shuffling, and span infilling; fine-tuned on CNN/DailyMail and XSum it set new state-of-the-art ROUGE scores. T5, from Colin Raffel and colleagues at Google, cast every NLP task as text-to-text and pretrained on the C4 corpus at scales up to 11 billion parameters. PEGASUS, from Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu at Google, introduced a gap-sentence-generation objective specifically aimed at summarization: important sentences are masked and the model learns to regenerate them. PEGASUS achieved state-of-the-art results across 12 summarization datasets and strong few-shot performance with as few as 1,000 labelled examples.
Standard transformer self-attention scales quadratically with sequence length, which prevents 2019-era models from processing documents much longer than 1,024 tokens without truncation. A series of architectures introduced sparse attention patterns to push the input limit further:
The arrival of GPT-3 in 2020 and its instruction-tuned successors made zero-shot summarization a routine chatbot capability. By 2023 papers were reporting that GPT-4 summaries were preferred to fine-tuned BART or PEGASUS outputs on news benchmarks, and that closed foundation models such as Claude and Gemini, with context windows of 200,000 to over a million tokens, could summarize book-length documents in a single pass. Open-weight families such as FLAN-T5 and Llama provided instruction-tuned alternatives that can be fine-tuned on domain-specific data.
Most summarization systems fall into one of four architectural families.
Encoder-only with an extractive head. A pretrained encoder such as BERT scores each sentence and the top-K sentences are selected. This route, popularized by BERTSUM (Liu and Lapata, 2019), is fast and stays faithful to the source.
Encoder-decoder transformer. BART, T5, PEGASUS, and their long-context successors share this template: the encoder builds contextual representations and the decoder generates the summary autoregressively while attending over the encoder. Pretraining objectives vary across text-infilling (BART), span corruption (T5), and gap-sentence generation (PEGASUS).
Decoder-only language model. Generic LLMs such as the GPT, Llama, Claude, and Gemini families read source and instruction in a single context window, then continue with the summary.
Retrieval-augmented or hierarchical pipelines. For inputs that exceed even long-context models, systems chunk the document, summarize the chunks, and recursively combine partial summaries; alternatively, a retrieval-augmented generation pipeline fetches the most relevant passages before generating the final summary.
The table below lists models that set state-of-the-art results or shaped later research.
| Model | Year | Organization | Parameters | Notes |
|---|---|---|---|---|
| Pointer-generator network | 2017 | Stanford / Google Brain | ~22M | RNN seq2seq with copy and coverage; first strong abstractive baseline on CNN/DailyMail |
| BART | 2019 | Facebook AI | 140M, 400M | Denoising encoder-decoder; new SOTA on CNN/DailyMail and XSum |
| T5 | 2019 | 60M to 11B | Text-to-text transfer transformer pretrained on C4 | |
| PEGASUS | 2019 | 223M, 568M | Gap-sentence generation objective; SOTA on 12 datasets | |
| Longformer / LED | 2020 | Allen Institute for AI | 149M, 435M | Sliding window plus global attention up to 16K tokens |
| BigBird | 2020 | ~127M, ~580M | Sparse global+local+random attention for long inputs | |
| LongT5 | 2021 | 220M to 3B | Transient-global attention plus PEGASUS-style pretraining | |
| PEGASUS-X | 2022 | 568M | Long-input PEGASUS variant for inputs up to 16K tokens | |
| FLAN-T5 | 2022 | 80M to 11B | Instruction-tuned T5; strong zero-shot summarization | |
| GPT-4 | 2023 | OpenAI | Undisclosed | Zero-shot summarization across domains |
| Claude | 2023 onward | Anthropic | Undisclosed | 200K+ token windows for long-document summarization |
| Gemini | 2023 onward | Google DeepMind | Undisclosed | Up to 1M-token context for book-length summarization |
| Llama family | 2023 onward | Meta AI | 1B to 405B | Open-weight LLMs widely fine-tuned for summarization |
Progress in summarization has been driven by a small number of public datasets, listed below.
| Dataset | Domain | Style | Notes |
|---|---|---|---|
| CNN/DailyMail | News | Multi-sentence highlights | ~313K articles, originally collected by Hermann et al. (2015) for reading comprehension and later repurposed for summarization |
| XSum | BBC news | One-sentence extreme summaries | 226,711 articles; highly abstractive, with about 96% novel trigrams |
| Newsroom | News | Mixed extractive/abstractive | 1.3M article-summary pairs across 38 publishers |
| WikiHow | How-to articles | Step summaries | 230K articles with paragraph headlines as targets |
| BillSum | US legislation | Bill summaries | Congressional and California state bills |
| MultiNews | News clusters | Multi-document | 56K clusters of related articles with editor summaries |
| GovReport | US government reports | Executive summaries | Mean document length ~9,600 tokens |
| arXiv / PubMed (Cohan et al., 2018) | Scientific papers | Abstracts | 113K arXiv and 215K PubMed papers; abstracts serve as targets |
| BookSum | Novels and short stories | Chapter and book-level summaries | Released by Salesforce Research for long narrative summarization |
| QMSum | Meeting transcripts | Query-focused summaries | Multi-domain meeting corpus; mean length ~13,300 tokens |
| SAMSum | Chat dialogues | Third-person summaries | 16,369 human-written messenger conversations |
| DialogSum | Spoken-style dialogues | Daily-life summaries | 13,460 dialogues in daily-life scenarios |
Automatic evaluation of summaries is hard because many different surface realizations can be equally good. The most widely used metric is ROUGE (Recall-Oriented Understudy for Gisting Evaluation), introduced by Chin-Yew Lin in 2004. ROUGE-N counts overlapping n-grams against one or more reference summaries, ROUGE-L measures the longest common subsequence, and ROUGE-S scores skip-bigrams. ROUGE was adopted at the Document Understanding Conference (DUC) 2004 and remains the default headline number in nearly every paper, though it correlates only weakly with human judgments of factuality.
BERTScore (Tianyi Zhang and colleagues, 2019) compares contextual BERT embeddings between candidate and reference, scoring semantic similarity rather than literal n-gram overlap. BLEURT (Sellam et al., 2020) is a learned regression metric fine-tuned to predict human ratings.
A separate strand of work targets factual consistency. FactCC (Kryscinski et al., 2020) trains a classifier to detect contradictions between source and summary. QAGS (Wang et al., 2020) generates questions from the candidate summary and checks whether a question answering system gives the same answers when reading the summary and the source. Newer LLM-based evaluators, often grouped as "G-Eval," prompt a strong model such as GPT-4 to rate candidates on coherence, relevance, consistency, and fluency.
Summarization models are deployed in many products. News aggregators such as Google News and Apple News surface short article previews. Scientific search tools such as Semantic Scholar generate one-sentence TLDRs of papers. Meeting platforms such as Zoom, Google Meet, and Microsoft Teams produce action-item summaries from live transcripts. Customer-support tools condense long ticket threads. Legal and medical software summarizes contracts, case files, and patient records, often with strict faithfulness requirements. Coding assistants summarize pull requests, log files, and stack traces.
Research focus has shifted from fine-tuning on fixed benchmarks to controlling general-purpose LLMs.
Prompt engineering for summarization. Chain-of-density prompting, introduced by Griffin Adams and colleagues in 2023, asks GPT-4 to draft an entity-sparse summary and then iteratively add salient entities at fixed length. A human study on 100 CNN/DailyMail articles found readers preferred the denser summaries over a vanilla one-shot prompt.
Controllable summarization. Users specify length, focus topic, audience, or tone in natural language, and the model adapts without retraining. Aspect-based, query-focused, and multi-document summarization can all be expressed through prompts.
Long-context summarization. With Claude offering 200K-token windows and Gemini reaching 1M tokens, entire technical reports, legal briefs, and novels can be processed without chunking, though hierarchical pipelines remain popular when budget or latency matters.
Multimodal summarization. Models that handle text, images, audio, and video together can summarize lectures, videos, and slides.
Several open problems remain.
Factual hallucination. Abstractive models, including the strongest LLMs, sometimes introduce details that contradict or do not appear in the source. The problem is sharpest for long inputs, low-resource domains, and highly compressed summaries.
Faithfulness versus informativeness. Extractive systems are faithful by construction but often miss synthesis opportunities; abstractive systems are more informative but harder to audit.
Metric reliability. ROUGE rewards lexical overlap and penalizes valid paraphrases, which is one reason model rankings shift when human judges are asked instead. Newer metrics partially address this but bring their own biases, especially when evaluator and generator share an underlying model family.
Position and length bias. Many neural summarizers learn to favor the first few sentences of the source, an artifact of news training data. Long-context models still tend to attend more strongly to the beginning and end of an input.
Evaluation in the LLM era. Reference summaries for benchmarks such as CNN/DailyMail and XSum were written by editors with their own conventions and errors. As LLM-generated summaries surpass these references, ROUGE becomes a weaker signal of progress, and the community has moved toward human preference and LLM-judge evaluation.