Summarization Models
Last reviewed
May 31, 2026
Sources
28 citations
Review status
Source-backed
Revision
v3 · 4,534 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
28 citations
Review status
Source-backed
Revision
v3 · 4,534 words
Add missing citations, update stale details, or suggest a clearer explanation.
Summarization models are natural language processing systems that condense a source document, set of documents, or dialogue into a shorter version that preserves the most important information. They sit alongside machine translation and question answering as canonical sequence-to-sequence tasks, and form the technical core of news digests, scientific TLDR generation, meeting recaps, and the executive briefs produced by modern chat assistants.
The field has been reshaped twice in quick succession: first by pretrained encoder-decoder models such as BART and T5 in 2019, and again by general-purpose large language models such as GPT-4, Claude, and Gemini, which produce strong summaries with zero or few examples.
See also: Natural Language Processing Models, Text Summarization.
A summarization system reads an input of length N and produces a much shorter output while keeping it informative, fluent, and faithful to the source. Researchers split the task along three primary axes.
Extractive versus abstractive. Extractive systems select spans (usually whole sentences) from the source and concatenate them. Abstractive systems generate new text that may paraphrase, fuse, or reorder content. Extractive output stays grounded but reads choppily; abstractive output is smoother and shorter, with a higher risk of hallucination. A third hybrid category, sometimes called compressive or mixed summarization, selects candidate sentences and then compresses or edits them into final output, combining the faithfulness advantage of extraction with the fluency advantage of generation.
Single versus multi-document. Single-document summarization condenses one input, the standard setting for news articles or scientific papers. Multi-document summarization fuses several related inputs and adds the problems of redundancy removal and cross-document coreference. Meeting summarization is a special case that handles multi-party dialogue with speaker turns, disfluencies, and implicit topic shifts.
Generic versus query- or aspect-focused. A generic summary covers the most salient content. A query-focused summary answers a user question against the source; an aspect-focused summary picks out a specific facet such as methods only from a research paper. Query-focused multi-document summarization (QFMDS) combines both challenges and has a long tradition at government-sponsored evaluation workshops including TREC and DUC.
Length and granularity. Extreme summarization, typified by the XSum benchmark, targets a single sentence that captures the headline intent of an article. Longer summaries range from a short paragraph to multi-page executive reports, as in the GovReport dataset whose targets average roughly 550 words.
Before neural networks dominated the area, extractive summarization was the practical default. Two graph-based algorithms published in 2004 became long-running baselines: TextRank by Rada Mihalcea and Paul Tarau, and LexRank by Gunes Erkan and Dragomir Radev. Both build a graph in which sentences are nodes and edges carry a similarity weight, then rank nodes with an iterative algorithm in the style of PageRank to pick the most central sentences. These methods require no training data and remain useful when annotated corpora are unavailable.
Earlier still, the MEAD system from the University of Michigan (Radev et al., 2004) combined centroid-based scoring, positional features, and redundancy penalties for multi-document summarization. The Document Understanding Conference (DUC), which ran from 2001 to 2007 and was succeeded by the Text Analysis Conference (TAC), provided annual shared tasks and standard datasets that drove the field's early progress.
Topic modelling offered a complementary angle. Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) could identify thematic segments and pick representative sentences for each topic, which proved especially useful for multi-document summarization where raw similarity measures blur across topic clusters.
The first wave of neural summarization adapted the encoder-decoder framework from machine translation, with a recurrent encoder reading the source and a recurrent decoder generating tokens while attending over the encoder states. Rush, Chopra, and Weston (2015) applied this to headline generation with attention, producing the first compelling abstractive baseline on news. Chopra, Auli, and Rush (2016) refined the attentional model and applied it to the newly released Gigaword sentence-compression task.
The decisive refinement was the pointer-generator network of Abigail See, Peter J. Liu, and Christopher D. Manning in 2017, which augments the decoder with a copy mechanism that can generate a vocabulary word or copy a token from the source. A coverage loss discourages repetition. On CNN/DailyMail it beat earlier abstractive baselines by at least 2 ROUGE points and remained competitive until pretrained transformers arrived.
In 2019 three encoder-decoder transformers established the modern paradigm. BART, from Mike Lewis and colleagues at Facebook AI, pretrains a bidirectional encoder and an autoregressive decoder to reconstruct text corrupted by token masking, deletion, sentence shuffling, and span infilling; fine-tuned on CNN/DailyMail and XSum it set new state-of-the-art ROUGE scores. T5, from Colin Raffel and colleagues at Google, cast every NLP task as text-to-text and pretrained on the C4 corpus at scales up to 11 billion parameters. PEGASUS, from Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu at Google, introduced a gap-sentence-generation objective specifically aimed at summarization: important sentences are masked and the model learns to regenerate them. PEGASUS achieved state-of-the-art results across 12 summarization datasets and strong few-shot performance with as few as 1,000 labelled examples.
Also in 2019, Yang Liu and Mirella Lapata published BERTSUM, which adapted the bidirectional encoder BERT to extractive and abstractive summarization. The extractive variant inserts a [CLS] token before each sentence, encodes the full document jointly, and scores each sentence with stacked transformer layers, then uses a greedy algorithm that maximizes ROUGE to select the final set. BERTSUM set a new extractive SOTA on CNN/DailyMail and demonstrated that the same pretrained encoder could support both extractive and abstractive heads.
Standard transformer self-attention scales quadratically with sequence length, which prevents 2019-era models from processing documents much longer than 1,024 tokens without truncation. A series of architectures introduced sparse attention patterns to push the input limit further:
The arrival of GPT-3 in 2020 and its instruction-tuned successors made zero-shot summarization a routine chatbot capability. By 2023 papers were reporting that GPT-4 summaries were preferred to fine-tuned BART or PEGASUS outputs on news benchmarks under human evaluation, even when ROUGE scores favored the fine-tuned models. This divergence highlighted the growing gap between automatic metrics and actual summary quality. Closed foundation models such as Claude and Gemini, with context windows of 200,000 to over a million tokens, could summarize book-length documents in a single pass. Open-weight families such as FLAN-T5 and Llama provided instruction-tuned alternatives that can be fine-tuned on domain-specific data. Research from 2024 and 2025 found that fine-tuned open-source LLMs with domain-specific training data could match or exceed GPT-4 zero-shot performance on narrow tasks, while GPT-4-class models retained advantages on zero-shot cross-domain and multilingual settings.
Most summarization systems fall into one of four architectural families.
Encoder-only with an extractive head. A pretrained encoder such as BERT scores each sentence and the top-K sentences are selected. This route, popularized by BERTSUM (Liu and Lapata, 2019), is fast and stays faithful to the source. The encoder can also drive an importance or saliency classifier that scores spans rather than whole sentences, enabling finer-grained extractive compression.
Encoder-decoder transformer. BART, T5, PEGASUS, and their long-context successors share this template: the encoder builds contextual representations and the decoder generates the summary autoregressively while attending over the encoder. Pretraining objectives vary across text-infilling (BART), span corruption (T5), and gap-sentence generation (PEGASUS). The encoder and decoder can be initialized separately from existing checkpoints or jointly from scratch on a massive denoising corpus.
Decoder-only language model. Generic LLMs such as the GPT, Llama, Claude, and Gemini families read source and instruction in a single context window, then continue with the summary. Instruction tuning and reinforcement learning from human feedback (RLHF) adapt these models to produce user-preferred summaries without task-specific fine-tuning on summarization corpora. The limitation is that faithfulness checking is harder without a separate encoder that explicitly grounds the decoder in the source.
Retrieval-augmented or hierarchical pipelines. For inputs that exceed even long-context models, systems chunk the document, summarize the chunks, and recursively combine partial summaries. This map-reduce pattern is easy to implement but loses cross-chunk dependencies. Alternatively, a retrieval-augmented generation pipeline fetches the most relevant passages before generating the final summary, which is efficient when the user query identifies the relevant region but introduces retrieval errors as a new failure mode.
The choice of pretraining objective has a large effect on summarization quality when fine-tuning data is scarce. Gap-sentence generation (PEGASUS) masks whole sentences that have high overlap with the rest of the document, a proxy for "informativeness," and teaches the model to reconstruct a document-level summary rather than local spans. Span infilling (T5 and BART) is more general, requiring reconstruction of corrupted sequences of different lengths. Sentence shuffling and insertion (BART) trains the encoder to understand discourse order, which helps when summaries require understanding narrative structure rather than just identifying key facts.
The table below lists models that set state-of-the-art results or shaped later research.
| Model | Year | Organization | Parameters | Notes |
|---|---|---|---|---|
| Pointer-generator network | 2017 | Stanford / Google Brain | ~22M | RNN seq2seq with copy and coverage; first strong abstractive baseline on CNN/DailyMail |
| BERTSUM | 2019 | University of Edinburgh | ~110M | BERT encoder with extractive and abstractive variants; SOTA on CNN/DailyMail |
| BART | 2019 | Facebook AI | 140M, 400M | Denoising encoder-decoder; new SOTA on CNN/DailyMail and XSum |
| T5 | 2019 | 60M to 11B | Text-to-text transfer transformer pretrained on C4 | |
| PEGASUS | 2019 | 223M, 568M | Gap-sentence generation objective; SOTA on 12 datasets | |
| Longformer / LED | 2020 | Allen Institute for AI | 149M, 435M | Sliding window plus global attention up to 16K tokens |
| BigBird | 2020 | ~127M, ~580M | Sparse global+local+random attention for long inputs | |
| LongT5 | 2021 | 220M to 3B | Transient-global attention plus PEGASUS-style pretraining | |
| PEGASUS-X | 2022 | 568M | Long-input PEGASUS variant for inputs up to 16K tokens | |
| FLAN-T5 | 2022 | 80M to 11B | Instruction-tuned T5; strong zero-shot summarization | |
| GPT-4 | 2023 | OpenAI | Undisclosed | Zero-shot summarization across domains; preferred by humans over fine-tuned encoder-decoder models on news |
| Claude | 2023 onward | Anthropic | Undisclosed | 200K+ token windows for long-document summarization |
| Gemini | 2023 onward | Google DeepMind | Undisclosed | Up to 1M-token context for book-length summarization |
| Llama family | 2023 onward | Meta AI | 1B to 405B | Open-weight LLMs widely fine-tuned for domain-specific summarization |
Progress in summarization has been driven by a small number of public datasets, listed below. Each dataset encodes a particular style of summarization, and models fine-tuned on one do not always transfer well to others.
| Dataset | Domain | Style | Size | Notes |
|---|---|---|---|---|
| CNN/DailyMail | News | Multi-sentence highlights | ~313K articles | Originally collected by Hermann et al. (2015) for reading comprehension and repurposed for summarization; moderately abstractive |
| Gigaword | News | Headline generation | ~4M sentence pairs | Sentence-level compression from the first sentence of articles; used by Rush et al. (2015) |
| XSum | BBC news | One-sentence extreme summaries | 226,711 articles | Highly abstractive, with about 96% novel trigrams; originally by Narayan, Cohen, and Lapata (2018) |
| Newsroom | News | Mixed extractive/abstractive | 1.3M article-summary pairs | Sourced from 38 publishers with explicit extractiveness labels |
| WikiHow | How-to articles | Step summaries | 230K articles | Paragraph headlines as targets; diverse procedural domain |
| BillSum | US legislation | Bill summaries | ~23K bills | Congressional and California state bills; long source documents |
| MultiNews | News clusters | Multi-document | 56K clusters | Related articles with editor summaries; tests redundancy handling |
| GovReport | US government reports | Executive summaries | ~19K reports | Mean document length ~9,600 tokens; mean summary ~550 words; Huang et al. (2021) |
| arXiv / PubMed | Scientific papers | Abstracts | 113K / 215K papers | Cohan et al. (2018); abstracts serve as targets; very long source documents |
| BookSum | Novels and short stories | Chapter and book-level summaries | ~12K chapter summaries | Salesforce Research; three granularity levels; Kryściński et al. (2021) |
| QMSum | Meeting transcripts | Query-focused summaries | 1.8K query-summary pairs | Multi-domain meeting corpus; mean length ~13,300 tokens |
| SAMSum | Chat dialogues | Third-person summaries | 16,369 conversations | Human-written messenger-style conversations; Gliwa et al. (2019) |
| DialogSum | Spoken-style dialogues | Daily-life summaries | 13,460 dialogues | Daily-life scenarios; Chen et al. (2021) |
| AMI / ICSI | Meeting transcripts | Meeting minutes | ~100 / ~75 meetings | Classic meeting corpora; multi-speaker; used before the LLM era for dialogue summarization |
The spread of datasets reflects the breadth of summarization as a task. News datasets such as CNN/DailyMail and XSum differ substantially in style: CNN/DailyMail summaries are bullet-style highlights written before the article was published, while XSum requires a one-sentence "why does this matter" framing that demands deep comprehension. Scientific paper datasets shift the demand toward technical vocabulary and hierarchical discourse. Dialogue datasets introduce speaker role modeling and speech-act understanding. Long-document datasets (GovReport, BookSum, arXiv) push models toward hierarchical architectures or large context windows.
Training on one dataset and evaluating on another routinely reveals that neural summarizers learn dataset-specific style artifacts rather than transferable summarization skills. XSum models trained on the extreme one-sentence target style produce overly compressed outputs on CNN/DailyMail; the reverse produces verbose outputs on XSum.
Automatic evaluation of summaries is hard because many different surface realizations can be equally good. The most widely used metric is ROUGE (Recall-Oriented Understudy for Gisting Evaluation), introduced by Chin-Yew Lin in 2004. ROUGE-N counts overlapping n-grams against one or more reference summaries, ROUGE-L measures the longest common subsequence, and ROUGE-S scores skip-bigrams. ROUGE was adopted at the Document Understanding Conference (DUC) 2004 and remains the default headline number in nearly every paper.
ROUGE's core limitations are well documented. It rewards lexical overlap and penalizes valid paraphrases, which can rank a poor abstractive summary that copies source phrases above a better one that paraphrases. It also fails to capture factual consistency: a summary can score well on ROUGE while containing facts that contradict the source. Cross-dataset comparisons using ROUGE are unreliable because different datasets use different numbers of reference summaries, and human references vary in abstractiveness.
BERTScore (Tianyi Zhang and colleagues, 2019) compares contextual BERT embeddings between candidate and reference, scoring semantic similarity rather than literal n-gram overlap. BLEURT (Sellam et al., 2020) is a learned regression metric fine-tuned to predict human ratings. MoverScore (Zhao et al., 2019) uses earth-mover distance between pooled word embeddings to compute soft n-gram overlap.
A separate strand of work targets factual consistency specifically. These metrics do not require a reference summary; they compare the generated summary directly to the source document.
FactCC (Kryscinski et al., 2020) trains a classifier on synthetically generated factual and non-factual sentence pairs to detect contradictions between source and summary. QAGS (Wang, Cho, and Lewis, 2020) generates questions from the candidate summary and checks whether a question answering system gives the same answers when reading the summary and the source; divergence signals inconsistency. FEQA (Durmus et al., 2020) applies a similar question-generation and question-answering loop, generating questions from the source and verifying whether the summary can answer them.
SummaC (Laban et al., 2022) operationalises sentence-level consistency via natural language inference, checking whether each summary sentence is entailed by and not contradicted by the source. AlignScore (Zha et al., 2023) proposes a unified alignment function that supports both reference-based and reference-free evaluation by fine-tuning a checkpoint to predict whether one text is factually supported by another.
Newer LLM-based evaluators, often grouped as "G-Eval" (Liu et al., 2023), prompt a strong model such as GPT-4 to rate candidates on coherence, relevance, consistency, and fluency using a chain-of-thought evaluation protocol. Studies find G-Eval correlates more closely with human judgments than ROUGE on most dimensions, though it inherits the biases of the underlying model and can favour outputs from the same model family.
Human evaluation remains the gold standard but is expensive and hard to replicate. Common human evaluation protocols ask annotators to rate summaries on a 1-5 scale along dimensions including fluency (grammaticality and readability), coherence (logical structure), relevance (coverage of key information), and consistency (factual alignment with the source). Direct assessment (DA), pairwise comparison (A/B testing), and pyramid-based evaluation (Nenkova and Passonneau, 2004) are the three dominant human evaluation frameworks.
Summarization models are deployed across many industries and products.
News and media. Aggregators such as Google News and Apple News surface short article previews. Specialized tools generate push-notification-length bullets for breaking news. Some publishers use summarization to auto-generate article metadata and social media snippets.
Scientific literature. Semantic Scholar and other academic search platforms generate one-sentence TLDRs of papers for users scanning large result sets. Automated abstract generation assists authors and supports systematic review workflows in medicine and law.
Meeting and communication platforms. Zoom, Google Meet, Microsoft Teams, and dedicated tools such as Otter.ai and tl;dv produce action-item summaries, decision logs, and meeting minutes from live or recorded transcripts. Customer-support and CRM platforms condense long ticket threads to give agents context before a call.
Legal and compliance. Legal research tools summarize contracts, case files, statutes, and regulatory filings. Strict faithfulness requirements in this domain have driven interest in extractive or lightly compressive approaches that preserve verifiable spans, and in faithfulness metrics as filters on generated summaries.
Healthcare. Clinical note summarization condenses patient histories, discharge summaries, and encounter notes. Biomedical literature summarization assists physicians and researchers in keeping up with large volumes of publications. Both settings place a premium on factual accuracy and domain-specific vocabulary.
Software development. Coding assistants summarize pull requests, code review threads, log files, and stack traces. Commit message generation from diffs is a closely related summarization subtask.
Education. Adaptive learning platforms generate reading summaries at different comprehension levels. Textbook chapter summarization and lecture transcript summarization are increasingly automated with LLM-based tools.
Research focus has shifted from fine-tuning on fixed benchmarks to controlling general-purpose LLMs.
Chain-of-density prompting, introduced by Griffin Adams and colleagues in 2023, asks GPT-4 to draft an entity-sparse summary and then iteratively add salient entities at fixed length. The process runs for five iterations, each adding more entities while preserving the target word count. A human study on 100 CNN/DailyMail articles found readers preferred the denser, information-packed summaries over a vanilla one-shot prompt. The authors released 500 annotated CoD summaries and 5,000 unannotated summaries.
Decomposed prompting instructs the model to identify key points, then draft and then revise the summary in separate steps, mimicking the plan-draft-revise workflow of professional summarizers.
Users specify length, focus topic, audience, or tone in natural language, and the model adapts without retraining. Aspect-based, query-focused, and multi-document summarization can all be expressed through prompts. Instruction-tuned LLMs generalize remarkably well across these controlled settings even when they have never seen the specific format during training.
With Claude offering 200K-token windows and Gemini reaching 1M tokens, entire technical reports, legal briefs, and novels can be processed without chunking, though hierarchical pipelines remain popular when budget or latency matters. Research in 2024 found that even long-context models still exhibit a "lost in the middle" bias: information in the center of the context window is retrieved less reliably than information near the beginning or end, with implications for summarization quality on very long inputs.
Models that handle text, images, audio, and video together can summarize lectures, podcast episodes, and narrated slide decks. Audio-grounded summarization from speech combines automatic speech recognition with abstractive summarization and must handle disfluencies, speaker overlap, and the absence of punctuation. Video summarization adds visual salience estimation and temporal segmentation to the pipeline.
A consistent empirical finding from 2024 and 2025 research is that fine-tuning a small model on domain-specific summarization data often outperforms zero-shot GPT-4 within that domain, while GPT-4 retains large advantages on zero-shot cross-domain transfer. This creates a deployment decision: fine-tuned smaller models are cheaper and more controllable; large zero-shot models are more flexible. Instruction-tuned open-source LLMs (Llama 3, Mistral, Qwen) occupy a middle ground when adapted with domain data.
Several open problems remain active areas of research.
Factual hallucination. Abstractive models, including the strongest LLMs, sometimes introduce details that contradict or do not appear in the source. Studies have found error rates of 20-30% or higher for factual inconsistencies in abstractive summaries generated by strong models on news datasets. The problem is sharpest for long inputs, low-resource domains, and highly compressed summaries where the model must infer or synthesize rather than paraphrase.
Faithfulness versus informativeness. Extractive systems are faithful by construction but often miss synthesis opportunities and produce choppy output. Abstractive systems are more informative but harder to audit. This tradeoff remains unresolved; no current approach achieves both maximum faithfulness and maximum informativeness simultaneously.
Metric reliability. ROUGE rewards lexical overlap and penalizes valid paraphrases, which is one reason model rankings shift when human judges are asked instead. Newer metrics partially address this but bring their own biases, especially when evaluator and generator share an underlying model family. G-Eval using GPT-4 tends to favor GPT-4-generated summaries in pairwise comparisons, an instance of self-preference bias.
Position and length bias. Many neural summarizers learn to favor the first few sentences of the source, an artifact of news training data where important information concentrates near the lead. This "lead bias" was documented and studied systematically in 2019 and remains present in LLM-era models; long-context models exhibit a related "lost in the middle" effect where central content is systematically underweighted.
Evaluation in the LLM era. Reference summaries for benchmarks such as CNN/DailyMail and XSum were written by editors with their own conventions and errors. As LLM-generated summaries surpass these references on human preference studies, ROUGE computed against the original references becomes a weaker signal of progress. The community has moved toward human preference evaluation and LLM-judge evaluation, but these introduce new reliability and reproducibility concerns.
Domain transfer and out-of-distribution robustness. Models trained on news summarization degrade when applied to scientific, legal, or biomedical text. Vocabulary shift, domain-specific entity types, and different rhetorical structures all reduce quality. Fine-tuning on in-domain data is effective but requires annotated corpora that may be expensive to produce in specialized domains.
Multilingual summarization. Most benchmark progress has been on English. Cross-lingual summarization (summarizing a document in one language and producing output in another) and multilingual summarization (operating across many languages) remain harder, with performance gaps between English and low-resource languages that widen for more abstractive settings.