Text summarization

Deep Learning Machine Learning Natural Language Processing

32 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

23 citations

Revision

v7 · 6,364 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Introduction

Text summarization is the natural language processing (NLP) task of automatically producing a shorter version of one or more documents that preserves the most important information from the original text. Summarization systems fall into two paradigms: extractive methods, which select and concatenate existing sentences from the source, and abstractive methods, which generate new text that conveys the source content in condensed form. The field dates to 1958, when IBM researcher Hans Peter Luhn published the first automatic summarizer, and it is now dominated by neural models such as BART, PEGASUS, and large language models, evaluated chiefly with the ROUGE metric.^[1]^[6]^[16]

Luhn described his 1958 system as producing an "auto-abstract" in which "sentences scoring highest in significance are extracted and printed out."^[1] More than six decades later, the central tension in the field remains the trade-off between extractive faithfulness and abstractive fluency: human studies have found hallucinated content in roughly 30% of summaries generated on the CNN/DailyMail dataset and in up to 92% of summaries on the highly abstractive XSum dataset.^[19] Advances in deep learning and transformer-based models have dramatically improved summarization quality over the past decade, transforming the field from rule-based heuristics into neural systems that can produce fluent, human-like summaries.

This article covers the history, methods, evaluation metrics, datasets, and applications of text summarization.

What is the difference between extractive and abstractive summarization?

The two fundamental approaches to text summarization differ in how they produce output. Extractive summarization copies sentences verbatim from the source, while abstractive summarization generates new sentences that paraphrase and compress the source content.

Extractive Summarization

Extractive summarization works by identifying the most important sentences (or passages) in a source document and assembling them into a summary. No new words or phrases are generated; the summary consists entirely of text copied from the original. This approach can be thought of as using a highlighter on a document.

Extractive methods typically involve three steps:

Representation: Sentences are converted into numerical representations using features such as word frequency, sentence position, or learned embeddings.
Scoring: Each sentence is assigned an importance score based on features like term frequency, overlap with the document title, position in the document, or centrality in a graph-based model.
Selection: The top-scoring sentences are selected, often with redundancy removal to avoid repeating the same information.

Extractive systems tend to produce grammatically correct output because each sentence was originally written by a human author. They also carry a lower risk of introducing factual errors, since they do not generate new text. However, extractive summaries can read awkwardly because sentences are pulled from different parts of a document and may lack coherent transitions.

Abstractive Summarization

Abstractive summarization generates new sentences that capture the core meaning of the source text. Rather than copying sentences verbatim, abstractive systems paraphrase, compress, and fuse information from multiple source sentences. This approach more closely mirrors how a human might write a summary.

Abstractive methods produce more fluent and readable summaries, and they can compress information more aggressively by combining facts from different parts of a document into a single sentence. However, they face a greater risk of hallucination, where the model generates content that is not supported by or contradicts the source text. Ensuring that abstractive summaries remain faithful to the source is one of the central challenges in modern summarization research.

Comparison

Feature	Extractive	Abstractive
Output source	Sentences copied from the original text	Newly generated text
Grammaticality	Generally high (original sentences)	High with modern neural models
Fluency and coherence	May lack smooth transitions	More natural and readable
Compression ratio	Limited by sentence boundaries	Can compress more aggressively
Factual accuracy	High (copies original text)	Risk of hallucination
Computational cost	Generally lower	Higher (requires text generation)
Typical methods	TF-IDF, TextRank, LexRank	Seq2seq, pointer-generator, BART, PEGASUS, LLMs

History of Text Summarization

When was the first automatic summarizer created?

The field of automatic text summarization began in 1958 with Hans Peter Luhn, a researcher at IBM, who published "The Automatic Creation of Literature Abstracts" in the IBM Journal of Research and Development.^[1] This paper is widely regarded as the founding work in automatic summarization.

Luhn's method was based on a straightforward statistical intuition: the most important sentences in a document are those that contain the highest concentration of significant words, where "significant" is determined by word frequency. As Luhn put it, the system computed "a relative measure of significance, first for individual words and then for sentences," after which "sentences scoring highest in significance are extracted and printed out to become the 'auto-abstract.'"^[1] His algorithm worked as follows:

Words were stemmed to their root forms and stop words (common function words like "the," "is," "and") were removed.
The remaining content words were sorted by frequency, and a frequency threshold was used to identify "significant" words.
For each sentence, a "significance factor" was computed based on the number of significant words and the linear distance between them within the sentence.
Sentences with the highest significance factors were selected for the summary.

The system was implemented on an IBM 704 computer. Despite its simplicity, Luhn's frequency-based approach established the core principle that word frequency correlates with topical importance, a concept that continues to influence summarization systems today.^[1]

Edmundson (1969): Beyond Word Frequency

H. P. Edmundson extended Luhn's work in 1969 with his paper "New Methods in Automatic Extracting," published in the Journal of the ACM.^[2] Edmundson argued that word frequency alone was insufficient and proposed three additional features for determining sentence importance:

Cue words: The presence of certain indicator phrases (such as "in conclusion," "importantly," or "the purpose of this paper") could signal that a sentence was summary-worthy.
Title words: Sentences containing words that also appeared in the document title or section headings were likely to be important.
Sentence location: Sentences appearing at the beginning or end of a document, or at the start of paragraphs, tended to carry more important information.

Edmundson's experiments showed that these three features, used in combination, outperformed frequency-based methods alone. His work established the use of multiple surface-level features for sentence scoring, an approach that dominated extractive summarization for decades.^[2]

Graph-Based Methods: LexRank and TextRank (2004)

Two influential graph-based methods emerged in 2004, both inspired by Google's PageRank algorithm for ranking web pages.

LexRank, developed by Gunes Erkan and Dragomir Radev and published in the Journal of Artificial Intelligence Research, constructs a graph where each sentence is a node and edges represent cosine similarity between sentence TF-IDF vectors.^[4] The algorithm then computes eigenvector centrality (analogous to PageRank) on this graph, and the sentences with the highest centrality scores are selected for the summary. LexRank was particularly effective for multi-document summarization and ranked first in multiple tasks at the DUC 2004 evaluation.^[4]

TextRank, proposed by Rada Mihalcea and Paul Tarau in their paper "TextRank: Bringing Order into Texts" at EMNLP 2004, applies a similar graph-based ranking algorithm.^[5] Sentences are nodes, and edge weights reflect sentence similarity. The PageRank algorithm is run iteratively on the graph until convergence, and the highest-ranked sentences form the summary. TextRank is unsupervised, requires no training data, and is domain-agnostic.^[5] It has been widely adopted in production systems and is available in several popular Python libraries.

Both methods demonstrated that modeling inter-sentence relationships through graph structures could capture document-level importance more effectively than sentence-level scoring alone.

Other Classical Methods

Beyond frequency-based and graph-based approaches, several other classical methods contributed to the field:

Latent Semantic Analysis (LSA): Steinberger and Jezek (2004) applied LSA to summarization, using singular value decomposition to identify the most important topics in a document and selecting sentences that best represent those topics.^[21]
Maximum Marginal Relevance (MMR): Carbonell and Goldstein (1998) introduced MMR, which selects sentences that are both relevant to the query and dissimilar to already selected sentences, explicitly balancing relevance and diversity.^[3]
Integer Linear Programming (ILP): Researchers formulated extractive summarization as an optimization problem, maximizing concept coverage subject to a length constraint. McDonald (2007) was among the first to use ILP for sentence selection.

Neural Approaches to Summarization

The application of neural networks to text summarization began in earnest around 2015, driven by advances in sequence-to-sequence (seq2seq) models and attention mechanisms.

Rush et al. (2015): Neural Attention for Summarization

Alexander Rush, Sumit Chopra, and Jason Weston published "A Neural Attention Model for Abstractive Sentence Summarization" at EMNLP 2015.^[8] This was one of the first works to apply a neural encoder-decoder architecture with attention to summarization. The model used a convolutional encoder and a neural language model decoder with a local attention mechanism that conditioned on the input sentence to generate each word of the summary.^[8]

The model was trained on the Gigaword dataset (approximately 3.8 million sentence-summary pairs) and achieved significant improvements over existing baselines on the DUC-2004 shared task.^[8] This work demonstrated that data-driven neural approaches could produce abstractive summaries and laid the groundwork for subsequent innovations.

Nallapati et al. (2016): Seq2Seq Summarization

Ramesh Nallapati and colleagues extended neural abstractive summarization in their 2016 paper "Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond."^[9] Their model used a bidirectional LSTM encoder and an LSTM decoder with attention, and introduced several innovations to address challenges specific to summarization:

A keyword model that captured topic-level features.
A hierarchical encoder that modeled sentence-to-word structure.
A switching mechanism that could choose between generating a word from the vocabulary or copying a word from the source (a precursor to the pointer-generator network).

Nallapati et al. also introduced the use of the CNN/DailyMail dataset for abstractive summarization, establishing it as a standard benchmark.^[9]

Pointer-Generator Networks (2017)

Abigail See, Peter J. Liu, and Christopher D. Manning published "Get To The Point: Summarization with Pointer-Generator Networks" at ACL 2017.^[10] This paper addressed two major shortcomings of seq2seq summarization models: their tendency to reproduce factual details inaccurately and their tendency to repeat themselves.

The pointer-generator network combines two mechanisms:

Pointer network: At each decoding step, the model can copy a word directly from the source text by "pointing" to it. This allows the model to reproduce out-of-vocabulary words, proper nouns, and technical terms accurately.
Generator: The model can also generate words from its fixed vocabulary, enabling it to produce novel words and paraphrases.

A learned "generation probability" p_gen at each time step determines the balance between pointing and generating. When p_gen is high, the model generates from the vocabulary; when p_gen is low, it copies from the source.

The paper also introduced a coverage mechanism that keeps track of which parts of the source have already been summarized. The coverage vector, accumulated over previous attention distributions, penalizes the model for attending to the same source positions repeatedly, which significantly reduces repetition in the output.^[10]

On the CNN/DailyMail dataset, the pointer-generator network with coverage outperformed the previous abstractive state-of-the-art by at least 2 ROUGE points.^[10] The architecture became highly influential, and the pointer-copy mechanism was adopted in many subsequent summarization systems.

Pre-Trained Transformer Models

The rise of pre-trained Transformer models marked a turning point for summarization. By pre-training on large corpora and then fine-tuning on summarization datasets, these models achieved state-of-the-art results across multiple benchmarks.

BERT and BERTSum (2019)

BERT (Bidirectional Encoder Representations from Transformers), while not originally designed for text generation, was adapted for extractive summarization by Yang Liu in the 2019 paper "Fine-tune BERT for Extractive Summarization" (BERTSum).^[12] The approach inserts [CLS] tokens between sentences, fine-tunes BERT to produce sentence-level representations, and then classifies each sentence as summary-worthy or not.^[12] BERTSum demonstrated that pre-trained language representations could significantly improve extractive summarization.

Liu and Lapata (2019) later extended BERTSum to abstractive summarization by adding a randomly initialized Transformer decoder on top of the BERT encoder, creating a two-stage extractive-then-abstractive pipeline.^[13]

BART (2020)

BART (Bidirectional and Auto-Regressive Transformers) was introduced by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer at Facebook AI Research (now Meta AI). The paper was presented at ACL 2020.^[15]

BART is a denoising autoencoder that combines a bidirectional encoder (like BERT) with an autoregressive decoder (like GPT). It is pre-trained by corrupting text with various noising functions and then learning to reconstruct the original text.^[15] The noising strategies include:

Token masking: Random tokens are replaced with a mask symbol.
Token deletion: Random tokens are deleted from the input.
Text infilling: Spans of text are replaced with a single mask token (the model must learn to predict how many tokens are missing).
Sentence permutation: The order of sentences is shuffled.
Document rotation: The document is rotated so that it begins at a random token.

The best performance was achieved with a combination of sentence permutation and text infilling. BART-large (400 million parameters) achieved state-of-the-art results on CNN/DailyMail summarization.^[15] The widely used facebook/bart-large-cnn model on Hugging Face is a BART model fine-tuned on CNN/DailyMail.

PEGASUS (2020)

PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive Summarization) was introduced by Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu at Google Research. The paper was presented at ICML 2020.^[16]

PEGASUS is notable for its summarization-specific pre-training objective called Gap Sentence Generation (GSG). During pre-training, important sentences are removed (masked) from a document, and the model is trained to generate the missing sentences from the remaining context.^[16] The intuition is that this task closely resembles abstractive summarization: the model learns to generate a condensed version (the gap sentences) from a larger body of text.

Sentences to mask are selected based on their importance, measured by ROUGE-1 overlap with the rest of the document (called "principal" or "Ind" strategy in the paper). The model uses a standard Transformer encoder-decoder architecture with 568 million parameters and was pre-trained on a combination of the C4 corpus and HugeNews (a dataset of 1.5 billion news articles).^[16]

PEGASUS achieved state-of-the-art results on all 12 downstream summarization datasets tested, spanning news, science, stories, instructions, emails, patents, and legislative bills.^[16]^[23] It was particularly effective in low-resource settings, reaching strong performance with as few as 1,000 fine-tuning examples on some datasets.^[16]

T5 (2020)

T5 (Text-to-Text Transfer Transformer), developed by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu at Google Research, reframes every NLP task as a text-to-text problem.^[17] For summarization, the input is prefixed with "summarize:" followed by the document, and the model generates the summary as output text.

T5 was pre-trained on the Colossal Clean Crawled Corpus (C4) and is available in sizes ranging from 60 million to 11 billion parameters.^[17] It demonstrated strong performance on CNN/DailyMail and other summarization benchmarks. The text-to-text framework makes T5 highly flexible, enabling it to handle multiple tasks with the same model and loss function.

Table of Major Summarization Approaches

Approach	Year	Type	Key Innovation	Authors
Luhn's method	1958	Extractive	Word frequency-based sentence scoring	H. P. Luhn
Edmundson's method	1969	Extractive	Cue words, title words, sentence location	H. P. Edmundson
MMR	1998	Extractive	Relevance-diversity trade-off for sentence selection	Carbonell, Goldstein
LexRank	2004	Extractive	Graph-based eigenvector centrality	Erkan, Radev
TextRank	2004	Extractive	PageRank-inspired unsupervised graph ranking	Mihalcea, Tarau
Neural attention model	2015	Abstractive	First neural encoder-decoder with attention for summarization	Rush, Chopra, Weston
Seq2seq RNN summarizer	2016	Abstractive	Hierarchical encoder, keyword model, switching mechanism	Nallapati et al.
Pointer-generator	2017	Hybrid	Copy mechanism with coverage to reduce repetition	See, Liu, Manning
BERTSum	2019	Extractive/Abstractive	Fine-tuned BERT for sentence-level extraction	Liu, Lapata
BART	2020	Abstractive	Denoising autoencoder pre-training with text infilling	Lewis et al.
PEGASUS	2020	Abstractive	Gap Sentence Generation pre-training objective	Zhang et al.
T5	2020	Abstractive	Unified text-to-text framework	Raffel et al.
LED	2020	Abstractive	Longformer sparse attention for long documents	Beltagy, Peters, Cohan
GPT-3/4 (few-shot)	2020/2023	Abstractive	In-context learning without fine-tuning	OpenAI

How do large language models perform text summarization?

The emergence of large language models (LLMs) such as GPT-3, GPT-4, Claude, Gemini, and LLaMA has fundamentally changed the landscape of text summarization. These models can produce high-quality summaries in zero-shot or few-shot settings, without requiring task-specific fine-tuning.

Zero-Shot and Few-Shot Summarization

LLMs can summarize text simply by being prompted with instructions like "Summarize the following article in three sentences." This eliminates the need for collecting summarization training data and fine-tuning a separate model. The quality of LLM-generated summaries often matches or exceeds that of fine-tuned specialized models, particularly for news summarization.^[20]

Research by Goyal et al. (2023) in "Benchmarking Large Language Models for News Summarization" found that GPT-4 demonstrated superior performance across ROUGE and METEOR metrics compared to earlier models.^[20] The study also found that GPT models produced a lower percentage of misrepresentations (approximately 15%) compared to other systems.^[20]

Advantages of LLMs for Summarization

Instruction following: LLMs can follow detailed instructions about summary length, style, format, and target audience.
Long context windows: Models like GPT-4 (128K tokens), Claude (200K tokens), and Gemini (up to 1 million tokens) can process entire books or long technical reports in a single pass.
Cross-lingual capability: Multilingual LLMs can summarize text in one language and produce a summary in another.
Customization: Through prompt engineering, users can request bullet-point summaries, executive summaries, technical abstracts, or simplified explanations without retraining the model.

Limitations

Despite their strengths, LLMs face several challenges in summarization:

Hallucination: LLMs can introduce information not present in the source document, particularly when summarizing technical or specialized content.
Position bias: Some research has shown that LLMs pay disproportionate attention to the beginning and end of long documents, potentially missing important content in the middle.
Cost and latency: Running large models for summarization is computationally expensive compared to smaller fine-tuned models.
Reproducibility: LLM outputs can vary across runs due to sampling, making it difficult to produce consistent summaries.

How is text summarization evaluated?

Evaluating the quality of automatically generated summaries is a challenging problem. Several metrics have been developed, each capturing different aspects of summary quality. The dominant automatic metric is ROUGE, supplemented by embedding-based metrics like BERTScore and, increasingly, LLM-based evaluators.

ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the most widely used family of metrics for summarization evaluation. It was introduced by Chin-Yew Lin in 2004 at the Text Summarization Branches Out workshop, and the literature has since adopted it as the standard evaluation metric for summarization.^[6] ROUGE measures the overlap between a candidate summary and one or more human-written reference summaries.^[6]

The main ROUGE variants are:

Metric	Description	What It Measures
ROUGE-1	Unigram overlap between candidate and reference	Word-level recall
ROUGE-2	Bigram overlap between candidate and reference	Phrase-level recall
ROUGE-L	Longest common subsequence (LCS) between candidate and reference	Sentence-level structural similarity
ROUGE-W	Weighted longest common subsequence	Rewards consecutive matches more than non-consecutive ones
ROUGE-S	Skip-bigram overlap	Captures word pairs that may have gaps between them

Each ROUGE variant can be reported as precision (what fraction of the candidate n-grams appear in the reference), recall (what fraction of the reference n-grams appear in the candidate), or F1 (the harmonic mean of precision and recall). In summarization research, ROUGE-1, ROUGE-2, and ROUGE-L F1 scores are the most commonly reported.

ROUGE has been the de facto standard for summarization evaluation for over two decades because of its simplicity, reproducibility, and reasonable correlation with human judgments. However, it has significant limitations: it relies purely on surface-level n-gram overlap and cannot account for paraphrasing, semantic equivalence, or factual correctness. A summary that uses different words to express the same meaning may receive a low ROUGE score, while a summary that copies frequent n-grams but misses the main point may score well.

BERTScore

BERTScore was proposed by Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi, published at ICLR 2020.^[14] Unlike ROUGE, which relies on exact n-gram matches, BERTScore computes semantic similarity between tokens in the candidate and reference summaries using contextual embeddings from pre-trained BERT models.^[14]

For each token in the candidate summary, BERTScore finds the most similar token in the reference summary (and vice versa) using cosine similarity of their BERT embeddings. The precision, recall, and F1 scores are then computed based on these greedy token-level matchings.^[14]

BERTScore offers several advantages over ROUGE:

It can recognize semantic equivalence between different wordings (e.g., "car" and "automobile").
It correlates better with human judgments in many evaluation scenarios.
It is more robust to paraphrasing and syntactic variation.

However, BERTScore is computationally more expensive than ROUGE and may not fully capture factual consistency.

Other Evaluation Metrics

Metric	Description	Strengths
METEOR	Considers synonyms, stemming, and paraphrase matching beyond exact n-gram overlap	Better handles paraphrasing than ROUGE
QuestEval / QAEval	Generates questions from the summary and checks if they can be answered from the source	Measures information coverage
FactCC	Uses NLI to check whether the summary is entailed by the source document	Measures factual consistency
SummaC	Aggregates NLI scores at the sentence level for consistency checking	Granular factual accuracy
UniEval	A unified evaluator that scores coherence, consistency, fluency, and relevance	Multi-dimensional quality assessment
G-Eval	Uses LLMs (e.g., GPT-4) as evaluators with chain-of-thought prompting	High correlation with human judgments

Human Evaluation

Automatic metrics remain imperfect, and human evaluation is still considered the gold standard for assessing summary quality. Human evaluators typically rate summaries along several dimensions:

Fluency: Is the summary grammatical and well-written?
Coherence: Does the summary read as a unified piece of text with logical flow?
Relevance: Does the summary capture the most important information from the source?
Consistency (Faithfulness): Does the summary contain only information that is supported by the source document?

Human evaluation is expensive and time-consuming, which is why automatic metrics are used for large-scale comparisons and benchmarking.

Datasets and Benchmarks

Summarization research relies on a number of standard datasets for training and evaluating models. The two most widely used benchmarks are CNN/DailyMail (multi-sentence news highlights) and XSum (single-sentence, highly abstractive BBC summaries).

CNN/DailyMail

The CNN/DailyMail dataset is the most widely used benchmark for single-document news summarization. It consists of approximately 312,000 news articles paired with multi-sentence summaries (highlight bullets written by journalists). The dataset was originally created by Karl Moritz Hermann et al. (2015) for reading comprehension, and later adapted for summarization by Nallapati et al. (2016).^[7]^[9]

Split	Number of Pairs
Training	287,226
Validation	13,368
Test	11,490

Articles average approximately 781 tokens, and summaries average 56 tokens (roughly 3.75 sentences). Because the reference summaries are multi-sentence highlights, this dataset favors models that can extract or generate multiple key points.

XSum (Extreme Summarization)

The XSum dataset, introduced by Shashi Narayan, Shay B. Cohen, and Mirella Lapata at EMNLP 2018, consists of BBC news articles paired with single-sentence summaries.^[11] It was designed to test a model's ability to perform highly abstractive summarization, as the reference summaries are typically not simple extractions from the article.^[11]

Split	Number of Pairs
Training	204,045
Validation	11,332
Test	11,334

Articles average 431 words (approximately 20 sentences), while summaries average just 23 words. XSum is considered more challenging than CNN/DailyMail because it requires significant abstraction and reformulation of the source content.

Other Notable Datasets

Dataset	Domain	Summary Style	Size (Train)	Average Doc Length
Gigaword	News headlines	Single-sentence headline	3.8M	~31 tokens
SAMSum	Dialogue	Abstractive conversation summary	14,732	~94 tokens (dialogue)
PubMed	Scientific papers	Abstract from body text	133,215	~3,000 tokens
arXiv	Scientific papers	Abstract from body text	215,913	~6,000 tokens
BillSum	US Congressional bills	Legislative summary	18,949	~1,800 tokens
BigPatent	Patent documents	Patent abstract	1.3M	~3,500 tokens
WikiHow	How-to articles	Step-by-step summary	157,252	~580 tokens
Multi-News	News (multi-document)	Multi-document summary	44,972	~2,100 tokens (total)
BookSum	Books	Chapter/book-level summary	12,630	~5,000+ tokens

DUC and TAC

The Document Understanding Conference (DUC), organized by NIST from 2001 to 2007, was the primary evaluation venue for summarization research in the pre-neural era. DUC introduced standard tasks for single-document and multi-document summarization with human-evaluated benchmarks. In 2008, DUC transitioned into the Text Analysis Conference (TAC), which continued to host summarization evaluation tracks.

Multi-Document Summarization

Multi-document summarization (MDS) involves generating a single summary from a collection of related documents. This task is more complex than single-document summarization because the system must handle several additional challenges:

Redundancy: Multiple documents about the same topic frequently contain overlapping information. The summarizer must identify and eliminate redundant content.
Contradiction: Different sources may present conflicting facts or perspectives. The system must handle inconsistencies without producing contradictory summaries.
Complementary information: Documents may contain complementary facts that, when combined, provide a more complete picture than any individual document.
Temporal ordering: When documents span different time periods (e.g., news articles about a developing story), the summary should reflect the chronological progression of events.
Cross-document coreference: The same entity may be referred to by different names or descriptions across documents.

Classical approaches to MDS include cluster-based methods (grouping similar sentences and selecting representatives from each cluster), graph-based methods like LexRank (which naturally extend to multi-document settings), and optimization-based methods using ILP.

Neural approaches to MDS have explored hierarchical attention mechanisms, where local attention captures within-document structure and global attention captures cross-document relationships. More recent work leverages LLMs by concatenating multiple documents (or their summaries) into a single prompt and generating a unified summary.

Long-Document Summarization

Summarizing long documents, such as scientific papers, legal filings, books, or meeting transcripts, poses distinct challenges beyond those of standard news summarization:

Input length: Standard Transformer models have quadratic attention complexity, limiting the length of text they can process. A scientific paper with 10,000 tokens exceeds the input capacity of many older models.
Content selection complexity: Longer documents contain more information, making it harder for models to identify which parts are most important.
Domain-specific terminology: Long documents are often technical (scientific, legal, medical), requiring specialized vocabulary and knowledge.
Structural complexity: Long documents frequently contain sections, subsections, tables, figures, and references that provide structural cues about importance.

Approaches for Long Documents

Several strategies have been developed to handle long documents:

Sparse attention models like the Longformer Encoder-Decoder (LED), introduced by Beltagy, Peters, and Cohan (2020), use a combination of local sliding window attention and global attention on selected tokens.^[18] LED can process inputs up to 16,384 tokens and was evaluated on the arXiv summarization dataset, achieving ROUGE-1/2/L scores of 46.63/19.62/41.83.^[18]

Hierarchical approaches process documents at multiple levels of granularity, for example by first encoding sentences, then aggregating sentence representations into document-level representations. This captures both local and global document structure.

Divide-and-conquer strategies split long documents into manageable chunks, summarize each chunk independently, and then combine the chunk summaries into a final summary. This approach is often called hierarchical or recursive summarization.

LLMs with extended context windows offer a more straightforward solution for some use cases. Models with context windows of 100K tokens or more can process many long documents in a single pass, though they may still struggle with content in the middle of very long inputs.

Why do summarization models hallucinate?

One of the most pressing challenges in abstractive summarization is ensuring that generated summaries are faithful to the source document. A faithful summary contains only information that is supported by the source text; it does not add, contradict, or distort any facts. Models hallucinate because abstractive generation, unlike extraction, is free to produce any token, and pretrained language models carry world knowledge that can leak into the output even when it is absent from the source.

Types of Hallucination

Research has identified two main types of hallucination in summarization:

Intrinsic hallucination: The summary distorts information from the source document. For example, it might attribute an action to the wrong person or change the relationship between entities.
Extrinsic hallucination: The summary introduces information that is not present in the source document at all. This information might be factually correct in the real world but is not supported by the source text being summarized.

Prevalence

Studies have found that hallucination is widespread in neural summarization systems. Research by Maynez et al. (2020) in "On Faithfulness and Factuality in Abstractive Summarization" (ACL 2020) found that approximately 30% of summaries generated from the CNN/DailyMail dataset contained hallucinated content.^[19] The problem is even more severe on the XSum dataset, where up to 92% of model-generated summaries contained some form of faithfulness error, largely because XSum's highly abstractive reference summaries encourage models to generate content that goes beyond the source.^[19] The same study found that textual entailment measures correlated better with human judgments of faithfulness than standard metrics such as ROUGE, pointing toward NLI-based evaluation.^[19]

Detecting and Mitigating Hallucination

Several approaches have been developed to detect and reduce hallucination:

Natural Language Inference (NLI): NLI-based methods check whether the source document entails each sentence in the summary. If a summary sentence is not entailed by the source, it may be hallucinated. Tools like FactCC and SummaC implement this approach.^[22]
Question-Answer (QA) based methods: These methods generate questions from the summary and attempt to answer them using the source document. Mismatches between the summary-derived answers and source-derived answers indicate potential hallucinations. QuestEval and FEQA are examples of this approach.
Constrained decoding: During generation, the model can be constrained to only produce tokens or phrases that are grounded in the source document.
Post-editing and correction: After generating a summary, a separate model can be used to detect and correct hallucinated content. Recent work has explored using LLMs for iterative hallucination detection and correction.
Training with faithful data: Filtering training data to remove examples with unfaithful reference summaries can reduce the tendency of models to hallucinate.

What is text summarization used for?

Text summarization is deployed across a wide range of industries and applications, from news digests and legal document review to clinical note summarization and search-engine snippets.

News and Media

News organizations use summarization to generate article headlines, produce news digests, and create social media posts from longer articles. Automated summarization enables rapid processing of breaking news and helps readers stay informed without reading full articles. News Corp Australia has used generative AI to produce thousands of local news stories weekly.

Legal

Legal professionals use summarization to condense lengthy court rulings, contracts, depositions, and regulatory filings. Summarization tools can identify key testimony, track themes across multiple depositions, and highlight risky clauses in contracts. This reduces the time lawyers spend on document review, which is particularly valuable in e-discovery and due diligence processes.

Healthcare and Biomedical

In healthcare, summarization helps clinicians manage the growing volume of information in electronic health records (EHRs). Systems can summarize patient histories, discharge notes, and medical literature to support clinical decision-making. Summarization is also used in systematic reviews, where researchers must synthesize findings from hundreds of studies on a particular medical question.

Scientific Research

Researchers use summarization to keep up with the rapidly growing scientific literature. Tools like Semantic Scholar and others provide automated summaries of research papers, helping scientists identify relevant work more efficiently. Summarization can also help with generating literature reviews and identifying gaps in existing research.

Business Intelligence

Businesses use summarization for processing customer feedback, summarizing meeting transcripts, generating executive summaries of reports, and condensing email threads. Financial analysts use summarization to extract key information from earnings calls, SEC filings, and market reports.

Education

Summarization tools help students condense textbook chapters, research papers, and lecture materials. They can also assist educators in creating study guides and review materials.

Search Engines and Information Retrieval

Search engines use summarization to generate snippets that appear in search results, providing users with a preview of each result's content. This form of query-focused summarization selects or generates text from the document that is most relevant to the user's search query.

Current Trends and Open Problems

Hybrid and Controllable Summarization

Modern systems increasingly combine extractive and abstractive approaches. A common pipeline first uses an extractive step to select the most relevant sentences, then applies an abstractive model to rewrite and compress the selected content. This hybrid approach reduces hallucination risk while maintaining fluent output.

Controllable summarization allows users to specify desired attributes of the output, such as length, reading level, focus topic, or style. Prompt engineering with LLMs has made controllable summarization much more accessible.

Multilingual Summarization

Cross-lingual summarization, where the source and summary are in different languages, is a growing area of research. Models like mBART and mT5, which are multilingual variants of BART and T5, have been applied to multilingual summarization tasks. LLMs with strong multilingual capabilities further expand the possibilities for cross-lingual summarization.

Evaluation Challenges

Despite decades of work, evaluating summarization quality remains an open problem. Existing metrics like ROUGE do not fully capture semantic equivalence, factual consistency, or the pragmatic utility of a summary. The development of better evaluation methods, including LLM-based evaluators like G-Eval, is an active area of research.

Faithfulness as a Priority

As summarization systems are deployed in high-stakes domains like healthcare, law, and finance, ensuring factual faithfulness has become increasingly important. Research continues on better methods for detecting and preventing hallucination, with a trend toward using NLI-based and QA-based automatic evaluation for faithfulness.

References

Luhn, H. P. (1958). "The Automatic Creation of Literature Abstracts." *IBM Journal of Research and Development*, 2(2), 159-165. ↩
Edmundson, H. P. (1969). "New Methods in Automatic Extracting." *Journal of the ACM*, 16(2), 264-285. ↩
Carbonell, J. and Goldstein, J. (1998). "The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries." *Proceedings of SIGIR 1998*. ↩
Erkan, G. and Radev, D. R. (2004). "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization." *Journal of Artificial Intelligence Research*, 22, 457-479. ↩
Mihalcea, R. and Tarau, P. (2004). "TextRank: Bringing Order into Texts." *Proceedings of EMNLP 2004*. ↩
Lin, C.-Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries." *Text Summarization Branches Out: Proceedings of the ACL-04 Workshop*. https://aclanthology.org/W04-1013/ ↩
Hermann, K. M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., and Blunsom, P. (2015). "Teaching Machines to Read and Comprehend." *Advances in Neural Information Processing Systems (NeurIPS) 2015*. ↩
Rush, A. M., Chopra, S., and Weston, J. (2015). "A Neural Attention Model for Abstractive Sentence Summarization." *Proceedings of EMNLP 2015*, 379-389. ↩
Nallapati, R., Zhou, B., dos Santos, C., Gulcehre, C., and Xiang, B. (2016). "Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond." *Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning*. ↩
See, A., Liu, P. J., and Manning, C. D. (2017). "Get To The Point: Summarization with Pointer-Generator Networks." *Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017)*. ↩
Narayan, S., Cohen, S. B., and Lapata, M. (2018). "Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization." *Proceedings of EMNLP 2018*. ↩
Liu, Y. (2019). "Fine-tune BERT for Extractive Summarization." *arXiv preprint arXiv:1903.10318*. ↩
Liu, Y. and Lapata, M. (2019). "Text Summarization with Pretrained Encoders." *Proceedings of EMNLP 2019*. ↩
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2020). "BERTScore: Evaluating Text Generation with BERT." *Proceedings of ICLR 2020*. ↩
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2020). "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension." *Proceedings of ACL 2020*. ↩
Zhang, J., Zhao, Y., Saleh, M., and Liu, P. J. (2020). "PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization." *Proceedings of ICML 2020*, 11328-11339. https://arxiv.org/abs/1912.08777 ↩
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *Journal of Machine Learning Research*, 21(140), 1-67. ↩
Beltagy, I., Peters, M. E., and Cohan, A. (2020). "Longformer: The Long-Document Transformer." *arXiv preprint arXiv:2004.05150*. ↩
Maynez, J., Narayan, S., Bohnet, B., and McDonald, R. (2020). "On Faithfulness and Factuality in Abstractive Summarization." *Proceedings of ACL 2020*. https://aclanthology.org/2020.acl-main.173/ ↩
Goyal, T., Xu, J., and Li, J. J. (2023). "Benchmarking Large Language Models for News Summarization." *Transactions of the Association for Computational Linguistics*, 11, 39-57. ↩
Steinberger, J. and Jezek, K. (2004). "Using Latent Semantic Analysis in Text Summarization and Summary Evaluation." *Proceedings of ISIM '04*. ↩
Kryscinski, W., McCann, B., Xiong, C., and Socher, R. (2020). "Evaluating the Factual Consistency of Abstractive Text Summarization." *Proceedings of EMNLP 2020*. ↩
Zhao, Y. and Saleh, M. (2020). "PEGASUS: A State-of-the-Art Model for Abstractive Text Summarization." *Google Research Blog*. https://research.google/blog/pegasus-a-state-of-the-art-model-for-abstractive-text-summarization/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit