Text summarization is the task of automatically producing a shorter version of one or more documents that preserves the most important information from the original text. It is one of the oldest and most studied problems in natural language processing (NLP), with roots stretching back to the late 1950s. The goal is to reduce the length of a document while retaining its key points, enabling readers to grasp the essence of the content without reading the full source.
Summarization systems are broadly divided into two paradigms: extractive methods, which select and concatenate existing sentences from the source, and abstractive methods, which generate new text that conveys the source content in a condensed form. Advances in deep learning and large language models have dramatically improved summarization quality over the past decade, transforming the field from rule-based heuristics into neural systems that can produce fluent, human-like summaries.
This article covers the history, methods, evaluation metrics, datasets, and applications of text summarization.
The two fundamental approaches to text summarization differ in how they produce output.
Extractive summarization works by identifying the most important sentences (or passages) in a source document and assembling them into a summary. No new words or phrases are generated; the summary consists entirely of text copied from the original. This approach can be thought of as using a highlighter on a document.
Extractive methods typically involve three steps:
Extractive systems tend to produce grammatically correct output because each sentence was originally written by a human author. They also carry a lower risk of introducing factual errors, since they do not generate new text. However, extractive summaries can read awkwardly because sentences are pulled from different parts of a document and may lack coherent transitions.
Abstractive summarization generates new sentences that capture the core meaning of the source text. Rather than copying sentences verbatim, abstractive systems paraphrase, compress, and fuse information from multiple source sentences. This approach more closely mirrors how a human might write a summary.
Abstractive methods produce more fluent and readable summaries, and they can compress information more aggressively by combining facts from different parts of a document into a single sentence. However, they face a greater risk of hallucination, where the model generates content that is not supported by or contradicts the source text. Ensuring that abstractive summaries remain faithful to the source is one of the central challenges in modern summarization research.
| Feature | Extractive | Abstractive |
|---|---|---|
| Output source | Sentences copied from the original text | Newly generated text |
| Grammaticality | Generally high (original sentences) | High with modern neural models |
| Fluency and coherence | May lack smooth transitions | More natural and readable |
| Compression ratio | Limited by sentence boundaries | Can compress more aggressively |
| Factual accuracy | High (copies original text) | Risk of hallucination |
| Computational cost | Generally lower | Higher (requires text generation) |
| Typical methods | TF-IDF, TextRank, LexRank | Seq2seq, pointer-generator, BART, PEGASUS, LLMs |
The field of automatic text summarization began with Hans Peter Luhn, a researcher at IBM, who published "The Automatic Creation of Literature Abstracts" in 1958 in the IBM Journal of Research and Development. This paper is widely regarded as the founding work in automatic summarization.
Luhn's method was based on a straightforward statistical intuition: the most important sentences in a document are those that contain the highest concentration of significant words, where "significant" is determined by word frequency. His algorithm worked as follows:
The system was implemented on an IBM 704 computer. Despite its simplicity, Luhn's frequency-based approach established the core principle that word frequency correlates with topical importance, a concept that continues to influence summarization systems today.
H. P. Edmundson extended Luhn's work in 1969 with his paper "New Methods in Automatic Extracting," published in the Journal of the ACM. Edmundson argued that word frequency alone was insufficient and proposed three additional features for determining sentence importance:
Edmundson's experiments showed that these three features, used in combination, outperformed frequency-based methods alone. His work established the use of multiple surface-level features for sentence scoring, an approach that dominated extractive summarization for decades.
Two influential graph-based methods emerged in 2004, both inspired by Google's PageRank algorithm for ranking web pages.
LexRank, developed by Gunes Erkan and Dragomir Radev and published in the Journal of Artificial Intelligence Research, constructs a graph where each sentence is a node and edges represent cosine similarity between sentence TF-IDF vectors. The algorithm then computes eigenvector centrality (analogous to PageRank) on this graph, and the sentences with the highest centrality scores are selected for the summary. LexRank was particularly effective for multi-document summarization and ranked first in multiple tasks at the DUC 2004 evaluation.
TextRank, proposed by Rada Mihalcea and Paul Tarau in their paper "TextRank: Bringing Order into Texts" at EMNLP 2004, applies a similar graph-based ranking algorithm. Sentences are nodes, and edge weights reflect sentence similarity. The PageRank algorithm is run iteratively on the graph until convergence, and the highest-ranked sentences form the summary. TextRank is unsupervised, requires no training data, and is domain-agnostic. It has been widely adopted in production systems and is available in several popular Python libraries.
Both methods demonstrated that modeling inter-sentence relationships through graph structures could capture document-level importance more effectively than sentence-level scoring alone.
Beyond frequency-based and graph-based approaches, several other classical methods contributed to the field:
The application of neural networks to text summarization began in earnest around 2015, driven by advances in sequence-to-sequence (seq2seq) models and attention mechanisms.
Alexander Rush, Sumit Chopra, and Jason Weston published "A Neural Attention Model for Abstractive Sentence Summarization" at EMNLP 2015. This was one of the first works to apply a neural encoder-decoder architecture with attention to summarization. The model used a convolutional encoder and a neural language model decoder with a local attention mechanism that conditioned on the input sentence to generate each word of the summary.
The model was trained on the Gigaword dataset (approximately 3.8 million sentence-summary pairs) and achieved significant improvements over existing baselines on the DUC-2004 shared task. This work demonstrated that data-driven neural approaches could produce abstractive summaries and laid the groundwork for subsequent innovations.
Ramesh Nallapati and colleagues extended neural abstractive summarization in their 2016 paper "Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond." Their model used a bidirectional LSTM encoder and an LSTM decoder with attention, and introduced several innovations to address challenges specific to summarization:
Nallapati et al. also introduced the use of the CNN/DailyMail dataset for abstractive summarization, establishing it as a standard benchmark.
Abigail See, Peter J. Liu, and Christopher D. Manning published "Get To The Point: Summarization with Pointer-Generator Networks" at ACL 2017. This paper addressed two major shortcomings of seq2seq summarization models: their tendency to reproduce factual details inaccurately and their tendency to repeat themselves.
The pointer-generator network combines two mechanisms:
A learned "generation probability" p_gen at each time step determines the balance between pointing and generating. When p_gen is high, the model generates from the vocabulary; when p_gen is low, it copies from the source.
The paper also introduced a coverage mechanism that keeps track of which parts of the source have already been summarized. The coverage vector, accumulated over previous attention distributions, penalizes the model for attending to the same source positions repeatedly, which significantly reduces repetition in the output.
On the CNN/DailyMail dataset, the pointer-generator network with coverage outperformed the previous abstractive state-of-the-art by at least 2 ROUGE points. The architecture became highly influential, and the pointer-copy mechanism was adopted in many subsequent summarization systems.
The rise of pre-trained Transformer models marked a turning point for summarization. By pre-training on large corpora and then fine-tuning on summarization datasets, these models achieved state-of-the-art results across multiple benchmarks.
BERT (Bidirectional Encoder Representations from Transformers), while not originally designed for text generation, was adapted for extractive summarization by Yang Liu in the 2019 paper "Fine-tune BERT for Extractive Summarization" (BERTSum). The approach inserts [CLS] tokens between sentences, fine-tunes BERT to produce sentence-level representations, and then classifies each sentence as summary-worthy or not. BERTSum demonstrated that pre-trained language representations could significantly improve extractive summarization.
Liu and Lapata (2019) later extended BERTSum to abstractive summarization by adding a randomly initialized Transformer decoder on top of the BERT encoder, creating a two-stage extractive-then-abstractive pipeline.
BART (Bidirectional and Auto-Regressive Transformers) was introduced by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer at Facebook AI Research (now Meta AI). The paper was presented at ACL 2020.
BART is a denoising autoencoder that combines a bidirectional encoder (like BERT) with an autoregressive decoder (like GPT). It is pre-trained by corrupting text with various noising functions and then learning to reconstruct the original text. The noising strategies include:
The best performance was achieved with a combination of sentence permutation and text infilling. BART-large (400 million parameters) achieved state-of-the-art results on CNN/DailyMail summarization. The widely used facebook/bart-large-cnn model on Hugging Face is a BART model fine-tuned on CNN/DailyMail.
PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive Summarization) was introduced by Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu at Google Research. The paper was presented at ICML 2020.
PEGASUS is notable for its summarization-specific pre-training objective called Gap Sentence Generation (GSG). During pre-training, important sentences are removed (masked) from a document, and the model is trained to generate the missing sentences from the remaining context. The intuition is that this task closely resembles abstractive summarization: the model learns to generate a condensed version (the gap sentences) from a larger body of text.
Sentences to mask are selected based on their importance, measured by ROUGE-1 overlap with the rest of the document (called "Ind" strategy in the paper). The model uses a standard Transformer encoder-decoder architecture with 568 million parameters and was pre-trained on a combination of the C4 corpus and HugeNews (a dataset of 1.5 billion news articles).
PEGASUS achieved state-of-the-art results on all 12 downstream summarization datasets tested, spanning news, science, stories, instructions, emails, patents, and legislative bills. It was particularly effective in low-resource settings, reaching strong performance with as few as 1,000 fine-tuning examples on some datasets.
T5 (Text-to-Text Transfer Transformer), developed by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu at Google Research, reframes every NLP task as a text-to-text problem. For summarization, the input is prefixed with "summarize:" followed by the document, and the model generates the summary as output text.
T5 was pre-trained on the Colossal Clean Crawled Corpus (C4) and is available in sizes ranging from 60 million to 11 billion parameters. It demonstrated strong performance on CNN/DailyMail and other summarization benchmarks. The text-to-text framework makes T5 highly flexible, enabling it to handle multiple tasks with the same model and loss function.
| Approach | Year | Type | Key Innovation | Authors |
|---|---|---|---|---|
| Luhn's method | 1958 | Extractive | Word frequency-based sentence scoring | H. P. Luhn |
| Edmundson's method | 1969 | Extractive | Cue words, title words, sentence location | H. P. Edmundson |
| MMR | 1998 | Extractive | Relevance-diversity trade-off for sentence selection | Carbonell, Goldstein |
| LexRank | 2004 | Extractive | Graph-based eigenvector centrality | Erkan, Radev |
| TextRank | 2004 | Extractive | PageRank-inspired unsupervised graph ranking | Mihalcea, Tarau |
| Neural attention model | 2015 | Abstractive | First neural encoder-decoder with attention for summarization | Rush, Chopra, Weston |
| Seq2seq RNN summarizer | 2016 | Abstractive | Hierarchical encoder, keyword model, switching mechanism | Nallapati et al. |
| Pointer-generator | 2017 | Hybrid | Copy mechanism with coverage to reduce repetition | See, Liu, Manning |
| BERTSum | 2019 | Extractive/Abstractive | Fine-tuned BERT for sentence-level extraction | Liu, Lapata |
| BART | 2020 | Abstractive | Denoising autoencoder pre-training with text infilling | Lewis et al. |
| PEGASUS | 2020 | Abstractive | Gap Sentence Generation pre-training objective | Zhang et al. |
| T5 | 2020 | Abstractive | Unified text-to-text framework | Raffel et al. |
| LED | 2020 | Abstractive | Longformer sparse attention for long documents | Beltagy, Peters, Cohan |
| GPT-3/4 (few-shot) | 2020/2023 | Abstractive | In-context learning without fine-tuning | OpenAI |
The emergence of large language models (LLMs) such as GPT-3, GPT-4, Claude, Gemini, and LLaMA has fundamentally changed the landscape of text summarization. These models can produce high-quality summaries in zero-shot or few-shot settings, without requiring task-specific fine-tuning.
LLMs can summarize text simply by being prompted with instructions like "Summarize the following article in three sentences." This eliminates the need for collecting summarization training data and fine-tuning a separate model. The quality of LLM-generated summaries often matches or exceeds that of fine-tuned specialized models, particularly for news summarization.
Research by Goyal et al. (2023) in "Benchmarking Large Language Models for News Summarization" found that GPT-4 demonstrated superior performance across ROUGE and METEOR metrics compared to earlier models. The study also found that GPT models produced a lower percentage of misrepresentations (approximately 15%) compared to other systems.
Despite their strengths, LLMs face several challenges in summarization:
Evaluating the quality of automatically generated summaries is a challenging problem. Several metrics have been developed, each capturing different aspects of summary quality.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the most widely used family of metrics for summarization evaluation. It was introduced by Chin-Yew Lin in 2004 at the Text Summarization Branches Out workshop. ROUGE measures the overlap between a candidate summary and one or more human-written reference summaries.
The main ROUGE variants are:
| Metric | Description | What It Measures |
|---|---|---|
| ROUGE-1 | Unigram overlap between candidate and reference | Word-level recall |
| ROUGE-2 | Bigram overlap between candidate and reference | Phrase-level recall |
| ROUGE-L | Longest common subsequence (LCS) between candidate and reference | Sentence-level structural similarity |
| ROUGE-W | Weighted longest common subsequence | Rewards consecutive matches more than non-consecutive ones |
| ROUGE-S | Skip-bigram overlap | Captures word pairs that may have gaps between them |
Each ROUGE variant can be reported as precision (what fraction of the candidate n-grams appear in the reference), recall (what fraction of the reference n-grams appear in the candidate), or F1 (the harmonic mean of precision and recall). In summarization research, ROUGE-1, ROUGE-2, and ROUGE-L F1 scores are the most commonly reported.
ROUGE has been the de facto standard for summarization evaluation for over two decades because of its simplicity, reproducibility, and reasonable correlation with human judgments. However, it has significant limitations: it relies purely on surface-level n-gram overlap and cannot account for paraphrasing, semantic equivalence, or factual correctness. A summary that uses different words to express the same meaning may receive a low ROUGE score, while a summary that copies frequent n-grams but misses the main point may score well.
BERTScore was proposed by Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi, published at ICLR 2020. Unlike ROUGE, which relies on exact n-gram matches, BERTScore computes semantic similarity between tokens in the candidate and reference summaries using contextual embeddings from pre-trained BERT models.
For each token in the candidate summary, BERTScore finds the most similar token in the reference summary (and vice versa) using cosine similarity of their BERT embeddings. The precision, recall, and F1 scores are then computed based on these greedy token-level matchings.
BERTScore offers several advantages over ROUGE:
However, BERTScore is computationally more expensive than ROUGE and may not fully capture factual consistency.
| Metric | Description | Strengths |
|---|---|---|
| METEOR | Considers synonyms, stemming, and paraphrase matching beyond exact n-gram overlap | Better handles paraphrasing than ROUGE |
| QuestEval / QAEval | Generates questions from the summary and checks if they can be answered from the source | Measures information coverage |
| FactCC | Uses NLI to check whether the summary is entailed by the source document | Measures factual consistency |
| SummaC | Aggregates NLI scores at the sentence level for consistency checking | Granular factual accuracy |
| UniEval | A unified evaluator that scores coherence, consistency, fluency, and relevance | Multi-dimensional quality assessment |
| G-Eval | Uses LLMs (e.g., GPT-4) as evaluators with chain-of-thought prompting | High correlation with human judgments |
Automatic metrics remain imperfect, and human evaluation is still considered the gold standard for assessing summary quality. Human evaluators typically rate summaries along several dimensions:
Human evaluation is expensive and time-consuming, which is why automatic metrics are used for large-scale comparisons and benchmarking.
Summarization research relies on a number of standard datasets for training and evaluating models.
The CNN/DailyMail dataset is the most widely used benchmark for single-document news summarization. It consists of approximately 312,000 news articles paired with multi-sentence summaries (highlight bullets written by journalists). The dataset was originally created by Karl Moritz Hermann et al. (2015) for reading comprehension, and later adapted for summarization by Nallapati et al. (2016).
| Split | Number of Pairs |
|---|---|
| Training | 287,226 |
| Validation | 13,368 |
| Test | 11,490 |
Articles average approximately 781 tokens, and summaries average 56 tokens (roughly 3.75 sentences). Because the reference summaries are multi-sentence highlights, this dataset favors models that can extract or generate multiple key points.
The XSum dataset, introduced by Shashi Narayan, Shay B. Cohen, and Mirella Lapata at EMNLP 2018, consists of BBC news articles paired with single-sentence summaries. It was designed to test a model's ability to perform highly abstractive summarization, as the reference summaries are typically not simple extractions from the article.
| Split | Number of Pairs |
|---|---|
| Training | 204,045 |
| Validation | 11,332 |
| Test | 11,334 |
Articles average 431 words (approximately 20 sentences), while summaries average just 23 words. XSum is considered more challenging than CNN/DailyMail because it requires significant abstraction and reformulation of the source content.
| Dataset | Domain | Summary Style | Size (Train) | Average Doc Length |
|---|---|---|---|---|
| Gigaword | News headlines | Single-sentence headline | 3.8M | ~31 tokens |
| SAMSum | Dialogue | Abstractive conversation summary | 14,732 | ~94 tokens (dialogue) |
| PubMed | Scientific papers | Abstract from body text | 133,215 | ~3,000 tokens |
| arXiv | Scientific papers | Abstract from body text | 215,913 | ~6,000 tokens |
| BillSum | US Congressional bills | Legislative summary | 18,949 | ~1,800 tokens |
| BigPatent | Patent documents | Patent abstract | 1.3M | ~3,500 tokens |
| WikiHow | How-to articles | Step-by-step summary | 157,252 | ~580 tokens |
| Multi-News | News (multi-document) | Multi-document summary | 44,972 | ~2,100 tokens (total) |
| BookSum | Books | Chapter/book-level summary | 12,630 | ~5,000+ tokens |
The Document Understanding Conference (DUC), organized by NIST from 2001 to 2007, was the primary evaluation venue for summarization research in the pre-neural era. DUC introduced standard tasks for single-document and multi-document summarization with human-evaluated benchmarks. In 2008, DUC transitioned into the Text Analysis Conference (TAC), which continued to host summarization evaluation tracks.
Multi-document summarization (MDS) involves generating a single summary from a collection of related documents. This task is more complex than single-document summarization because the system must handle several additional challenges:
Classical approaches to MDS include cluster-based methods (grouping similar sentences and selecting representatives from each cluster), graph-based methods like LexRank (which naturally extend to multi-document settings), and optimization-based methods using ILP.
Neural approaches to MDS have explored hierarchical attention mechanisms, where local attention captures within-document structure and global attention captures cross-document relationships. More recent work leverages LLMs by concatenating multiple documents (or their summaries) into a single prompt and generating a unified summary.
Summarizing long documents, such as scientific papers, legal filings, books, or meeting transcripts, poses distinct challenges beyond those of standard news summarization:
Several strategies have been developed to handle long documents:
Sparse attention models like the Longformer Encoder-Decoder (LED), introduced by Beltagy, Peters, and Cohan (2020), use a combination of local sliding window attention and global attention on selected tokens. LED can process inputs up to 16,384 tokens and was evaluated on the arXiv summarization dataset, achieving ROUGE-1/2/L scores of 46.63/19.62/41.83.
Hierarchical approaches process documents at multiple levels of granularity, for example by first encoding sentences, then aggregating sentence representations into document-level representations. This captures both local and global document structure.
Divide-and-conquer strategies split long documents into manageable chunks, summarize each chunk independently, and then combine the chunk summaries into a final summary. This approach is often called hierarchical or recursive summarization.
LLMs with extended context windows offer a more straightforward solution for some use cases. Models with context windows of 100K tokens or more can process many long documents in a single pass, though they may still struggle with content in the middle of very long inputs.
One of the most pressing challenges in abstractive summarization is ensuring that generated summaries are faithful to the source document. A faithful summary contains only information that is supported by the source text; it does not add, contradict, or distort any facts.
Research has identified two main types of hallucination in summarization:
Studies have found that hallucination is widespread in neural summarization systems. Research by Maynez et al. (2020) in "On Faithfulness and Factuality in Abstractive Summarization" (ACL 2020) found that approximately 30% of summaries generated from the CNN/DailyMail dataset contained hallucinated content. The problem is even more severe on the XSum dataset, where up to 92% of model-generated summaries contained some form of faithfulness error, largely because XSum's highly abstractive reference summaries encourage models to generate content that goes beyond the source.
Several approaches have been developed to detect and reduce hallucination:
Text summarization is deployed across a wide range of industries and applications.
News organizations use summarization to generate article headlines, produce news digests, and create social media posts from longer articles. Automated summarization enables rapid processing of breaking news and helps readers stay informed without reading full articles. News Corp Australia has used generative AI to produce thousands of local news stories weekly.
Legal professionals use summarization to condense lengthy court rulings, contracts, depositions, and regulatory filings. Summarization tools can identify key testimony, track themes across multiple depositions, and highlight risky clauses in contracts. This reduces the time lawyers spend on document review, which is particularly valuable in e-discovery and due diligence processes.
In healthcare, summarization helps clinicians manage the growing volume of information in electronic health records (EHRs). Systems can summarize patient histories, discharge notes, and medical literature to support clinical decision-making. Summarization is also used in systematic reviews, where researchers must synthesize findings from hundreds of studies on a particular medical question.
Researchers use summarization to keep up with the rapidly growing scientific literature. Tools like Semantic Scholar and others provide automated summaries of research papers, helping scientists identify relevant work more efficiently. Summarization can also help with generating literature reviews and identifying gaps in existing research.
Businesses use summarization for processing customer feedback, summarizing meeting transcripts, generating executive summaries of reports, and condensing email threads. Financial analysts use summarization to extract key information from earnings calls, SEC filings, and market reports.
Summarization tools help students condense textbook chapters, research papers, and lecture materials. They can also assist educators in creating study guides and review materials.
Search engines use summarization to generate snippets that appear in search results, providing users with a preview of each result's content. This form of query-focused summarization selects or generates text from the document that is most relevant to the user's search query.
Modern systems increasingly combine extractive and abstractive approaches. A common pipeline first uses an extractive step to select the most relevant sentences, then applies an abstractive model to rewrite and compress the selected content. This hybrid approach reduces hallucination risk while maintaining fluent output.
Controllable summarization allows users to specify desired attributes of the output, such as length, reading level, focus topic, or style. Prompt engineering with LLMs has made controllable summarization much more accessible.
Cross-lingual summarization, where the source and summary are in different languages, is a growing area of research. Models like mBART and mT5, which are multilingual variants of BART and T5, have been applied to multilingual summarization tasks. LLMs with strong multilingual capabilities further expand the possibilities for cross-lingual summarization.
Despite decades of work, evaluating summarization quality remains an open problem. Existing metrics like ROUGE do not fully capture semantic equivalence, factual consistency, or the pragmatic utility of a summary. The development of better evaluation methods, including LLM-based evaluators like G-Eval, is an active area of research.
As summarization systems are deployed in high-stakes domains like healthcare, law, and finance, ensuring factual faithfulness has become increasingly important. Research continues on better methods for detecting and preventing hallucination, with a trend toward using NLI-based and QA-based automatic evaluation for faithfulness.