Summarization Models

AI Models Natural Language Processing

23 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

28 citations

Revision

v4 · 4,534 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Summarization models are natural language processing systems that condense a source document, set of documents, or dialogue into a shorter version that preserves the most important information.^[27] They sit alongside machine translation and question answering as canonical sequence-to-sequence tasks, and form the technical core of news digests, scientific TLDR generation, meeting recaps, and the executive briefs produced by modern chat assistants.

The field has been reshaped twice in quick succession: first by pretrained encoder-decoder models such as BART and T5 in 2019, and again by general-purpose large language models such as GPT-4, Claude, and Gemini, which produce strong summaries with zero or few examples.^[28]

Definition and core distinctions

A summarization system reads an input of length N and produces a much shorter output while keeping it informative, fluent, and faithful to the source.^[27] Researchers split the task along three primary axes.

Extractive versus abstractive. Extractive systems select spans (usually whole sentences) from the source and concatenate them. Abstractive systems generate new text that may paraphrase, fuse, or reorder content. Extractive output stays grounded but reads choppily; abstractive output is smoother and shorter, with a higher risk of hallucination. A third hybrid category, sometimes called compressive or mixed summarization, selects candidate sentences and then compresses or edits them into final output, combining the faithfulness advantage of extraction with the fluency advantage of generation.^[27]

Single versus multi-document. Single-document summarization condenses one input, the standard setting for news articles or scientific papers. Multi-document summarization fuses several related inputs and adds the problems of redundancy removal and cross-document coreference.^[27] Meeting summarization is a special case that handles multi-party dialogue with speaker turns, disfluencies, and implicit topic shifts.

Generic versus query- or aspect-focused. A generic summary covers the most salient content. A query-focused summary answers a user question against the source; an aspect-focused summary picks out a specific facet such as methods only from a research paper. Query-focused multi-document summarization (QFMDS) combines both challenges and has a long tradition at government-sponsored evaluation workshops including TREC and DUC.^[27]

Length and granularity. Extreme summarization, typified by the XSum benchmark, targets a single sentence that captures the headline intent of an article.^[7] Longer summaries range from a short paragraph to multi-page executive reports, as in the GovReport dataset whose targets average roughly 550 words.^[19]

History and key approaches

Early statistical and graph-based methods

Before neural networks dominated the area, extractive summarization was the practical default. Two graph-based algorithms published in 2004 became long-running baselines: TextRank by Rada Mihalcea and Paul Tarau,^[2] and LexRank by Gunes Erkan and Dragomir Radev.^[3] Both build a graph in which sentences are nodes and edges carry a similarity weight, then rank nodes with an iterative algorithm in the style of PageRank to pick the most central sentences.^[2]^[3] These methods require no training data and remain useful when annotated corpora are unavailable.

Earlier still, the MEAD system from the University of Michigan (Radev et al., 2004) combined centroid-based scoring, positional features, and redundancy penalties for multi-document summarization. The Document Understanding Conference (DUC), which ran from 2001 to 2007 and was succeeded by the Text Analysis Conference (TAC), provided annual shared tasks and standard datasets that drove the field's early progress.^[27]

Topic modelling offered a complementary angle. Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) could identify thematic segments and pick representative sentences for each topic, which proved especially useful for multi-document summarization where raw similarity measures blur across topic clusters.^[27]

Sequence-to-sequence with attention

The first wave of neural summarization adapted the encoder-decoder framework from machine translation, with a recurrent encoder reading the source and a recurrent decoder generating tokens while attending over the encoder states. Rush, Chopra, and Weston (2015) applied this to headline generation with attention, producing the first compelling abstractive baseline on news.^[4] Chopra, Auli, and Rush (2016) refined the attentional model and applied it to the newly released Gigaword sentence-compression task.

The decisive refinement was the pointer-generator network of Abigail See, Peter J. Liu, and Christopher D. Manning in 2017, which augments the decoder with a copy mechanism that can generate a vocabulary word or copy a token from the source.^[6] A coverage loss discourages repetition. On CNN/DailyMail it beat earlier abstractive baselines by at least 2 ROUGE points and remained competitive until pretrained transformers arrived.^[6]

Pretrained encoder-decoder transformers

In 2019 three encoder-decoder transformers established the modern paradigm. BART, from Mike Lewis and colleagues at Facebook AI, pretrains a bidirectional encoder and an autoregressive decoder to reconstruct text corrupted by token masking, deletion, sentence shuffling, and span infilling; fine-tuned on CNN/DailyMail and XSum it set new state-of-the-art ROUGE scores.^[10] T5, from Colin Raffel and colleagues at Google, cast every NLP task as text-to-text and pretrained on the C4 corpus at scales up to 11 billion parameters.^[12] PEGASUS, from Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu at Google, introduced a gap-sentence-generation objective specifically aimed at summarization: important sentences are masked and the model learns to regenerate them.^[13] PEGASUS achieved state-of-the-art results across 12 summarization datasets and strong few-shot performance with as few as 1,000 labelled examples.^[13]

Also in 2019, Yang Liu and Mirella Lapata published BERTSUM, which adapted the bidirectional encoder BERT to extractive and abstractive summarization.^[11] The extractive variant inserts a [CLS] token before each sentence, encodes the full document jointly, and scores each sentence with stacked transformer layers, then uses a greedy algorithm that maximizes ROUGE to select the final set. BERTSUM set a new extractive SOTA on CNN/DailyMail and demonstrated that the same pretrained encoder could support both extractive and abstractive heads.^[11]

Long-context summarization

Standard transformer self-attention scales quadratically with sequence length, which prevents 2019-era models from processing documents much longer than 1,024 tokens without truncation.^[16] A series of architectures introduced sparse attention patterns to push the input limit further:

Longformer (Beltagy, Peters, and Cohan, 2020) replaces full attention with a sliding window plus task-motivated global tokens.^[16] Its encoder-decoder variant, LED, is widely used on arXiv and other long-document tasks.
BigBird (Zaheer et al., Google, 2020) combines global, local, and random attention and supports inputs roughly 8x longer than the original transformer on the same hardware.^[17]
LongT5 (Guo et al., 2021) ports a transient global attention mechanism into the T5 family and adopts a PEGASUS-style pretraining objective, reaching state-of-the-art results on arXiv, PubMed, BigPatent, and MediaSum.^[21]
PEGASUS-X (Phang, Zhao, and Liu, 2022) extends PEGASUS with block-local attention and extra long-input pretraining for inputs up to 16,384 tokens, topping GovReport and PubMed without scaling parameters much.^[23]

General-purpose LLMs

The arrival of GPT-3 in 2020 and its instruction-tuned successors made zero-shot summarization a routine chatbot capability. By 2023 papers were reporting that GPT-4 summaries were preferred to fine-tuned BART or PEGASUS outputs on news benchmarks under human evaluation, even when ROUGE scores favored the fine-tuned models.^[28] This divergence highlighted the growing gap between automatic metrics and actual summary quality. Closed foundation models such as Claude and Gemini, with context windows of 200,000 to over a million tokens, could summarize book-length documents in a single pass. Open-weight families such as FLAN-T5 and Llama provided instruction-tuned alternatives that can be fine-tuned on domain-specific data. Research from 2024 and 2025 found that fine-tuned open-source LLMs with domain-specific training data could match or exceed GPT-4 zero-shot performance on narrow tasks, while GPT-4-class models retained advantages on zero-shot cross-domain and multilingual settings.^[28]

Architectures and training objectives

Most summarization systems fall into one of four architectural families.

Encoder-only with an extractive head. A pretrained encoder such as BERT scores each sentence and the top-K sentences are selected. This route, popularized by BERTSUM (Liu and Lapata, 2019), is fast and stays faithful to the source.^[11] The encoder can also drive an importance or saliency classifier that scores spans rather than whole sentences, enabling finer-grained extractive compression.

Encoder-decoder transformer. BART, T5, PEGASUS, and their long-context successors share this template: the encoder builds contextual representations and the decoder generates the summary autoregressively while attending over the encoder. Pretraining objectives vary across text-infilling (BART),^[10] span corruption (T5),^[12] and gap-sentence generation (PEGASUS).^[13] The encoder and decoder can be initialized separately from existing checkpoints or jointly from scratch on a massive denoising corpus.

Decoder-only language model. Generic LLMs such as the GPT, Llama, Claude, and Gemini families read source and instruction in a single context window, then continue with the summary. Instruction tuning and reinforcement learning from human feedback (RLHF) adapt these models to produce user-preferred summaries without task-specific fine-tuning on summarization corpora. The limitation is that faithfulness checking is harder without a separate encoder that explicitly grounds the decoder in the source.

Retrieval-augmented or hierarchical pipelines. For inputs that exceed even long-context models, systems chunk the document, summarize the chunks, and recursively combine partial summaries. This map-reduce pattern is easy to implement but loses cross-chunk dependencies. Alternatively, a retrieval-augmented generation pipeline fetches the most relevant passages before generating the final summary, which is efficient when the user query identifies the relevant region but introduces retrieval errors as a new failure mode.

Pretraining objectives for summarization

The choice of pretraining objective has a large effect on summarization quality when fine-tuning data is scarce. Gap-sentence generation (PEGASUS) masks whole sentences that have high overlap with the rest of the document, a proxy for "informativeness," and teaches the model to reconstruct a document-level summary rather than local spans.^[13] Span infilling (T5 and BART) is more general, requiring reconstruction of corrupted sequences of different lengths.^[10]^[12] Sentence shuffling and insertion (BART) trains the encoder to understand discourse order, which helps when summaries require understanding narrative structure rather than just identifying key facts.^[10]

Notable summarization models

The table below lists models that set state-of-the-art results or shaped later research.

Model	Year	Organization	Parameters	Notes
Pointer-generator network	2017	Stanford / Google Brain	~22M	RNN seq2seq with copy and coverage; first strong abstractive baseline on CNN/DailyMail^[6]
BERTSUM	2019	University of Edinburgh	~110M	BERT encoder with extractive and abstractive variants; SOTA on CNN/DailyMail^[11]
BART	2019	Facebook AI	140M, 400M	Denoising encoder-decoder; new SOTA on CNN/DailyMail and XSum^[10]
T5	2019	Google	60M to 11B	Text-to-text transfer transformer pretrained on C4^[12]
PEGASUS	2019	Google	223M, 568M	Gap-sentence generation objective; SOTA on 12 datasets^[13]
Longformer / LED	2020	Allen Institute for AI	149M, 435M	Sliding window plus global attention up to 16K tokens^[16]
BigBird	2020	Google	~127M, ~580M	Sparse global+local+random attention for long inputs^[17]
LongT5	2021	Google	220M to 3B	Transient-global attention plus PEGASUS-style pretraining^[21]
PEGASUS-X	2022	Google	568M	Long-input PEGASUS variant for inputs up to 16K tokens^[23]
FLAN-T5	2022	Google	80M to 11B	Instruction-tuned T5; strong zero-shot summarization
GPT-4	2023	OpenAI	Undisclosed	Zero-shot summarization across domains; preferred by humans over fine-tuned encoder-decoder models on news^[28]
Claude	2023 onward	Anthropic	Undisclosed	200K+ token windows for long-document summarization
Gemini	2023 onward	Google DeepMind	Undisclosed	Up to 1M-token context for book-length summarization
Llama family	2023 onward	Meta AI	1B to 405B	Open-weight LLMs widely fine-tuned for domain-specific summarization

Datasets and benchmarks

Progress in summarization has been driven by a small number of public datasets, listed below. Each dataset encodes a particular style of summarization, and models fine-tuned on one do not always transfer well to others.

Dataset	Domain	Style	Size	Notes
CNN/DailyMail	News	Multi-sentence highlights	~313K articles	Originally collected by Hermann et al. (2015) for reading comprehension and repurposed for summarization; moderately abstractive^[5]
Gigaword	News	Headline generation	~4M sentence pairs	Sentence-level compression from the first sentence of articles; used by Rush et al. (2015)^[4]
XSum	BBC news	One-sentence extreme summaries	226,711 articles	Highly abstractive, with about 96% novel trigrams; originally by Narayan, Cohen, and Lapata (2018)^[7]
Newsroom	News	Mixed extractive/abstractive	1.3M article-summary pairs	Sourced from 38 publishers with explicit extractiveness labels
WikiHow	How-to articles	Step summaries	230K articles	Paragraph headlines as targets; diverse procedural domain
BillSum	US legislation	Bill summaries	~23K bills	Congressional and California state bills; long source documents
MultiNews	News clusters	Multi-document	56K clusters	Related articles with editor summaries; tests redundancy handling
GovReport	US government reports	Executive summaries	~19K reports	Mean document length ~9,600 tokens; mean summary ~550 words; Huang et al. (2021)^[19]
arXiv / PubMed	Scientific papers	Abstracts	113K / 215K papers	Cohan et al. (2018); abstracts serve as targets; very long source documents^[8]
BookSum	Novels and short stories	Chapter and book-level summaries	~12K chapter summaries	Salesforce Research; three granularity levels; Kryściński et al. (2021)^[20]
QMSum	Meeting transcripts	Query-focused summaries	1.8K query-summary pairs	Multi-domain meeting corpus; mean length ~13,300 tokens
SAMSum	Chat dialogues	Third-person summaries	16,369 conversations	Human-written messenger-style conversations; Gliwa et al. (2019)^[9]
DialogSum	Spoken-style dialogues	Daily-life summaries	13,460 dialogues	Daily-life scenarios; Chen et al. (2021)
AMI / ICSI	Meeting transcripts	Meeting minutes	~100 / ~75 meetings	Classic meeting corpora; multi-speaker; used before the LLM era for dialogue summarization

Dataset characteristics and challenges

The spread of datasets reflects the breadth of summarization as a task. News datasets such as CNN/DailyMail and XSum differ substantially in style: CNN/DailyMail summaries are bullet-style highlights written before the article was published, while XSum requires a one-sentence "why does this matter" framing that demands deep comprehension.^[7] Scientific paper datasets shift the demand toward technical vocabulary and hierarchical discourse. Dialogue datasets introduce speaker role modeling and speech-act understanding. Long-document datasets (GovReport, BookSum, arXiv) push models toward hierarchical architectures or large context windows.

Training on one dataset and evaluating on another routinely reveals that neural summarizers learn dataset-specific style artifacts rather than transferable summarization skills. XSum models trained on the extreme one-sentence target style produce overly compressed outputs on CNN/DailyMail; the reverse produces verbose outputs on XSum.

Evaluation metrics

Automatic evaluation of summaries is hard because many different surface realizations can be equally good. The most widely used metric is ROUGE (Recall-Oriented Understudy for Gisting Evaluation), introduced by Chin-Yew Lin in 2004.^[1] ROUGE-N counts overlapping n-grams against one or more reference summaries, ROUGE-L measures the longest common subsequence, and ROUGE-S scores skip-bigrams.^[1] ROUGE was adopted at the Document Understanding Conference (DUC) 2004 and remains the default headline number in nearly every paper.

ROUGE's core limitations are well documented.^[28] It rewards lexical overlap and penalizes valid paraphrases, which can rank a poor abstractive summary that copies source phrases above a better one that paraphrases.^[14] It also fails to capture factual consistency: a summary can score well on ROUGE while containing facts that contradict the source.^[18] Cross-dataset comparisons using ROUGE are unreliable because different datasets use different numbers of reference summaries, and human references vary in abstractiveness.

BERTScore (Tianyi Zhang and colleagues, 2019) compares contextual BERT embeddings between candidate and reference, scoring semantic similarity rather than literal n-gram overlap.^[14] BLEURT (Sellam et al., 2020) is a learned regression metric fine-tuned to predict human ratings. MoverScore (Zhao et al., 2019) uses earth-mover distance between pooled word embeddings to compute soft n-gram overlap.

Faithfulness and factual consistency metrics

A separate strand of work targets factual consistency specifically. These metrics do not require a reference summary; they compare the generated summary directly to the source document.

FactCC (Kryscinski et al., 2020) trains a classifier on synthetically generated factual and non-factual sentence pairs to detect contradictions between source and summary.^[18] QAGS (Wang, Cho, and Lewis, 2020) generates questions from the candidate summary and checks whether a question answering system gives the same answers when reading the summary and the source; divergence signals inconsistency.^[15] FEQA (Durmus et al., 2020) applies a similar question-generation and question-answering loop, generating questions from the source and verifying whether the summary can answer them.

SummaC (Laban et al., 2022) operationalises sentence-level consistency via natural language inference, checking whether each summary sentence is entailed by and not contradicted by the source.^[22] AlignScore (Zha et al., 2023) proposes a unified alignment function that supports both reference-based and reference-free evaluation by fine-tuning a checkpoint to predict whether one text is factually supported by another.^[26]

Newer LLM-based evaluators, often grouped as "G-Eval" (Liu et al., 2023), prompt a strong model such as GPT-4 to rate candidates on coherence, relevance, consistency, and fluency using a chain-of-thought evaluation protocol.^[25] Studies find G-Eval correlates more closely with human judgments than ROUGE on most dimensions, though it inherits the biases of the underlying model and can favour outputs from the same model family.^[25]

Human evaluation

Human evaluation remains the gold standard but is expensive and hard to replicate. Common human evaluation protocols ask annotators to rate summaries on a 1-5 scale along dimensions including fluency (grammaticality and readability), coherence (logical structure), relevance (coverage of key information), and consistency (factual alignment with the source). Direct assessment (DA), pairwise comparison (A/B testing), and pyramid-based evaluation (Nenkova and Passonneau, 2004) are the three dominant human evaluation frameworks.

Applications

Summarization models are deployed across many industries and products.

News and media. Aggregators such as Google News and Apple News surface short article previews. Specialized tools generate push-notification-length bullets for breaking news. Some publishers use summarization to auto-generate article metadata and social media snippets.

Scientific literature. Semantic Scholar and other academic search platforms generate one-sentence TLDRs of papers for users scanning large result sets. Automated abstract generation assists authors and supports systematic review workflows in medicine and law.

Meeting and communication platforms. Zoom, Google Meet, Microsoft Teams, and dedicated tools such as Otter.ai and tl;dv produce action-item summaries, decision logs, and meeting minutes from live or recorded transcripts. Customer-support and CRM platforms condense long ticket threads to give agents context before a call.

Legal and compliance. Legal research tools summarize contracts, case files, statutes, and regulatory filings. Strict faithfulness requirements in this domain have driven interest in extractive or lightly compressive approaches that preserve verifiable spans, and in faithfulness metrics as filters on generated summaries.

Healthcare. Clinical note summarization condenses patient histories, discharge summaries, and encounter notes. Biomedical literature summarization assists physicians and researchers in keeping up with large volumes of publications. Both settings place a premium on factual accuracy and domain-specific vocabulary.

Software development. Coding assistants summarize pull requests, code review threads, log files, and stack traces. Commit message generation from diffs is a closely related summarization subtask.

Education. Adaptive learning platforms generate reading summaries at different comprehension levels. Textbook chapter summarization and lecture transcript summarization are increasingly automated with LLM-based tools.

Modern LLM-era developments

Research focus has shifted from fine-tuning on fixed benchmarks to controlling general-purpose LLMs.

Prompt engineering for summarization

Chain-of-density prompting, introduced by Griffin Adams and colleagues in 2023, asks GPT-4 to draft an entity-sparse summary and then iteratively add salient entities at fixed length.^[24] The process runs for five iterations, each adding more entities while preserving the target word count. A human study on 100 CNN/DailyMail articles found readers preferred the denser, information-packed summaries over a vanilla one-shot prompt.^[24] The authors released 500 annotated CoD summaries and 5,000 unannotated summaries.^[24]

Decomposed prompting instructs the model to identify key points, then draft and then revise the summary in separate steps, mimicking the plan-draft-revise workflow of professional summarizers.

Controllable summarization

Users specify length, focus topic, audience, or tone in natural language, and the model adapts without retraining. Aspect-based, query-focused, and multi-document summarization can all be expressed through prompts. Instruction-tuned LLMs generalize remarkably well across these controlled settings even when they have never seen the specific format during training.

Long-context summarization

With Claude offering 200K-token windows and Gemini reaching 1M tokens, entire technical reports, legal briefs, and novels can be processed without chunking, though hierarchical pipelines remain popular when budget or latency matters. Research in 2024 found that even long-context models still exhibit a "lost in the middle" bias: information in the center of the context window is retrieved less reliably than information near the beginning or end, with implications for summarization quality on very long inputs.

Multimodal summarization

Models that handle text, images, audio, and video together can summarize lectures, podcast episodes, and narrated slide decks. Audio-grounded summarization from speech combines automatic speech recognition with abstractive summarization and must handle disfluencies, speaker overlap, and the absence of punctuation. Video summarization adds visual salience estimation and temporal segmentation to the pipeline.

Fine-tuning versus prompting tradeoffs

A consistent empirical finding from 2024 and 2025 research is that fine-tuning a small model on domain-specific summarization data often outperforms zero-shot GPT-4 within that domain, while GPT-4 retains large advantages on zero-shot cross-domain transfer.^[28] This creates a deployment decision: fine-tuned smaller models are cheaper and more controllable; large zero-shot models are more flexible. Instruction-tuned open-source LLMs (Llama 3, Mistral, Qwen) occupy a middle ground when adapted with domain data.

Limitations

Several open problems remain active areas of research.

Factual hallucination. Abstractive models, including the strongest LLMs, sometimes introduce details that contradict or do not appear in the source.^[18] Studies have found error rates of 20-30% or higher for factual inconsistencies in abstractive summaries generated by strong models on news datasets.^[18] The problem is sharpest for long inputs, low-resource domains, and highly compressed summaries where the model must infer or synthesize rather than paraphrase.

Faithfulness versus informativeness. Extractive systems are faithful by construction but often miss synthesis opportunities and produce choppy output. Abstractive systems are more informative but harder to audit. This tradeoff remains unresolved; no current approach achieves both maximum faithfulness and maximum informativeness simultaneously.

Metric reliability. ROUGE rewards lexical overlap and penalizes valid paraphrases, which is one reason model rankings shift when human judges are asked instead.^[14] Newer metrics partially address this but bring their own biases, especially when evaluator and generator share an underlying model family. G-Eval using GPT-4 tends to favor GPT-4-generated summaries in pairwise comparisons, an instance of self-preference bias.^[25]

Position and length bias. Many neural summarizers learn to favor the first few sentences of the source, an artifact of news training data where important information concentrates near the lead. This "lead bias" was documented and studied systematically in 2019 and remains present in LLM-era models; long-context models exhibit a related "lost in the middle" effect where central content is systematically underweighted.

Evaluation in the LLM era. Reference summaries for benchmarks such as CNN/DailyMail and XSum were written by editors with their own conventions and errors. As LLM-generated summaries surpass these references on human preference studies, ROUGE computed against the original references becomes a weaker signal of progress.^[28] The community has moved toward human preference evaluation and LLM-judge evaluation, but these introduce new reliability and reproducibility concerns.

Domain transfer and out-of-distribution robustness. Models trained on news summarization degrade when applied to scientific, legal, or biomedical text. Vocabulary shift, domain-specific entity types, and different rhetorical structures all reduce quality. Fine-tuning on in-domain data is effective but requires annotated corpora that may be expensive to produce in specialized domains.

Multilingual summarization. Most benchmark progress has been on English. Cross-lingual summarization (summarizing a document in one language and producing output in another) and multilingual summarization (operating across many languages) remain harder, with performance gaps between English and low-resource languages that widen for more abstractive settings.^[28]

References

Lin, Chin-Yew (2004). *ROUGE: A Package for Automatic Evaluation of Summaries.* Text Summarization Branches Out, ACL Workshop. aclanthology.org/W04-1013 ↩
Mihalcea, Rada and Tarau, Paul (2004). *TextRank: Bringing Order into Texts.* EMNLP 2004. aclanthology.org/W04-3252 ↩
Erkan, Gunes and Radev, Dragomir R. (2004). *LexRank: Graph-based Lexical Centrality as Salience in Text Summarization.* Journal of Artificial Intelligence Research, 22. arxiv.org/abs/1109.2128 ↩
Rush, Alexander M.; Chopra, Sumit; and Weston, Jason (2015). *A Neural Attention Model for Abstractive Sentence Summarization.* EMNLP 2015. arxiv.org/abs/1509.00685 ↩
Hermann, Karl Moritz et al. (2015). *Teaching Machines to Read and Comprehend.* NeurIPS 2015. arxiv.org/abs/1506.03340 ↩
See, Abigail; Liu, Peter J.; and Manning, Christopher D. (2017). *Get To The Point: Summarization with Pointer-Generator Networks.* ACL 2017. arxiv.org/abs/1704.04368 ↩
Narayan, Shashi; Cohen, Shay B.; and Lapata, Mirella (2018). *Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization.* EMNLP 2018. arxiv.org/abs/1808.08745 ↩
Cohan, Arman et al. (2018). *A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents.* NAACL 2018. arxiv.org/abs/1804.05685 ↩
Gliwa, Bogdan et al. (2019). *SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization.* EMNLP-IJCNLP Workshop. arxiv.org/abs/1911.12237 ↩
Lewis, Mike et al. (2019). *BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.* arxiv.org/abs/1910.13461 ↩
Liu, Yang and Lapata, Mirella (2019). *Text Summarization with Pretrained Encoders.* EMNLP 2019. arxiv.org/abs/1908.08345 ↩
Raffel, Colin et al. (2019). *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.* JMLR 21. arxiv.org/abs/1910.10683 ↩
Zhang, Jingqing; Zhao, Yao; Saleh, Mohammad; and Liu, Peter J. (2019). *PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization.* ICML 2020. arxiv.org/abs/1912.08777 ↩
Zhang, Tianyi et al. (2019). *BERTScore: Evaluating Text Generation with BERT.* ICLR 2020. arxiv.org/abs/1904.09675 ↩
Wang, Alex; Cho, Kyunghyun; and Lewis, Mike (2020). *Asking and Answering Questions to Evaluate the Factual Consistency of Summaries.* ACL 2020. arxiv.org/abs/2004.04228 ↩
Beltagy, Iz; Peters, Matthew E.; and Cohan, Arman (2020). *Longformer: The Long-Document Transformer.* arxiv.org/abs/2004.05150 ↩
Zaheer, Manzil et al. (2020). *Big Bird: Transformers for Longer Sequences.* NeurIPS 2020. arxiv.org/abs/2007.14062 ↩
Kryscinski, Wojciech et al. (2020). *Evaluating the Factual Consistency of Abstractive Text Summarization.* EMNLP 2020. arxiv.org/abs/1910.12840 ↩
Huang, Luyang; Cao, Shuyang; Parulian, Nikolaus; Ji, Heng; and Wang, Lu (2021). *Efficient Attentions for Long Document Summarization.* NAACL 2021. arxiv.org/abs/2104.02112 ↩
Kryściński, Wojciech; Rajani, Nazneen; Agarwal, Divyansh; Xiong, Caiming; and Radev, Dragomir (2021). *BookSum: A Collection of Datasets for Long-form Narrative Summarization.* arxiv.org/abs/2105.08209 ↩
Guo, Mandy et al. (2021). *LongT5: Efficient Text-To-Text Transformer for Long Sequences.* Findings of NAACL 2022. arxiv.org/abs/2112.07916 ↩
Laban, Philippe et al. (2022). *SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization.* TACL 2022. arxiv.org/abs/2111.09525 ↩
Phang, Jason; Zhao, Yao; and Liu, Peter J. (2022). *Investigating Efficiently Extending Transformers for Long Input Summarization.* arxiv.org/abs/2208.04347 ↩
Adams, Griffin et al. (2023). *From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting.* NewSumm Workshop, EMNLP 2023. arxiv.org/abs/2309.04269 ↩
Liu, Yang et al. (2023). *G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.* EMNLP 2023. arxiv.org/abs/2303.16634 ↩
Zha, Yuheng et al. (2023). *AlignScore: Evaluating Factual Consistency with a Unified Alignment Function.* ACL 2023. arxiv.org/abs/2305.16739 ↩
El-Kassas, Wafaa S. et al. (2021). *Automatic Text Summarization: A Comprehensive Survey.* Expert Systems with Applications 165. doi.org/10.1016/j.eswa.2020.113679 ↩
Tang, Yixin et al. (2024). *A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models.* arxiv.org/abs/2406.11289 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

AI Summary Generators Chain of Density prompting

Definition and core distinctions

History and key approaches

Early statistical and graph-based methods

Sequence-to-sequence with attention

Pretrained encoder-decoder transformers

Long-context summarization

General-purpose LLMs

Architectures and training objectives

Pretraining objectives for summarization

Notable summarization models

Datasets and benchmarks

Dataset characteristics and challenges

Evaluation metrics

Faithfulness and factual consistency metrics

Human evaluation

Applications

Modern LLM-era developments

Prompt engineering for summarization

Controllable summarization

Long-context summarization

Multimodal summarization

Fine-tuning versus prompting tradeoffs

Limitations

See also

References

Improve this article

Related Articles

Translation Models

Bert-base-uncased model

Conversational Models

Fill-Mask Models

Question Answering Models

Sentence-transformers/all-MiniLM-L6-v2 model

What links here

Related Articles

Translation Models

Bert-base-uncased model

Conversational Models

Fill-Mask Models

Question Answering Models

Sentence-transformers/all-MiniLM-L6-v2 model

What links here