Chain of Density prompting

Natural Language Processing Prompt Engineering

16 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v4 · 3,161 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Chain of Density (CoD) is a prompting technique for abstractive text summarization with large language models, introduced in the 2023 paper "From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting" by Griffin Adams, Alex Fabbri, Faisal Ladhak, Eric Lehman, and Noémie Elhadad.^[1] The method instructs a model (originally GPT-4) to write an initial entity-sparse summary and then to rewrite it several times in succession, each time adding one to three missing salient entities while holding the total length fixed.^[1] The result is a sequence of progressively denser summaries from which a user can pick the step that best balances informativeness against readability. Empirical work on 100 CNN/DailyMail articles found that human annotators preferred summaries whose entity density was close to that of human-written references, and clearly preferred denser CoD outputs over GPT-4 outputs from a vanilla prompt.^[1] CoD was developed jointly at Salesforce AI Research, Columbia University, and MIT, and quickly attracted reimplementations in Hugging Face, LangChain, and the Instructor library.^[1]^[2]^[3]^[4]

Background and Motivation

By 2023, zero-shot prompting of LLMs like GPT-4 had become the dominant paradigm for news summarization, largely displacing supervised models such as BART and PEGASUS fine-tuned on labeled corpora.^[1] Researchers had shown that careful prompting could control length, style, and topic of generated summaries, and that GPT-3 summaries were often preferred by humans over previous supervised baselines.^[1] However, an information density axis was under-studied. The CoD authors observed that a summary "should be denser, containing a higher concentration of information, than the source document," yet vanilla GPT-4 prompts often produced summaries that were both lead-biased (drawing disproportionately from the article's opening sentences) and entity-sparse.^[1] Selecting the right level of density is a hard tradeoff: too sparse and the summary fails to inform; too dense and it becomes incoherent or factually unreliable within a fixed token budget.^[1]

The CoD paper proposed using the number of unique named entities per token as a proxy for density, and treating the choice of density as an empirical question to be answered by human preference rather than by automatic metrics.^[1] The authors note that this complements prior work on entity-based summarization (entity chains as planning targets, entity-grounded faithfulness, entity coverage as an evaluation unit) but adapts the idea to zero-shot prompting of a frontier LLM.^[1]

Publication Details

Field	Value
Title	From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting
Authors	Griffin Adams, Alex Fabbri, Faisal Ladhak, Eric Lehman, Noémie Elhadad
Affiliations	Columbia University (CS and Biomedical Informatics), Salesforce AI, MIT
arXiv ID	2309.04269
First submitted	8 September 2023
Venue	4th New Frontiers in Summarization Workshop (NewSum 2023), co-located with EMNLP 2023, Singapore
ACL Anthology ID	2023.newsum-1.7
Pages	68 to 74
Dataset	`griffin/chain_of_density` on Hugging Face (500 annotated + 5,000 unannotated CoD summaries)

Sources: arXiv listing,^[1] ACL Anthology entry,^[2] Hugging Face dataset card.^[3]

How Chain of Density Works

CoD is implemented as a single prompt to GPT-4 that asks the model to produce five successive summaries of one source article.^[1] The model is required, at each step, to identify one to three "Missing Entities" from the source that were not in its previous summary and to rewrite the prior summary so that it covers all previous content plus the new entities, without increasing the overall word count. To make room, the model is instructed to use abstraction, fusion, and compression, and to remove uninformative filler phrases such as "the article discusses."^[1]

The CoD prompt

The exact prompt published in Figure 2 of the paper reads, in part:^[1]

You will generate increasingly concise, entity-dense summaries of the above Article. Repeat the following 2 steps 5 times. Step 1. Identify 1 to 3 informative Entities ("," delimited) from the Article which are missing from the previously generated summary. Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities.

A "Missing Entity" is defined by five criteria:^[1]

Relevant: to the main story.
Specific: descriptive yet concise (5 words or fewer).
Novel: not in the previous summary.
Faithful: present in the article.
Anywhere: located anywhere in the article.

The first summary is deliberately verbose: four to five sentences and roughly 80 words, "highly non-specific" and laced with fillers, so that subsequent steps have textual slack to absorb new entities by compression.^[1] All five summaries are returned in a single JSON list whose entries each contain the fields Missing_Entities and Denser_Summary.^[1]

Densification dynamics

Across 100 CNN/DailyMail articles, the authors measured direct statistics (token counts, unique entity counts via spaCy named entity recognition, and the entity-per-token ratio) at each step. Length stays close to a fixed budget (72 tokens at step 1, 67 tokens at step 2 as filler is removed, drifting back up to 72 tokens by step 5), while entity density rises monotonically from 0.089 at step 1 to 0.167 at step 5.^[1] For comparison, the average human-written reference has density 0.151 and a vanilla GPT-4 summary has 0.122.^[1]

CoD step	Avg tokens	Avg entities	Entity density (entities/token)
1	72	6.4	0.089
2	67	8.7	0.129
3	67	9.9	0.148
4	69	10.8	0.158
5	72	12.1	0.167
Human reference	60	8.8	0.151
Vanilla GPT-4	70	8.5	0.122

Table reproduced from Table 1 of the paper.^[1]

The authors also measure indirect statistics that they predict will move as a side-effect of densification. They report that abstractiveness increases (extractive density decreases) across the five steps, fusion (the number of source sentences aligned to each summary sentence) rises, and content distribution shifts away from the article's lead and toward middle and tail sentences.^[1] These three effects together describe a movement from a lead-biased extractive summary toward an abstractive, source-distributed one, while keeping length and content faithful.

Empirical Findings

Human preference study

The first four authors annotated 500 CoD summaries (5 steps for each of 100 articles), shown in randomized order alongside the articles. Each annotator picked their top preferred summary per article using the "good summary" rubric from Stiennon et al. (2020).^[1] Across annotators, the modal preferred step is 2, the median is 3, and the expected (mean) step is 3.06.^[1] In aggregate, 61% of first-place votes went to summaries at step 3 or later (23.0% + 22.5% + 15.5%), confirming that humans favor noticeably denser outputs than the vanilla GPT-4 baseline.^[1] At the same time, the most dense step (step 5) was the least preferred among CoD candidates, indicating a clear plateau followed by decline.

CoD step	Share of first-place votes (aggregate)	GPT-4 "Overall" rating (1 to 5)
1	8.3%	4.41
2	30.8%	4.58
3	23.0%	4.57
4	22.5%	4.61
5	15.5%	4.58

Compiled from Tables 2 and 3 of the paper.^[1] The "preferred" entity density inferred from step 3 ($\approx 0.15$) is essentially identical to that of human-written reference summaries (0.151) and considerably higher than that of vanilla GPT-4 (0.122).^[1] The annotators reported a low Fleiss' kappa of 0.112, which the authors attribute to the subtle differences between adjacent CoD steps and to subjectivity in summarization preference, a pattern previously observed for GPT-based summaries by Goyal et al. (2022).^[1]

Automatic ratings

The authors complement the human study with GPT-4-as-a-judge Likert ratings (1 to 5) along five dimensions adapted from Aharoni et al. (2023) and the SummEval framework: Informative, Quality, Coherence, Attributable, and Overall.^[1] The Informative score peaks at step 4 (4.74), while article-free dimensions (Quality at step 2 and Coherence at step 1) decline sooner. Overall scores are highest at steps 2 to 4 and lowest at the extremes, consistent with the human preference pattern.^[1] The summary-level Pearson correlation between GPT-4 Overall ratings and human preferences is 0.311, with weaker correlations for the other dimensions, in line with prior findings that automatic metrics struggle to distinguish summaries of similar quality.^[1]

Qualitative tradeoff

The authors give worked examples (Figure 4) in which one densification step improves the summary (adding "Liverpool" plus goal-scorers replaces a vague phrase with concrete causal information) and another step harms it (cramming in an additional detail about "TV5Monde" introduces an awkward fusion of unrelated entities, hurting readability).^[1] The qualitative takeaway is that there is a real, content-dependent ceiling on useful density beyond which entity insertion damages coherence and may invite hallucination.^[1]

CoD sits at the intersection of zero-shot summarization and structured iterative prompting. It is helpful to position it against three reference points the paper or its followers explicitly cite.

Technique	Mechanism	Goal	Iterates the same length?	Key reference
Vanilla summarization prompt	Single GPT-4 call, e.g. "Write a VERY short summary of the Article. Do not exceed 70 words."	Concise summary	Not applicable (single pass)	Adams et al., 2023^[1]
Chain of Density (CoD)	Single prompt that elicits 5 rewrites; each step adds 1 to 3 missing entities and re-compresses to identical length	Calibrate density tradeoff for a fixed length	Yes, by construction	Adams et al., 2023^[1]
Recursive summarization (Wu et al.)	Fine-tunes GPT-3 with RLHF to summarize book sections, then summaries of summaries, then summaries of those	Summarize very long inputs by recursive task decomposition	No (each level compresses)	Wu et al., 2021^[5]
Chain-of-Thought (CoT)	Prompt model to produce intermediate reasoning steps before answering	Improve multi-step reasoning accuracy	Not applicable (free length)	Wei et al., 2022

CoD and recursive summarization both use iteration but for different reasons. Recursive summarization (Wu, Ouyang et al., OpenAI, 2021) attacks input length by summarizing chunks and then summarizing the chunk-summaries, using fine-tuned models trained via human-feedback reward modeling.^[5] CoD addresses information density at fixed output length via prompt design alone with a frozen LLM, and iterates over the output rather than over chunks of the input.^[1] Compared to Chain-of-Thought prompting, which elicits intermediate reasoning to improve answer correctness, CoD elicits intermediate artifacts (the early summaries) to expose a quality tradeoff and let downstream users choose where on the curve to stop. The naming convention echoes CoT but the mechanism, target, and output structure are distinct.

Beyond these reference points, CoD is often discussed alongside Chain-of-Verification (CoVe), Tree of Thoughts, and Self-Refine in surveys of iterative LLM prompting strategies.^[4]

Implementations and Adoption

The paper, the prompt, and the dataset were released together in September 2023, and the technique was reimplemented broadly within weeks. Key public artifacts include:

Hugging Face dataset griffin/chain_of_density: 100 annotated CoD test articles and 5,000 additional unannotated CoD summaries, with per-step token counts, entity counts, density, fusion scores, ROUGE scores, and GPT-4 Likert ratings.^[3] The annotated split is intended for evaluation; the unannotated split is intended for density distillation into smaller open models such as Llama 2.^[1]
LangChain Hub publishes a chain-of-density prompt template adapted to LangChain's ChatPrompt format, configurable to different models and entity definitions.^[6]
Instructor library tutorial by Ivan Leo and Jason Liu (5 November 2023) demonstrates distilling the multi-step CoD process into a single specialized model by fine-tuning GPT-3.5 on a small set of CoD outputs, reporting roughly 20x latency improvement and substantial cost reduction while maintaining entity density of about 0.14 to 0.15.^[4]
Community reimplementations include richawo/chain-of-density, an open-source Python implementation that uses the OpenAI API to apply CoD to arbitrary input documents.^[7] Other community repositories and notebooks reproduce the prompt against open-source models.

The original authors' code and example scripts are tracked alongside the Hugging Face dataset, but the canonical reproducible artifact for most users is the dataset card and the prompt printed verbatim in Figure 2 of the paper.^[1]^[3]

Distillation into smaller models

A 2024 study, Knowledge Distillation Using Frontier Open-Source LLMs by Shirgaonkar, Pandey, Abay, Aktas, and Aski, uses Llama 3.1 405B Instruct as a teacher to generate synthetic training data, with the CoD dataset as one of three target tasks (alongside GovReport and BBCNews).^[8] The authors find that synthetic CoD-style training data significantly improves the summarization accuracy of 8B and 70B student models, providing evidence that the dense-summary signal from frontier models is transferable via knowledge distillation.^[8] This direction was anticipated in the original CoD paper's "Limitations" section, which explicitly proposes "density distillation into an open-sourced model" as future work.^[1]

Applications

The CoD framework has been applied or recommended in several settings:

News summarization: the original evaluation domain, drawn from CNN/DailyMail.^[1]
Long-document business summarization: third-party blog evaluations apply CoD to financial filings such as 10-K reports, although with mixed results on consistency and adherence to length (see Limitations).^[9]
Production summarization pipelines: the Instructor blog and LangChain Hub templates target latency- and cost-sensitive applications by either using CoD directly or distilling it into a fine-tuned smaller model.^[4]^[6]
Synthetic training data: CoD outputs from frontier models are used as targets for fine-tuning open models, with the explicit goal of teaching them to write dense yet readable short summaries.^[1]^[8]

Because the technique relies on a single prompt and no model changes, it can be combined with any frontier instruction-tuned LLM that handles structured (JSON) output reliably.^[1]^[6]

Limitations and Criticisms

The CoD authors and subsequent independent evaluators have flagged several real limitations.

Acknowledged by the original paper

Single domain: the evaluation is only on news (CNN/DailyMail); generalization to scientific, legal, medical, or conversational text is not tested.^[1]
Low instance-level agreement: Fleiss' kappa among the four annotators is just 0.112, meaning that for any single article, annotators often disagree about which CoD step is best. System-level trends (e.g., aggregate preference for steps 2 to 4) are clearer than per-article judgments.^[1]
Closed model: GPT-4 is proprietary, so the original results cannot be reproduced exactly with open weights, and only the outputs (the dataset) are open.^[1]
Automatic-metric weakness: even GPT-4-as-judge correlates only modestly with human preferences (0.31 Pearson at the summary level), so automatic evaluation alone cannot reliably tune density.^[1]

Identified by follow-up work

A 2024 evaluation by the Yugen.ai team running CoD against the NVIDIA 10-K annual report with GPT-4o and a 1,000-word target found additional practical issues:^[9]

Instruction non-adherence: the model sometimes produced shorter summaries than requested and dropped entities from earlier steps despite the "never drop entities" rule.^[9]
Early density saturation: entity-token density saturated around internal step 4, with later steps showing flat or declining unique-entity counts.^[9]
Non-reproducibility: even with temperature 0 and fixed seeds, the final summaries across five independent runs differed materially, with cosine similarity ranging from 0.78 to 0.92 (TF-IDF) and 0.79 to 0.98 (embeddings).^[9]
Latency and cost: five sequential summary rewrites per document increase both wall-clock latency and token billing relative to single-pass prompting.^[9]
Long-document limits: very long inputs require chunking or retrieval, since the entire document plus the growing chain of prior summaries must fit in the model's context window.^[9]

The trade-off between informativeness and readability also has a "hallucination edge": as compression becomes more aggressive at later steps, the risk of factually awkward fusions (the paper's TV5Monde example) rises, and the authors leave precise quantification of that risk to future work.^[1] More generally, dense summaries can amplify hallucinations when an entity that is grammatically forced into the rewrite is not actually grounded in the source.

Significance

CoD is significant for three reasons. First, it introduced entity density as a controllable, measurable dimension of LLM-generated summaries that is orthogonal to length and that previously had no principled prompt-level handle.^[1] Second, it gave practitioners a single, simple, copy-pasteable prompt that produces an entire density curve rather than one summary, letting a downstream system or user pick the operating point. Third, by open-sourcing 500 annotated and 5,000 unannotated CoD summaries, the authors created a small but high-quality resource for downstream distillation, which has since been used to transfer dense-summary behavior to open models.^[1]^[3]^[8]

Within prompt engineering more broadly, CoD is often cited alongside Chain-of-Thought, Tree of Thoughts, and self-refinement methods as an early demonstration that structured iteration within a single prompt can extract a usable quality curve from a frozen model without any fine-tuning or external orchestrator.^[4]

Chain-of-Thought prompting: parallel naming convention, different aim (reasoning accuracy versus summary density).
Tree of Thoughts: branching search over intermediate states rather than a linear chain.
Summarization models and the broader text summarization literature.
ROUGE as a long-standing automatic metric for summary evaluation.
Named entity recognition, used to count entities per token.
Knowledge distillation from frontier LLMs as a downstream use of CoD data.
RLHF and recursive summarization, the prior dominant paradigm for long-document summarization (Wu et al., 2021).
Instructor library CoD tutorial, which packaged the technique for production users.
Salesforce AI Research, one of the originating labs.

References

Griffin Adams, Alex Fabbri, Faisal Ladhak, Eric Lehman, Noémie Elhadad, "From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting", arXiv:2309.04269 [cs.CL], 2023-09-08. https://arxiv.org/abs/2309.04269. Accessed 2026-05-21. ↩
Griffin Adams, Alex Fabbri, Faisal Ladhak, Eric Lehman, Noémie Elhadad, "From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting", Proceedings of the 4th New Frontiers in Summarization Workshop (NewSum 2023), Association for Computational Linguistics, pp. 68 to 74, 2023-12-06. https://aclanthology.org/2023.newsum-1.7/. Accessed 2026-05-21. ↩
Griffin Adams, "chain_of_density dataset card", Hugging Face Datasets, 2023-09. https://huggingface.co/datasets/griffin/chain_of_density. Accessed 2026-05-21. ↩
Ivan Leo and Jason Liu, "Smarter Summaries w/ Finetuning GPT-3.5 and Chain of Density", Instructor library blog, 2023-11-05. https://python.useinstructor.com/blog/2023/11/05/chain-of-density/. Accessed 2026-05-21. ↩
Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano, "Recursively Summarizing Books with Human Feedback", arXiv:2109.10862 [cs.CL], 2021-09-22. https://arxiv.org/abs/2109.10862. Accessed 2026-05-21. ↩
LangChain AI, "chain-of-density prompt", LangSmith Hub, 2023. https://smith.langchain.com/hub/langchain-ai/chain-of-density. Accessed 2026-05-21. ↩
Richard Awoyemi, "richawo/chain-of-density: Implementing the Chain Of Density text summarisation technique", GitHub, 2023. https://github.com/richawo/chain-of-density. Accessed 2026-05-21. ↩
Anup Shirgaonkar, Nikhil Pandey, Nazmiye Ceren Abay, Tolga Aktas, Vijay Aski, "Knowledge Distillation Using Frontier Open-Source LLMs: Generalizability and the Role of Synthetic Data", arXiv:2410.18588 [cs.CL], 2024-10-24. https://arxiv.org/abs/2410.18588. Accessed 2026-05-21. ↩
Deepak Jangra and Akshay Singh, "Evaluating Chain of Density Method for Better LLM Summarization", Yugen.ai Technology Blog (Medium), 2024-09-13. https://medium.com/yugen-ai-technology-blog/evaluating-chain-of-density-method-for-better-llm-summarization-2a4f32695821. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Chain-of-Thought

Background and Motivation

Publication Details

How Chain of Density Works

The CoD prompt

Densification dynamics

Empirical Findings

Human preference study

Automatic ratings

Qualitative tradeoff

Comparison with Related Prompting Techniques

Implementations and Adoption

Distillation into smaller models

Applications

Limitations and Criticisms

Acknowledged by the original paper

Identified by follow-up work

Significance

Related Work

See also

References

Improve this article

Related Articles

Agentic Context Engineering

How to Pressure LLMs for Better Output

Meta Prompting

Chain-of-Thought

26 Principles of Good Prompts

Prompt

What links here

Related Articles

Agentic Context Engineering

How to Pressure LLMs for Better Output

Meta Prompting

Chain-of-Thought

26 Principles of Good Prompts

Prompt