Chain of Density prompting
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,166 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,166 words
Add missing citations, update stale details, or suggest a clearer explanation.
Chain of Density (CoD) is a prompting technique for abstractive text summarization with large language models, introduced in the 2023 paper "From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting" by Griffin Adams, Alex Fabbri, Faisal Ladhak, Eric Lehman, and Noémie Elhadad.[^1] The method instructs a model (originally GPT-4) to write an initial entity-sparse summary and then to rewrite it several times in succession, each time adding one to three missing salient entities while holding the total length fixed.[^1] The result is a sequence of progressively denser summaries from which a user can pick the step that best balances informativeness against readability. Empirical work on 100 CNN/DailyMail articles found that human annotators preferred summaries whose entity density was close to that of human-written references, and clearly preferred denser CoD outputs over GPT-4 outputs from a vanilla prompt.[^1] CoD was developed jointly at Salesforce AI Research, Columbia University, and MIT, and quickly attracted reimplementations in Hugging Face, LangChain, and the Instructor library.[^1][^2][^3][^4]
By 2023, zero-shot prompting of LLMs like GPT-4 had become the dominant paradigm for news summarization, largely displacing supervised models such as BART and PEGASUS fine-tuned on labeled corpora.[^1] Researchers had shown that careful prompting could control length, style, and topic of generated summaries, and that GPT-3 summaries were often preferred by humans over previous supervised baselines.[^1] However, an information density axis was under-studied. The CoD authors observed that a summary "should be denser, containing a higher concentration of information, than the source document," yet vanilla GPT-4 prompts often produced summaries that were both lead-biased (drawing disproportionately from the article's opening sentences) and entity-sparse.[^1] Selecting the right level of density is a hard tradeoff: too sparse and the summary fails to inform; too dense and it becomes incoherent or factually unreliable within a fixed token budget.[^1]
The CoD paper proposed using the number of unique named entities per token as a proxy for density, and treating the choice of density as an empirical question to be answered by human preference rather than by automatic metrics.[^1] The authors note that this complements prior work on entity-based summarization (entity chains as planning targets, entity-grounded faithfulness, entity coverage as an evaluation unit) but adapts the idea to zero-shot prompting of a frontier LLM.[^1]
| Field | Value |
|---|---|
| Title | From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting |
| Authors | Griffin Adams, Alex Fabbri, Faisal Ladhak, Eric Lehman, Noémie Elhadad |
| Affiliations | Columbia University (CS and Biomedical Informatics), Salesforce AI, MIT |
| arXiv ID | 2309.04269 |
| First submitted | 8 September 2023 |
| Venue | 4th New Frontiers in Summarization Workshop (NewSum 2023), co-located with EMNLP 2023, Singapore |
| ACL Anthology ID | 2023.newsum-1.7 |
| Pages | 68 to 74 |
| Dataset | griffin/chain_of_density on Hugging Face (500 annotated + 5,000 unannotated CoD summaries) |
Sources: arXiv listing,[^1] ACL Anthology entry,[^2] Hugging Face dataset card.[^3]
CoD is implemented as a single prompt to GPT-4 that asks the model to produce five successive summaries of one source article.[^1] The model is required, at each step, to identify one to three "Missing Entities" from the source that were not in its previous summary and to rewrite the prior summary so that it covers all previous content plus the new entities, without increasing the overall word count. To make room, the model is instructed to use abstraction, fusion, and compression, and to remove uninformative filler phrases such as "the article discusses."[^1]
The exact prompt published in Figure 2 of the paper reads, in part:[^1]
You will generate increasingly concise, entity-dense summaries of the above Article. Repeat the following 2 steps 5 times. Step 1. Identify 1 to 3 informative Entities ("," delimited) from the Article which are missing from the previously generated summary. Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the Missing Entities.
A "Missing Entity" is defined by five criteria:[^1]
The first summary is deliberately verbose: four to five sentences and roughly 80 words, "highly non-specific" and laced with fillers, so that subsequent steps have textual slack to absorb new entities by compression.[^1] All five summaries are returned in a single JSON list whose entries each contain the fields Missing_Entities and Denser_Summary.[^1]
Across 100 CNN/DailyMail articles, the authors measured direct statistics (token counts, unique entity counts via spaCy named entity recognition, and the entity-per-token ratio) at each step. Length stays close to a fixed budget (72 tokens at step 1, 67 tokens at step 2 as filler is removed, drifting back up to 72 tokens by step 5), while entity density rises monotonically from 0.089 at step 1 to 0.167 at step 5.[^1] For comparison, the average human-written reference has density 0.151 and a vanilla GPT-4 summary has 0.122.[^1]
| CoD step | Avg tokens | Avg entities | Entity density (entities/token) |
|---|---|---|---|
| 1 | 72 | 6.4 | 0.089 |
| 2 | 67 | 8.7 | 0.129 |
| 3 | 67 | 9.9 | 0.148 |
| 4 | 69 | 10.8 | 0.158 |
| 5 | 72 | 12.1 | 0.167 |
| Human reference | 60 | 8.8 | 0.151 |
| Vanilla GPT-4 | 70 | 8.5 | 0.122 |
Table reproduced from Table 1 of the paper.[^1]
The authors also measure indirect statistics that they predict will move as a side-effect of densification. They report that abstractiveness increases (extractive density decreases) across the five steps, fusion (the number of source sentences aligned to each summary sentence) rises, and content distribution shifts away from the article's lead and toward middle and tail sentences.[^1] These three effects together describe a movement from a lead-biased extractive summary toward an abstractive, source-distributed one, while keeping length and content faithful.
The first four authors annotated 500 CoD summaries (5 steps for each of 100 articles), shown in randomized order alongside the articles. Each annotator picked their top preferred summary per article using the "good summary" rubric from Stiennon et al. (2020).[^1] Across annotators, the modal preferred step is 2, the median is 3, and the expected (mean) step is 3.06.[^1] In aggregate, 61% of first-place votes went to summaries at step 3 or later (23.0% + 22.5% + 15.5%), confirming that humans favor noticeably denser outputs than the vanilla GPT-4 baseline.[^1] At the same time, the most dense step (step 5) was the least preferred among CoD candidates, indicating a clear plateau followed by decline.
| CoD step | Share of first-place votes (aggregate) | GPT-4 "Overall" rating (1 to 5) |
|---|---|---|
| 1 | 8.3% | 4.41 |
| 2 | 30.8% | 4.58 |
| 3 | 23.0% | 4.57 |
| 4 | 22.5% | 4.61 |
| 5 | 15.5% | 4.58 |
Compiled from Tables 2 and 3 of the paper.[^1] The "preferred" entity density inferred from step 3 ($\approx 0.15$) is essentially identical to that of human-written reference summaries (0.151) and considerably higher than that of vanilla GPT-4 (0.122).[^1] The annotators reported a low Fleiss' kappa of 0.112, which the authors attribute to the subtle differences between adjacent CoD steps and to subjectivity in summarization preference, a pattern previously observed for GPT-based summaries by Goyal et al. (2022).[^1]
The authors complement the human study with GPT-4-as-a-judge Likert ratings (1 to 5) along five dimensions adapted from Aharoni et al. (2023) and the SummEval framework: Informative, Quality, Coherence, Attributable, and Overall.[^1] The Informative score peaks at step 4 (4.74), while article-free dimensions (Quality at step 2 and Coherence at step 1) decline sooner. Overall scores are highest at steps 2 to 4 and lowest at the extremes, consistent with the human preference pattern.[^1] The summary-level Pearson correlation between GPT-4 Overall ratings and human preferences is 0.311, with weaker correlations for the other dimensions, in line with prior findings that automatic metrics struggle to distinguish summaries of similar quality.[^1]
The authors give worked examples (Figure 4) in which one densification step improves the summary (adding "Liverpool" plus goal-scorers replaces a vague phrase with concrete causal information) and another step harms it (cramming in an additional detail about "TV5Monde" introduces an awkward fusion of unrelated entities, hurting readability).[^1] The qualitative takeaway is that there is a real, content-dependent ceiling on useful density beyond which entity insertion damages coherence and may invite hallucination.[^1]
CoD sits at the intersection of zero-shot summarization and structured iterative prompting. It is helpful to position it against three reference points the paper or its followers explicitly cite.
| Technique | Mechanism | Goal | Iterates the same length? | Key reference |
|---|---|---|---|---|
| Vanilla summarization prompt | Single GPT-4 call, e.g. "Write a VERY short summary of the Article. Do not exceed 70 words." | Concise summary | Not applicable (single pass) | Adams et al., 2023[^1] |
| Chain of Density (CoD) | Single prompt that elicits 5 rewrites; each step adds 1 to 3 missing entities and re-compresses to identical length | Calibrate density tradeoff for a fixed length | Yes, by construction | Adams et al., 2023[^1] |
| Recursive summarization (Wu et al.) | Fine-tunes GPT-3 with RLHF to summarize book sections, then summaries of summaries, then summaries of those | Summarize very long inputs by recursive task decomposition | No (each level compresses) | Wu et al., 2021[^5] |
| Chain-of-Thought (CoT) | Prompt model to produce intermediate reasoning steps before answering | Improve multi-step reasoning accuracy | Not applicable (free length) | Wei et al., 2022 |
CoD and recursive summarization both use iteration but for different reasons. Recursive summarization (Wu, Ouyang et al., OpenAI, 2021) attacks input length by summarizing chunks and then summarizing the chunk-summaries, using fine-tuned models trained via human-feedback reward modeling.[^5] CoD addresses information density at fixed output length via prompt design alone with a frozen LLM, and iterates over the output rather than over chunks of the input.[^1] Compared to Chain-of-Thought prompting, which elicits intermediate reasoning to improve answer correctness, CoD elicits intermediate artifacts (the early summaries) to expose a quality tradeoff and let downstream users choose where on the curve to stop. The naming convention echoes CoT but the mechanism, target, and output structure are distinct.
Beyond these reference points, CoD is often discussed alongside Chain-of-Verification (CoVe), Tree of Thoughts, and Self-Refine in surveys of iterative LLM prompting strategies.[^4]
The paper, the prompt, and the dataset were released together in September 2023, and the technique was reimplemented broadly within weeks. Key public artifacts include:
griffin/chain_of_density: 100 annotated CoD test articles and 5,000 additional unannotated CoD summaries, with per-step token counts, entity counts, density, fusion scores, ROUGE scores, and GPT-4 Likert ratings.[^3] The annotated split is intended for evaluation; the unannotated split is intended for density distillation into smaller open models such as Llama 2.[^1]chain-of-density prompt template adapted to LangChain's ChatPrompt format, configurable to different models and entity definitions.[^6]richawo/chain-of-density, an open-source Python implementation that uses the OpenAI API to apply CoD to arbitrary input documents.[^7] Other community repositories and notebooks reproduce the prompt against open-source models.The original authors' code and example scripts are tracked alongside the Hugging Face dataset, but the canonical reproducible artifact for most users is the dataset card and the prompt printed verbatim in Figure 2 of the paper.[^1][^3]
A 2024 study, Knowledge Distillation Using Frontier Open-Source LLMs by Shirgaonkar, Pandey, Abay, Aktas, and Aski, uses Llama 3.1 405B Instruct as a teacher to generate synthetic training data, with the CoD dataset as one of three target tasks (alongside GovReport and BBCNews).[^8] The authors find that synthetic CoD-style training data significantly improves the summarization accuracy of 8B and 70B student models, providing evidence that the dense-summary signal from frontier models is transferable via knowledge distillation.[^8] This direction was anticipated in the original CoD paper's "Limitations" section, which explicitly proposes "density distillation into an open-sourced model" as future work.[^1]
The CoD framework has been applied or recommended in several settings:
Because the technique relies on a single prompt and no model changes, it can be combined with any frontier instruction-tuned LLM that handles structured (JSON) output reliably.[^1][^6]
The CoD authors and subsequent independent evaluators have flagged several real limitations.
A 2024 evaluation by the Yugen.ai team running CoD against the NVIDIA 10-K annual report with GPT-4o and a 1,000-word target found additional practical issues:[^9]
The trade-off between informativeness and readability also has a "hallucination edge": as compression becomes more aggressive at later steps, the risk of factually awkward fusions (the paper's TV5Monde example) rises, and the authors leave precise quantification of that risk to future work.[^1] More generally, dense summaries can amplify hallucinations when an entity that is grammatically forced into the rewrite is not actually grounded in the source.
CoD is significant for three reasons. First, it introduced entity density as a controllable, measurable dimension of LLM-generated summaries that is orthogonal to length and that previously had no principled prompt-level handle.[^1] Second, it gave practitioners a single, simple, copy-pasteable prompt that produces an entire density curve rather than one summary, letting a downstream system or user pick the operating point. Third, by open-sourcing 500 annotated and 5,000 unannotated CoD summaries, the authors created a small but high-quality resource for downstream distillation, which has since been used to transfer dense-summary behavior to open models.[^1][^3][^8]
Within prompt engineering more broadly, CoD is often cited alongside Chain-of-Thought, Tree of Thoughts, and self-refinement methods as an early demonstration that structured iteration within a single prompt can extract a usable quality curve from a frozen model without any fine-tuning or external orchestrator.[^4]