Chain of Verification (CoVe)
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,632 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,632 words
Add missing citations, update stale details, or suggest a clearer explanation.
Chain of Verification (CoVe) is a prompting technique that reduces factual hallucinations in large language models by having the model fact-check its own draft response through a structured four-step deliberation process. The method was introduced in September 2023 by Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston at Meta AI, in the paper "Chain-of-Verification Reduces Hallucination in Large Language Models" (arXiv:2309.11495).[1] The pipeline asks a model to (1) draft an initial response, (2) plan a list of independent verification questions, (3) answer those questions in isolation so that the answers cannot be biased by the original draft, and (4) generate a revised final response that reconciles the verification outputs with the draft.[1] The authors reported that applying CoVe to a LLaMA 65B base model more than doubled precision on list-based Wikidata questions and lifted FactScore on long-form Wikipedia biography generation from 55.9 to 71.4 percent.[1] The technique appeared at Findings of the Association for Computational Linguistics in 2024 and is widely implemented in third-party tooling, including LangChain templates and open-source Python packages.[2][3]
Hallucination, defined by the CoVe authors as the production of "plausible yet incorrect factual information," remains one of the most studied failure modes of modern large language models.[1] Hallucination arises from the autoregressive token-by-token generation procedure: once a model commits to an incorrect entity or fact early in a response, subsequent tokens often condition on and amplify that error rather than correct it.[1] The CoVe paper distinguishes between hallucinations in short, list-like outputs (for example, "name all the politicians born in New York") and in long-form prose (for example, a multi-paragraph biography), arguing that the latter are particularly hard to detect because errors are interleaved with correct content.[1]
Prior to CoVe, several lines of work attempted to mitigate hallucination at inference time without retraining. Wang and colleagues at Google introduced self-consistency in March 2022, which samples multiple chain-of-thought reasoning paths from a single prompt and selects the majority answer, improving arithmetic and commonsense reasoning accuracy by between 3.9 and 17.9 percentage points across benchmarks.[4] Madaan and colleagues introduced Self-Refine in March 2023, in which the same model generates an output, critiques it, and iteratively rewrites the output using its own feedback, reporting roughly 20 percent absolute improvement across seven tasks without any additional training.[5] Self-criticism approaches, in which a model is prompted to find errors in its own output and revise, have been studied extensively but are limited by the model's tendency to validate its own initial answer.[1] Luyu Gao and colleagues introduced RARR (Researching and Revising What Language Models Say, Using Language Models) at ACL 2023, which finds attributions for model outputs and post-edits unsupported content while preserving the original text as much as possible.[6]
A separate strand of work, retrieval-augmented generation, attacks hallucination by grounding outputs in external knowledge rather than asking the model to police itself; RAG systems retrieve passages from a corpus or the open web and pass them into the prompt as context.[7] CoVe was positioned as a complement to such grounding methods: it is a pure prompting recipe that requires no retrieval index, no tools, and no fine-tuning, and works with any sufficiently capable base model.[1]
The CoVe authors framed their contribution against this background as an attempt to test "whether language models can deliberate on the responses they give in order to correct their mistakes," without recourse to retrieval or to additional training.[1] The seven-author team came from Meta AI's Fundamental AI Research lab and from ETH Zurich, with first author Shehzaad Dhuliawala holding a joint appointment between the two institutions, and last author Jason Weston being a long-time research scientist at Meta AI working on dialogue and reasoning.[1] The paper's framing was deliberately conservative: rather than claim a new architecture or training procedure, the authors proposed a prompting protocol that could be layered onto existing models without modification.[1]
The CoVe pipeline is defined by four sequential stages applied to a single input query.[1]
In stage one, the base response generation, the model is given the user query and produces an initial draft response in the normal way, with no special instructions to verify or fact-check.[1] This draft is treated as a hypothesis that may contain errors.
In stage two, the planning stage, the model receives the query and the draft response and is prompted to list a set of fact-checking questions whose answers, if available, would let a reader judge whether the draft is correct.[1] For a draft that lists politicians born in New York, the model might propose questions such as "Where was politician A born?" and "When was politician B born?" The plan is a flat list of verification questions, not a tree.
In stage three, the execution stage, the model answers each verification question.[1] The CoVe paper investigates several ways to execute this step, which differ in how much context the model sees when answering each question.
In stage four, the final verified response, the model is given the query, the original draft, the verification questions, and the verification answers, and is prompted to write a revised response that is consistent with the verifications.[1] If a verification answer contradicts the draft, the final response is expected to drop or correct the offending claim. If verifications confirm the draft, the final response retains the corresponding content.
The CoVe paper investigates four variants of how the verification questions are executed.[1]
The Joint variant runs the entire pipeline in a single prompt: planning, execution, and final response are all generated left to right, with later tokens conditioned on earlier ones.[1] This is the cheapest variant but suffers from the same problem as self-criticism: the verification answers can be biased by the visible draft, because the model has already committed to the draft tokens.[1]
The 2-Step variant separates the planning of verification questions from their execution into two distinct prompts. In the second prompt, the model sees only the verification questions, not the original draft.[1] This prevents the draft from influencing the verification answers but still allows the verification answers in the second prompt to influence each other, because they are generated together.[1]
The Factored variant treats each verification question as an independent prompt. Every verification answer is produced by a separate model call that sees only the single question, with no access to the draft or to the other verification answers.[1] This eliminates cross-contamination between answers at the cost of issuing one model call per verification question.[1]
The Factor+Revise variant builds on Factored by adding an explicit cross-checking step. After the verification answers are produced, an additional prompt is used to detect inconsistencies between the verification answers and the original draft before the final response is generated.[1] In practice this means a separate model call for each draft claim, asking whether the verification answer supports or contradicts the claim.
The variants trade off cost and quality. Joint is roughly equivalent in cost to standard prompting because it is a single completion; Factor+Revise can require one model call per verification question plus one per consistency check, making it the most expensive variant.[1] The authors reported that Factor+Revise produced the best long-form generation results, while Factored and 2-Step performed best on different short-form benchmarks.[1]
The authors evaluated CoVe with the LLaMA 65B base model using greedy decoding and few-shot examples, comparing against several baselines including direct few-shot prompting, chain-of-thought prompting, and instruction-tuned LLaMA 2 Chat.[1] Three benchmark families were used.
The Wikidata list questions benchmark asks the model to enumerate entities matching a description, for example "List politicians born in Boston, Massachusetts." Precision is measured by how many of the listed entities actually satisfy the constraint according to Wikidata.[1] On this task, LLaMA 65B few-shot achieved a precision of 0.17, while the best CoVe variant (2-Step) reached 0.36, more than doubling precision.[1] The number of hallucinated entities per query dropped from 2.95 in the few-shot baseline to 0.68 with CoVe.[1] A second list task drawn from Wikipedia categories (Wiki-Category) showed a similar pattern, with precision rising from 0.12 to 0.22 using the Factored variant.[1]
The MultiSpanQA benchmark, introduced by Haonan Li and colleagues at NAACL 2022, is a multi-span extractive question answering dataset where each answer is a series of non-contiguous spans in a passage.[8] In the closed-book setting used by CoVe, the model receives only the question and must produce the answer spans from its parametric memory.[1] LLaMA 65B few-shot scored 0.39 F1 on the closed-book version; the Factored CoVe variant raised this to 0.48 F1, a relative improvement of 23 percent.[1]
The long-form generation task asks the model to produce a multi-paragraph biography for a person, evaluated using FActScore, an automatic metric introduced by Sewon Min and colleagues at EMNLP 2023.[9] FActScore decomposes a generation into atomic facts and computes the percentage supported by a reliable knowledge source (Wikipedia in the default configuration), reporting that ChatGPT achieved a FActScore of about 58 percent on biographies of relatively obscure people.[9] On this benchmark, LLaMA 65B few-shot scored 55.9, and CoVe Factor+Revise lifted the score to 71.4, an absolute gain of 15.5 percentage points and a relative gain of about 28 percent.[1] CoVe-augmented LLaMA 65B also exceeded the FActScores the original FActScore paper reported for ChatGPT and PerplexityAI on the same task family.[1][9]
The CoVe paper additionally compared standard chain-of-thought prompting against CoVe on the hallucination benchmarks and reported that CoT generated the highest number of hallucinations per query by a wide margin on the list-based tasks, suggesting that asking the model to "think step by step" does not by itself protect against factual error.[1] This finding distinguishes CoVe's verification structure from generic reasoning prompts.
CoVe sits within a broader family of self-correction prompting techniques that emerged in 2023. CRITIC (Critique-and-Revise with Tool-Interactive Critiquing), introduced by Zhibin Gou and colleagues in May 2023 and presented at ICLR 2024, also has a model validate and revise its own outputs, but extends the idea with external tools such as search engines, code interpreters, and toxicity classifiers; the authors argued that external feedback is of central importance because models often cannot reliably critique themselves on facts they did not know.[10] CRITIC differs from CoVe in that the verification answers come from tools rather than from the same language model in isolation, which addresses one of the limitations the CoVe authors flagged about same-model verification.[10][1]
Self-RAG (Learning to Retrieve, Generate, and Critique through Self-Reflection), introduced by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi in October 2023, trains the model end-to-end to emit special reflection tokens that decide when to retrieve passages and how to assess the relevance and supportedness of retrieved evidence.[11] Where CoVe is a pure prompting recipe over a fixed base model, Self-RAG bakes the verification behaviour into the weights through supervised training on reflection-token-annotated data.[11]
CoVe also inspired a line of domain-specific extensions. U-CoVe, posted to arXiv in 2024, adapts the CoVe pipeline to low-resource Urdu and reports gains analogous to the English-language results on Urdu factuality benchmarks.[12] Researchers at ETH Zurich evaluated CoVe in a reproducibility study and confirmed the broad direction of the original results, while noting that absolute scores depend on the choice of base model and verification prompts.[13]
Within Meta's own research stack, the verification idea appears in subsequent factuality work and in long-form generation pipelines that combine retrieval and self-checking; the CoVe paper itself was cited in numerous follow-up papers on factuality evaluation, including factuality benchmarks such as FactBench and survey papers on hallucination mitigation.[14][15]
CoVe is sometimes grouped with chain-of-thought (CoT) prompting because both impose multi-step structure on a single inference, but the two methods target different failure modes. CoT, introduced by Jason Wei and colleagues at Google in 2022, asks the model to produce intermediate reasoning steps before its final answer; the gains come primarily on arithmetic, commonsense, and symbolic reasoning, where the missing ingredient is the chain of inference rather than the underlying facts.[16] Self-consistency sharpens CoT by sampling multiple reasoning paths and taking the majority answer.[4]
CoVe, by contrast, targets factual recall errors: cases where the model knows how to answer the question structurally but produces a plausible but incorrect entity, date, or attribute. The verification step is not an extension of the reasoning chain but a parallel set of independent factual lookups against the model's own knowledge.[1] On the list-based Wikidata benchmark, the CoVe authors found that CoT actually produced more hallucinated entities than the plain few-shot baseline, while CoVe more than halved them, suggesting that reasoning structure and factual structure are largely orthogonal.[1]
A useful way to view the distinction is that CoT improves the conditional distribution P(answer | reasoning, question), while CoVe attempts to filter the marginal P(claim | knowledge) by asking targeted independent queries about each candidate claim.[1] In practice the two are stackable: a model can do CoT during the draft stage and CoVe during the verification stage, although the original CoVe paper did not study this combination in depth.[1]
CoVe also differs from Self-Refine, which loops a single critique-and-rewrite step over the same draft. Self-Refine uses the same model as critic and as generator without separating the contexts, so the critic sees the draft directly and tends to be biased toward accepting it.[5] The Factored and Factor+Revise variants of CoVe were designed specifically to break this bias by hiding the draft from the verification step.[1]
The CoVe authors acknowledged several limitations.[1] First, CoVe does not remove hallucinations completely; the FActScore of 71.4 on long-form biographies still leaves close to 30 percent of atomic facts unsupported.[1] Second, the method only helps with directly stated, verifiable facts: it does nothing for opinion-based content, for stylistic errors, or for failures of reasoning.[1] If the draft contains a chain of inferences with a flawed premise, no factual verification question will catch the structural error.[1]
Third, the verification answers themselves can be hallucinated. The model that answers the verification questions is the same model that produced the draft, drawing on the same parametric knowledge; if the model is confidently wrong about a fact, it will be confidently wrong about the verification question for that fact.[1] The Factor+Revise variant mitigates this somewhat by adding a consistency check, but cannot exceed the upper bound set by the base model's knowledge.[1]
Fourth, the technique is computationally expensive in its strongest forms. Factored requires one extra model call per verification question, and Factor+Revise adds another call per claim to check consistency. For a draft that mentions ten facts, this can mean ten to twenty additional inference calls per query, which translates directly into latency and token cost in production deployments.[1]
Fifth, the verification questions must be well-formed and atomic. If the plan stage produces a verification question that is itself compound ("Where and when was politician A born?"), the model can answer one half correctly and fabricate the other, defeating the purpose.[1] In practice this means CoVe is more reliable on domains where the underlying facts decompose cleanly into atomic checks, such as biographical attributes, list memberships, and dated events.[1]
Finally, CoVe shares a limitation with all self-checking methods: it does not detect omissions. If the draft simply leaves out a relevant fact, no verification question will be asked about that fact, and the final response will inherit the omission.[1] Retrieval-augmented methods such as RAG are better suited to surfacing missing context.[7]
Within months of the September 2023 arXiv preprint, CoVe was implemented in several open-source toolkits. The most widely used implementation, by Sourajit Ghosh ("ritun16") on GitHub, packages the CoVe pipeline as a Python module using LangChain, OpenAI models, and optional search tools; the repository accumulated 203 GitHub stars and 33 forks and is referenced in multiple tutorials.[3] A separate PyPI package, langchain-chain-of-verification, repackages the same logic as both a command-line interface and a library for use with later LangChain versions.[17] Analytics Vidhya and PromptHub published step-by-step LangChain Expression Language implementations targeted at production developers.[18][19]
The technique has been adopted as a reference template in prompt engineering education, appearing in survey papers on hallucination mitigation as one of the canonical inference-time methods alongside self-consistency, Self-Refine, and Reflexion.[15] The CoVe paper was selected for Findings of the Association for Computational Linguistics 2024, where it was published with page range 3563 to 3578.[2]
Commercial pipelines that surface CoVe-style verification include factuality-focused agents that combine the verification structure with web search or RAG: in these systems the verification answers come from a retrieval tool rather than the base model, partially addressing the self-knowledge limitation, while preserving the four-step structure proposed in the original paper.[18][3] CoVe is also referenced in reproducibility studies, including an ETH Zurich student project that re-implemented the four variants on top of open-source models and reproduced the qualitative ordering of results.[13]
The technique has been extended to languages other than English. U-CoVe, an Urdu-language adaptation, replicated the CoVe pipeline using Urdu instruction-tuned models and reported reductions in hallucinated content on Urdu factuality benchmarks, showing that the prompting structure transfers across languages without architectural changes.[12] Multilingual factuality benchmarks such as Multi-FAct have used CoVe-style verification as one of the mitigations they evaluate.[20]
Beyond direct implementations, CoVe is frequently cited in survey papers on hallucination mitigation as one of the canonical inference-time prompting approaches. The 2024 survey "A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models" by Tonmoy and colleagues groups CoVe with Self-Refine and Reflexion under "self-refinement" methods, and contrasts them with retrieval-based, decoding-based, and fine-tuning-based mitigation strategies.[15] The survey notes that self-refinement methods including CoVe have the advantage of requiring no additional training data or model updates, but inherit the limitation that the same model that produced the error is being asked to detect it.[15]
CoVe has also been referenced in newer factuality benchmarks. FactBench, introduced in 2024 as a dynamic factuality benchmark drawing from in-the-wild user queries, includes CoVe as one of the comparison baselines for hallucination-mitigation systems and evaluates how the four CoVe variants behave on queries drawn from real production traffic rather than curated test sets.[14] These follow-up evaluations have generally confirmed that CoVe reduces hallucination rates relative to plain few-shot prompting, while pointing out that the absolute level of remaining hallucinations is still substantial.[14][15]