Corrective RAG (CRAG)
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,608 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,608 words
Add missing citations, update stale details, or suggest a clearer explanation.
Corrective Retrieval Augmented Generation (CRAG) is a method for improving the robustness of retrieval-augmented generation (RAG) when the underlying retrieval step returns irrelevant, incomplete, or factually wrong documents. It was introduced in the 2024 paper "Corrective Retrieval Augmented Generation" by Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling [1]. The core idea is to insert a lightweight retrieval evaluator between the retriever and the large language model (LLM) generator. The evaluator grades the retrieved documents for a query, produces a confidence label, and then triggers one of three corrective knowledge actions: refine the documents (Correct), discard them and fall back to a large-scale web search (Incorrect), or combine both (Ambiguous). A "decompose-then-recompose" algorithm filters the retained documents down to their most relevant fragments before generation [1].
CRAG is designed to be plug-and-play: the authors state it "can be seamlessly coupled with various RAG-based approaches," including the closely related Self-RAG method, where the combination is called Self-CRAG [1]. Experiments on four short-form and long-form generation benchmarks (PopQA, Biography, PubHealth, and Arc-Challenge) show that adding CRAG improves both standard RAG and Self-RAG. The technique was subsequently popularized as a reference pattern for agentic RAG, most visibly through LangChain and LangGraph tutorials [2].
RAG augments an LLM by prepending documents retrieved from an external corpus to the input, which lets the model ground its output in knowledge it did not memorize in its parameters [1]. Its effectiveness, however, is contingent on the relevance and accuracy of what the retriever returns. Most conventional RAG pipelines incorporate the retrieved passages indiscriminately, regardless of whether the documents actually help answer the query [1].
When the corpus is static and limited, or when the query is ambiguous, a retriever can surface a substantial amount of irrelevant information. The CRAG paper illustrates this with a query such as "Who was the screenwriter for Death of a Batman?", where a low-quality retriever returns documents about the 1989 Batman film instead, potentially misleading the generator into a confident but wrong answer [1]. Because the generator treats prepended text as trusted context, bad retrieval can directly cause or amplify hallucination. CRAG specifically targets the scenario in which the retriever returns inaccurate results, asking how a RAG system should behave when retrieval goes wrong rather than assuming retrieval always succeeds [1].
The central component is a lightweight retrieval evaluator that scores how relevant each retrieved document is to the input query. For each question, roughly ten documents are retrieved; the question is concatenated with each document individually, and the evaluator predicts a relevance score for that question-document pair [1]. The evaluator is built on T5-large (Raffel et al., 2020), which is fine-tuned for this relevance-scoring task and is far smaller than the LLM generators it assists [1]. A later open-source explainability study using SHAP reported that the fine-tuned T5 evaluator relies substantially on named-entity alignment between the question and the document rather than on broad semantic similarity [3].
The per-document relevance scores are aggregated into an overall confidence judgment for the query, and an upper threshold and a lower threshold convert that confidence into one of three discrete actions [1]:
| Confidence label | Trigger condition | Knowledge action |
|---|---|---|
| Correct | Confidence above the upper threshold | Refine the retrieved documents with decompose-then-recompose; use this internal knowledge |
| Incorrect | Confidence below the lower threshold | Discard all retrieved documents; run a web search and use that external knowledge |
| Ambiguous | Confidence between the two thresholds | Combine refined internal knowledge with web-search results |
The Correct label means at least one document is deemed reliable; the Incorrect label means all retrieved documents are judged irrelevant; and the intermediate Ambiguous action is a softer fallback that hedges by using both sources. The authors note that the Ambiguous action helps reduce the system's dependence on the precision of the evaluator itself, since misjudgments in borderline cases still benefit from both knowledge streams [1].
When retrieval is judged Incorrect (and as a supplement under Ambiguous), CRAG seeks new knowledge from outside the static corpus by performing a large-scale web search [1]. The input query is first rewritten into a form better suited for search, the returned URLs are navigated and their content transcribed, and the same knowledge-refinement step is then applied to the fetched pages [1]. To limit the bias and unreliability that open web content can introduce, the method prefers authoritative and regulated sources such as Wikipedia [1]. The official implementation performs the web search through the Serper.dev Google Search API [4]; LangChain's reproduction of the pattern instead uses Tavily Search as the web-search tool [2].
Even relevant documents usually contain noise, so CRAG does not feed retained documents to the generator verbatim. A decompose-then-recompose knowledge-refinement method extracts the most critical content from a relevant document by splitting it into fine-grained "knowledge strips," scoring each strip for relevance, filtering out the strips judged irrelevant, and recomposing the surviving strips into a compact knowledge input [1]. This same decomposition, filtering, and recomposition procedure is applied both to documents from the local corpus and to transcribed web pages, so the generator receives focused knowledge rather than full passages [1]. After the appropriate action assembles the final knowledge, an arbitrary generator LLM produces the answer conditioned on the query and that knowledge [1].
CRAG was evaluated on four datasets spanning short-form and long-form generation: PopQA (short-form open-domain question answering), Biography (long-form generation), PubHealth (a health-claim true-or-false task), and Arc-Challenge (multiple-choice science questions) [1]. Accuracy was used as the metric for PopQA, PubHealth, and Arc-Challenge, while FactScore (Min et al., 2023) was used for Biography [1]. Two generators were tested: a base LLaMA2-hf-7b model and the SelfRAG-LLaMA2-7b model fine-tuned by the Self-RAG authors [1].
Adding CRAG improved results across the board. With the SelfRAG-LLaMA2-7b generator, the reported scores were as follows [1]:
| Method (SelfRAG-LLaMA2-7b) | PopQA | Biography (FactScore) | PubHealth | Arc-Challenge |
|---|---|---|---|---|
| Standard RAG | 52.8 | 59.2 | 39.0 | 53.2 |
| Self-RAG | 54.9 | 81.2 | 72.4 | 67.3 |
| CRAG | 59.8 | 74.1 | 75.6 | 68.6 |
| Self-CRAG | 61.8 | 86.2 | 74.8 | 67.2 |
CRAG outperformed standard RAG on all four datasets, and Self-CRAG (CRAG layered on top of Self-RAG) achieved the strongest PopQA and Biography scores in the table [1]. The paper reports that CRAG significantly improves the performance of both standard RAG and the state-of-the-art Self-RAG, which the authors present as evidence of its generalizability across short-form and long-form tasks and across different backbone generators [1].
CRAG is frequently discussed alongside Self-RAG (Asai et al., 2023), and the two are complementary approaches to making RAG more reliable. Self-RAG trains a single LLM to decide adaptively when to retrieve and to emit special reflection tokens that critique whether retrieved passages are relevant and whether the generated answer is supported by them [1]. CRAG instead keeps the generator unchanged and adds an external, lightweight T5-based evaluator plus a web-search correction loop, so the corrective machinery does not require retraining the main model [1].
Because CRAG operates as a wrapper around the retrieval step, it can be attached to a Self-RAG pipeline rather than replacing it. The authors call this combination Self-CRAG and report that it lifts Self-RAG's scores on several benchmarks, demonstrating the plug-and-play claim in practice [1]. Both methods are examples of self-reflective or adaptive retrieval, in which the system reasons about retrieval quality before committing to an answer instead of trusting every retrieved document [2].
CRAG was submitted to arXiv on January 29, 2024, with a revised version (v3) posted on October 7, 2024 [1]. The first author Shi-Qi Yan and second author Jia-Chen Gu are listed as equal contributors; the authors are affiliated with the University of Science and Technology of China, the University of California, Los Angeles, and Google DeepMind [1]. Reference code is released in the HuskyInSalt/CRAG GitHub repository, which provides the fine-tuned retrieval-evaluator weights and training data [4]. The paper was also submitted to ICLR 2025 but was ultimately withdrawn there, so the arXiv preprint remains the canonical reference [1].
Beyond the original paper, CRAG became a widely cited template for agentic and self-reflective RAG. LangChain published an implementation that reproduces the pattern in LangGraph, modeling the retrieval-grade, query-rewrite, web-search, and generate steps as nodes in a state graph and citing the CRAG paper directly [2]. That LangGraph cookbook and numerous derivative tutorials (for example, walkthroughs from DataCamp and others) typically simplify the original design: rather than fine-tuning a T5 evaluator, they grade each retrieved document with an LLM and route to a web search such as Tavily when documents fall below a relevance threshold [2]. An open-source reproduction and explainability analysis of CRAG was also published in 2026, examining how the T5-based retrieval evaluator makes its decisions [3]. Through these adaptations, the CRAG concept of grading retrieval and correcting it with refinement or web search has become a common building block in production RAG systems.