Self-RAG
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,462 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,462 words
Add missing citations, update stale details, or suggest a clearer explanation.
Self-RAG (Self-Reflective Retrieval-Augmented Generation) is a framework that trains a single large language model to adaptively decide when to retrieve external passages, to generate text grounded in those passages, and to critique both the passages and its own output by predicting special "reflection tokens." [1] It was introduced in the paper "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection" by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi, first posted to arXiv on October 17, 2023. [1] The authors are affiliated with the University of Washington and the Allen Institute for AI (Hajishirzi), with Sil at IBM Research. [2] The work was accepted at the International Conference on Learning Representations (ICLR) 2024 as an oral presentation. [3]
Unlike conventional retrieval-augmented generation, which retrieves a fixed number of passages for every query and concatenates them to the prompt regardless of whether they help, Self-RAG retrieves on demand and verifies, at the level of individual output segments, whether each generated statement is supported by the retrieved evidence. The same model that produces the answer also emits the critique, and the reflection tokens make the model's behavior controllable at inference time without retraining. [1] Reported results show Self-RAG models with 7 billion and 13 billion parameters outperforming ChatGPT and retrieval-augmented Llama2-chat on open-domain question answering, reasoning, fact verification, and long-form generation with citations. [1]
Standard retrieval-augmented generation prepends a fixed number of retrieved documents to the model's input on every query. This design has two recurring weaknesses that Self-RAG targets. [1]
First, indiscriminate retrieval is wasteful and can be harmful. Many queries, such as simple reasoning prompts or requests that the model can answer from its parametric knowledge, do not benefit from retrieval, and injecting off-topic passages can introduce irrelevant context that degrades the output rather than improving it. Retrieving a fixed number of passages regardless of need also limits versatility across tasks with very different information requirements. [1]
Second, standard RAG provides no guarantee that the generated text is actually consistent with the cited passages. The model is free to ignore the retrieved evidence or to add unsupported claims, so retrieval alone does not eliminate hallucination and does not ensure that long-form outputs are faithfully grounded in their sources. [1] Self-RAG addresses both problems by teaching the model to decide whether retrieval is warranted and to explicitly assess whether its statements are supported by what was retrieved.
The core idea is to expand the model's output vocabulary with reflection tokens that the model learns to generate inline, interleaved with ordinary text. There are four token families, divided into one retrieval token and three critique tokens. [1]
| Reflection token | Role | Possible values |
|---|---|---|
| Retrieve | Decide whether retrieval is needed for the next segment | Yes, No, Continue (reuse evidence) |
| IsRel (relevance) | Judge whether a retrieved passage is relevant to the prompt | Relevant, Irrelevant |
| IsSup (support) | Judge whether the generated segment is supported by the passage | Fully supported, Partially supported, No support |
| IsUse (usefulness) | Rate the overall usefulness of the response to the query | A 1 to 5 rating |
Generation proceeds segment by segment. At each step the model first predicts a Retrieve token. If it predicts "No," it continues generating from its own parametric knowledge without consulting any documents. If it predicts "Yes," a retriever is invoked to fetch relevant passages; a "Continue" value indicates that previously retrieved evidence remains sufficient and can be reused. This lets a single model retrieve multiple times, once, or not at all over the course of one response, in contrast to the fixed retrieval budget of standard RAG. [1]
When retrieval is triggered, Self-RAG processes the fetched passages in parallel: for each candidate passage it generates a continuation together with critique tokens. The IsRel token marks whether the passage is relevant to the input. The IsSup token marks whether the generated continuation is fully supported, partially supported, or unsupported by that passage, which directly measures grounding. The IsUse token gives an overall usefulness rating for the response. [1] Because these judgments are emitted as part of the generation, they double as natural citations and self-assessments of factuality.
At inference time Self-RAG performs a segment-level beam search. For each generation segment it scores candidate continuations using the language model probability combined with a weighted sum of the critique-token probabilities, then keeps the best-scoring segments to extend into the next step. [1] The relative weights on the relevance, support, and usefulness signals (denoted w_rel, w_sup, and w_use in the reference implementation) are hyperparameters set at inference, so a deployment can, for example, emphasize evidential support for a fact-checking task or favor fluency for an open-ended task without any retraining. [4] A retrieval threshold similarly tunes how often the model chooses to retrieve. The reference implementation uses default settings such as a beam width of 2 and a maximum search depth of 6. [4] This inference-time controllability is a distinguishing property of the method relative to a model trained with fixed retrieval behavior.
Self-RAG is trained entirely with the standard next-token-prediction objective, with no reinforcement learning, by treating reflection tokens as ordinary vocabulary items added to the model. [1] Training is organized around two models. [1]
A critic model is trained first to predict reflection tokens. To create its supervision, the authors prompt GPT-4 to label passages and generations with the appropriate Retrieve, IsRel, IsSup, and IsUse values, then distill those labels into the critic by fine-tuning it to reproduce them. Using the critic to generate the tokens offline, rather than calling GPT-4 during data creation, keeps the pipeline inexpensive. [1]
The critic is then used to annotate a large and diverse instruction-following dataset: it inserts reflection tokens and, where the Retrieve token is "Yes," the corresponding retrieved passages are interleaved into the sequence. The resulting augmented corpus, which combines original task outputs, retrieved passages, and reflection tokens, is used to train the generator (also called the Self-RAG model) with the ordinary language-modeling loss. [1] At inference the critic is no longer required: the single generator model produces both the answer text and the reflection tokens itself. [1] The released models are built on Llama2 7B and 13B base models. [4]
Across six tasks the 7B and 13B Self-RAG models substantially outperform comparable baselines, including supervised fine-tuned LLMs, standard retrieval-augmented models, ChatGPT, and retrieval-augmented Llama2-chat. [1] Evaluation tasks span short-form open-domain question answering on PopQA and TriviaQA, the ARC-Challenge reasoning benchmark, fact verification on PubHealth, and long-form generation on the ASQA long-form QA dataset and a biography generation task. [1] The largest reported gains come in factuality and in citation accuracy for long-form outputs, where the explicit support critique directly improves how well generated statements are grounded in cited evidence. [1] The authors released their code and trained models publicly. [2]
Self-RAG belongs to a family of "adaptive" or "active" retrieval methods that move beyond always retrieving a fixed set of passages, but it differs from the others in where the decision logic lives. [5]
Corrective RAG (CRAG) attaches a separate lightweight retrieval evaluator that scores retrieved documents as correct, incorrect, or ambiguous and triggers corrective actions, such as web search or a decompose-then-recompose refinement, when the evidence looks inadequate. CRAG is a plug-and-play module placed around an existing generator, whereas Self-RAG bakes the retrieval and critique decisions into the generator itself through learned reflection tokens. [5]
FLARE (Forward-Looking Active Retrieval) triggers retrieval mid-generation whenever the model becomes uncertain about an upcoming span, keeping the underlying generator frozen and relying on an external signal, rather than training the model to emit retrieval decisions. [5] Classifier-routing approaches such as Adaptive RAG instead route each query to a no-retrieval, single-step, or multi-step pipeline based on predicted query complexity. [5]
The common thread is dynamic, need-based retrieval combined with some form of evidence checking. Self-RAG's distinguishing contribution is that a single end-to-end-trained model performs retrieval decisions, generation, and fine-grained self-critique together, and that its reflection tokens expose tunable controls over that behavior at inference time. [1]