Self-RAG

AI Agents Machine Learning

7 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v2 · 1,462 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Self-RAG (Self-Reflective Retrieval-Augmented Generation) is a framework that trains a single large language model to adaptively decide when to retrieve external passages, to generate text grounded in those passages, and to critique both the passages and its own output by predicting special "reflection tokens." ^[1] It was introduced in the paper "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection" by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi, first posted to arXiv on October 17, 2023. ^[1] The authors are affiliated with the University of Washington and the Allen Institute for AI (Hajishirzi), with Sil at IBM Research. ^[2] The work was accepted at the International Conference on Learning Representations (ICLR) 2024 as an oral presentation. ^[3]

Unlike conventional retrieval-augmented generation, which retrieves a fixed number of passages for every query and concatenates them to the prompt regardless of whether they help, Self-RAG retrieves on demand and verifies, at the level of individual output segments, whether each generated statement is supported by the retrieved evidence. The same model that produces the answer also emits the critique, and the reflection tokens make the model's behavior controllable at inference time without retraining. ^[1] Reported results show Self-RAG models with 7 billion and 13 billion parameters outperforming ChatGPT and retrieval-augmented Llama2-chat on open-domain question answering, reasoning, fact verification, and long-form generation with citations. ^[1]

Background: limits of standard RAG

Standard retrieval-augmented generation prepends a fixed number of retrieved documents to the model's input on every query. This design has two recurring weaknesses that Self-RAG targets. ^[1]

First, indiscriminate retrieval is wasteful and can be harmful. Many queries, such as simple reasoning prompts or requests that the model can answer from its parametric knowledge, do not benefit from retrieval, and injecting off-topic passages can introduce irrelevant context that degrades the output rather than improving it. Retrieving a fixed number of passages regardless of need also limits versatility across tasks with very different information requirements. ^[1]

Second, standard RAG provides no guarantee that the generated text is actually consistent with the cited passages. The model is free to ignore the retrieved evidence or to add unsupported claims, so retrieval alone does not eliminate hallucination and does not ensure that long-form outputs are faithfully grounded in their sources. ^[1] Self-RAG addresses both problems by teaching the model to decide whether retrieval is warranted and to explicitly assess whether its statements are supported by what was retrieved.

How Self-RAG works

The core idea is to expand the model's output vocabulary with reflection tokens that the model learns to generate inline, interleaved with ordinary text. There are four token families, divided into one retrieval token and three critique tokens. ^[1]

Reflection token	Role	Possible values
Retrieve	Decide whether retrieval is needed for the next segment	Yes, No, Continue (reuse evidence)
IsRel (relevance)	Judge whether a retrieved passage is relevant to the prompt	Relevant, Irrelevant
IsSup (support)	Judge whether the generated segment is supported by the passage	Fully supported, Partially supported, No support
IsUse (usefulness)	Rate the overall usefulness of the response to the query	A 1 to 5 rating

^[1]

On-demand retrieval

Generation proceeds segment by segment. At each step the model first predicts a Retrieve token. If it predicts "No," it continues generating from its own parametric knowledge without consulting any documents. If it predicts "Yes," a retriever is invoked to fetch relevant passages; a "Continue" value indicates that previously retrieved evidence remains sufficient and can be reused. This lets a single model retrieve multiple times, once, or not at all over the course of one response, in contrast to the fixed retrieval budget of standard RAG. ^[1]

Parallel processing and critique

When retrieval is triggered, Self-RAG processes the fetched passages in parallel: for each candidate passage it generates a continuation together with critique tokens. The IsRel token marks whether the passage is relevant to the input. The IsSup token marks whether the generated continuation is fully supported, partially supported, or unsupported by that passage, which directly measures grounding. The IsUse token gives an overall usefulness rating for the response. ^[1] Because these judgments are emitted as part of the generation, they double as natural citations and self-assessments of factuality.

Inference and controllability

At inference time Self-RAG performs a segment-level beam search. For each generation segment it scores candidate continuations using the language model probability combined with a weighted sum of the critique-token probabilities, then keeps the best-scoring segments to extend into the next step. ^[1] The relative weights on the relevance, support, and usefulness signals (denoted w_rel, w_sup, and w_use in the reference implementation) are hyperparameters set at inference, so a deployment can, for example, emphasize evidential support for a fact-checking task or favor fluency for an open-ended task without any retraining. ^[4] A retrieval threshold similarly tunes how often the model chooses to retrieve. The reference implementation uses default settings such as a beam width of 2 and a maximum search depth of 6. ^[4] This inference-time controllability is a distinguishing property of the method relative to a model trained with fixed retrieval behavior.

Training

Self-RAG is trained entirely with the standard next-token-prediction objective, with no reinforcement learning, by treating reflection tokens as ordinary vocabulary items added to the model. ^[1] Training is organized around two models. ^[1]

A critic model is trained first to predict reflection tokens. To create its supervision, the authors prompt GPT-4 to label passages and generations with the appropriate Retrieve, IsRel, IsSup, and IsUse values, then distill those labels into the critic by fine-tuning it to reproduce them. Using the critic to generate the tokens offline, rather than calling GPT-4 during data creation, keeps the pipeline inexpensive. ^[1]

The critic is then used to annotate a large and diverse instruction-following dataset: it inserts reflection tokens and, where the Retrieve token is "Yes," the corresponding retrieved passages are interleaved into the sequence. The resulting augmented corpus, which combines original task outputs, retrieved passages, and reflection tokens, is used to train the generator (also called the Self-RAG model) with the ordinary language-modeling loss. ^[1] At inference the critic is no longer required: the single generator model produces both the answer text and the reflection tokens itself. ^[1] The released models are built on Llama2 7B and 13B base models. ^[4]

Results

Across six tasks the 7B and 13B Self-RAG models substantially outperform comparable baselines, including supervised fine-tuned LLMs, standard retrieval-augmented models, ChatGPT, and retrieval-augmented Llama2-chat. ^[1] Evaluation tasks span short-form open-domain question answering on PopQA and TriviaQA, the ARC-Challenge reasoning benchmark, fact verification on PubHealth, and long-form generation on the ASQA long-form QA dataset and a biography generation task. ^[1] The largest reported gains come in factuality and in citation accuracy for long-form outputs, where the explicit support critique directly improves how well generated statements are grounded in cited evidence. ^[1] The authors released their code and trained models publicly. ^[2]

Relationship to other RAG methods

Self-RAG belongs to a family of "adaptive" or "active" retrieval methods that move beyond always retrieving a fixed set of passages, but it differs from the others in where the decision logic lives. ^[5]

Corrective RAG (CRAG) attaches a separate lightweight retrieval evaluator that scores retrieved documents as correct, incorrect, or ambiguous and triggers corrective actions, such as web search or a decompose-then-recompose refinement, when the evidence looks inadequate. CRAG is a plug-and-play module placed around an existing generator, whereas Self-RAG bakes the retrieval and critique decisions into the generator itself through learned reflection tokens. ^[5]

FLARE (Forward-Looking Active Retrieval) triggers retrieval mid-generation whenever the model becomes uncertain about an upcoming span, keeping the underlying generator frozen and relying on an external signal, rather than training the model to emit retrieval decisions. ^[5] Classifier-routing approaches such as Adaptive RAG instead route each query to a no-retrieval, single-step, or multi-step pipeline based on predicted query complexity. ^[5]

The common thread is dynamic, need-based retrieval combined with some form of evidence checking. Self-RAG's distinguishing contribution is that a single end-to-end-trained model performs retrieval decisions, generation, and fine-grained self-critique together, and that its reflection tokens expose tunable controls over that behavior at inference time. ^[1]

References

Asai, Akari; Wu, Zeqiu; Wang, Yizhong; Sil, Avirup; Hajishirzi, Hannaneh. "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." arXiv:2310.11511, October 17, 2023. https://arxiv.org/abs/2310.11511 ↩
Self-RAG project page. https://selfrag.github.io/ ↩
"Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." ICLR 2024 oral presentation. https://iclr.cc/virtual/2024/oral/19736 ↩
Asai, Akari. self-rag GitHub repository (reference implementation, training and inference details). https://github.com/AkariAsai/self-rag ↩
Gao, Yunfan, et al. "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv:2312.10997. https://arxiv.org/abs/2312.10997 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Corrective RAG (CRAG)Multi-hop RAG Retrieval-Augmented Generation

Overview

Background: limits of standard RAG

How Self-RAG works

On-demand retrieval

Parallel processing and critique

Inference and controllability

Training

Results

Relationship to other RAG methods

References

Improve this article

Related Articles

Agentic Context Engineering

Computer-use agent

AI agents

Mixture of Agents

Reflexion

Coconut (Chain of Continuous Thought)

What links here

Related Articles

Agentic Context Engineering

Computer-use agent

AI agents

Mixture of Agents

Reflexion

Coconut (Chain of Continuous Thought)

What links here