Multi-hop RAG
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,129 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,129 words
Add missing citations, update stale details, or suggest a clearer explanation.
Multi-hop RAG is a family of retrieval-augmented generation techniques designed to answer questions that require composing evidence from two or more documents or text chunks. Unlike single-hop questions, which can be resolved by retrieving and reading one passage, multi-hop questions need a model to chain facts across separate sources, such as "Which university employed the author of the book that inspired the 2010 film Inception?" Approaches range from iterative retrieve-and-reason loops (IRCoT, Self-Ask, FLARE) to explicit question decomposition, graph-based community summarization (GraphRAG), and self-reflective routing (Self-RAG). The field grew out of multi-hop question answering benchmarks such as HotpotQA, 2WikiMultiHopQA, and MuSiQue, and has its own RAG-specific benchmark, MultiHop-RAG, introduced in 2024.[1][2][3][4]
Standard RAG pipelines retrieve a fixed number of passages once, conditioned on the user query, and feed them to the generator. This pattern was sufficient for many single-hop questions but failed on questions whose answer depends on intermediate facts that are not named in the original query. The classical example from the HotpotQA dataset asks about the "Government Position" held by the woman who portrayed Corliss Archer in the film "Kiss and Tell"; to answer it, a system must first identify the actress (Shirley Temple) and then look up her government role, requiring at least two distinct retrievals over Wikipedia.[1]
Yang and colleagues released HotpotQA in 2018 at EMNLP. The dataset contains 113,000 Wikipedia-based question and answer pairs whose questions require finding and reasoning over multiple supporting documents. The authors annotated sentence-level supporting facts so systems could be trained to reason with strong supervision and produce explainable predictions. HotpotQA also introduced "comparison" questions, which test whether a model can extract attributes from two distinct entities and compare them.[1] The dataset defined two evaluation settings that remain standard: a "distractor" setting, in which each question is paired with its two gold paragraphs plus eight TF-IDF-retrieved distractors, and a "fullwiki" setting, which requires retrieval from the full English Wikipedia. Reported F1 scores fall substantially when moving from distractor to fullwiki, exposing how much of the difficulty comes from retrieval, not just reading comprehension.[5]
Two follow-up benchmarks pushed the difficulty further. 2WikiMultiHopQA, introduced by Ho and colleagues at COLING 2020, combines structured information from Wikidata with unstructured Wikipedia text. The dataset uses logical rules and templates over Wikidata triples to construct questions of four types: comparison, inference, compositional, and bridge-comparison. Each instance includes an annotated reasoning path called "evidence information" that lets evaluators check whether a system reached the right answer through valid steps rather than shortcuts.[6]
MuSiQue, introduced by Trivedi and colleagues in TACL 2022, took a bottom-up approach. Rather than crowdsourcing free-form multi-hop questions, the authors systematically composed pairs of existing single-hop questions in which one reasoning step depends critically on the answer of the other. The result is MuSiQue-Ans, a dataset of about 25,000 two-to-four-hop questions with adversarial properties: single-hop models lose roughly 30 F1 points on it relative to HotpotQA, and the human-machine gap is roughly three times larger than on prior multi-hop datasets, suggesting the benchmark is harder to game via shortcut features.[3]
These benchmarks were designed before large language models with in-context retrieval became standard. As RAG systems matured, researchers recognized that HotpotQA-style benchmarks were a poor fit for evaluating RAG pipelines, because the gold passages were short Wikipedia snippets aligned with question phrasing. A dedicated RAG-specific benchmark, MultiHop-RAG, was released by Tang and Yang in January 2024 and accepted at COLM 2024. It draws on a corpus of English news articles published between September and December 2023 and contains 2,556 multi-hop queries with ground-truth answers and supporting evidence spread across two to four documents. The four query categories are inference, comparison, temporal, and null (where the corpus does not support a definite answer).[4]
A single-hop RAG pipeline assumes one retrieval call suffices: embed the query, fetch the top-k chunks from a vector database (often combined with BM25 lexical retrieval), and pass them to the generator. Failure modes are dominated by chunk quality and embedding fidelity.
Multi-hop RAG introduces three new sources of failure. First, the original query may not contain all the keywords needed to retrieve the second-hop passage, so a literal embedding match misses the intermediate document. Second, even when both passages are retrieved, the generator must compose them; the "compositionality gap" identified by Press and colleagues at EMNLP Findings 2023 measures how often models answer all sub-questions correctly but still fail at the composite question, and they showed that this gap does not close automatically with model scale.[7] Third, the retrieval budget is bounded: pulling more chunks per hop risks crowding out relevant context with distractors, while too few hops misses required evidence.
Tang and Yang quantified the gap directly. With ground-truth evidence supplied, GPT-4 reaches roughly 0.89 accuracy on MultiHop-RAG; with retrieved evidence from a standard dense retriever, accuracy drops to about 0.56, indicating that retrieval rather than reading is the dominant bottleneck for multi-hop queries on realistic corpora.[8]
The most common family of multi-hop RAG methods interleaves retrieval with generation, so that intermediate reasoning steps can issue new queries.
IRCoT (Interleaving Retrieval with Chain-of-Thought) was introduced by Trivedi, Balasubramanian, Khot, and Sabharwal in December 2022. The motivating observation was that "what to retrieve depends on what has already been derived." Standard one-shot retrieval uses only the input question as a query, which is insufficient when later hops need facts that are absent from the surface form of the question. IRCoT alternates two operations: a chain-of-thought step that generates one additional reasoning sentence using the LLM, and a retrieval step that uses the last generated sentence as the new query and appends the new passages to the working context. The process terminates when the model produces an answer or a maximum step budget is reached.[9]
Trivedi and colleagues evaluated IRCoT on HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. Using GPT-3 as the underlying model, IRCoT improved retrieval recall by up to 21 points and downstream QA exact match by up to 15 points over a one-shot retrieval baseline. Similar gains held in out-of-distribution settings and with smaller models such as Flan-T5-large without additional training, suggesting the technique generalizes beyond a single backbone.[9]
Self-Ask, introduced by Press and colleagues at EMNLP Findings 2023, applies a related idea at the prompt level. The model is shown few-shot examples in which it first writes a "follow-up question," then writes an "intermediate answer" to that follow-up, and repeats until it can output the final answer. This makes the decomposition explicit in the model's own output, which means a search engine or retriever can be plugged in to answer the follow-up questions externally. The paper introduced the Bamboogle and Compositional Celebrities benchmarks specifically to expose the compositionality gap, and showed that Self-Ask plus search closes much of that gap on these benchmarks.[7]
ReAct (Yao and colleagues, ICLR 2023) generalizes the pattern beyond explicit question decomposition. The model alternates "Thought," "Action," and "Observation" tokens, where Action is typically a call to a tool such as a Wikipedia search API, and Observation is the returned text. On HotpotQA, ReAct prompts GPT-3 to issue successive searches that ground each reasoning step in real text, which the authors argue mitigates the hallucination and error-propagation problems of pure chain-of-thought.[10]
FLARE (Forward-Looking Active REtrieval, Jiang and colleagues, EMNLP 2023) targets long-form generation rather than short factoid QA but is often grouped with multi-hop RAG. Rather than retrieve based on the prior text alone, FLARE drafts the next sentence the model is about to produce, then uses that draft as the retrieval query. If the draft contains low-confidence tokens (measured by the LM's own token probabilities), FLARE retrieves passages and regenerates the sentence with the new context. The active retrieval mechanism allows the system to retrieve only when needed and to anticipate which facts a future sentence will require, an approach the authors evaluated on four long-form knowledge-intensive tasks.[11]
Self-RAG, introduced by Asai and colleagues in October 2023 and published at ICLR 2024, trains a single LM to interleave retrieval with generation under its own control. The model emits special "reflection tokens" that decide whether to retrieve, whether retrieved passages are relevant, whether the generated continuation is supported, and how useful the final answer is. Training data is constructed by prompting GPT-4 to annotate reflection tokens on existing instruction-following examples, then distilling these labels into a smaller Llama-2 base model. The 7B and 13B Self-RAG models outperformed ChatGPT and retrieval-augmented Llama-2-chat baselines on open-domain QA, reasoning, and fact-verification tasks in the original paper.[12]
A second family of methods makes decomposition explicit. The query is first transformed into a list of sub-questions, each sub-question is answered through its own retrieval call, and the partial answers are then composed.
LlamaIndex ships this pattern as the Sub-Question Query Engine. Given a complex query and a set of registered tools (each backed by an index), an LLM generates sub-questions, routes each one to the appropriate tool, gathers intermediate answers, and synthesizes a final response. The pattern is well suited to queries that span multiple data sources, for example a question that requires combining a financial filing with a research report.[13] LangChain exposes a related "Multi-Query Retriever" that automatically generates multiple paraphrases of the input query, retrieves chunks for each paraphrase, and merges the candidate set before passing it to the generator.[14]
Decomposition-style approaches show up in recent academic work as well. A representative example is Multi-Meta-RAG (Poliakov and Shvai, 2024), which targets the MultiHop-RAG benchmark specifically: an LLM extracts metadata fields from the query (entities, dates, source attributes), the metadata is used to filter the vector database before similarity search, and the filtered set is retrieved and passed to the generator. The authors report meaningful improvements on MultiHop-RAG, illustrating that LLM-extracted metadata can compensate for the weakness of pure dense retrieval on multi-hop queries.[15]
DSPy, the declarative LM-programming framework from Stanford and collaborators, treats multi-hop retrieval as a compilable pipeline. The framework's documentation includes a standard multi-hop example (often called "Baleen-style," after Khattab and colleagues' earlier retrieval system) that generates a query, retrieves passages, reasons over them to generate a new query, retrieves again, and finally answers. DSPy compiles the prompts and few-shot demonstrations automatically using held-out training questions, which is useful when the cost of hand-tuning multi-hop prompts is high.[16]
Iterative methods struggle when the answer requires aggregating evidence across many entities, for example "What are the main themes in this 10,000-page corpus?" or "How do the priorities of the EU AI Act and the US AI Executive Order compare?" Such global questions are not retrievable in any literal sense, because no single passage contains the answer. Graph-based multi-hop RAG attacks this regime.
GraphRAG, released by Microsoft Research in April 2024, uses an LLM to build a knowledge graph index from source documents in two stages. First, an LLM extracts entities and relationships from each chunk to produce an entity knowledge graph. Second, the system runs the Leiden community-detection algorithm on the graph to produce a hierarchy of "communities," then prompts the LLM to write a natural-language summary for every community at every level. At query time, the system uses the relevant community summaries to draft partial answers and synthesizes them into a final response.[2] Edge and colleagues reported that on global sensemaking questions over one-million-token corpora, GraphRAG produced substantial improvements over conventional RAG baselines in both comprehensiveness and diversity of generated answers, as judged by an LLM evaluator and by head-to-head human assessment.[17]
GraphRAG uses the Leiden algorithm rather than its predecessor Louvain because Leiden guarantees well-connected communities and supports hierarchical clustering; the system stores Level 0 (fine-grained) up to higher-level community summaries and selects which level to consult based on the question's scope.[18] The Microsoft GraphRAG implementation was open-sourced in mid-2024, and several follow-ups extended the approach with hybrid graph-plus-vector retrieval and integration with graph databases such as Neo4j and Memgraph.[19]
A broader literature predating GraphRAG used pre-existing knowledge graphs (Wikidata, ConceptNet, proprietary enterprise graphs) as a parallel retrieval target. The retriever fetches both passages from a text index and structured facts from a graph; the generator conditions on both. This pattern is convenient when the corpus naturally has a graph structure (clinical notes linked to patient identifiers, financial filings linked to tickers) but requires curation work that GraphRAG-style approaches automate.
Multi-hop questions often phrase the answer in language that does not appear in the corpus. HyDE (Hypothetical Document Embeddings), introduced by Gao, Ma, Lin, and Callan at ACL 2023, addresses the mismatch by first asking an LLM to generate a plausible-but-possibly-fictional answer document, then embedding that hypothetical document with a contrastively trained encoder such as Contriever, and using the resulting vector to retrieve real documents. Real documents that share latent topical structure with the hypothetical one fall close to it in embedding space, even if the surface vocabulary differs. The technique was introduced for zero-shot dense retrieval without relevance labels and is often used as a query rewriter in multi-hop pipelines: the hypothetical document for an intermediate hop can name entities the original question never mentioned, improving recall on the second-hop retrieval.[20]
The technique is available as a standalone retriever in the HyDE page in LlamaIndex, Haystack, and similar frameworks. Empirically, HyDE provides larger gains for harder queries and less consistent gains for short keyword queries that already match passages well.
A complementary line of work targets the failure mode where retrieval fails because individual chunks lack the context needed to be matched. Anthropic published "Contextual Retrieval" in September 2024. The technique prepends a chunk-specific contextual summary, generated by an LLM from the full document, to each chunk before embedding and BM25 indexing. The chunk now carries information about what it is about, which company is being discussed, which fiscal year applies, and so on, so a query that mentions those entities can match even when the underlying chunk text does not.[21]
Anthropic reported that contextual embeddings alone reduced top-20-chunk retrieval failure rate by 35 percent, that combining contextual embeddings with contextual BM25 reduced it by 49 percent, and that adding reranking on top brought the reduction to 67 percent. The article framed the technique as a way to make every chunk independently retrievable, which is particularly useful when a multi-hop query needs to recover a chunk whose only link to the answer is a pronoun or implicit reference.[21]
For multi-hop RAG specifically, contextual chunking interacts favorably with iterative methods: better-contextualized chunks reduce the number of hops needed, because the first hop can succeed where it would have failed under naive chunking.
The release of Gemini 1.5 Pro with a million-token context window in 2024 raised the question of whether multi-hop RAG matters at all. If a corpus fits in context, why not just include it and ask the model to do the reasoning natively?
Empirical results suggest the answer depends on the failure mode. On the original needle-in-a-haystack test, Gemini 1.5 Pro reaches over 99.7 percent recall on a single inserted fact within a one-million-token haystack, validating the long-context retrieval claim.[22] But on multi-needle variants, where multiple facts must be recovered and integrated, average recall on realistic queries falls substantially. Multi-hop benchmarks expose a reasoning failure mode that is separate from retrieval: even when every fact is in context, integrating them across long distances often fails.[23]
In practical pipelines, multi-hop RAG and long context are complementary rather than substitutes. Long context allows retrieved chunks to be larger and to carry more surrounding evidence, which reduces the number of retrieval hops needed. Iterative retrieval gives the model a way to seek information it does not have, which long context alone cannot do over a corpus that exceeds the window. Production systems typically combine both: retrieve aggressively with a multi-hop strategy, then put generous context into a long-context model so the synthesis step has room to reason.
Multi-hop RAG is evaluated along two dimensions: retrieval (did the system fetch the right supporting passages?) and generation (did it produce the right answer?).
Standard metrics are Hit@k, Mean Reciprocal Rank, Recall@k, and supporting-fact F1. HotpotQA reports supporting-fact EM and F1 alongside the answer metrics, which penalizes systems that arrive at the right answer by retrieving irrelevant context. The leaderboard for HotpotQA, maintained at the project site, shows that the best distractor-setting systems exceed 70 percent joint EM, while the best fullwiki systems reach into the high 60s, with most of the gap attributable to retrieval difficulty.[24]
For short-answer benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue), exact match and F1 over normalized tokens are standard. MultiHop-RAG uses accuracy for inference, comparison, and temporal categories and includes a "null" category to penalize systems that hallucinate when the corpus does not support an answer.[4]
For longer-form multi-hop generations, evaluators increasingly use LLM-as-judge scoring or pairwise human comparison, as in the GraphRAG paper.[17] These methods are harder to standardize but capture the comprehensiveness and grounding dimensions that exact-match metrics miss.
The table below collects reported results from primary sources. Scores are not directly comparable across rows because settings and base models differ.
| System | Benchmark | Metric | Score | Source |
|---|---|---|---|---|
| IRCoT (GPT-3) | HotpotQA distractor | Answer F1 gain | up to +15 over baseline | Trivedi et al. 2022 [9] |
| GPT-4 with gold evidence | MultiHop-RAG | Accuracy | ~0.89 | Tang and Yang 2024 [8] |
| GPT-4 with retrieved evidence | MultiHop-RAG | Accuracy | ~0.56 | Tang and Yang 2024 [8] |
| Contextual Retrieval (Voyage) | Anthropic eval set | Top-20 failure rate reduction | 67% | Anthropic 2024 [21] |
| Self-RAG 13B | Open-domain QA | vs ChatGPT | outperforms | Asai et al. 2023 [12] |
| GraphRAG | Global queries on 1M-token corpus | Comprehensiveness | substantial gains over RAG baseline | Edge et al. 2024 [17] |
Multi-hop RAG inherits the failure modes of its components and adds compositional ones of its own.
The first limitation is cost. Iterative retrieval issues multiple LLM calls and multiple retriever calls per question. For Self-Ask or IRCoT-style approaches, a three-hop question may require three to five LLM calls plus retrievals, multiplying latency and token spend over a single-hop baseline. GraphRAG amortizes some of this by precomputing community summaries, but the index-build cost grows with corpus size and must be repeated when the corpus changes substantively.[17]
The second limitation is error propagation. Each hop conditions on the output of the previous hop. When an early step retrieves an irrelevant passage or the LLM produces a wrong intermediate answer, the error compounds. IRCoT, ReAct, and Self-RAG all attempt to mitigate this with reflection or with the ability to fall back to the original question, but no method eliminates the problem.
The third limitation is benchmark generalization. The original multi-hop QA benchmarks were built over Wikipedia, with relatively clean sentence-level supporting facts. Real corpora are messier: chunks have inconsistent length, documents have boilerplate, and entities are referred to by abbreviation or by implicit context. MultiHop-RAG was constructed to be more realistic on this dimension by using news articles, but it remains a fixed snapshot of late-2023 news, and performance there does not guarantee performance on a particular enterprise corpus.[4]
The fourth limitation is null detection. Many multi-hop systems hallucinate when the corpus does not support an answer. MultiHop-RAG's null category exists specifically to measure this; reported accuracies on null queries are typically lower than on inference or comparison queries, indicating that most current pipelines err on the side of producing some answer.[4]
Multi-hop RAG appears in production patterns offered by every major orchestration framework. LlamaIndex exposes the Sub-Question Query Engine, the Knowledge Graph RAG Query Engine, and built-in router engines that select among single-hop and multi-hop strategies based on query complexity.[13] LangChain offers multi-query retrievers, a Self-Ask agent, and integrations with GraphRAG and with academic implementations such as Self-RAG.[14] Microsoft's GraphRAG is available as an open-source Python package, and several vendors (Neo4j, Memgraph, Databricks) ship managed integrations.[19]
On the model side, the iterative pattern interacts with ReAct-style tool use, which has become a default capability of frontier LLMs. Most agentic pipelines today combine ReAct-style multi-hop reasoning with one or more retrieval tools, blurring the distinction between "multi-hop RAG" and general agentic question answering.
Enterprise deployments often combine multiple strategies. A typical pattern is: rewrite the query with HyDE for the first hop, retrieve a broad candidate set with a hybrid dense and BM25 index, rerank with a cross-encoder, run an IRCoT or Self-Ask style decomposition for questions flagged as multi-hop, fall back to a GraphRAG-style community summary for genuinely global queries, and synthesize the answer in a long-context model with citation-bound generation to make grounding auditable.