Agentic RAG

Agentic RAG (agentic retrieval-augmented generation) is a design pattern for grounded language generation in which an autonomous large language model agent plans, executes, and revises a sequence of retrievals, tool calls, and reasoning steps rather than performing a single fixed retrieval pass. The pattern emerged through 2023 and 2024 as practitioners hit the limits of vanilla RAG, which assumes that one embedding lookup against a vector store will surface the right context for any query. By contrast, an agentic RAG system treats retrieval as a tool that an AI agent can invoke iteratively, decompose into subqueries, reformulate after inspecting results, and combine with calculator calls, code execution, web search, or queries against structured stores until the agent decides it has enough evidence to answer.

The term gained wide currency after a January 2025 survey paper by Aditi Singh and colleagues titled "Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG" (arXiv:2501.09136), which catalogued the design patterns under a unified taxonomy and helped consolidate vocabulary that had previously been scattered across blog posts and framework documentation ^[1]. By 2026 the approach had been adopted across the major orchestration frameworks, including LangGraph, LlamaIndex, DSPy, Haystack, and AutoGen, and had shipped as a first-class product feature in Microsoft Azure AI Search, Cohere Command R, Glean, and Contextual AI's enterprise platform ^[2]^[3]^[4]^[5].

Background and motivation

Vanilla RAG, as introduced by Patrick Lewis and colleagues at Facebook AI Research in 2020, follows a fixed three-step pipeline: embed the user query, retrieve the top-k most similar chunks from a vector index, and pass those chunks alongside the query to a generator model. The simplicity of this pipeline drove rapid adoption, and by 2024 retrieval-augmented architectures appeared in a majority of enterprise generative AI deployments. The same simplicity, however, produced predictable failure modes when queries grew more complex than "look up one fact in one document" ^[6].

Four limitations of vanilla RAG drove the move toward agentic variants. First, single-shot retrieval fails on multi-hop questions where the answer requires combining facts from documents that are not semantically similar to the original query. Second, the top-k retrieval window is fixed in advance, so queries that need very little context waste tokens while queries that need a great deal of context get truncated. Third, vanilla RAG has no mechanism to recognize that retrieved documents are irrelevant or insufficient, so a low-quality retrieval propagates directly into a hallucinated answer. Fourth, vanilla RAG cannot ask clarifying questions, query a structured database, run a calculation, or call any tool other than the dense vector index it was wired to at build time ^[1]^[7].

Agentic RAG addresses each of these limitations by inserting an agent loop between the user and the retriever. The loop uses chain-of-thought reasoning, function calling, and self-evaluation primitives that became reliable in frontier models during 2023 and 2024, including GPT-4 function calling, Claude tool use, and the open-source ReAct prompting pattern. With these primitives in place, an LLM could not only generate text from retrieved context but also decide what to retrieve next, when to stop retrieving, and how to combine retrieval with other tools.

History and terminology

The specific phrase "agentic RAG" began circulating in mid-2024 in framework documentation and engineering blogs. LangChain's LangGraph team published an early influential piece titled "Agentic RAG with LangGraph" in 2024 that described how to use a state-machine abstraction to implement the iterative retrieval loop, drawing on the Self-RAG paper by Akari Asai and colleagues (arXiv:2310.11511, October 2023) which had introduced self-reflective retrieval and critique tokens ^[8]. LlamaIndex published a parallel guide titled "Agentic RAG with LlamaIndex" in 2024 describing per-document agents that could be composed under a top-level coordinator agent ^[3].

The term was consolidated in academic literature by Singh, Ehtesham, Kumar, Talaei Khoei, and Vasilakos in their January 2025 survey, which defined Agentic RAG as RAG systems that "embed autonomous AI agents into the pipeline" and use "agentic design patterns including reflection, planning, tool use, and multi-agent collaboration" to manage retrieval dynamically ^[1]. The survey proposed a taxonomy across four dimensions: agent cardinality (single vs. multi-agent), control structure (sequential to adaptive), autonomy level, and knowledge representation. A revised version of the survey (v4, April 2026) expanded the practical implementation guidance and added case studies in healthcare, finance, and education.

A related survey by the same research community, "Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges" (arXiv:2506.10408, 2025), framed agentic RAG using the dual-process distinction from cognitive psychology, with fast, intuitive single-shot RAG on one end and slower, deliberative agentic loops on the other ^[9].

Comparison to vanilla RAG

The core difference between vanilla RAG and agentic RAG is the introduction of a control loop with conditional logic. Where vanilla RAG runs retrieval exactly once per query, agentic RAG runs retrieval an unknown number of times, with the LLM itself deciding whether each round produced sufficient evidence.

Dimension	Vanilla RAG	Agentic RAG
Retrieval steps	Exactly one	Variable, decided by agent
Query handling	Embed user query directly	Decompose into subqueries, reformulate
Tool use	Only vector retriever	Multiple tools (vector, SQL, web, calculator, code)
Failure handling	No detection of bad retrievals	Self-evaluation and retry
Latency (typical)	1 to 2 seconds	8 to 12 seconds with 3 to 4 iteration loops
Token cost	Baseline	3 to 10 times higher
Best for	Single-hop factual lookup	Multi-hop reasoning, ambiguous queries
Hallucination risk	Higher when retrieval misses	Lower due to validation steps
Engineering complexity	Low	High

A practical heuristic that emerged in production engineering practice by 2026 is to route incoming queries through a classifier that selects vanilla RAG for single-hop factual questions and reserves agentic RAG for multi-step reasoning, cross-source synthesis, or high-stakes domains such as legal, medical, and financial work ^[7].

Architectural patterns

Agentic RAG is not a single algorithm but a family of patterns that share the agent-in-the-loop principle. The Singh et al. survey and subsequent practitioner literature have converged on roughly seven recurring patterns that can be combined in any one system.

Pattern	Description	Representative system
Query decomposition	Break a compound query into focused subqueries dispatched in parallel or sequence	Azure AI Search agentic retrieval, LlamaIndex sub-question engine
Iterative retrieval	After each retrieval, the agent inspects results and decides whether to retrieve again with a refined query	LangGraph corrective RAG, Self-RAG
Multi-hop reasoning	Chain retrievals where each query depends on facts surfaced by the previous one	Anthropic Claude with computer use, LlamaIndex sub-question agent
Tool use beyond retrieval	Mix vector retrieval with SQL, web search, calculator, code execution, and structured APIs	Cohere Command R with native tool use, OpenAI Assistants API
Self-reflection	The agent grades its own draft answer against the retrieved evidence and rewrites or retrieves more if support is weak	Self-RAG (Asai et al. 2023), Reflexion-style agents
Re-ranking and validation	A second model or pass scores retrieved chunks for relevance and discards low-confidence material	Cohere Rerank, Azure AI Search semantic ranking
Multi-agent orchestration	Specialized agents (planner, retriever, critic, synthesizer) divide labor and pass intermediate state	AutoGen, LangGraph supervisor pattern, CrewAI

Most production deployments use several of these patterns at once. Microsoft's agentic retrieval in Azure AI Search, for example, combines query decomposition, parallel subquery dispatch, semantic re-ranking, and an LLM-driven planning step in a single managed service ^[2]. LangGraph's reference "corrective RAG" tutorial combines iterative retrieval with self-reflection, where a grading node decides whether to accept retrieved documents, rewrite the query, or fall back to web search ^[8].

Frameworks supporting agentic RAG

The orchestration frameworks that emerged around general-purpose agent development have all added explicit support for agentic RAG patterns, though they differ in abstraction style and target use case.

Framework	Origin	Abstraction	Agentic RAG support
LangChain and LangGraph	LangChain Inc., 2022 and 2024	Graphs and state machines	Reference tutorials for corrective RAG, self-reflective RAG, multi-agent supervisor RAG
LlamaIndex	Jerry Liu, 2022	Index abstractions, query engines, workflows	Per-document agents, sub-question query engine, ReAct agent, Workflows API
DSPy	Stanford NLP, 2023	Programmatic prompts with optimizers	RAG modules with MIPROv2 prompt optimization, ReAct module
Haystack 2.x	deepset, Berlin	Composable pipelines	Pipeline components for retrievers, rankers, routers, plus agent primitives
AutoGen	Microsoft Research	Conversational multi-agent	Agent roles for retrieval, code execution, and critique
CrewAI	CrewAI Inc., 2024	Role-based crews	Researcher and reporter agents with shared retrieval tools

LangGraph and LlamaIndex are the two most widely adopted frameworks for agentic RAG specifically, in part because both teams published canonical tutorials with the phrase "agentic RAG" in the title during 2024 ^[3]^[8]. DSPy takes a distinctive approach: instead of asking developers to write prompts manually, it compiles them from declarative signatures and optimizes them automatically, which has proven valuable for tuning the many LLM calls inside a long agentic loop. AutoGen and CrewAI are weighted more toward multi-agent orchestration in which retrieval is one capability among many.

Production adopters

By 2026 agentic RAG had moved from research and demos into shipped products across the major cloud providers and AI vendors.

Adopter	Product	Notable agentic RAG feature	First shipped
Microsoft Azure	Azure AI Search	Agentic retrieval with LLM-driven query planning, parallel subquery dispatch, semantic reranking	May 2025 (public preview)
Cohere	Command R and Command R+	Native RAG and tool-use training, integrated with Rerank model	March 2024
Glean	Glean Work AI platform	Agentic reasoning architecture for enterprise search and workflow automation	2024
Contextual AI	Contextual AI Platform	Production RAG with built-in reranking, grounding, and instruction-following agents	2024
Anthropic	Claude Projects and Claude with computer use	Agent loops with file and web retrieval as tools	2024
OpenAI	Assistants API, file search, web search tools	Built-in retrieval and web tools that GPT-4 family models invoke autonomously	2023 to 2024
LlamaIndex Inc.	LlamaCloud	Managed agentic RAG pipelines over enterprise data	2024

Microsoft's agentic retrieval announcement in May 2025 was particularly significant because it positioned the agentic loop as a managed service rather than something application developers had to assemble themselves. Microsoft reported that the service improved answer relevance on complex questions by up to 40 percent compared with single-shot RAG ^[2]. Cohere's Command R, announced in March 2024, was the first frontier-class model marketed specifically for production RAG and tool-use workloads, trained from the start to call tools and cite retrieved documents ^[4]. Glean's pivot from enterprise search to a Work AI platform during 2024 was framed around adding an "agentic reasoning architecture" on top of its existing retrieval stack so that agents could plan, execute, evaluate, and adapt to achieve workflow goals such as resolving customer support tickets ^[5]. Contextual AI, founded by the lead author of the original 2020 RAG paper Douwe Kiela, built its commercial platform around the assumption that production-quality retrieval-augmented generation requires more than a single retrieval step.

Self-RAG and corrective RAG

Two academic methods sit at the foundation of self-reflective agentic RAG. Self-RAG, introduced by Akari Asai and colleagues in October 2023, trained a language model to emit special "reflection tokens" that indicate whether retrieval is needed, whether retrieved passages are relevant, whether the generated output is supported by retrieval, and whether the output is useful. The model adaptively retrieves passages on demand and critiques both retrieved content and its own output during generation ^[8].

Corrective RAG (CRAG), introduced in early 2024, added a lightweight retrieval evaluator that grades retrieved documents and triggers corrective actions: if confidence is high, the documents are passed to the generator; if confidence is low, the system rewrites the query and falls back to web search. LangGraph's reference "corrective RAG" tutorial implements this pattern as a graph with grade and rewrite nodes and has become one of the most-cited starting points for engineers building agentic RAG pipelines ^[8].

Both Self-RAG and CRAG are sometimes described as proto-agentic in that they introduce internal control flow into the RAG pipeline without yet treating the LLM as a fully general planning agent. The Singh et al. survey places them as a transitional class between modular RAG and full agentic RAG ^[1].

Comparison to multi-vector retrieval and GraphRAG

Agentic RAG sits alongside two other significant extensions of vanilla RAG: multi-vector retrieval and GraphRAG. Multi-vector retrieval, exemplified by ColBERT-style late-interaction models, improves the recall and precision of the retrieval step itself by representing each document with many vectors instead of one. It does not change the single-shot retrieval structure of vanilla RAG. GraphRAG, introduced by Microsoft Research in 2024, structures the knowledge base as a graph of entities and relationships extracted from source documents, then queries the graph for cross-document patterns. GraphRAG excels at sensemaking and corpus-wide questions but adds substantial offline construction cost.

Approach	Adds	Retains	Best for
Vanilla RAG	Vector retrieval	Single-shot pipeline	Single-hop factual queries
Multi-vector retrieval	Late interaction, multiple vectors per document	Single retrieval step	Precision-critical retrieval
GraphRAG	Knowledge graph, community summaries	Largely batch retrieval	Cross-document sensemaking
Agentic RAG	Agent loop, tool use, self-reflection	Whatever retriever is wired in	Multi-step reasoning, ambiguous queries

In practice these approaches compose. Some 2026 systems use a multi-vector retriever and a knowledge graph as two of the tools an agentic RAG agent can call, choosing between them per subquery.

Trade-offs

Adopting agentic RAG involves predictable costs that should be weighed against its benefits.

Latency. Each iteration of the agent loop adds at least one LLM call, often two if a separate critic or grader is involved. Production measurements consistently show a 2x to 5x latency multiplier compared with vanilla RAG, with worse tail behavior. A vanilla RAG system that responds in 1 to 2 seconds typically grows to 8 to 12 seconds when 3 to 4 iteration loops are used ^[7].

Token cost. The same multiplication applies to token spend. A system that runs at $500 per day on vanilla RAG can run at $1,500 to $5,000 per day on agentic RAG before optimization. Cost can be reduced by caching subquery results, batching parallel subqueries, and using smaller models for grading steps.

Engineering complexity. Agentic loops introduce many new failure modes: infinite loops, cascading errors when one tool returns malformed output, hard-to-reproduce non-determinism across runs, and evaluation difficulties because the trajectory through the agent graph differs across runs. Robust agentic RAG deployments invest heavily in tracing, evaluation harnesses, and circuit breakers that cap iteration counts.

Hallucination reduction. The compensating benefit is meaningful. Self-reflection and validation loops catch retrieval errors that vanilla RAG would pass through to the generator. Studies cited in the agentic RAG literature consistently show improvements in factual accuracy on multi-hop benchmarks, with Microsoft reporting the 40 percent relevance improvement on complex questions for its Azure AI Search agentic retrieval ^[2].

Adaptability. Agentic RAG handles ambiguous, exploratory, or evolving queries that vanilla RAG cannot, because the agent can reformulate, ask clarifying subquestions, or fall back to different tools when its first attempt fails. This makes agentic RAG particularly valuable for conversational interfaces, where users rarely ask perfectly formed single-hop questions.

Applications

The Singh et al. survey and subsequent industry reports identify several application areas where agentic RAG has been deployed in production ^[1]^[5].

Enterprise search and knowledge work. Glean's platform applies agentic RAG to questions over the full corpus of an organization's documents, chat threads, tickets, and code, with agents that plan a search across multiple connected systems before answering. Microsoft's Copilot for Microsoft 365 uses an agentic retrieval layer to ground answers in tenant-specific data.

Customer support automation. Agents resolve tickets by retrieving against knowledge bases, past tickets, and product documentation, and by calling internal APIs to check account status or run remediation scripts.

Healthcare. Clinical decision support and biomedical research assistants use multi-hop retrieval across literature, guidelines, and electronic health records. The survey emphasizes that the self-reflection loop is particularly important in this domain because unverified retrievals carry safety risk.

Finance. Agentic RAG systems combine retrieval against research reports, regulatory filings, and market data with calculator and structured-query tools to answer analyst questions that no single source contains.

Software engineering. Coding agents combine retrieval against codebases and documentation with tool use including code execution, type checking, and test running, in patterns sometimes described as agentic code RAG.

Education. Tutoring systems retrieve curriculum content adaptively based on student responses, decomposing complex questions into subquestions appropriate to the student's level.

Evaluation and open problems

Evaluating agentic RAG remains substantially harder than evaluating vanilla RAG. The agent's trajectory through retrievals and tool calls is non-deterministic, the number of LLM calls per query varies, and many benchmark questions have multiple valid answers depending on which subquery decomposition the agent chooses. The Singh et al. survey identifies five open problems: evaluation methodologies, agent coordination at scale, memory management across long-running sessions, computational efficiency, and governance frameworks covering security, auditability, and regulatory compliance ^[1].

New benchmarks specifically targeting agentic RAG have emerged, including multi-hop question-answering datasets that require chained retrieval and synthetic enterprise corpora designed to test cross-source synthesis. The 2025 paper "Is Agentic RAG worth it? An experimental comparison of RAG approaches" (arXiv:2601.07711) reported that agentic configurations outperformed vanilla RAG on multi-hop and ambiguous queries but were rarely worth the cost on simple lookup queries, supporting the routing heuristic adopted in production ^[7].

Relationship to broader agent research

Agentic RAG can be viewed as a specialization of the broader AI agents research agenda to the retrieval setting. The same primitives that power agentic RAG, namely planning, tool use, self-reflection, and multi-agent orchestration, underlie general-purpose agent systems such as code generation agents, computer-use agents, and scientific research agents. The retrieval setting is convenient for the agent research community because it grounds the agent's actions in a verifiable knowledge base, making evaluation more tractable than open-ended agent tasks.

The convergence of agentic RAG with general agent infrastructure has accelerated as frameworks such as LangGraph and AutoGen have generalized beyond RAG into broader agent orchestration, and as foundation model providers have shipped tool-calling APIs that turn any external data source into something an agent can query. By 2026 the boundary between "a RAG system with agents" and "an agent that happens to use retrieval as one of its tools" had largely dissolved in practice, even though the term agentic RAG continued to be used as a useful label for the class of systems where retrieval was a primary capability.

References

Singh, A., Ehtesham, A., Kumar, S., Talaei Khoei, T., and Vasilakos, A. V. "Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG." arXiv:2501.09136, January 2025; v4 April 2026. https://arxiv.org/abs/2501.09136
Microsoft Azure. "Introducing agentic retrieval in Azure AI Search." Azure AI Foundry blog, May 2025. https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/introducing-agentic-retrieval-in-azure-ai-search/4414677 ; Microsoft Learn, "Agentic Retrieval Overview - Azure AI Search." https://learn.microsoft.com/en-us/azure/search/agentic-retrieval-overview
LlamaIndex. "Agentic RAG With LlamaIndex: Architecture Guide." LlamaIndex blog, 2024. https://www.llamaindex.ai/blog/agentic-rag-with-llamaindex-2721b8a49ff6
Cohere. "Command R: RAG at Production Scale." Cohere blog, March 2024. https://cohere.com/blog/command-r ; Cohere documentation, "Cohere's Command R Model." https://docs.cohere.com/docs/command-r
Glean. "Agentic reasoning: The future of Work AI." Glean blog, 2024. https://www.glean.com/blog/agentic-reasoning-future-ai
Lewis, P., Perez, E., Piktus, A., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." arXiv:2005.11401, NeurIPS 2020.
"Is Agentic RAG worth it? An experimental comparison of RAG approaches." arXiv:2601.07711, 2025. https://arxiv.org/html/2601.07711v1 ; "Agentic RAG vs Classic RAG: From a Pipeline to a Control Loop." Towards Data Science, 2026.
Asai, A., Wu, Z., Wang, Y., Sil, A., and Hajishirzi, H. "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." arXiv:2310.11511, October 2023. https://arxiv.org/abs/2310.11511 ; LangChain blog, "Self-Reflective RAG with LangGraph." https://blog.langchain.com/agentic-rag-with-langgraph/ ; LangChain docs, "Build a custom RAG agent with LangGraph." https://docs.langchain.com/oss/python/langgraph/agentic-rag
"Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges." arXiv:2506.10408, 2025. https://arxiv.org/html/2506.10408v1

Background and motivation

History and terminology

Comparison to vanilla RAG

Architectural patterns

Frameworks supporting agentic RAG

Production adopters

Self-RAG and corrective RAG

Comparison to multi-vector retrieval and GraphRAG

Trade-offs

Applications

Evaluation and open problems

Relationship to broader agent research

References

Improve this article

Related Articles

Agentic Context Engineering

Computer-use agent

Context engineering

MTEB (Massive Text Embedding Benchmark)

Agent memory

OpenClaw

Background and motivation

History and terminology

Comparison to vanilla RAG

Architectural patterns

Frameworks supporting agentic RAG

Production adopters

Self-RAG and corrective RAG

Comparison to multi-vector retrieval and GraphRAG

Trade-offs

Applications

Evaluation and open problems

Relationship to broader agent research

References

Related Articles

Agentic Context Engineering

Computer-use agent

Context engineering

MTEB (Massive Text Embedding Benchmark)

Agent memory

OpenClaw