Retrieval-Augmented Generation

Large Language Models Machine Learning Natural Language Processing

31 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

66 citations

Revision

v9 · 6,133 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Retrieval-Augmented Generation (RAG) is a technique in natural language processing that augments a generative model with an information-retrieval component, so that responses are conditioned on documents fetched from an external corpus rather than relying solely on the parametric knowledge stored in the model's weights. A RAG system embeds a user query, retrieves the most relevant chunks from a knowledge base (typically a vector database or hybrid index), assembles them into the model's prompt, and then asks a large language model (LLM) to generate an answer grounded in that context. The approach was introduced by Patrick Lewis and colleagues at Facebook AI Research, University College London, and New York University in May 2020, and it directly addresses three persistent weaknesses of standalone LLMs: a fixed knowledge cutoff, a tendency to hallucinate, and the inability to cite sources.^[1]

The original paper reported that RAG "set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures," and that RAG models "generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline."^[1] On the standard benchmarks of the day, the RAG-Sequence model reached 44.5 exact match on Natural Questions and 56.8 on TriviaQA, retrieving from a non-parametric index in which "each Wikipedia article is split into disjoint 100-word chunks, to make a total of 21M documents."^[1] RAG has since become the default architecture for grounding LLMs in private and current data: Databricks reported in 2025 that vector databases, the storage layer most RAG systems depend on, "grew 377% year-over-year, the fastest growth among all LLM-related technologies" across its customer base of more than 10,000 organizations.^[66]

Since the original paper, rag has evolved from a single neural architecture into an entire engineering discipline. Modern pipelines combine dense embeddings with BM25 keyword scoring, reranking by cross-encoders such as ColBERT, query transformations such as HyDE, and orchestration frameworks such as LangChain and LlamaIndex. New variants - GraphRAG (Microsoft Research, 2024), Self-RAG (Asai et al., 2023), Corrective RAG (Yan et al., 2024), and RAFT (Berkeley, 2024) - extend the basic recipe with knowledge graphs, self-reflection, retrieval-quality verification, and supervised fine-tuning over retrieved contexts. Frontier vendors such as OpenAI, Anthropic, and Google now ship managed RAG endpoints (OpenAI File Search in the Responses API, the Anthropic Files API, and Google Vertex AI Search) that hide much of the indexing complexity from application developers.^[2]^[3]

Background

What is parametric vs non-parametric memory?

LLMs store knowledge implicitly in their network weights, a representation sometimes called parametric memory. This makes them fluent and fast at inference time, but knowledge written into weights cannot be inspected, edited, or attributed to a source, and updating it requires re-training or fine-tuning. RAG re-introduces an explicit non-parametric memory: an external corpus that the model can read at inference time. Lewis et al. framed RAG precisely as a way to combine the two, "where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever."^[1]

The knowledge cutoff problem

Every LLM is trained on a snapshot of text up to some date - GPT-3's cutoff was October 2019, Claude Opus 4.1's was March 2025, and so on. Information that appears after the cutoff is simply unknown to the model, and information that was rare or controversial at training time may be misremembered. RAG decouples the knowledge store from the model: a vector index can be re-built nightly or updated incrementally without touching the LLM's weights.

Hallucination

LLMs trained with maximum-likelihood objectives are known to produce hallucinations - confident, fluent text that is not supported by any real source. Ji et al. surveyed this phenomenon in 2023 and showed that hallucination is a structural property of generative language models, not a bug to be patched out.^[4] RAG mitigates (though does not eliminate) hallucination by constraining generation to text that is supported by retrieved passages. Empirical work in clinical and legal settings has consistently found that grounded models hallucinate less often than their ungrounded counterparts, though they can still misquote, mis-attribute, or fail to retrieve the correct document.^[5]^[6]

History

When was RAG introduced, and who created it?

The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" was uploaded to arXiv on 22 May 2020 (arXiv:2005.11401) and published at NeurIPS 2020. RAG did not appear in a vacuum, however; several 2019-2020 systems explored how to combine retrieval with generation or language modelling.

kNN-LM (Khandelwal et al., 2019). Researchers at Stanford and Facebook AI proposed augmenting a fixed language model with a nearest-neighbour lookup over a datastore of cached key/value pairs from the training corpus. At each generation step, the model interpolates its own next-token distribution with the empirical distribution of tokens that followed similar contexts in the datastore. The technique improved perplexity on Wikitext-103 without any extra training, showing that explicit memory can complement parametric memory.^[7]

REALM (Guu et al., 2020). Google Research's Retrieval-Augmented Language Model Pre-training (REALM) was, alongside Lewis et al., one of the first end-to-end trained systems with a learned neural retriever. REALM jointly trained a masked-language-model objective with a retriever over a Wikipedia index, demonstrating that the retriever could be back-propagated through during pre-training. REALM was the conceptual sibling of RAG, but it focused on extractive MLM rather than seq2seq generation.^[8]

Dense Passage Retrieval (Karpukhin et al., 2020). DPR replaced the BM25 baseline used in earlier open-domain QA pipelines with a dual-encoder neural retriever trained on question/passage pairs. DPR became the retriever component of the original RAG paper, and it remains the canonical reference for dense passage retrieval.^[9]

MIPS-style retrieval and FAISS. Approximate Maximum Inner Product Search and Facebook's FAISS library made it tractable to search billions of vectors in milliseconds, which was a precondition for any RAG-style system at scale.^[10]

The original RAG paper (Lewis et al. 2020)

The thirteen authors - Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, and Douwe Kiela - were affiliated with Facebook AI Research, University College London, and New York University.^[1]

The system has three components:

A query encoder (BERT-base) that maps a question to a dense vector.
A document index: 21 million 100-word chunks of English Wikipedia, encoded once with a passage encoder (also BERT-base), and indexed with FAISS for fast Maximum Inner Product Search. The passage encoder weights come from DPR.
A generator, BART-large, which conditions on the question concatenated with each retrieved passage and produces a free-form answer token-by-token.

The retriever and generator are trained end-to-end with the document index treated as a latent variable. Lewis et al. coined the term Retrieval-Augmented Generation and showed that the resulting models set new state-of-the-art results on three open-domain question answering benchmarks - Natural Questions, TriviaQA, and WebQuestions - reaching 44.5 exact match on Natural Questions and 56.8 on TriviaQA with the RAG-Sequence variant, and produced "more specific, diverse and factual" language than the parametric BART baseline on Jeopardy question generation and abstractive summarization.^[1]

Evolution from 2021 to 2026

After the original paper, retrieval-augmented techniques rapidly diffused across NLP. Atlas (Izacard et al., Meta AI, 2022) demonstrated few-shot learning with a retrieval-augmented seq2seq model.^[11] DeepMind's RETRO (Borgeaud et al., 2022) showed that a 7.5B-parameter model retrieving from a 2-trillion-token database could match the performance of a 175B-parameter GPT-3 on language modelling benchmarks, despite using 25x fewer parameters. DeepMind reported that "a 7.5 billion parameter RETRO model outperforms the 175 billion parameter Jurassic-1 on 10 out of 16 datasets and outperforms the 280B Gopher on 9 out of 16 datasets," a concrete demonstration that retrieval can substitute for raw parameter count.^[12] After the release of ChatGPT in late 2022, demand from application developers turned RAG into a default architecture for grounding LLMs in private data, and 2023-2024 saw rapid growth of orchestration frameworks (LangChain, LlamaIndex, Haystack) and specialized vector databases (Pinecone, Weaviate, Qdrant, Chroma, Milvus, Vespa, pgvector). Microsoft Research's GraphRAG paper appeared in April 2024, Self-RAG and Corrective RAG in late 2023 and early 2024, and the RAG vs long-context debate intensified after Google launched Gemini 1.5 with a one-million-token window in February 2024.^[13]^[14]^[15]

How widely is RAG used?

By the mid-2020s, retrieval augmentation had moved from a research result to mainstream enterprise practice. In its 2025 "State of AI" report, drawn from telemetry across more than 10,000 customers including over 300 Fortune 500 companies, Databricks found that the majority of organizations building with generative AI augment base models with tools, retrieval systems, and vector databases rather than relying on off-the-shelf LLMs alone, and that vector databases "grew 377% year-over-year, the fastest growth among all LLM-related technologies."^[66] Independent market analysts likewise treat RAG as one of the fastest-growing segments of the AI tooling stack, reflecting the same underlying shift toward grounding models in proprietary, up-to-date corpora.

RAG-Sequence vs RAG-Token

Lewis et al. proposed two formulations of how the retrieved passages affect generation:

RAG-Sequence treats the retrieved document as a single latent variable, marginalized over the top-K passages, that conditions the entire output sequence. The same document is implicitly used for every generated token. RAG-Sequence is easier to train, and Lewis et al. found that it produced more coherent text on tasks like Jeopardy question generation.
RAG-Token allows a different document to be retrieved (or weighted) at each generated token. This is more expressive: a single answer can draw on multiple documents - one for a date, another for a name, another for a definition. RAG-Token tends to win on knowledge-intensive QA tasks where a single answer span requires evidence from multiple passages.^[1]

Both variants share the same retriever and generator; they differ only in the marginalization step. In modern engineering practice, almost all production systems use a degenerate RAG-Sequence variant: pick top-K passages once, concatenate them into the prompt, and let the LLM decide how to use them. The original RAG-Token marginalization is rarely implemented exactly because it requires re-running the decoder K times per token.

Modern RAG pipeline

A production RAG pipeline today is typically broken into five phases. Each is a research topic of its own.

1. Ingestion (parsing and chunking)

Raw documents - PDFs, HTML pages, Slack messages, code repositories, audio transcripts - are normalized to text and split into chunks. Common chunking strategies are described in detail below; the canonical default is recursive character splitting at 256-1024 token granularity with 10-20% overlap.

2. Indexing

Each chunk is passed through an embedding model (typically 256-3072 dimensions) and stored in a vector database together with the original text and metadata. Most production indexes use HNSW (Hierarchical Navigable Small World) or IVF-PQ (Inverted File with Product Quantization) for approximate nearest-neighbor search, both of which trade a small recall loss for sub-100ms latency over hundreds of millions of vectors.^[16] Many systems index BM25 tokens alongside the embeddings to enable hybrid search.

3. Retrieval

At query time, the user query is embedded (and optionally tokenized for BM25), the top-K nearest chunks are retrieved, and an optional reranker is applied to re-order them. Typical K is 20-100 before reranking and 3-10 after.

4. Augmentation

The chosen passages are inserted into the model's prompt, usually in a template such as:

You are an assistant. Use only the context below.
<context>
{passage_1}
{passage_2}
...
</context>
Question: {user_query}

Augmentation can also include metadata (source URLs, dates, scores), citations, or system instructions to refuse when context is irrelevant.

5. Generation

The augmented prompt is sent to an LLM, which produces an answer. Many production systems include a final citation step that maps spans of the answer back to specific retrieved chunks, and a guardrail step that checks whether the answer is supported by the retrieved evidence (see Evaluation, below).

Embedding models

Embedding quality is one of the strongest determinants of retrieval recall. The space of embedding models has grown rapidly since 2022, and the Massive Text Embedding Benchmark (MTEB) leaderboard, introduced by Muennighoff et al. at HuggingFace in late 2022, has become the de-facto evaluation harness.^[17] As of early 2026, commonly used embeddings include:

OpenAI text-embedding-3-large and text-embedding-3-small (released January 2024), with native dimensions of 3072 and 1536 respectively and support for Matryoshka-style dimension truncation.^[18]
Cohere Embed v3 / v4, multilingual models with quantization-aware training, designed for production RAG.^[19]
Voyage AI voyage-3 / voyage-3-large, optimized for English RAG with 32K-token input windows.^[20]
Google Gecko and the EmbeddingGemma family, plus the older text-embedding-005 in Vertex AI.^[21]
BGE (BAAI General Embeddings) from the Beijing Academy of AI - open-source models such as bge-large-en-v1.5 and bge-m3 (multilingual, multi-functional, multi-granularity), which were near the top of MTEB through 2023-2024.^[22]
GTE (General Text Embeddings) from Alibaba DAMO, including gte-large and gte-Qwen2-7B-instruct.^[23]
E5 from Microsoft Research (e5-large-v2, multilingual-e5-large).^[24]
Jina embeddings v3, a small but competitive open-source model with long-context support.^[25]
Nomic Embed (nomic-embed-text-v1.5), the first fully reproducible open embedding model with training data and code released.^[26]
Mistral Embed (mistral-embed) and NV-Embed-v2 from NVIDIA, both popular open / API models.

There is no single best embedding model: choice depends on language coverage, max context, dimensionality, licensing, and how well the model was pre-trained on the target domain.

Vector databases

The retrieval index is usually backed by a specialized vector database. The market has consolidated around a handful of options, each with different trade-offs:

System	Type	Notes
Pinecone	Managed SaaS	The earliest commercial vector DB (2019); serverless tier; very low operational burden.^[27]
Weaviate	Open source + managed	Native hybrid search, GraphQL API, modular vectorizers; strong ecosystem.^[28]
Qdrant	Open source + managed	Rust implementation; payload filtering; sparse + dense in one index.^[29]
Milvus / Zilliz Cloud	Open source + managed	LF AI graduate project; scales to tens of billions of vectors.^[30]
Chroma	Open source	Developer-first, in-process; popular for prototypes and LangChain tutorials.^[31]
pgvector	PostgreSQL extension	Adds `vector` type, IVFFlat / HNSW indexes to Postgres; lets organizations reuse their existing database.^[32]
Vespa	Open source	Originally Yahoo's search engine; combines vector, lexical, tensor, and ranking in one stack.^[33]
Vald	Open source	Kubernetes-native, NGT-based; designed for billion-scale.^[34]
Elasticsearch / OpenSearch	Search engine + vectors	Added dense_vector and kNN support; convenient when teams already use Elastic.^[35]
Redis (Redis Stack)	In-memory	Vector similarity via FT.SEARCH KNN; very low latency for hot indices.
LanceDB	Open source	Columnar (Apache Arrow) format on object storage; popular for multimodal use cases.
Vectara, MongoDB Atlas Vector Search, Couchbase Capella, SingleStore	Managed	RAG features bolted onto established databases.

Almost all of these systems implement HNSW or a closely related graph-based index. FAISS (Facebook AI) remains the most widely used library for ANN search and is embedded inside many of the products above.^[10] Adoption of this layer has grown steeply: Databricks reported that vector databases were the fastest-growing LLM-related technology across its customer base, expanding 377% year-over-year.^[66]

Hybrid retrieval

A consistent empirical finding is that dense + sparse hybrid retrieval beats either alone. Dense retrievers excel at semantic similarity but struggle with rare named entities, identifiers, and acronyms; lexical models such as BM25 handle exact matches but miss paraphrases.^[36]

The standard technique for combining their outputs is Reciprocal Rank Fusion (RRF), introduced by Cormack et al. in a 2009 SIGIR paper. RRF assigns each document the score sum_i 1 / (k + rank_i(d)) across retrievers i, with the constant k typically 60. RRF is hyperparameter-free in the relative sense and consistently performs well on TREC-style benchmarks.^[37]

Modern variants include:

Convex linear combination of cosine and BM25 scores after min-max normalization.
SPLADE (Formal et al., 2021), which learns sparse lexical representations using a masked-language-model head; SPLADE produces sparse vectors that can be indexed in a classical inverted index but still capture semantic expansion.^[38]
ColBERT late-interaction retrieval (see below), which itself is a fine-grained hybrid between dense embeddings and token-level matching.

Reranking

Bi-encoder retrieval is fast but loses information by encoding query and document independently. A reranker runs a more expensive model over the top-K candidates to re-order them.

Cross-encoder rerankers. A cross-encoder concatenates the query and a candidate passage and feeds the pair to a transformer that outputs a relevance score. Cross-encoders trained on MS MARCO (Nogueira and Cho, 2019) are the canonical starting point.^[39]
ColBERT v1/v2. Khattab and Zaharia (Stanford, 2020) proposed late interaction: each token in the query and document is encoded independently, but the final relevance score is the sum of maximum similarities ("MaxSim") between query and document tokens. ColBERT v2 (2022) added denoised supervision and residual compression, making token-level retrieval practical at web scale.^[40]^[41]
Cohere Rerank, Voyage rerank-2, Jina Reranker, and mxbai-rerank are commercial or open multilingual rerankers commonly used in production.
LLM-as-a-reranker: a strong instruction-tuned LLM is asked to score or sort the top-K candidates. This is the most accurate option but slow and expensive.

A standard pipeline retrieves top-50 to top-200 candidates with a fast dense + BM25 stage and reranks to top-3 or top-10 with a cross-encoder or ColBERT-style model.

Chunking strategies

How documents are split into chunks dominates retrieval quality. Common strategies include:^[42]

Fixed-size chunking. Split by a target token / character length (e.g., 512 tokens) with overlap (e.g., 64 tokens). Simple and predictable; can break sentences, tables, or code blocks.
Recursive character splitting. Try to split on a priority list of separators (\n\n, \n, . , ), falling back to smaller units. This is the default in LangChain and LlamaIndex.
Sentence / paragraph chunking. Use SpaCy/NLTK or a regex to split at natural boundaries; group sentences until a target length is reached.
Structural / document-aware chunking. Use Markdown headings, HTML DOM, or PDF layout (via Unstructured, LlamaParse, or Adobe PDF Extract) to keep semantic blocks together. Tables, figures, and code blocks are extracted as their own chunks.
Semantic chunking (Greg Kamradt's 5 levels, widely adopted in LlamaIndex 2023). Embed sentences sequentially; merge consecutive sentences whose embeddings are similar above a threshold, and cut at semantic breakpoints.^[43]
Late chunking (Jina, 2024). Encode an entire long document once with a long-context embedding model, then derive chunk vectors by pooling token embeddings within each chunk window. This preserves global context inside each chunk vector.^[44]
Hierarchical / parent-child chunking. Index small chunks for matching but return the larger parent chunk (a section or page) at generation time. LlamaIndex calls this auto-merging retrieval.
Contextual chunking / contextual retrieval (Anthropic, September 2024). Prepend a short LLM-generated summary of where the chunk sits in the document. Anthropic reported that adding contextual embeddings alone cut the top-20-chunk retrieval failure rate by 35% (from 5.7% to 3.7%), that combining contextual embeddings with contextual BM25 and Reciprocal Rank Fusion reduced failures by 49% (to 2.9%), and that layering a reranker on top produced a 67% reduction over the baseline.^[45]
Agentic / adaptive chunking. An LLM agent reads a document and decides where to split based on content, often producing irregular but semantically coherent chunks.

Empirically, chunking strategy interacts strongly with embedding model and corpus type: there is no universal best practice, only sensible defaults.

Query transformations

Many failures of naive RAG come from the gap between the user's question and the way the answer is written in the corpus. Several query-transformation techniques close this gap:

HyDE (Hypothetical Document Embeddings). Gao et al. (2022) propose having the LLM first generate a hypothetical answer to the question, then embed and retrieve against that hypothetical text. The intuition is that an LLM-written answer is closer in embedding space to the real evidence than the original query.^[46]
Multi-query retrieval. Generate multiple paraphrases of the user query (with the LLM), retrieve for each, and union the results.
Sub-question decomposition. Break a compound question into sub-questions, retrieve evidence for each, and aggregate answers.
Step-back prompting. Zheng et al. (Google DeepMind, 2023) ask the model to formulate a higher-level "step-back" question first and retrieve for both the specific and the generalized form.^[47]
Query routing. A small classifier or LLM selects which index, retriever, or tool to invoke for each query (e.g., FAQ index vs documentation index vs SQL database).
RAG-Fusion (Raudaschl, 2023). Generate multiple query variants, retrieve for each, and combine results with RRF.

Advanced patterns

Graph RAG (Microsoft Research, April 2024)

GraphRAG is an open-source project from Microsoft Research that augments RAG with a knowledge graph built directly from the source corpus by an LLM. The pipeline extracts entities and relationships from each text chunk, clusters them with the Leiden algorithm into hierarchical communities, summarizes each community, and stores the resulting graph alongside vector indexes. At query time, global queries are answered by aggregating community summaries, while local queries start from entity nodes most similar to the question and traverse the graph. Microsoft's paper ("From Local to Global: A Graph RAG Approach to Query-Focused Summarization", Edge et al., arXiv:2404.16130) reported substantial gains on holistic queries that require synthesizing across an entire corpus.^[13]

Self-RAG (Asai et al., 2023)

Self-Reflective Retrieval-Augmented Generation (arXiv:2310.11511) from Asai, Wu, Wang, Sil, and Hajishirzi (University of Washington / Allen AI) trains an LLM to emit special reflection tokens that decide (a) whether retrieval is needed, (b) whether retrieved passages are relevant, and (c) whether the generated answer is supported and useful. The model is supervised to produce these tokens during fine-tuning. Self-RAG outperforms ChatGPT and a Llama2-chat baseline on open-domain QA, fact verification, and long-form generation benchmarks.^[48]

Corrective RAG (Yan et al., 2024)

CRAG (arXiv:2401.15884) trains a lightweight retrieval evaluator that scores retrieved documents as Correct, Incorrect, or Ambiguous. If the evaluator is not confident, CRAG falls back to a web search and applies a decompose-then-recompose strategy to filter noisy documents. The authors report consistent gains over RAG baselines on PopQA, Biography, PubHealth, and ARC-Challenge.^[49]

RAFT (Berkeley, March 2024)

Retrieval-Augmented Fine-Tuning (Zhang et al., arXiv:2403.10131) fine-tunes an LLM on (question, retrieved-context, answer) triples in which some of the retrieved documents are deliberately irrelevant "distractors." The model learns both how to ignore irrelevant context and how to cite the right passages by quoting them in chain-of-thought style. RAFT improves accuracy on domain-specific RAG benchmarks (PubMed, HotpotQA, HuggingFace docs) and is now a common recipe for adapting open-weights models to a specific corpus.^[50]

FLARE, IRCoT, and active retrieval

FLARE (Forward-Looking Active REtrieval, Jiang et al., 2023) repeatedly triggers retrieval during long-form generation whenever the model's next-sentence probability falls below a threshold, then re-generates with the new evidence.^[51] IRCoT (Trivedi et al., 2022) interleaves chain-of-thought reasoning with retrieval steps for multi-hop QA.^[52] Both formalize the idea that retrieval can happen during generation, not just before it.

Agentic RAG

The most recent generation of systems treats RAG as a tool available to an autonomous agent. An agent plans which questions to ask, which indexes or APIs to query, iterates based on intermediate results, and finally synthesizes a response. OpenAI's Responses API File Search tool, Anthropic's tool-using assistants, and frameworks such as LangGraph, LlamaIndex Workflows, and CrewAI all embody this pattern. Agentic RAG handles open-ended research questions and multi-hop reasoning that fixed pipelines cannot, at the cost of latency and unpredictability.^[53]

Evaluation

Evaluating RAG is harder than evaluating either retrieval or generation alone. Three quantities matter:

Retrieval quality - did we fetch the right passages? Metrics: Recall@K, Precision@K, MRR (Mean Reciprocal Rank), and nDCG (normalized Discounted Cumulative Gain).
Faithfulness / groundedness - is every claim in the answer supported by the retrieved evidence?
Answer quality - does the answer actually address the user's question?

Open-source evaluation frameworks include:

RAGAS (Es, Espinosa-Anke, James, 2023). A reference-free framework that defines metrics for faithfulness, answer relevancy, context precision, and context recall, scored by an LLM judge. RAGAS has become the de-facto evaluation library for production RAG.^[54]
ARES (Saad-Falcon et al., Stanford, 2023). Trains lightweight LLM judges with synthetic queries to evaluate context relevance, answer faithfulness, and answer relevance.^[55]
TruLens (TruEra), which adds the "RAG triad" (context relevance, groundedness, answer relevance) to LangChain or LlamaIndex pipelines.
DeepEval, Phoenix (Arize AI), Langfuse, Braintrust, and LangSmith - production observability and evaluation platforms.

Benchmarks for retrieval and RAG include BEIR (zero-shot IR), MTEB (embeddings), KILT (Knowledge-Intensive Language Tasks), and HotpotQA (multi-hop QA). LongBench and the LOFT benchmark from Google specifically test long-context vs RAG trade-offs.

Production stacks

The 2024-2026 ecosystem has converged on a handful of orchestration frameworks:

LangChain (Harrison Chase, October 2022). The most widely used framework, with abstractions for document loaders, splitters, embeddings, vector stores, retrievers, and chains. LangGraph is its directed-graph runtime for agentic workflows, and LangSmith is the companion observability/evaluation product.
LlamaIndex (Jerry Liu, originally GPT Index, late 2022). Focused specifically on data ingestion and retrieval, with sophisticated indices (vector store, summary, knowledge graph, sub-question, auto-merging) and a Workflows orchestration API.
Haystack (deepset). Production-oriented modular pipelines with strong evaluation tooling and a Python-typed component contract.
Semantic Kernel (Microsoft, 2023). A planner-based SDK for .NET and Python that integrates RAG, function calling, and Azure services.
DSPy (Stanford). Programs LM pipelines as compositions of typed modules whose prompts and few-shot examples are compiled by an optimizer (BootstrapFewShot, MIPRO).^[56]
TxtAI, EmbedChain, NeumAI, Verba (Weaviate), and R2R are smaller open-source alternatives.

Specialized RAG engines such as RAGFlow, FastGPT, Anything LLM, and PrivateGPT target self-hosted document chatbots.

Frontier-model file/document search

By 2025 the three frontier-model vendors all expose managed RAG primitives:

OpenAI File Search in the Responses and Assistants APIs lets developers upload files into vector stores managed by OpenAI, then attach those stores to a model with a single tool call. OpenAI handles chunking, embedding, retrieval, and citation insertion.^[57]
Anthropic Files API and Search Tool. Anthropic's Files API stores documents that can be referenced by Claude via a built-in file_search tool; the platform manages chunking and retrieval on the developer's behalf.^[58]
Google Vertex AI Search and Gemini grounding. Google provides both an enterprise search product (Vertex AI Search) and a Gemini grounding feature that can ground responses on Google Search results or on a custom data store.^[59]
Amazon Bedrock Knowledge Bases. AWS's managed RAG service ingests documents from S3, embeds them via Titan or Cohere, stores vectors in OpenSearch / Aurora / Pinecone, and exposes a RetrieveAndGenerate API.^[60]

These products commoditize the "naive RAG" stack and push developer differentiation toward chunking, evaluation, and agentic orchestration.

Limitations

Retrieval mismatch

If the retriever returns the wrong passages, no amount of LLM capability will recover the correct answer. Common failure modes include:

The corpus does not contain the answer at all.
The answer is in the corpus but uses different vocabulary from the query.
Embedding similarity is dominated by topical co-occurrence rather than answer relevance.
Multi-hop questions require synthesizing across passages that are not similar to each other or to the question.

Hybrid retrieval, reranking, query transformations, and agentic iteration are all attempts to mitigate retrieval mismatch.

Chunk boundary issues

Naive chunking can split a sentence in half, separate a table from its header, or break a code block. The chunk that contains the answer may also lack the surrounding context (the paragraph or section heading) that makes the answer interpretable. Hierarchical retrieval, late chunking, and contextual retrieval (Anthropic 2024) are responses to this.

Citation accuracy and provenance

LLMs sometimes "cite" passages that do not actually support their claims. Stanford HAI's 2024 study of legal RAG products found that even purpose-built legal AI assistants hallucinated or mis-cited authorities in 17-33% of queries, despite advertising "hallucination-free" outputs.^[6] Faithfulness is a primary RAGAS metric for this reason.

Context length and "lost in the middle"

Liu et al. (Stanford, 2023) showed that LLMs given long contexts pay strong attention to the beginning and end but degrade in the middle, even when the relevant evidence is at a fixed location.^[61] Many practitioners therefore place the most important retrieved passage at the start or end of the prompt and limit total context to a few thousand tokens.

Security and data governance

Enterprise RAG must enforce document-level access control, redact PII, log retrieval for audit, and prevent prompt-injection attacks in which malicious text embedded in a retrieved document tries to override the system instructions. Microsoft, Google, and AWS now publish guidance on RAG security; OWASP's "Top 10 for LLM Applications" includes prompt injection and insecure plugin design.

How does RAG differ from long-context models?

The most heated 2024-2026 debate in this area is whether very long context windows make RAG obsolete. Gemini 1.5 Pro (February 2024) introduced a one-million-token window and demonstrated "needle-in-a-haystack" recall close to 100% in controlled settings; Gemini 2.5 Pro extended this to ~2 million tokens, Anthropic shipped a 1M-token Claude in 2025, and Magic.dev reported a 100M-token research model.^[14]^[15]

Arguments that long context displaces RAG:

Simpler architecture: paste the entire knowledge base into the prompt.
No engineering for chunking, embedding, retrieval, or reranking.
Better handling of long-range dependencies inside the corpus.
Models like Gemini 1.5 Pro hit 99%+ recall on synthetic needle-in-haystack tests.

Arguments that RAG persists:

Cost. A single one-million-token Gemini call is dramatically more expensive than a 5,000-token RAG call, and most corpora are far larger than one million tokens.
Latency. Long prefills are slow; KV-cache reuse helps but is not always available across users.
Accuracy at the high end. Benchmarks such as RULER (NVIDIA, 2024), NoLiMa (2024), and Google's LOFT (2024) showed that real long-context comprehension - not just exact retrieval - degrades substantially as context length grows. Effective context length is often a small fraction of the advertised window.^[62]^[63]^[64]
Updatability. A vector index can be incrementally updated; a one-million-token prompt cannot.
Auditability. RAG produces citations to specific documents; long-context generation does not.
Privacy and scoping. RAG can enforce per-user access controls at retrieval time.

The pragmatic synthesis is that long context complements RAG rather than replacing it. Long context lets RAG keep more retrieved passages, larger parent chunks, and richer scratchpads, while RAG selects which documents to put inside that long context. Practitioners increasingly use the term context engineering to describe the discipline of assembling the right context - retrieved or otherwise - for a given query.^[65]

References

Lewis, Patrick, et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. arXiv:2005.11401. https://arxiv.org/abs/2005.11401 ↩
OpenAI. "File search - OpenAI Platform documentation." https://platform.openai.com/docs/guides/tools-file-search ↩
Anthropic. "Files API - Claude API documentation." https://docs.claude.com/en/docs/build-with-claude/files ↩
Ji, Ziwei, et al. "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys, 2023. arXiv:2202.03629. https://arxiv.org/abs/2202.03629 ↩
Shuster, Kurt, et al. "Retrieval Augmentation Reduces Hallucination in Conversation." Findings of EMNLP 2021. arXiv:2104.07567. https://arxiv.org/abs/2104.07567 ↩
Magesh, Varun, et al. "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools." Stanford HAI / RegLab, 2024. https://arxiv.org/abs/2405.20362 ↩
Khandelwal, Urvashi, et al. "Generalization through Memorization: Nearest Neighbor Language Models." ICLR 2020. arXiv:1911.00172. https://arxiv.org/abs/1911.00172 ↩
Guu, Kelvin, et al. "REALM: Retrieval-Augmented Language Model Pre-Training." ICML 2020. arXiv:2002.08909. https://arxiv.org/abs/2002.08909 ↩
Karpukhin, Vladimir, et al. "Dense Passage Retrieval for Open-Domain Question Answering." EMNLP 2020. arXiv:2004.04906. https://arxiv.org/abs/2004.04906 ↩
Johnson, Jeff, Matthijs Douze, and Herve Jegou. "Billion-Scale Similarity Search with GPUs." IEEE Transactions on Big Data, 2021. arXiv:1702.08734. https://arxiv.org/abs/1702.08734 ↩
Izacard, Gautier, et al. "Atlas: Few-shot Learning with Retrieval Augmented Language Models." JMLR 2023. arXiv:2208.03299. https://arxiv.org/abs/2208.03299 ↩
Borgeaud, Sebastian, et al. "Improving Language Models by Retrieving from Trillions of Tokens." ICML 2022. arXiv:2112.04426. https://arxiv.org/abs/2112.04426 ; Google DeepMind. "Improving language models by retrieving from trillions of tokens." https://deepmind.google/blog/improving-language-models-by-retrieving-from-trillions-of-tokens/ ↩
Edge, Darren, et al. "From Local to Global: A Graph RAG Approach to Query-Focused Summarization." Microsoft Research, April 2024. arXiv:2404.16130. https://arxiv.org/abs/2404.16130 ↩
Google DeepMind. "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context." February 2024. arXiv:2403.05530. https://arxiv.org/abs/2403.05530 ↩
Anthropic. "Introducing 100K Context Windows / Claude 3 Family." https://www.anthropic.com/news/100k-context-windows ↩
Malkov, Yu A., and Yashunin, D. A. "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs." IEEE TPAMI, 2020. arXiv:1603.09320. https://arxiv.org/abs/1603.09320 ↩
Muennighoff, Niklas, et al. "MTEB: Massive Text Embedding Benchmark." EACL 2023. arXiv:2210.07316. https://arxiv.org/abs/2210.07316 ; Leaderboard at https://huggingface.co/spaces/mteb/leaderboard ↩
OpenAI. "New embedding models and API updates." 25 January 2024. https://openai.com/index/new-embedding-models-and-api-updates/ ↩
Cohere. "Embed v3." 2 November 2023. https://cohere.com/blog/introducing-embed-v3 ↩
Voyage AI. "voyage-3-large: the new state-of-the-art general-purpose embedding model." 7 January 2025. https://blog.voyageai.com/2025/01/07/voyage-3-large/ ↩
Google Cloud. "Text embeddings API - Vertex AI." https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-text-embeddings ↩
BAAI. "FlagEmbedding / BGE." https://github.com/FlagOpen/FlagEmbedding ; Xiao, Shitao, et al. "C-Pack: Packaged Resources To Advance General Chinese Embedding." SIGIR 2024. arXiv:2309.07597. https://arxiv.org/abs/2309.07597 ↩
Li, Zehan, et al. "Towards General Text Embeddings with Multi-stage Contrastive Learning." Alibaba DAMO, 2023. arXiv:2308.03281. https://arxiv.org/abs/2308.03281 ↩
Wang, Liang, et al. "Text Embeddings by Weakly-Supervised Contrastive Pre-training." Microsoft Research, 2022. arXiv:2212.03533. https://arxiv.org/abs/2212.03533 ↩
Sturua, Saba, et al. "jina-embeddings-v3: Multilingual Embeddings With Task LoRA." Jina AI, 2024. arXiv:2409.10173. https://arxiv.org/abs/2409.10173 ↩
Nussbaum, Zach, et al. "Nomic Embed: Training a Reproducible Long Context Text Embedder." Nomic AI, 2024. arXiv:2402.01613. https://arxiv.org/abs/2402.01613 ↩
Pinecone Systems. "Pinecone serverless." https://www.pinecone.io/product/serverless/ ↩
Weaviate. "Weaviate documentation." https://weaviate.io/developers/weaviate ↩
Qdrant. "Qdrant documentation." https://qdrant.tech/documentation/ ↩
Milvus / Zilliz. "Milvus documentation." https://milvus.io/docs ↩
Chroma. "Chroma documentation." https://docs.trychroma.com/ ↩
pgvector. "Open-source vector similarity search for Postgres." https://github.com/pgvector/pgvector ↩
Vespa. "Vespa documentation." https://docs.vespa.ai/ ↩
Vald. "Vald: A Highly Scalable Distributed Vector Search Engine." https://vald.vdaas.org/ ↩
Elastic. "k-nearest neighbor (kNN) search." https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html ↩
Bruch, Sebastian, et al. "An Analysis of Fusion Functions for Hybrid Retrieval." TOIS, 2023. arXiv:2210.11934. https://arxiv.org/abs/2210.11934 ↩
Cormack, Gordon V., Charles L. A. Clarke, and Stefan Buettcher. "Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods." SIGIR 2009. https://dl.acm.org/doi/10.1145/1571941.1572114 ↩
Formal, Thibault, Benjamin Piwowarski, and Stephane Clinchant. "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking." SIGIR 2021. arXiv:2107.05720. https://arxiv.org/abs/2107.05720 ↩
Nogueira, Rodrigo, and Kyunghyun Cho. "Passage Re-ranking with BERT." 2019. arXiv:1901.04085. https://arxiv.org/abs/1901.04085 ↩
Khattab, Omar, and Matei Zaharia. "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." SIGIR 2020. arXiv:2004.12832. https://arxiv.org/abs/2004.12832 ↩
Santhanam, Keshav, et al. "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction." NAACL 2022. arXiv:2112.01488. https://arxiv.org/abs/2112.01488 ↩
LangChain. "Text splitters." https://python.langchain.com/docs/concepts/text_splitters/ ↩
Kamradt, Greg. "5 Levels of Text Splitting." LlamaIndex / FullStackRetrieval, 2023. https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb ↩
Gunther, Michael, et al. "Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models." Jina AI, 2024. arXiv:2409.04701. https://arxiv.org/abs/2409.04701 ↩
Anthropic. "Introducing Contextual Retrieval." 19 September 2024. https://www.anthropic.com/news/contextual-retrieval ↩
Gao, Luyu, et al. "Precise Zero-Shot Dense Retrieval without Relevance Labels (HyDE)." ACL 2023. arXiv:2212.10496. https://arxiv.org/abs/2212.10496 ↩
Zheng, Huaixiu Steven, et al. "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models." Google DeepMind, 2023. arXiv:2310.06117. https://arxiv.org/abs/2310.06117 ↩
Asai, Akari, et al. "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." ICLR 2024. arXiv:2310.11511. https://arxiv.org/abs/2310.11511 ↩
Yan, Shi-Qi, et al. "Corrective Retrieval Augmented Generation." 2024. arXiv:2401.15884. https://arxiv.org/abs/2401.15884 ↩
Zhang, Tianjun, et al. "RAFT: Adapting Language Model to Domain Specific RAG." UC Berkeley, 2024. arXiv:2403.10131. https://arxiv.org/abs/2403.10131 ↩
Jiang, Zhengbao, et al. "Active Retrieval Augmented Generation (FLARE)." EMNLP 2023. arXiv:2305.06983. https://arxiv.org/abs/2305.06983 ↩
Trivedi, Harsh, et al. "Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions (IRCoT)." ACL 2023. arXiv:2212.10509. https://arxiv.org/abs/2212.10509 ↩
Singh, Aditi, et al. "Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG." 2025. arXiv:2501.09136. https://arxiv.org/abs/2501.09136 ↩
Es, Shahul, et al. "RAGAS: Automated Evaluation of Retrieval Augmented Generation." EACL 2024. arXiv:2309.15217. https://arxiv.org/abs/2309.15217 ; docs at https://docs.ragas.io/ ↩
Saad-Falcon, Jon, et al. "ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems." NAACL 2024. arXiv:2311.09476. https://arxiv.org/abs/2311.09476 ↩
Khattab, Omar, et al. "DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines." Stanford, 2023. arXiv:2310.03714. https://arxiv.org/abs/2310.03714 ↩
OpenAI. "File search tool reference." https://platform.openai.com/docs/guides/tools-file-search ↩
Anthropic. "Files API." https://docs.claude.com/en/docs/build-with-claude/files ↩
Google Cloud. "Grounding overview - Vertex AI Generative AI." https://cloud.google.com/vertex-ai/generative-ai/docs/grounding/overview ↩
Amazon Web Services. "Knowledge Bases for Amazon Bedrock." https://docs.aws.amazon.com/bedrock/latest/userguide/knowledge-base.html ↩
Liu, Nelson F., et al. "Lost in the Middle: How Language Models Use Long Contexts." TACL 2024. arXiv:2307.03172. https://arxiv.org/abs/2307.03172 ↩
Hsieh, Cheng-Ping, et al. "RULER: What's the Real Context Size of Your Long-Context Language Models?" NVIDIA, 2024. arXiv:2404.06654. https://arxiv.org/abs/2404.06654 ↩
Modarressi, Ali, et al. "NoLiMa: Long-Context Evaluation Beyond Literal Matching." 2024. arXiv:2502.05167. https://arxiv.org/abs/2502.05167 ↩
Lee, Jinhyuk, et al. "Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? (LOFT)." Google DeepMind, 2024. arXiv:2406.13121. https://arxiv.org/abs/2406.13121 ↩
Karpathy, Andrej. "Context Engineering." Personal essay, 2025. Discussion in https://blog.langchain.com/context-engineering-for-agents/ ↩
Databricks. "State of AI: Enterprise Adoption & Growth Trends." 2025. https://www.databricks.com/blog/state-ai-enterprise-adoption-growth-trends ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

8 revisions by 1 contributors · full history

Suggest edit