Retrieval-Augmented Generation
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 ยท 5,726 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 ยท 5,726 words
Add missing citations, update stale details, or suggest a clearer explanation.
Retrieval-Augmented Generation (RAG) is a technique in natural language processing that augments a generative model with an information-retrieval component, so that responses are conditioned on documents fetched from an external corpus rather than relying solely on the parametric knowledge stored in the model's weights. A RAG system embeds a user query, retrieves the most relevant chunks from a knowledge base (typically a vector database or hybrid index), assembles them into the model's prompt, and then asks a large language model (LLM) to generate an answer grounded in that context. The approach was introduced by Patrick Lewis and colleagues at Facebook AI Research, University College London, and New York University in May 2020, and it directly addresses three persistent weaknesses of standalone LLMs: a fixed knowledge cutoff, a tendency to hallucinate, and the inability to cite sources.[^1]
Since the original paper, rag has evolved from a single neural architecture into an entire engineering discipline. Modern pipelines combine dense embeddings with BM25 keyword scoring, reranking by cross-encoders such as ColBERT, query transformations such as HyDE, and orchestration frameworks such as LangChain and LlamaIndex. New variants - GraphRAG (Microsoft Research, 2024), Self-RAG (Asai et al., 2023), Corrective RAG (Yan et al., 2024), and RAFT (Berkeley, 2024) - extend the basic recipe with knowledge graphs, self-reflection, retrieval-quality verification, and supervised fine-tuning over retrieved contexts. Frontier vendors such as OpenAI, Anthropic, and Google now ship managed RAG endpoints (OpenAI File Search in the Responses API, the Anthropic Files API, and Google Vertex AI Search) that hide much of the indexing complexity from application developers.[^2][^3]
LLMs store knowledge implicitly in their network weights, a representation sometimes called parametric memory. This makes them fluent and fast at inference time, but knowledge written into weights cannot be inspected, edited, or attributed to a source, and updating it requires re-training or fine-tuning. RAG re-introduces an explicit non-parametric memory: an external corpus that the model can read at inference time. Lewis et al. framed RAG precisely as a way to combine the two, "where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever."[^1]
Every LLM is trained on a snapshot of text up to some date - GPT-3's cutoff was October 2019, Claude Opus 4.1's was March 2025, and so on. Information that appears after the cutoff is simply unknown to the model, and information that was rare or controversial at training time may be misremembered. RAG decouples the knowledge store from the model: a vector index can be re-built nightly or updated incrementally without touching the LLM's weights.
LLMs trained with maximum-likelihood objectives are known to produce hallucinations - confident, fluent text that is not supported by any real source. Ji et al. surveyed this phenomenon in 2023 and showed that hallucination is a structural property of generative language models, not a bug to be patched out.[^4] RAG mitigates (though does not eliminate) hallucination by constraining generation to text that is supported by retrieved passages. Empirical work in clinical and legal settings has consistently found that grounded models hallucinate less often than their ungrounded counterparts, though they can still misquote, mis-attribute, or fail to retrieve the correct document.[^5][^6]
RAG did not appear in a vacuum. Several 2019-2020 systems explored how to combine retrieval with generation or language modelling.
kNN-LM (Khandelwal et al., 2019). Researchers at Stanford and Facebook AI proposed augmenting a fixed language model with a nearest-neighbour lookup over a datastore of cached key/value pairs from the training corpus. At each generation step, the model interpolates its own next-token distribution with the empirical distribution of tokens that followed similar contexts in the datastore. The technique improved perplexity on Wikitext-103 without any extra training, showing that explicit memory can complement parametric memory.[^7]
REALM (Guu et al., 2020). Google Research's Retrieval-Augmented Language Model Pre-training (REALM) was, alongside Lewis et al., one of the first end-to-end trained systems with a learned neural retriever. REALM jointly trained a masked-language-model objective with a retriever over a Wikipedia index, demonstrating that the retriever could be back-propagated through during pre-training. REALM was the conceptual sibling of RAG, but it focused on extractive MLM rather than seq2seq generation.[^8]
Dense Passage Retrieval (Karpukhin et al., 2020). DPR replaced the BM25 baseline used in earlier open-domain QA pipelines with a dual-encoder neural retriever trained on question/passage pairs. DPR became the retriever component of the original RAG paper, and it remains the canonical reference for dense passage retrieval.[^9]
MIPS-style retrieval and FAISS. Approximate Maximum Inner Product Search and Facebook's FAISS library made it tractable to search billions of vectors in milliseconds, which was a precondition for any RAG-style system at scale.[^10]
The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" was uploaded to arXiv on 22 May 2020 (arXiv:2005.11401) and published at NeurIPS 2020. The thirteen authors - Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen-tau Yih, Tim Rocktaschel, Sebastian Riedel, and Douwe Kiela - were affiliated with Facebook AI Research, University College London, and New York University.[^1]
The system has three components:
The retriever and generator are trained end-to-end with the document index treated as a latent variable. Lewis et al. coined the term Retrieval-Augmented Generation and showed that the resulting models set new state-of-the-art results on three open-domain question answering benchmarks - Natural Questions, TriviaQA, and WebQuestions - and produced "more specific, diverse, and factual" language than the parametric BART baseline on Jeopardy question generation and abstractive summarization.[^1]
After the original paper, retrieval-augmented techniques rapidly diffused across NLP. Atlas (Izacard et al., Meta AI, 2022) demonstrated few-shot learning with a retrieval-augmented seq2seq model.[^11] DeepMind's RETRO (Borgeaud et al., 2022) showed that a 7.5B-parameter model retrieving from a 2-trillion-token database could match the performance of a 175B-parameter GPT-3 on language modelling benchmarks, despite using 25x fewer parameters.[^12] After the release of ChatGPT in late 2022, demand from application developers turned RAG into a default architecture for grounding LLMs in private data, and 2023-2024 saw rapid growth of orchestration frameworks (LangChain, LlamaIndex, Haystack) and specialized vector databases (Pinecone, Weaviate, Qdrant, Chroma, Milvus, Vespa, pgvector). Microsoft Research's GraphRAG paper appeared in April 2024, Self-RAG and Corrective RAG in late 2023 and early 2024, and the RAG vs long-context debate intensified after Google launched Gemini 1.5 with a one-million-token window in February 2024.[^13][^14][^15]
Lewis et al. proposed two formulations of how the retrieved passages affect generation:
Both variants share the same retriever and generator; they differ only in the marginalization step. In modern engineering practice, almost all production systems use a degenerate RAG-Sequence variant: pick top-K passages once, concatenate them into the prompt, and let the LLM decide how to use them. The original RAG-Token marginalization is rarely implemented exactly because it requires re-running the decoder K times per token.
A production RAG pipeline today is typically broken into five phases. Each is a research topic of its own.
Raw documents - PDFs, HTML pages, Slack messages, code repositories, audio transcripts - are normalized to text and split into chunks. Common chunking strategies are described in detail below; the canonical default is recursive character splitting at 256-1024 token granularity with 10-20% overlap.
Each chunk is passed through an embedding model (typically 256-3072 dimensions) and stored in a vector database together with the original text and metadata. Most production indexes use HNSW (Hierarchical Navigable Small World) or IVF-PQ (Inverted File with Product Quantization) for approximate nearest-neighbor search, both of which trade a small recall loss for sub-100ms latency over hundreds of millions of vectors.[^16] Many systems index BM25 tokens alongside the embeddings to enable hybrid search.
At query time, the user query is embedded (and optionally tokenized for BM25), the top-K nearest chunks are retrieved, and an optional reranker is applied to re-order them. Typical K is 20-100 before reranking and 3-10 after.
The chosen passages are inserted into the model's prompt, usually in a template such as:
You are an assistant. Use only the context below.
<context>
{passage_1}
{passage_2}
...
</context>
Question: {user_query}
Augmentation can also include metadata (source URLs, dates, scores), citations, or system instructions to refuse when context is irrelevant.
The augmented prompt is sent to an LLM, which produces an answer. Many production systems include a final citation step that maps spans of the answer back to specific retrieved chunks, and a guardrail step that checks whether the answer is supported by the retrieved evidence (see Evaluation, below).
Embedding quality is one of the strongest determinants of retrieval recall. The space of embedding models has grown rapidly since 2022, and the Massive Text Embedding Benchmark (MTEB) leaderboard, introduced by Muennighoff et al. at HuggingFace in late 2022, has become the de-facto evaluation harness.[^17] As of early 2026, commonly used embeddings include:
bge-large-en-v1.5 and bge-m3 (multilingual, multi-functional, multi-granularity), which were near the top of MTEB through 2023-2024.[^22]gte-large and gte-Qwen2-7B-instruct.[^23]e5-large-v2, multilingual-e5-large).[^24]nomic-embed-text-v1.5), the first fully reproducible open embedding model with training data and code released.[^26]There is no single best embedding model: choice depends on language coverage, max context, dimensionality, licensing, and how well the model was pre-trained on the target domain.
The retrieval index is usually backed by a specialized vector database. The market has consolidated around a handful of options, each with different trade-offs:
| System | Type | Notes |
|---|---|---|
| Pinecone | Managed SaaS | The earliest commercial vector DB (2019); serverless tier; very low operational burden.[^27] |
| Weaviate | Open source + managed | Native hybrid search, GraphQL API, modular vectorizers; strong ecosystem.[^28] |
| Qdrant | Open source + managed | Rust implementation; payload filtering; sparse + dense in one index.[^29] |
| Milvus / Zilliz Cloud | Open source + managed | LF AI graduate project; scales to tens of billions of vectors.[^30] |
| Chroma | Open source | Developer-first, in-process; popular for prototypes and LangChain tutorials.[^31] |
| pgvector | PostgreSQL extension | Adds vector type, IVFFlat / HNSW indexes to Postgres; lets organizations reuse their existing database.[^32] |
| Vespa | Open source | Originally Yahoo's search engine; combines vector, lexical, tensor, and ranking in one stack.[^33] |
| Vald | Open source | Kubernetes-native, NGT-based; designed for billion-scale.[^34] |
| Elasticsearch / OpenSearch | Search engine + vectors | Added dense_vector and kNN support; convenient when teams already use Elastic.[^35] |
| Redis (Redis Stack) | In-memory | Vector similarity via FT.SEARCH KNN; very low latency for hot indices. |
| LanceDB | Open source | Columnar (Apache Arrow) format on object storage; popular for multimodal use cases. |
| Vectara, MongoDB Atlas Vector Search, Couchbase Capella, SingleStore | Managed | RAG features bolted onto established databases. |
Almost all of these systems implement HNSW or a closely related graph-based index. FAISS (Facebook AI) remains the most widely used library for ANN search and is embedded inside many of the products above.[^10]
A consistent empirical finding is that dense + sparse hybrid retrieval beats either alone. Dense retrievers excel at semantic similarity but struggle with rare named entities, identifiers, and acronyms; lexical models such as BM25 handle exact matches but miss paraphrases.[^36]
The standard technique for combining their outputs is Reciprocal Rank Fusion (RRF), introduced by Cormack et al. in a 2009 SIGIR paper. RRF assigns each document the score sum_i 1 / (k + rank_i(d)) across retrievers i, with the constant k typically 60. RRF is hyperparameter-free in the relative sense and consistently performs well on TREC-style benchmarks.[^37]
Modern variants include:
Bi-encoder retrieval is fast but loses information by encoding query and document independently. A reranker runs a more expensive model over the top-K candidates to re-order them.
A standard pipeline retrieves top-50 to top-200 candidates with a fast dense + BM25 stage and reranks to top-3 or top-10 with a cross-encoder or ColBERT-style model.
How documents are split into chunks dominates retrieval quality. Common strategies include:[^42]
\n\n, \n, . , ), falling back to smaller units. This is the default in LangChain and LlamaIndex.Empirically, chunking strategy interacts strongly with embedding model and corpus type: there is no universal best practice, only sensible defaults.
Many failures of naive RAG come from the gap between the user's question and the way the answer is written in the corpus. Several query-transformation techniques close this gap:
GraphRAG is an open-source project from Microsoft Research that augments RAG with a knowledge graph built directly from the source corpus by an LLM. The pipeline extracts entities and relationships from each text chunk, clusters them with the Leiden algorithm into hierarchical communities, summarizes each community, and stores the resulting graph alongside vector indexes. At query time, global queries are answered by aggregating community summaries, while local queries start from entity nodes most similar to the question and traverse the graph. Microsoft's paper ("From Local to Global: A Graph RAG Approach to Query-Focused Summarization", Edge et al., arXiv:2404.16130) reported substantial gains on holistic queries that require synthesizing across an entire corpus.[^13]
Self-Reflective Retrieval-Augmented Generation (arXiv:2310.11511) from Asai, Wu, Wang, Sil, and Hajishirzi (University of Washington / Allen AI) trains an LLM to emit special reflection tokens that decide (a) whether retrieval is needed, (b) whether retrieved passages are relevant, and (c) whether the generated answer is supported and useful. The model is supervised to produce these tokens during fine-tuning. Self-RAG outperforms ChatGPT and a Llama2-chat baseline on open-domain QA, fact verification, and long-form generation benchmarks.[^48]
CRAG (arXiv:2401.15884) trains a lightweight retrieval evaluator that scores retrieved documents as Correct, Incorrect, or Ambiguous. If the evaluator is not confident, CRAG falls back to a web search and applies a decompose-then-recompose strategy to filter noisy documents. The authors report consistent gains over RAG baselines on PopQA, Biography, PubHealth, and ARC-Challenge.[^49]
Retrieval-Augmented Fine-Tuning (Zhang et al., arXiv:2403.10131) fine-tunes an LLM on (question, retrieved-context, answer) triples in which some of the retrieved documents are deliberately irrelevant "distractors." The model learns both how to ignore irrelevant context and how to cite the right passages by quoting them in chain-of-thought style. RAFT improves accuracy on domain-specific RAG benchmarks (PubMed, HotpotQA, HuggingFace docs) and is now a common recipe for adapting open-weights models to a specific corpus.[^50]
FLARE (Forward-Looking Active REtrieval, Jiang et al., 2023) repeatedly triggers retrieval during long-form generation whenever the model's next-sentence probability falls below a threshold, then re-generates with the new evidence.[^51] IRCoT (Trivedi et al., 2022) interleaves chain-of-thought reasoning with retrieval steps for multi-hop QA.[^52] Both formalize the idea that retrieval can happen during generation, not just before it.
The most recent generation of systems treats RAG as a tool available to an autonomous agent. An agent plans which questions to ask, which indexes or APIs to query, iterates based on intermediate results, and finally synthesizes a response. OpenAI's Responses API File Search tool, Anthropic's tool-using assistants, and frameworks such as LangGraph, LlamaIndex Workflows, and CrewAI all embody this pattern. Agentic RAG handles open-ended research questions and multi-hop reasoning that fixed pipelines cannot, at the cost of latency and unpredictability.[^53]
Evaluating RAG is harder than evaluating either retrieval or generation alone. Three quantities matter:
Open-source evaluation frameworks include:
Benchmarks for retrieval and RAG include BEIR (zero-shot IR), MTEB (embeddings), KILT (Knowledge-Intensive Language Tasks), and HotpotQA (multi-hop QA). LongBench and the LOFT benchmark from Google specifically test long-context vs RAG trade-offs.
The 2024-2026 ecosystem has converged on a handful of orchestration frameworks:
Specialized RAG engines such as RAGFlow, FastGPT, Anything LLM, and PrivateGPT target self-hosted document chatbots.
By 2025 the three frontier-model vendors all expose managed RAG primitives:
file_search tool; the platform manages chunking and retrieval on the developer's behalf.[^58]RetrieveAndGenerate API.[^60]These products commoditize the "naive RAG" stack and push developer differentiation toward chunking, evaluation, and agentic orchestration.
If the retriever returns the wrong passages, no amount of LLM capability will recover the correct answer. Common failure modes include:
Hybrid retrieval, reranking, query transformations, and agentic iteration are all attempts to mitigate retrieval mismatch.
Naive chunking can split a sentence in half, separate a table from its header, or break a code block. The chunk that contains the answer may also lack the surrounding context (the paragraph or section heading) that makes the answer interpretable. Hierarchical retrieval, late chunking, and contextual retrieval (Anthropic 2024) are responses to this.
LLMs sometimes "cite" passages that do not actually support their claims. Stanford HAI's 2024 study of legal RAG products found that even purpose-built legal AI assistants hallucinated or mis-cited authorities in 17-33% of queries, despite advertising "hallucination-free" outputs.[^6] Faithfulness is a primary RAGAS metric for this reason.
Liu et al. (Stanford, 2023) showed that LLMs given long contexts pay strong attention to the beginning and end but degrade in the middle, even when the relevant evidence is at a fixed location.[^61] Many practitioners therefore place the most important retrieved passage at the start or end of the prompt and limit total context to a few thousand tokens.
Enterprise RAG must enforce document-level access control, redact PII, log retrieval for audit, and prevent prompt-injection attacks in which malicious text embedded in a retrieved document tries to override the system instructions. Microsoft, Google, and AWS now publish guidance on RAG security; OWASP's "Top 10 for LLM Applications" includes prompt injection and insecure plugin design.
The most heated 2024-2026 debate in this area is whether very long context windows make RAG obsolete. Gemini 1.5 Pro (February 2024) introduced a one-million-token window and demonstrated "needle-in-a-haystack" recall close to 100% in controlled settings; Gemini 2.5 Pro extended this to ~2 million tokens, Anthropic shipped a 1M-token Claude in 2025, and Magic.dev reported a 100M-token research model.[^14][^15]
Arguments that long context displaces RAG:
Arguments that RAG persists:
The pragmatic synthesis is that long context complements RAG rather than replacing it. Long context lets RAG keep more retrieved passages, larger parent chunks, and richer scratchpads, while RAG selects which documents to put inside that long context. Practitioners increasingly use the term context engineering to describe the discipline of assembling the right context - retrieved or otherwise - for a given query.[^65]