Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with generative AI models to produce responses grounded in external knowledge sources. Rather than relying solely on the parametric memory stored in a model's weights, RAG systems retrieve relevant documents from an external corpus at inference time and condition the generation on those documents. The approach was formally introduced by Lewis et al. (2020) at NeurIPS and has since become one of the most widely adopted methods for reducing hallucination in large language models (LLMs).
Imagine you have a friend who is very good at talking and telling stories, but sometimes makes things up because they can't remember everything. Now imagine you give that friend a library card and tell them, "Before you answer my question, go look it up in the library first." That's basically what RAG does. The "friend" is an AI language model, and the "library" is a big collection of documents. Instead of guessing, the AI goes and finds the right information first, then uses it to give you a better answer.
Large language models such as GPT-4, Claude, and LLaMA store factual knowledge implicitly within their billions of parameters. This parametric approach has several well-known limitations:
RAG addresses these problems by giving the model access to an external knowledge base at inference time. Retrieved documents serve as a form of non-parametric memory, providing up-to-date, verifiable evidence that the generator can use to produce more accurate and attributable responses.
The idea of combining retrieval with neural text generation has roots in earlier work on open-domain question answering, but several papers established the foundations of modern RAG:
| Year | Paper | Authors | Contribution |
|---|---|---|---|
| 2020 | REALM: Retrieval-Augmented Language Model Pre-Training | Guu et al. | Introduced the concept of pre-training a language model jointly with a neural retriever, using masked language modeling as the training signal. Achieved 4-16% absolute accuracy gains on Open-QA benchmarks. |
| 2020 | Dense Passage Retrieval for Open-Domain Question Answering | Karpukhin et al. | Demonstrated that learned dense representations outperform BM25 by 9-19% in top-20 passage retrieval accuracy using a dual-encoder framework. |
| 2020 | Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks | Lewis et al. | Introduced the RAG framework combining a pre-trained seq2seq model (BART) with a dense retriever (DPR) over a Wikipedia index. Proposed two variants: RAG-Sequence and RAG-Token. |
| 2020 | ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT | Khattab and Zaharia | Proposed a late interaction architecture that independently encodes queries and documents, enabling efficient retrieval while maintaining the expressiveness of BERT-based models. |
| 2021 | Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering | Izacard and Grave | Introduced Fusion-in-Decoder (FiD), which separately encodes multiple retrieved passages and fuses them in the decoder, achieving state-of-the-art results on Natural Questions and TriviaQA. |
| 2022 | Improving Language Models by Retrieving from Trillions of Tokens | Borgeaud et al. | Introduced RETRO, which uses chunk-wise cross-attention to condition on documents retrieved from a 2-trillion-token corpus. A 7.5B parameter RETRO model outperformed the 175B parameter Jurassic-1 on multiple benchmarks. |
| 2023 | Precise Zero-Shot Dense Retrieval without Relevance Labels | Gao et al. | Introduced HyDE (Hypothetical Document Embeddings), a query transformation technique that generates a hypothetical answer and uses its embedding for retrieval, achieving strong zero-shot performance. |
| 2024 | Retrieval-Augmented Generation for Large Language Models: A Survey | Gao et al. | Comprehensive survey categorizing RAG into Naive RAG, Advanced RAG, and Modular RAG paradigms. |
| 2024 | From Local to Global: A Graph RAG Approach to Query-Focused Summarization | Edge et al. (Microsoft) | Introduced GraphRAG, which builds a knowledge graph from source documents and uses community detection for global summarization queries. |
A RAG system consists of two primary components: a retriever and a generator. During inference, the retriever fetches relevant documents from an external knowledge source, and the generator produces a response conditioned on both the user query and the retrieved documents.
The typical RAG pipeline involves five stages:
The original Lewis et al. (2020) paper proposed two formulations:
In practice, RAG-Token offers more flexibility for tasks requiring information synthesis from multiple sources, while RAG-Sequence is simpler and sufficient for many question-answering tasks.
The choice of retrieval method significantly affects RAG system performance. Retrieval approaches generally fall into three categories: sparse retrieval, dense retrieval, and hybrid methods.
Sparse retrieval methods represent documents and queries as high-dimensional vectors where most dimensions are zero, based on term frequency statistics.
Strengths: Computationally efficient, no training required, excellent at exact keyword matching, works well with domain-specific terminology.
Weaknesses: Cannot capture semantic similarity (e.g., "car" vs. "automobile"), struggles with paraphrased queries, and cannot handle misspellings.
Dense retrieval uses neural networks to encode queries and documents into low-dimensional continuous vectors (embeddings) that capture semantic meaning.
Strengths: Captures semantic similarity, handles synonyms and paraphrases, performs well on conversational and natural language queries.
Weaknesses: Requires training data or pre-trained models, computationally more expensive at indexing time, may underperform on highly technical or domain-specific terminology without fine-tuning.
Hybrid approaches combine sparse and dense retrieval to leverage the strengths of both methods. A common pattern is to run BM25 and a dense retriever in parallel, then merge results using Reciprocal Rank Fusion (RRF) or a learned score combination.
Hybrid search has been shown to improve recall by 15-30% over single-method approaches. Anthropic's Contextual Retrieval technique reported a 49% reduction in retrieval failure rates when combining hybrid search with contextual embeddings.
Embedding models convert text into dense vector representations used for semantic search in RAG systems. The quality of embeddings directly impacts retrieval accuracy.
| Model | Developer | Dimensions | Max tokens | Open source | Notes |
|---|---|---|---|---|---|
| text-embedding-3-large | OpenAI | 3,072 | 8,191 | No | Strong general-purpose performance; supports dimensionality reduction via Matryoshka representation |
| text-embedding-3-small | OpenAI | 1,536 | 8,191 | No | Cost-effective alternative with good performance |
| Cohere embed-v4 | Cohere | 1,024 | 128,000 | No | Leads MTEB benchmark (65.2 overall); supports long-context embedding |
| voyage-3-large | Voyage AI | 1,024 | 32,000 | No | Strong retrieval performance; 32K context window |
| BGE-M3 | BAAI | 1,024 | 8,192 | Yes | Top open-source multilingual model; supports 100+ languages |
| NV-Embed-v2 | NVIDIA | 4,096 | 32,768 | Yes | Top overall MTEB score among open models |
| E5-Mistral-7B-Instruct | Microsoft | 4,096 | 32,768 | Yes | High retrieval performance using a Mistral-based architecture |
| Qwen-3-Embedding-8B | Alibaba | 4,096 | 32,768 | Yes | State-of-the-art 75.22 MTEB English score; surpasses many proprietary models |
The Massive Text Embedding Benchmark (MTEB) is the standard evaluation suite for embedding models, covering tasks such as retrieval, classification, clustering, and semantic textual similarity. Performance on the retrieval subset is most relevant for RAG applications, and overall MTEB scores can be misleading if a model excels at non-retrieval tasks but performs poorly on retrieval specifically.
Vector databases are specialized storage systems designed for efficient similarity search over high-dimensional embedding vectors. They form the backbone of most production RAG systems.
| Database | Type | Key features | Best for |
|---|---|---|---|
| Pinecone | Managed (cloud) | Fully managed, automatic scaling, metadata filtering | Teams wanting zero infrastructure management |
| Weaviate | Open source / managed | Built-in vectorization, GraphQL API, hybrid search | Semantic search with complex data relationships |
| Milvus / Zilliz | Open source / managed | GPU-accelerated, multi-vector support, 35,000+ GitHub stars | Enterprise-scale deployments handling billions of vectors |
| Qdrant | Open source / managed | Written in Rust, advanced payload filtering, efficient memory usage | Applications requiring both vector search and complex metadata filtering |
| ChromaDB | Open source | Lightweight, simple API, runs in-process | Prototyping and small-scale applications |
| FAISS | Library (open source) | Created by Meta; supports GPU acceleration, multiple index types (IVF, HNSW, PQ) | Research and custom high-performance solutions |
| pgvector | Extension | PostgreSQL extension for vector similarity search | Teams already using PostgreSQL who want to avoid a separate database |
A 2025 benchmark found that Milvus/Zilliz Cloud leads in low-latency performance, with Pinecone and Qdrant close behind, most achieving 10-100ms query times on datasets of 1-10 million vectors.
Chunking is the process of splitting source documents into smaller segments before indexing. The chunking strategy directly impacts retrieval quality, as chunks must be large enough to contain meaningful information but small enough to be relevant to specific queries.
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Fixed-size | Splits text into chunks of a predetermined number of tokens or characters, with optional overlap | Simple to implement, predictable chunk sizes | May split sentences or paragraphs mid-thought |
| Recursive | Uses hierarchical separators (paragraphs, sentences, words) to split text while preserving natural boundaries | Respects document structure better than fixed-size | Produces variable-length chunks; may still split semantic units |
| Semantic | Groups sentences based on the similarity of their embeddings, splitting where semantic similarity drops below a threshold | Produces semantically coherent chunks | Computationally expensive; requires an embedding model during preprocessing |
| Document-based | Splits on structural elements such as Markdown headings, HTML tags, or code blocks | Preserves document structure and metadata | Chunk sizes can vary widely; not all documents have clear structure |
| Sentence-level | Each sentence or small group of sentences becomes a chunk | Fine-grained retrieval, good for fact-specific queries | Very small chunks may lack context |
| Late chunking | Embeds the entire document first through a long-context embedding model, then splits the token-level embeddings into chunks | Preserves full-document context in embeddings | Requires a long-context embedding model; more complex pipeline |
The RAG survey by Gao et al. (2024) categorizes the evolution of RAG into three paradigms: Naive RAG, Advanced RAG, and Modular RAG. Advanced RAG introduces optimizations at the pre-retrieval, retrieval, and post-retrieval stages.
Query rewriting: The original user query is reformulated or expanded to improve retrieval results. Techniques include:
HyDE (Hypothetical Document Embeddings): Introduced by Gao et al. (2023), HyDE uses an LLM to generate a hypothetical ideal answer to the query, then uses that hypothetical answer's embedding (rather than the query's embedding) to search for real documents. This bridges the gap between short user queries and longer document passages in the embedding space. HyDE significantly outperforms the unsupervised dense retriever Contriever and performs comparably to fine-tuned retrievers across web search, QA, and fact verification tasks.
Hybrid search: Combining sparse (BM25) and dense retrieval in parallel, as described above.
Multi-hop retrieval: For complex questions requiring reasoning over multiple documents, the system performs iterative retrieval steps, using the output of one retrieval round to inform the next query.
Parent-child retrieval: Small chunks are used for embedding and retrieval (for precision), but when a small chunk is retrieved, the system returns the larger parent chunk (for context) to the generator.
Re-ranking: After initial retrieval, a cross-encoder model rescores the retrieved documents for more accurate relevance judgments. Cross-encoders process the query and document jointly (rather than independently, as bi-encoders do), producing more accurate but slower relevance scores. Common re-rankers include Cohere Rerank, the BGE Reranker family, and cross-encoder models from Sentence Transformers.
Contextual compression: Retrieved passages are summarized or compressed to include only the most relevant information, reducing noise in the generator's input and making better use of the model's context window.
Filtering and deduplication: Removing duplicate or near-duplicate chunks, and filtering out passages that fall below a relevance threshold.
Modular RAG systems treat the retrieval, augmentation, and generation components as interchangeable modules that can be independently optimized, replaced, or rearranged. This paradigm enables architectures such as:
Evaluating RAG systems requires assessing both the retrieval component and the generation component. Traditional metrics like BLEU and ROUGE are insufficient because they do not capture factual accuracy or the quality of retrieved context.
RAGAS (Retrieval Augmented Generation Assessment), introduced by Es et al. (2023), provides reference-free evaluation metrics specifically designed for RAG pipelines:
| Metric | What it measures | Range |
|---|---|---|
| Faithfulness | Whether the generated answer is factually consistent with the retrieved context (i.e., no hallucinated claims) | 0 to 1 |
| Answer relevance | How well the generated answer addresses the user's query | 0 to 1 |
| Context precision | Whether relevant documents are ranked higher than irrelevant ones among the retrieved set | 0 to 1 |
| Context recall | What proportion of the ground-truth relevant information is captured by the retrieved documents | 0 to 1 |
RAG and fine-tuning are two complementary approaches for adapting LLMs to specific domains or tasks. They address different aspects of model behavior and have distinct trade-offs.
| Dimension | RAG | Fine-tuning |
|---|---|---|
| Knowledge source | External documents retrieved at inference time | Encoded in model parameters during training |
| Knowledge freshness | Can incorporate new information instantly by updating the document store | Knowledge is frozen at training time; requires retraining to update |
| Computational cost | Higher inference cost (retrieval + generation), lower upfront cost | Higher upfront training cost, lower per-query inference cost |
| Hallucination risk | Lower, because answers are grounded in retrieved evidence | Can still hallucinate if training data is noisy or insufficient |
| Attribution | Can cite retrieved sources | Generally cannot attribute specific facts to training examples |
| Output style control | Limited influence on writing style or format | Can adapt tone, format, and domain-specific vocabulary |
| Data requirements | Requires a well-organized document corpus | Requires labeled training examples |
| Latency | Higher (retrieval adds latency) | Lower (no retrieval step) |
| Best for | Factual QA, knowledge-intensive tasks, rapidly changing information | Style adaptation, domain-specific language, structured output formats |
In practice, the two approaches can be combined. A fine-tuned model can serve as the generator in a RAG system, gaining both domain-specific language capabilities from fine-tuning and factual grounding from retrieval. A case study by Balaguer et al. (2024) in the agricultural domain found that RAG consistently outperformed fine-tuning for factual accuracy, while fine-tuning was better for tasks requiring domain-specific reasoning patterns.
RAG has been adopted across a wide range of domains:
The RAG market was valued at $1.92 billion in 2024 and is projected to reach $10.20 billion by 2030, growing at a compound annual growth rate of 39.66%.
Despite its effectiveness, RAG has several known limitations:
Several open-source frameworks simplify building RAG systems:
| Framework | Developer | Language | Description |
|---|---|---|---|
| LangChain | LangChain Inc. | Python, JS | Modular framework for building LLM applications with retrieval, chains, and agents |
| LlamaIndex | LlamaIndex Inc. | Python, JS | Data framework for connecting custom data sources to LLMs; strong indexing and query capabilities |
| Haystack | deepset | Python | End-to-end NLP framework with pipeline-based architecture for building RAG and search systems |
| DSPy | Stanford NLP | Python | Programmatic framework for optimizing LM prompts and weights, including retrieval-augmented pipelines |
| Semantic Kernel | Microsoft | C#, Python, Java | SDK for integrating LLMs with conventional programming, including RAG patterns |
| Verba | Weaviate | Python | Open-source RAG application built on Weaviate for personal and organizational data |