Retrieval-Augmented Generation

25 min read

Updated Jul 28, 2026

Suggest edit History Talk32 citations Fact-checked Jul 28, 2026

Retrieval-augmented generation (RAG) is a family of methods that retrieves information from an external collection and conditions a generative model on that information when producing an output. It connects information retrieval with generation so that an answer can draw on material not contained, or not reliably accessible, in a model's parameters. The external collection may contain passages, records, code, images, or other retrievable units, but this article focuses on text retrieval for natural language processing.

Patrick Lewis and colleagues introduced the named RAG architecture in 2020. Their system paired a dense retriever over a Wikipedia index with a pretrained sequence-to-sequence generator and treated the retrieved document as a latent variable ^[1]. The term is now also used more broadly for modular applications that retrieve passages and place them in the input of a large language model, even when those applications do not use the original training objective or model architecture.

Retrieval can make a source collection easier to update, expose passages for inspection, and improve task performance in particular settings. It does not by itself make an answer correct, current, complete, cited, or secure. A RAG system can retrieve the wrong material, miss necessary evidence, retrieve contradictory or malicious content, ignore good evidence, or generate a claim that its cited passages do not support. Any benefit is therefore a property of a specific corpus, retriever, generator, task, and evaluation, not of the label "RAG" alone.

Scope and terminology

A minimal RAG system has four functional parts:

a source collection whose items have identifiers and content;
an index or search mechanism that can select items for an input;
a context-assembly step that presents selected material to a generator; and
a generator that produces an output from the input and assembled context.

Many deployments add query rewriting, metadata filters, reranking, deduplication, citation construction, access control, caching, monitoring, or multiple rounds of retrieval. Those additions change the system's behavior, but they are not required by the broad definition.

The source collection is sometimes called non-parametric memory because its contents can be changed without directly changing the generator's learned weights. Knowledge encoded in those weights is called parametric memory. The distinction is useful but not absolute. A learned retriever has parameters; generated answers may mix retrieved information with parametric knowledge; and updating an index can require recomputing representations. RAG is therefore not equivalent to a vector database, a search engine, a prompt template, or a particular model.

Retrieval means selecting information in response to an input. Augmentation means supplying retrieved information to the computation that produces the output. Generation means constructing an output sequence rather than merely returning a stored passage. A retrieve-and-extract question-answering system and a search interface may be adjacent technologies, but they are not necessarily RAG under this definition.

Grounding and attribution are separate claims. An answer is grounded only to the extent that its claims follow from the stated evidence under an appropriate interpretation. An answer is attributed when it identifies sources or passages. A citation can be correctly formatted yet fail to support the nearby claim, and a well-supported answer can omit a citation. Freshness is also separate: retrieval can expose newer material only if the collection, index, filters, and timestamps are maintained correctly.

Historical development

Retrieval-assisted language modeling predates the 2020 RAG paper. REALM trained a masked language model with a latent retriever used during pretraining, fine-tuning, and inference, showing that retrieval could be learned from the language-modeling objective ^[2]. Dense Passage Retrieval then demonstrated a dual-encoder retriever for open-domain question answering, representing questions and passages separately and ranking passages by vector similarity ^[3]. The original RAG model initialized its retriever from that work.

Another line retrieved examples at token-prediction time. The kNN-LM method built a datastore from a language model's hidden states and interpolated the model's next-token distribution with a nearest-neighbor distribution, without additional model training in the studied setup ^[4]. This is retrieval-augmented language modeling, but it does not retrieve human-readable passages for a prompt in the same way as a typical application-level RAG pipeline.

Research after 2020 explored different points of integration. Fusion-in-Decoder encoded each question-passage pair separately and let the decoder combine the encoded evidence, avoiding a single early concatenation of all passages ^[5]. RETRO incorporated retrieval into autoregressive language-model pretraining and retrieved neighboring chunks from a very large token database ^[6]. Atlas jointly trained a retriever and generator for few-shot learning and allowed the non-parametric memory to be updated ^[7]. These systems show that retrieval can enter pretraining, fine-tuning, or inference; their architectures are not interchangeable.

Evaluation work also began to couple outputs with provenance. KILT provided a shared Wikipedia snapshot and evaluated both a task output and the pages offered as evidence ^[8]. Later methods changed when and why retrieval occurs. Self-RAG trained a model to use reflection tokens for retrieval and for judgments about relevance, support, and utility ^[9]. FLARE triggered retrieval while generating long-form text, using anticipated future content and confidence in the studied method ^[10]. IRCoT interleaved retrieval with intermediate reasoning for multi-step questions ^[11]. HyDE generated a hypothetical document as a query representation and then retrieved real documents near its embedding ^[12]. In HyDE, the generated hypothetical document is not evidence; the retrieved corpus items remain the candidate evidence.

These developments produced a broad design space rather than a single standard architecture. A useful description of a RAG system should state what is retrieved, from where, at which stage, with what supervision, and how retrieved material affects generation.

Architecture and operation

Corpus construction and indexing

The corpus determines what the system can retrieve. Corpus design includes source selection, licensing, extraction, normalization, segmentation, versioning, and retention. A source identifier should remain attached to every indexed unit so that an answer can be traced back to the exact item and version used. When permissions differ by user or tenant, authorization must be enforced before or during retrieval, not merely hidden in the user interface after retrieval.

Long documents are often divided into passages or chunks because retrieval and generation operate within finite computation and context-window limits. Boundaries affect meaning. Small units may isolate a precise fact but lose surrounding definitions, tables, or exceptions; large units preserve more context but consume more input space and may dilute the relevant part. Overlap can preserve material near a boundary while increasing duplication. Structural segmentation can respect sections, paragraphs, records, or code units. There is no evidence-based universal chunk length. The original RAG experiments used disjoint 100-word Wikipedia chunks, which was a choice for that corpus and experiment, not a general rule ^[1].

A lexical index represents terms and their statistics. BM25, for example, belongs to a probabilistic relevance framework and scores term matches using document and collection statistics ^[13]. A dense index stores learned embeddings and retrieves items whose vectors are close to a query vector. Dense Passage Retrieval is one influential dual-encoder design ^[3]. Hybrid search combines signals, commonly lexical and dense scores or ranks, to capture both exact terminology and semantic similarity.

No retrieval family dominates every domain. BEIR evaluated lexical, sparse, dense, late-interaction, and reranking systems over heterogeneous zero-shot datasets. Performance varied substantially across datasets, and several dense systems generalized poorly to some domains despite strong results in narrower settings ^[14]. ColBERT instead retains multiple token-level representations and uses late interaction; ColBERTv2 studied compression and denoised supervision to reduce its storage footprint while preserving its retrieval approach ^[15]. These results support measuring retrieval on representative data rather than selecting a method by architecture name.

Exact search over every dense vector can be expensive. Approximate-nearest-neighbor indexes trade some recall for speed or memory. HNSW organizes proximity links in a multilayer graph and searches the graph approximately ^[16]. Index parameters, vector quantization, filtering, and update strategy can all change the candidates returned even when the same embedding model is used. Retrieval evaluation must therefore cover the deployed index, not only an offline similarity function.

Query processing and retrieval

The user's input is not always an effective search query. A pipeline may normalize it, add conversational context, extract entities, generate several queries, or decompose a multi-part question. Rewriting can improve recall, but it can also discard a constraint or introduce an assumption. Systems should retain the original request and test whether rewritten queries preserve its meaning.

A first-stage retriever usually returns a candidate set. Metadata filters can restrict dates, languages, source types, permissions, or document states. A reranker can then score the query and each candidate jointly, often at higher computation per candidate than a dual encoder. The number of retrieved candidates, the number passed to the generator, and any minimum-score rule are distinct controls. A fixed top-K rule always supplies K items even when none is useful unless the system also applies a threshold or an explicit no-evidence decision.

Multi-step questions may require evidence that does not share vocabulary with the original query. Iterative systems use an answer fragment or intermediate state to retrieve again. This can recover bridging evidence, as studied by IRCoT ^[11], but each step can also compound an early error or fill the context with redundant passages. Adaptive retrieval methods try to stop when evidence is sufficient, yet sufficiency itself must be evaluated.

Pipeline choices interact. An EMNLP 2024 study compared combinations of query classification, chunking, retrieval, reranking, repacking, summarization, and generation in its experimental settings and found performance-efficiency tradeoffs rather than a cost-free universal configuration ^[17]. Its recommendations remain conditional on the tested datasets and models. Deployment values for chunk size, overlap, candidate count, reranking depth, and context order should be selected against representative tasks, latency limits, and error costs.

Context assembly and generation

Context assembly decides which retrieved units the generator actually sees and how they are presented. Common operations include removing duplicates, grouping neighboring passages, preserving document boundaries, labeling sources, ordering evidence, and fitting the material into an input budget. A rank produced for retrieval relevance is not automatically the best order for generation. The system may need to keep a definition next to its qualifier or represent two conflicting sources separately.

The generator receives the request plus the assembled context. Instructions can ask it to use only the supplied evidence, distinguish uncertainty, quote identifiers, or abstain when support is missing. Those instructions shape behavior but do not enforce a logical constraint. A model can ignore context, blend it with parametric memory, misunderstand a passage, or attach a citation to an unsupported sentence.

An evidence-aware output path preserves the mapping from output claims to source units. One approach lets the generator emit source identifiers; another aligns generated spans with passages after generation. Post-hoc citation matching can improve formatting but cannot turn an unsupported claim into a supported one. If the task requires exact extraction, calculation, or database semantics, a deterministic component may be more appropriate than asking a language model to reproduce the operation from prose.

The original probabilistic model

The 2020 architecture formalized retrieval as a latent variable. For an input, a retriever assigns a probability to each passage, while a generator assigns token probabilities conditioned on the input, a passage, and earlier output tokens. In the formulas below, x is the input, y is the output, z is a passage, eta and theta denote retriever and generator parameters, and the set Z contains the top-K passages under the retriever.

In RAG-Sequence, one latent passage accounts for the whole output sequence:

p_{\mathrm{seq}}(y \mid x) \approx \sum_{z \in \mathcal{Z}_K(x)} p_{\eta}(z \mid x) \prod_{i=1}^{N} p_{\theta}(y_i \mid x,z,y_{<i})

In RAG-Token, the passage is marginalized separately for each output position:

p_{\mathrm{tok}}(y \mid x) \approx \prod_{i=1}^{N} \sum_{z \in \mathcal{Z}_K(x)} p_{\eta}(z \mid x) p_{\theta}(y_i \mid x,z,y_{<i})

The original implementation used DPR for retrieval, BART for generation, and a dense index of a December 2018 Wikipedia snapshot. It fine-tuned the query encoder and generator while keeping the passage encoder and document index fixed ^[1]. Consequently, "end-to-end" in that experiment did not mean that every corpus representation changed during each update.

Many systems now called RAG do not calculate either marginal. They retrieve text with a separate service, concatenate selected passages with an instruction, and call a generator without jointly training the retriever and generator. That modular pattern is still retrieval-augmented generation in the broad sense, but claims about the Lewis et al. likelihood, training behavior, or experimental results do not automatically transfer to it.

Training, updating, and inference

RAG components can be trained separately or jointly. A retriever may learn from query-passage relevance pairs, question-answer supervision, clicks, or synthetic examples. A generator may be pretrained independently and used through prompting, fine-tuned to use retrieved evidence, or trained jointly with a retriever. REALM, the original RAG model, RETRO, and Atlas illustrate different training objectives and update boundaries ^[2]^[1]^[6]^[7].

Updating the corpus is not the same as updating the model. A source change may require re-extraction, re-chunking, re-embedding, index insertion or deletion, cache invalidation, and a new version identifier. If old vectors or cached answers remain active, the system can continue returning superseded material. Conversely, replacing a source collection does not erase conflicting parametric knowledge in the generator.

Retrievers also drift. A new embedding model may make stored vectors incompatible with new query vectors. A changed chunking policy alters both the retrieval unit and the citation target. A corpus snapshot may be internally consistent but stale. Reproducible evaluation records the corpus version, index build, retriever version, generator version, prompt or decoding configuration, and evaluation set.

At inference time, a well-specified system handles at least three states: useful evidence was found, evidence was found but conflicts, or no adequate evidence was found. Always answering can hide the last two states. Abstention or escalation can reduce unsupported answers, but it should be evaluated for both false refusals and failures to refuse.

Evaluation

RAG evaluation should separate the stages that can fail. End-to-end answer accuracy alone cannot show whether an error came from the corpus, retrieval, context assembly, or generation. Component diagnostics also prevent a generator's parametric knowledge from masking a retriever that did not find the evidence.

Retrieval evaluation

Retrieval evaluation requires queries, a defined corpus snapshot, and relevance judgments. Depending on the task, measures may include recall at K, precision at K, mean reciprocal rank, average precision, or normalized discounted cumulative gain. The metric should match the downstream need. If an answer requires two passages, retrieving only one may count as partial recall but still make the answer impossible. If the corpus has incomplete relevance labels, an apparently false positive may be an unjudged relevant item.

Retrieval quality includes more than topical similarity. Evidence may be topically relevant but outdated, unauthorized, contradicted, or insufficient for the requested claim. Evaluation can stratify by domain, language, date, document type, query ambiguity, multi-hop depth, and the presence of an answer in the corpus. BEIR's cross-domain results illustrate why an in-domain retrieval score should not be treated as a generalization guarantee ^[14].

Generation and evidence evaluation

Generation evaluation can measure task correctness, faithfulness to provided evidence, relevance to the request, completeness, uncertainty handling, citation quality, and style. These dimensions can conflict. A concise answer may be relevant but omit a qualifier; a faithful answer can repeat an error in the source; a factually correct answer can be unsupported by the retrieved context because the generator supplied it from parametric memory.

RAGAS proposed reference-free measures for aspects of retrieved context, faithfulness, and answer quality ^[18]. ARES uses learned judges for context relevance, answer faithfulness, and answer relevance, combined with a small human-labeled set and prediction-powered inference ^[19]. These tools can accelerate experiments, but their scores depend on judge behavior, synthetic data, prompts, and task assumptions. They should be calibrated against human judgments and consequential errors in the target use.

Citation evaluation needs at least two questions: does each citation support the claim it is attached to, and are claims that need support actually cited? The ALCE benchmark evaluates long-form generation along correctness, citation quality, and fluency, and separates citation correctness from citation completeness ^[20]. Provenance-aware benchmarks such as KILT additionally test whether the system identified source pages, not just whether it produced a target string ^[8].

RAGTruth assembled nearly 18,000 manually annotated responses across question answering, data-to-text generation, and summarization to study hallucinations in RAG outputs ^[21]. Its existence reflects a basic evaluation fact: providing retrieved context does not make every generated statement entailed by that context. Human review remains important where a wrong synthesis, omitted exception, or misplaced citation has high cost.

Operational evaluation

Operational measures include latency at each stage, index size, update delay, retrieval and generation cost, cache behavior, failure recovery, access-control correctness, and observability. Quality should be measured under the same context budgets and time limits used in production. An expensive reranker may improve a benchmark yet miss a latency objective; a cache may reduce latency while serving stale or unauthorized content.

Evaluation sets should include answerable and unanswerable requests, ambiguous queries, conflicting sources, outdated documents, malicious content, permission boundaries, and corpus changes. Online monitoring should not rely only on user ratings because users may not detect a plausible unsupported answer. Logged source identifiers and versions make later incident analysis possible, subject to privacy and retention constraints.

Failure modes and limitations

RAG introduces a pipeline of dependent components. It can improve a system only when the external information and the mechanism for using it are good enough for the task.

Retrieval and corpus failures

A coverage failure occurs when the necessary information is absent from the corpus. A retrieval miss occurs when it is present but not selected. A granularity failure occurs when segmentation separates evidence from a definition, header, table, exception, or neighboring passage needed to interpret it. A ranking failure places useful evidence below the context cutoff. Metadata errors can silently filter out the right source or admit a source the user is not permitted to see.

Retrieved material can also be low quality. Experiments on retrieval noise show that inappropriate passages can reduce answer quality in tested models ^[22]. Relevance is not truth: counterfactual-noise experiments found that relevant-looking but conflicting passages could mislead studied retrieval-augmented models ^[23]. Systems that search changing or user-contributed corpora need source validation, versioning, and conflict handling rather than assuming that high similarity means authority.

The correct behavior when retrieval fails is itself an evaluation target. NoMIRACL contains relevant and non-relevant passage settings across 18 languages and tests both using relevant evidence and avoiding answers based on irrelevant evidence ^[24]. A system that refuses every difficult request avoids some hallucinations but has poor usefulness; a system that always answers has the opposite failure. Thresholds must be chosen for the application's error costs.

Generation and attribution failures

The generator may ignore a relevant passage, overgeneralize from it, combine incompatible sources, copy an error, or introduce unsupported details. Retrieval augmentation reduced hallucination in specific conversational experiments ^[25], but that result does not establish a general guarantee. The outcome depends on the source collection, retrieval quality, model, prompt, and task.

A citation is not proof of entailment. Citation-generation systems can attach a plausible source that discusses the topic without supporting the exact claim. They can also cite one passage for a sentence containing several claims when only one is supported. Verification should operate at the claim level and distinguish direct support, contradiction, missing evidence, and source-quality concerns.

RAG cannot by itself resolve normative or interpretive questions. If sources disagree about policy, diagnosis, or causation, a generator should not silently collapse the disagreement into one confident answer. Source authority also depends on the question: a primary experiment can establish what its authors did, while a standards body may be authoritative about its own standard but not about an empirical effect outside its evidence.

Long context and retrieval

Long-context inference and RAG are alternatives for selecting information, but they can also be combined. Supplying an entire collection avoids a separate retrieval miss only when the collection fits and the model can reliably use it. It may increase input cost, expose more irrelevant or sensitive text, and make attribution harder.

In multi-document question answering and key-value retrieval experiments, "Lost in the Middle" found that performance often varied with the position of the relevant information and could be worse when it appeared in the middle of a long context ^[26]. That is a result for the tested tasks and models, not a law of all long-context systems. LOFT found that studied long-context models could rival specialized retrieval or RAG systems on some real-world tasks, while results remained sensitive to task, model, prompting, and reasoning requirements ^[27].

The choice is therefore empirical. RAG may be preferable when the corpus is too large, changes frequently, requires source-level access control, or benefits from explicit provenance. Direct long context may be useful for a bounded collection whose interactions are difficult for a retriever to anticipate. Hybrid designs can retrieve a document set and then give the generator a relatively long contiguous context. Comparisons should hold answer quality, source coverage, latency, cost, and security boundaries constant.

Security and governance

Retrieved content is an input channel, not a trusted instruction channel. Indirect prompt injection research demonstrated that attacker-controlled external content can manipulate LLM-integrated applications when that content is processed as part of a model's input ^[28]. In RAG, a malicious passage may tell the model to ignore system instructions, reveal data, call a tool, or produce attacker-chosen text. Delimiters and instructions can help a model distinguish data from commands, but they are not a complete security boundary.

Corpus integrity is another attack surface. PoisonedRAG demonstrated knowledge-corruption attacks in which a small number of crafted texts inserted into a large corpus caused tested systems to produce attacker-chosen answers for target questions ^[29]. The reported attack rates belong to the paper's threat models and experiments; they should not be generalized to every system. The broader conclusion is that write access, ingestion, ranking, and source trust need security controls.

Risk management spans the full lifecycle. NIST's Generative AI Profile recommends attention to data and content provenance, third-party components, pre-deployment testing, monitoring, incident disclosure, and the limits of current measurement ^[30]. For RAG, this translates into documented source policy, authenticated ingestion, least-privilege retrieval, versioned indexes, tests for data leakage and conflicting evidence, output review proportional to consequence, and a way to remove or quarantine a source.

OWASP's guidance on vector and embedding weaknesses highlights cross-context leakage, access-control failures, poisoning, and inadequate data validation ^[31]. Authorization filters must be applied to every retrieval path, including caches and fallback search. Tenant identifiers should not be treated as the only defense if the underlying index can return another tenant's content. Sensitive data also remains sensitive after embedding; an index is not anonymization.

OWASP separately notes that RAG does not fully mitigate prompt injection ^[32]. Defenses are layered: restrict source and tool permissions, isolate untrusted content, validate retrieved metadata, minimize context, constrain tool calls outside the language model, monitor anomalous retrievals, and require human confirmation for irreversible or high-impact actions. None of these controls proves that a deployment is secure. A security claim must specify the attacker, assets, access, system boundaries, and tested controls.

Retrieval-augmented generation is best understood as a configurable evidence pipeline. Its value comes from making external information available at generation time and, when designed explicitly, preserving a trail back to sources. Its weaknesses come from the same dependency: the output inherits limitations from corpus construction, retrieval, context assembly, model behavior, and governance. Reliable use requires measuring each stage and preserving the option to say that adequate evidence was not found.

References

^Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, and others. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." In *Advances in Neural Information Processing Systems 33*, 2020. proceedings.neurips.cc/...6b493230-Abstract
^Guu, Kelvin, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. "REALM: Retrieval-Augmented Language Model Pre-Training." In *Proceedings of the 37th International Conference on Machine Learning*, 2020. proceedings.mlr.press/...guu20a
^Karpukhin, Vladimir, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. "Dense Passage Retrieval for Open-Domain Question Answering." In *Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing*, 2020. aclanthology.org/2020.emnlp-main.550
^Khandelwal, Urvashi, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. "Generalization through Memorization: Nearest Neighbor Language Models." In *International Conference on Learning Representations*, 2020. openreview.net/forum
^Izacard, Gautier, and Edouard Grave. "Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering." In *Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics*, 2021. aclanthology.org/2021.eacl-main.74
^Borgeaud, Sebastian, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, and others. "Improving Language Models by Retrieving from Trillions of Tokens." In *Proceedings of the 39th International Conference on Machine Learning*, 2022. proceedings.mlr.press/...borgeaud22a
^Izacard, Gautier, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, and others. "Atlas: Few-shot Learning with Retrieval Augmented Language Models." *Journal of Machine Learning Research* 24, no. 251 (2023): 1-43. jmlr.org/...23-0037
^Petroni, Fabio, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, and others. "KILT: a Benchmark for Knowledge Intensive Language Tasks." In *Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics*, 2021. aclanthology.org/2021.naacl-main.200
^Asai, Akari, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection." In *International Conference on Learning Representations*, 2024. openreview.net/forum
^Jiang, Zhengbao, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. "Active Retrieval Augmented Generation." In *Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, 2023. aclanthology.org/2023.emnlp-main.495
^Trivedi, Harsh, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. "Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions." In *Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics*, 2023. aclanthology.org/2023.acl-long.557
^Gao, Luyu, Xueguang Ma, Jimmy Lin, and Jamie Callan. "Precise Zero-Shot Dense Retrieval without Relevance Labels." In *Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics*, 2023. aclanthology.org/2023.acl-long.99
^Robertson, Stephen, and Hugo Zaragoza. "The Probabilistic Relevance Framework: BM25 and Beyond." *Foundations and Trends in Information Retrieval* 3, no. 4 (2009): 333-389. doi.org/...1500000019
^Thakur, Nandan, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." *Proceedings of the NeurIPS Track on Datasets and Benchmarks* 1, 2021. datasets-benchmarks-proceedings.neurips.cc/...t-round2
^Santhanam, Keshav, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction." In *Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics*, 2022. aclanthology.org/2022.naacl-main.272
^Malkov, Yu A., and D. A. Yashunin. "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs." *IEEE Transactions on Pattern Analysis and Machine Intelligence* 42, no. 4 (2020): 824-836. doi.org/...TPAMI.2018.2889473
^Wang, Xiaohua, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, and others. "Searching for Best Practices in Retrieval-Augmented Generation." In *Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, 2024. aclanthology.org/2024.emnlp-main.981
^Es, Shahul, Jithin James, Luis Espinosa Anke, and Steven Schockaert. "RAGAs: Automated Evaluation of Retrieval Augmented Generation." In *Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations*, 2024. aclanthology.org/2024.eacl-demo.16
^Saad-Falcon, Jon, Omar Khattab, Christopher Potts, and Matei Zaharia. "ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems." In *Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics*, 2024. aclanthology.org/2024.naacl-long.20
^Gao, Tianyu, Howard Yen, Jiatong Yu, and Danqi Chen. "Enabling Large Language Models to Generate Text with Citations." In *Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, 2023. aclanthology.org/2023.emnlp-main.398
^Niu, Cheng, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. "RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models." In *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics*, 2024. aclanthology.org/2024.acl-long.585
^Fang, Feiteng, Yuelin Bai, Shiwen Ni, Min Yang, Xiaojun Chen, and Ruifeng Xu. "Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training." In *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics*, 2024. aclanthology.org/2024.acl-long.540
^Hong, Giwon, Jeonghwan Kim, Junmo Kang, Sung-Hyon Myaeng, and Joyce Jiyoung Whang. "Why So Gullible? Enhancing the Robustness of Retrieval-Augmented Models against Counterfactual Noise." In *Findings of the Association for Computational Linguistics: NAACL 2024*, 2024. aclanthology.org/2024.findings-naacl.159
^Thakur, Nandan, Luiz Bonifacio, Crystina Zhang, Odunayo Ogundepo, Ehsan Kamalloo, and others. "Knowing When You Don't Know: A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation." In *Findings of the Association for Computational Linguistics: EMNLP 2024*, 2024. aclanthology.org/2024.findings-emnlp.730
^Shuster, Kurt, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. "Retrieval Augmentation Reduces Hallucination in Conversation." In *Findings of the Association for Computational Linguistics: EMNLP 2021*, 2021. aclanthology.org/2021.findings-emnlp.320
^Liu, Nelson F., Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. "Lost in the Middle: How Language Models Use Long Contexts." *Transactions of the Association for Computational Linguistics* 12 (2024): 157-173. aclanthology.org/2024.tacl-1.9
^Lee, Jinhyuk, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, and others. "Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?" arXiv, 2024. arxiv.org/...2406.13121
^Greshake, Kai, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. "Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection." arXiv, 2023. arxiv.org/...2302.12173
^Zou, Wei, Runpeng Geng, Binghui Wang, and Jinyuan Jia. "PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models." In *34th USENIX Security Symposium*, 2025. usenix.org/...zou-poisonedrag
^Autio, Chloe, Reva Schwartz, Jesse Dunietz, Shomik Jain, Martin Stanley, and others. "Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile." NIST AI 600-1, 2024. nist.gov/...ork-generative-artificial-intelligence
^OWASP GenAI Security Project. "LLM08:2025 Vector and Embedding Weaknesses." 2025. genai.owasp.org/...vector-and-embedding-weaknesses
^OWASP GenAI Security Project. "LLM01:2025 Prompt Injection." 2025. genai.owasp.org/...llm01-prompt-injection

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

10 revisions by 1 contributors · v11 · 4,976 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Reviewer note: Independently verified against 32 primary, peer-reviewed, official, government, and authoritative records covering definitions, system architecture, original formulations, retrieval, indexing, updating, evaluation, failure modes, long-context tradeoffs, security, governance, and evidence limits; technical, mathematical, empirical, currentness, and scope claims checked through 2026-07-28.

Suggest edit

Retrieval-Augmented Generation

Scope and terminology

Historical development

Architecture and operation

Corpus construction and indexing

Query processing and retrieval

Context assembly and generation

The original probabilistic model

Training, updating, and inference

Evaluation

Retrieval evaluation

Generation and evidence evaluation

Operational evaluation

Failure modes and limitations

Retrieval and corpus failures

Generation and attribution failures

Long context and retrieval

Security and governance

References

Improve this article

What links here (24 of 310)

What links here (24 of 310)

Scope and terminology

Historical development

Architecture and operation

Corpus construction and indexing

Query processing and retrieval

Context assembly and generation

The original probabilistic model

Training, updating, and inference

Evaluation

Retrieval evaluation

Generation and evidence evaluation

Operational evaluation

Failure modes and limitations

Retrieval and corpus failures

Generation and attribution failures

Long context and retrieval

Security and governance

References

Improve this article

Related Articles

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

Machine learning terms/Natural Language Processing

What links here (24 of 310)

Related Articles

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

Machine learning terms/Natural Language Processing

What links here (24 of 310)