Dense Passage Retrieval (DPR)
Last reviewed
May 1, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 3,838 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 · 3,838 words
Add missing citations, update stale details, or suggest a clearer explanation.
Dense Passage Retrieval (DPR) is a neural information retrieval method that uses a dual-encoder BERT architecture to map questions and passages into dense vectors, then retrieves passages by maximum inner-product search over those vectors. Introduced by Facebook AI Research in April 2020, it was the first large-scale demonstration that learned dense representations can substantially outperform classical sparse retrievers such as BM25 on open-domain question answering, and it became the technical template for a generation of retrievers that now power production retrieval-augmented generation (RAG) systems.
The foundational paper, Dense Passage Retrieval for Open-Domain Question Answering, was posted to arXiv on April 10, 2020 and presented at EMNLP 2020. Within two years it had become one of the most-cited papers in the modern retrieval literature, and the dual-encoder recipe it popularised (BERT bi-encoder, contrastive loss, in-batch negatives, hard negatives mined from BM25) is now the default starting point for almost every commercial and open-source embeddings model used for search and RAG.
Before DPR, open-domain QA pipelines almost always began with a sparse retriever. BM25, a probabilistic ranking function from the 1990s based on TF-IDF statistics with length normalisation, was the de-facto first stage. Neural rerankers were sometimes layered on top, but learning the first-stage retriever end-to-end was considered hard: BERT cross-encoders are too slow to apply to millions of passages at query time, and earlier dense methods (e.g. ORQA, REALM) required either expensive auxiliary pretraining or made strong assumptions that did not transfer well to standard benchmarks.
The DPR authors set out to test a much simpler hypothesis: with the right contrastive training objective and a sensible negative-sampling strategy, an off-the-shelf BERT bi-encoder fine-tuned on existing QA supervision should be enough to beat BM25. They were right, by a large margin.
The paper was written by Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen and Wen-tau Yih. The team spanned Facebook AI Research, Princeton University (Danqi Chen), and the University of Washington (Sewon Min). Several authors went on to shape much of what came next in retrieval and RAG: Patrick Lewis was first author on the original RAG paper a few months later, Wen-tau Yih continued to publish on dense retrieval and multi-hop QA, and Sewon Min produced influential work on retrieval evaluation and on memory-augmented language models.
| Item | Value |
|---|---|
| First arXiv submission | April 10, 2020 (v1) |
| Final arXiv revision | September 30, 2020 (v3) |
| Published venue | EMNLP 2020 |
| ACL Anthology DOI | 10.18653/v1/2020.emnlp-main.550 |
| Code | github.com/facebookresearch/DPR |
| Hugging Face module | transformers.models.dpr |
DPR is a bi-encoder (also called a dual-encoder or two-tower) model. Two separate BERT-base networks are used: a question encoder $E_Q$ and a passage encoder $E_P$. Each takes raw text, runs it through 12 transformer layers, and returns the 768-dimensional hidden state of the [CLS] token as the dense representation. So a question $q$ becomes a vector $E_Q(q) \in \mathbb{R}^{768}$, and a passage $p$ becomes $E_P(p) \in \mathbb{R}^{768}$.
The similarity between a question and a passage is the dot product of these two vectors:
$$\text{sim}(q, p) = E_Q(q)^\top E_P(p)$$
The authors experimented with cosine similarity and a learned bilinear form, but plain dot product worked as well or better and is much friendlier to vector indexes, so that is what the released models use.
The two-tower design is what makes DPR usable at scale. Because the passage encoder does not see the question, every passage in the corpus can be encoded once, offline, and stored in a vector index. At query time only the question needs to be encoded. Retrieval then reduces to a maximum inner-product search (MIPS), which can be solved in milliseconds over millions of vectors using approximate nearest neighbour structures such as FAISS, HNSW, or IVF with product quantization. This is the key reason a bi-encoder is preferred over a BERT cross-encoder for first-stage retrieval, even though the cross-encoder would be more accurate per pair: the cross-encoder cannot precompute and would need to score every passage against every query at runtime.
Alongside the retriever, the original DPR repository releases a DPRReader based on BERT that reads the top-k retrieved passages and extracts a final answer span. The full open-domain QA pipeline is therefore:
The Hugging Face transformers library exposes this as three model classes: DPRQuestionEncoder, DPRContextEncoder, and DPRReader, each with matching tokenizers.
DPR is trained with a contrastive objective on existing question-passage pairs. For a batch of $B$ questions, each with one positive passage, the loss is the negative log-likelihood of the positive under a softmax over the positive plus a set of negatives:
$$\mathcal{L}(q_i, p_i^+, {p_{i,j}^-}) = -\log \frac{\exp(\text{sim}(q_i, p_i^+))}{\exp(\text{sim}(q_i, p_i^+)) + \sum_j \exp(\text{sim}(q_i, p_{i,j}^-))}$$
The paper's central practical insight is how it constructs negatives. Three strategies were studied:
The in-batch trick is what makes the training cheap. With a batch of 128 questions, each example sees 127 negatives essentially for free, since their encoded vectors are already in GPU memory. Adding even one BM25 hard negative per question on top of in-batch negatives gave the best results in the paper. The released single checkpoints use one BM25 hard negative + 127 in-batch negatives per question.
The main NQ models in the paper are trained with:
| Setting | Value |
|---|---|
| Encoder backbone | bert-base-uncased |
| Output dimension | 768 (CLS token) |
| Optimiser | Adam |
| Learning rate | 1e-5 |
| Linear warm-up | yes |
| Batch size | 128 |
| Epochs | 40 |
| Negatives per question | 1 BM25 hard + 127 in-batch |
| Compute | 8 x 32 GB GPUs, ~1 day |
DPR is trained on the standard open-domain QA datasets: Natural Questions, TriviaQA, WebQuestions, CuratedTREC, and SQuAD. The paper reports two main settings: Single (a separate retriever per dataset) and Multi (one retriever trained on the union of NQ, TriviaQA, WebQuestions, and CuratedTREC, with SQuAD held out because of its narrow Wikipedia coverage). The retrieval corpus is the December 2018 English Wikipedia dump segmented into 100-word passages, yielding about 21 million passages.
The original paper reports top-20 and top-100 retrieval accuracy, defined as the fraction of questions for which at least one of the top-k retrieved passages contains the gold answer string. The headline numbers from Table 2 of the paper:
| Retriever | NQ | TriviaQA | WebQuestions | CuratedTREC | SQuAD |
|---|---|---|---|---|---|
| BM25 | 59.1 | 66.9 | 55.0 | 70.9 | 68.8 |
| DPR (Single) | 78.4 | 79.4 | 73.2 | 79.8 | 63.2 |
| BM25 + DPR | 76.6 | 79.8 | 71.0 | 85.2 | 71.5 |
| Retriever | NQ | TriviaQA | WebQuestions | CuratedTREC | SQuAD |
|---|---|---|---|---|---|
| BM25 | 73.7 | 76.7 | 71.1 | 84.1 | 80.0 |
| DPR (Single) | 85.4 | 85.0 | 81.4 | 89.1 | 77.2 |
| BM25 + DPR | 83.8 | 84.5 | 80.5 | 92.7 | 81.3 |
The absolute improvements over BM25 on the four datasets where DPR is trained range from about 9 to 19 points at top-20. SQuAD is the one dataset where DPR underperforms BM25, which the authors attribute to two factors: SQuAD questions are written by annotators who saw the passage and thus contain unusually high lexical overlap with it (favouring BM25), and the SQuAD passages are drawn from a small set of Wikipedia articles, making the training distribution narrow.
With DPR retrieval feeding their extractive reader, the paper sets new state-of-the-art exact-match scores on Natural Questions (41.5%), TriviaQA (56.8%), and WebQuestions (34.6%) at the time of publication, beating prior systems built on BM25 plus large readers.
The dual-encoder recipe DPR popularised was extended in many directions, often by changing the negatives, the encoder, or the interaction function.
| Method | Year | Key idea | Why it matters |
|---|---|---|---|
| Sentence-BERT | 2019 | Siamese BERT for general sentence embeddings | Predates DPR; established the bi-encoder pattern for similarity, though not specifically for QA retrieval |
| DPR (Single / Multi) | 2020 | BERT bi-encoder + in-batch + BM25 hard negatives | The reference dense retriever for open-domain QA |
| ColBERT | 2020 | Late interaction over per-token embeddings | Better accuracy than DPR with manageable cost; ColBERTv2 (2022) compresses the token index |
| ANCE | 2021 | Asynchronously updated ANN-mined hard negatives | Showed that the negatives, not the encoder, were often the bottleneck |
| RocketQA / RocketQAv2 | 2021 | Cross-encoder distillation, denoising hard negatives | Strong gains on MS MARCO and NQ |
| coCondenser / Condenser | 2021 | Retrieval-oriented pretraining objective for the bi-encoder | Better dense retrievers from the same fine-tuning data |
| SimCSE | 2021 | Simple contrastive sentence embeddings | Strong general embeddings from unsupervised contrastive learning |
| GTR | 2021 | T5-based dual encoder, scaled to 4.8B parameters | Showed that dual-encoder accuracy keeps improving with scale |
| mDPR / CORA | 2021 | Multilingual and cross-lingual variants of DPR | Extends DPR to non-English QA |
| E5 | 2022 | Weakly-supervised contrastive pretraining at web scale | Strong out-of-the-box retriever, dominant on MTEB for a period |
| DRAGON | 2023 | Diverse augmentation across queries and supervision | Single retriever that is robust both in- and out-of-domain |
| BGE | 2023 | RetroMAE-pretrained encoder, contrastive fine-tuning | Open-source MTEB leader; widely used in production RAG |
| GTE / NV-Embed / Voyage / Cohere Embed | 2023-2025 | Larger backbones, instruction tuning, hard negatives at scale | Modern commercial and open-source embeddings, all bi-encoders |
With the exception of ColBERT-style late interaction, every entry in this table inherits DPR's basic design choice: encode query and document independently, train with a contrastive loss, and search by inner product or cosine similarity. The improvements come from how the encoders are pretrained, what negatives they see, and how supervision is mixed.
ColBERT (Khattab and Zaharia, SIGIR 2020) is the most important alternative architecture. Instead of pooling each input into a single 768-dim vector, it stores per-token embeddings and computes similarity as a sum of MaxSim operations between query tokens and document tokens. This late interaction preserves fine-grained matching that DPR's single-vector representation discards, at the cost of a much larger index. ColBERTv2 (Santhanam et al. 2022) compresses these token vectors with residual quantisation, narrowing the storage gap. Late-interaction methods consistently match or exceed single-vector dense retrievers on hard benchmarks like BEIR.
It is common to combine DPR-style first-stage retrieval with a BERT cross-encoder reranker that scores each (query, passage) pair jointly. The cross-encoder is too slow to use over millions of passages, but applied to the top 100 hits from a bi-encoder, it adds noticeable accuracy at modest cost. Most production RAG stacks today use exactly this two-stage pattern.
Dense retrievers and sparse retrievers solve the same problem with different inductive biases, and their failure modes are largely complementary. Hybrid retrieval, which combines a sparse and a dense retriever, is therefore standard in production.
| Property | BM25 (sparse) | SPLADE (learned sparse) | DPR (single-vector dense) | ColBERT (late interaction) |
|---|---|---|---|---|
| Representation | Term-frequency vector over the vocabulary | Sparse vector over BERT WordPiece vocabulary, learned | Single 768-dim dense vector per text | Per-token 128-dim dense vectors |
| Matching | Lexical overlap | Lexical + learned term expansion | Semantic, single vector | Semantic, fine-grained MaxSim |
| Training | None (bag-of-words statistics) | Contrastive on QA / MS MARCO + sparsity regulariser | Contrastive bi-encoder | Contrastive bi-encoder |
| Index | Inverted index | Inverted index | ANN over dense vectors (FAISS, HNSW) | Per-token dense ANN, optionally compressed |
| Query latency | Very low (CPU) | Low (CPU) | Low (GPU encode + ANN search) | Higher (GPU encode + multi-vector search) |
| GPU at query time | Not required | Not required | Required | Required |
| Out-of-domain robustness | Strong | Strong | Often weak (BEIR) | Strong |
| Reference paper | Robertson, 1995 (Okapi) | Formal et al., 2021 | Karpukhin et al., 2020 | Khattab and Zaharia, 2020 |
A simple hybrid that adds normalised BM25 and DPR scores, or merges their result lists with reciprocal-rank fusion, often outperforms either retriever alone. The DPR paper itself reports that BM25 + DPR beats either method on TriviaQA, CuratedTREC, and SQuAD (Tables 2 above).
Building a production retriever around DPR or one of its descendants involves a number of engineering choices that the original paper only touches on.
IndexFlatIP) is fine. Beyond that, approximate methods such as HNSW and IVF + PQ are needed. Vector databases like FAISS, Milvus, Pinecone, Weaviate, Qdrant, and pgvector all implement these structures with various trade-offs in build time, memory, and recall.DPR, and to a lesser extent its descendants, has well-documented weaknesses.
The DPR paper has been cited several thousand times since 2020 and is a fixture in retrieval and QA reading lists. More importantly, its design choices became the default. The standard recipe of (1) bi-encoder transformer, (2) contrastive InfoNCE-style loss, (3) in-batch negatives plus mined hard negatives, and (4) inner-product nearest-neighbour search at inference is what every modern embedding model used for retrieval starts from. BGE, E5, GTE, NV-Embed, Cohere Embed v3, OpenAI's text-embedding-3 family, and Voyage AI's embeddings all follow this pattern, differing mainly in the encoder backbone (often much larger than BERT-base, sometimes decoder-only LLMs), the pretraining and fine-tuning data, and the volume of mined negatives.
DPR also shaped how people built retrieval-augmented systems. The original RAG paper used DPR's retriever directly. FiD, Atlas, RETRO, REPLUG, and the long line of RAG architectures that followed all assume that a strong dense retriever exists and that you can swap it in. Today the same architectural pattern underpins enterprise RAG products, code assistants that retrieve from internal codebases, biomedical and legal search engines, and the retrieval components used by web-augmented LLM chatbots.
In the broader embeddings ecosystem, DPR helped establish that contrastive bi-encoder training, not masked-language-model pretraining alone, is what produces useful retrieval vector embeddings. That insight directly motivated SimCSE, the E5 family, and the entire MTEB-driven competition for general-purpose embeddings models.
facebookresearch/DPR (archived October 2023): the official PyTorch implementation, training scripts, pretrained NQ and Multi checkpoints, and the FAISS indexing code.transformers: DPRQuestionEncoder, DPRContextEncoder, DPRReader and matching tokenizers; weights such as facebook/dpr-question_encoder-single-nq-base and facebook/dpr-ctx_encoder-single-nq-base are downloadable.Five years on, dense retrieval is production-standard and DPR itself is rarely deployed in its original form. The state of the art has moved to larger backbones, instruction-tuned and decoder-based encoders, and far more aggressive hard-negative mining at billion-document scale. Late-interaction approaches in the ColBERT lineage have resurged for accuracy-critical use cases. Sparse, dense, and late-interaction methods coexist in hybrid stacks because their failure modes are complementary, and BEIR-aware evaluation pushed the field to care about robustness, not just in-domain accuracy.
What has not changed is the architectural blueprint. The DPR paper showed that a simple bi-encoder, trained on a few hundred thousand QA pairs with the right negatives, could turn dense retrieval from a research curiosity into the default first stage of any modern semantic search or RAG system. That bet is the foundation almost every retriever still rests on.