# Dense Passage Retrieval (DPR)

> Source: https://aiwiki.ai/wiki/dense_passage_retrieval
> Updated: 2026-06-24
> Categories: Information Retrieval, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Dense Passage Retrieval (DPR)** is a neural [information retrieval](/wiki/information_retrieval) method that uses a dual-encoder [BERT](/wiki/bert) architecture to map questions and passages into dense vectors, then retrieves passages by maximum inner-product search over those vectors. Introduced by Facebook AI Research in a paper posted to arXiv on April 10, 2020 and presented at EMNLP 2020, DPR was the first large-scale demonstration that learned dense representations can substantially outperform classical sparse retrievers such as [BM25](/wiki/bm25) on open-domain [question answering](/wiki/question_answering): its dense retriever beat a strong Lucene-BM25 system "largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy" [1]. It became the technical template for a generation of retrievers that now power production [retrieval-augmented generation (RAG)](/wiki/retrieval_augmented_generation) systems.

The foundational paper, *Dense Passage Retrieval for Open-Domain Question Answering*, was written by eight authors at Facebook AI Research, Princeton University, and the University of Washington [1]. Within two years it had become one of the most-cited papers in the modern retrieval literature, and the dual-encoder recipe it popularised (BERT bi-encoder, contrastive loss, in-batch negatives, hard negatives mined from BM25) is now the default starting point for almost every commercial and open-source [embeddings](/wiki/embeddings) model used for [semantic search](/wiki/semantic_search) and RAG.

## Background and origin

Before DPR, open-domain QA pipelines almost always began with a sparse retriever. BM25, a probabilistic ranking function from the 1990s based on [TF-IDF](/wiki/tf_idf) statistics with length normalisation [13], was the de-facto first stage. The DPR paper opens by naming exactly this status quo: "traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method" [1]. Neural rerankers were sometimes layered on top, but learning the first-stage retriever end-to-end was considered hard: BERT cross-encoders are too slow to apply to millions of passages at query time, and earlier dense methods (e.g. ORQA, REALM) required either expensive auxiliary pretraining or made strong assumptions that did not transfer well to standard benchmarks.

The DPR authors set out to test a much simpler hypothesis: with the right contrastive training objective and a sensible negative-sampling strategy, an off-the-shelf BERT bi-encoder fine-tuned on existing QA supervision should be enough to beat BM25. As the abstract puts it, "retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework" [1]. They were right, by a large margin.

### Who wrote the DPR paper?

The paper was written by Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen and Wen-tau Yih [1]. The team spanned Facebook AI Research, Princeton University (Danqi Chen), and the University of Washington (Sewon Min). Several authors went on to shape much of what came next in retrieval and RAG: Patrick Lewis was first author on the original RAG paper a few months later [2], Wen-tau Yih continued to publish on dense retrieval and multi-hop QA, and Sewon Min produced influential work on retrieval evaluation and on memory-augmented language models.

| Item | Value |
|---|---|
| First arXiv submission | April 10, 2020 (v1) |
| Final arXiv revision | September 30, 2020 (v3) |
| Published venue | EMNLP 2020 |
| ACL Anthology DOI | 10.18653/v1/2020.emnlp-main.550 |
| Code | github.com/facebookresearch/DPR |
| Hugging Face module | `transformers.models.dpr` |

## How does the DPR architecture work?

DPR is a *bi-encoder* (also called a dual-encoder or two-tower) model. Two separate BERT-base networks are used: a question encoder $E_Q$ and a passage encoder $E_P$. Each takes raw text, runs it through 12 transformer layers, and returns the 768-dimensional hidden state of the `[CLS]` token as the dense representation [11]. So a question $q$ becomes a vector $E_Q(q) \in \mathbb{R}^{768}$, and a passage $p$ becomes $E_P(p) \in \mathbb{R}^{768}$.

The similarity between a question and a passage is the dot product of these two vectors:

$$\text{sim}(q, p) = E_Q(q)^\top E_P(p)$$

The authors experimented with cosine similarity and a learned bilinear form, but plain dot product worked as well or better and is much friendlier to vector indexes, so that is what the released models use [1]. A dot product between two 768-dimensional vectors is just 768 multiplications and 767 additions, cheap enough to score millions of candidates per second with the right index.

### Why does DPR use two encoders?

The two-tower design is what makes DPR usable at scale. Because the passage encoder does not see the question, every passage in the corpus can be encoded once, offline, and stored in a vector index. At query time only the question needs to be encoded. Retrieval then reduces to a maximum inner-product search (MIPS), which can be solved in milliseconds over millions of vectors using approximate nearest neighbour structures such as [FAISS](/wiki/faiss), HNSW, or IVF with [product quantization](/wiki/product_quantization). This is the key reason a bi-encoder is preferred over a BERT cross-encoder for first-stage retrieval, even though the cross-encoder would be more accurate per pair: the cross-encoder cannot precompute and would need to score every passage against every query at runtime.

### Reader and pipeline

Alongside the retriever, the original DPR repository releases a `DPRReader` based on BERT that reads the top-k retrieved passages and extracts a final answer span [12]. The full open-domain QA pipeline is therefore:

1. Encode the user's question with $E_Q$.
2. Find the top-k passages by maximum inner-product search.
3. Pass question + retrieved passages into the reader.
4. Return the highest-scoring answer span.

The Hugging Face `transformers` library exposes this as three model classes: `DPRQuestionEncoder`, `DPRContextEncoder`, and `DPRReader`, each with matching tokenizers [11].

## How is DPR trained?

DPR is trained with a contrastive objective on existing question-passage pairs. For a batch of $B$ questions, each with one positive passage, the loss is the negative log-likelihood of the positive under a softmax over the positive plus a set of negatives:

$$\mathcal{L}(q_i, p_i^+, \{p_{i,j}^-\}) = -\log \frac{\exp(\text{sim}(q_i, p_i^+))}{\exp(\text{sim}(q_i, p_i^+)) + \sum_j \exp(\text{sim}(q_i, p_{i,j}^-))}$$

### Negative sampling

The paper's central practical insight is how it constructs negatives. Three strategies were studied [1]:

- **Random negatives**: random passages from the corpus.
- **BM25 negatives**: top BM25 hits that do not contain the gold answer string. These are *hard* negatives because they are lexically similar to the question but topically wrong.
- **In-batch negatives**: for question $q_i$, treat the positive passages of every *other* question $q_j$ in the same batch as negatives.

The in-batch trick is what makes the training cheap. With a batch of 128 questions, each example sees 127 negatives essentially for free, since their encoded vectors are already in GPU memory [1]. Adding even one BM25 hard negative per question on top of in-batch negatives gave the best results in the paper. The released `single` checkpoints use one BM25 hard negative + 127 in-batch negatives per question [1].

### Hyperparameters

The main NQ models in the paper are trained with:

| Setting | Value |
|---|---|
| Encoder backbone | `bert-base-uncased` |
| Output dimension | 768 (CLS token) |
| Optimiser | Adam |
| Learning rate | 1e-5 |
| Linear warm-up | yes |
| Batch size | 128 |
| Epochs | 40 |
| Negatives per question | 1 BM25 hard + 127 in-batch |
| Compute | 8 x 32 GB GPUs, ~1 day |

### Datasets

DPR is trained on the standard open-domain QA datasets: Natural Questions, TriviaQA, WebQuestions, CuratedTREC, and SQuAD. The paper reports two main settings: *Single* (a separate retriever per dataset) and *Multi* (one retriever trained on the union of NQ, TriviaQA, WebQuestions, and CuratedTREC, with SQuAD held out because of its narrow Wikipedia coverage). The retrieval corpus is the December 2018 English Wikipedia dump segmented into 100-word passages, yielding about 21 million passages [1].

## Benchmark results

The original paper reports top-20 and top-100 retrieval accuracy, defined as the fraction of questions for which at least one of the top-k retrieved passages contains the gold answer string [1]. The headline numbers from Table 2 of the paper:

### Top-20 retrieval accuracy on test sets

| Retriever | NQ | TriviaQA | WebQuestions | CuratedTREC | SQuAD |
|---|---|---|---|---|---|
| BM25 | 59.1 | 66.9 | 55.0 | 70.9 | 68.8 |
| DPR (Single) | 78.4 | 79.4 | 73.2 | 79.8 | 63.2 |
| BM25 + DPR | 76.6 | 79.8 | 71.0 | 85.2 | 71.5 |

### Top-100 retrieval accuracy on test sets

| Retriever | NQ | TriviaQA | WebQuestions | CuratedTREC | SQuAD |
|---|---|---|---|---|---|
| BM25 | 73.7 | 76.7 | 71.1 | 84.1 | 80.0 |
| DPR (Single) | 85.4 | 85.0 | 81.4 | 89.1 | 77.2 |
| BM25 + DPR | 83.8 | 84.5 | 80.5 | 92.7 | 81.3 |

On Natural Questions the gap is especially stark: DPR reaches 78.4% top-20 accuracy against BM25's 59.1%, an absolute gain of 19.3 points [1]. The absolute improvements over BM25 on the four datasets where DPR is trained range from about 9 to 19 points at top-20, matching the paper's headline claim of "9%-19% absolute" [1]. SQuAD is the one dataset where DPR underperforms BM25, which the authors attribute to two factors: SQuAD questions are written by annotators who saw the passage and thus contain unusually high lexical overlap with it (favouring BM25), and the SQuAD passages are drawn from a small set of Wikipedia articles, making the training distribution narrow.

With DPR retrieval feeding their extractive reader, the paper sets new state-of-the-art exact-match scores on Natural Questions (41.5%), TriviaQA (56.8%), and WebQuestions (34.6%) at the time of publication, beating prior systems built on BM25 plus large readers [1].

## Variants and successors

The dual-encoder recipe DPR popularised was extended in many directions, often by changing the negatives, the encoder, or the interaction function.

| Method | Year | Key idea | Why it matters |
|---|---|---|---|
| Sentence-BERT | 2019 | Siamese BERT for general sentence embeddings | Predates DPR; established the bi-encoder pattern for similarity, though not specifically for QA retrieval |
| DPR (Single / Multi) | 2020 | BERT bi-encoder + in-batch + BM25 hard negatives | The reference dense retriever for open-domain QA |
| ColBERT | 2020 | Late interaction over per-token embeddings | Better accuracy than DPR with manageable cost; ColBERTv2 (2022) compresses the token index |
| ANCE | 2021 | Asynchronously updated ANN-mined hard negatives | Showed that the negatives, not the encoder, were often the bottleneck |
| RocketQA / RocketQAv2 | 2021 | Cross-encoder distillation, denoising hard negatives | Strong gains on MS MARCO and NQ |
| coCondenser / Condenser | 2021 | Retrieval-oriented pretraining objective for the bi-encoder | Better dense retrievers from the same fine-tuning data |
| SimCSE | 2021 | Simple contrastive sentence embeddings | Strong general embeddings from unsupervised contrastive learning |
| GTR | 2021 | T5-based dual encoder, scaled to 4.8B parameters | Showed that dual-encoder accuracy keeps improving with scale |
| mDPR / CORA | 2021 | Multilingual and cross-lingual variants of DPR | Extends DPR to non-English QA |
| E5 | 2022 | Weakly-supervised contrastive pretraining at web scale | Strong out-of-the-box retriever, dominant on MTEB for a period |
| DRAGON | 2023 | Diverse augmentation across queries and supervision | Single retriever that is robust both in- and out-of-domain |
| BGE | 2023 | RetroMAE-pretrained encoder, contrastive fine-tuning | Open-source MTEB leader; widely used in production RAG |
| GTE / NV-Embed / Voyage / Cohere Embed | 2023-2025 | Larger backbones, instruction tuning, hard negatives at scale | Modern commercial and open-source embeddings, all bi-encoders |

With the exception of ColBERT-style late interaction, every entry in this table inherits DPR's basic design choice: encode query and document independently, train with a contrastive loss, and search by inner product or cosine similarity. The improvements come from how the encoders are pretrained, what negatives they see, and how supervision is mixed.

### How does DPR differ from ColBERT and late interaction?

ColBERT (Khattab and Zaharia, SIGIR 2020) is the most important alternative architecture [3]. Instead of pooling each input into a single 768-dim vector, it stores per-token embeddings and computes similarity as a sum of MaxSim operations between query tokens and document tokens. This *late interaction* preserves fine-grained matching that DPR's single-vector representation discards, at the cost of a much larger index. ColBERTv2 (Santhanam et al. 2022) compresses these token vectors with residual quantisation, narrowing the storage gap [8]. Late-interaction methods consistently match or exceed single-vector dense retrievers on hard benchmarks like BEIR [7].

### Cross-encoder reranking

It is common to combine DPR-style first-stage retrieval with a BERT cross-encoder reranker that scores each (query, passage) pair jointly. The cross-encoder is too slow to use over millions of passages, but applied to the top 100 hits from a bi-encoder, it adds noticeable accuracy at modest cost. Most production RAG stacks today use exactly this two-stage pattern.

## How does dense retrieval differ from sparse retrieval?

Dense retrievers and sparse retrievers solve the same problem with different inductive biases, and their failure modes are largely complementary. Hybrid retrieval, which combines a sparse and a dense retriever, is therefore standard in production.

| Property | BM25 (sparse) | SPLADE (learned sparse) | DPR (single-vector dense) | ColBERT (late interaction) |
|---|---|---|---|---|
| Representation | Term-frequency vector over the vocabulary | Sparse vector over BERT WordPiece vocabulary, learned | Single 768-dim dense vector per text | Per-token 128-dim dense vectors |
| Matching | Lexical overlap | Lexical + learned term expansion | Semantic, single vector | Semantic, fine-grained MaxSim |
| Training | None (bag-of-words statistics) | Contrastive on QA / MS MARCO + sparsity regulariser | Contrastive bi-encoder | Contrastive bi-encoder |
| Index | Inverted index | Inverted index | ANN over dense vectors (FAISS, HNSW) | Per-token dense ANN, optionally compressed |
| Query latency | Very low (CPU) | Low (CPU) | Low (GPU encode + ANN search) | Higher (GPU encode + multi-vector search) |
| GPU at query time | Not required | Not required | Required | Required |
| Out-of-domain robustness | Strong | Strong | Often weak (BEIR) | Strong |
| Reference paper | Robertson, 1995 (Okapi) | Formal et al., 2021 | Karpukhin et al., 2020 | Khattab and Zaharia, 2020 |

A simple hybrid that adds normalised BM25 and DPR scores, or merges their result lists with reciprocal-rank fusion, often outperforms either retriever alone. The DPR paper itself reports that BM25 + DPR beats either method on TriviaQA, CuratedTREC, and SQuAD (Tables 2 above) [1].

## What is DPR used for?

- **Open-domain question answering.** DPR's original target. The Karpukhin paper feeds the top-k passages into an extractive reader; later work like FiD (Izacard and Grave, EACL 2021) replaces the reader with a generative T5 that fuses many passages in its decoder, pushing exact-match scores on NQ to 51.4% with the same DPR retriever [6].
- **Retrieval-augmented generation.** The original RAG paper (Lewis et al., NeurIPS 2020) directly bolts DPR's question encoder onto a BART generator; the passage index is treated as non-parametric memory [2]. Modern RAG stacks (LlamaIndex, LangChain, custom production systems) follow the same pattern, often swapping DPR for a stronger embedding model like BGE, E5, or commercial offerings from [Voyage AI](/wiki/voyage_ai), Cohere, or OpenAI.
- **Semantic search.** Internal documentation search, e-commerce search, customer support knowledge bases. Anywhere keyword search misses paraphrases or vocabulary mismatch, a DPR-style retriever helps.
- **Code search.** Encoders fine-tuned on (natural language docstring, code snippet) pairs let users search large codebases by intent rather than exact tokens.
- **Cross-lingual retrieval.** mDPR (Asai et al., 2021) trains a multilingual DPR variant; CORA extends this to fully cross-lingual open QA where the question and answer language can differ.
- **Conversational retrieval.** ORConvQA and follow-on work apply DPR-style retrievers to multi-turn conversational QA, where the query must be reformulated based on dialogue history.
- **Entity linking and knowledge-base completion.** BLINK (Wu et al., 2020), built by some of the same FAIR authors, applies a bi-encoder + cross-encoder pipeline to large-scale entity linking against Wikipedia.

## Practical considerations

Building a production retriever around DPR or one of its descendants involves a number of engineering choices that the original paper only touches on.

- **Indexing.** Passage embeddings are computed once and stored. For up to a few million vectors, exact MIPS (FAISS `IndexFlatIP`) is fine. Beyond that, approximate methods such as HNSW and IVF + PQ are needed. [Vector databases](/wiki/vector_database) like FAISS, Milvus, Pinecone, Weaviate, Qdrant, and pgvector all implement these structures with various trade-offs in build time, memory, and recall.
- **Re-indexing.** Whenever the encoder is updated (a new fine-tune, a new base model, a domain adaptation), every passage must be re-embedded. For a Wikipedia-scale corpus this is hours of GPU time. For a billion-document corpus it can be days, which is one reason teams hesitate to retrain frequently.
- **Hard-negative mining.** The single highest-leverage trick beyond the basic recipe. ANCE-style asynchronous mining (re-embed corpus periodically, sample top hits as new negatives) consistently improves over static BM25 negatives [5].
- **Domain adaptation.** Off-the-shelf DPR or even modern MTEB-leading embeddings often perform poorly on legal, biomedical, financial, or proprietary corpora. Fine-tuning on a few thousand in-domain (query, positive) pairs, ideally with hard negatives from the target index, usually closes most of the gap.
- **Chunking.** DPR encodes 100-word passages. Modern retrievers accept longer inputs (E5 and BGE-M3 go up to 8192 tokens), but chunking strategy still matters: too-long passages dilute the signal, too-short passages lose context. Many production systems retrieve at the chunk level and then expand to surrounding context for the reader.
- **Score calibration for hybrid.** Combining BM25 and dense scores requires either rank-based fusion (RRF) or careful min-max normalisation, since the two scoring functions live on different scales.

## Limitations

DPR, and to a lesser extent its descendants, has well-documented weaknesses.

- **Compute cost.** Encoding queries at runtime requires a GPU for low-latency serving. Large corpora require non-trivial GPU time to embed, and re-embedding on encoder updates is expensive.
- **Out-of-domain generalisation.** The BEIR benchmark (Thakur et al., NeurIPS 2021) showed that BM25 is a surprisingly strong zero-shot baseline, and that several dense retrievers including DPR underperform it on tasks far from their training distribution, such as BioASQ or Touche-2020 [7]. Late-interaction models (ColBERT) and large general-purpose embeddings (BGE, E5) close this gap but do not eliminate it.
- **Rare entities and exact matches.** Single-vector dense retrievers struggle with queries that hinge on rare proper nouns, identifiers, error codes, or numbers. BM25 handles these robustly because the relevant token simply has high IDF. Hybrid retrieval is the standard mitigation.
- **Information loss from pooling.** Compressing a 100-word passage into a single 768-dim vector throws away a lot of detail. Late-interaction architectures (ColBERT, ColBERTv2, BGE-M3 multi-vector mode) recover some of this at the cost of a larger index.
- **Sensitivity to negatives.** The choice of hard negatives can change retrieval accuracy by several points. Models trained without hard negatives often show plausible top-5 results that are subtly off-topic.
- **Annotation gaps.** DPR trains on questions whose gold passage is known. For domains without such labelled data (most enterprise use cases), one has to either generate synthetic queries (e.g. with an LLM) or rely on transfer from public retrievers.

## Influence

The DPR paper has been cited several thousand times since 2020 and is a fixture in retrieval and QA reading lists. More importantly, its design choices became the default. The standard recipe of (1) bi-encoder transformer, (2) contrastive InfoNCE-style loss, (3) in-batch negatives plus mined hard negatives, and (4) inner-product nearest-neighbour search at inference is what every modern embedding model used for retrieval starts from. BGE, E5, GTE, NV-Embed, Cohere Embed v3, OpenAI's text-embedding-3 family, and Voyage AI's embeddings all follow this pattern, differing mainly in the encoder backbone (often much larger than BERT-base, sometimes decoder-only LLMs), the pretraining and fine-tuning data, and the volume of mined negatives.

DPR also shaped how people built retrieval-augmented systems. The original RAG paper used DPR's retriever directly [2]. FiD, Atlas, RETRO, REPLUG, and the long line of RAG architectures that followed all assume that a strong dense retriever exists and that you can swap it in. Today the same architectural pattern underpins enterprise RAG products, code assistants that retrieve from internal codebases, biomedical and legal search engines, and the retrieval components used by web-augmented LLM chatbots.

In the broader [embeddings](/wiki/embeddings) ecosystem, DPR helped establish that contrastive bi-encoder training, not masked-language-model pretraining alone, is what produces useful retrieval [vector embeddings](/wiki/vector_embeddings). That insight directly motivated SimCSE, the E5 family [9], and the entire MTEB-driven competition for general-purpose embeddings models.

## Open-source code and models

- **`facebookresearch/DPR`** (archived October 2023): the official PyTorch implementation, training scripts, pretrained NQ and Multi checkpoints, and the FAISS indexing code [12].
- **Hugging Face `transformers`**: `DPRQuestionEncoder`, `DPRContextEncoder`, `DPRReader` and matching tokenizers; weights such as `facebook/dpr-question_encoder-single-nq-base` and `facebook/dpr-ctx_encoder-single-nq-base` are downloadable [11].
- **Sentence Transformers**: not DPR per se, but the most widely used library that exposes DPR-style training (SBERT, MultipleNegativesRankingLoss) for arbitrary domains [4].
- **Pyserini**: BM25 and DPR baselines for reproducible IR experiments, including several reproductions of the DPR paper.
- **Tevatron**: a flexible toolkit for training dense retrievers in the DPR style with modern enhancements.

## Modern context

Five years on, dense retrieval is production-standard and DPR itself is rarely deployed in its original form. The state of the art has moved to larger backbones, instruction-tuned and decoder-based encoders, and far more aggressive hard-negative mining at billion-document scale [15]. Late-interaction approaches in the ColBERT lineage have resurged for accuracy-critical use cases. Sparse, dense, and late-interaction methods coexist in hybrid stacks because their failure modes are complementary, and BEIR-aware evaluation pushed the field to care about robustness, not just in-domain accuracy [7].

What has not changed is the architectural blueprint. The DPR paper showed that a simple bi-encoder, trained on a few hundred thousand QA pairs with the right negatives, could turn dense retrieval from a research curiosity into the default first stage of any modern semantic search or RAG system. That bet is the foundation almost every retriever still rests on.

## References

1. Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W. (2020). *Dense Passage Retrieval for Open-Domain Question Answering*. EMNLP 2020. arXiv:2004.04906. https://aclanthology.org/2020.emnlp-main.550/
2. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S., and Kiela, D. (2020). *Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*. NeurIPS 2020. arXiv:2005.11401.
3. Khattab, O., and Zaharia, M. (2020). *ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT*. SIGIR 2020. arXiv:2004.12832.
4. Reimers, N., and Gurevych, I. (2019). *Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks*. EMNLP-IJCNLP 2019. https://aclanthology.org/D19-1410/
5. Xiong, L., Xiong, C., Li, Y., Tang, K., Liu, J., Bennett, P. N., Ahmed, J., and Overwijk, A. (2021). *Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval (ANCE)*. ICLR 2021. arXiv:2007.00808.
6. Izacard, G., and Grave, E. (2021). *Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering (Fusion-in-Decoder)*. EACL 2021. https://aclanthology.org/2021.eacl-main.74/
7. Thakur, N., Reimers, N., Ruckle, A., Srivastava, A., and Gurevych, I. (2021). *BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models*. NeurIPS 2021 Datasets and Benchmarks. arXiv:2104.08663.
8. Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., and Zaharia, M. (2022). *ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction*. NAACL 2022.
9. Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., and Wei, F. (2022). *Text Embeddings by Weakly-Supervised Contrastive Pre-training (E5)*. arXiv:2212.03533.
10. Xiao, S., Liu, Z., Zhang, P., and Muennighoff, N. (2023). *C-Pack: Packaged Resources To Advance General Chinese Embedding (BGE)*. arXiv:2309.07597.
11. Hugging Face Transformers documentation. *DPR model class reference*. https://huggingface.co/docs/transformers/model_doc/dpr
12. Facebook Research. *DPR repository*. https://github.com/facebookresearch/DPR
13. Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. (1995). *Okapi at TREC-3*. NIST Special Publication 500-225.
14. Formal, T., Piwowarski, B., and Clinchant, S. (2021). *SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking*. SIGIR 2021.
15. Lin, S., Asai, A., Li, M., Oguz, B., Lin, J., Mehdad, Y., Yih, W., and Chen, X. (2023). *How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval*. EMNLP 2023.