Information retrieval (IR) is the science and practice of finding relevant documents, passages, or data from large unstructured collections in response to a user's information need. The field sits at the intersection of computer science, natural language processing, and machine learning, and it provides the theoretical and algorithmic foundations for web search engines, enterprise search, question-answering systems, and modern retrieval-augmented generation (RAG) pipelines.
The term "information retrieval" was coined by Calvin Mooers in 1950 to describe the process of searching for information within a stored collection. Early IR research focused on library automation and indexing systems. In the 1960s, Gerard Salton and his team at Harvard (and later Cornell University) developed the SMART (System for the Mechanical Analysis and Retrieval of Text) information retrieval system, which was the first to implement the vector space model for representing documents and queries. By 1971, SMART was demonstrating retrieval performance that rivaled human indexers. Salton is widely regarded as "the father of information retrieval," and many foundational concepts, including relevance feedback and Rocchio classification, emerged from SMART research.
In 1972, Karen Sparck Jones published a seminal paper in the Journal of Documentation introducing the concept of inverse document frequency (IDF), which assigns higher weight to terms that appear in fewer documents across a collection. This insight became a cornerstone of term weighting and remains embedded in virtually every modern search engine. The probabilistic retrieval framework, which provides a formal statistical foundation for ranking documents by their estimated probability of relevance, was developed throughout the 1970s and 1980s by Stephen E. Robertson, Karen Sparck Jones, and colleagues.
The Text REtrieval Conference (TREC), co-sponsored by the National Institute of Standards and Technology (NIST) and the Defense Advanced Research Projects Agency (DARPA), began in 1992 and played a transformative role in advancing the field. TREC provided large-scale standardized test collections and evaluation methodologies that allowed researchers to compare retrieval systems rigorously. Within the first six years of TREC workshops, the effectiveness of participating retrieval systems approximately doubled. An independent study found that roughly one-third of the improvement in web search engines from 1999 to 2009 is attributable to research catalyzed by TREC.
The rise of the World Wide Web in the mid-1990s brought IR techniques into mainstream computing. Commercial search engines such as AltaVista, Yahoo, and later Google applied and extended IR methods at unprecedented scale. Google's PageRank algorithm combined link analysis with traditional text-matching signals, and the success of web search cemented information retrieval as one of the most impactful areas of computer science.
An IR system operates over a corpus (a collection of documents) and responds to user queries. Documents can be web pages, academic papers, news articles, product descriptions, or any body of text. The central challenge is determining which documents are relevant to a given query, where relevance is defined as the degree to which a document satisfies the user's information need.
To enable fast retrieval over large corpora, IR systems build data structures called indexes. The most widely used structure is the inverted index, which maps each unique term in the corpus to a list of documents (and positions within those documents) where that term occurs. When a search query arrives, the system looks up each query term in the inverted index, retrieves the corresponding document lists, and combines them to produce candidate results. Inverted indexes form the backbone of systems like Apache Lucene, Elasticsearch, and Apache Solr.
For dense retrieval methods, the analogous structure is a vector index that stores document embeddings and supports fast similarity search. Libraries such as FAISS (Facebook AI Similarity Search), released by Meta AI Research in 2017, implement algorithms like Hierarchical Navigable Small Worlds (HNSW) and Inverted File Index (IVF) to perform approximate nearest neighbor (ANN) search over millions or billions of vectors in sub-linear time.
Term Frequency-Inverse Document Frequency (TF-IDF) is one of the earliest and most influential term weighting schemes in information retrieval. Introduced through the combined work of Hans Peter Luhn (term frequency, 1957) and Karen Sparck Jones (inverse document frequency, 1972), TF-IDF quantifies the importance of a word to a document within a corpus.
The scheme consists of two components:
The final TF-IDF score for a term in a document is the product of these two values: TF(t, d) x IDF(t). Documents are ranked by the sum of TF-IDF scores across all query terms. TF-IDF remains useful as a baseline and as a feature in more complex systems, but it has known limitations: it treats each term independently (ignoring word order and semantics) and does not account for document length variation.
BM25, also known as Okapi BM25, is a probabilistic ranking function that refines and extends the ideas behind TF-IDF. Developed by Stephen E. Robertson, Karen Sparck Jones, and colleagues during the 1980s and 1990s as part of the Okapi information retrieval system at City University London, the "25" denotes its position in a series of iterative Best Matching formulations. BM25 was among the top-performing systems in the early TREC conferences and has remained a dominant baseline for over three decades.
BM25 improves on basic TF-IDF in two key ways:
The BM25 scoring formula for a query Q containing terms q1, q2, ..., qn against a document D is:
Score(D, Q) = sum over i of [ IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D| / avgdl)) ]
where f(qi, D) is the frequency of term qi in document D, |D| is the document length, and avgdl is the average document length across the corpus.
BM25 is the default ranking function in Apache Lucene (and by extension Elasticsearch and Apache Solr), and it remains remarkably competitive. On the BEIR zero-shot benchmark, BM25 outperforms many neural retrieval models that were fine-tuned on MS MARCO, demonstrating strong generalization across domains.
Before ranked retrieval became dominant, Boolean retrieval systems allowed users to combine search terms with logical operators (AND, OR, NOT). A Boolean query returns all documents that satisfy the logical expression without ranking them by relevance. While limited in expressiveness, Boolean retrieval is still used in specialized applications such as patent search and legal discovery where precision and explicit control over query logic are critical.
Rigorous evaluation is essential for comparing IR systems and measuring progress. Metrics fall into two broad categories: set-based metrics that consider only whether retrieved documents are relevant, and rank-aware metrics that also account for the position of relevant documents in the result list.
Precision is the fraction of retrieved documents that are relevant:
Precision = |Relevant ∩ Retrieved| / |Retrieved|
Recall is the fraction of all relevant documents that are retrieved:
Recall = |Relevant ∩ Retrieved| / |Relevant|
These metrics are often computed at a fixed cutoff k (Precision@k, Recall@k), meaning only the top k results are considered. There is a natural tension between precision and recall: increasing the number of retrieved results tends to improve recall but can reduce precision, and vice versa.
MAP is a rank-aware metric that combines precision at every position where a relevant document is found. For a single query, Average Precision (AP) is the mean of precision values calculated at each rank position containing a relevant document. MAP is the mean of AP scores across all queries in a test set. MAP rewards systems that place relevant documents earlier in the ranked list and is one of the most widely reported metrics in academic IR research.
NDCG is a rank-aware metric designed for graded relevance judgments (where documents can be rated on a scale such as 0-3 rather than simply relevant/not relevant). It builds on Discounted Cumulative Gain (DCG), which sums the relevance grades of results while applying a logarithmic discount based on position:
DCG@k = sum from i=1 to k of [ (2^rel_i - 1) / log2(i + 1) ]
NDCG normalizes DCG by dividing it by the ideal DCG (IDCG), which is the DCG of a perfect ranking:
NDCG@k = DCG@k / IDCG@k
NDCG@10 is the primary metric used in the BEIR benchmark and is widely used for evaluating web search and recommendation systems.
MRR measures how quickly a system returns the first relevant result. The reciprocal rank for a single query is 1 divided by the position of the first relevant document. MRR is the mean of reciprocal ranks across all queries:
MRR = (1/|Q|) * sum over i of (1 / rank_i)
MRR is particularly useful for navigational queries and factoid question answering, where the user typically seeks a single correct answer. MRR@10 is the primary metric for the MS MARCO passage ranking leaderboard.
| Metric | Rank-Aware | Relevance Type | Primary Use Case | Key Property |
|---|---|---|---|---|
| Precision@k | No | Binary | General retrieval | Measures accuracy of top k results |
| Recall@k | No | Binary | General retrieval | Measures coverage of relevant documents |
| MAP | Yes | Binary | Academic IR evaluation | Rewards relevant results placed early |
| NDCG@k | Yes | Graded | Web search, BEIR benchmark | Handles multi-level relevance |
| MRR | Yes | Binary | QA, navigational search | Focuses on first relevant result |
| F1 Score | No | Binary | Balanced evaluation | Harmonic mean of precision and recall |
MS MARCO (Microsoft Machine Reading Comprehension) is one of the most influential datasets in modern information retrieval. Released by Microsoft in 2016, it was originally designed for reading comprehension and question answering. The dataset contains over 1 million anonymized queries sampled from Bing's search logs, approximately 8.8 million passages extracted from roughly 3.6 million web documents, and around 533,000 training examples with human-annotated relevance labels.
MS MARCO's passage ranking task has become the de facto training set for neural retrieval models. The leaderboard uses MRR@10 as its primary evaluation metric. Nearly all prominent dense retrieval models, including DPR, ColBERT, and many sentence transformer variants, are trained or fine-tuned on MS MARCO data. However, models that achieve top performance on MS MARCO do not always generalize well to other domains, a limitation that motivated the creation of benchmarks like BEIR.
BEIR (Benchmarking IR) is a heterogeneous benchmark for evaluating the zero-shot generalization of retrieval models across diverse tasks and domains. Published by Nandan Thakur, Nils Reimers, Andreas Ruckle, Abhishek Srivastava, and Iryna Gurevych at NeurIPS 2021, BEIR comprises 18 publicly available datasets spanning nine different retrieval tasks: fact checking, citation prediction, duplicate question retrieval, argument retrieval, news retrieval, question answering, tweet retrieval, bio-medical IR, and entity retrieval.
The benchmark enforces a zero-shot evaluation protocol: models may be pre-trained on large generic corpora (typically MS MARCO) but may not be adapted to any BEIR-specific dataset before testing. BEIR uses NDCG@10 as its primary metric. A notable finding from BEIR is that BM25, despite being a simple lexical method, often outperforms neural models trained solely on MS MARCO when evaluated on out-of-domain datasets, highlighting the importance of generalization beyond in-domain benchmarks.
The Massive Text Embedding Benchmark (MTEB), introduced in 2022, evaluates text embedding models across 56 tasks including retrieval, classification, re-ranking, clustering, and summarization. MTEB provides a holistic assessment of embedding quality and hosts a public leaderboard on Hugging Face. An expanded version, MMTEB (Massive Multilingual Text Embedding Benchmark), extends the evaluation to over 500 tasks across more than 1,000 languages.
The TREC conference has produced numerous benchmark collections over its three decades of operation, covering tasks such as ad hoc retrieval, question answering, conversational search, and deep learning passage/document ranking. The TREC Deep Learning Track, launched in 2019, uses a large-scale collection based on MS MARCO data and has become a primary venue for evaluating neural retrieval systems.
Starting around 2018, the application of deep learning and pretrained language models like BERT to information retrieval transformed the field. Neural IR methods learn dense vector representations that capture semantic meaning, allowing them to match queries and documents based on meaning rather than exact keyword overlap.
A bi-encoder (also called a dual encoder) uses two separate neural network encoders: one for the query and one for the document. Each encoder independently maps its input to a fixed-size dense vector, and relevance is estimated by computing the similarity (typically dot product or cosine similarity) between the two vectors.
The key advantage of bi-encoders is efficiency. Because document embeddings can be computed offline and stored in a vector index, retrieval at query time involves only encoding the query and performing an approximate nearest neighbor search, which can process millions of candidates in milliseconds. The primary disadvantage is that by compressing each text into a single vector, fine-grained token-level interactions between the query and document are lost.
Dense Passage Retrieval (DPR), published by Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih at EMNLP 2020, was a landmark model demonstrating that learned dense representations could outperform BM25 for open-domain question answering. DPR uses two independent BERT-base models as its query and passage encoders and is trained with a contrastive loss using pairs of queries with their relevant passages along with hard negative examples retrieved by BM25.
On open-domain QA benchmarks, DPR outperformed a strong Lucene-BM25 baseline by 9-19 percentage points in top-20 passage retrieval accuracy. DPR uses FAISS for indexing and retrieval at inference time, enabling efficient search over large passage collections.
A cross-encoder processes the query and document jointly by concatenating them (separated by a special token) and feeding the combined sequence through a single transformer model. The model outputs a relevance score for the pair. Because every token from the query can attend to every token from the document through the transformer's self-attention mechanism, cross-encoders capture rich, fine-grained interactions and generally achieve higher accuracy than bi-encoders.
The drawback is computational cost. Since the query and document must be processed together, document representations cannot be precomputed. For a corpus of N documents, answering a single query requires N forward passes through the model, making cross-encoders impractical as first-stage retrievers for large collections. Instead, they are typically used as rerankers in a two-stage pipeline.
ColBERT (Contextualized Late Interaction over BERT), introduced by Omar Khattab and Matei Zaharia at SIGIR 2020, proposes a middle ground between bi-encoders and cross-encoders through a mechanism called late interaction. ColBERT independently encodes the query and document using BERT, but instead of compressing each into a single vector, it retains the full set of contextualized token embeddings.
At scoring time, each query token embedding is compared against all document token embeddings via cosine similarity, and the maximum similarity score for each query token (called MaxSim) is summed to produce the final relevance score. This approach preserves token-level matching granularity while still allowing document embeddings to be precomputed and stored.
ColBERTv2, published by Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia in 2022, improved on the original by introducing residual compression that reduces the storage footprint by 6-10x while maintaining or improving retrieval quality. ColBERTv2 achieves state-of-the-art results both within and outside the MS MARCO training domain.
| Architecture | Query-Document Interaction | Precompute Documents | Speed at Retrieval | Accuracy | Typical Use |
|---|---|---|---|---|---|
| Bi-Encoder (e.g., DPR) | Minimal (single-vector similarity) | Yes | Very fast | Moderate | First-stage retrieval |
| Cross-Encoder | Full (joint attention) | No | Very slow | High | Reranking |
| Late Interaction (e.g., ColBERT) | Token-level (MaxSim) | Partially (token embeddings stored) | Moderate | High | Retrieval or reranking |
| Learned Sparse (e.g., SPLADE) | Term-level (weighted sparse vectors) | Yes | Fast (inverted index) | Moderate-High | First-stage retrieval |
SPLADE (Sparse Lexical and Expansion Model), introduced at SIGIR 2021, bridges the gap between classical sparse methods and neural approaches. SPLADE uses a pretrained BERT model's masked language modeling (MLM) head to generate sparse, high-dimensional vector representations for queries and documents. These representations can be searched using traditional inverted indexes, combining the efficiency of lexical retrieval with the semantic understanding of neural models.
A distinctive feature of SPLADE is term expansion: the model can assign non-zero weights to terms that do not explicitly appear in the text but are semantically related. For example, a document about "canines" might receive a non-zero weight for the term "dog." SPLADE applies log-saturation and explicit sparsity regularization to keep representations compact. On the BEIR benchmark, SPLADE variants have shown strong generalization, often outperforming dense bi-encoder models on out-of-domain tasks.
Hybrid retrieval combines sparse (lexical) and dense (semantic) retrieval methods to leverage their complementary strengths. Sparse methods like BM25 excel at exact keyword matching and perform well on queries containing rare or domain-specific terms. Dense methods capture semantic similarity and handle synonymy, paraphrasing, and conceptual matching more effectively. By combining both, hybrid systems achieve more robust retrieval across a wider range of query types.
There are several strategies for implementing hybrid retrieval:
Parallel retrieval with score fusion: Both a sparse retriever (e.g., BM25) and a dense retriever (e.g., a bi-encoder) run independently on the same query. Their result lists are merged using score normalization and linear combination, or through Reciprocal Rank Fusion (RRF), which assigns scores based on each document's rank position in both lists rather than relying on raw scores. RRF is popular because it does not require score calibration between different retrieval systems.
Learned hybrid representations: Models like SPLADE produce sparse vectors that can be searched alongside dense vectors in a unified index. Some systems store both sparse and dense representations for each document and combine scores at query time.
Pipeline approaches: A sparse retriever generates an initial candidate set, which is then re-scored by a dense model. This approach is conceptually similar to reranking but uses a bi-encoder rather than a cross-encoder for the second stage.
Major search platforms including Elasticsearch, OpenSearch, Weaviate, and Pinecone provide built-in support for hybrid search, typically offering both BM25 and vector search with configurable fusion methods.
Reranking is a technique in which an initial set of candidate documents, retrieved by a fast first-stage retriever, is re-scored by a more accurate but computationally expensive model. The two-stage retrieve-then-rerank pipeline has become standard practice in modern IR systems because it balances efficiency with accuracy.
In a typical pipeline:
Because the reranker only processes a small number of candidates rather than the entire corpus, the computational cost remains manageable even for transformer-based cross-encoders.
Several families of reranking models are available:
| Reranker | Provider | Architecture | Key Features |
|---|---|---|---|
| ms-marco-MiniLM | Sentence Transformers | Cross-encoder (MiniLM) | Lightweight, fast, trained on MS MARCO |
| Cohere Rerank 3.5 | Cohere | Proprietary cross-encoder | Multilingual (100+ languages), API-based |
| Jina Reranker v2 | Jina AI | Open-weight cross-encoder | Multimodal support (images, PDFs) |
| bge-reranker-v2-m3 | BAAI | Cross-encoder (Transformer) | Open-source, competitive accuracy |
| ColBERTv2 | Stanford | Late interaction | Can serve as both retriever and reranker |
| RankGPT | Microsoft Research | LLM-based listwise | Uses GPT models for zero-shot reranking |
Reranking consistently improves retrieval quality. Studies have shown that adding a cross-encoder reranker on top of a bi-encoder retriever can improve NDCG@10 by 5-15 points on standard benchmarks. The quality gains come from the cross-encoder's ability to model fine-grained token-level interactions that bi-encoders miss when compressing text into a single vector.
Retrieval-augmented generation (RAG) represents one of the most significant applications of information retrieval in the era of large language models. RAG combines a retrieval component with a generative model to produce answers grounded in external knowledge.
A standard RAG pipeline consists of three stages:
The quality of the retrieval stage directly determines the quality of the generated response. If the retriever fails to surface relevant passages, the LLM will either hallucinate or produce an incomplete answer. Research has shown that retrieval accuracy (measured by Recall@k of the relevant passages) is the single most important factor for RAG performance, more so than the choice of generative model.
Advanced RAG systems employ multi-stage retrieval with reranking, query expansion, hypothetical document embeddings (HyDE), and iterative retrieval to maximize the relevance of context provided to the generator. The tight coupling between IR and language generation has renewed interest in retrieval research and driven the development of specialized embedding models, rerankers, and hybrid search systems optimized for RAG workloads.
Information retrieval research has produced the algorithms and data structures that power both open-source and commercial search infrastructure.
Apache Lucene is an open-source Java library for full-text search that implements inverted indexing, BM25 scoring, and (since version 9.0) dense vector search using the HNSW algorithm. Lucene serves as the foundation for several widely used search platforms:
The growth of dense retrieval and RAG has spurred the development of purpose-built vector databases:
FAISS (Facebook AI Similarity Search), developed by Meta AI Research and released in 2017, is an open-source library for efficient similarity search and clustering of dense vectors. FAISS implements multiple indexing strategies including flat (exact) search, Inverted File Index (IVF), Product Quantization (PQ), and HNSW, and supports both CPU and GPU acceleration. It is widely used as the retrieval backend for research systems and as a building block within production vector databases.
| Method | Type | Approach | Strengths | Weaknesses | Year Introduced |
|---|---|---|---|---|---|
| Boolean Retrieval | Sparse | Exact keyword match with logical operators | Precise control, no ranking ambiguity | No ranking, rigid query syntax | 1950s |
| TF-IDF | Sparse | Term frequency weighted by inverse document frequency | Simple, interpretable, fast | Ignores semantics, no length normalization | 1957/1972 |
| BM25 | Sparse | Probabilistic ranking with saturation and length normalization | Strong baseline, excellent generalization | No semantic matching | 1980s-1994 |
| DPR | Dense | Dual BERT encoders with contrastive training | Captures semantic similarity | Single-vector bottleneck, requires training data | 2020 |
| ColBERT | Dense (late interaction) | Token-level MaxSim over BERT embeddings | High accuracy with precomputable document embeddings | Higher storage than single-vector models | 2020 |
| SPLADE | Learned sparse | BERT MLM head with sparse regularization | Efficient (inverted index), strong generalization | Requires training, more complex than BM25 | 2021 |
| Cross-Encoder | Dense (joint) | Full transformer attention over concatenated query-document | Highest accuracy | Too slow for first-stage retrieval | 2019 |
| Hybrid (BM25 + Dense) | Hybrid | Parallel sparse and dense retrieval with score fusion | Combines lexical and semantic matching | Added complexity, two indexes | 2020+ |
Information retrieval continues to evolve rapidly. Several trends are shaping the field as of 2025 and 2026: