Information Retrieval

Information Retrieval Machine Learning Natural Language Processing

28 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v7 · 5,603 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Information retrieval (IR) is the science and practice of finding relevant documents, passages, or data from large unstructured collections in response to a user's information need. It provides the algorithms behind web search engines, enterprise search, question answering, and modern retrieval-augmented generation (RAG) pipelines, and it spans two broad families of methods: sparse lexical ranking (such as TF-IDF and BM25, which match exact terms) and dense neural retrieval (such as DPR and ColBERT, which match meaning through learned embeddings). The field sits at the intersection of computer science, natural language processing, and machine learning.

The core problem of IR is ranking: given a query and a corpus, return documents ordered so the most relevant appear first. A 2010 economic impact study commissioned by the U.S. National Institute of Standards and Technology (NIST) estimated that research catalyzed by the Text REtrieval Conference (TREC) was responsible for roughly one-third of the improvement in web search engines between 1999 and 2009, saving users up to 3 billion hours of search time, and that every $1 invested in TREC returned $3.35 to $5.07 in benefits to the research community. ^[18]

History

The term "information retrieval" was coined by Calvin Mooers in 1950 to describe the process of searching for information within a stored collection. ^[1] Early IR research focused on library automation and indexing systems. In the 1960s, Gerard Salton and his team at Harvard (and later Cornell University) developed the SMART (System for the Mechanical Analysis and Retrieval of Text) information retrieval system, which was the first to implement the vector space model for representing documents and queries. ^[2] By 1971, SMART was demonstrating retrieval performance that rivaled human indexers. Salton is widely regarded as "the father of information retrieval," and many foundational concepts, including relevance feedback and Rocchio classification, emerged from SMART research. ^[2]

In 1972, Karen Sparck Jones published a seminal paper in the Journal of Documentation introducing the concept of inverse document frequency (IDF), which assigns higher weight to terms that appear in fewer documents across a collection. ^[3] Her central argument was that term importance is statistical rather than semantic: "specificity should be interpreted statistically, as a function of term use rather than of term meaning." ^[3] This insight became a cornerstone of term weighting and remains embedded in virtually every modern search engine. The probabilistic retrieval framework, which provides a formal statistical foundation for ranking documents by their estimated probability of relevance, was developed throughout the 1970s and 1980s by Stephen E. Robertson, Karen Sparck Jones, and colleagues. ^[4]

The Text REtrieval Conference (TREC), co-sponsored by the National Institute of Standards and Technology (NIST) and the Defense Advanced Research Projects Agency (DARPA), began in 1992 and played a transformative role in advancing the field. ^[14] TREC provided large-scale standardized test collections and evaluation methodologies that allowed researchers to compare retrieval systems rigorously. ^[14] Within the first six years of TREC workshops, the effectiveness of participating retrieval systems approximately doubled. An independent assessment found that roughly one-third of the improvement in web search engines from 1999 to 2009 is attributable to research catalyzed by TREC. ^[18]

The rise of the World Wide Web in the mid-1990s brought IR techniques into mainstream computing. Commercial search engines such as AltaVista, Yahoo, and later Google applied and extended IR methods at unprecedented scale. Google's PageRank algorithm combined link analysis with traditional text-matching signals, and the success of web search cemented information retrieval as one of the most impactful areas of computer science.

Core Concepts

Documents, Queries, and Relevance

An IR system operates over a corpus (a collection of documents) and responds to user queries. Documents can be web pages, academic papers, news articles, product descriptions, or any body of text. The central challenge is determining which documents are relevant to a given query, where relevance is defined as the degree to which a document satisfies the user's information need.

Indexing

To enable fast retrieval over large corpora, IR systems build data structures called indexes. The most widely used structure is the inverted index, which maps each unique term in the corpus to a list of documents (and positions within those documents) where that term occurs. When a search query arrives, the system looks up each query term in the inverted index, retrieves the corresponding document lists, and combines them to produce candidate results. Inverted indexes form the backbone of systems like Apache Lucene, Elasticsearch, and Apache Solr.

For dense retrieval methods, the analogous structure is a vector index that stores document embeddings and supports fast similarity search. Libraries such as FAISS (Facebook AI Similarity Search), released by Meta AI Research in 2017, implement algorithms like Hierarchical Navigable Small Worlds (HNSW) and Inverted File Index (IVF) to perform approximate nearest neighbor (ANN) search over millions or billions of vectors in sub-linear time. ^[12]

Classical Retrieval Methods

TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is one of the earliest and most influential term weighting schemes in information retrieval. Introduced through the combined work of Hans Peter Luhn (term frequency, 1957) and Karen Sparck Jones (inverse document frequency, 1972), TF-IDF quantifies the importance of a word to a document within a corpus. ^[3] ^[16]

The scheme consists of two components:

Term Frequency (TF): Measures how often a term appears in a given document. A common formulation is the raw count of term t in document d divided by the total number of terms in d, though logarithmic and other sublinear variants are also used.
Inverse Document Frequency (IDF): Measures how rare or common a term is across the entire corpus. It is typically computed as log(N / df_t), where N is the total number of documents and df_t is the number of documents containing term t. Common words like "the" or "is" receive low IDF scores, while distinctive terms receive high scores. ^[3]

The final TF-IDF score for a term in a document is the product of these two values: TF(t, d) x IDF(t). Documents are ranked by the sum of TF-IDF scores across all query terms. TF-IDF remains useful as a baseline and as a feature in more complex systems, but it has known limitations: it treats each term independently (ignoring word order and semantics) and does not account for document length variation.

BM25 (Best Matching 25)

BM25, also known as Okapi BM25, is a probabilistic ranking function that refines and extends the ideas behind TF-IDF. Developed by Stephen E. Robertson, Karen Sparck Jones, and colleagues during the 1980s and 1990s as part of the Okapi information retrieval system at City University London, the "25" denotes its position in a series of iterative Best Matching formulations. ^[4] BM25 was among the top-performing systems in the early TREC conferences and has remained a dominant baseline for over three decades. ^[4]

BM25 improves on basic TF-IDF in two key ways:

Term frequency saturation: Rather than allowing term frequency to scale linearly, BM25 applies a saturation function controlled by the parameter k1 (typically set to 1.2). As a term appears more frequently in a document, its marginal contribution to the score diminishes. When the term frequency equals k1 for a document of average length, the term contributes half its maximum possible score. ^[4]
Document length normalization: The parameter b (typically set to 0.75) controls the degree to which longer documents are penalized. When b = 1, full length normalization is applied; when b = 0, document length is ignored. This prevents long documents from being unfairly favored simply because they contain more terms. ^[4]

The BM25 scoring formula for a query Q containing terms q1, q2, ..., qn against a document D is:

Score(D, Q) = sum over i of [ IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D| / avgdl)) ]

where f(qi, D) is the frequency of term qi in document D, |D| is the document length, and avgdl is the average document length across the corpus.

BM25 is the default ranking function in Apache Lucene, which made it the default similarity in version 6.0 (released April 2016), replacing the older TF-IDF based scoring; Elasticsearch and Apache Solr inherited the change in their corresponding releases. ^[17] BM25 remains remarkably competitive. On the BEIR zero-shot benchmark, BM25 outperforms many neural retrieval models that were fine-tuned on MS MARCO, demonstrating strong generalization across domains, and the BEIR authors describe it as "a robust baseline." ^[9]

Boolean Retrieval

Before ranked retrieval became dominant, Boolean retrieval systems allowed users to combine search terms with logical operators (AND, OR, NOT). A Boolean query returns all documents that satisfy the logical expression without ranking them by relevance. While limited in expressiveness, Boolean retrieval is still used in specialized applications such as patent search and legal discovery where precision and explicit control over query logic are critical.

Evaluation Metrics

Rigorous evaluation is essential for comparing IR systems and measuring progress. Metrics fall into two broad categories: set-based metrics that consider only whether retrieved documents are relevant, and rank-aware metrics that also account for the position of relevant documents in the result list.

Precision and Recall

Precision is the fraction of retrieved documents that are relevant:

Precision = |Relevant ∩ Retrieved| / |Retrieved|

Recall is the fraction of all relevant documents that are retrieved:

Recall = |Relevant ∩ Retrieved| / |Relevant|

These metrics are often computed at a fixed cutoff k (Precision@k, Recall@k), meaning only the top k results are considered. There is a natural tension between precision and recall: increasing the number of retrieved results tends to improve recall but can reduce precision, and vice versa.

Mean Average Precision (MAP)

MAP is a rank-aware metric that combines precision at every position where a relevant document is found. For a single query, Average Precision (AP) is the mean of precision values calculated at each rank position containing a relevant document. MAP is the mean of AP scores across all queries in a test set. MAP rewards systems that place relevant documents earlier in the ranked list and is one of the most widely reported metrics in academic IR research.

Normalized Discounted Cumulative Gain (NDCG)

NDCG is a rank-aware metric designed for graded relevance judgments (where documents can be rated on a scale such as 0-3 rather than simply relevant/not relevant). It builds on Discounted Cumulative Gain (DCG), which sums the relevance grades of results while applying a logarithmic discount based on position:

DCG@k = sum from i=1 to k of [ (2^rel_i - 1) / log2(i + 1) ]

NDCG normalizes DCG by dividing it by the ideal DCG (IDCG), which is the DCG of a perfect ranking:

NDCG@k = DCG@k / IDCG@k

NDCG@10 is the primary metric used in the BEIR benchmark and is widely used for evaluating web search and recommendation systems. ^[9]

Mean Reciprocal Rank (MRR)

MRR measures how quickly a system returns the first relevant result. The reciprocal rank for a single query is 1 divided by the position of the first relevant document. MRR is the mean of reciprocal ranks across all queries:

MRR = (1/|Q|) * sum over i of (1 / rank_i)

MRR is particularly useful for navigational queries and factoid question answering, where the user typically seeks a single correct answer. MRR@10 is the primary metric for the MS MARCO passage ranking leaderboard. ^[10]

Summary of Evaluation Metrics

Metric	Rank-Aware	Relevance Type	Primary Use Case	Key Property
Precision@k	No	Binary	General retrieval	Measures accuracy of top k results
Recall@k	No	Binary	General retrieval	Measures coverage of relevant documents
MAP	Yes	Binary	Academic IR evaluation	Rewards relevant results placed early
NDCG@k	Yes	Graded	Web search, BEIR benchmark	Handles multi-level relevance
MRR	Yes	Binary	QA, navigational search	Focuses on first relevant result
F1 Score	No	Binary	Balanced evaluation	Harmonic mean of precision and recall

Benchmarks and Datasets

MS MARCO

MS MARCO (Microsoft Machine Reading Comprehension) is one of the most influential datasets in modern information retrieval. Released by Microsoft in 2016, it was originally designed for reading comprehension and question answering. ^[10] The dataset contains over 1 million anonymized queries sampled from Bing's search logs, approximately 8.8 million passages extracted from roughly 3.6 million web documents, and around 533,000 training examples with human-annotated relevance labels (of which roughly 400,000 are positively labeled query-passage pairs). ^[10]

MS MARCO's passage ranking task has become the de facto training set for neural retrieval models. The leaderboard uses MRR@10 as its primary evaluation metric. Nearly all prominent dense retrieval models, including DPR, ColBERT, and many sentence transformer variants, are trained or fine-tuned on MS MARCO data. However, models that achieve top performance on MS MARCO do not always generalize well to other domains, a limitation that motivated the creation of benchmarks like BEIR. ^[9]

BEIR

BEIR (Benchmarking IR) is a heterogeneous benchmark for evaluating the zero-shot generalization of retrieval models across diverse tasks and domains. Published by Nandan Thakur, Nils Reimers, Andreas Ruckle, Abhishek Srivastava, and Iryna Gurevych at NeurIPS 2021, BEIR comprises 18 publicly available datasets spanning nine different retrieval tasks: fact checking, citation prediction, duplicate question retrieval, argument retrieval, news retrieval, question answering, tweet retrieval, bio-medical IR, and entity retrieval. ^[9] The paper evaluated 10 state-of-the-art retrieval systems spanning lexical, sparse, dense, late-interaction, and re-ranking architectures. ^[9]

The benchmark enforces a zero-shot evaluation protocol: models may be pre-trained on large generic corpora (typically MS MARCO) but may not be adapted to any BEIR-specific dataset before testing. BEIR uses NDCG@10 as its primary metric. A notable finding from BEIR is that BM25, despite being a simple lexical method, often outperforms neural models trained solely on MS MARCO when evaluated on out-of-domain datasets; the authors conclude that "BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs." ^[9]

MTEB

The Massive Text Embedding Benchmark (MTEB), introduced in 2022, evaluates text embedding models across 56 tasks including retrieval, classification, re-ranking, clustering, and summarization. ^[11] MTEB provides a holistic assessment of embedding quality and hosts a public leaderboard on Hugging Face. ^[11] An expanded version, MMTEB (Massive Multilingual Text Embedding Benchmark), extends the evaluation to over 500 tasks across more than 1,000 languages.

TREC Datasets

The TREC conference has produced numerous benchmark collections over its three decades of operation, covering tasks such as ad hoc retrieval, question answering, conversational search, and deep learning passage/document ranking. ^[14] The TREC Deep Learning Track, launched in 2019, uses a large-scale collection based on MS MARCO data and has become a primary venue for evaluating neural retrieval systems.

Neural Information Retrieval

Starting around 2018, the application of deep learning and pretrained language models like BERT to information retrieval transformed the field. Neural IR methods learn dense vector representations that capture semantic meaning, allowing them to match queries and documents based on meaning rather than exact keyword overlap.

Bi-Encoders

A bi-encoder (also called a dual encoder) uses two separate neural network encoders: one for the query and one for the document. Each encoder independently maps its input to a fixed-size dense vector, and relevance is estimated by computing the similarity (typically dot product or cosine similarity) between the two vectors.

The key advantage of bi-encoders is efficiency. Because document embeddings can be computed offline and stored in a vector index, retrieval at query time involves only encoding the query and performing an approximate nearest neighbor search, which can process millions of candidates in milliseconds. The primary disadvantage is that by compressing each text into a single vector, fine-grained token-level interactions between the query and document are lost.

Dense Passage Retrieval (DPR)

Dense Passage Retrieval (DPR), published by Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih at EMNLP 2020, was a landmark model demonstrating that learned dense representations could outperform BM25 for open-domain question answering. ^[5] DPR uses two independent BERT-base models as its query and passage encoders and is trained with a contrastive loss using pairs of queries with their relevant passages along with hard negative examples retrieved by BM25. ^[5]

The DPR paper reported that, on open-domain QA benchmarks, "our dense retriever outperforms a strong Lucene-BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy." ^[5] DPR uses FAISS for indexing and retrieval at inference time, enabling efficient search over large passage collections.

Cross-Encoders

A cross-encoder processes the query and document jointly by concatenating them (separated by a special token) and feeding the combined sequence through a single transformer model. The model outputs a relevance score for the pair. ^[15] Because every token from the query can attend to every token from the document through the transformer's self-attention mechanism, cross-encoders capture rich, fine-grained interactions and generally achieve higher accuracy than bi-encoders. ^[15]

The drawback is computational cost. Since the query and document must be processed together, document representations cannot be precomputed. For a corpus of N documents, answering a single query requires N forward passes through the model, making cross-encoders impractical as first-stage retrievers for large collections. Instead, they are typically used as rerankers in a two-stage pipeline. ^[15]

ColBERT and Late Interaction

ColBERT (Contextualized Late Interaction over BERT), introduced by Omar Khattab and Matei Zaharia at SIGIR 2020, proposes a middle ground between bi-encoders and cross-encoders through a mechanism called late interaction. ^[6] ColBERT independently encodes the query and document using BERT, but instead of compressing each into a single vector, it retains the full set of contextualized token embeddings. ^[6]

At scoring time, each query token embedding is compared against all document token embeddings via cosine similarity, and the maximum similarity score for each query token (called MaxSim) is summed to produce the final relevance score. ^[6] This approach preserves token-level matching granularity while still allowing document embeddings to be precomputed and stored. The authors report that ColBERT's effectiveness is "competitive with existing BERT-based models (and outperforms every non-BERT baseline), while executing two orders-of-magnitude faster and requiring four orders-of-magnitude fewer FLOPs per query." ^[6]

ColBERTv2, published by Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia in 2022, improved on the original by introducing residual compression that reduces the storage footprint by 6-10x while maintaining or improving retrieval quality. ^[7] ColBERTv2 achieves state-of-the-art results both within and outside the MS MARCO training domain. ^[7]

Comparison of Neural Retrieval Architectures

Architecture	Query-Document Interaction	Precompute Documents	Speed at Retrieval	Accuracy	Typical Use
Bi-Encoder (e.g., DPR)	Minimal (single-vector similarity)	Yes	Very fast	Moderate	First-stage retrieval
Cross-Encoder	Full (joint attention)	No	Very slow	High	Reranking
Late Interaction (e.g., ColBERT)	Token-level (MaxSim)	Partially (token embeddings stored)	Moderate	High	Retrieval or reranking
Learned Sparse (e.g., SPLADE)	Term-level (weighted sparse vectors)	Yes	Fast (inverted index)	Moderate-High	First-stage retrieval

Learned Sparse Retrieval: SPLADE

SPLADE (Sparse Lexical and Expansion Model), introduced at SIGIR 2021, bridges the gap between classical sparse methods and neural approaches. ^[8] SPLADE uses a pretrained BERT model's masked language modeling (MLM) head to generate sparse, high-dimensional vector representations for queries and documents. ^[8] These representations can be searched using traditional inverted indexes, combining the efficiency of lexical retrieval with the semantic understanding of neural models.

A distinctive feature of SPLADE is term expansion: the model can assign non-zero weights to terms that do not explicitly appear in the text but are semantically related. ^[8] For example, a document about "canines" might receive a non-zero weight for the term "dog." SPLADE applies log-saturation and explicit sparsity regularization to keep representations compact. ^[8] On the BEIR benchmark, SPLADE variants have shown strong generalization, often outperforming dense bi-encoder models on out-of-domain tasks. ^[9]

Hybrid Retrieval

Hybrid retrieval combines sparse (lexical) and dense (semantic) retrieval methods to leverage their complementary strengths. Sparse methods like BM25 excel at exact keyword matching and perform well on queries containing rare or domain-specific terms. Dense methods capture semantic similarity and handle synonymy, paraphrasing, and conceptual matching more effectively. By combining both, hybrid systems achieve more robust retrieval across a wider range of query types.

Implementation Approaches

There are several strategies for implementing hybrid retrieval:

Parallel retrieval with score fusion: Both a sparse retriever (e.g., BM25) and a dense retriever (e.g., a bi-encoder) run independently on the same query. Their result lists are merged using score normalization and linear combination, or through Reciprocal Rank Fusion (RRF), which assigns scores based on each document's rank position in both lists rather than relying on raw scores. RRF is popular because it does not require score calibration between different retrieval systems.
Learned hybrid representations: Models like SPLADE produce sparse vectors that can be searched alongside dense vectors in a unified index. Some systems store both sparse and dense representations for each document and combine scores at query time.
Pipeline approaches: A sparse retriever generates an initial candidate set, which is then re-scored by a dense model. This approach is conceptually similar to reranking but uses a bi-encoder rather than a cross-encoder for the second stage.

Major search platforms including Elasticsearch, OpenSearch, Weaviate, and Pinecone provide built-in support for hybrid search, typically offering both BM25 and vector search with configurable fusion methods.

Reranking

Reranking is a technique in which an initial set of candidate documents, retrieved by a fast first-stage retriever, is re-scored by a more accurate but computationally expensive model. The two-stage retrieve-then-rerank pipeline has become standard practice in modern IR systems because it balances efficiency with accuracy. ^[15]

How Reranking Works

In a typical pipeline:

A first-stage retriever (BM25, a bi-encoder, or a hybrid system) returns the top 100-1,000 candidate documents for a given query.
A reranker (usually a cross-encoder) receives each query-document pair and produces a refined relevance score. ^[15]
The candidates are re-sorted by the reranker's scores, and the top results are presented to the user.

Because the reranker only processes a small number of candidates rather than the entire corpus, the computational cost remains manageable even for transformer-based cross-encoders.

Reranking Models

Several families of reranking models are available:

Reranker	Provider	Architecture	Key Features
ms-marco-MiniLM	Sentence Transformers	Cross-encoder (MiniLM)	Lightweight, fast, trained on MS MARCO
Cohere Rerank 3.5	Cohere	Proprietary cross-encoder	Multilingual (100+ languages), API-based
Jina Reranker v2	Jina AI	Open-weight cross-encoder	Multimodal support (images, PDFs)
bge-reranker-v2-m3	BAAI	Cross-encoder (Transformer)	Open-source, competitive accuracy
ColBERTv2	Stanford	Late interaction	Can serve as both retriever and reranker
RankGPT	Microsoft Research	LLM-based listwise	Uses GPT models for zero-shot reranking

Impact on Retrieval Quality

Reranking consistently improves retrieval quality. Studies have shown that adding a cross-encoder reranker on top of a bi-encoder retriever can improve NDCG@10 by 5-15 points on standard benchmarks. ^[15] The quality gains come from the cross-encoder's ability to model fine-grained token-level interactions that bi-encoders miss when compressing text into a single vector. ^[15]

Information Retrieval and RAG

Retrieval-augmented generation (RAG) represents one of the most significant applications of information retrieval in the era of large language models. RAG combines a retrieval component with a generative model to produce answers grounded in external knowledge. ^[13] The approach was introduced by Patrick Lewis and colleagues at Meta AI (then Facebook AI) in 2020; the original paper reported that RAG models "set the state-of-the-art on three open domain QA tasks" and "generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline." ^[13]

The RAG Pipeline

A standard RAG pipeline consists of three stages:

Indexing: Documents are chunked, encoded into embeddings (and optionally indexed with BM25), and stored in a vector database or search engine.
Retrieval: Given a user query, the system retrieves the top-k most relevant passages using dense retrieval, sparse retrieval, or a hybrid approach.
Generation: The retrieved passages are concatenated with the query and fed as context to an LLM (such as GPT-4, Claude, or Llama), which generates a response grounded in the retrieved evidence. ^[13]

Why does IR matter for RAG?

The quality of the retrieval stage directly determines the quality of the generated response. If the retriever fails to surface relevant passages, the LLM will either hallucinate or produce an incomplete answer. Research has shown that retrieval accuracy (measured by Recall@k of the relevant passages) is the single most important factor for RAG performance, more so than the choice of generative model.

Advanced RAG systems employ multi-stage retrieval with reranking, query expansion, hypothetical document embeddings (HyDE), and iterative retrieval to maximize the relevance of context provided to the generator. The tight coupling between IR and language generation has renewed interest in retrieval research and driven the development of specialized embedding models, rerankers, and hybrid search systems optimized for RAG workloads.

Search Engines and IR Infrastructure

Information retrieval research has produced the algorithms and data structures that power both open-source and commercial search infrastructure.

Apache Lucene Ecosystem

Apache Lucene is an open-source Java library for full-text search that implements inverted indexing, BM25 scoring, and (since version 9.0) dense vector search using the HNSW algorithm. Lucene serves as the foundation for several widely used search platforms:

Elasticsearch: A distributed search and analytics engine built on Lucene, widely used for log analysis, full-text search, and real-time analytics. It supports both BM25 and vector search.
Apache Solr: An enterprise search platform built on Lucene, known for advanced querying, faceted search, and extensive customization options. Solr added dense vector search support in recent versions.
OpenSearch: An open-source fork of Elasticsearch maintained by Amazon, with built-in support for hybrid search combining BM25 and neural methods.

Vector Databases

The growth of dense retrieval and RAG has spurred the development of purpose-built vector databases:

Pinecone: A fully managed vector database designed for similarity search at scale.
Weaviate: An open-source vector database with built-in hybrid search (BM25 + vector) and module support for embedding generation.
Milvus: An open-source vector database supporting multiple index types (HNSW, IVF, product quantization) for billion-scale retrieval.
Qdrant: An open-source vector similarity search engine with support for filtering and payload-based search.
Chroma: A lightweight, open-source embedding database popular for prototyping RAG applications.

FAISS

FAISS (Facebook AI Similarity Search), developed by Meta AI Research and released in 2017, is an open-source library for efficient similarity search and clustering of dense vectors. ^[12] FAISS implements multiple indexing strategies including flat (exact) search, Inverted File Index (IVF), Product Quantization (PQ), and HNSW, and supports both CPU and GPU acceleration. ^[12] It is widely used as the retrieval backend for research systems and as a building block within production vector databases.

Comprehensive Comparison of Retrieval Methods

Method	Type	Approach	Strengths	Weaknesses	Year Introduced
Boolean Retrieval	Sparse	Exact keyword match with logical operators	Precise control, no ranking ambiguity	No ranking, rigid query syntax	1950s
TF-IDF	Sparse	Term frequency weighted by inverse document frequency	Simple, interpretable, fast	Ignores semantics, no length normalization	1957/1972
BM25	Sparse	Probabilistic ranking with saturation and length normalization	Strong baseline, excellent generalization	No semantic matching	1980s-1994
DPR	Dense	Dual BERT encoders with contrastive training	Captures semantic similarity	Single-vector bottleneck, requires training data	2020
ColBERT	Dense (late interaction)	Token-level MaxSim over BERT embeddings	High accuracy with precomputable document embeddings	Higher storage than single-vector models	2020
SPLADE	Learned sparse	BERT MLM head with sparse regularization	Efficient (inverted index), strong generalization	Requires training, more complex than BM25	2021
Cross-Encoder	Dense (joint)	Full transformer attention over concatenated query-document	Highest accuracy	Too slow for first-stage retrieval	2019
Hybrid (BM25 + Dense)	Hybrid	Parallel sparse and dense retrieval with score fusion	Combines lexical and semantic matching	Added complexity, two indexes	2020+

How does sparse retrieval differ from dense retrieval?

The practical distinction between the two main families of IR comes down to how each represents text and what kind of match it rewards. Sparse (lexical) methods such as TF-IDF and BM25 represent a document as a high-dimensional vector with one dimension per vocabulary term, where most entries are zero; they reward exact term overlap and are interpretable, fast over an inverted index, and strong on rare or out-of-vocabulary terms (model names, error codes, legal citations). Dense methods such as DPR encode text into a compact continuous vector (often 384 to 1,024 dimensions) using a neural encoder; they reward semantic similarity, handling synonymy and paraphrase that defeat lexical matching, but they require training data, a vector index, and they can miss exact-keyword queries. This complementarity is exactly why hybrid retrieval, which fuses both signals, is now the default in production search and RAG systems.

Current Trends and Future Directions

Information retrieval continues to evolve rapidly. Several trends are shaping the field as of 2025 and 2026:

Retrieval for agentic AI: As AI agents gain the ability to use tools and browse information autonomously, IR systems are being adapted to serve as knowledge backends for AI agents that issue queries, evaluate results, and iteratively refine their searches.
Multilingual and cross-lingual retrieval: Models like mE5, multilingual E5, and Cohere's multilingual embeddings are extending dense retrieval to hundreds of languages, enabling search across language barriers.
Long-context retrieval: With LLM context windows expanding to hundreds of thousands of tokens, researchers are exploring how to retrieve and rank longer documents and how to integrate retrieval with long-context models that can process more evidence at once.
Instruction-tuned embeddings: Models trained to follow retrieval-specific instructions (e.g., "retrieve passages that answer this question" vs. "retrieve passages similar to this document") are achieving improved performance across diverse retrieval tasks.
Efficient retrieval at scale: Techniques like Matryoshka representation learning (which produces embeddings that can be truncated to different dimensions without retraining), binary quantization, and improved ANN algorithms are reducing the cost of dense retrieval in production.

References

Mooers, C. N. (1950). "The Theory of Digital Handling of Non-numerical Information and its Implications to Machine Economics." Zator Company Technical Bulletin No. 48. ↩
Salton, G., Wong, A., & Yang, C. S. (1975). "A Vector Space Model for Automatic Indexing." *Communications of the ACM*, 18(11), 613-620. ↩
Sparck Jones, K. (1972). "A Statistical Interpretation of Term Specificity and Its Application in Retrieval." *Journal of Documentation*, 28(1), 11-21. ↩
Robertson, S. E., & Zaragoza, H. (2009). "The Probabilistic Relevance Framework: BM25 and Beyond." *Foundations and Trends in Information Retrieval*, 3(4), 333-389. ↩
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). "Dense Passage Retrieval for Open-Domain Question Answering." *Proceedings of EMNLP 2020*. ↩
Khattab, O., & Zaharia, M. (2020). "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." *Proceedings of SIGIR 2020*. ↩
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., & Zaharia, M. (2022). "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction." *Proceedings of NAACL 2022*. ↩
Formal, T., Piwowarski, B., & Clinchant, S. (2021). "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking." *Proceedings of SIGIR 2021*. ↩
Thakur, N., Reimers, N., Ruckle, A., Srivastava, A., & Gurevych, I. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." *NeurIPS 2021 Datasets and Benchmarks Track*. ↩
Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao, J., Liu, X., Majumder, R., McNamara, A., Mitra, B., Nguyen, T., Rosenberg, M., Song, X., Stoica, A., Tiwary, S., & Wang, T. (2016). "MS MARCO: A Human Generated Machine Reading Comprehension Dataset." *arXiv:1611.09268*. ↩
Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2022). "MTEB: Massive Text Embedding Benchmark." *arXiv:2210.07316*. ↩
Johnson, J., Douze, M., & Jegou, H. (2017). "Billion-scale Similarity Search with GPUs." *arXiv:1702.08734*. (FAISS) ↩
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktaschel, T., Riedel, S., & Kiela, D. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." *Proceedings of NeurIPS 2020*. arXiv:2005.11401. ↩
Voorhees, E. M. (2005). "The TREC Robust Retrieval Track." *ACM SIGIR Forum*, 39(1), 11-20. ↩
Nogueira, R., & Cho, K. (2019). "Passage Re-ranking with BERT." *arXiv:1901.04085*. ↩
Luhn, H. P. (1957). "A Statistical Approach to Mechanized Encoding and Searching of Literary Information." *IBM Journal of Research and Development*, 1(4), 309-317. ↩
Apache Lucene project (2016). "Lucene 6.0.0 Release" (BM25Similarity became the default similarity, replacing the prior TF-IDF based DefaultSimilarity). Apache Lucene Core 6.0.0 API documentation. ↩
Rowe, B. R., Wood, D. W., Link, A. N., & Simoni, D. A. (2010). "Economic Impact Assessment of NIST's Text REtrieval Conference (TREC) Program." RTI International / NIST Planning Report 10-1. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit