Hybrid search
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,140 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,140 words
Add missing citations, update stale details, or suggest a clearer explanation.
Hybrid search is a family of information retrieval techniques that combine the ranked outputs of a lexical (sparse) retriever, typically BM25 or a learned sparse model such as SPLADE, with those of a dense (vector) retriever based on neural embeddings or Dense Passage Retrieval (DPR)-style bi-encoders.[1][2] The two systems are run in parallel against the same corpus and their result lists are merged using a fusion algorithm, most commonly Reciprocal Rank Fusion (RRF), a weighted convex combination of normalized scores, or a learning-to-rank model trained on labeled relevance data.[3][4] Hybrid retrieval has become the dominant first-stage retrieval architecture for production retrieval-augmented generation systems because it consistently outperforms either single modality alone on heterogeneous out-of-domain benchmarks such as BEIR.[2][5] Major vector database vendors (Pinecone, Weaviate, Qdrant, Vespa) and search engines (Elasticsearch, OpenSearch) now expose hybrid search as a first-class primitive.[6][7][8][9][10]
Classical text retrieval, formalized in the 1970s and 1980s, scored documents using sparse bag-of-words representations and term-weighting schemes such as TF-IDF and BM25 (Okapi BM25), which approximate the probability that a document is relevant to a query via term frequency, inverse document frequency, and document-length normalization.[11] BM25 dominated open-domain retrieval for two decades because it is cheap to compute, requires no training data, and matches exact terms reliably, but it is fundamentally lexical: a query for "automobile" returns no documents that only mention "car".
The introduction of pre-trained transformer encoders such as BERT enabled a different style of retrieval: queries and documents are independently encoded into low-dimensional dense vectors, and relevance is measured by inner product or cosine similarity over those vectors. Karpukhin and colleagues at Facebook AI Research showed in 2020 that such a "dual encoder" Dense Passage Retriever (DPR), trained on weakly supervised question-passage pairs, could outperform BM25 by 9 to 19 points on top-20 retrieval accuracy for Natural Questions and several other open-domain QA benchmarks.[12] Dense retrievers handle synonymy, paraphrase, and semantic similarity gracefully but tend to be "token-blind": they can miss rare named entities, product codes, identifiers, and out-of-vocabulary terms that are decisive for relevance.[4][13]
These two failure modes are largely complementary. Luan, Eisenstein, Toutanova, and Collins formalized this observation in a 2021 Transactions of the Association for Computational Linguistics paper, demonstrating both theoretically and empirically that sparse bag-of-words models have unbounded capacity for long documents whereas fixed-dimension dual encoders are capacity-limited, and that simple sparse-dense hybrids "outperform strong alternatives in large-scale retrieval".[14] The 2021 BEIR benchmark by Thakur and colleagues then provided large-scale evidence that dense models trained on MS MARCO frequently fail to generalize zero-shot to out-of-domain corpora, whereas BM25 remains a robust baseline; hybrid systems and reranking architectures were among the strongest configurations evaluated.[2] Together these results catalyzed the modern view of hybrid search as a default architecture for production retrieval.
The first practical hybrid systems predate the dense-retrieval revival. Rank aggregation across multiple IR systems has been studied since at least the TREC evaluations of the 1990s, where techniques such as CombSUM, CombMNZ (Fox and Shaw), and Borda count were used to fuse runs from independent retrieval systems. The 2009 RRF paper by Cormack et al. compared these score-fusion methods with rank-fusion alternatives, including Condorcet pairwise voting, and showed that the deceptively simple reciprocal-rank score consistently won on TREC and on the spam-tracker corpora used in the paper, while needing neither training data nor per-system score normalization.[3] This robustness is why RRF later became the default fusion operator for hybrid sparse-dense pipelines a decade later, when dense retrievers entered the picture: it works out of the box even when the two systems produce wildly incomparable raw scores.
A hybrid search system has three logical stages: independent first-stage retrieval, score or rank fusion, and (optionally) a second-stage reranker such as a ColBERT late-interaction model or a transformer cross-encoder.
Sparse retrieval is normally implemented over an inverted index. For BM25 the score of a document d against query q is
BM25(q, d) = sum over t in q of IDF(t) * (f(t,d) * (k1+1)) / (f(t,d) + k1 * (1 - b + b * |d|/avgdl))
with parameters typically k1 in [1.2, 2.0] and b ≈ 0.75.[11] Learned sparse models such as SPLADE replace raw term weights with weights produced by a masked language model head over the BERT vocabulary, regularized to remain sparse, so the result can still be served by an inverted index but each document is "expanded" with semantically related terms.[15]
Dense retrieval encodes the query into a single vector and looks up the approximate nearest neighbors of that vector in a precomputed index over document embeddings, typically using algorithms such as HNSW graphs or product quantization implemented in libraries like FAISS.[16] Common encoders include DPR, sentence-transformer models derived from Sentence-BERT, and commercial embedding APIs from providers such as OpenAI, Cohere, and Jina.
The most widely deployed fusion method is Reciprocal Rank Fusion (RRF), introduced by Gordon V. Cormack, Charles L. A. Clarke, and Stefan Büttcher in a short paper at SIGIR 2009.[3] Given n ranked result lists R_1, ..., R_n and a small constant k (the paper uses k = 60), each document d receives a fused score
RRFscore(d) = sum over i of 1 / (k + r_i(d))
where r_i(d) is the rank of d in list i (and the contribution is zero if d does not appear in list i).[3][17] Documents are then sorted in descending RRF score. The constant k softens the influence of the very top ranks: a larger k allows mid-ranked documents to accumulate weight across multiple lists. Cormack and colleagues showed on TREC tracks that RRF, with no training data and no per-system tuning, outperformed both Condorcet-style voting and a learning-to-rank baseline that was given access to relevance judgments.[3] Vendor implementations almost universally inherit the default k = 60: Elasticsearch's rrf retriever exposes rank_constant with default 60, Weaviate's rankedFusion uses the same constant in its 1 / (rank + 60) formula, and Qdrant's Query API similarly defaults to 60.[7][17][18]
The alternative is to fuse raw scores rather than ranks. Because BM25 scores are unbounded positive values and cosine or inner-product similarities live in roughly [-1, 1], the two must first be normalized. Common normalizations include min-max within the result window, theoretical min-max (BM25 bounded below by zero, cosine by -1), and z-score normalization. The fused score is then
score(d) = alpha * s_dense(d) + (1 - alpha) * s_sparse(d)
with alpha in [0, 1] controlling the relative weight. Sebastian Bruch, Siyu Gai, and Amir Ingber of Pinecone analyzed this family in a 2022 paper "An Analysis of Fusion Functions for Hybrid Retrieval", finding that a learned convex combination is largely agnostic to the choice of score normalization, can be tuned with very few labeled examples, and outperforms RRF on both in-domain and out-of-domain BEIR-style evaluations; the same study reported that RRF is more parameter-sensitive than previously assumed.[19] In Pinecone's product, alpha-weighted convex combination is implemented by scaling query and document sparse and dense components before computing a single dot product, so the underlying index can serve hybrid queries at native speed.[6][20]
When labeled relevance data is available, the scores or ranks from sparse and dense retrievers can be treated as features and fed into a supervised learning-to-rank model (e.g., LambdaMART or a small neural ranker). This is the most flexible approach but requires per-domain training data, so it is typically used in mature search systems with large query logs rather than in zero-shot RAG pipelines.[4][19]
Hybrid retrieval is often paired with a second-stage neural reranker that scores the union of the top k results from each retriever using a transformer cross-encoder or a late-interaction model such as ColBERT. The reranker sees the query and candidate document jointly and produces a calibrated relevance score, at the cost of one forward pass per candidate. This cascaded design (cheap hybrid recall, expensive precise reranker) is the de facto reference architecture in toolkits such as Pyserini and Haystack.[9][21]
The BEIR (Benchmarking-IR) suite, introduced by Thakur, Reimers, Rücklé, Srivastava, and Gurevych at NeurIPS 2021, comprises 18 publicly available retrieval datasets spanning fact-checking, question answering, biomedical search, financial filings, and other domains.[2] BEIR's central finding is that BM25 is a remarkably robust zero-shot baseline: many dense retrievers trained on MS MARCO underperform BM25 when transferred to out-of-domain corpora, while late-interaction and reranking-based models achieve the best mean nDCG@10 at substantially higher cost.[2] Subsequent work, including hybrid retrieval studies and the analysis by Bruch et al., used BEIR as the standard testbed and reported that hybrid sparse-dense systems improve average nDCG@10 substantially over either single modality.[5][19]
Elastic's research team, evaluating Elastic's Learned Sparse Encoder (ELSER), BM25, and a dense baseline on a 12-task BEIR subset, reported that RRF with k = 20 (window 1000) increased average nDCG@10 by 1.4 percentage points over ELSER alone and 18 percentage points over BM25 alone, and that a weighted linear combination calibrated on annotated data delivered a 6-point gain over ELSER alone and 24 points over BM25 alone.[5] Importantly, the RRF result was "either better or similar to BM25 alone for all test data sets", which is the main practical attraction: hybrid search rarely loses to a strong sparse baseline, even without tuning.[5] Comparable conclusions have been documented by OpenSearch's team, who benchmarked normalization-and-combination pipelines on BEIR and the Amazon ESCI dataset and concluded that min-max normalization plus arithmetic-mean combination yielded the best hybrid configuration in their setup.[8]
On the MS MARCO passage ranking task, hybrid retrieval has long been a leaderboard staple. Pyserini, the Python toolkit for reproducible IR research developed in Jimmy Lin's lab at the University of Waterloo, reports reference hybrid runs combining BM25 or uniCOIL sparse signals with TCT-ColBERT-style dense encoders, and these consistently improve over either component alone.[21] Luan and colleagues' TACL 2021 analysis showed similar gains on MS MARCO and on the Wikipedia-based Natural Questions corpus.[14]
Hybrid search is now a feature of essentially every production vector store and search engine. The implementations differ in how they index sparse and dense signals, which fusion methods they expose, and whether they perform fusion at the index layer or as a post-processing step.
| System | Sparse component | Dense component | Default fusion | Notes |
|---|---|---|---|---|
| Pinecone | Sparse vectors (BM25 or SPLADE) | Dense vectors | Convex combination via alpha | Single sparse-dense vector, dotproduct metric only.[6][20] |
| Weaviate | BM25 with configurable k1, b, tokenization | Dense vector search | relativeScoreFusion (default since v1.24) and rankedFusion (RRF with k = 60) | alpha parameter weights vector vs keyword.[7] |
| Qdrant | Sparse vectors / SPLADE | Dense vectors, multi-vector (ColBERT) | RRF in Query API (since v1.10, July 2024) | Supports prefetch-then-rerank pipelines and DBSF.[18] |
| Elasticsearch | BM25 (Lucene) and ELSER learned sparse | kNN with HNSW | rrf retriever (default rank_constant = 60, weighted RRF added later) | Hybrid via the retrievers API and sub-searches.[17][22] |
| OpenSearch | BM25 (Lucene) and learned sparse | kNN with HNSW or IVF | Search-pipeline normalization-processor (min_max, l2) plus combination (arithmetic_mean, geometric_mean, harmonic_mean); RRF processor added later | Per-clause weighting.[8][23] |
| Vespa | bm25, nativeRank text features, weakAnd | nearestNeighbor over HNSW tensor field | User-defined ranking expressions combining text and vector features | Single ranking framework, supports RANK operator for retrieve-with-features.[10][24] |
| pgvector + tsvector | PostgreSQL tsvector / tsquery full-text search | pgvector cosine / inner product | Application-level RRF or weighted sum | Hybrid runs inside Postgres; commonly combined via RRF in a single SQL.[25] |
Pinecone introduced hybrid search built on a single "sparse-dense" vector type that requires the index to use the dotproduct distance metric.[6][26] The official documentation describes sparse vectors with a very large number of dimensions but only a small proportion of non-zero values, one per vocabulary token, and recommends BM25 or SPLADE as the encoder.[26] To merge signals the system uses a convex combination governed by an alpha parameter; the helper function hybrid_score_norm multiplies dense values by alpha and sparse values by (1 - alpha) before the dot product so the index itself computes the weighted score with no extra latency.[20] Pinecone now also supports separate sparse and dense indexes with fusion at query time for users who prefer to manage the two modalities independently.[20]
Weaviate exposes a hybrid search operator that runs BM25 and vector search in parallel, then applies one of two fusion algorithms.[7] Since v1.24 the default is relativeScoreFusion, which scales the highest score in each modality to 1 and the lowest to 0, preserving magnitude information. The alternative rankedFusion is exactly RRF with k = 60 (score = 1 / (rank + 60)).[7] An alpha parameter in [0, 1] weights the contributions, with alpha = 0.5 as the default; alpha = 1.0 is pure vector search and alpha = 0.0 is pure BM25. Weaviate keeps the full BM25 surface area (custom tokenization, stopwords, k1, b, AND/OR operator) so hybrid mode behaves like a superset of the BM25 endpoint.[7]
Qdrant added a unified Query API in version 1.10 (July 2024) that can express hybrid retrieval as a single nested query, with prefetch sub-queries for sparse and dense vectors and a fusion stage on top.[18] The built-in fusion is RRF; Qdrant also documents Relative Score Fusion (RSF) and Distribution-Based Score Fusion (DBSF) as alternatives that practitioners can apply.[27] The same API natively supports reranking flows ("prefetch with sparse, rerank with ColBERT-style multi-vectors") and Matryoshka-style nested embedding queries.[18]
Elasticsearch added native RRF in the 8.x line, first via a sub-searches RRF construct and then via the retrievers API; the rrf retriever takes a rank_constant (default 60) and rank_window_size (default 100).[17] A weighted variant published by Elastic Search Labs lets each retriever contribute with its own weight: rrf_score = w1 * rrf_1 + w2 * rrf_2, useful when one signal is known to be stronger than another.[28] OpenSearch implements hybrid search through a search-pipeline normalization-processor that intercepts query-phase scores from each clause, normalizes them (min_max or l2) and combines them (arithmetic_mean, geometric_mean, or harmonic_mean) with per-clause weights.[23][8] An RRF processor was added more recently to support rank-based fusion alongside the score-based pipeline.[8]
Vespa, originally an internal Yahoo system released as open source in 2017, treats text matching and vector search as features in a single ranking pipeline rather than as separate operators.[10] A hybrid query combines weakAnd or nearestNeighbor retrieval operators, and the ranking profile assembles arbitrary expressions over text features (bm25(title), nativeRank(content)) and vector features (closeness(field, embedding), cosine distance).[10][24] Because the ranking expression is user-defined, fusion strategies in Vespa range from simple weighted sums to multi-phase ranking with a GBDT or transformer model in the second phase.[24]
PostgreSQL supports hybrid search natively through the combination of its built-in tsvector / tsquery full-text search and the pgvector extension, which adds a vector data type and HNSW or IVFFlat indexing.[25] A common pattern is to run both queries in a single SQL with WITH clauses, then merge with RRF: each subquery produces a ranked list, the SQL computes 1.0 / (k + rank) for each, sums per document, and orders by the sum.[25] This approach is attractive because the entire pipeline stays inside a single transactional database without a separate vector store.
Modern RAG orchestration frameworks treat hybrid retrieval as a standard component.
LangChain provides an EnsembleRetriever that accepts a list of retrievers (typically a BM25Retriever and a vector-store retriever) plus optional per-retriever weights, and fuses their results using Reciprocal Rank Fusion with a constant c (default 60).[29] An alpha-style weighting is achieved through the weights argument: a [0.6, 0.4] configuration assigns 60% influence to the first retriever and 40% to the second.[29]
LlamaIndex composes hybrid retrieval by combining a VectorIndexRetriever with a BM25Retriever (or a vendor-native hybrid like Pinecone's sparse-dense queries) and merging results.[30] For vendors that expose an alpha parameter (e.g., Weaviate), LlamaIndex documents an "alpha tuning" recipe in which alpha is treated as a hyperparameter to be swept against the developer's own evaluation set.[31]
deepset's Haystack framework treats retrievers as composable pipeline nodes. A canonical hybrid pipeline wires an InMemoryBM25Retriever and an InMemoryEmbeddingRetriever (or their Elasticsearch / Weaviate / OpenSearch / Azure AI Search analogues) into a DocumentJoiner that performs RRF or weighted joining, and optionally appends a transformer ranker for second-stage scoring.[21][32] Dedicated wrappers such as WeaviateHybridRetriever and AzureAISearchHybridRetriever push the fusion down into the vendor where supported.[33][34]
Hybrid search is widely used for:
A typical RAG ingestion pipeline backed by hybrid search looks roughly as follows. At indexing time the corpus is chunked, each chunk is sent through both a sparse encoder (BM25 statistics or a SPLADE model from Hugging Face) and a dense encoder (a Sentence-BERT variant or a hosted embedding API from OpenAI, Cohere, or Jina), and the two representations are written to the vector store and the inverted index respectively. At query time the same encoders convert the user's natural-language query into a sparse query vector and a dense query vector. The vector store retrieves the top k1 dense matches; the inverted index retrieves the top k2 sparse matches. The two lists are fused by RRF or convex combination, the resulting top n candidates are passed through a re-ranking cross-encoder, and the final top few are concatenated into the LLM prompt. Most of this orchestration is handled by LangChain, LlamaIndex, or Haystack with minimal user code, and modern vector stores increasingly perform the fusion step server-side so the client only sees a single ranked list.
Hybrid search is not free. Operationally, the system must maintain two indexes (or a hybrid index that supports both), pay query-time latency for both retrievers, and tune at least one fusion parameter (alpha, k, or per-modality weights). The Pinecone fusion-analysis paper explicitly documents that RRF, contrary to its "tuning-free" reputation, can be parameter-sensitive when the two underlying systems differ in calibration, and recommends a small labeled set to tune a convex combination instead.[19]
The benchmark literature also contains caveats. Lassance and colleagues highlighted in 2023 that MS MARCO comes in subtly different "preprocessed" variants and that head-to-head comparisons in the literature have sometimes compared systems trained or evaluated on different versions, leading to overstated improvements.[35] BEIR's own conclusion is that dense retrievers underperform BM25 in many out-of-domain settings, so a hybrid that simply sums an under-trained dense run with BM25 can recover most, but not all, of the lost ground without addressing the underlying generalization gap.[2]
Finally, hybrid retrieval does not solve the problem of bad documents in the index. If a corpus contains contradictory facts or stale text, returning more of them via hybrid recall amplifies the downstream burden on the LLM-side reranker or generator. The cascaded design with a precise neural reranker is therefore standard in production RAG.
Hybrid search sits at the intersection of several active research areas: