BM25, also called Okapi BM25 or Best Match 25, is a probabilistic ranking function used by search engines and information retrieval systems to estimate how relevant a document is to a given query. It belongs to the broader Probabilistic Relevance Framework developed in the 1970s and 1980s by Stephen Robertson, Karen Sparck Jones, and collaborators, and it remains, more than three decades after its first publication, the de facto default scoring function for keyword search in production systems including Apache Lucene, Elasticsearch, OpenSearch, Solr, Vespa, and Tantivy.
The "BM" stands for "Best Matching" and the number 25 reflects the fact that the formula was the twenty-fifth iteration in a series of probabilistic weighting experiments by Robertson and colleagues. It was first described in fully recognizable form by Robertson and Walker in 1994 and implemented in the Okapi search system at the City University of London, the source of the "Okapi" prefix.
BM25 generalizes classical tf_idf scoring with two refinements: a saturating term-frequency function that prevents extremely repetitive documents from dominating the ranking, and an explicit document-length normalization that compensates for the fact that longer documents naturally contain more occurrences of any given term. These properties make BM25 competitive even in the era of large neural language models, and it serves as both the standard baseline in academic IR research and the lexical component of hybrid retrieval pipelines that power retrieval_augmented_generation systems.
BM25 emerged from a long line of research into probabilistic models for document retrieval that began with the work of Maron and Kuhns in the early 1960s and was formalized by Robertson and Sparck Jones in their 1976 paper on relevance weighting of search terms. That earlier work introduced the Robertson-Sparck Jones weight, often abbreviated RSJ, which is essentially the IDF component used inside BM25 today.
A key intermediate step was Stephen Harter's 1975 work on the 2-Poisson model, which proposed that occurrences of a content-bearing term could be modeled as a mixture of two Poisson processes: one for documents genuinely about the topic ("elite" documents) and another for documents in which the term appears only incidentally. The 2-Poisson derivation produced a saturating term-weighting function, but the original formulation required estimating several latent parameters per term.
In their landmark 1994 SIGIR paper, Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval, Robertson and Walker showed that almost all of the practical benefit of the 2-Poisson model could be captured by a much simpler closed-form expression with only two free hyperparameters. They combined this saturating term-frequency factor with a length normalization derived from earlier Okapi experiments and with the RSJ inverse document frequency, producing the formula universally known as BM25. The Okapi system at City University, in which the function was first implemented, gave rise to the alternative name Okapi BM25.
Throughout the late 1990s and early 2000s, BM25 dominated the Text REtrieval Conference (TREC) ad hoc and web tracks, repeatedly outperforming vector-space cosine similarity on heterogeneous collections. Robertson and Hugo Zaragoza published a comprehensive retrospective in 2009, The Probabilistic Relevance Framework: BM25 and Beyond, which is still the canonical theoretical reference for the function and its variants.
For a query Q containing terms q₁, q₂, ..., qₙ and a document D, the BM25 score is
score(D, Q) = Σ IDF(q_i) · [ f(q_i, D) · (k₁ + 1) ] / [ f(q_i, D) + k₁ · (1 - b + b · |D| / avgdl) ]
where
f(q_i, D) is the raw term frequency of qᵢ in document D,|D| is the length of D measured in tokens,avgdl is the average document length across the collection,k₁ and b are free hyperparameters, andIDF(q_i) is the inverse document frequency of qᵢ.The IDF component is the Robertson-Sparck Jones weight in its commonly used form:
IDF(q_i) = log( (N - n(q_i) + 0.5) / (n(q_i) + 0.5) + 1 )
where N is the total number of documents in the collection and n(qᵢ) is the number of documents containing qᵢ. The + 1 outside the logarithm is a smoothing convention introduced in Lucene to avoid negative IDF values for very common terms; the original probabilistic derivation produces a log-odds ratio that can in fact go negative for terms appearing in more than half of the collection.
Three behaviors explain almost everything BM25 does well.
Term-frequency saturation. The numerator grows linearly with f(q_i, D), but the denominator also contains f(q_i, D), so the ratio asymptotes at (k₁ + 1) as term frequency grows. Doubling the count of a query term in a document raises its contribution noticeably the first few times but quickly produces diminishing returns, which prevents keyword-stuffed pages from winning the ranking.
Document-length normalization. The factor (1 - b + b · |D| / avgdl) raises the effective k₁ for documents longer than average and lowers it for shorter ones. With b = 1 normalization is full; with b = 0 length is ignored. This addresses the bias toward long documents that plain tf-idf inherits from the bag-of-words model.
Negative IDF for ubiquitous terms. Without the Lucene + 1 smoothing, words appearing in more than half of the collection contribute negative weight, actively discouraging matches on stop-word-like terms and pushing the ranking toward more discriminative query terms.
BM25 has two practitioner-facing hyperparameters. The defaults work reasonably well almost everywhere, but tuning them on a held-out evaluation set can yield meaningful relevance gains.
| Parameter | Typical range | Default in Lucene | Effect when increased |
|---|---|---|---|
k₁ | 1.2 to 2.0 | 1.2 | Slower term-frequency saturation; raw counts matter more |
b | 0.0 to 1.0 | 0.75 | Stronger penalty on long documents |
Setting k₁ = 0 reduces the term-frequency factor to a binary indicator, recovering the Binary Independence Model. Very large k₁ causes BM25 to behave like raw tf-idf with no saturation. The default k₁ = 1.2 reflects empirical findings from TREC experiments on English collections; for very short documents like product titles, lower values around k₁ = 0.9 sometimes work better, while for long-form documents like research papers, values from 1.5 to 2.0 are common.
The parameter b has a sharper interpretation: b = 0 disables length normalization and b = 1 applies it fully. The default b = 0.75 is a compromise that works well on mixed collections. Homogeneous corpora often prefer lower b; corpora that span several orders of magnitude in length favor values close to 1.
BM25 is sometimes described as an improved tf-idf, and the comparison is informative. Both functions multiply a term-frequency component by an inverse document frequency component and sum across query terms, but the details differ in ways that matter in practice.
| Property | tf-idf | BM25 |
|---|---|---|
| Term-frequency growth | Linear or log-scaled, unbounded | Saturating, asymptotes at k₁ + 1 |
| Document-length normalization | None or ad hoc cosine normalization | Built-in, controlled by b |
| IDF formulation | log(N / df) | log((N - df + 0.5) / (df + 0.5) + 1) |
| Theoretical basis | Heuristic, vector-space model | Probabilistic Relevance Framework, 2-Poisson |
| Negative weights for very common terms | No | Yes (without Lucene +1 smoothing) |
| Hyperparameters | Usually none | k₁, b |
In head-to-head experiments on standard test collections, BM25 typically beats vanilla tf-idf by several points of mean average precision. The advantage is largest on collections with widely varying document lengths and on queries with repeated content terms, both of which are situations where unbounded term frequency in tf-idf produces unstable scores.
The core BM25 formula has been extended in many directions to address specific weaknesses or to handle structured documents. The most widely deployed variants are summarized below.
| Variant | Year | Authors | Key idea |
|---|---|---|---|
| BM25F | 2004 | Robertson, Zaragoza, Taylor | Fielded scoring; weighted combination of term frequencies across multiple document fields with per-field length normalization |
| BM25+ | 2011 | Lv, Zhai | Adds a constant δ to the term-frequency factor to enforce a strict lower bound; fixes a bias against very long documents |
| BM25L | 2011 | Lv, Zhai | Shifts the normalization curve so long documents are penalized less, while preserving the saturating shape |
| BM25-adpt | 2012 | Lv, Zhai | Adapts k₁ per term based on collection statistics rather than using a single global value |
| BM25T | 2017 | Trotman, Puurula, Burgess | Various term-specific tweaks; common in academic IR |
| TW-IDF | 2013 | Rousseau, Vazirgiannis | Replaces term frequency with a graph-based term weight inspired by BM25 saturation |
BM25F is by far the most operationally important variant. Real web pages have multiple fields such as title, headers, anchor text, body, and URL, each carrying different relevance signals and each having its own characteristic length distribution. BM25F handles each field as a separate "stream," computes a length-normalized weighted term frequency per stream, sums across streams with field-specific boost factors, and then applies the standard BM25 saturation and IDF on top. Search engines from Bing to Elasticsearch use BM25F or close relatives to score multi-field documents.
BM25+ and BM25L address a subtle theoretical defect in the original BM25: the lower bound of the term-frequency contribution can become arbitrarily small for very long documents, with the result that a long document containing the query term is sometimes scored similarly to a short document that does not contain the term at all. BM25+ adds a small additive constant to enforce a strict positive lower bound; BM25L reshapes the normalization curve to similar effect.
BM25 is the standard scoring function in essentially every modern open-source full-text search engine.
| System | Default scoring | Notes |
|---|---|---|
| Apache Lucene | BM25 since version 6.0 (2015) | Replaced original DefaultSimilarity (a tf-idf variant) |
| elasticsearch | BM25 since 5.0 (2016) | Inherited the change from Lucene |
| OpenSearch | BM25 via Lucene's BM25Similarity | Direct fork of Elasticsearch |
| Apache Solr | BM25 since Solr 6 | Same Lucene foundation |
| Tantivy | BM25 | Rust full-text library used by Quickwit and ParadeDB |
| Vespa | BM25 as bm25 rank feature | nativeRank is the historical default |
| PostgreSQL (ParadeDB, VectorChord-BM25) | BM25 via extension | Brings true BM25 to Postgres |
| Whoosh, Xapian, Terrier | BM25 | Standard in research IR toolkits |
The transition in Lucene from its original DefaultSimilarity to BM25 in version 6.0 (2015) was one of the most consequential changes in the history of open-source search. It propagated downstream into Elasticsearch and Solr in 2016, effectively making BM25 the default behavior of nearly every search box on the open web.
The rise of dense retrieval, in which queries and documents are encoded by neural networks into fixed-dimensional vectors and matched by approximate nearest neighbor search, was widely expected to make BM25 obsolete. That has not happened. Instead, BM25 has settled into three durable roles.
BM25 is the standard reference baseline in modern IR research. Almost every paper introducing a dense retriever, learned sparse retriever, or cross-encoder reranker reports BM25 numbers on standard benchmarks such as MS MARCO, TREC Deep Learning, and the BEIR benchmark. The toolkits Anserini (Java, built on Lucene) and Pyserini (Python bindings to Anserini) maintain reproducible BM25 baselines and prebuilt indexes for these collections, which makes BM25 the lingua franca for comparing retrieval systems.
A frequently cited result from the BEIR benchmark is that BM25 remains a remarkably strong zero-shot retriever: dense models trained on MS MARCO often fail to generalize to other domains, while BM25 has no domain-specific training and therefore degrades gracefully when moved to scientific, medical, or legal corpora. This robustness has shaped how the field thinks about retrieval evaluation.
Production semantic_search systems and retrieval_augmented_generation pipelines very rarely use dense retrieval alone. They combine BM25 lexical search with dense neural retrieval and then fuse the two ranked lists. The fusion can be done by a learned reranker, by weighted score interpolation, or, most commonly, by Reciprocal Rank Fusion (RRF), which assigns each document a score of 1 / (k + rank) in each list (with k typically set to 60) and sums those scores.
| Approach | Strengths | Weaknesses |
|---|---|---|
| BM25 alone | Exact lexical match, zero training data required, fast, interpretable, robust across domains | Misses paraphrases, synonyms, and conceptual matches |
| Dense retrieval alone (e.g., DPR, two-tower_model bi-encoders) | Captures semantic similarity, handles paraphrases | Expensive to train, weak on rare terms and out-of-domain queries, requires a vector_database |
| BM25 + dense, RRF fusion | Inherits exact-match precision and semantic recall; consistently best on heterogeneous queries | Higher infrastructure cost; two indexes to maintain |
| Learned sparse (SPLADE) | BM25-style inverted index with learned term expansion; closes most BM25 gaps while retaining its speed | Newer, less mature tooling; still requires a learned model |
Empirical reports from production teams routinely show that hybrid BM25-plus-dense fusion lifts recall at top-k by ten to twenty points over either component alone, which translates directly into fewer hallucinations in a downstream RAG system because the language model has more relevant context to ground its answers in.
A newer family of retrievers, exemplified by SPLADE (Sparse Lexical and Expansion Model, Formal et al. 2021) and its successors, attempts to keep the inverted-index efficiency of BM25 while learning the term weights and the term expansions with a neural language model. SPLADE replaces BM25's heuristic saturating term-frequency function with a learned max-pooling over BERT token activations, and it lets the model add or remove terms from the document and the query, producing a sparse vector in the BERT vocabulary that can be indexed and searched with standard inverted-index data structures.
SPLADE is best understood as BM25 with the rough edges sanded off by a transformer: it keeps the storage and latency profile of a sparse retriever, it remains interpretable in terms of which vocabulary tokens contributed to the score, and on benchmarks like MS MARCO and BEIR it closes most of the gap between BM25 and state-of-the-art dense models while being substantially cheaper to serve.
k₁ = 1.2, b = 0.75 work well across a wide range of corpora, so a working search system can be built without tuning.k₁ and b differ measurably between corpora and tuning them requires labeled or implicit-feedback data.+ 1 inside the logarithm to clamp IDF at zero; some research toolkits keep the original signed form.Thirty-two years after the 1994 Robertson-Walker paper, BM25 is still the default scoring function in the search engines that index the open web, the standard baseline in academic IR, and the lexical backbone of every serious hybrid retrieval pipeline. Several factors explain this longevity:
For most teams building search or retrieval_augmented_generation today, the right starting point is still: index the corpus, run BM25 with default k₁ = 1.2 and b = 0.75, and only add a dense retriever, a reranker, or a learned sparse model once BM25 has been measured and shown to be insufficient. More often than not, the BM25 baseline is harder to beat than expected.