BM25 (Okapi BM25)

BM25, also called Okapi BM25 or Best Match 25, is a probabilistic ranking function used by search engines and information retrieval systems to estimate how relevant a document is to a given query. It belongs to the broader Probabilistic Relevance Framework developed in the 1970s and 1980s by Stephen Robertson, Karen Sparck Jones, and collaborators, and it remains, more than three decades after its first publication, the de facto default scoring function for keyword search in production systems including Apache Lucene, Elasticsearch, OpenSearch, Solr, Vespa, and Tantivy.

The "BM" stands for "Best Matching" and the number 25 reflects the fact that the formula was the twenty-fifth iteration in a series of probabilistic weighting experiments by Robertson and colleagues. It was first described in fully recognizable form by Robertson and Walker in 1994 and implemented in the Okapi search system at the City University of London, the source of the "Okapi" prefix.

BM25 generalizes classical tf_idf scoring with two refinements: a saturating term-frequency function that prevents extremely repetitive documents from dominating the ranking, and an explicit document-length normalization that compensates for the fact that longer documents naturally contain more occurrences of any given term. These properties make BM25 competitive even in the era of large neural language models, and it serves as both the standard baseline in academic IR research and the lexical component of hybrid retrieval pipelines that power retrieval_augmented_generation systems.

History and origins

BM25 emerged from a long line of research into probabilistic models for document retrieval that began with the work of Maron and Kuhns in the early 1960s and was formalized by Robertson and Sparck Jones in their 1976 paper on relevance weighting of search terms. That earlier work introduced the Robertson-Sparck Jones weight, often abbreviated RSJ, which is essentially the IDF component used inside BM25 today.

A key intermediate step was Stephen Harter's 1975 work on the 2-Poisson model, which proposed that occurrences of a content-bearing term could be modeled as a mixture of two Poisson processes: one for documents genuinely about the topic ("elite" documents) and another for documents in which the term appears only incidentally. The 2-Poisson derivation produced a saturating term-weighting function, but the original formulation required estimating several latent parameters per term.

In their landmark 1994 SIGIR paper, Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval, Robertson and Walker showed that almost all of the practical benefit of the 2-Poisson model could be captured by a much simpler closed-form expression with only two free hyperparameters. They combined this saturating term-frequency factor with a length normalization derived from earlier Okapi experiments and with the RSJ inverse document frequency, producing the formula universally known as BM25. The Okapi system at City University, in which the function was first implemented, gave rise to the alternative name Okapi BM25.

Throughout the late 1990s and early 2000s, BM25 dominated the Text REtrieval Conference (TREC) ad hoc and web tracks, repeatedly outperforming vector-space cosine similarity on heterogeneous collections. Robertson and Hugo Zaragoza published a comprehensive retrospective in 2009, The Probabilistic Relevance Framework: BM25 and Beyond, which is still the canonical theoretical reference for the function and its variants.

The BM25 scoring formula

For a query Q containing terms q₁, q₂, ..., qₙ and a document D, the BM25 score is

score(D, Q) = Σ IDF(q_i) · [ f(q_i, D) · (k₁ + 1) ] / [ f(q_i, D) + k₁ · (1 - b + b · |D| / avgdl) ]

where

f(q_i, D) is the raw term frequency of qᵢ in document D,
|D| is the length of D measured in tokens,
avgdl is the average document length across the collection,
k₁ and b are free hyperparameters, and
IDF(q_i) is the inverse document frequency of qᵢ.

The IDF component is the Robertson-Sparck Jones weight in its commonly used form:

IDF(q_i) = log( (N - n(q_i) + 0.5) / (n(q_i) + 0.5) + 1 )

where N is the total number of documents in the collection and n(qᵢ) is the number of documents containing qᵢ. The + 1 outside the logarithm is a smoothing convention introduced in Lucene to avoid negative IDF values for very common terms; the original probabilistic derivation produces a log-odds ratio that can in fact go negative for terms appearing in more than half of the collection.

Reading the formula

Three behaviors explain almost everything BM25 does well.

Term-frequency saturation. The numerator grows linearly with f(q_i, D), but the denominator also contains f(q_i, D), so the ratio asymptotes at (k₁ + 1) as term frequency grows. Doubling the count of a query term in a document raises its contribution noticeably the first few times but quickly produces diminishing returns, which prevents keyword-stuffed pages from winning the ranking.

Document-length normalization. The factor (1 - b + b · |D| / avgdl) raises the effective k₁ for documents longer than average and lowers it for shorter ones. With b = 1 normalization is full; with b = 0 length is ignored. This addresses the bias toward long documents that plain tf-idf inherits from the bag-of-words model.

Negative IDF for ubiquitous terms. Without the Lucene + 1 smoothing, words appearing in more than half of the collection contribute negative weight, actively discouraging matches on stop-word-like terms and pushing the ranking toward more discriminative query terms.

Hyperparameters

BM25 has two practitioner-facing hyperparameters. The defaults work reasonably well almost everywhere, but tuning them on a held-out evaluation set can yield meaningful relevance gains.

Parameter	Typical range	Default in Lucene	Effect when increased
`k₁`	1.2 to 2.0	1.2	Slower term-frequency saturation; raw counts matter more
`b`	0.0 to 1.0	0.75	Stronger penalty on long documents

Setting k₁ = 0 reduces the term-frequency factor to a binary indicator, recovering the Binary Independence Model. Very large k₁ causes BM25 to behave like raw tf-idf with no saturation. The default k₁ = 1.2 reflects empirical findings from TREC experiments on English collections; for very short documents like product titles, lower values around k₁ = 0.9 sometimes work better, while for long-form documents like research papers, values from 1.5 to 2.0 are common.

The parameter b has a sharper interpretation: b = 0 disables length normalization and b = 1 applies it fully. The default b = 0.75 is a compromise that works well on mixed collections. Homogeneous corpora often prefer lower b; corpora that span several orders of magnitude in length favor values close to 1.

BM25 versus tf-idf

BM25 is sometimes described as an improved tf-idf, and the comparison is informative. Both functions multiply a term-frequency component by an inverse document frequency component and sum across query terms, but the details differ in ways that matter in practice.

Property	tf-idf	BM25
Term-frequency growth	Linear or log-scaled, unbounded	Saturating, asymptotes at `k₁ + 1`
Document-length normalization	None or ad hoc cosine normalization	Built-in, controlled by `b`
IDF formulation	`log(N / df)`	`log((N - df + 0.5) / (df + 0.5) + 1)`
Theoretical basis	Heuristic, vector-space model	Probabilistic Relevance Framework, 2-Poisson
Negative weights for very common terms	No	Yes (without Lucene `+1` smoothing)
Hyperparameters	Usually none	`k₁`, `b`

In head-to-head experiments on standard test collections, BM25 typically beats vanilla tf-idf by several points of mean average precision. The advantage is largest on collections with widely varying document lengths and on queries with repeated content terms, both of which are situations where unbounded term frequency in tf-idf produces unstable scores.

BM25 variants

The core BM25 formula has been extended in many directions to address specific weaknesses or to handle structured documents. The most widely deployed variants are summarized below.

Variant	Year	Authors	Key idea
BM25F	2004	Robertson, Zaragoza, Taylor	Fielded scoring; weighted combination of term frequencies across multiple document fields with per-field length normalization
BM25+	2011	Lv, Zhai	Adds a constant `δ` to the term-frequency factor to enforce a strict lower bound; fixes a bias against very long documents
BM25L	2011	Lv, Zhai	Shifts the normalization curve so long documents are penalized less, while preserving the saturating shape
BM25-adpt	2012	Lv, Zhai	Adapts `k₁` per term based on collection statistics rather than using a single global value
BM25T	2017	Trotman, Puurula, Burgess	Various term-specific tweaks; common in academic IR
TW-IDF	2013	Rousseau, Vazirgiannis	Replaces term frequency with a graph-based term weight inspired by BM25 saturation

BM25F is by far the most operationally important variant. Real web pages have multiple fields such as title, headers, anchor text, body, and URL, each carrying different relevance signals and each having its own characteristic length distribution. BM25F handles each field as a separate "stream," computes a length-normalized weighted term frequency per stream, sums across streams with field-specific boost factors, and then applies the standard BM25 saturation and IDF on top. Search engines from Bing to Elasticsearch use BM25F or close relatives to score multi-field documents.

BM25+ and BM25L address a subtle theoretical defect in the original BM25: the lower bound of the term-frequency contribution can become arbitrarily small for very long documents, with the result that a long document containing the query term is sometimes scored similarly to a short document that does not contain the term at all. BM25+ adds a small additive constant to enforce a strict positive lower bound; BM25L reshapes the normalization curve to similar effect.

Adoption in production search

BM25 is the standard scoring function in essentially every modern open-source full-text search engine.

System	Default scoring	Notes
Apache Lucene	BM25 since version 6.0 (2015)	Replaced original `DefaultSimilarity` (a tf-idf variant)
elasticsearch	BM25 since 5.0 (2016)	Inherited the change from Lucene
OpenSearch	BM25 via Lucene's `BM25Similarity`	Direct fork of Elasticsearch
Apache Solr	BM25 since Solr 6	Same Lucene foundation
Tantivy	BM25	Rust full-text library used by Quickwit and ParadeDB
Vespa	BM25 as `bm25` rank feature	`nativeRank` is the historical default
PostgreSQL (ParadeDB, VectorChord-BM25)	BM25 via extension	Brings true BM25 to Postgres
Whoosh, Xapian, Terrier	BM25	Standard in research IR toolkits

The transition in Lucene from its original DefaultSimilarity to BM25 in version 6.0 (2015) was one of the most consequential changes in the history of open-source search. It propagated downstream into Elasticsearch and Solr in 2016, effectively making BM25 the default behavior of nearly every search box on the open web.

BM25 in the era of neural retrieval

The rise of dense retrieval, in which queries and documents are encoded by neural networks into fixed-dimensional vectors and matched by approximate nearest neighbor search, was widely expected to make BM25 obsolete. That has not happened. Instead, BM25 has settled into three durable roles.

As a baseline

BM25 is the standard reference baseline in modern IR research. Almost every paper introducing a dense retriever, learned sparse retriever, or cross-encoder reranker reports BM25 numbers on standard benchmarks such as MS MARCO, TREC Deep Learning, and the BEIR benchmark. The toolkits Anserini (Java, built on Lucene) and Pyserini (Python bindings to Anserini) maintain reproducible BM25 baselines and prebuilt indexes for these collections, which makes BM25 the lingua franca for comparing retrieval systems.

A frequently cited result from the BEIR benchmark is that BM25 remains a remarkably strong zero-shot retriever: dense models trained on MS MARCO often fail to generalize to other domains, while BM25 has no domain-specific training and therefore degrades gracefully when moved to scientific, medical, or legal corpora. This robustness has shaped how the field thinks about retrieval evaluation.

As one half of hybrid retrieval

Production semantic_search systems and retrieval_augmented_generation pipelines very rarely use dense retrieval alone. They combine BM25 lexical search with dense neural retrieval and then fuse the two ranked lists. The fusion can be done by a learned reranker, by weighted score interpolation, or, most commonly, by Reciprocal Rank Fusion (RRF), which assigns each document a score of 1 / (k + rank) in each list (with k typically set to 60) and sums those scores.

Approach	Strengths	Weaknesses
BM25 alone	Exact lexical match, zero training data required, fast, interpretable, robust across domains	Misses paraphrases, synonyms, and conceptual matches
Dense retrieval alone (e.g., DPR, two-tower_model bi-encoders)	Captures semantic similarity, handles paraphrases	Expensive to train, weak on rare terms and out-of-domain queries, requires a vector_database
BM25 + dense, RRF fusion	Inherits exact-match precision and semantic recall; consistently best on heterogeneous queries	Higher infrastructure cost; two indexes to maintain
Learned sparse (SPLADE)	BM25-style inverted index with learned term expansion; closes most BM25 gaps while retaining its speed	Newer, less mature tooling; still requires a learned model

Empirical reports from production teams routinely show that hybrid BM25-plus-dense fusion lifts recall at top-k by ten to twenty points over either component alone, which translates directly into fewer hallucinations in a downstream RAG system because the language model has more relevant context to ground its answers in.

As the inspiration for learned sparse retrieval

A newer family of retrievers, exemplified by SPLADE (Sparse Lexical and Expansion Model, Formal et al. 2021) and its successors, attempts to keep the inverted-index efficiency of BM25 while learning the term weights and the term expansions with a neural language model. SPLADE replaces BM25's heuristic saturating term-frequency function with a learned max-pooling over BERT token activations, and it lets the model add or remove terms from the document and the query, producing a sparse vector in the BERT vocabulary that can be indexed and searched with standard inverted-index data structures.

SPLADE is best understood as BM25 with the rough edges sanded off by a transformer: it keeps the storage and latency profile of a sparse retriever, it remains interpretable in terms of which vocabulary tokens contributed to the score, and on benchmarks like MS MARCO and BEIR it closes most of the gap between BM25 and state-of-the-art dense models while being substantially cheaper to serve.

Strengths and limitations

Strengths

No training required. BM25 is unsupervised and can be applied to any new corpus the moment it is indexed. This is decisive for cold-start scenarios and for domains where labeled relevance judgments are scarce.
Inverted-index efficient. Score contributions are computed per term and summed across the posting list, which fits the inverted-index data structure that has been optimized for fifty years. Latency is typically a few milliseconds even on collections of hundreds of millions of documents.
Interpretable. The score decomposes term by term, so a search engineer can inspect exactly why a document was returned. Dense neural retrievers are mostly opaque.
Robust across domains. BM25 has no notion of meaning, but it also has no domain biases. On the BEIR benchmark and similar zero-shot evaluations, BM25 frequently outperforms dense retrievers trained on a different domain.
Stable defaults. The default k₁ = 1.2, b = 0.75 work well across a wide range of corpora, so a working search system can be built without tuning.

Limitations

Pure lexical matching. BM25 cannot match a query for "car" with a document about "automobiles" unless an external synonym list, query expansion, or stemming pipeline is added. This is the primary reason hybrid search exists.
Bag-of-words assumption. Word order, syntactic structure, and discourse-level context are ignored.
No notion of document quality. BM25 only measures how well a document matches the query, not whether the document is well-written, authoritative, or fresh. Production systems layer additional ranking signals on top.
Sensitive to tokenization, stemming, and stop-word choices. The numerical score depends entirely on the upstream text-processing pipeline, and inconsistencies between indexing and querying time will silently degrade relevance.
Hyperparameters drift with domain. While defaults are robust, optimal k₁ and b differ measurably between corpora and tuning them requires labeled or implicit-feedback data.

Practical implementation notes

Tokenization matters. Lowercasing, stemming, stop-word removal, and subword handling all affect the BM25 score before the formula is even evaluated. Use the same analyzer at indexing and query time.
Negative IDF. The original Robertson-Sparck Jones IDF can go negative for very common terms. Lucene adds + 1 inside the logarithm to clamp IDF at zero; some research toolkits keep the original signed form.
Per-field BM25. When indexing structured documents, prefer BM25F over running independent BM25 instances per field. BM25F applies length normalization within each field before saturation.
Avoid score comparison across queries. BM25 scores are not calibrated probabilities and only rank documents within a single query.
Reciprocal Rank Fusion is rank-based. When fusing BM25 with dense scores, RRF uses only the rank, sidestepping the score-calibration problem.

Why BM25 is still relevant in 2026

Thirty-two years after the 1994 Robertson-Walker paper, BM25 is still the default scoring function in the search engines that index the open web, the standard baseline in academic IR, and the lexical backbone of every serious hybrid retrieval pipeline. Several factors explain this longevity:

The information-retrieval problem has not changed as much as the rest of NLP. Documents and queries are still mostly text, and exact lexical matches are still strong evidence of relevance.
BM25 is cheap to compute and trivial to scale. A modern engine can BM25-rank a hundred-million-document collection in single-digit milliseconds with no GPU.
BM25 generalizes across domains without retraining. Dense retrievers trained on one corpus often degrade sharply when moved to another; BM25 does not.
BM25 composes well with neural retrievers. Hybrid pipelines that combine BM25 with dense embeddings consistently beat either component alone, so BM25 is not displaced; it is an ingredient.
Learned sparse retrievers preserve the BM25 deployment story. SPLADE-style models produce sparse vectors that fit into the same inverted-index infrastructure that already serves BM25, making the upgrade incremental.

For most teams building search or retrieval_augmented_generation today, the right starting point is still: index the corpus, run BM25 with default k₁ = 1.2 and b = 0.75, and only add a dense retriever, a reranker, or a learned sparse model once BM25 has been measured and shown to be insufficient. More often than not, the BM25 baseline is harder to beat than expected.

References

Robertson, S. E., & Walker, S. (1994). Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. *Proceedings of the 17th Annual International ACM SIGIR Conference*, 232 to 241.
Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. *Foundations and Trends in Information Retrieval*, 3(4), 333 to 389.
Robertson, S. E., & Sparck Jones, K. (1976). Relevance Weighting of Search Terms. *Journal of the American Society for Information Science*, 27(3), 129 to 146.
Robertson, S., Zaragoza, H., & Taylor, M. (2004). Simple BM25 Extension to Multiple Weighted Fields. *Proceedings of CIKM 2004*.
Lv, Y., & Zhai, C. (2011). Lower-bounding Term Frequency Normalization. *Proceedings of CIKM 2011* (introduces BM25+ and BM25L).
Harter, S. P. (1975). A Probabilistic Approach to Automatic Keyword Indexing. *Journal of the American Society for Information Science*, 26(4 and 5).
Formal, T., Piwowarski, B., & Clinchant, S. (2021). SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. *Proceedings of SIGIR 2021* (short paper).
Lin, J., Ma, X., Lin, S., Yang, J., Pradeep, R., & Nogueira, R. (2021). Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. *Proceedings of SIGIR 2021* (demonstration).
Yang, P., Fang, H., & Lin, J. (2017). Anserini: Enabling the Use of Lucene for Information Retrieval Research. *Proceedings of SIGIR 2017*.
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. *NeurIPS 2021 Datasets and Benchmarks Track*.
Cormack, G. V., Clarke, C. L. A., & Büttcher, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. *Proceedings of SIGIR 2009*.
Apache Lucene project. *BM25Similarity* class documentation, lucene.apache.org.
Elastic. *Practical BM25* blog series (Parts 1 to 3) by Shane Connelly, elastic.co/blog.
OpenSearch Project. *Keyword search* documentation, opensearch.org.
Wikipedia contributors. *Okapi BM25*, en.wikipedia.org/wiki/Okapi_BM25.

BM25 (Okapi BM25)

BM25 (Okapi BM25)

History and origins

The BM25 scoring formula

Reading the formula

Hyperparameters

BM25 versus tf-idf

BM25 variants

Adoption in production search

BM25 in the era of neural retrieval

As a baseline

As one half of hybrid retrieval

As the inspiration for learned sparse retrieval

Strengths and limitations

Strengths

Limitations

Practical implementation notes

Why BM25 is still relevant in 2026

See also

References

Improve this article

BM25 (Okapi BM25)

History and origins

The BM25 scoring formula

Reading the formula

Hyperparameters

BM25 versus tf-idf

BM25 variants

Adoption in production search

BM25 in the era of neural retrieval

As a baseline

As one half of hybrid retrieval

As the inspiration for learned sparse retrieval

Strengths and limitations

Strengths

Limitations

Practical implementation notes

Why BM25 is still relevant in 2026

See also

References

BM25 (Okapi BM25)

History and origins

The BM25 scoring formula

Reading the formula

Hyperparameters

BM25 versus tf-idf

BM25 variants

Adoption in production search

BM25 in the era of neural retrieval

As a baseline

As one half of hybrid retrieval

As the inspiration for learned sparse retrieval

Strengths and limitations

Strengths

Limitations

Practical implementation notes

Why BM25 is still relevant in 2026

See also

References

Improve this article

Related Articles

MTEB (Massive Text Embedding Benchmark)

Re-ranking

Information Retrieval

TF-IDF (Term Frequency-Inverse Document Frequency)

HNSW

Search Engine

BM25 (Okapi BM25)

History and origins

The BM25 scoring formula

Reading the formula

Hyperparameters

BM25 versus tf-idf

BM25 variants

Adoption in production search

BM25 in the era of neural retrieval

As a baseline

As one half of hybrid retrieval

As the inspiration for learned sparse retrieval

Strengths and limitations

Strengths

Limitations

Practical implementation notes

Why BM25 is still relevant in 2026

See also

References

Related Articles

MTEB (Massive Text Embedding Benchmark)

Re-ranking

Information Retrieval

TF-IDF (Term Frequency-Inverse Document Frequency)

HNSW

Search Engine