# BM25 (Okapi BM25)

> Source: https://aiwiki.ai/wiki/bm25
> Updated: 2026-06-21
> Categories: Algorithms, Information Retrieval
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**BM25**, also called **Okapi BM25** or **Best Match 25**, is a probabilistic ranking function used by search engines and information retrieval systems to estimate how relevant a document is to a given query. It belongs to the broader Probabilistic Relevance Framework developed in the 1970s and 1980s by Stephen Robertson, Karen Sparck Jones, and collaborators, and it remains, more than three decades after its first publication, the de facto default scoring function for keyword search in production systems including Apache Lucene, Elasticsearch, OpenSearch, Solr, Vespa, and Tantivy.[2]

The "BM" stands for "Best Matching" and the number 25 reflects the fact that the formula was the twenty-fifth iteration in a series of probabilistic weighting experiments by Robertson and colleagues. It was first described in fully recognizable form by Robertson and Walker in 1994 and implemented in the **Okapi** search system at the City University of London, the source of the "Okapi" prefix.[1]

BM25 generalizes classical [tf_idf](/wiki/tf_idf) scoring with two refinements: a saturating term-frequency function that prevents extremely repetitive documents from dominating the ranking, and an explicit document-length normalization that compensates for the fact that longer documents naturally contain more occurrences of any given term.[2] These properties make BM25 competitive even in the era of large neural language models, and it serves as both the standard baseline in academic IR research and the lexical component of hybrid retrieval pipelines that power [retrieval_augmented_generation](/wiki/retrieval_augmented_generation) systems.

## What does BM25 stand for and where did it come from?

BM25 emerged from a long line of research into probabilistic models for document retrieval that began with the work of Maron and Kuhns in the early 1960s and was formalized by Robertson and Sparck Jones in their 1976 paper on relevance weighting of search terms.[3] That earlier work introduced the **Robertson-Sparck Jones weight**, often abbreviated RSJ, which is essentially the IDF component used inside BM25 today.[3]

A key intermediate step was Stephen Harter's 1975 work on the **2-Poisson model**, which proposed that occurrences of a content-bearing term could be modeled as a mixture of two Poisson processes: one for documents genuinely about the topic ("elite" documents) and another for documents in which the term appears only incidentally.[6] The 2-Poisson derivation produced a saturating term-weighting function, but the original formulation required estimating several latent parameters per term.[6]

In their landmark 1994 SIGIR paper, *Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval*, Robertson and Walker showed that almost all of the practical benefit of the 2-Poisson model could be captured by a much simpler closed-form expression with only two free hyperparameters.[1] As they put it, the goal was to use "the 2-Poisson model for term frequencies to suggest ways of incorporating certain variables in probabilistic models for information retrieval," namely within-document term frequency, document length, and within-query term frequency.[1] They combined this saturating term-frequency factor with a length normalization derived from earlier Okapi experiments and with the RSJ inverse document frequency, producing the formula universally known as BM25.[1] The Okapi system at City University, in which the function was first implemented, gave rise to the alternative name **Okapi BM25**.[1]

Throughout the late 1990s and early 2000s, BM25 dominated the **Text REtrieval Conference (TREC)** ad hoc and web tracks, repeatedly outperforming vector-space cosine similarity on heterogeneous collections. Robertson and Hugo Zaragoza published a comprehensive retrospective in 2009, *The Probabilistic Relevance Framework: BM25 and Beyond*, which describes the Probabilistic Relevance Framework as "a formal framework for document retrieval" that "led to one of the most successful text-retrieval algorithms, BM25," and is still the canonical theoretical reference for the function and its variants.[2]

## What is the BM25 scoring formula?

For a query *Q* containing terms *q₁, q₂, ..., qₙ* and a document *D*, the BM25 score is

```
score(D, Q) = Σ IDF(q_i) · [ f(q_i, D) · (k₁ + 1) ] / [ f(q_i, D) + k₁ · (1 - b + b · |D| / avgdl) ]
```

where

- `f(q_i, D)` is the raw term frequency of *qᵢ* in document *D*,
- `|D|` is the length of *D* measured in tokens,
- `avgdl` is the average document length across the collection,
- `k₁` and `b` are free hyperparameters, and
- `IDF(q_i)` is the inverse document frequency of *qᵢ*.

The IDF component is the Robertson-Sparck Jones weight in its commonly used form:[3]

```
IDF(q_i) = log( (N - n(q_i) + 0.5) / (n(q_i) + 0.5) + 1 )
```

where *N* is the total number of documents in the collection and *n(qᵢ)* is the number of documents containing *qᵢ*. The `+ 1` outside the logarithm is a smoothing convention introduced in Lucene to avoid negative IDF values for very common terms; the original probabilistic derivation produces a log-odds ratio that can in fact go negative for terms appearing in more than half of the collection.[12]

### How do you read the formula?

Three behaviors explain almost everything BM25 does well.

**Term-frequency saturation.** The numerator grows linearly with `f(q_i, D)`, but the denominator also contains `f(q_i, D)`, so the ratio asymptotes at `(k₁ + 1)` as term frequency grows. Doubling the count of a query term in a document raises its contribution noticeably the first few times but quickly produces diminishing returns, which prevents keyword-stuffed pages from winning the ranking.

**Document-length normalization.** The factor `(1 - b + b · |D| / avgdl)` raises the effective `k₁` for documents longer than average and lowers it for shorter ones. With `b = 1` normalization is full; with `b = 0` length is ignored. This addresses the bias toward long documents that plain tf-idf inherits from the bag-of-words model.

**Negative IDF for ubiquitous terms.** Without the Lucene `+ 1` smoothing, words appearing in more than half of the collection contribute negative weight, actively discouraging matches on stop-word-like terms and pushing the ranking toward more discriminative query terms.[12]

## What are the BM25 hyperparameters k1 and b?

BM25 has two practitioner-facing hyperparameters. The defaults work reasonably well almost everywhere, but tuning them on a held-out evaluation set can yield meaningful relevance gains.

| Parameter | Typical range | Default in Lucene | Effect when increased |
| --- | --- | --- | --- |
| `k₁` | 1.2 to 2.0 | 1.2 | Slower term-frequency saturation; raw counts matter more |
| `b` | 0.0 to 1.0 | 0.75 | Stronger penalty on long documents |

Setting `k₁ = 0` reduces the term-frequency factor to a binary indicator, recovering the **Binary Independence Model**. Very large `k₁` causes BM25 to behave like raw tf-idf with no saturation. The default `k₁ = 1.2` reflects empirical findings from TREC experiments on English collections; for very short documents like product titles, lower values around `k₁ = 0.9` sometimes work better, while for long-form documents like research papers, values from `1.5` to `2.0` are common.[12]

The parameter `b` has a sharper interpretation: `b = 0` disables length normalization and `b = 1` applies it fully. The default `b = 0.75` is a compromise that works well on mixed collections.[12] Homogeneous corpora often prefer lower `b`; corpora that span several orders of magnitude in length favor values close to `1`.

## How does BM25 differ from tf-idf?

BM25 is sometimes described as an improved tf-idf, and the comparison is informative. Both functions multiply a term-frequency component by an inverse document frequency component and sum across query terms, but the details differ in ways that matter in practice.

| Property | tf-idf | BM25 |
| --- | --- | --- |
| Term-frequency growth | Linear or log-scaled, unbounded | Saturating, asymptotes at `k₁ + 1` |
| Document-length normalization | None or ad hoc cosine normalization | Built-in, controlled by `b` |
| IDF formulation | `log(N / df)` | `log((N - df + 0.5) / (df + 0.5) + 1)` |
| Theoretical basis | Heuristic, vector-space model | Probabilistic Relevance Framework, 2-Poisson |
| Negative weights for very common terms | No | Yes (without Lucene `+1` smoothing) |
| Hyperparameters | Usually none | `k₁`, `b` |

In head-to-head experiments on standard test collections, BM25 typically beats vanilla tf-idf by several points of mean average precision.[2] The advantage is largest on collections with widely varying document lengths and on queries with repeated content terms, both of which are situations where unbounded term frequency in tf-idf produces unstable scores.

## What are the main BM25 variants?

The core BM25 formula has been extended in many directions to address specific weaknesses or to handle structured documents. The most widely deployed variants are summarized below.

| Variant | Year | Authors | Key idea |
| --- | --- | --- | --- |
| **BM25F** | 2004 | Robertson, Zaragoza, Taylor | Fielded scoring; weighted combination of term frequencies across multiple document fields with per-field length normalization |
| **BM25+** | 2011 | Lv, Zhai | Adds a constant `δ` to the term-frequency factor to enforce a strict lower bound; fixes a bias against very long documents |
| **BM25L** | 2011 | Lv, Zhai | Shifts the normalization curve so long documents are penalized less, while preserving the saturating shape |
| **BM25-adpt** | 2012 | Lv, Zhai | Adapts `k₁` per term based on collection statistics rather than using a single global value |
| **BM25T** | 2017 | Trotman, Puurula, Burgess | Various term-specific tweaks; common in academic IR |
| **TW-IDF** | 2013 | Rousseau, Vazirgiannis | Replaces term frequency with a graph-based term weight inspired by BM25 saturation |

BM25F is by far the most operationally important variant. Real web pages have multiple fields such as title, headers, anchor text, body, and URL, each carrying different relevance signals and each having its own characteristic length distribution. BM25F handles each field as a separate "stream," computes a length-normalized weighted term frequency per stream, sums across streams with field-specific boost factors, and then applies the standard BM25 saturation and IDF on top.[4] Search engines from Bing to Elasticsearch use BM25F or close relatives to score multi-field documents.

BM25+ and BM25L address a subtle theoretical defect in the original BM25: the lower bound of the term-frequency contribution can become arbitrarily small for very long documents, with the result that a long document containing the query term is sometimes scored similarly to a short document that does not contain the term at all.[5] BM25+ adds a small additive constant to enforce a strict positive lower bound; BM25L reshapes the normalization curve to similar effect.[5]

## Which search engines use BM25 by default?

BM25 is the standard scoring function in essentially every modern open-source full-text search engine.

| System | Default scoring | Notes |
| --- | --- | --- |
| **Apache Lucene** | BM25 since version 6.0 (2016) | Replaced original `DefaultSimilarity` (a tf-idf variant) |
| **elasticsearch** | BM25 since 5.0 (2016) | Inherited the change from Lucene |
| **OpenSearch** | BM25 via Lucene's `BM25Similarity` | Direct fork of Elasticsearch |
| **Apache Solr** | BM25 since Solr 6 | Same Lucene foundation |
| **Tantivy** | BM25 | Rust full-text library used by Quickwit and ParadeDB |
| **Vespa** | BM25 as `bm25` rank feature | `nativeRank` is the historical default |
| **PostgreSQL (ParadeDB, VectorChord-BM25)** | BM25 via extension | Brings true BM25 to Postgres |
| **Whoosh, Xapian, Terrier** | BM25 | Standard in research IR toolkits |

The transition in Lucene from its original `DefaultSimilarity` to BM25 in version 6.0, released in April 2016, was one of the most consequential changes in the history of open-source search.[12] It propagated downstream into Elasticsearch 5.0 and Solr 6 the same year, effectively making BM25 the default behavior of nearly every search box on the open web.[13]

## Is BM25 still relevant in the era of neural retrieval?

The rise of dense retrieval, in which queries and documents are encoded by neural networks into fixed-dimensional vectors and matched by approximate nearest neighbor search, was widely expected to make BM25 obsolete. That has not happened. Instead, BM25 has settled into three durable roles.

### As a baseline

**BM25 is the standard reference baseline in modern IR research.** Almost every paper introducing a dense retriever, learned sparse retriever, or cross-encoder reranker reports BM25 numbers on standard benchmarks such as **MS MARCO**, **TREC Deep Learning**, and the **BEIR** benchmark.[10] The toolkits **Anserini** (Java, built on Lucene) and **Pyserini** (Python bindings to Anserini) maintain reproducible BM25 baselines and prebuilt indexes for these collections, which makes BM25 the lingua franca for comparing retrieval systems.[8][9]

A frequently cited result from the BEIR benchmark is that BM25 remains a remarkably strong zero-shot retriever: dense models trained on MS MARCO often fail to generalize to other domains, while BM25 has no domain-specific training and therefore degrades gracefully when moved to scientific, medical, or legal corpora.[10] The BEIR authors found that several learned sparse models that "perform well in-domain on MS MARCO completely fail to generalize" by underperforming BM25 on nearly all out-of-domain datasets, a result that has shaped how the field thinks about retrieval evaluation.[10]

### As one half of hybrid retrieval

Production [semantic_search](/wiki/semantic_search) systems and [retrieval_augmented_generation](/wiki/retrieval_augmented_generation) pipelines very rarely use dense retrieval alone. They combine BM25 lexical search with dense neural retrieval and then fuse the two ranked lists. The fusion can be done by a learned reranker, by weighted score interpolation, or, most commonly, by **Reciprocal Rank Fusion (RRF)**, which assigns each document a score of `1 / (k + rank)` in each list (with `k` typically set to 60) and sums those scores.[11]

| Approach | Strengths | Weaknesses |
| --- | --- | --- |
| **BM25 alone** | Exact lexical match, zero training data required, fast, interpretable, robust across domains | Misses paraphrases, synonyms, and conceptual matches |
| **Dense retrieval alone** (e.g., DPR, [two-tower_model](/wiki/two-tower_model) bi-encoders) | Captures semantic similarity, handles paraphrases | Expensive to train, weak on rare terms and out-of-domain queries, requires a [vector_database](/wiki/vector_database) |
| **BM25 + dense, RRF fusion** | Inherits exact-match precision and semantic recall; consistently best on heterogeneous queries | Higher infrastructure cost; two indexes to maintain |
| **Learned sparse (SPLADE)** | BM25-style inverted index with learned term expansion; closes most BM25 gaps while retaining its speed | Newer, less mature tooling; still requires a learned model |

Empirical reports from production teams routinely show that hybrid BM25-plus-dense fusion lifts recall at top-k by ten to twenty points over either component alone, which translates directly into fewer hallucinations in a downstream RAG system because the language model has more relevant context to ground its answers in.

### As the inspiration for learned sparse retrieval

A newer family of retrievers, exemplified by **SPLADE** (Sparse Lexical and Expansion Model, Formal et al. 2021) and its successors, attempts to keep the inverted-index efficiency of BM25 while learning the term weights and the term expansions with a neural language model.[7] SPLADE replaces BM25's heuristic saturating term-frequency function with a learned max-pooling over BERT token activations, and it lets the model add or remove terms from the document and the query, producing a sparse vector in the BERT vocabulary that can be indexed and searched with standard inverted-index data structures.[7]

SPLADE is best understood as **BM25 with the rough edges sanded off by a transformer**: it keeps the storage and latency profile of a sparse retriever, it remains interpretable in terms of which vocabulary tokens contributed to the score, and on benchmarks like MS MARCO and BEIR it closes most of the gap between BM25 and state-of-the-art dense models while being substantially cheaper to serve.[7]

## What are the strengths and limitations of BM25?

### Strengths

- **No training required.** BM25 is unsupervised and can be applied to any new corpus the moment it is indexed. This is decisive for cold-start scenarios and for domains where labeled relevance judgments are scarce.
- **Inverted-index efficient.** Score contributions are computed per term and summed across the posting list, which fits the inverted-index data structure that has been optimized for fifty years. Latency is typically a few milliseconds even on collections of hundreds of millions of documents.
- **Interpretable.** The score decomposes term by term, so a search engineer can inspect exactly why a document was returned. Dense neural retrievers are mostly opaque.
- **Robust across domains.** BM25 has no notion of meaning, but it also has no domain biases. On the BEIR benchmark and similar zero-shot evaluations, BM25 frequently outperforms dense retrievers trained on a different domain.[10]
- **Stable defaults.** The default `k₁ = 1.2`, `b = 0.75` work well across a wide range of corpora, so a working search system can be built without tuning.[12]

### Limitations

- **Pure lexical matching.** BM25 cannot match a query for "car" with a document about "automobiles" unless an external synonym list, query expansion, or stemming pipeline is added. This is the primary reason hybrid search exists.
- **Bag-of-words assumption.** Word order, syntactic structure, and discourse-level context are ignored.
- **No notion of document quality.** BM25 only measures how well a document matches the query, not whether the document is well-written, authoritative, or fresh. Production systems layer additional ranking signals on top.
- **Sensitive to tokenization, stemming, and stop-word choices.** The numerical score depends entirely on the upstream text-processing pipeline, and inconsistencies between indexing and querying time will silently degrade relevance.
- **Hyperparameters drift with domain.** While defaults are robust, optimal `k₁` and `b` differ measurably between corpora and tuning them requires labeled or implicit-feedback data.

## Practical implementation notes

- **Tokenization matters.** Lowercasing, stemming, stop-word removal, and subword handling all affect the BM25 score before the formula is even evaluated. Use the same analyzer at indexing and query time.[13]
- **Negative IDF.** The original Robertson-Sparck Jones IDF can go negative for very common terms. Lucene adds `+ 1` inside the logarithm to clamp IDF at zero; some research toolkits keep the original signed form.[12]
- **Per-field BM25.** When indexing structured documents, prefer BM25F over running independent BM25 instances per field. BM25F applies length normalization within each field before saturation.[4]
- **Avoid score comparison across queries.** BM25 scores are not calibrated probabilities and only rank documents within a single query.
- **Reciprocal Rank Fusion is rank-based.** When fusing BM25 with dense scores, RRF uses only the rank, sidestepping the score-calibration problem.[11]

## Why is BM25 still relevant in 2026?

Thirty-two years after the 1994 Robertson-Walker paper, BM25 is still the default scoring function in the search engines that index the open web, the standard baseline in academic IR, and the lexical backbone of every serious hybrid retrieval pipeline.[1] Several factors explain this longevity:

1. **The information-retrieval problem has not changed as much as the rest of NLP.** Documents and queries are still mostly text, and exact lexical matches are still strong evidence of relevance.
2. **BM25 is cheap to compute and trivial to scale.** A modern engine can BM25-rank a hundred-million-document collection in single-digit milliseconds with no GPU.
3. **BM25 generalizes across domains without retraining.** Dense retrievers trained on one corpus often degrade sharply when moved to another; BM25 does not.[10]
4. **BM25 composes well with neural retrievers.** Hybrid pipelines that combine BM25 with dense embeddings consistently beat either component alone, so BM25 is not displaced; it is an ingredient.
5. **Learned sparse retrievers preserve the BM25 deployment story.** SPLADE-style models produce sparse vectors that fit into the same inverted-index infrastructure that already serves BM25, making the upgrade incremental.[7]

For most teams building search or [retrieval_augmented_generation](/wiki/retrieval_augmented_generation) today, the right starting point is still: index the corpus, run BM25 with default `k₁ = 1.2` and `b = 0.75`, and only add a dense retriever, a reranker, or a learned sparse model once BM25 has been measured and shown to be insufficient. More often than not, the BM25 baseline is harder to beat than expected.

## See also

- [tf_idf](/wiki/tf_idf)
- [semantic_search](/wiki/semantic_search)
- [retrieval_augmented_generation](/wiki/retrieval_augmented_generation)
- [vector_database](/wiki/vector_database)
- [two-tower_model](/wiki/two-tower_model)
- elasticsearch

## References

1. Robertson, S. E., & Walker, S. (1994). Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval. *Proceedings of the 17th Annual International ACM SIGIR Conference*, 232 to 241.
2. Robertson, S., & Zaragoza, H. (2009). The Probabilistic Relevance Framework: BM25 and Beyond. *Foundations and Trends in Information Retrieval*, 3(4), 333 to 389.
3. Robertson, S. E., & Sparck Jones, K. (1976). Relevance Weighting of Search Terms. *Journal of the American Society for Information Science*, 27(3), 129 to 146.
4. Robertson, S., Zaragoza, H., & Taylor, M. (2004). Simple BM25 Extension to Multiple Weighted Fields. *Proceedings of CIKM 2004*.
5. Lv, Y., & Zhai, C. (2011). Lower-bounding Term Frequency Normalization. *Proceedings of CIKM 2011* (introduces BM25+ and BM25L).
6. Harter, S. P. (1975). A Probabilistic Approach to Automatic Keyword Indexing. *Journal of the American Society for Information Science*, 26(4 and 5).
7. Formal, T., Piwowarski, B., & Clinchant, S. (2021). SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. *Proceedings of SIGIR 2021* (short paper).
8. Lin, J., Ma, X., Lin, S., Yang, J., Pradeep, R., & Nogueira, R. (2021). Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations. *Proceedings of SIGIR 2021* (demonstration).
9. Yang, P., Fang, H., & Lin, J. (2017). Anserini: Enabling the Use of Lucene for Information Retrieval Research. *Proceedings of SIGIR 2017*.
10. Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. *NeurIPS 2021 Datasets and Benchmarks Track*.
11. Cormack, G. V., Clarke, C. L. A., & Büttcher, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. *Proceedings of SIGIR 2009*.
12. Apache Lucene project. *BM25Similarity* class documentation, lucene.apache.org.
13. Elastic. *Practical BM25* blog series (Parts 1 to 3) by Shane Connelly, elastic.co/blog.
14. OpenSearch Project. *Keyword search* documentation, opensearch.org.
15. Wikipedia contributors. *Okapi BM25*, en.wikipedia.org/wiki/Okapi_BM25.