See also: Information retrieval, Re-ranking, Recommendation system
Ranking in machine learning, often called learning to rank (LTR), is a class of supervised problems whose goal is to produce an ordered list of items rather than to predict a single value for each item independently. Given a query and a candidate set of documents, products, passages, or recommendations, a ranking model assigns scores so that the most relevant items appear first. The quality of the model is judged on the order it produces, not on the absolute scores it assigns. Ranking has been the workhorse of web search engines since the mid-2000s, sits at the heart of modern recommendation systems, and has become central again in the era of large language models where retrieval and re-ranking determine the quality of grounded answers in retrieval-augmented generation.
The field crystallised in the late 2000s. Microsoft researcher Tie-Yan Liu's 2009 monograph Learning to Rank for Information Retrieval organised dozens of competing methods into the now-standard pointwise, pairwise, and listwise framework, and Chris Burges and his colleagues at Microsoft Research produced the algorithmic line that ran from RankNet (2005) through LambdaRank (2006) to LambdaMART (2010), the model that powered Bing's web ranking for years. Since 2019 the field has shifted again, first to BERT-based cross-encoders such as monoBERT and ColBERT, and then to listwise rerankers built on LLMs themselves.
A ranking problem is defined by a set of queries, a set of candidate items per query, and relevance labels (often graded, on a scale such as 0 through 4) for some query-item pairs. The model learns a scoring function f(q, d) that takes a query q and an item d and returns a real number. At inference time, the candidates for a new query are scored independently and sorted by score. Two features distinguish ranking from standard regression or classification:
These two facts shape the algorithms. Pointwise methods sidestep both issues by reducing ranking to regression or classification on individual items. Pairwise and listwise methods confront them, either by using surrogate losses on pairs of items or by approximating list-level metrics with smooth proxies.
Liu's 2009 survey is the standard reference for the way LTR algorithms are grouped. The three families differ in what each training example looks like and what loss function is minimised.
| Approach | Training instance | Typical loss | Examples | Key trade-off |
|---|---|---|---|---|
| Pointwise | (query, item, relevance) | Squared error, cross-entropy, ordinal regression loss | Linear regression, decision trees, McRank, Subset Ranking | Simple to implement, but ignores relative order between items for the same query |
| Pairwise | (query, item_i, item_j, label which is greater) | Logistic / cross-entropy on pairs, hinge | RankNet, RankSVM, RankBoost, GBRank | Models relative order directly; many methods do not directly optimise list metrics |
| Listwise | (query, full ranked list) | Listwise cross-entropy, surrogate of NDCG / MAP | ListNet, ListMLE, AdaRank, LambdaMART, SoftRank | Closest to evaluation metric, but harder to optimise and more expensive |
Pointwise methods predict a score or label per item independently. Training looks identical to ordinary regression or classification: features describe a single (query, item) pair, the label is the graded relevance, and the loss is computed per example. Linear regression, gradient-boosted trees, and ordinal regression all fit here. Cossock and Zhang's 2008 Subset Ranking paper and Li, Burges, and Wu's 2007 McRank are commonly cited representatives. The method is easy to implement and scales well, but it throws away an important signal: a perfect score on item A is useless if the model also gives an equally high score to a less relevant item B for the same query. Pointwise methods often work well as components inside larger pipelines, especially as the first stage of a multi-stage system.
Pairwise methods reformulate ranking as binary classification on pairs of items from the same query. For every pair (item_i, item_j) where i is more relevant than j, the model learns to give item_i a higher score. Cross-entropy on the score difference, hinge loss, or AdaBoost weighted updates are typical choices. The pairwise family is the largest of the three. RankSVM (Joachims, 2002) adapted support vector machines to the pairwise loss. RankBoost (Freund and colleagues, 2003) used boosting. RankNet (Burges et al., 2005, ICML) introduced a neural network with a differentiable cross-entropy on pairs, and is the model from which most modern boosted-tree ranking systems descend. GBRank (Zheng et al., 2007) was an early application of gradient boosted trees to pairs.
Pairwise loss is more aligned with ranking than pointwise loss but still does not match the metrics evaluators care about. A model can win nearly all pairs and still place a single highly relevant document at position 50 instead of position 1, which destroys NDCG@10.
Listwise methods take the entire ranked list as the unit of training. There are two sub-families. The first defines a smooth loss directly on the list, often the cross-entropy between two probability distributions over permutations: ListNet (Cao et al., ICML 2007) is the canonical example, with ListMLE and ListMart following. The second works directly on the metric of interest by approximating its non-smooth behaviour. SoftRank smooths the rank positions with a Gaussian. AdaRank (Xu and Li, 2007) adapts AdaBoost to optimise IR measures. LambdaRank (Burges, Ragno, and Le, NeurIPS 2006) sidesteps the problem entirely by writing down the gradient of the loss directly, multiplying the pairwise RankNet gradient by the change in NDCG that swapping the two items would cause. LambdaMART (Burges, 2010) plugs that gradient into MART (Multiple Additive Regression Trees), giving a gradient-boosted tree ensemble that optimises NDCG without ever computing the loss itself.
LambdaMART went on to dominate competitive learning-to-rank: an ensemble of LambdaMART rankers won track 1 of the 2010 Yahoo! Learning to Rank Challenge, and it powered Bing's web ranking for several years. Burges' 2010 Microsoft Research technical report From RankNet to LambdaRank to LambdaMART: An Overview is the standard reference for the lineage.
A short tour of the most influential methods, in roughly chronological order.
| Algorithm | Year | Family | Notes |
|---|---|---|---|
| RankSVM | 2002 | Pairwise | SVM on pairwise differences (Joachims) |
| RankBoost | 2003 | Pairwise | AdaBoost adapted to pairs |
| RankNet | 2005 | Pairwise | Neural net with cross-entropy on pairs (Burges et al., ICML) |
| LambdaRank | 2006 | Listwise (gradient) | RankNet gradient scaled by delta-NDCG (Burges, Ragno, Le, NeurIPS) |
| McRank | 2007 | Pointwise | Multi-class classification on graded relevance |
| GBRank | 2007 | Pairwise | Gradient boosted trees on pairs |
| AdaRank | 2007 | Listwise | AdaBoost optimising IR measures (Xu, Li) |
| ListNet | 2007 | Listwise | Cross-entropy between softmax-normalised score distributions (Cao et al., ICML) |
| LambdaMART | 2010 | Listwise (gradient) | MART + LambdaRank gradients (Burges); won Yahoo LTR Challenge 2010 |
| monoBERT | 2019 | Cross-encoder | BERT cross-encoder fine-tuned on MS MARCO (Nogueira and Cho) |
| DPR | 2020 | Bi-encoder | Two BERT towers for open-domain QA (Karpukhin et al., EMNLP) |
| ColBERT | 2020 | Late interaction | Per-token MaxSim over BERT embeddings (Khattab and Zaharia, SIGIR) |
| monoT5 | 2020 | Cross-encoder | Sequence-to-sequence ranker on T5 (Nogueira et al., Findings of EMNLP) |
| RankGPT | 2023 | LLM listwise | Sliding-window listwise reranking with GPT-4 (Sun et al., EMNLP) |
From about 2018 onward, transformer-based language models took over the top of the leaderboard for most public IR tasks. Three architectural patterns dominate.
In a bi-encoder, also called a two-tower model, the query and the document are encoded independently into fixed-length vectors, and the score is the dot product or cosine similarity between them. Document vectors can be precomputed and indexed with an approximate nearest-neighbour structure such as HNSW or IVF, so retrieval scales to billions of items. Sentence-BERT (Reimers and Gurevych, 2019) and DPR, Dense Passage Retrieval (Karpukhin et al., EMNLP 2020), are the most cited examples. DPR was the work that established dense retrieval as competitive with BM25 on open-domain QA, beating Lucene-BM25 by 9 to 19 absolute points on top-20 retrieval accuracy.
Bi-encoders are fast but limited in expressiveness, since the model never sees the query and the document together.
A cross-encoder feeds the query and the document into the same transformer in a single forward pass and reads a relevance score off the [CLS] token or a similar pooled representation. The richer joint attention produces stronger relevance judgments at the cost of having to score every (query, document) pair from scratch at query time. Nogueira and Cho's monoBERT (arXiv:1901.04085, 2019) was the first widely cited application: a BERT cross-encoder fine-tuned on MS MARCO, used to rerank the top 1000 results from a BM25 first stage, lifted MRR@10 by 27 percent relative to the previous state of the art on the MS MARCO passage task. monoT5 (Nogueira, Jiang, Pradeep, and Lin, Findings of EMNLP 2020) replaced the encoder-only model with a T5 sequence-to-sequence model that generates target tokens such as "true" or "false" and reads the relevance probability off the logits.
ColBERT (Khattab and Zaharia, SIGIR 2020) sits between bi-encoders and cross-encoders. It encodes the query and the document independently with BERT into per-token embeddings, then computes a MaxSim score by, for every query token, taking the maximum dot product with any document token, and summing over query tokens. The interaction is therefore deferred ("late"), but it is still token level, capturing fine-grained matches that pure bi-encoders miss. The 2020 paper reported over 170 times speedup and four orders of magnitude fewer FLOPs per query than monoBERT for comparable quality. ColBERTv2 (Santhanam et al., NAACL 2022) added denoised supervision and residual compression, becoming a standard baseline on BEIR.
In practice, modern systems combine several stages. A typical pipeline runs a fast first-stage retriever (BM25, a bi-encoder, or a sparse model such as SPLADE) over the entire corpus to fetch a few hundred candidates, then runs a slower but more accurate cross-encoder or LLM reranker on those candidates. This is the multi-stage pattern that Nogueira, Cho, and others have written about extensively, and it is the one that almost every production retrieval system uses today, including most RAG implementations.
The most recent shift is to use large language models themselves as rankers. RankGPT (Sun et al., EMNLP 2023, Outstanding Paper Award) showed that GPT-4, prompted with a list of passages and asked to output a permutation, matches or beats supervised neural rerankers on TREC Deep Learning and several BEIR datasets. Because LLM context windows cannot fit hundreds of long passages, RankGPT uses a sliding window: the model reorders 20 passages at a time, then slides the window backward through the candidate list. RankLLM, RankZephyr, and others have built on the same approach with smaller open models, often distilling GPT-4 ranking behaviour into a fine-tuned 7B or smaller model.
LLM rankers have closed much of the gap between zero-shot and supervised retrieval, particularly on out-of-domain tasks, but they are expensive: a full BEIR-scale evaluation with GPT-4 in the loop runs into thousands of dollars in API calls, and latency per query is far higher than for a fine-tuned cross-encoder.
No discussion of ranking is complete without BM25. The Okapi BM25 ranking function (Robertson, Walker, and colleagues, SIGIR 1994) is the lexical baseline against which every neural model is measured. BM25 scores a query-document pair using term frequency, inverse document frequency, and a length normalisation that is tuned by two free parameters k1 and b. Despite its age, BM25 remains a very strong baseline, especially in zero-shot and out-of-domain settings: the BEIR benchmark (Thakur et al., NeurIPS 2021 Datasets and Benchmarks) found that BM25 outperformed many dense retrieval systems on tasks where the supervised models had not been trained.
Variants such as BM25+RM3 (Robertson and Jones-style relevance feedback) and BM25F (multi-field BM25) extend the basic formula. SPLADE (Formal et al., 2021) combines lexical match with learned sparse expansions, and is widely used as a first-stage retriever alongside or in place of BM25.
Ranking evaluation is its own subfield, and the choice of metric strongly influences which algorithm wins. The most common metrics are listed below.
| Metric | Definition | Strengths | Weaknesses |
|---|---|---|---|
| Precision@k | Fraction of top-k that are relevant | Easy to interpret, position-agnostic within top-k | Ignores order within k and items below k |
| Recall@k | Fraction of all relevant items captured in top-k | Critical for first-stage retrieval | Ignores order |
| MAP (Mean Average Precision) | Mean over queries of average precision (mean of P@k at each relevant rank) | Combines precision and recall | Assumes binary relevance |
| MRR (Mean Reciprocal Rank) | Mean over queries of 1 / rank of first relevant item | Natural for QA and known-item search | Ignores anything past the first relevant item |
| NDCG@k | Discounted gain over top-k, normalised by ideal DCG | Supports graded relevance, position discount | Sensitive to choice of gain function and discount |
| ERR (Expected Reciprocal Rank) | Cascade-model based, captures user satisfaction with graded relevance | Better correlation with click data than DCG | Less common, more complex |
| Win rate / preference rate | Fraction of queries on which model A's ranking is preferred to model B's | Used for LLM rerankers and human evaluation | Pairwise, not absolute |
NDCG is the dominant metric for graded relevance and was introduced by Kalervo Jarvelin and Jaana Kekalainen in Cumulated Gain-based Evaluation of IR Techniques (ACM TOIS, 2002). The discount function (typically log2(rank+1)) and the gain function (typically 2^rel - 1) together credit a model for placing highly relevant documents near the top, with diminishing returns further down. ERR (Chapelle, Metzler, Zhang, and Grinspan, CIKM 2009) extends MRR to graded relevance and ties the metric to a cascade user model where the user keeps reading until satisfied. Empirically, ERR correlates better with click data on commercial search engines than NDCG.
In practice, papers report several metrics together. The standard MS MARCO leaderboard uses MRR@10 for the passage task and NDCG@10 for the document task; TREC Deep Learning uses NDCG@10 and MAP.
Learning to rank progressed roughly in step with public datasets. A few have been particularly influential.
| Dataset | Year | Domain | Notes |
|---|---|---|---|
| LETOR (3.0, 4.0) | 2007-2009 | Web search | Microsoft Research collection of pre-extracted features over OHSUMED, TREC, and others; first standard LTR benchmark |
| Yahoo! Learning to Rank Challenge | 2010 | Web search | 700 features per (query, document); LambdaMART ensembles won track 1 |
| Microsoft MSLR-WEB10K / 30K | 2010 | Web search | 136 features per (query, document), released with the Yahoo challenge era |
| MS MARCO | 2016 | Passage and document retrieval | 1,010,916 anonymised Bing queries with human judgments and 8.8M passages (Bajaj et al.) |
| TREC Deep Learning Track | 2019- | Passage and document | Annual TREC track using MS MARCO data; the standard arena for neural ranking |
| Natural Questions, TriviaQA | 2018-2019 | Open-domain QA | Used to benchmark retrievers and rerankers in QA settings |
| BEIR | 2021 | 18 IR tasks across domains | Heterogeneous zero-shot benchmark (Thakur et al., NeurIPS); measures generalisation |
| MIRACL, mMARCO | 2022 | Multilingual | Cross-lingual ranking, often used to benchmark multilingual encoders |
BEIR has become the most cited zero-shot benchmark since its 2021 release. Its central finding, that BM25 was a stronger zero-shot baseline than many supervised dense retrievers and that re-rankers and late-interaction models on average won across tasks but at high computational cost, reframed how the community evaluates new methods.
Ranking models are implemented in many libraries; a few cover the bulk of production use cases.
| Library | Type | Notes |
|---|---|---|
| LightGBM | Gradient boosted trees | objective='lambdarank' implements LambdaMART; the default ranker for many production systems |
| XGBoost | Gradient boosted trees | rank:pairwise, rank:ndcg, rank:map; rank:ndcg is also LambdaMART-based |
| allRank | PyTorch listwise | Open-source library from Allegro for ListNet, LambdaRank, ApproxNDCG, NeuralNDCG |
| TF-Ranking | TensorFlow | Google's TensorFlow Ranking library, supports pointwise, pairwise, listwise losses |
| sentence-transformers | Bi- and cross-encoders | Reimers' library; ships pretrained MS MARCO cross-encoders such as ms-marco-MiniLM and ms-marco-Electra |
| Pyserini | Retrieval toolkit | Python interface to Anserini (Lucene/BM25) and to dense retrievers; commonly used in TREC submissions |
| ColBERT, RAGatouille | Late-interaction | Original ColBERT and ColBERTv2 reference implementations and a high-level wrapper |
| LangChain, LlamaIndex | RAG orchestration | Wrap retrievers (BM25, FAISS, Pinecone, Weaviate, Qdrant, Chroma) and rerankers (Cohere Rerank, Voyage, BGE) for LLM applications |
At the production scale, custom systems still dominate. Google, Bing, Baidu, Amazon, and Meta all run proprietary multi-stage stacks combining lexical retrieval, dense retrieval, and learned rerankers, with feature engineering for personalisation, freshness, and business signals layered on top.
Ranking shows up in nearly every product category that involves retrieving items in response to a request.
Web and product search. Google, Bing, Baidu, Amazon, and eBay all rank documents or products against user queries. Bing in particular ran LambdaMART for years, and Google's BERT-based neural ranking has been documented in several blog posts since 2019.
Recommendation systems. YouTube, TikTok, Netflix, and Spotify rank items against an implicit query that combines user history, context, and engagement signals. The two-tower architecture is widely used at this scale because the item tower can be precomputed and the user tower can be served online.
Question answering and open-domain QA. Retrieval-then-read pipelines such as those built on DPR or ColBERT rank passages against a question, then feed the top passages to a reader model that extracts or generates an answer.
RAG for LLMs. Modern LLM applications almost always include a retrieval step, and the quality of that step depends on a combination of first-stage retriever and reranker. As LLMs get better at handling long contexts, the importance of returning the right ten passages (rather than the right hundred) has only grown.
Ads and sponsored search. Click-through rate prediction for advertising is a closely related ranking problem, with its own literature on click models, exploration-exploitation, and counterfactual evaluation.
Code search and code review. GitHub Copilot's code search and various enterprise code search products rank code snippets against natural-language queries.
Drug and protein discovery. Virtual screening of compound libraries against a target is increasingly framed as a ranking problem with learned scoring functions.
Ranking shares the limitations of any supervised system, plus a few of its own.
Training labels are expensive. Editorial relevance judgments require trained assessors, and even then they cover only a tiny fraction of (query, document) pairs. Most production systems supplement editorial labels with click data or with synthetic labels generated by a stronger model, but both come with biases.
Click data has heavy biases. Position bias means that users click on the top result more often regardless of relevance. Presentation bias, selection bias, and trust bias all distort the signal. Counterfactual learning to rank (Joachims and colleagues, 2017 onward) tries to correct for these by treating click logs as a kind of bandit feedback and applying inverse-propensity weighting.
Offline metrics do not always match online performance. A model that wins on NDCG@10 on a held-out test set can lose on click-through rate, dwell time, or revenue when deployed. The gap is particularly acute for personalisation, freshness, and exploration.
Out-of-domain generalisation. BEIR demonstrated that supervised dense retrievers often fail to transfer to new domains, while BM25 holds up surprisingly well. The community has responded with multi-domain training, hard-negative mining, and synthetic question generation, but generalisation remains an active research area.
Fairness and exposure. Ranking decisions distribute attention, and uneven exposure can amplify inequalities, whether between job applicants, news outlets, or sellers on a marketplace. Fair ranking research (Singh and Joachims, 2018; Biega and colleagues, 2018) extends the classical learning-to-rank framework with fairness constraints.
LLM rerankers are slow and expensive. RankGPT and its successors deliver strong results but cost real money per query and add seconds of latency. Distillation into smaller open models is the dominant response.
Imagine you ask a librarian for books about dinosaurs. The librarian could just hand you any book that has the word "dinosaur" on the cover. That works, but you want the best book first, the second best book second, and so on. Ranking in machine learning teaches a computer to be that careful librarian. The computer looks at the question, looks at every book it could give you, and tries to figure out the right order to put them in. It learns by practising on lots of past questions where someone has already marked which books were the most useful. The tricky part is that the computer is not graded on whether it picked good books in general, only on whether it put them in the right order. There are a lot of different recipes for teaching this, with names like LambdaMART and ColBERT, and most modern search engines and recommendation apps use some mix of them.