Ranking

Ranking in machine learning, often called learning to rank (LTR), is a class of supervised problems whose goal is to produce an ordered list of items rather than to predict a single value for each item independently. Given a query and a candidate set of documents, products, passages, or recommendations, a ranking model assigns scores so that the most relevant items appear first. The quality of the model is judged on the order it produces, not on the absolute scores it assigns. Ranking has been the workhorse of web search engines since the mid-2000s, sits at the heart of modern recommendation systems, and has become central again in the era of large language models where retrieval and re-ranking determine the quality of grounded answers in retrieval-augmented generation.

The field crystallised in the late 2000s. Microsoft researcher Tie-Yan Liu's 2009 monograph Learning to Rank for Information Retrieval organised dozens of competing methods into the now-standard pointwise, pairwise, and listwise framework, and Chris Burges and his colleagues at Microsoft Research produced the algorithmic line that ran from RankNet (2005) through LambdaRank (2006) to LambdaMART (2010), the model that powered Bing's web ranking for years. Since 2019 the field has shifted again, first to BERT-based cross-encoders such as monoBERT and ColBERT, and then to listwise rerankers built on LLMs themselves.

Definition and problem setting

A ranking problem is defined by a set of queries, a set of candidate items per query, and relevance labels (often graded, on a scale such as 0 through 4) for some query-item pairs. The model learns a scoring function f(q, d) that takes a query q and an item d and returns a real number. At inference time, the candidates for a new query are scored independently and sorted by score. Two features distinguish ranking from standard regression or classification:

The relevant unit of evaluation is the list, not the individual prediction. A score that is wrong by 0.5 matters only if it changes the order.
Most ranking metrics are non-smooth in the model parameters, because they depend on positions in a sorted list rather than on raw scores. Sorting introduces step functions, and step functions have zero gradient almost everywhere.

These two facts shape the algorithms. Pointwise methods sidestep both issues by reducing ranking to regression or classification on individual items. Pairwise and listwise methods confront them, either by using surrogate losses on pairs of items or by approximating list-level metrics with smooth proxies.

The three classical formulations

Liu's 2009 survey is the standard reference for the way LTR algorithms are grouped. The three families differ in what each training example looks like and what loss function is minimised.

Approach	Training instance	Typical loss	Examples	Key trade-off
Pointwise	(query, item, relevance)	Squared error, cross-entropy, ordinal regression loss	Linear regression, decision trees, McRank, Subset Ranking	Simple to implement, but ignores relative order between items for the same query
Pairwise	(query, item_i, item_j, label which is greater)	Logistic / cross-entropy on pairs, hinge	RankNet, RankSVM, RankBoost, GBRank	Models relative order directly; many methods do not directly optimise list metrics
Listwise	(query, full ranked list)	Listwise cross-entropy, surrogate of NDCG / MAP	ListNet, ListMLE, AdaRank, LambdaMART, SoftRank	Closest to evaluation metric, but harder to optimise and more expensive

Pointwise approach

Pointwise methods predict a score or label per item independently. Training looks identical to ordinary regression or classification: features describe a single (query, item) pair, the label is the graded relevance, and the loss is computed per example. Linear regression, gradient-boosted trees, and ordinal regression all fit here. Cossock and Zhang's 2008 Subset Ranking paper and Li, Burges, and Wu's 2007 McRank are commonly cited representatives. The method is easy to implement and scales well, but it throws away an important signal: a perfect score on item A is useless if the model also gives an equally high score to a less relevant item B for the same query. Pointwise methods often work well as components inside larger pipelines, especially as the first stage of a multi-stage system.

Pairwise approach

Pairwise methods reformulate ranking as binary classification on pairs of items from the same query. For every pair (item_i, item_j) where i is more relevant than j, the model learns to give item_i a higher score. Cross-entropy on the score difference, hinge loss, or AdaBoost weighted updates are typical choices. The pairwise family is the largest of the three. RankSVM (Joachims, 2002) adapted support vector machines to the pairwise loss. RankBoost (Freund and colleagues, 2003) used boosting. RankNet (Burges et al., 2005, ICML) introduced a neural network with a differentiable cross-entropy on pairs, and is the model from which most modern boosted-tree ranking systems descend. GBRank (Zheng et al., 2007) was an early application of gradient boosted trees to pairs.

Pairwise loss is more aligned with ranking than pointwise loss but still does not match the metrics evaluators care about. A model can win nearly all pairs and still place a single highly relevant document at position 50 instead of position 1, which destroys NDCG@10.

Listwise approach

Listwise methods take the entire ranked list as the unit of training. There are two sub-families. The first defines a smooth loss directly on the list, often the cross-entropy between two probability distributions over permutations: ListNet (Cao et al., ICML 2007) is the canonical example, with ListMLE and ListMart following. The second works directly on the metric of interest by approximating its non-smooth behaviour. SoftRank smooths the rank positions with a Gaussian. AdaRank (Xu and Li, 2007) adapts AdaBoost to optimise IR measures. LambdaRank (Burges, Ragno, and Le, NeurIPS 2006) sidesteps the problem entirely by writing down the gradient of the loss directly, multiplying the pairwise RankNet gradient by the change in NDCG that swapping the two items would cause. LambdaMART (Burges, 2010) plugs that gradient into MART (Multiple Additive Regression Trees), giving a gradient-boosted tree ensemble that optimises NDCG without ever computing the loss itself.

LambdaMART went on to dominate competitive learning-to-rank: an ensemble of LambdaMART rankers won track 1 of the 2010 Yahoo! Learning to Rank Challenge, and it powered Bing's web ranking for several years. Burges' 2010 Microsoft Research technical report From RankNet to LambdaRank to LambdaMART: An Overview is the standard reference for the lineage.

Major algorithms

A short tour of the most influential methods, in roughly chronological order.

Algorithm	Year	Family	Notes
RankSVM	2002	Pairwise	SVM on pairwise differences (Joachims)
RankBoost	2003	Pairwise	AdaBoost adapted to pairs
RankNet	2005	Pairwise	Neural net with cross-entropy on pairs (Burges et al., ICML)
LambdaRank	2006	Listwise (gradient)	RankNet gradient scaled by delta-NDCG (Burges, Ragno, Le, NeurIPS)
McRank	2007	Pointwise	Multi-class classification on graded relevance
GBRank	2007	Pairwise	Gradient boosted trees on pairs
AdaRank	2007	Listwise	AdaBoost optimising IR measures (Xu, Li)
ListNet	2007	Listwise	Cross-entropy between softmax-normalised score distributions (Cao et al., ICML)
LambdaMART	2010	Listwise (gradient)	MART + LambdaRank gradients (Burges); won Yahoo LTR Challenge 2010
monoBERT	2019	Cross-encoder	BERT cross-encoder fine-tuned on MS MARCO (Nogueira and Cho)
DPR	2020	Bi-encoder	Two BERT towers for open-domain QA (Karpukhin et al., EMNLP)
ColBERT	2020	Late interaction	Per-token MaxSim over BERT embeddings (Khattab and Zaharia, SIGIR)
monoT5	2020	Cross-encoder	Sequence-to-sequence ranker on T5 (Nogueira et al., Findings of EMNLP)
RankGPT	2023	LLM listwise	Sliding-window listwise reranking with GPT-4 (Sun et al., EMNLP)

Modern neural ranking

From about 2018 onward, transformer-based language models took over the top of the leaderboard for most public IR tasks. Three architectural patterns dominate.

Bi-encoders (two-tower models)

In a bi-encoder, also called a two-tower model, the query and the document are encoded independently into fixed-length vectors, and the score is the dot product or cosine similarity between them. Document vectors can be precomputed and indexed with an approximate nearest-neighbour structure such as HNSW or IVF, so retrieval scales to billions of items. Sentence-BERT (Reimers and Gurevych, 2019) and DPR, Dense Passage Retrieval (Karpukhin et al., EMNLP 2020), are the most cited examples. DPR was the work that established dense retrieval as competitive with BM25 on open-domain QA, beating Lucene-BM25 by 9 to 19 absolute points on top-20 retrieval accuracy.

Bi-encoders are fast but limited in expressiveness, since the model never sees the query and the document together.

Cross-encoders

A cross-encoder feeds the query and the document into the same transformer in a single forward pass and reads a relevance score off the [CLS] token or a similar pooled representation. The richer joint attention produces stronger relevance judgments at the cost of having to score every (query, document) pair from scratch at query time. Nogueira and Cho's monoBERT (arXiv:1901.04085, 2019) was the first widely cited application: a BERT cross-encoder fine-tuned on MS MARCO, used to rerank the top 1000 results from a BM25 first stage, lifted MRR@10 by 27 percent relative to the previous state of the art on the MS MARCO passage task. monoT5 (Nogueira, Jiang, Pradeep, and Lin, Findings of EMNLP 2020) replaced the encoder-only model with a T5 sequence-to-sequence model that generates target tokens such as "true" or "false" and reads the relevance probability off the logits.

Late interaction

ColBERT (Khattab and Zaharia, SIGIR 2020) sits between bi-encoders and cross-encoders. It encodes the query and the document independently with BERT into per-token embeddings, then computes a MaxSim score by, for every query token, taking the maximum dot product with any document token, and summing over query tokens. The interaction is therefore deferred ("late"), but it is still token level, capturing fine-grained matches that pure bi-encoders miss. The 2020 paper reported over 170 times speedup and four orders of magnitude fewer FLOPs per query than monoBERT for comparable quality. ColBERTv2 (Santhanam et al., NAACL 2022) added denoised supervision and residual compression, becoming a standard baseline on BEIR.

Multi-stage retrieval

In practice, modern systems combine several stages. A typical pipeline runs a fast first-stage retriever (BM25, a bi-encoder, or a sparse model such as SPLADE) over the entire corpus to fetch a few hundred candidates, then runs a slower but more accurate cross-encoder or LLM reranker on those candidates. This is the multi-stage pattern that Nogueira, Cho, and others have written about extensively, and it is the one that almost every production retrieval system uses today, including most RAG implementations.

LLM-based rankers

The most recent shift is to use large language models themselves as rankers. RankGPT (Sun et al., EMNLP 2023, Outstanding Paper Award) showed that GPT-4, prompted with a list of passages and asked to output a permutation, matches or beats supervised neural rerankers on TREC Deep Learning and several BEIR datasets. Because LLM context windows cannot fit hundreds of long passages, RankGPT uses a sliding window: the model reorders 20 passages at a time, then slides the window backward through the candidate list. RankLLM, RankZephyr, and others have built on the same approach with smaller open models, often distilling GPT-4 ranking behaviour into a fine-tuned 7B or smaller model.

LLM rankers have closed much of the gap between zero-shot and supervised retrieval, particularly on out-of-domain tasks, but they are expensive: a full BEIR-scale evaluation with GPT-4 in the loop runs into thousands of dollars in API calls, and latency per query is far higher than for a fine-tuned cross-encoder.

First-stage retrieval and BM25

No discussion of ranking is complete without BM25. The Okapi BM25 ranking function (Robertson, Walker, and colleagues, SIGIR 1994) is the lexical baseline against which every neural model is measured. BM25 scores a query-document pair using term frequency, inverse document frequency, and a length normalisation that is tuned by two free parameters k1 and b. Despite its age, BM25 remains a very strong baseline, especially in zero-shot and out-of-domain settings: the BEIR benchmark (Thakur et al., NeurIPS 2021 Datasets and Benchmarks) found that BM25 outperformed many dense retrieval systems on tasks where the supervised models had not been trained.

Variants such as BM25+RM3 (Robertson and Jones-style relevance feedback) and BM25F (multi-field BM25) extend the basic formula. SPLADE (Formal et al., 2021) combines lexical match with learned sparse expansions, and is widely used as a first-stage retriever alongside or in place of BM25.

Evaluation metrics

Ranking evaluation is its own subfield, and the choice of metric strongly influences which algorithm wins. The most common metrics are listed below.

Metric	Definition	Strengths	Weaknesses
Precision@k	Fraction of top-k that are relevant	Easy to interpret, position-agnostic within top-k	Ignores order within k and items below k
Recall@k	Fraction of all relevant items captured in top-k	Critical for first-stage retrieval	Ignores order
MAP (Mean Average Precision)	Mean over queries of average precision (mean of P@k at each relevant rank)	Combines precision and recall	Assumes binary relevance
MRR (Mean Reciprocal Rank)	Mean over queries of 1 / rank of first relevant item	Natural for QA and known-item search	Ignores anything past the first relevant item
NDCG@k	Discounted gain over top-k, normalised by ideal DCG	Supports graded relevance, position discount	Sensitive to choice of gain function and discount
ERR (Expected Reciprocal Rank)	Cascade-model based, captures user satisfaction with graded relevance	Better correlation with click data than DCG	Less common, more complex
Win rate / preference rate	Fraction of queries on which model A's ranking is preferred to model B's	Used for LLM rerankers and human evaluation	Pairwise, not absolute

NDCG is the dominant metric for graded relevance and was introduced by Kalervo Jarvelin and Jaana Kekalainen in Cumulated Gain-based Evaluation of IR Techniques (ACM TOIS, 2002). The discount function (typically log2(rank+1)) and the gain function (typically 2^rel - 1) together credit a model for placing highly relevant documents near the top, with diminishing returns further down. ERR (Chapelle, Metzler, Zhang, and Grinspan, CIKM 2009) extends MRR to graded relevance and ties the metric to a cascade user model where the user keeps reading until satisfied. Empirically, ERR correlates better with click data on commercial search engines than NDCG.

In practice, papers report several metrics together. The standard MS MARCO leaderboard uses MRR@10 for the passage task and NDCG@10 for the document task; TREC Deep Learning uses NDCG@10 and MAP.

Datasets and benchmarks

Learning to rank progressed roughly in step with public datasets. A few have been particularly influential.

Dataset	Year	Domain	Notes
LETOR (3.0, 4.0)	2007-2009	Web search	Microsoft Research collection of pre-extracted features over OHSUMED, TREC, and others; first standard LTR benchmark
Yahoo! Learning to Rank Challenge	2010	Web search	700 features per (query, document); LambdaMART ensembles won track 1
Microsoft MSLR-WEB10K / 30K	2010	Web search	136 features per (query, document), released with the Yahoo challenge era
MS MARCO	2016	Passage and document retrieval	1,010,916 anonymised Bing queries with human judgments and 8.8M passages (Bajaj et al.)
TREC Deep Learning Track	2019-	Passage and document	Annual TREC track using MS MARCO data; the standard arena for neural ranking
Natural Questions, TriviaQA	2018-2019	Open-domain QA	Used to benchmark retrievers and rerankers in QA settings
BEIR	2021	18 IR tasks across domains	Heterogeneous zero-shot benchmark (Thakur et al., NeurIPS); measures generalisation
MIRACL, mMARCO	2022	Multilingual	Cross-lingual ranking, often used to benchmark multilingual encoders

BEIR has become the most cited zero-shot benchmark since its 2021 release. Its central finding, that BM25 was a stronger zero-shot baseline than many supervised dense retrievers and that re-rankers and late-interaction models on average won across tasks but at high computational cost, reframed how the community evaluates new methods.

Implementation

Ranking models are implemented in many libraries; a few cover the bulk of production use cases.

Library	Type	Notes
LightGBM	Gradient boosted trees	objective='lambdarank' implements LambdaMART; the default ranker for many production systems
XGBoost	Gradient boosted trees	rank:pairwise, rank:ndcg, rank:map; rank:ndcg is also LambdaMART-based
allRank	PyTorch listwise	Open-source library from Allegro for ListNet, LambdaRank, ApproxNDCG, NeuralNDCG
TF-Ranking	TensorFlow	Google's TensorFlow Ranking library, supports pointwise, pairwise, listwise losses
sentence-transformers	Bi- and cross-encoders	Reimers' library; ships pretrained MS MARCO cross-encoders such as ms-marco-MiniLM and ms-marco-Electra
Pyserini	Retrieval toolkit	Python interface to Anserini (Lucene/BM25) and to dense retrievers; commonly used in TREC submissions
ColBERT, RAGatouille	Late-interaction	Original ColBERT and ColBERTv2 reference implementations and a high-level wrapper
LangChain, LlamaIndex	RAG orchestration	Wrap retrievers (BM25, FAISS, Pinecone, Weaviate, Qdrant, Chroma) and rerankers (Cohere Rerank, Voyage, BGE) for LLM applications

At the production scale, custom systems still dominate. Google, Bing, Baidu, Amazon, and Meta all run proprietary multi-stage stacks combining lexical retrieval, dense retrieval, and learned rerankers, with feature engineering for personalisation, freshness, and business signals layered on top.

Applications

Ranking shows up in nearly every product category that involves retrieving items in response to a request.

Web and product search. Google, Bing, Baidu, Amazon, and eBay all rank documents or products against user queries. Bing in particular ran LambdaMART for years, and Google's BERT-based neural ranking has been documented in several blog posts since 2019.

Recommendation systems. YouTube, TikTok, Netflix, and Spotify rank items against an implicit query that combines user history, context, and engagement signals. The two-tower architecture is widely used at this scale because the item tower can be precomputed and the user tower can be served online.

Question answering and open-domain QA. Retrieval-then-read pipelines such as those built on DPR or ColBERT rank passages against a question, then feed the top passages to a reader model that extracts or generates an answer.

RAG for LLMs. Modern LLM applications almost always include a retrieval step, and the quality of that step depends on a combination of first-stage retriever and reranker. As LLMs get better at handling long contexts, the importance of returning the right ten passages (rather than the right hundred) has only grown.

Ads and sponsored search. Click-through rate prediction for advertising is a closely related ranking problem, with its own literature on click models, exploration-exploitation, and counterfactual evaluation.

Code search and code review. GitHub Copilot's code search and various enterprise code search products rank code snippets against natural-language queries.

Drug and protein discovery. Virtual screening of compound libraries against a target is increasingly framed as a ranking problem with learned scoring functions.

Limitations and open problems

Ranking shares the limitations of any supervised system, plus a few of its own.

Training labels are expensive. Editorial relevance judgments require trained assessors, and even then they cover only a tiny fraction of (query, document) pairs. Most production systems supplement editorial labels with click data or with synthetic labels generated by a stronger model, but both come with biases.

Click data has heavy biases. Position bias means that users click on the top result more often regardless of relevance. Presentation bias, selection bias, and trust bias all distort the signal. Counterfactual learning to rank (Joachims and colleagues, 2017 onward) tries to correct for these by treating click logs as a kind of bandit feedback and applying inverse-propensity weighting.

Offline metrics do not always match online performance. A model that wins on NDCG@10 on a held-out test set can lose on click-through rate, dwell time, or revenue when deployed. The gap is particularly acute for personalisation, freshness, and exploration.

Out-of-domain generalisation. BEIR demonstrated that supervised dense retrievers often fail to transfer to new domains, while BM25 holds up surprisingly well. The community has responded with multi-domain training, hard-negative mining, and synthetic question generation, but generalisation remains an active research area.

Fairness and exposure. Ranking decisions distribute attention, and uneven exposure can amplify inequalities, whether between job applicants, news outlets, or sellers on a marketplace. Fair ranking research (Singh and Joachims, 2018; Biega and colleagues, 2018) extends the classical learning-to-rank framework with fairness constraints.

LLM rerankers are slow and expensive. RankGPT and its successors deliver strong results but cost real money per query and add seconds of latency. Distillation into smaller open models is the dominant response.

Explain like I'm 5 (ELI5)

Imagine you ask a librarian for books about dinosaurs. The librarian could just hand you any book that has the word "dinosaur" on the cover. That works, but you want the best book first, the second best book second, and so on. Ranking in machine learning teaches a computer to be that careful librarian. The computer looks at the question, looks at every book it could give you, and tries to figure out the right order to put them in. It learns by practising on lots of past questions where someone has already marked which books were the most useful. The tricky part is that the computer is not graded on whether it picked good books in general, only on whether it put them in the right order. There are a lot of different recipes for teaching this, with names like LambdaMART and ColBERT, and most modern search engines and recommendation apps use some mix of them.

References

Liu, T.-Y. "Learning to Rank for Information Retrieval." *Foundations and Trends in Information Retrieval*, vol. 3, no. 3, 2009, pp. 225-331. https://www.nowpublishers.com/article/Details/INR-016
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. "Learning to Rank using Gradient Descent." *Proceedings of the 22nd International Conference on Machine Learning (ICML 2005)*, Bonn, Germany, 7-11 August 2005. https://icml.cc/Conferences/2005/proceedings/papers/012_LearningToRank_BurgesEtAl.pdf
Burges, C., Ragno, R., and Le, Q. V. "Learning to Rank with Nonsmooth Cost Functions." *Advances in Neural Information Processing Systems (NeurIPS) 19*, 2006.
Burges, C. J. C. "From RankNet to LambdaRank to LambdaMART: An Overview." Microsoft Research Technical Report MSR-TR-2010-82, 2010. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf
Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., and Li, H. "Learning to Rank: From Pairwise Approach to Listwise Approach." *Proceedings of ICML 2007*, pp. 129-136. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2007-40.pdf
Xu, J., and Li, H. "AdaRank: A Boosting Algorithm for Information Retrieval." *Proceedings of SIGIR 2007*.
Joachims, T. "Optimizing Search Engines using Clickthrough Data." *Proceedings of KDD 2002*.
Li, P., Burges, C., and Wu, Q. "McRank: Learning to Rank Using Multiple Classification and Gradient Boosting." *NIPS 2007*.
Robertson, S. E., and Walker, S. "Some Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrieval." *Proceedings of SIGIR 1994*.
Robertson, S., and Zaragoza, H. "The Probabilistic Relevance Framework: BM25 and Beyond." *Foundations and Trends in Information Retrieval*, vol. 3, no. 4, 2009, pp. 333-389.
Jarvelin, K., and Kekalainen, J. "Cumulated Gain-based Evaluation of IR Techniques." *ACM Transactions on Information Systems*, vol. 20, no. 4, 2002, pp. 422-446. https://dl.acm.org/doi/10.1145/582415.582418
Chapelle, O., Metzler, D., Zhang, Y., and Grinspan, P. "Expected Reciprocal Rank for Graded Relevance." *Proceedings of CIKM 2009*. https://dl.acm.org/doi/10.1145/1645953.1646033
Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao, J., Liu, X., Majumder, R., et al. "MS MARCO: A Human Generated MAchine Reading COmprehension Dataset." *NIPS 2016 Workshop on Cognitive Computation*. https://arxiv.org/abs/1611.09268
Nogueira, R., and Cho, K. "Passage Re-ranking with BERT." arXiv:1901.04085, January 2019. https://arxiv.org/abs/1901.04085
Nogueira, R., Jiang, Z., Pradeep, R., and Lin, J. "Document Ranking with a Pretrained Sequence-to-Sequence Model." *Findings of the Association for Computational Linguistics: EMNLP 2020*. https://aclanthology.org/2020.findings-emnlp.63/
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W. "Dense Passage Retrieval for Open-Domain Question Answering." *Proceedings of EMNLP 2020*. https://aclanthology.org/2020.emnlp-main.550/
Khattab, O., and Zaharia, M. "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." *Proceedings of SIGIR 2020*. https://arxiv.org/abs/2004.12832
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., and Zaharia, M. "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction." *Proceedings of NAACL 2022*. https://aclanthology.org/2022.naacl-main.272/
Reimers, N., and Gurevych, I. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." *Proceedings of EMNLP 2019*.
Sun, W., Yan, L., Ma, X., Wang, S., Ren, P., Chen, Z., Yin, D., and Ren, Z. "Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents." *Proceedings of EMNLP 2023* (Outstanding Paper Award). https://aclanthology.org/2023.emnlp-main.923/
Thakur, N., Reimers, N., Ruckle, A., Srivastava, A., and Gurevych, I. "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." *NeurIPS 2021 Datasets and Benchmarks Track*. https://arxiv.org/abs/2104.08663
Formal, T., Lassance, C., Piwowarski, B., and Clinchant, S. "SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval." arXiv:2109.10086, 2021.
Joachims, T., Swaminathan, A., and Schnabel, T. "Unbiased Learning-to-Rank with Biased Feedback." *Proceedings of WSDM 2017*.
Singh, A., and Joachims, T. "Fairness of Exposure in Rankings." *Proceedings of KDD 2018*.
Microsoft Research. "LETOR 4.0 Datasets." https://www.microsoft.com/en-us/research/project/letor-learning-to-rank-for-information-retrieval/
Chapelle, O., and Chang, Y. "Yahoo! Learning to Rank Challenge Overview." *JMLR Workshop and Conference Proceedings*, vol. 14, 2011, pp. 1-24. http://proceedings.mlr.press/v14/chapelle11a/chapelle11a.pdf
XGBoost developers. "Learning to Rank." XGBoost documentation. https://xgboost.readthedocs.io/en/latest/tutorials/learning_to_rank.html
LightGBM developers. "lightgbm.LGBMRanker." LightGBM documentation. https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRanker.html

Definition and problem setting

The three classical formulations

Pointwise approach

Pairwise approach

Listwise approach

Major algorithms

Modern neural ranking

Bi-encoders (two-tower models)

Cross-encoders

Late interaction

Multi-stage retrieval

LLM-based rankers

First-stage retrieval and BM25

Evaluation metrics

Datasets and benchmarks

Implementation

Applications

Limitations and open problems

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

Machine learning terms/Recommendation Systems

MTEB (Massive Text Embedding Benchmark)

Candidate Generation

Collaborative filtering

Item matrix

Matrix factorization

Definition and problem setting

The three classical formulations

Pointwise approach

Pairwise approach

Listwise approach

Major algorithms

Modern neural ranking

Bi-encoders (two-tower models)

Cross-encoders

Late interaction

Multi-stage retrieval

LLM-based rankers

First-stage retrieval and BM25

Evaluation metrics

Datasets and benchmarks

Implementation

Applications

Limitations and open problems

Explain like I'm 5 (ELI5)

See also

References

Related Articles

Machine learning terms/Recommendation Systems

MTEB (Massive Text Embedding Benchmark)

Candidate Generation

Collaborative filtering

Item matrix

Matrix factorization