Re-ranking

introduction

Re-ranking, also written as reranking and sometimes called rank refinement or re-scoring, is the second stage of a two-stage information retrieval pipeline. A fast first-stage retriever produces a candidate set of roughly 50 to 1000 documents from a large corpus, then a slower but more accurate model rescores those candidates so the most relevant items rise to the top. The technique is now standard in web search, enterprise search, recommender systems, and retrieval-augmented generation (RAG), where the top results are fed to a large language model for answer synthesis.

The core motivation is a tradeoff. Models that score every document against the query with full cross-attention are too slow to run over millions of items, but bi-encoder retrievers that compress queries and documents into independent vectors miss fine-grained interactions between the two. Two stages get the best of both: cheap recall first, expensive precision second. Reimers and Gurevych framed this split clearly in the Sentence-BERT paper, showing that finding the most similar pair in 10,000 sentences with BERT takes about 65 hours of inference, while a bi-encoder reduces it to 5 seconds at the cost of some accuracy.[1]

the two-stage retrieval architecture

A modern retrieval pipeline almost always has at least these two phases. Some systems add a third stage with a LLM judge or a fusion step.

Stage	Goal	Typical methods	Latency budget	Items processed
1. Retrieval	Recall: find anything plausibly relevant	BM25, dense retrieval (two-tower model bi-encoders), hybrid sparse plus dense, ColBERT approximate nearest neighbor	Sub-100 ms	Millions to billions
2. Re-ranking	Precision: order the top-K	Cross-encoder, late-interaction (ColBERT), monoT5, LLM listwise reranker, learning-to-rank with rich features	100 ms to several seconds	Top 50 to 1000 candidates
3. (optional) Post-processing	Diversity, business rules, deduplication	Maximal marginal relevance, reciprocal rank fusion, custom blenders	Tens of milliseconds	Top 10 to 100

This split also matches what production RAG systems do. Hybrid search recovers candidates that either of BM25 or dense retrieval would miss alone, the cross-encoder re-orders them by true semantic match, then the top three to ten chunks go into the LLM context window.[2]

why two stages

Running a cross-encoder over a million documents would require a million transformer forward passes per query. At 10 ms per pass, that is roughly three hours per query on a single GPU. Bi-encoders sidestep this by encoding documents once, offline, into a vector index, so query time becomes a fast nearest-neighbor lookup. The cost is information loss: query and document are compressed independently, and the model never sees them together.

A cross-encoder is the inverse. The query and document are concatenated and processed together, so attention can match a question word against any token in the candidate. Quality goes up, throughput goes down. Reranking only the top-K candidates of the bi-encoder is a clean compromise: pre-computation handles the recall problem, and the K-pair cross-attention handles the precision problem. Industry write-ups generally report cross-encoders being 50 to 100 times slower than bi-encoders per pair, which is exactly why nobody runs them across the whole corpus.[3]

re-ranker types

Four main families of re-rankers are in production use today.

Type	Example models	Architecture	Strengths	Weaknesses
Cross-encoder	Sentence-BERT `ms-marco-MiniLM-L-6-v2`, BGE reranker, Cohere Rerank, Voyage rerank-2, Jina Reranker v2	Single transformer over (query, doc) concatenation; outputs scalar score	High accuracy, simple to fine-tune, well supported in libraries	Cannot pre-compute document representations; cost grows linearly with K
Late interaction	ColBERT, ColBERTv2, ColPali	Per-token embeddings for query and document; MaxSim aggregation	Faster than cross-encoder, can index for first-stage too, fine-grained matching	Larger index size; multi-vector storage
Sequence-to-sequence	monoT5, RankT5	T5 generates a relevance token; logit acts as score	Strong zero-shot transfer, scales with model size	Slower than encoder cross-encoders at the same parameter count
LLM reranker	RankGPT, RankZephyr, RankVicuna, listwise prompting with GPT-4 or Claude	Prompt the LLM with the query and a list of candidates; ask for the ranked order	Best zero-shot quality, can use natural-language criteria	High latency and token cost; output parsing is brittle

cross-encoders

A cross-encoder takes both the query and the candidate document, joins them with a separator token, runs them through a transformer, and uses a final linear head to emit a single relevance score. Because every query token can attend to every document token, the model captures word-level interactions that a bi-encoder cannot. The Sentence-BERT paper formalized cross-encoder versus bi-encoder terminology in 2019 and shipped open-weight cross-encoders for MS MARCO that are still widely used in libraries like sentence-transformers.[1]

Nogueira and Cho's 2019 paper Passage Re-ranking with BERT, often called monoBERT in later work, was the first to apply BERT as a passage reranker. They fine-tuned BERT-Base and BERT-Large on the MS MARCO passage ranking task and beat the previous state of the art by 27 percent relative MRR@10, then took the top spot on the public leaderboard.[4] The monoBERT recipe (binary classification on relevant or non-relevant query-passage pairs) became the template every subsequent encoder-based reranker followed.

late interaction with ColBERT

ColBERT, introduced by Khattab and Zaharia at SIGIR 2020, sits between a bi-encoder and a cross-encoder.[5] Instead of one vector per text, ColBERT keeps one vector per token. At query time, the model encodes the query into per-token vectors, then computes a MaxSim score: for each query token, take the maximum cosine similarity against any document token, then sum across query tokens. Documents are encoded offline, so the index is much larger than a bi-encoder index but the computation per candidate is much cheaper than a cross-encoder. The original paper reports two orders of magnitude lower latency and four orders of magnitude fewer FLOPs per query than a comparable BERT cross-encoder, while staying competitive on quality. ColBERTv2 (2022) added residual compression to shrink the index. ColBERT can serve either as a first-stage retriever (with vector indexes) or as a reranker, which makes it unusual.

sequence-to-sequence rerankers

monoT5 (Nogueira et al., 2020, EMNLP Findings) treats reranking as a generation problem.[6] The model is trained on prompts like Query: <q> Document: <d> Relevant: and is fine-tuned to produce the token true for relevant pairs and false for irrelevant ones. The relevance score is the logit of the true token relative to false. Because T5 was already a strong sequence model, monoT5 transferred well to TREC tracks and other out-of-domain corpora. The Castorini group later extended this with duoT5 (pairwise) and the Expando-Mono-Duo design pattern.

LLM rerankers

In 2023 Sun et al. published Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents, which won an EMNLP 2023 Outstanding Paper Award and introduced the now-standard listwise prompting recipe known as RankGPT.[7] The LLM is shown a query and the top 20 candidates by ID, then asked to output the ranked list of IDs. Because the model sees all candidates at once, it can compare them against each other, which pointwise cross-encoders cannot. To rerank more candidates than the context window allows, RankGPT uses a sliding window that walks back to front, reranking the worst window first and bubbling strong candidates upward.

RankZephyr (Pradeep, Sharifymoghaddam, and Lin, 2023) distilled the RankGPT-3.5 and RankGPT-4 behavior into a 7-billion-parameter open model based on Zephyr-Beta and Mistral, closing much of the open versus closed gap on TREC Deep Learning evaluations.[8] The RankLLM toolkit packages these models for reproducible IR research.

learning-to-rank with rich features

Before neural rerankers took over, the dominant approach was learning-to-rank (LTR) over hand-engineered features such as BM25 score, click-through rate, anchor text matches, freshness, and PageRank. LambdaMART (Burges, 2010), a boosted-tree variant of LambdaRank, optimized NDCG directly and won Track 1 of the Yahoo! Learning to Rank Challenge in 2010.[9] LambdaMART implementations in LightGBM and XGBoost are still standard in commercial search and ad ranking, often as the final stage on top of neural candidates.

commercial reranker APIs

A cluster of vendors now sells hosted rerankers. They are convenient because no GPU operation is required on the client side, and the models are usually multilingual and continuously updated.

Provider	Model name	Context length	Languages	Notes
Cohere	rerank-v3.5	4096 tokens	100+	Multilingual default since December 2024; available on Bedrock, Azure, Oracle
Voyage AI	rerank-2	16K tokens combined	Multilingual	Quality-focused; rerank-2-lite is cheaper at 8K context
Jina AI	jina-reranker-v2-base-multilingual	1024 tokens (with chunking)	100+	278M parameters; tuned for function calling and code search
BAAI	bge-reranker-v2-m3	8192 tokens	Multilingual	Open weights (Apache 2.0); 0.6B parameters; runs on consumer hardware
Mixedbread	mxbai-rerank-large-v1	512 tokens	English	Open weights; Apache 2.0

Neither Anthropic nor OpenAI ships a dedicated reranker endpoint. Teams that want LLM-grade reranking from those providers prompt a chat model with a listwise template, which works well but costs more per query. The Cohere Rerank, Voyage Rerank, and Jina Reranker endpoints all return scores in roughly the same shape: a JSON list of indexes paired with relevance values, suitable for sorting client-side.

latency and quality tradeoffs

Latency depends on hardware, batch size, and candidate length. The numbers below are typical orders of magnitude for English passages of around 100 to 200 tokens.

Method	Per-query latency	When it makes sense
BM25	1 to 10 ms	Always run for keyword matching
Bi-encoder ANN	10 to 100 ms	First-stage dense retrieval
Cross-encoder over top-100	100 ms to 1 second on GPU	Default reranker for production RAG
Cross-encoder over top-1000	1 to 5 seconds on GPU	Higher recall, slower endpoint
ColBERT MaxSim over top-1000	50 to 200 ms	Mid-tier between bi- and cross-encoder
LLM listwise (GPT-4 over 20 docs)	2 to 10 seconds	Highest quality, batch or async pipelines

Quality gains compound through the stack. Genzeon's hybrid plus rerank pipeline reports MRR@3 jumping from 0.433 to 0.605 (a 39.7 percent relative improvement) once a cross-encoder is added on top of fused BM25 plus dense retrieval.[2] Pinecone, Databricks, and Cohere have all published similar numbers showing 5 to 15 NDCG@10 points lifted by a cross-encoder reranker compared to the best first-stage method alone. The exact size of the lift depends on how good the first stage already is. If BM25 already returns the right answer at rank 1, the reranker has little to add. If the relevant passage is buried at rank 47 because it has weak lexical overlap, reranking is what saves the query.

evaluation metrics

Re-ranking is evaluated with the same offline metrics as ranking in general: NDCG@k, MRR, MAP, Precision@k, and Recall@k (the last is more useful for the retrieval stage). Common public benchmarks include MS MARCO passage and document tracks, the TREC Deep Learning tracks (DL19, DL20, DL21, DL22), and BEIR, the heterogeneous zero-shot benchmark covering 18 datasets. BEIR results are reported as average NDCG@10 across the included datasets, and they are the standard reference point for both first-stage retrievers and rerankers.

For RAG-specific systems, downstream answer-quality metrics also matter. A reranker that lifts NDCG by two points but increases the share of incorrect answers because it surfaces a tempting but wrong distractor is a net loss for the application. Teams running RAG often pair retrieval metrics with end-to-end accuracy metrics on a curated answer set.

classical re-ranking techniques

Not every reranker is a neural model. Two non-neural approaches still appear in production.

Reciprocal rank fusion (RRF), introduced by Cormack, Clarke, and Buttcher at SIGIR 2009, is the standard way to combine multiple ranked lists into one.[10] The RRF score for a document is the sum of 1 / (k + rank_i) across each input list, where k is a small constant (the paper uses 60). RRF throws away the underlying scores and works only with positions, which makes it robust to systems that produce scores on incompatible scales. It is the default fusion strategy in OpenSearch, Elasticsearch, Azure Cognitive Search, and most hybrid retrieval libraries.

Pseudo-relevance feedback (PRF) treats the top-K results from the first retrieval as if they were known to be relevant, then expands the query with terms drawn from those documents and runs a second retrieval. RM3 is the most common PRF method on top of BM25, and neural variants like ANCE-PRF and ColBERT-PRF apply the same idea to dense retrieval. PRF is technically a re-ranking step because the second retrieval reorders the candidate set in light of feedback, even if there is no separate scoring model.

implementation

Most reranker work in 2025 is done through one of these libraries.

Stack	Reranker class or component	Typical use
sentence-transformers	`CrossEncoder`	Local cross-encoder inference; fine-tuning on custom data
LangChain	`CohereRerank`, `JinaRerank`, `FlashrankRerank` (in `langchain_community`)	Wrap a base retriever with `ContextualCompressionRetriever`
LlamaIndex	`CohereRerank`, `SentenceTransformerRerank`, `LLMRerank` node post-processors	Plug into a `QueryEngine` between retrieval and synthesis
Haystack 2	`TransformersSimilarityRanker`, `CohereRanker`, `JinaRanker`, `LostInTheMiddleRanker`	Drop-in pipeline component
RankLLM	`RankZephyr`, `RankGPT`, `FIRST`	LLM listwise reranking with sliding window
FlagEmbedding	`FlagReranker`, `FlagLLMReranker`	Inference for BGE rerankers

In LangChain the wiring is short. Build a base vector retriever, instantiate CohereRerank, then wrap them with ContextualCompressionRetriever(base_compressor=cohere, base_retriever=vector). Every query first runs through the vector retriever, the candidate documents flow into the rerank API, and the wrapped retriever returns the reordered top-N. LlamaIndex follows the same pattern with a node post-processor list on a QueryEngine.[11]

limitations

The cross-encoder design has structural limits. Document representations cannot be pre-computed, so cost is paid at query time and grows linearly with the candidate count. This is acceptable for K of 20 to 100 but painful for K of 1000 or more. Late-interaction models like ColBERT trade some accuracy for the ability to pre-compute, and learning-to-rank with light features stays useful when latency budgets are very tight.

Domain shift is another headache. A reranker trained on MS MARCO web passages can degrade on legal contracts, medical literature, or internal company documents. Fine-tuning on in-domain pairs (often generated synthetically from the corpus) usually fixes this, but it requires labeled or semi-labeled data and a non-trivial training step. Cohere, Voyage, Jina, and BAAI release multilingual checkpoints partly to amortize this problem across users.

LLM rerankers introduce token cost and latency. A RankGPT pass over 20 candidates with GPT-4 routinely takes several seconds and costs more than a typical chat completion. They also fail in interesting ways: outputs can be malformed, the model can refuse to rank, or it can hallucinate IDs that were not in the input list. RankZephyr and other distilled open models reduce cost but still need careful output parsing.

Finally, optimizing rerankers in isolation can hide problems with the first stage. If the first-stage retrieval misses the relevant document entirely, no reranker can save it, since the document is not in the candidate set. End-to-end recall at the retrieval stage is therefore as important as precision at the rerank stage, and tuning one without the other tends to give misleading results.

history

Classical IR systems used multi-stage ranking long before neural networks were involved. Bing and Yahoo's web search stacks in the late 2000s ran an inverted-index recall stage, an L1 ranker (often LambdaMART) on a few thousand candidates, and an L2 ranker with richer features and click signals on the top hundred. The split was driven by the same arithmetic as today: full feature extraction over the entire web is impossible, so cheap recall comes first.

The neural era started with Nogueira and Cho's monoBERT in 2019, which showed that a transformer cross-encoder could move the MS MARCO state of the art by a large margin.[4] Reimers and Gurevych's Sentence-BERT in the same year provided the bi-encoder counterpart that made retrieval cheap enough to feed those rerankers.[1] ColBERT (2020) and monoT5 (2020) explored late interaction and seq2seq formulations.[5][6] By 2023, RankGPT had shown that LLM listwise prompting could match or beat dedicated rerankers, and RankZephyr had distilled that capability into open-weight models.[7][8] Cohere shipped its first commercial Rerank model in 2023, with rerank-v3.5 (the multilingual default) following in December 2024. Voyage launched rerank-2 in September 2024, and Jina released its multilingual reranker around the same time.

The direction of travel is toward longer context, more languages, and tighter integration with hybrid retrieval. Whether the future belongs to specialized cross-encoders, late-interaction models, or just very capable general LLMs remains an open question. In practice most production systems still combine all three: BM25 plus dense for recall, a fast cross-encoder for the bulk of the rerank, and an LLM (or LLM-based judge) for the final handful of candidates that go into a generated answer.

explain like I'm 5

Imagine a giant library with a million books. You ask the librarian for books about dinosaurs. The librarian quickly grabs 100 books with the word dinosaur in the title. That is fast but rough; some of those books are coloring books, some are romance novels with a dinosaur on the cover. Then a paleontologist comes over and looks through the 100 books one by one, ranking them by how useful they are for what you actually want. The paleontologist is slow but careful. That second pass is re-ranking. The librarian gives you breadth, the paleontologist gives you precision, and together they hand you the ten best books in the library without anyone having to read all million.

references

Reimers, N. and Gurevych, I. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP-IJCNLP 2019. https://arxiv.org/abs/1908.10084
Genzeon. "Hybrid Retrieval and Reranking in RAG: A Dual-Stage Approach to Improve Information Recall and Precision." https://www.genzeon.com/hybrid-retrieval-deranking-in-rag-recall-precision/
Sentence Transformers documentation. "Cross-Encoders." https://www.sbert.net/examples/applications/cross-encoder/README.html
Nogueira, R. and Cho, K. "Passage Re-ranking with BERT." arXiv:1901.04085, 2019. https://arxiv.org/abs/1901.04085
Khattab, O. and Zaharia, M. "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT." SIGIR 2020. https://arxiv.org/abs/2004.12832
Nogueira, R., Jiang, Z., Pradeep, R., and Lin, J. "Document Ranking with a Pretrained Sequence-to-Sequence Model." Findings of EMNLP 2020. https://arxiv.org/abs/2003.06713
Sun, W., Yan, L., Ma, X., Ren, P., Yin, D., and Ren, Z. "Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents." EMNLP 2023 (Outstanding Paper). https://arxiv.org/abs/2304.09542
Pradeep, R., Sharifymoghaddam, S., and Lin, J. "RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!" arXiv:2312.02724, 2023. https://arxiv.org/abs/2312.02724
Burges, C. J. C. "From RankNet to LambdaRank to LambdaMART: An Overview." Microsoft Research Technical Report MSR-TR-2010-82, 2010. https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/
Cormack, G. V., Clarke, C. L. A., and Buttcher, S. "Reciprocal Rank Fusion outperforms Condorcet and Individual Rank Learning Methods." SIGIR 2009. https://cormack.uwaterloo.ca/cormacksigir09-rrf.pdf
Cohere documentation. "Cohere Rerank on LangChain (Integration Guide)." https://docs.cohere.com/docs/rerank-on-langchain
Cohere documentation. "Announcing Rerank-v3.5." https://docs.cohere.com/changelog/rerank-v3.5
Voyage AI blog. "rerank-2 and rerank-2-lite: the next generation of Voyage multilingual rerankers." September 30, 2024. https://blog.voyageai.com/2024/09/30/rerank-2/
Jina AI. "Jina Reranker v2 for Agentic RAG: Ultra-Fast, Multilingual, Function-Calling and Code Search." https://jina.ai/news/jina-reranker-v2-for-agentic-rag-ultra-fast-multilingual-function-calling-and-code-search/
BAAI. "bge-reranker-v2-m3 model card." https://huggingface.co/BAAI/bge-reranker-v2-m3

introduction

the two-stage retrieval architecture

why two stages

re-ranker types

cross-encoders

late interaction with ColBERT

sequence-to-sequence rerankers

LLM rerankers

learning-to-rank with rich features

commercial reranker APIs

latency and quality tradeoffs

evaluation metrics

classical re-ranking techniques

implementation

limitations

history

explain like I'm 5

references

Improve this article

Related Articles

Dense Passage Retrieval (DPR)

MTEB (Massive Text Embedding Benchmark)

Information Retrieval

BM25 (Okapi BM25)

Search Engine

Linkup

introduction

the two-stage retrieval architecture

why two stages

re-ranker types

cross-encoders

late interaction with ColBERT

sequence-to-sequence rerankers

LLM rerankers

learning-to-rank with rich features

commercial reranker APIs

latency and quality tradeoffs

evaluation metrics

classical re-ranking techniques

implementation

limitations

history

explain like I'm 5

references

Related Articles

Dense Passage Retrieval (DPR)

MTEB (Massive Text Embedding Benchmark)

Information Retrieval

BM25 (Okapi BM25)

Search Engine

Linkup