See also: Machine learning terms, Ranking, Information Retrieval
Re-ranking, also written as reranking and sometimes called rank refinement or re-scoring, is the second stage of a two-stage information retrieval pipeline. A fast first-stage retriever produces a candidate set of roughly 50 to 1000 documents from a large corpus, then a slower but more accurate model rescores those candidates so the most relevant items rise to the top. The technique is now standard in web search, enterprise search, recommender systems, and retrieval-augmented generation (RAG), where the top results are fed to a large language model for answer synthesis.
The core motivation is a tradeoff. Models that score every document against the query with full cross-attention are too slow to run over millions of items, but bi-encoder retrievers that compress queries and documents into independent vectors miss fine-grained interactions between the two. Two stages get the best of both: cheap recall first, expensive precision second. Reimers and Gurevych framed this split clearly in the Sentence-BERT paper, showing that finding the most similar pair in 10,000 sentences with BERT takes about 65 hours of inference, while a bi-encoder reduces it to 5 seconds at the cost of some accuracy.[1]
A modern retrieval pipeline almost always has at least these two phases. Some systems add a third stage with a LLM judge or a fusion step.
| Stage | Goal | Typical methods | Latency budget | Items processed |
|---|---|---|---|---|
| 1. Retrieval | Recall: find anything plausibly relevant | BM25, dense retrieval (two-tower model bi-encoders), hybrid sparse plus dense, ColBERT approximate nearest neighbor | Sub-100 ms | Millions to billions |
| 2. Re-ranking | Precision: order the top-K | Cross-encoder, late-interaction (ColBERT), monoT5, LLM listwise reranker, learning-to-rank with rich features | 100 ms to several seconds | Top 50 to 1000 candidates |
| 3. (optional) Post-processing | Diversity, business rules, deduplication | Maximal marginal relevance, reciprocal rank fusion, custom blenders | Tens of milliseconds | Top 10 to 100 |
This split also matches what production RAG systems do. Hybrid search recovers candidates that either of BM25 or dense retrieval would miss alone, the cross-encoder re-orders them by true semantic match, then the top three to ten chunks go into the LLM context window.[2]
Running a cross-encoder over a million documents would require a million transformer forward passes per query. At 10 ms per pass, that is roughly three hours per query on a single GPU. Bi-encoders sidestep this by encoding documents once, offline, into a vector index, so query time becomes a fast nearest-neighbor lookup. The cost is information loss: query and document are compressed independently, and the model never sees them together.
A cross-encoder is the inverse. The query and document are concatenated and processed together, so attention can match a question word against any token in the candidate. Quality goes up, throughput goes down. Reranking only the top-K candidates of the bi-encoder is a clean compromise: pre-computation handles the recall problem, and the K-pair cross-attention handles the precision problem. Industry write-ups generally report cross-encoders being 50 to 100 times slower than bi-encoders per pair, which is exactly why nobody runs them across the whole corpus.[3]
Four main families of re-rankers are in production use today.
| Type | Example models | Architecture | Strengths | Weaknesses |
|---|---|---|---|---|
| Cross-encoder | Sentence-BERT ms-marco-MiniLM-L-6-v2, BGE reranker, Cohere Rerank, Voyage rerank-2, Jina Reranker v2 | Single transformer over (query, doc) concatenation; outputs scalar score | High accuracy, simple to fine-tune, well supported in libraries | Cannot pre-compute document representations; cost grows linearly with K |
| Late interaction | ColBERT, ColBERTv2, ColPali | Per-token embeddings for query and document; MaxSim aggregation | Faster than cross-encoder, can index for first-stage too, fine-grained matching | Larger index size; multi-vector storage |
| Sequence-to-sequence | monoT5, RankT5 | T5 generates a relevance token; logit acts as score | Strong zero-shot transfer, scales with model size | Slower than encoder cross-encoders at the same parameter count |
| LLM reranker | RankGPT, RankZephyr, RankVicuna, listwise prompting with GPT-4 or Claude | Prompt the LLM with the query and a list of candidates; ask for the ranked order | Best zero-shot quality, can use natural-language criteria | High latency and token cost; output parsing is brittle |
A cross-encoder takes both the query and the candidate document, joins them with a separator token, runs them through a transformer, and uses a final linear head to emit a single relevance score. Because every query token can attend to every document token, the model captures word-level interactions that a bi-encoder cannot. The Sentence-BERT paper formalized cross-encoder versus bi-encoder terminology in 2019 and shipped open-weight cross-encoders for MS MARCO that are still widely used in libraries like sentence-transformers.[1]
Nogueira and Cho's 2019 paper Passage Re-ranking with BERT, often called monoBERT in later work, was the first to apply BERT as a passage reranker. They fine-tuned BERT-Base and BERT-Large on the MS MARCO passage ranking task and beat the previous state of the art by 27 percent relative MRR@10, then took the top spot on the public leaderboard.[4] The monoBERT recipe (binary classification on relevant or non-relevant query-passage pairs) became the template every subsequent encoder-based reranker followed.
ColBERT, introduced by Khattab and Zaharia at SIGIR 2020, sits between a bi-encoder and a cross-encoder.[5] Instead of one vector per text, ColBERT keeps one vector per token. At query time, the model encodes the query into per-token vectors, then computes a MaxSim score: for each query token, take the maximum cosine similarity against any document token, then sum across query tokens. Documents are encoded offline, so the index is much larger than a bi-encoder index but the computation per candidate is much cheaper than a cross-encoder. The original paper reports two orders of magnitude lower latency and four orders of magnitude fewer FLOPs per query than a comparable BERT cross-encoder, while staying competitive on quality. ColBERTv2 (2022) added residual compression to shrink the index. ColBERT can serve either as a first-stage retriever (with vector indexes) or as a reranker, which makes it unusual.
monoT5 (Nogueira et al., 2020, EMNLP Findings) treats reranking as a generation problem.[6] The model is trained on prompts like Query: <q> Document: <d> Relevant: and is fine-tuned to produce the token true for relevant pairs and false for irrelevant ones. The relevance score is the logit of the true token relative to false. Because T5 was already a strong sequence model, monoT5 transferred well to TREC tracks and other out-of-domain corpora. The Castorini group later extended this with duoT5 (pairwise) and the Expando-Mono-Duo design pattern.
In 2023 Sun et al. published Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents, which won an EMNLP 2023 Outstanding Paper Award and introduced the now-standard listwise prompting recipe known as RankGPT.[7] The LLM is shown a query and the top 20 candidates by ID, then asked to output the ranked list of IDs. Because the model sees all candidates at once, it can compare them against each other, which pointwise cross-encoders cannot. To rerank more candidates than the context window allows, RankGPT uses a sliding window that walks back to front, reranking the worst window first and bubbling strong candidates upward.
RankZephyr (Pradeep, Sharifymoghaddam, and Lin, 2023) distilled the RankGPT-3.5 and RankGPT-4 behavior into a 7-billion-parameter open model based on Zephyr-Beta and Mistral, closing much of the open versus closed gap on TREC Deep Learning evaluations.[8] The RankLLM toolkit packages these models for reproducible IR research.
Before neural rerankers took over, the dominant approach was learning-to-rank (LTR) over hand-engineered features such as BM25 score, click-through rate, anchor text matches, freshness, and PageRank. LambdaMART (Burges, 2010), a boosted-tree variant of LambdaRank, optimized NDCG directly and won Track 1 of the Yahoo! Learning to Rank Challenge in 2010.[9] LambdaMART implementations in LightGBM and XGBoost are still standard in commercial search and ad ranking, often as the final stage on top of neural candidates.
A cluster of vendors now sells hosted rerankers. They are convenient because no GPU operation is required on the client side, and the models are usually multilingual and continuously updated.
| Provider | Model name | Context length | Languages | Notes |
|---|---|---|---|---|
| Cohere | rerank-v3.5 | 4096 tokens | 100+ | Multilingual default since December 2024; available on Bedrock, Azure, Oracle |
| Voyage AI | rerank-2 | 16K tokens combined | Multilingual | Quality-focused; rerank-2-lite is cheaper at 8K context |
| Jina AI | jina-reranker-v2-base-multilingual | 1024 tokens (with chunking) | 100+ | 278M parameters; tuned for function calling and code search |
| BAAI | bge-reranker-v2-m3 | 8192 tokens | Multilingual | Open weights (Apache 2.0); 0.6B parameters; runs on consumer hardware |
| Mixedbread | mxbai-rerank-large-v1 | 512 tokens | English | Open weights; Apache 2.0 |
Neither Anthropic nor OpenAI ships a dedicated reranker endpoint. Teams that want LLM-grade reranking from those providers prompt a chat model with a listwise template, which works well but costs more per query. The Cohere Rerank, Voyage Rerank, and Jina Reranker endpoints all return scores in roughly the same shape: a JSON list of indexes paired with relevance values, suitable for sorting client-side.
Latency depends on hardware, batch size, and candidate length. The numbers below are typical orders of magnitude for English passages of around 100 to 200 tokens.
| Method | Per-query latency | When it makes sense |
|---|---|---|
| BM25 | 1 to 10 ms | Always run for keyword matching |
| Bi-encoder ANN | 10 to 100 ms | First-stage dense retrieval |
| Cross-encoder over top-100 | 100 ms to 1 second on GPU | Default reranker for production RAG |
| Cross-encoder over top-1000 | 1 to 5 seconds on GPU | Higher recall, slower endpoint |
| ColBERT MaxSim over top-1000 | 50 to 200 ms | Mid-tier between bi- and cross-encoder |
| LLM listwise (GPT-4 over 20 docs) | 2 to 10 seconds | Highest quality, batch or async pipelines |
Quality gains compound through the stack. Genzeon's hybrid plus rerank pipeline reports MRR@3 jumping from 0.433 to 0.605 (a 39.7 percent relative improvement) once a cross-encoder is added on top of fused BM25 plus dense retrieval.[2] Pinecone, Databricks, and Cohere have all published similar numbers showing 5 to 15 NDCG@10 points lifted by a cross-encoder reranker compared to the best first-stage method alone. The exact size of the lift depends on how good the first stage already is. If BM25 already returns the right answer at rank 1, the reranker has little to add. If the relevant passage is buried at rank 47 because it has weak lexical overlap, reranking is what saves the query.
Re-ranking is evaluated with the same offline metrics as ranking in general: NDCG@k, MRR, MAP, Precision@k, and Recall@k (the last is more useful for the retrieval stage). Common public benchmarks include MS MARCO passage and document tracks, the TREC Deep Learning tracks (DL19, DL20, DL21, DL22), and BEIR, the heterogeneous zero-shot benchmark covering 18 datasets. BEIR results are reported as average NDCG@10 across the included datasets, and they are the standard reference point for both first-stage retrievers and rerankers.
For RAG-specific systems, downstream answer-quality metrics also matter. A reranker that lifts NDCG by two points but increases the share of incorrect answers because it surfaces a tempting but wrong distractor is a net loss for the application. Teams running RAG often pair retrieval metrics with end-to-end accuracy metrics on a curated answer set.
Not every reranker is a neural model. Two non-neural approaches still appear in production.
Reciprocal rank fusion (RRF), introduced by Cormack, Clarke, and Buttcher at SIGIR 2009, is the standard way to combine multiple ranked lists into one.[10] The RRF score for a document is the sum of 1 / (k + rank_i) across each input list, where k is a small constant (the paper uses 60). RRF throws away the underlying scores and works only with positions, which makes it robust to systems that produce scores on incompatible scales. It is the default fusion strategy in OpenSearch, Elasticsearch, Azure Cognitive Search, and most hybrid retrieval libraries.
Pseudo-relevance feedback (PRF) treats the top-K results from the first retrieval as if they were known to be relevant, then expands the query with terms drawn from those documents and runs a second retrieval. RM3 is the most common PRF method on top of BM25, and neural variants like ANCE-PRF and ColBERT-PRF apply the same idea to dense retrieval. PRF is technically a re-ranking step because the second retrieval reorders the candidate set in light of feedback, even if there is no separate scoring model.
Most reranker work in 2025 is done through one of these libraries.
| Stack | Reranker class or component | Typical use |
|---|---|---|
| sentence-transformers | CrossEncoder | Local cross-encoder inference; fine-tuning on custom data |
| LangChain | CohereRerank, JinaRerank, FlashrankRerank (in langchain_community) | Wrap a base retriever with ContextualCompressionRetriever |
| LlamaIndex | CohereRerank, SentenceTransformerRerank, LLMRerank node post-processors | Plug into a QueryEngine between retrieval and synthesis |
| Haystack 2 | TransformersSimilarityRanker, CohereRanker, JinaRanker, LostInTheMiddleRanker | Drop-in pipeline component |
| RankLLM | RankZephyr, RankGPT, FIRST | LLM listwise reranking with sliding window |
| FlagEmbedding | FlagReranker, FlagLLMReranker | Inference for BGE rerankers |
In LangChain the wiring is short. Build a base vector retriever, instantiate CohereRerank, then wrap them with ContextualCompressionRetriever(base_compressor=cohere, base_retriever=vector). Every query first runs through the vector retriever, the candidate documents flow into the rerank API, and the wrapped retriever returns the reordered top-N. LlamaIndex follows the same pattern with a node post-processor list on a QueryEngine.[11]
The cross-encoder design has structural limits. Document representations cannot be pre-computed, so cost is paid at query time and grows linearly with the candidate count. This is acceptable for K of 20 to 100 but painful for K of 1000 or more. Late-interaction models like ColBERT trade some accuracy for the ability to pre-compute, and learning-to-rank with light features stays useful when latency budgets are very tight.
Domain shift is another headache. A reranker trained on MS MARCO web passages can degrade on legal contracts, medical literature, or internal company documents. Fine-tuning on in-domain pairs (often generated synthetically from the corpus) usually fixes this, but it requires labeled or semi-labeled data and a non-trivial training step. Cohere, Voyage, Jina, and BAAI release multilingual checkpoints partly to amortize this problem across users.
LLM rerankers introduce token cost and latency. A RankGPT pass over 20 candidates with GPT-4 routinely takes several seconds and costs more than a typical chat completion. They also fail in interesting ways: outputs can be malformed, the model can refuse to rank, or it can hallucinate IDs that were not in the input list. RankZephyr and other distilled open models reduce cost but still need careful output parsing.
Finally, optimizing rerankers in isolation can hide problems with the first stage. If the first-stage retrieval misses the relevant document entirely, no reranker can save it, since the document is not in the candidate set. End-to-end recall at the retrieval stage is therefore as important as precision at the rerank stage, and tuning one without the other tends to give misleading results.
Classical IR systems used multi-stage ranking long before neural networks were involved. Bing and Yahoo's web search stacks in the late 2000s ran an inverted-index recall stage, an L1 ranker (often LambdaMART) on a few thousand candidates, and an L2 ranker with richer features and click signals on the top hundred. The split was driven by the same arithmetic as today: full feature extraction over the entire web is impossible, so cheap recall comes first.
The neural era started with Nogueira and Cho's monoBERT in 2019, which showed that a transformer cross-encoder could move the MS MARCO state of the art by a large margin.[4] Reimers and Gurevych's Sentence-BERT in the same year provided the bi-encoder counterpart that made retrieval cheap enough to feed those rerankers.[1] ColBERT (2020) and monoT5 (2020) explored late interaction and seq2seq formulations.[5][6] By 2023, RankGPT had shown that LLM listwise prompting could match or beat dedicated rerankers, and RankZephyr had distilled that capability into open-weight models.[7][8] Cohere shipped its first commercial Rerank model in 2023, with rerank-v3.5 (the multilingual default) following in December 2024. Voyage launched rerank-2 in September 2024, and Jina released its multilingual reranker around the same time.
The direction of travel is toward longer context, more languages, and tighter integration with hybrid retrieval. Whether the future belongs to specialized cross-encoders, late-interaction models, or just very capable general LLMs remains an open question. In practice most production systems still combine all three: BM25 plus dense for recall, a fast cross-encoder for the bulk of the rerank, and an LLM (or LLM-based judge) for the final handful of candidates that go into a generated answer.
Imagine a giant library with a million books. You ask the librarian for books about dinosaurs. The librarian quickly grabs 100 books with the word dinosaur in the title. That is fast but rough; some of those books are coloring books, some are romance novels with a dinosaur on the cover. Then a paleontologist comes over and looks through the 100 books one by one, ranking them by how useful they are for what you actually want. The paleontologist is slow but careful. That second pass is re-ranking. The librarian gives you breadth, the paleontologist gives you precision, and together they hand you the ten best books in the library without anyone having to read all million.