Reranker

Information Retrieval Natural Language Processing

24 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

24 citations

Revision

v3 · 4,821 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A reranker (also called a cross-encoder reranker or rerank model) is a neural model used in retrieval augmented generation and information retrieval pipelines to re-score an initial set of candidate documents returned by a fast first-stage retriever. Rerankers take a (query, candidate) pair as joint input and produce a single relevance score, in contrast to embeddings models (bi-encoders) that encode query and document independently into vectors that are later compared by cosine similarity.^[1]^[2] The dominant rerank architecture is a cross-encoder built on a transformer encoder such as bert, deberta, or MiniLM, fine-tuned on (query, relevant passage, hard negative) triples drawn from datasets like MS MARCO.^[3]^[4] In production retrieval augmented generation rag systems, the reranker sits between a fast recall stage (often bm25 combined with dense embedding vector search) and the generative large language model, typically narrowing 50 to 200 candidates down to the 5 to 20 most relevant passages that fit within the model's context window.^[5]^[6]

Commercial rerank APIs are offered by cohere, voyage ai, Mixedbread, and Jina AI, while open-source rerankers such as baai's bge-reranker-v2-m3 and the cross-encoder/ms-marco-MiniLM-L6-v2 model are widely deployed through the Sentence Transformers library.^[7]^[8]^[9]^[10]^[11] On the mteb-adjacent BEIR benchmark, rerankers and late-interaction models such as colbert consistently top the leaderboard for zero-shot retrieval, with the cost of orders of magnitude higher compute per query compared to bi-encoder retrieval.^[12]

Background

Neural information retrieval before 2019 was dominated by two limiting choices. The first was lexical matching with bm25, which is fast and a strong baseline but cannot capture synonymy or paraphrase. The second was deep cross-attention models like BERT that scored query-document pairs jointly. The joint scoring approach was extremely accurate but computationally infeasible at corpus scale because every query required a fresh forward pass over every candidate document. Reimers and Gurevych quantified this gap in their 2019 Sentence-BERT paper: finding the most similar pair in a collection of 10,000 sentences using BERT in pairwise mode required roughly 50 million inference computations and approximately 65 hours on a V100 GPU, while precomputed BERT-derived sentence embeddings reduced the same task to about 5 seconds with minor accuracy loss.^[1]

That observation set up the two-tower division of labor that defines modern neural retrieval. A bi-encoder, also called a dual encoder or sentence transformer, encodes queries and documents independently into fixed-dimensional vectors and compares them with cosine similarity or dot product. Indexing the document side once and querying with nearest-neighbor search makes bi-encoders linear in corpus size and compatible with approximate nearest neighbor indexes inside vector database systems.^[1]^[2] A cross-encoder, by contrast, ingests the concatenated query and candidate as a single transformer input, applies full self-attention across both halves, and emits a scalar relevance score. Cross-encoders cannot precompute document representations, so their cost scales with the number of candidates they score at query time, but they retain the joint reasoning that simple vector similarity loses.^[1]^[2]

The decisive demonstration that BERT-style cross-encoders could win at passage ranking came from Nogueira and Cho, who in January 2019 posted "Passage Re-ranking with BERT" on arXiv. Their reimplementation of BERT-Large for query-passage scoring took the top entry on the MS MARCO passage retrieval leaderboard and outperformed the previous state of the art by 27% relative in MRR@10.^[13] That paper, often referred to in subsequent literature as monoBERT, established the standard template that today's commercial rerankers still follow: tokenize the query and passage with a [SEP] token between them, pass the joint sequence through a pretrained transformer, take a pooled representation (the [CLS] token activation), and apply a linear scoring head trained on relevance labels.^[13]

The follow-up question of how to deploy such models without paying their full latency cost on every document produced two architectural responses. The first response, which the field eventually consolidated around, was two-stage retrieval: keep cross-encoders strictly for reranking a small candidate pool produced by a faster first stage. The second response, introduced by Khattab and Zaharia at SIGIR 2020, was colbert, a "late interaction" model that retains token-level interaction while preserving the precomputability of document representations. ColBERT encodes the query and document separately into per-token contextualized embeddings, then scores each query token against the maximum-similarity document token (the MaxSim operation) and sums those scores. Khattab and Zaharia reported that ColBERT was two orders of magnitude faster than BERT-based ranking and required up to four orders of magnitude fewer FLOPs per query, while preserving competitive effectiveness.^[14]

The BEIR benchmark, introduced by Thakur, Reimers, Rücklé, Srivastava, and Gurevych in April 2021 and accepted at the NeurIPS 2021 Datasets and Benchmarks Track, provided the empirical settlement to the bi-encoder versus cross-encoder debate. BEIR aggregated 18 publicly available text retrieval datasets across diverse domains and evaluated 10 retrieval systems spanning lexical, sparse, dense, late-interaction, and reranking architectures in a zero-shot setting. The authors reported that "re-ranking and late-interaction based models on average achieve the best zero-shot performances," while purely dense bi-encoder models, despite their efficiency, often underperformed BM25 on out-of-domain queries.^[12] That headline finding has driven the now-standard production pattern in which a fast first stage feeds a heavier reranker.

How a cross-encoder reranker works

A cross-encoder reranker is conceptually simple. Given a user query q and a candidate document d, the model forms a joint input string of the form [CLS] q [SEP] d [SEP], tokenizes it with the transformer's subword tokenizer, and runs the resulting token sequence through a pretrained encoder. The encoder applies multi-head self-attention across the combined sequence so that every query token attends to every document token and vice versa. After the final transformer layer, the model takes a pooled representation, most commonly the activation of the [CLS] token, and feeds it into a single linear layer (a "scoring head") that emits a real-valued logit. That logit can be used directly to sort candidates, or it can be passed through a sigmoid to produce a probability of relevance in [0, 1].^[9]^[13]

In code, the Sentence Transformers library exposes this pattern through its CrossEncoder interface. For example, calling CrossEncoder('cross-encoder/ms-marco-MiniLM-L6-v2').predict([(query, doc1), (query, doc2)]) returns a list of scores such as [8.607, -4.320], where higher values indicate greater relevance.^[9] The MiniLM-L6 backbone has 22.7M parameters and runs at roughly 1,800 query-document pairs per second on a single V100 GPU, which is fast enough for production reranking of moderate candidate pools but several orders of magnitude slower than embedding-based ANN search.^[9]

Three properties differentiate this from a bi-encoder. First, the reranker performs full joint attention between query and document tokens, so it can model lexical alignment ("term X in the query matches term Y in the document under context Z") that disappears when each side is collapsed to a single vector. Second, because the model never produces a standalone document embedding, document representations cannot be cached, and every (query, document) pair must be scored fresh at query time. Third, the model emits an absolute relevance estimate per pair rather than a similarity in a shared geometric space, so the score head is typically trained with a regression or binary classification objective rather than a contrastive objective.^[2]^[9]

Training data and losses

Cross-encoders are almost always initialized from a pretrained transformer and fine-tuned on relevance-labeled triples. The MS MARCO passage ranking dataset is the canonical training source. MS MARCO contains roughly 1 million unique real user queries sampled from Bing logs, paired with a corpus of about 8.8 million unique passages, with labeled relevance judgments produced by Microsoft annotators.^[15] Most public reranker checkpoints, including cross-encoder/ms-marco-MiniLM-L6-v2, are trained directly on MS MARCO triples consisting of (query, positive passage, negative passage), where positives come from labeled judgments and negatives are mined from the top results of a first-stage retriever.^[9]^[13]

The Sentence Transformers training framework supports several loss functions for cross-encoder rerankers, each appropriate to a different data format:^[16]

BinaryCrossEntropyLoss for pointwise (query, document, 0/1 label) data, the default for fine-tuning monoBERT-style models.
MultipleNegativesRankingLoss for triplet data of the form (query, positive, negatives), which treats other positives in the batch as additional in-batch negatives.
Listwise ranking losses including RankNetLoss, ListNetLoss, and LambdaLoss for cases where multiple candidates per query are ranked together.
MarginMSELoss for cross-architecture knowledge distillation from a strong cross-encoder teacher.

The MarginMSE loss, introduced by Hofstätter et al. in 2020, plays an important role in producing fast, deployable rerankers. The technique distills from a high-capacity cross-encoder teacher (such as a BERT-Large monoBERT) into smaller students (such as MiniLM or TinyBERT) by training the student to match the score margin between positive and negative pairs rather than the absolute logits. Hofstätter et al. showed that this margin-matching distillation significantly improves the reranking effectiveness of efficient architectures including TK, ColBERT, PreTT, and BERT CLS dot product models without sacrificing their inference speed.^[17]

Hard negative mining is the other lever that consistently moves reranker quality. Easy negatives, sampled at random from the corpus, give the model little signal because almost all are obviously unrelated. Hard negatives are documents that the first-stage retriever scored highly but that a human (or a teacher model) labeled non-relevant. Training the cross-encoder to distinguish these near-miss candidates from true positives directly aligns the training distribution with the inference distribution, because at deployment the reranker will only ever see candidates that survived the first stage.^[17]^[18]

Late interaction as a middle ground

colbert occupies an architectural middle ground between bi-encoders and full cross-encoders. Rather than collapsing each side to a single vector, ColBERT encodes the query into a set of per-token contextualized embeddings and the document into a similar bag of token embeddings, both produced by a single shared BERT backbone. At scoring time, ColBERT computes the MaxSim score: for each query token, it finds the maximum cosine similarity with any document token, and the total relevance score is the sum of those per-token maxima.^[14]

Because document token embeddings can be precomputed and indexed for nearest-neighbor lookup, ColBERT supports end-to-end retrieval from millions of documents without ever running the full transformer at query time. Khattab and Zaharia reported that this design achieved "two orders of magnitude faster" execution than concatenated BERT ranking and "up to four orders of magnitude fewer FLOPs per query" while remaining competitive with cross-encoders on MS MARCO and TREC-CAR.^[14] ColBERTv2, presented at NAACL 2022 by Santhanam, Khattab, Saad-Falcon, Potts, and Zaharia, added centroid-based quantization that reduced index size by 6 to 10 times while improving zero-shot quality on out-of-domain BEIR datasets.^[19] BEIR's own evaluation grouped ColBERT-style models with rerankers as the "late interaction and reranking" category that achieved the best average zero-shot performance among all systems evaluated.^[12]

Comparison to bi-encoders

The trade-offs between bi-encoders and cross-encoder rerankers fall along three axes: latency, quality, and indexing.

Aspect	Bi-encoder (embedding)	Cross-encoder (reranker)	Late interaction (ColBERT)
Joint attention between query and document	No, encoded separately	Yes, full self-attention across concatenated input	Per-token MaxSim after independent encoding
Document representation precomputed	Yes, single vector	No, scored per query	Yes, set of token vectors
Cost scaling at query time	O(1) per candidate after ANN lookup	O(N) full transformer passes for N candidates	O(query_tokens × doc_tokens) similarity ops
Typical use in pipeline	First-stage retrieval	Second-stage reranking of 50 to 200 candidates	End-to-end retrieval or reranking
Storage cost per document	1 vector (e.g., 768 floats)	None (no doc vector)	One vector per document token

Sources for table values include the Sentence-BERT paper for the bi-encoder versus cross-encoder distinction, Pinecone's reranker explainer for the typical top-k values in production pipelines, and the ColBERT paper for the late-interaction characterization.^[1]^[5]^[14]

The empirical quality gap is consistent. On the MS MARCO Passage Ranking dev set, cross-encoder/ms-marco-MiniLM-L6-v2 reaches MRR@10 of 39.01 and NDCG@10 of 74.30 on TREC Deep Learning 2019, both substantially above the BM25 baseline of roughly 19 MRR@10 on the same set and above typical bi-encoder scores in the high 20s.^[9] Rerank checkpoints with higher capacity, such as the 2B-parameter mxbai-rerank-large-v2 from Mixedbread, push average NDCG@10 across BEIR to 57.49, which is materially above strong bi-encoder baselines on the same benchmark.^[10] The price of that quality is a per-query cost that scales linearly with the number of candidates reranked, where a typical cross-encoder rerank of 100 candidates on a single A100 GPU completes in roughly 0.9 seconds for mxbai-rerank-large-v2 sized models.^[10]

The practical consequence is that in modern retrieval stacks, bi-encoders and cross-encoders are used together rather than in opposition. The bi-encoder (often alongside bm25 in a hybrid search union) handles the recall stage, returning a candidate set of 50 to 200 documents. The cross-encoder then handles the precision stage, re-scoring that candidate set to surface the top 5 to 20 documents that are passed downstream, typically to a large language model in a retrieval augmented generation rag application.^[5]^[6]

Commercial rerankers

A handful of vendors offer rerankers as managed APIs, removing the need for users to host their own GPU inference. The category is dominated by four products.

Cohere Rerank

cohere introduced its Rerank API as a separate endpoint from its embeddings product, positioning it as a drop-in precision booster for any first-stage retriever. The most recent version, Rerank 3.5, launched on December 2, 2024, with a 4,096-token context length per document and stated support for over 100 languages including Arabic, Chinese, English, French, German, Hindi, Japanese, Korean, Portuguese, Russian, and Spanish.^[7]^[20] Cohere reported that Rerank 3.5 reached state-of-the-art performance on the BEIR benchmark and on domain-specific tasks in finance, e-commerce, hospitality, project management, and email retrieval. Internal Cohere benchmarks claimed a 23.4% improvement on financial services datasets compared with hybrid search and a 30.8% improvement compared with BM25 alone.^[7] The launch coincided with a v2 API revision that replaced the prior max_chunks_per_doc parameter with max_tokens_per_doc defaulting to 4,096. Rerank 3.5 is also available through Amazon Bedrock and Azure AI Foundry.^[21]

Voyage AI Rerank

voyage ai released the rerank-2 series on September 30, 2024, branding rerank-2 as its quality-optimized reranker and rerank-2-lite as a latency-optimized variant. The full rerank-2 model supports a 16,000-token combined context length for a query-document pair, with up to 4,000 tokens reserved for the query, and natively spans 31 languages across 51 multilingual datasets including French, German, Japanese, Spanish, Korean, Bengali, Portuguese, and Russian. Voyage AI reported that rerank-2 improves retrieval accuracy by 13.89% on average atop OpenAI's text-embedding-3-large baseline across 93 retrieval datasets, outperforming Cohere v3 by 7.14% and BGE v2-m3 by 15.61% in their internal evaluation.^[22]

On August 11, 2025, Voyage AI followed with rerank-2.5 and rerank-2.5-lite, adding instruction-following capabilities and extending context length to 32,000 tokens (described as "8x that of Cohere Rerank v3.5 and double that of rerank-2"). Voyage AI reported that rerank-2.5 outperformed Cohere Rerank v3.5 by 7.94% on the standard 93-dataset evaluation, by 12.70% on the MAIR benchmark, and by 11.48% in real-world instruction-following accuracy. The instruction-following feature lets callers steer rerank behavior with natural-language prompts that emphasize specific document components, type filters, or query disambiguation contexts.^[23]

Mixedbread Rerank

Mixedbread, a Berlin- and San Francisco-based startup, releases its rerank models as fully open-source under Apache 2.0 in addition to offering them through its hosted API. The mxbai-rerank-large-v2 model, published on June 4, 2025, is a 2B-parameter cross-encoder trained with a three-step reinforcement learning pipeline combining Guided Reinforcement Prompt Optimization (GRPO), contrastive learning, and preference learning. Mixedbread reports a BEIR average NDCG@10 of 57.49, a Mr.TyDi multilingual score of 29.79, and a Chinese score of 84.16, with end-to-end latency of approximately 0.89 seconds on an A100 GPU when reranking 100 candidates. The earlier mxbai-rerank-large-v1 reached NDCG@10 of 48.8 on a subset of 11 BEIR datasets.^[10]^[24]

Jina Reranker

Jina AI launched jina-reranker-v2-base-multilingual on June 25, 2024 as a 278M-parameter cross-encoder with a 1,024-token context length, multilingual coverage including English, German, Spanish, Chinese (simplified and traditional), and Japanese among 26 tested languages, and support for flash attention that Jina reports yields a 3x to 6x speedup over v1. Reported scores include BEIR NDCG@10 of 53.17 across 17 datasets, MKQA NDCG@10 of 54.83 across 26 languages, and MLDR Recall@10 of 68.95 across 13 languages. Jina also positions the model for code retrieval (CodeSearchNet MRR@10 of 71.36), function calling (ToolBench Recall@3 of 77.75), and table search (Recall@3 of 93.31). The Hugging Face checkpoint is licensed CC-BY-NC-4.0, with commercial use through the Jina AI API.^[11]

Open-source rerankers

Beyond the commercial offerings, several open-source rerankers are widely deployed.

BAAI BGE Reranker

The baai BGE family includes both embedding models and dedicated rerankers. bge-reranker-v2-m3 is a 0.6B-parameter cross-encoder built on top of the bge-m3 multilingual backbone, released under Apache 2.0. The model accepts query-passage pairs up to 8,192 tokens (with recommended fine-tuning length of 1,024 tokens) and outputs a logit per pair that can be normalized to [0, 1] with a sigmoid. BAAI evaluates it as a reranker over the top 100 results from bge-en-v1.5 large and e5-mistral 7b instruct on BEIR, top 100 from bge-zh-v1.5 large on CMTEB-retrieval, and top 100 from bge-m3 on the MIRACL multilingual benchmark.^[8] The smaller bge-reranker-base and bge-reranker-large checkpoints in the same family are commonly used as default rerankers in open-source RAG stacks because they ship under a permissive license and integrate with the FlagEmbedding and Sentence Transformers libraries.^[8]

MS MARCO MiniLM cross-encoders

The cross-encoder/ms-marco-MiniLM-L6-v2 checkpoint, hosted on Hugging Face under Apache 2.0, has become the default open-source reranker for production deployment because of its favorable speed-quality ratio. The model is a 22.7M-parameter cross-encoder built on microsoft/MiniLM-L12-H384-uncased, fine-tuned on MS MARCO. It reaches MRR@10 of 39.01 on the MS MARCO dev set and NDCG@10 of 74.30 on TREC DL 2019 while processing 1,800 query-document pairs per second on a V100 GPU. The Sentence Transformers documentation lists L-2, L-4, L-6, and L-12 variants for users who want to trade quality for latency.^[9]

Other open-source efforts

The NeuralCherche library packages cross-encoder rerankers with a sklearn-style API for users who prefer not to depend on Sentence Transformers, and academic releases such as the Hofstätter neural-ranking-kd repository provide distilled cross-encoders trained with MarginMSE.^[17] Sebastian Hofstätter's group at TU Wien continues to publish reference implementations of distilled rerankers used as teachers and students in cross-architecture knowledge distillation pipelines.^[17]

Integration with retrieval-augmented generation

In a typical retrieval augmented generation rag pipeline, the reranker occupies a specific stage between retrieval and generation. Pinecone's reference architecture, which mirrors the pattern used across LangChain, LlamaIndex, and most enterprise RAG stacks, describes the flow this way: a query enters the system; a fast bi-encoder retrieves a candidate set of size k1 (commonly 25 to 200) from a vector database; an optional sparse retriever such as bm25 contributes a parallel candidate set; the union is reranked by a cross-encoder to produce a smaller candidate set of size k2 (commonly 3 to 20); and the surviving candidates are passed as context to an large language model for generation. Pinecone's example reranks from a candidate pool of 25 down to the top 3 for the LLM.^[5]

The benefit of inserting a reranker is two-sided. First, retrieval recall is preserved or improved because the first stage can over-fetch with high recall settings without worrying that low-quality candidates will pollute the LLM prompt. Second, the LLM's effective signal-to-noise ratio rises because the small number of passages it sees are the most relevant ones in the corpus, not merely the ones that scored well on cosine similarity. Recent measurements in the research literature have reported that adding a reranker to a two-stage RAG pipeline improves mean NDCG@10 by up to 5.4 percentage points, raises end-to-end generation accuracy by 6 to 8 percentage points, and reduces context tokens by approximately 35% relative to a single-stage retrieval baseline.^[6]

A second integration consideration is candidate budget. Because rerankers run in O(N) full transformer passes, doubling the candidate count doubles the latency of the rerank stage. Most production deployments size the candidate pool to balance the marginal recall benefit of more candidates against the latency cost of scoring each one. For a typical workload with sub-second latency budgets and a reranker like cross-encoder/ms-marco-MiniLM-L6-v2 running at 1,800 pairs per second per GPU, reranking 100 to 200 candidates fits comfortably in the budget; reranking 1,000 candidates does not.^[9]

A third consideration is heterogeneity of input. Some commercial rerankers, including Cohere Rerank 3.5 and Voyage rerank-2.5, advertise support for structured inputs such as JSON, tables, code, and emails as well as long passages.^[7]^[23] These rerankers can be used to score not just text passages but also chunks of code (in colbert-style code retrieval), function definitions for agentic tool use, or rows of tabular data.

Performance on standard benchmarks

The BEIR benchmark from Thakur et al. is the most widely cited test bed for measuring reranker quality across heterogeneous retrieval tasks. BEIR aggregates 18 datasets including TREC-COVID, NFCorpus, NQ, HotpotQA, FiQA, ArguAna, Quora, FEVER, and others, evaluated in a zero-shot setting where the model is trained on MS MARCO and tested on the remaining datasets without any in-domain fine-tuning.^[12] In BEIR's headline analysis, lexical BM25 served as a strong baseline that dense bi-encoder models often failed to beat in zero-shot settings, while reranking and late-interaction models including monoBERT and ColBERT consistently scored highest on average.^[12]

Specific reported scores on BEIR average NDCG@10 include:

Model	BEIR avg NDCG@10	Reference
BM25 (lexical baseline)	~42.3	Thakur et al. 2021 BEIR paper^[12]
Mixedbread mxbai-rerank-large-v2 (2B params)	57.49	Mixedbread model card 2025^[10]
Jina Reranker v2 base multilingual (278M params)	53.17	Jina model card 2024^[11]
Mixedbread mxbai-rerank-large-v1	48.8 (11-dataset subset)	Mixedbread v1 docs^[24]

MS MARCO Passage Ranking is the corresponding in-domain training benchmark; reported MRR@10 for cross-encoder rerankers includes 39.01 for cross-encoder/ms-marco-MiniLM-L6-v2 and 36.5 for the smaller MiniLM-L2 checkpoint, both compared with a BM25 baseline of roughly 19 on the same set.^[9] TREC Deep Learning 2019, derived from MS MARCO, is the corresponding test set; NDCG@10 of 74.30 is the reference number for the L6 MiniLM cross-encoder.^[9]

A standard practical observation across all these benchmarks is that bi-encoder embedding quality has narrowed the gap with rerankers since 2021, particularly for in-domain queries. However, on out-of-domain or domain-specific data (legal, finance, medical, code), adding a reranker continues to yield large gains because the cross-encoder can attend to fine-grained lexical and contextual cues that a generic embedding model encodes only coarsely. This is the empirical observation that Cohere, Voyage AI, and Mixedbread all cite when justifying domain-specific rerank variants.^[7]^[22]^[10]

Cost and latency

A reranker is the most expensive stage of a typical retrieval pipeline on a per-query basis. Sizing this cost is part of the deployment decision. The relevant figures are:

A 22.7M-parameter MiniLM cross-encoder processes about 1,800 query-document pairs per second on a single V100 GPU, so reranking 100 candidates costs about 55 milliseconds.^[9]
A 0.6B-parameter bge-reranker-v2-m3 processes substantially fewer pairs per second at full precision but supports fp16 and bf16 inference with minimal quality loss.^[8]
A 2B-parameter mxbai-rerank-large-v2 reranks 100 candidates in approximately 0.89 seconds on an A100 GPU.^[10]
Commercial rerank APIs price by tokens. Cohere's Rerank 3.5 on AWS Bedrock and Azure Foundry, Voyage AI's rerank-2.5 on the Voyage API, and Jina AI's Reranker on the Jina API all charge per query plus per document token, with exact prices that vary by region and provider but are in the $0.001 to $0.01 per query range for typical RAG workloads of 50 to 200 candidates.^[21]^[23]^[11]

Three deployment patterns reduce reranker cost in practice. First, candidate budget control: production systems aggressively trim the candidate pool before rerank, often to 20 to 50 candidates rather than 100 to 200, accepting a small recall hit for substantial latency savings. Second, multi-stage cascading, in which a small distilled reranker scores 200 candidates and a larger reranker rescores only the top 20. Third, knowledge distillation: deploying a 22M-parameter MiniLM cross-encoder distilled from a Cohere or BGE teacher gives most of the quality at a fraction of the cost, the same trade-off that motivated Hofstätter et al.'s MarginMSE technique in the first place.^[17]

Limitations and open problems

Several limitations of current rerankers are well documented in the literature.

Cross-encoders inherit the input length limit of their underlying transformer. ms-marco-MiniLM-L6-v2 truncates at 512 tokens; jina-reranker-v2-base-multilingual at 1,024 tokens; Cohere Rerank 3.5 at 4,096 tokens; Voyage rerank-2 at 16,000 tokens; Voyage rerank-2.5 at 32,000 tokens.^[7]^[11]^[22]^[23] When a document exceeds the limit, callers must either truncate (losing information from the tail) or split the document into windows and aggregate, both of which introduce noise. Long-context rerankers reduce this problem but cost more per pair.

Multilinguality remains uneven. While bge-reranker-v2-m3, jina-reranker-v2-base-multilingual, Cohere Rerank 3.5, and Voyage rerank-2.5 all claim multilingual support, evaluation on non-English benchmarks consistently shows a quality gap relative to English. Mr.TyDi NDCG@10 of 29.79 for mxbai-rerank-large-v2, compared with its 57.49 BEIR English average, illustrates the gap.^[10]

Domain shift between MS MARCO training data and downstream tasks (medical, legal, code, agentic tool calling) reduces zero-shot quality. The BEIR analysis explicitly flagged this as a generalization problem for both bi-encoders and rerankers, although rerankers degrade less.^[12] Recent rerankers have responded with broader training data: Cohere advertises specific gains on finance, e-commerce, and hospitality data; Voyage on technical documentation, law, and medical; Jina on code and structured tools.^[7]^[23]^[11]

Instruction-following is an emerging capability. Until 2025, rerankers scored a (query, document) pair without any explicit notion of what kind of relevance to prefer. Voyage's rerank-2.5 was the first major commercial reranker to accept natural-language instructions alongside the query, letting callers specify "prefer documents that emphasize section X" or "filter to documents of type Y" without retraining.^[23] Whether instruction-following becomes standard across the category, and whether instruction-tuned rerankers are sufficiently controllable for safety-relevant applications, remains an open question.

Finally, the field has not converged on standard evaluation protocols for reranker quality in RAG end-to-end. BEIR measures retrieval quality with NDCG@10 and MRR@10 on text retrieval, but downstream generation quality depends on how well the reranker's chosen passages support the LLM's answer. Benchmarks such as MAIR, AirBench, and BEIR-extended variants have appeared to fill this gap, but each measures a slightly different notion of relevance, and comparisons across them require care.^[23]^[11]

References

Nils Reimers and Iryna Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", arXiv preprint arXiv:1908.10084, 2019-08-27. https://arxiv.org/abs/1908.10084. Accessed 2026-05-25. ↩
Sentence Transformers documentation, "Cross-Encoders" pretrained models page, sbert.net, 2024. https://www.sbert.net/docs/cross_encoder/pretrained_models.html. Accessed 2026-05-25. ↩
Hugging Face model hub, "cross-encoder/ms-marco-MiniLM-L6-v2", 2023. https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2. Accessed 2026-05-25. ↩
Sentence Transformers documentation, "Cross-Encoder Training Overview", sbert.net, 2024. https://sbert.net/docs/cross_encoder/training_overview.html. Accessed 2026-05-25. ↩
Pinecone, "Rerankers and Two-Stage Retrieval", Pinecone Learn series, 2023. https://www.pinecone.io/learn/series/rag/rerankers/. Accessed 2026-05-25. ↩
Anonymous authors, "Enhancing Retrieval-Augmented Generation with Two-Stage Retrieval: FlashRank Reranking and Query Expansion", arXiv preprint arXiv:2601.03258, 2026. https://arxiv.org/abs/2601.03258. Accessed 2026-05-25. ↩
Cohere, "Announcing Rerank-v3.5", Cohere changelog, 2024-12-02. https://docs.cohere.com/changelog/rerank-v3.5. Accessed 2026-05-25. ↩
BAAI, "BAAI/bge-reranker-v2-m3 model card", Hugging Face, 2024. https://huggingface.co/BAAI/bge-reranker-v2-m3. Accessed 2026-05-25. ↩
Hugging Face model hub, "cross-encoder/ms-marco-MiniLM-L6-v2 model card", 2023. https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2. Accessed 2026-05-25. ↩
Mixedbread AI, "mixedbread-ai/mxbai-rerank-large-v2 model card", Hugging Face, 2025-06-04. https://huggingface.co/mixedbread-ai/mxbai-rerank-large-v2. Accessed 2026-05-25. ↩
Jina AI, "jinaai/jina-reranker-v2-base-multilingual model card", Hugging Face, 2024-06-25. https://huggingface.co/jinaai/jina-reranker-v2-base-multilingual. Accessed 2026-05-25. ↩
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych, "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models", arXiv preprint arXiv:2104.08663, 2021-04-17. https://arxiv.org/abs/2104.08663. Accessed 2026-05-25. ↩
Rodrigo Nogueira and Kyunghyun Cho, "Passage Re-ranking with BERT", arXiv preprint arXiv:1901.04085, 2019-01-13. https://arxiv.org/abs/1901.04085. Accessed 2026-05-25. ↩
Omar Khattab and Matei Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT", arXiv preprint arXiv:2004.12832, 2020-04-27. https://arxiv.org/abs/2004.12832. Accessed 2026-05-25. ↩
Microsoft Research, "MS MARCO: Dataset overview", microsoft.github.io/msmarco, 2018. https://microsoft.github.io/msmarco/. Accessed 2026-05-25. ↩
Sentence Transformers documentation, "Cross-Encoder Training Overview: losses", sbert.net, 2024. https://sbert.net/docs/cross_encoder/training_overview.html. Accessed 2026-05-25. ↩
Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury, "Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation", arXiv preprint arXiv:2010.02666, 2020-10-06. https://arxiv.org/abs/2010.02666. Accessed 2026-05-25. ↩
Michael Brenndoerfer, "Reranking: Cross-Encoders for Precise Information Retrieval", mbrenndoerfer.com, 2024. https://mbrenndoerfer.com/writing/reranking-cross-encoders-information-retrieval. Accessed 2026-05-25. ↩
Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia, "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction", arXiv preprint arXiv:2112.01488, 2021-12-02. https://arxiv.org/abs/2112.01488. Accessed 2026-05-25. ↩
Cohere, "Rerank 3.5 model card on Amazon Bedrock", AWS documentation, 2024. https://docs.aws.amazon.com/bedrock/latest/userguide/model-card-cohere-rerank-3-5.html. Accessed 2026-05-25. ↩
Amazon Web Services, "Cohere Rerank 3.5 is now available in Amazon Bedrock through Rerank API", AWS Machine Learning Blog, 2024-12-02. https://aws.amazon.com/blogs/machine-learning/cohere-rerank-3-5-is-now-available-in-amazon-bedrock-through-rerank-api/. Accessed 2026-05-25. ↩
Voyage AI, "rerank-2 and rerank-2-lite: the next generation of Voyage multilingual rerankers", Voyage AI blog, 2024-09-30. https://blog.voyageai.com/2024/09/30/rerank-2/. Accessed 2026-05-25. ↩
Voyage AI, "rerank-2.5 and rerank-2.5-lite: instruction-following rerankers", Voyage AI blog, 2025-08-11. https://blog.voyageai.com/2025/08/11/rerank-2-5/. Accessed 2026-05-25. ↩
Mixedbread AI, "mxbai-rerank-large-v1 Documentation", mixedbread.com, 2024. https://www.mixedbread.com/docs/reranking/mxbai-rerank-large-v1. Accessed 2026-05-25. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

BGE (BAAI General Embedding)Chunking (information retrieval)Hybrid search Qwen3 Embedding Re-ranking Vespa (search engine)Zero-Shot Classification Models

Background

How a cross-encoder reranker works

Training data and losses

Late interaction as a middle ground

Comparison to bi-encoders

Commercial rerankers

Cohere Rerank

Voyage AI Rerank

Mixedbread Rerank

Jina Reranker

Open-source rerankers

BAAI BGE Reranker

MS MARCO MiniLM cross-encoders

Other open-source efforts

Integration with retrieval-augmented generation

Performance on standard benchmarks

Cost and latency

Limitations and open problems

See also

References

Improve this article

Related Articles

Similarity Measure

Vector embeddings

LlamaIndex

AI search

Embeddings

Information Retrieval

What links here

Related Articles

Similarity Measure

Vector embeddings

LlamaIndex

AI search

Embeddings

Information Retrieval

What links here