Cross-encoder
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,229 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,229 words
Add missing citations, update stale details, or suggest a clearer explanation.
A cross-encoder is a neural network architecture for scoring the relevance of a query and a candidate document by feeding them jointly into a single transformer, producing a scalar relevance score. Unlike a bi-encoder (also called dual-encoder), which embeds the query and each document independently and compares the resulting vectors with an inner product or cosine similarity, the cross-encoder concatenates the two texts (typically as [CLS] query [SEP] document [SEP]) and lets self-attention mix tokens from both sides at every layer, then reads a scalar from a classification head over the [CLS] representation.[1][2] Cross-encoders consistently outperform bi-encoders in passage ranking accuracy because they model fine-grained token-level interactions, but they cannot precompute document representations, so a query must run one forward pass for every candidate. In modern retrieval systems they are therefore used almost exclusively as a second-stage reranker sitting behind a fast first-stage retriever such as BM25, a dense bi-encoder, or a sparse learned model like SPLADE.[3][2] The pattern was popularized by Nogueira and Cho's 2019 application of BERT to MS MARCO passage ranking (later nicknamed "monoBERT"), and it underpins the cross-encoder/ms-marco-MiniLM-L6-v2 model family on the Hugging Face Hub as well as commercial rerank APIs from Cohere, Jina AI, Mixedbread, and BAAI.[4][5][6][7]
Information retrieval before deep learning relied on lexical models such as TF-IDF and BM25, which score each document independently from a query through term-frequency statistics. These models scale to billions of documents because they can be evaluated with an inverted index, but they cannot resolve vocabulary mismatch (paraphrases, synonyms) between query and document.[8]
The release of BERT in 2018 made it practical to attempt query/document scoring with a transformer that jointly attends across the two texts. In January 2019, Rodrigo Nogueira and Kyunghyun Cho posted "Passage Re-ranking with BERT" (arXiv 1901.04085), the first paper to fine-tune BERT-Large on the MS MARCO Passage Ranking dataset as a relevance classifier. They concatenated the query and a candidate passage with [CLS] and [SEP] tokens, added a single linear layer on top of the [CLS] vector, and trained with binary cross-entropy on positive and sampled negative pairs. The system topped the MS MARCO leaderboard and improved MRR@10 by roughly 27% relative over the previous state of the art.[4] This single-BERT formulation became known as monoBERT, distinguishing it from later pairwise duoBERT and listwise variants.[4]
In August 2019, Nils Reimers and Iryna Gurevych of the UKP Lab at TU Darmstadt published "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" at EMNLP 2019.[1] The paper crystallized the cross-encoder versus bi-encoder distinction. Standard BERT, the authors noted, requires both sentences to be fed into the network simultaneously and "causes a massive computational overhead": finding the most similar pair in a collection of 10,000 sentences takes about 65 hours on a V100 GPU. Their proposed Sentence-BERT (SBERT) used siamese (shared-weight) BERT encoders to produce fixed-length sentence embeddings comparable by cosine similarity, reducing the same task to about 5 seconds while preserving most of BERT's accuracy.[1] Crucially, Reimers and Gurevych framed cross-encoders as the high-accuracy reference and bi-encoders as the deployable approximation, a framing that has dominated the literature since.[2]
A second formative paper, Karpukhin et al.'s "Dense Passage Retrieval for Open-Domain Question Answering" (EMNLP 2020), pushed bi-encoders further by training two separate BERT towers (one for the question, one for the passage) with an in-batch negative sampling contrastive objective.[9] Their DPR system outperformed BM25 by 9 to 19 percentage points on top-20 retrieval accuracy across open-domain QA benchmarks and made first-stage dense retrieval the new baseline.[9] But because bi-encoders independently embed and then compare, they remained strictly less expressive than cross-encoders that attend across both texts; this gap motivated the standard retrieve-then-rerank pipeline that combines a bi-encoder first stage with a cross-encoder second stage.[3][9]
By 2021, Thakur et al.'s BEIR benchmark (NeurIPS 2021 Datasets and Benchmarks Track) had measured this pipeline across 17 heterogeneous information retrieval datasets and concluded that reranking models "on average achieve the best zero-shot performances, however, at high computational costs."[10] That trade-off has shaped every subsequent generation of reranker models.
Given a query q and a candidate document d, a cross-encoder constructs the input sequence
[CLS] q [SEP] d [SEP]
with the WordPiece (or sentencepiece) tokenizer of the underlying transformer. Segment IDs distinguish the query span from the document span, and standard position embeddings are added. The full sequence is then passed through a stack of transformer encoder layers; at every layer, multi-head self-attention can mix tokens from the query and the document, producing contextualized representations that depend on both.[4][2]
A small head sits on top of the encoder. The most common formulation, used by monoBERT and almost all cross-encoder/ms-marco-* checkpoints, takes the final hidden state of the [CLS] token and projects it through a single linear layer to a logit. During training the logit is converted to a Bernoulli probability via sigmoid and trained with binary cross-entropy against positive/negative labels; at inference the raw logit is used directly as a relevance score (no softmax over a candidate pool is required because each candidate is scored independently).[4][5] Variants exist: monoT5 uses an encoder-decoder T5 model and conditions on the prompt "Query: q Document: d Relevant:", then reads the logit assigned to the token "true" versus "false" as the score.[11] RankT5 replaces the classification objective with pairwise softmax or listwise Softmax/PolyLoss ranking losses computed across the candidate pool, which the authors show improves both in-domain MRR/NDCG and out-of-domain generalization on BEIR.[12]
The decisive architectural property is joint attention. In a bi-encoder, query tokens never attend to document tokens (or vice versa); each text is summarized into a fixed-length vector before comparison, which forces lossy compression. In a cross-encoder, every query token can attend to every document token at every layer, so the model can directly check token-level alignments, paraphrases, negations, numerical comparisons, and other fine-grained relations.[2][1] This is why on MS MARCO Dev and TREC Deep Learning, cross-encoder rerankers routinely add 3 to 8 absolute points of MRR@10 or NDCG@10 over their bi-encoder first stage.[5][10]
The same joint attention is also the headline limitation. To score N candidates, the cross-encoder must run N forward passes (each over a sequence of length |q|+|d|), whereas a bi-encoder can precompute document embeddings offline and reduce online cost to one query embedding plus an approximate nearest-neighbor lookup with FAISS or Pinecone. The Hugging Face card for cross-encoder/ms-marco-MiniLM-L6-v2 reports roughly 1,800 documents per second on a V100, meaning that reranking the top 100 candidates per query adds approximately 55ms of GPU time per query for that small 6-layer model; larger 12-layer or BGE-reranker-v2-gemma (3B parameter) models can be one to two orders of magnitude slower.[5][13]
The de facto training set for general-purpose cross-encoder rerankers is the MS MARCO Passage Ranking corpus, a Microsoft release of roughly 8.8 million web passages and around 500,000 anonymized Bing queries with sparse positive judgments. Nogueira and Cho's monoBERT trained with binary cross-entropy on positive query/passage pairs and BM25-sampled negatives; subsequent work showed that mining "hard negatives" with the current model and distilling soft scores from a strong cross-encoder teacher both substantially improve quality.[4][14] The cross-encoder/ms-marco-MiniLM-L6-v2 model card credits training on the MS MARCO Passage Ranking task and reports NDCG@10 = 74.30 on TREC Deep Learning 2019 and MRR@10 = 39.01 on MS MARCO Dev.[5]
The sentence-transformers library, originally created by UKP Lab and now maintained by Hugging Face, exposes a CrossEncoder class with a small fit/predict API; given a list of InputExample(texts=[query, doc], label=relevance) it handles tokenization, padding, sigmoid output, and evaluation against held-out positives.[2] Hugging Face's 2024 "Training and Finetuning Reranker Models with Sentence Transformers" blog formalizes this into a Sentence Transformers v3 trainer that supports binary classification, regression, and listwise losses for cross-encoders.[14]
State-of-the-art rerankers usually combine three ingredients beyond plain cross-entropy on MS MARCO:
| Model | Year | Base / size | Training data | Notes |
|---|---|---|---|---|
| monoBERT (Nogueira and Cho) | 2019 | BERT-Large, 340M | MS MARCO Passage | First BERT reranker; topped MS MARCO by 27% MRR.[4] |
cross-encoder/ms-marco-MiniLM-L6-v2 | 2020 | MiniLM-L12, 22.7M | MS MARCO Passage | Distilled 6-layer student; 1,800 docs/s on V100; widely used default reranker.[5] |
| monoT5 (Nogueira et al.) | 2020 | T5-base/large/3B | MS MARCO Passage | Generative reranker; scores "true"/"false" token logits.[11] |
| RankT5 (Zhuang et al.) | 2022 | T5 encoder-decoder & encoder-only | MS MARCO Passage | Listwise softmax loss; strong zero-shot BEIR.[12] |
| BGE-reranker (BAAI) | 2023 | XLM-RoBERTa-base/large | Multilingual web | Multilingual cross-encoder; bundled with the BGE embedding family.[16] |
| BGE-reranker-v2-m3 (BAAI) | 2024 | BGE-M3, 0.6B | bge-m3-data, Quora, FEVER | Multilingual; fast inference; sigmoid-normalized scores.[16] |
| BGE-reranker-v2-gemma (BAAI) | 2024 | Gemma-2B, ~3B | Multilingual | LLM-based reranker; strong English and multilingual quality.[7] |
| Cohere Rerank 3.0 | 2024 | proprietary | proprietary multilingual | English-only and multilingual variants; commercial API.[17] |
| Cohere Rerank 3.5 | 2024-12-02 | proprietary | proprietary multilingual | Single multilingual model, 4,096-token context, 100+ languages.[18] |
Jina Reranker v2 (jina-reranker-v2-base-multilingual) | 2024-06-25 | XLM-RoBERTa-style, 278M | Multilingual + code + function-calling | Flash-Attention 2; 6x throughput vs. v1; supports function-call and code retrieval.[6] |
| mxbai-rerank-large-v1 (Mixedbread) | 2024-02-29 | proprietary, 435M | LLM-labeled real-world queries | Open under Apache 2.0; 74.9% on BEIR (reported).[15] |
| mxbai-rerank-large-v2 (Mixedbread) | 2025 | 1.5B | Multilingual + code + SQL | Reinforcement-learning fine-tuned; 8k token context; 100+ languages.[19] |
cross-encoder/ms-marco-MiniLM familyThe single most downloaded cross-encoder on Hugging Face is the cross-encoder/ms-marco-MiniLM-L*-v2 line, distilled from larger BERT teachers into MiniLM-L12-H384 students with 2, 4, 6, or 12 transformer layers. The L6 checkpoint at 22.7M parameters is the canonical reference reranker for academic papers and is the default returned by examples in the sentence-transformers documentation.[5][2] All four sizes share the same MS MARCO Passage training pipeline; the cards report TREC DL 2019 NDCG@10 climbing from roughly 67 (L2) to 74 (L6) to 74.3 (L12) as depth increases, while throughput on a V100 falls correspondingly.[5]
Beijing Academy of Artificial Intelligence (BAAI) released the BGE-reranker family alongside its BGE embedding models. The original bge-reranker-base and bge-reranker-large (2023) are XLM-RoBERTa cross-encoders trained on Chinese and English retrieval data.[16] The v2 generation introduced three siblings: bge-reranker-v2-m3, a 0.6B multilingual model built on the BGE-M3 backbone; bge-reranker-v2-minicpm-layerwise, which exposes intermediate-layer exits for inference acceleration; and bge-reranker-v2-gemma, a roughly 3B LLM-based reranker built on Google's Gemma-2B base.[7][16] All v2 variants accept a query/passage pair and emit a raw score that BAAI recommends normalizing to [0, 1] with a sigmoid.[16]
Cohere Rerank is a managed reranker API first released as Rerank v1 in 2023 and updated to Rerank 3.0 in 2024 with English-only and multilingual variants.[17] Rerank 3.5, announced 2 December 2024, consolidates to a single multilingual model with a 4,096-token context window and SOTA scores on Cohere's internal multilingual retrieval evals; the v2 API replaces max_chunks_per_doc with max_tokens_per_doc and makes the model field required.[18] Jina Reranker is a similar commercial API; jina-reranker-v2-base-multilingual, released 25 June 2024, is a 278M-parameter cross-encoder built with Flash Attention 2 that supports more than 100 languages and is specifically tuned for function-call and code retrieval relevant to agentic RAG.[6] Mixedbread mxbai-rerank ships both open-source weights on Hugging Face and a managed API; the v1 models (xsmall/base/large) were released 29 February 2024 under Apache 2.0, and v2 (base 0.5B, large 1.5B) added a reinforcement-learning fine-tuning step and an 8k-token context.[15][19]
The canonical deployment pattern, documented in the sentence-transformers "Retrieve & Re-Rank" tutorial, is a two-stage cascade. A fast first-stage retriever, typically BM25, a Dense Passage Retrieval (DPR) style bi-encoder, SPLADE, or ColBERT, pulls a candidate set of K documents (commonly 50 to 200). A cross-encoder then scores all K candidates for the current query and returns the top k (commonly 3 to 10).[3][20] On heterogeneous benchmarks like BM25 + cross-encoder pipelines reported in BEIR, this cascade typically adds several absolute NDCG@10 points over the first-stage retriever alone, at the cost of one cross-encoder forward pass per surviving candidate.[10]
The pattern has become standard in retrieval-augmented generation. In a Retrieval-Augmented Generation (RAG) system, the cross-encoder reranker is placed between the vector store and the LLM context window: it raises the precision of the top-3 or top-5 passages that actually enter the prompt, which has an outsized effect on answer quality because LLM context is small and expensive. The same architecture is now standard in Agentic RAG systems where the reranker also scores tool descriptions and function-call schemas, a use case explicitly targeted by Jina Reranker v2 and Mixedbread mxbai-rerank-v2.[6][19]
Throughput numbers vary by model and hardware, but the broad picture from public model cards and benchmarks is:
| Stage | Typical model | Throughput (single A100 / V100, order of magnitude) | Relative quality |
|---|---|---|---|
| First-stage sparse | BM25 (CPU inverted index) | millions of docs/query | baseline |
| First-stage dense | DPR / BGE-M3 bi-encoder | millions of docs/query (after ANN index) | + several points NDCG over BM25 on in-domain data |
| Reranker (small) | cross-encoder/ms-marco-MiniLM-L6-v2 | ~1,800 docs/sec on V100 | + several points NDCG over first stage |
| Reranker (large) | bge-reranker-v2-gemma / mxbai-rerank-large-v2 / monoT5-3B | ~10s to 100s of docs/sec | additional gains on hard / zero-shot data |
The first stage's job is recall (does the true positive sit somewhere in the top K?), and the reranker's job is precision (is it at position 1?). Increasing K improves recall but pays N more cross-encoder forward passes; the typical operating point is K between 50 and 200.[3][20]
Throughput. The headline limitation is that cross-encoders cannot precompute document representations. Every (query, document) pair requires a full forward pass over the concatenated sequence, so latency grows linearly with the candidate pool. For interactive search this caps practical reranker depth at roughly the top 100 to 200, and the largest LLM-based rerankers (BGE-reranker-v2-gemma, monoT5-3B, mxbai-rerank-large-v2) further constrain that depth.[5][7][19]
Score uncalibrated across queries. A cross-encoder's raw logit is trained as a per-query relevance score and is not calibrated across queries: a score of 5 for query A and 5 for query B do not imply equal relevance. Most production rerankers either rerank within a single query's pool only or post-process scores with a sigmoid and a query-specific threshold.[16]
Domain shift. BEIR demonstrated that cross-encoders trained on MS MARCO transfer well to many domains but degrade on argument retrieval (Touche-2020), citation prediction (SciDocs), and bio-medical retrieval, often by 5 to 15 NDCG@10 points. The benchmark warned that "reranking models on average achieve the best zero-shot performances, however, at high computational costs," but performance is far from uniform across tasks.[10] RankT5 and the BGE-reranker-v2 family partially close the gap with listwise losses and broader multilingual training, respectively.[12][16]
Context window. Standard BERT-based cross-encoders inherit a 512-token cap, which can truncate longer passages. Newer models (Cohere Rerank 3.5 at 4,096 tokens, Mixedbread mxbai-rerank-v2 at 8,192) push this higher, but documents that exceed the cap still need chunking and aggregation.[18][19]
Black-box scores. Because the relevance score is an opaque scalar from a deep network, debugging why a particular document was ranked above another is hard. Tools such as TermImportance and attention-rollout exist for cross-encoders but are rarely used in production.
A cross-encoder is one of three commonly contrasted neural retrieval architectures:
| Architecture | How query and doc interact | Online cost | Where it wins |
|---|---|---|---|
| Bi-encoder (DPR, Sentence-BERT) | Encoded independently, compared by dot product | O(1) per doc after indexing | First-stage retrieval over millions of docs |
| Late-interaction (ColBERT) | Per-token bi-encoder, late MaxSim aggregation | Higher than bi-encoder, lower than cross-encoder | High-quality retrieval at moderate scale |
| Cross-encoder (monoBERT, MS MARCO MiniLM) | Joint attention over concatenated input | O(N) per query | Reranking the top K candidates |
In practice these are complementary rather than competing. A modern hybrid pipeline frequently chains a BM25 sparse retriever, a Dense Passage Retrieval (DPR) style dense bi-encoder, an optional SPLADE sparse-learned retriever, and a cross-encoder reranker at the end.[3][10]