Cross-encoder

Information Retrieval Natural Language Processing Neural Networks

19 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v3 · 3,839 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A cross-encoder is a neural network architecture that scores a pair of texts by feeding them jointly into a single transformer (such as BERT) and reading out one scalar score for the pair, for example a relevance or similarity score. The two texts (typically a query and a candidate document) are concatenated into one input, [CLS] query [SEP] document [SEP], so that self-attention can mix tokens from both sides at every layer; a small head over the [CLS] representation then outputs the score.^[1]^[2] Because the model performs full cross-attention over both texts at once, a cross-encoder is highly accurate, but it cannot precompute document representations: a separate forward pass must run for every candidate pair, so it does not scale to large-scale retrieval. The standard remedy is the retrieve-then-rerank pipeline, in which a fast bi-encoder (which embeds each text independently into a fixed vector so similarity is a cheap dot product, scaling to millions of documents) retrieves a shortlist of candidates, and a cross-encoder reranks only that shortlist.^[2]^[3] The distinction was crystallized by the Sentence-Transformers / SBERT framework of Nils Reimers and Iryna Gurevych (2019), and the architecture as a relevance ranker was popularized by Nogueira and Cho's 2019 application of BERT to MS MARCO passage ranking (later nicknamed "monoBERT").^[1]^[4]

What is a cross-encoder?

A cross-encoder takes two pieces of text together and produces a single number describing their relationship. Unlike a bi-encoder (also called a dual-encoder), which embeds the query and each document independently and compares the resulting vectors with an inner product or cosine similarity, the cross-encoder concatenates the two texts (typically as [CLS] query [SEP] document [SEP]) and lets self-attention mix tokens from both sides at every layer, then reads a scalar from a classification head over the [CLS] representation.^[1]^[2] As the Sentence-Transformers documentation puts it, "A Cross-Encoder does not produce a sentence embedding"; instead "we pass both sentences simultaneously to the Transformer network" and it "outputs a value between 0 and 1 indicating the similarity" of the pair.^[2] Cross-encoders consistently outperform bi-encoders in passage ranking accuracy because they model fine-grained token-level interactions, but they cannot precompute document representations, so a query must run one forward pass for every candidate. In modern retrieval systems they are therefore used almost exclusively as a second-stage reranker sitting behind a fast first-stage retriever such as BM25, a dense bi-encoder, or a sparse learned model like SPLADE.^[3]^[2] The pattern underpins the cross-encoder/ms-marco-MiniLM-L6-v2 model family on the Hugging Face Hub as well as commercial rerank APIs from Cohere, Jina AI, Mixedbread, and BAAI.^[4]^[5]^[6]^[7]

Background and history

Information retrieval before deep learning relied on lexical models such as TF-IDF and BM25, which score each document independently from a query through term-frequency statistics. These models scale to billions of documents because they can be evaluated with an inverted index, but they cannot resolve vocabulary mismatch (paraphrases, synonyms) between query and document.^[8]

The release of BERT in 2018 made it practical to attempt query/document scoring with a transformer that jointly attends across the two texts. In January 2019, Rodrigo Nogueira and Kyunghyun Cho posted "Passage Re-ranking with BERT" (arXiv 1901.04085), the first paper to fine-tune BERT-Large on the MS MARCO Passage Ranking dataset as a relevance classifier. They concatenated the query and a candidate passage with [CLS] and [SEP] tokens, added a single linear layer on top of the [CLS] vector, and trained with binary cross-entropy on positive and sampled negative pairs. The system topped the MS MARCO leaderboard and improved MRR@10 by roughly 27% relative over the previous state of the art.^[4] This single-BERT formulation became known as monoBERT, distinguishing it from later pairwise duoBERT and listwise variants.^[4]

In August 2019, Nils Reimers and Iryna Gurevych of the UKP Lab at TU Darmstadt published "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" at EMNLP 2019.^[1] The paper crystallized the cross-encoder versus bi-encoder distinction. Standard BERT, the authors noted, "requires that both sentences are fed into the network, which causes a massive computational overhead": "Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours) with BERT."^[1] Their proposed Sentence-BERT (SBERT) used siamese (shared-weight) BERT encoders to produce fixed-length sentence embeddings comparable by cosine similarity, which "reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT" while preserving most of BERT's accuracy.^[1] Crucially, Reimers and Gurevych framed cross-encoders as the high-accuracy reference and bi-encoders as the deployable approximation, a framing that has dominated the literature since.^[2]

A second formative paper, Karpukhin et al.'s "Dense Passage Retrieval for Open-Domain Question Answering" (EMNLP 2020), pushed bi-encoders further by training two separate BERT towers (one for the question, one for the passage) with an in-batch negative sampling contrastive objective.^[9] Their DPR system outperformed BM25 by 9 to 19 percentage points on top-20 retrieval accuracy across open-domain QA benchmarks and made first-stage dense retrieval the new baseline.^[9] But because bi-encoders independently embed and then compare, they remained strictly less expressive than cross-encoders that attend across both texts; this gap motivated the standard retrieve-then-rerank pipeline that combines a bi-encoder first stage with a cross-encoder second stage.^[3]^[9]

By 2021, Thakur et al.'s BEIR benchmark (NeurIPS 2021 Datasets and Benchmarks Track) had measured this pipeline across 17 heterogeneous information retrieval datasets and concluded that reranking models "on average achieve the best zero-shot performances, however, at high computational costs."^[10] That trade-off has shaped every subsequent generation of reranker models.

How does a cross-encoder work?

Input encoding

Given a query q and a candidate document d, a cross-encoder constructs the input sequence

[CLS] q [SEP] d [SEP]

with the WordPiece (or sentencepiece) tokenizer of the underlying transformer. Segment IDs distinguish the query span from the document span, and standard position embeddings are added. The full sequence is then passed through a stack of transformer encoder layers; at every layer, multi-head self-attention can mix tokens from the query and the document, producing contextualized representations that depend on both.^[4]^[2]

Scoring head and output

A small head sits on top of the encoder. The most common formulation, used by monoBERT and almost all cross-encoder/ms-marco-* checkpoints, takes the final hidden state of the [CLS] token and projects it through a single linear layer to a logit. During training the logit is converted to a Bernoulli probability via sigmoid and trained with binary cross-entropy against positive/negative labels; at inference the raw logit is used directly as a relevance score (no softmax over a candidate pool is required because each candidate is scored independently).^[4]^[5] Variants exist: monoT5 uses an encoder-decoder T5 model and conditions on the prompt "Query: q Document: d Relevant:", then reads the logit assigned to the token "true" versus "false" as the score.^[11] RankT5 replaces the classification objective with pairwise softmax or listwise Softmax/PolyLoss ranking losses computed across the candidate pool, which the authors show improves both in-domain MRR/NDCG and out-of-domain generalization on BEIR.^[12]

Why do cross-encoders score better?

The decisive architectural property is joint attention. In a bi-encoder, query tokens never attend to document tokens (or vice versa); each text is summarized into a fixed-length vector before comparison, which forces lossy compression. In a cross-encoder, every query token can attend to every document token at every layer, so the model can directly check token-level alignments, paraphrases, negations, numerical comparisons, and other fine-grained relations.^[2]^[1] This is why on MS MARCO Dev and TREC Deep Learning, cross-encoder rerankers routinely add 3 to 8 absolute points of MRR@10 or NDCG@10 over their bi-encoder first stage.^[5]^[10]

Why are cross-encoders slow?

The same joint attention is also the headline limitation. To score N candidates, the cross-encoder must run N forward passes (each over a sequence of length |q|+|d|), whereas a bi-encoder can precompute document embeddings offline and reduce online cost to one query embedding plus an approximate nearest-neighbor lookup with FAISS or Pinecone. The Hugging Face card for cross-encoder/ms-marco-MiniLM-L6-v2 reports roughly 1,800 documents per second on a V100, meaning that reranking the top 100 candidates per query adds approximately 55ms of GPU time per query for that small 6-layer model; larger 12-layer or BGE-reranker-v2-gemma (3B parameter) models can be one to two orders of magnitude slower.^[5]^[13]

Cross-encoder vs bi-encoder: what is the difference?

The cross-encoder and the bi-encoder solve the same problem (scoring how well two texts match) with opposite engineering trade-offs. A bi-encoder, the architecture behind Sentence-BERT (SBERT) and Dense Passage Retrieval (DPR), encodes the query and each document separately into fixed-length vectors and compares them with a dot product or cosine similarity. Because document vectors can be computed once and stored in a vector index, online cost is effectively constant per document, which is what lets bi-encoders search millions of documents. A cross-encoder, by contrast, never produces a reusable embedding: it must see both texts together, so its cost grows linearly with the number of candidates.^[1]^[2] The Sentence-Transformers documentation summarizes the trade-off directly: "Cross-Encoders achieve higher performance than Bi-Encoders, however, they do not scale well for large datasets."^[2]

Property	Bi-encoder (dual-encoder)	Cross-encoder
Input handling	Each text encoded independently	Both texts concatenated, encoded jointly
Output	A reusable embedding per text	A single score per pair, no embedding
Query/document attention	None across texts	Full cross-attention at every layer
Document vectors precomputable?	Yes (index offline)	No (one forward pass per candidate)
Online cost	O(1) per document after indexing	O(N) per query for N candidates
Accuracy	Lower (lossy compression)	Higher (fine-grained token interactions)
Typical role	First-stage retrieval over millions of docs	Reranking the top K candidates

Because the two are complementary, the documented best practice is to use them together: "First, you use an efficient Bi-Encoder to retrieve e.g. the top-100 most similar sentences for a query. Then, you use a Cross-Encoder to re-rank these 100 hits by computing the score for every (query, hit) combination."^[2]

How is a cross-encoder trained?

MS MARCO Passage Ranking

The de facto training set for general-purpose cross-encoder rerankers is the MS MARCO Passage Ranking corpus, a Microsoft release of roughly 8.8 million web passages and around 500,000 anonymized Bing queries with sparse positive judgments. Nogueira and Cho's monoBERT trained with binary cross-entropy on positive query/passage pairs and BM25-sampled negatives; subsequent work showed that mining "hard negatives" with the current model and distilling soft scores from a strong cross-encoder teacher both substantially improve quality.^[4]^[14] The cross-encoder/ms-marco-MiniLM-L6-v2 model card credits training on the MS MARCO Passage Ranking task and reports NDCG@10 = 74.30 on TREC Deep Learning 2019 and MRR@10 = 39.01 on MS MARCO Dev.^[5]

Sentence-Transformers training API

The sentence-transformers library, originally created by UKP Lab and now maintained by Hugging Face, exposes a CrossEncoder class with a small fit/predict API; given a list of InputExample(texts=[query, doc], label=relevance) it handles tokenization, padding, sigmoid output, and evaluation against held-out positives.^[2] Hugging Face's 2024 "Training and Finetuning Reranker Models with Sentence Transformers" blog formalizes this into a Sentence Transformers v3 trainer that supports binary classification, regression, and listwise losses for cross-encoders.^[14]

Distillation, listwise losses, and reinforcement learning

State-of-the-art rerankers usually combine three ingredients beyond plain cross-entropy on MS MARCO:

Hard-negative mining with a current bi-encoder or BM25 to ensure the training set contains near-miss passages.^[14]
Listwise or pairwise ranking losses (RankNet, ListNet, softmax cross-entropy over a candidate pool, PolyLoss). RankT5 demonstrated that listwise softmax losses outperform pointwise classification on both in-domain MS MARCO and zero-shot BEIR.^[12]
Cross-encoder distillation, where a smaller student model is trained to match the relevance scores of a much larger teacher (often a T5-3B monoT5 or a frontier LLM). This is the recipe behind the small MiniLM checkpoints that punch far above their parameter count.^[5]^[14] Mixedbread reports using a reinforcement-learning fine-tune on top of cross-entropy and distillation losses for its mxbai-rerank-v2 models.^[15]

Notable cross-encoder models

Model	Year	Base / size	Training data	Notes
monoBERT (Nogueira and Cho)	2019	BERT-Large, 340M	MS MARCO Passage	First BERT reranker; topped MS MARCO by 27% MRR.^[4]
`cross-encoder/ms-marco-MiniLM-L6-v2`	2020	MiniLM-L12, 22.7M	MS MARCO Passage	Distilled 6-layer student; 1,800 docs/s on V100; widely used default reranker.^[5]
monoT5 (Nogueira et al.)	2020	T5-base/large/3B	MS MARCO Passage	Generative reranker; scores `"true"`/`"false"` token logits.^[11]
RankT5 (Zhuang et al.)	2022	T5 encoder-decoder & encoder-only	MS MARCO Passage	Listwise softmax loss; strong zero-shot BEIR.^[12]
BGE-reranker (BAAI)	2023	XLM-RoBERTa-base/large	Multilingual web	Multilingual cross-encoder; bundled with the BGE embedding family.^[16]
BGE-reranker-v2-m3 (BAAI)	2024	BGE-M3, 0.6B	bge-m3-data, Quora, FEVER	Multilingual; fast inference; sigmoid-normalized scores.^[16]
BGE-reranker-v2-gemma (BAAI)	2024	Gemma-2B, ~3B	Multilingual	LLM-based reranker; strong English and multilingual quality.^[7]
Cohere Rerank 3.0	2024	proprietary	proprietary multilingual	English-only and multilingual variants; commercial API.^[17]
Cohere Rerank 3.5	2024-12-02	proprietary	proprietary multilingual	Single multilingual model, 4,096-token context, 100+ languages.^[18]
Jina Reranker v2 (`jina-reranker-v2-base-multilingual`)	2024-06-25	XLM-RoBERTa-style, 278M	Multilingual + code + function-calling	Flash-Attention 2; 6x throughput vs. v1; supports function-call and code retrieval.^[6]
mxbai-rerank-large-v1 (Mixedbread)	2024-02-29	proprietary, 435M	LLM-labeled real-world queries	Open under Apache 2.0; 74.9% on BEIR (reported).^[15]
mxbai-rerank-large-v2 (Mixedbread)	2025	1.5B	Multilingual + code + SQL	Reinforcement-learning fine-tuned; 8k token context; 100+ languages.^[19]

The `cross-encoder/ms-marco-MiniLM` family

The single most downloaded cross-encoder on Hugging Face is the cross-encoder/ms-marco-MiniLM-L*-v2 line, distilled from larger BERT teachers into MiniLM-L12-H384 students with 2, 4, 6, or 12 transformer layers. The L6 checkpoint at 22.7M parameters is the canonical reference reranker for academic papers and is the default returned by examples in the sentence-transformers documentation.^[5]^[2] All four sizes share the same MS MARCO Passage training pipeline; the cards report TREC DL 2019 NDCG@10 climbing from roughly 67 (L2) to 74 (L6) to 74.3 (L12) as depth increases, while throughput on a V100 falls correspondingly.^[5]

BGE-reranker family

Beijing Academy of Artificial Intelligence (BAAI) released the BGE-reranker family alongside its BGE embedding models. The original bge-reranker-base and bge-reranker-large (2023) are XLM-RoBERTa cross-encoders trained on Chinese and English retrieval data.^[16] The v2 generation introduced three siblings: bge-reranker-v2-m3, a 0.6B multilingual model built on the BGE-M3 backbone; bge-reranker-v2-minicpm-layerwise, which exposes intermediate-layer exits for inference acceleration; and bge-reranker-v2-gemma, a roughly 3B LLM-based reranker built on Google's Gemma-2B base.^[7]^[16] All v2 variants accept a query/passage pair and emit a raw score that BAAI recommends normalizing to [0, 1] with a sigmoid.^[16]

Commercial reranker APIs

Cohere Rerank is a managed reranker API first released as Rerank v1 in 2023 and updated to Rerank 3.0 in 2024 with English-only and multilingual variants.^[17] Rerank 3.5, announced 2 December 2024, consolidates to a single multilingual model with a 4,096-token context window and SOTA scores on Cohere's internal multilingual retrieval evals; the v2 API replaces max_chunks_per_doc with max_tokens_per_doc and makes the model field required.^[18] Jina Reranker is a similar commercial API; jina-reranker-v2-base-multilingual, released 25 June 2024, is a 278M-parameter cross-encoder built with Flash Attention 2 that supports more than 100 languages and is specifically tuned for function-call and code retrieval relevant to agentic RAG.^[6] Mixedbread mxbai-rerank ships both open-source weights on Hugging Face and a managed API; the v1 models (xsmall/base/large) were released 29 February 2024 under Apache 2.0, and v2 (base 0.5B, large 1.5B) added a reinforcement-learning fine-tuning step and an 8k-token context.^[15]^[19]

When should you use a cross-encoder?

The canonical deployment pattern, documented in the sentence-transformers "Retrieve & Re-Rank" tutorial, is a two-stage cascade. A fast first-stage retriever, typically BM25, a Dense Passage Retrieval (DPR) style bi-encoder, SPLADE, or ColBERT, pulls a candidate set of K documents (commonly 50 to 200). A cross-encoder then scores all K candidates for the current query and returns the top k (commonly 3 to 10).^[3]^[20] On heterogeneous benchmarks like BM25 + cross-encoder pipelines reported in BEIR, this cascade typically adds several absolute NDCG@10 points over the first-stage retriever alone, at the cost of one cross-encoder forward pass per surviving candidate.^[10]

In short, reach for a cross-encoder when ranking quality on a small candidate set matters more than raw throughput: as a reranker over the top tens-to-hundreds of hits, for pair-scoring tasks (duplicate-question detection, natural language inference, answer selection), and anywhere precision at position 1 drives the outcome. Do not use a cross-encoder as a first-stage retriever over a large corpus; that is the bi-encoder's job.^[2]^[3]

The pattern has become standard in retrieval-augmented generation. In a Retrieval-Augmented Generation (RAG) system, the cross-encoder reranker is placed between the vector store and the LLM context window: it raises the precision of the top-3 or top-5 passages that actually enter the prompt, which has an outsized effect on answer quality because LLM context is small and expensive. The same architecture is now standard in Agentic RAG systems where the reranker also scores tool descriptions and function-call schemas, a use case explicitly targeted by Jina Reranker v2 and Mixedbread mxbai-rerank-v2.^[6]^[19]

Quality / latency trade-offs

Throughput numbers vary by model and hardware, but the broad picture from public model cards and benchmarks is:

Stage	Typical model	Throughput (single A100 / V100, order of magnitude)	Relative quality
First-stage sparse	BM25 (CPU inverted index)	millions of docs/query	baseline
First-stage dense	DPR / BGE-M3 bi-encoder	millions of docs/query (after ANN index)	+ several points NDCG over BM25 on in-domain data
Reranker (small)	`cross-encoder/ms-marco-MiniLM-L6-v2`	~1,800 docs/sec on V100	+ several points NDCG over first stage
Reranker (large)	bge-reranker-v2-gemma / mxbai-rerank-large-v2 / monoT5-3B	~10s to 100s of docs/sec	additional gains on hard / zero-shot data

The first stage's job is recall (does the true positive sit somewhere in the top K?), and the reranker's job is precision (is it at position 1?). Increasing K improves recall but pays N more cross-encoder forward passes; the typical operating point is K between 50 and 200.^[3]^[20]

What are the limitations of cross-encoders?

Throughput. The headline limitation is that cross-encoders cannot precompute document representations. Every (query, document) pair requires a full forward pass over the concatenated sequence, so latency grows linearly with the candidate pool. For interactive search this caps practical reranker depth at roughly the top 100 to 200, and the largest LLM-based rerankers (BGE-reranker-v2-gemma, monoT5-3B, mxbai-rerank-large-v2) further constrain that depth.^[5]^[7]^[19]

Score uncalibrated across queries. A cross-encoder's raw logit is trained as a per-query relevance score and is not calibrated across queries: a score of 5 for query A and 5 for query B do not imply equal relevance. Most production rerankers either rerank within a single query's pool only or post-process scores with a sigmoid and a query-specific threshold.^[16]

Domain shift. BEIR demonstrated that cross-encoders trained on MS MARCO transfer well to many domains but degrade on argument retrieval (Touche-2020), citation prediction (SciDocs), and bio-medical retrieval, often by 5 to 15 NDCG@10 points. The benchmark warned that "reranking models on average achieve the best zero-shot performances, however, at high computational costs," but performance is far from uniform across tasks.^[10] RankT5 and the BGE-reranker-v2 family partially close the gap with listwise losses and broader multilingual training, respectively.^[12]^[16]

Context window. Standard BERT-based cross-encoders inherit a 512-token cap, which can truncate longer passages. Newer models (Cohere Rerank 3.5 at 4,096 tokens, Mixedbread mxbai-rerank-v2 at 8,192) push this higher, but documents that exceed the cap still need chunking and aggregation.^[18]^[19]

Black-box scores. Because the relevance score is an opaque scalar from a deep network, debugging why a particular document was ranked above another is hard. Tools such as TermImportance and attention-rollout exist for cross-encoders but are rarely used in production.

A cross-encoder is one of three commonly contrasted neural retrieval architectures:

Architecture	How query and doc interact	Online cost	Where it wins
Bi-encoder (DPR, Sentence-BERT)	Encoded independently, compared by dot product	O(1) per doc after indexing	First-stage retrieval over millions of docs
Late-interaction (ColBERT)	Per-token bi-encoder, late MaxSim aggregation	Higher than bi-encoder, lower than cross-encoder	High-quality retrieval at moderate scale
Cross-encoder (monoBERT, MS MARCO MiniLM)	Joint attention over concatenated input	O(N) per query	Reranking the top K candidates

In practice these are complementary rather than competing. A modern hybrid pipeline frequently chains a BM25 sparse retriever, a Dense Passage Retrieval (DPR) style dense bi-encoder, an optional SPLADE sparse-learned retriever, and a cross-encoder reranker at the end.^[3]^[10]

References

Reimers, Nils and Gurevych, Iryna, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", arXiv (EMNLP 2019), 2019-08-27. https://arxiv.org/abs/1908.10084. Accessed 2026-05-21. ↩
Sentence Transformers Project, "Cross-Encoders documentation", sbert.net, 2025. https://www.sbert.net/examples/cross_encoder/applications/README.html. Accessed 2026-05-21. ↩
Sentence Transformers Project, "Retrieve & Re-Rank Pipeline documentation", sbert.net, 2025. https://sbert.net/examples/sentence_transformer/applications/retrieve_rerank/README.html. Accessed 2026-05-21. ↩
Nogueira, Rodrigo and Cho, Kyunghyun, "Passage Re-ranking with BERT", arXiv 1901.04085, 2019-01-13 (revised 2020-04-14). https://arxiv.org/abs/1901.04085. Accessed 2026-05-21. ↩
Cross-Encoder Team, "cross-encoder/ms-marco-MiniLM-L6-v2 model card", Hugging Face Hub, 2020. https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2. Accessed 2026-05-21. ↩
Jina AI, "Jina Reranker v2 for Agentic RAG", jina.ai blog, 2024-06-25. https://jina.ai/news/jina-reranker-v2-for-agentic-rag-ultra-fast-multilingual-function-calling-and-code-search/. Accessed 2026-05-21. ↩
BAAI, "BAAI/bge-reranker-v2-gemma model card", Hugging Face Hub, 2024. https://huggingface.co/BAAI/bge-reranker-v2-gemma. Accessed 2026-05-21. ↩
Sentence Transformers Project, "Sentence Transformers documentation index", sbert.net, 2025. https://www.sbert.net/index.html. Accessed 2026-05-21. ↩
Karpukhin, Vladimir et al., "Dense Passage Retrieval for Open-Domain Question Answering", arXiv 2004.04906 (EMNLP 2020), 2020-04-10 (revised 2020-09-30). https://arxiv.org/abs/2004.04906. Accessed 2026-05-21. ↩
Thakur, Nandan; Reimers, Nils; Rücklé, Andreas; Srivastava, Abhishek; Gurevych, Iryna, "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models", arXiv 2104.08663 (NeurIPS 2021 Datasets and Benchmarks), 2021-04-17. https://arxiv.org/abs/2104.08663. Accessed 2026-05-21. ↩
Nogueira, Rodrigo; Jiang, Zhiying; Pradeep, Ronak; Lin, Jimmy, "Document Ranking with a Pretrained Sequence-to-Sequence Model", arXiv 2003.06713 (Findings of ACL: EMNLP 2020), 2020-03-14. https://arxiv.org/abs/2003.06713. Accessed 2026-05-21. ↩
Zhuang, Honglei; Qin, Zhen; Jagerman, Rolf; et al., "RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses", arXiv 2210.10634 (SIGIR 2023), 2022-10-19. https://arxiv.org/abs/2210.10634. Accessed 2026-05-21. ↩
BAAI, "BAAI/bge-reranker-v2-m3 model card", Hugging Face Hub, 2024. https://huggingface.co/BAAI/bge-reranker-v2-m3. Accessed 2026-05-21. ↩
Aarsen, Tom, "Training and Finetuning Reranker Models with Sentence Transformers", Hugging Face Blog, 2024. https://huggingface.co/blog/train-reranker. Accessed 2026-05-21. ↩
Mixedbread AI, "Boost Your Search With The Crispy Mixedbread Rerank Models", mixedbread.com blog, 2024-02-29. https://www.mixedbread.com/blog/mxbai-rerank-v1. Accessed 2026-05-21. ↩
BAAI, "BGE-Reranker-v2 documentation", bge-model.com, 2024. https://bge-model.com/bge/bge_reranker_v2.html. Accessed 2026-05-21. ↩
Cohere, "Cohere's Rerank Model (Details and Application)", Cohere docs, 2024. https://docs.cohere.com/docs/rerank. Accessed 2026-05-21. ↩
Cohere, "Announcing Rerank-v3.5", Cohere changelog, 2024-12-02. https://docs.cohere.com/changelog/rerank-v3.5. Accessed 2026-05-21. ↩
Mixedbread AI, "mxbai-rerank-large-v2 model documentation", mixedbread.com, 2025. https://www.mixedbread.com/docs/models/reranking/mxbai-rerank-large-v2. Accessed 2026-05-21. ↩
GitHub, "FlagEmbedding repository README", FlagOpen/FlagEmbedding, 2024. https://github.com/FlagOpen/FlagEmbedding. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

MTEB (Massive Text Embedding Benchmark)Ranking Re-ranking Tower

What is a cross-encoder?

Background and history

How does a cross-encoder work?

Input encoding

Scoring head and output

Why do cross-encoders score better?

Why are cross-encoders slow?

Cross-encoder vs bi-encoder: what is the difference?

How is a cross-encoder trained?

MS MARCO Passage Ranking

Sentence-Transformers training API

Distillation, listwise losses, and reinforcement learning

Notable cross-encoder models

The cross-encoder/ms-marco-MiniLM family

BGE-reranker family

Commercial reranker APIs

When should you use a cross-encoder?

Quality / latency trade-offs

What are the limitations of cross-encoders?

Related concepts

See also

References

Improve this article

Related Articles

Two-Tower Model

Similarity Measure

Vector embeddings

LlamaIndex

AI search

Embeddings

What links here

Related Articles

Two-Tower Model

Similarity Measure

Vector embeddings

LlamaIndex

AI search

Embeddings

What links here

The `cross-encoder/ms-marco-MiniLM` family