MTEB (Massive Text Embedding Benchmark)
Last reviewed
May 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,422 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,422 words
Add missing citations, update stale details, or suggest a clearer explanation.
MTEB, short for Massive Text Embedding Benchmark, is a public evaluation suite for text embedding models. It was introduced in October 2022 by Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers in the paper "MTEB: Massive Text Embedding Benchmark" (arXiv:2210.07316), later accepted to EACL 2023. The original release covered 8 task families across 58 datasets and 112 languages, with 33 embedding models scored on a single shared leaderboard hosted by Hugging Face. MTEB has become the de facto report card for sentence and passage encoders, and almost every embedding model released after late 2022 quotes its MTEB numbers in the model card.
The benchmark sits in an awkward but useful position. It is not as deep as a single specialist suite (BEIR for retrieval, STS-B for similarity), yet it spans enough task families that a model good at only one of them gets exposed quickly. The 2025 follow-up MMTEB (Massive Multilingual Text Embedding Benchmark) extended the same idea to more than 500 tasks across over 250 languages, mostly through community contributions, and is now the natural successor for new evaluations. A 2025 sibling benchmark, MIEB (Massive Image Embedding Benchmark), applied the same methodology to image and image-text encoders.
| Field | Value |
|---|---|
| Released | October 2022 (original paper); EACL 2023 (peer-reviewed publication) |
| Authors | Niklas Muennighoff, Nouamane Tazi, Loïc Magne, Nils Reimers |
| Original paper | arXiv:2210.07316 |
| Hosting | Hugging Face Spaces (Gradio) |
| Code repository | github.com/embeddings-benchmark/mteb |
| Results repository | github.com/embeddings-benchmark/results |
| License | Apache 2.0 (code), task licenses vary |
| Task families (v1) | 8 |
| Datasets (v1) | 58 |
| Languages (v1) | 112 |
| Successor | MMTEB (arXiv:2502.13595, ICLR 2025); MIEB for images (arXiv:2504.10471, ICCV 2025) |
The original paper was driven by Niklas Muennighoff, who at the time was working at Hugging Face, together with co-authors at Hugging Face and Cohere. Their argument in the introduction was that text embedding evaluation in 2022 was fragmented. STS-B and SICK-R dominated the literature on sentence similarity, BEIR had become the standard for zero-shot retrieval, and people who cared about classification or clustering used yet other suites. A model that did well on one of these often did badly on another, and there was no single number that compared, say, Sentence-BERT against OpenAI's text-embedding-ada-002 in a way you could trust.
Muennighoff and his co-authors built MTEB to fill that gap. The benchmark deliberately reuses well-known existing datasets (BEIR for retrieval, SemEval STS years for similarity, MTOP and Banking77 for classification, and so on) rather than inventing new ones. The novelty is in the unification: a single Python package, one evaluation protocol per task family, and one leaderboard. The 33 models in the initial run included Sentence-Transformers checkpoints (all-MiniLM-L6-v2, all-mpnet-base-v2), Sentence-BERT, Sentence-T5 variants based on Google's T5, GTR, LaBSE, LASER2, and OpenAI's text-embedding-ada-002. No fine-tuning was done by the authors; models were evaluated as released.
The headline finding from the paper was blunt: "no particular text embedding method dominates across all tasks." Sentence-T5 XXL won on STS, GTR-XXL won on retrieval, all-mpnet-base-v2 was strong on classification and reranking, and ada-002 was solid but not best-in-class anywhere. That "no winner" result was part of why the benchmark stuck. There was a real ranking to compete for, and no obvious incumbent.
MTEB groups tasks into 8 families. Each family uses one fixed metric, so a model's score on a task is a single number you can average. The split below is from the 2022 release; the family count has not changed in MMTEB, although the dataset count has grown by an order of magnitude.
| Task family | Metric | Datasets in MTEB v1 (2022) | Example datasets |
|---|---|---|---|
| Bitext Mining | F1 | 3 | BUCC, Tatoeba |
| Classification | Accuracy | 12 | Banking77, AmazonReviews, MassiveIntent, Emotion, ToxicConversations |
| Clustering | V-measure | 11 | ArxivClustering, RedditClustering, StackExchange, BiorxivClusteringS2S |
| Pair Classification | Average precision (cosine) | 3 | SprintDuplicateQuestions, TwitterSemEval, TwitterURLCorpus |
| Reranking | MAP | 4 | AskUbuntu, MindSmall, SciDocsRR, StackOverflowDupQuestions |
| Retrieval | nDCG@10 | 15 | MS MARCO, NQ, HotpotQA, FiQA, Quora, SciFact, Touche2020, ArguAna, ClimateFEVER, the rest of BEIR |
| Semantic Textual Similarity | Spearman correlation (cosine) | 10 | STS12 through STS22, SICK-R, BIOSSES, STSBenchmark |
| Summarization | Spearman correlation (cosine) | 1 | SummEval |
A few oddities are worth knowing. Pair Classification uses average precision over cosine similarity scores between pairs labelled as duplicates or not. Reranking starts from a fixed candidate list per query and asks the embedding model to re-score the candidates, so a high reranking score does not imply a high retrieval score (which has to find candidates from scratch in a much larger corpus). The Summarization task is unusual: you embed a summary and the source, then correlate cosine similarity with human-rated quality on SummEval. With only one dataset in that category, summarization scores are noisy, and most analyses skip it or down-weight it.
The Retrieval family is dominated by BEIR, and BEIR contributes 15 of the original 58 datasets. This is the source of one of MTEB's recurring criticisms: the average score is heavily influenced by retrieval-style tasks. A model trained with BEIR-style hard negatives gets a free boost on the average even if it does nothing special on classification or clustering.
The metric per family was chosen to match what the field already used for that task type, so MTEB scores are comparable to standalone results in the underlying papers. The choice does have consequences, though.
The overall MTEB score is the average of task-family averages, not the average of all 58 dataset scores. This matters because the families are wildly unbalanced: with 15 retrieval datasets and 1 summarization dataset, an unweighted average over datasets would be dominated by retrieval. Averaging within family first and then across families gives each task family equal weight. Whether that is the right weighting is debatable; you could argue retrieval should count more than summarization since people care about it more, or that summarization should count less because it has only one dataset. Either way, the family-average convention is what the leaderboard uses.
MTEB is run zero-shot. The model is evaluated as released; no further fine-tuning is allowed against MTEB datasets. Tasks have a designated split (usually test, sometimes validation when no public test set exists), and scores are computed from a single deterministic run.
For classification, the linear probe is fit on the train split and evaluated on the test split. For clustering, retrieval, reranking, STS, and pair classification, only the test split matters because no learned head is involved. For STS, SICK-R, and STSBenchmark, the gold human similarity ratings are public, which has caused contamination problems (more on this below).
The mteb Python package handles all of this. A typical evaluation looks like the following.
import mteb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
tasks = mteb.get_tasks(task_types=["Retrieval"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder="results/all-MiniLM-L6-v2")
Results are written as JSON files per task. The leaderboard scrapes those JSON files (now usually pulled from the embeddings-benchmark/results GitHub repo) and recomputes the rankings. Anyone can submit a result by opening a pull request against that repo.
The top of the MTEB leaderboard has turned over several times since 2022. Roughly, you can split it into four eras: small encoder models (2022 to early 2023), the BGE-and-E5 era (2023 to mid 2024), the LLM-as-encoder era (mid 2024 onward), and the closed-API resurgence (2025 onward, dominated by Google's Gemini Embedding family).
| Year | Top models | Approximate overall MTEB English score | Notes |
|---|---|---|---|
| 2022 | Sentence-T5 XXL, GTR-XXL, all-mpnet-base-v2, OpenAI text-embedding-ada-002 | ~58 to 61 | Original release. ada-002 around 60.99. |
| 2023 (early) | E5-large-v2 (Microsoft), BGE-large (BAAI), GTE-large | ~62 to 64 | Open BERT-based encoders pull ahead of ada-002. |
| 2023 (late) | bge-large-en-v1.5, Cohere embed-v3, jina-embeddings-v2-base-en, OpenAI text-embedding-3-large | ~64 to 66 | Public retrieval-tuned encoders dominate. text-embedding-3-large around 64.6. |
| 2024 (early) | NV-Embed (NVIDIA), SFR-Embedding (Salesforce), gte-Qwen2-7B-instruct, Stella-1.5B-v5, voyage-large-2-instruct | ~68 to 70 | Larger backbones, often 7B parameters, decoder LLMs adapted to embeddings. |
| 2024 (late) | NV-Embed-v2 | 72.31 | NVIDIA's NV-Embed-v2 reaches 72.31 across the 56 English MTEB tasks reported on the leaderboard, briefly holding the top spot. |
| 2025 to 2026 | Google gemini-embedding-001, gemini-embedding-2-preview, NV-Embed successors, Qwen3-Embedding-8B, NVIDIA Llama-Embed-Nemotron-8B | ~68 on the harder MTEB(eng, v2); higher on the original v1 split | Closed APIs return to the top on the harder MTEB(eng, v2) split. Gemini Embedding 001 led with 68.32 on MTEB(eng, v2) in 2025, a +5.09 gap over the next entry. |
A couple of points are worth pulling out from that table. First, performance has gone up substantially on the harder, deduplicated MTEB(eng, v2) split since 2022, and a much larger gain on the original v1 leaderboard. Second, the model size at the top has gone from around 110M parameters (encoder-only BERT base) to several billion (decoder LLMs with last-token pooling, or large closed APIs of unknown size). Whether that is a good trade depends entirely on what you do with the embeddings; for serving billions of queries, the smaller open models often still win on cost.
The BGE family from BAAI deserves a separate note. BGE released in mid-2023 with a heavily curated dataset of hard negatives mined from the BEIR-adjacent space, plus an in-batch contrastive objective. It dominated the leaderboard for months and became the default open-source embedding choice for RAG pipelines. Most of the LLM-based embedding models that came later (NV-Embed, gte-Qwen2, SFR) borrowed BGE's training recipe in some form: hard negatives, instruction prefixes, last-token or mean pooling on a decoder backbone, optional Matryoshka representation learning so a single model can produce multiple embedding sizes.
Google released Gemini Embedding (gemini-embedding-001) into general availability via the Gemini API in 2025, citing MTEB scores directly in the launch. On the harder MTEB(eng, v2) split, gemini-embedding-001 posted an average of 68.32 with a +5.09 absolute gap over the next entry, taking the top spot from a field of mostly open-source 7B-class encoders. It also led the multilingual MTEB tab and the code subset.
In March 2026, Google followed up with Gemini Embedding 2 Preview, marketed as a native multimodal embedding model that produces a unified vector for text, image, video, audio, and PDF inputs. The launch claimed simultaneous #1 finishes on MTEB's English, multilingual, and coding tabs, with a +5 percentage-point gap over the runner-up on multilingual MTEB across 250+ languages. The fact that a closed API has stayed at the top of the leaderboard for over a year is itself a change from the 2023 to 2024 pattern, when open-source models traded the lead.
Not every leaderboard tab tells the same story. NVIDIA's Llama-Embed-Nemotron-8B (arXiv:2511.07025), released in late 2025, took #1 on the MMTEB multilingual leaderboard as of October 2025. It is a Llama-3.1-8B fine-tune trained on 16.1 million query-document pairs (7.7 million from public datasets and 8.4 million synthetic), with 32 hidden layers and a 4096-dim output. Qwen3-Embedding-8B from Alibaba reached 80.68 on MTEB Code in 2026, the strongest publicly reported score on the code subset. The picture is now more fragmented than in 2023: closed APIs win some tabs, open 7B-to-8B models win others, and the relative ranking depends on whether you weight English, multilingual, code, or domain tasks.
MMTEB (Massive Multilingual Text Embedding Benchmark, Enevoldsen et al., arXiv:2502.13595, accepted to ICLR 2025) is the natural successor to the original benchmark. Where MTEB had 58 datasets and 112 languages, MMTEB has more than 500 quality-controlled tasks across over 250 languages, contributed by 85 co-authors. The expansion is mostly community-driven. People who care about a particular language or domain submit their tasks through the same GitHub repository, and once accepted those tasks become part of the benchmark.
MMTEB also adds task types that the original benchmark did not really cover. Long-document retrieval, instruction-following retrieval (where the query includes a task description that the model is supposed to follow), and code retrieval are the three big additions. These are harder than the original tasks, and the gap between models widens at the top.
MMTEB found, somewhat surprisingly, that scaling up to LLM-based embeddings does not always help on multilingual tasks. The paper reports that multilingual-e5-large-instruct, a 560M parameter model from Microsoft, was the best publicly available option overall on the multilingual subset at submission time. Larger models that did well on English MTEB sometimes underperformed it on languages outside of their training distribution. The lesson is roughly the same as the original 2022 finding: no single model dominates, and the ranking depends on which tasks you care about.
A related contribution from the MMTEB paper is a downsampling method that picks a subset of tasks correlated with the full-task ranking. Running the full MMTEB takes hundreds of GPU-hours per model, which is a real barrier. The downsampled "MTEB(Eng v2)" and similar subsets recover the same rankings at a small fraction of the compute (around 3.11 hours on an H100 for a 7B model, using 2% of the original documents and 6% of the original characters), and have become the practical default for new evaluations.
MIEB (Massive Image Embedding Benchmark, Xiao et al., arXiv:2504.10471, ICCV 2025) ports the MTEB methodology to vision and vision-language encoders. The benchmark covers 130 individual tasks across 38 languages, grouped into 8 high-level categories, and the initial paper benchmarks 50 image and image-text models. Tasks include image classification, image-text retrieval, multilingual image-text retrieval, document understanding, visual STS, compositionality (matching captions in the presence of confounders), and interleaved encodings.
The paper's headline finding mirrors MTEB's: no single image encoder dominates across all eight categories. A second result is more practically useful. MIEB scores correlate strongly with how well a vision encoder performs when plugged into a multimodal large language model, which makes MIEB a reasonable proxy for selecting vision backbones for multimodal LLM training without having to train the full LLM. The same mteb Python package and Hugging Face Space host the MIEB leaderboard, alongside the text and code tabs.
A family of MTEB variants now exists for specific languages and domains.
| Variant | Scope | Notes |
|---|---|---|
| MTEB(eng, v1) | The original 56-task English benchmark | Still the most-quoted single number for older models. |
| MTEB(eng, v2) | Curated and deduplicated English subset | Used in the leaderboard's "v2" tab; reduces task overlap and is the current default for new English submissions. |
| MTEB-French | French-specific tasks | Released by community contributors; covers AlloProf, MIRACL-fr, and others. |
| MTEB-Polish | Polish tasks | Used as the standard for Polish embedding evaluation. |
| C-MTEB (Chinese) | Chinese embedding tasks | Released alongside the BGE model paper (arXiv:2309.07597) by the BAAI team. |
| SEB (Scandinavian) | Danish, Swedish, Norwegian, Finnish | Scandinavian Embedding Benchmark, integrated into the MMTEB infrastructure. |
| MTEB-Code (CoIR) | Code search and retrieval | Tasks include CodeSearchNet, CosQA, and several function-name and docstring matching tasks. |
| MTEB-Law | Legal retrieval and similarity | Smaller suite; useful for domain-specific embedding evaluation. |
| MTEB(Multilingual, v1) | Multilingual subset of MMTEB | Default "multilingual" leaderboard tab. |
| MIEB | Image and image-text embeddings | 130 tasks across 38 languages; ICCV 2025. |
The variants share infrastructure with the main package. They are different task_types filters in the same Python library, and they appear as separate tabs in the leaderboard UI.
MTEB has taken a fair amount of well-earned criticism, and most of it is legitimate.
Test-set contamination. Many of the underlying datasets are old enough that they appear in the training data of any large language model. STS-B, SICK-R, MS MARCO, NQ, and HotpotQA are all on the public web. Models trained with internet-scale corpora have likely seen them. The MMTEB paper and several follow-ups ("Maintaining MTEB", arXiv:2506.21182) discuss this directly. The pragmatic response has been to add new tasks (MTEB(eng, v2), MMTEB) that did not exist publicly when older models were trained. This helps but does not solve the problem, since even "new" datasets can get scraped quickly.
Retrieval bias from BEIR. Of the original 58 datasets, 15 are retrieval and most of those come from BEIR. If you train your model with hard-negative mining against BEIR-like data, you get a built-in advantage on the average score. This is not cheating in the strict sense (BEIR is the standard retrieval benchmark) but it does mean MTEB's overall number rewards retrieval-focused training and may underweight the other task types in practice.
STS dominance early on. Through 2022 and early 2023, the gap between models on retrieval and classification was narrower than the gap on STS, which meant STS Spearman correlations had outsize influence on the average. As retrieval tasks improved (and as models started solving STS at near-ceiling levels), this concern faded somewhat.
Reproducibility. Because the leaderboard accepts community submissions, and because some submissions report numbers without releasing weights or training data, several entries have been called out as not independently reproducible. The MMTEB paper explicitly notes that some top models (stella-1.5B-v5, gte-Qwen2-7B-instruct, bge-multilingual-gemma2, voyage-large-2-instruct, text-embed-3-large among them) "have not disclosed key technical details necessary for reproduction." The leaderboard now has filters and badges to flag this, but the issue is structural.
Closed-API opacity. With Gemini Embedding and other closed APIs returning to the top of the leaderboard in 2025 and 2026, a related concern is that the public cannot inspect what training data they used, what context-length tricks they employ, or how they handle the long-tail languages on the multilingual tab. Closed-API entries are ranked the same way as open-weight models, which some researchers argue is unfair to either side.
Evaluation cost. Running the full MTEB takes several GPU-hours for a small encoder and tens of hours for a 7B model. MMTEB is much worse. This pushes researchers to use the downsampled versions, which is fine in practice but means the headline number you see for a new model is sometimes computed on a smaller subset than the leaderboard's full ranking.
Single-pooling assumption. MTEB scores a model as configured. A model that uses mean pooling for one task and CLS pooling for another (a reasonable thing to do) is not really supported, even though that is how some research models behave. The package supports it through encoder wrapper code, but the leaderboard implicitly assumes a single embedding function per model.
None of these criticisms are fatal. They are the normal failure modes of any benchmark that gets popular, and the maintainers have been responsive to them. Anyone using MTEB as the only signal for model selection is making a mistake.
The mteb Python package (pip install mteb) is the entire user-facing surface. It provides:
mteb.get_tasks(...)) that can filter by language, domain, task type, or specific dataset name.mteb.get_model(...)) that wraps Sentence-Transformers, OpenAI, Cohere, Voyage AI, and custom callable models with a unified interface.evaluate(...) or MTEB(...).run(...) entry point that runs the full evaluation pipeline and writes JSON results.mteb run --model name --tasks task1 task2.Results are stored in the embeddings-benchmark/results GitHub repository, which the leaderboard reads from. The leaderboard itself is a Gradio Hugging Face Space at huggingface.co/spaces/mteb/leaderboard, and it has tabs for English (v1, v2), Multilingual, Code, Law, French, Polish, Image (MIEB), and several others. New results show up within a few hours of being merged into the results repo.
It is hard to overstate how much MTEB shaped the open-source embedding landscape between 2023 and 2025. Before MTEB, embedding models were sold either with cherry-picked task numbers or with a citation to BEIR, which only covered retrieval. After MTEB, every serious release came with a 56-row table.
The direct consequences:
Open-source embeddings caught up to and overtook ada-002. ada-002 was the default embedding API for RAG pipelines through 2022 and most of 2023, despite being a closed model with a paid API. By the second half of 2023, the BGE family and E5 from Microsoft were beating it on MTEB, and it started showing up in tutorials as the slow, expensive option you used if you did not want to self-host.
Hard-negative mining became standard. The training recipe that won MTEB (BGE, then E5, then the LLM-based models) is built around mining hard negatives from a large corpus and using them in a contrastive loss. That recipe is now the default starting point for any new embedding model.
Matryoshka representation learning got popular. Because the leaderboard rewards both quality and practical utility, models that could output multiple embedding dimensions (a 1024-dim and a 256-dim from the same forward pass) gained an advantage in real applications. Matryoshka, originally introduced in a 2022 paper, became standard in 2024 releases.
OpenAI shipped text-embedding-3. OpenAI's release of text-embedding-3-small and text-embedding-3-large in January 2024 cited MTEB scores directly, with text-embedding-3-large reaching around 64.6 on MTEB English. The release notes were essentially "here is our new MTEB number", and the model was positioned against open-source competitors.
The benchmark itself became a hiring signal. Several embedding research groups (NVIDIA, Salesforce, Voyage AI, Mistral) hired on the strength of leaderboard results. The leaderboard is competitive enough that being top-3 has commercial value.
Closed APIs came back. Google's Gemini Embedding 001 (2025) and Gemini Embedding 2 Preview (2026) re-established a clear closed-API lead on MTEB(eng, v2) and MTEB Multilingual after about two years of open models dominating the top.
MTEB's broader effect was to commoditize embeddings. By 2025 you could get a free open-source model that was within 5 points of the absolute best closed model, deploy it on a single GPU, and get reasonable results on almost any text task. That is a meaningful change from 2022, when the practical choice was "pay OpenAI or use a small Sentence-Transformers checkpoint and hope for the best."
A few things MTEB does not really capture, which matter for production use.