MTEB (Massive Text Embedding Benchmark)

AI Benchmarks Artificial Intelligence Information Retrieval Natural Language Processing Open Source AI

23 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v4 · 4,535 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MTEB, short for Massive Text Embedding Benchmark, is the standard public leaderboard for evaluating text embedding models across many task types at once. It was introduced in October 2022 by Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers (then at Hugging Face and Cohere) in the paper "MTEB: Massive Text Embedding Benchmark" (arXiv:2210.07316), later accepted to EACL 2023.^[1] The original release scored 33 embedding models on 8 task families spanning 58 datasets and 112 languages, on a single shared leaderboard hosted on Hugging Face.^[1] MTEB has become the de facto report card for sentence and passage encoders, and almost every embedding model released after late 2022 quotes its MTEB numbers in the model card; the public leaderboard now holds thousands of submissions.

The benchmark sits in an awkward but useful position. It is not as deep as a single specialist suite (BEIR for retrieval, STS-B for similarity), yet it spans enough task families that a model good at only one of them gets exposed quickly. The headline finding of the 2022 paper was blunt: "no particular text embedding method dominates across all tasks," which the authors took as evidence that "the field has yet to converge on a universal text embedding method."^[1] The 2025 follow-up MMTEB (Massive Multilingual Text Embedding Benchmark) extended the same idea to more than 500 tasks across over 250 languages, mostly through community contributions, and is now the natural successor for new evaluations.^[2] A 2025 sibling benchmark, MIEB (Massive Image Embedding Benchmark), applied the same methodology to image and image-text encoders.^[3]

Facts

Field	Value
Released	October 2022 (original paper); EACL 2023 (peer-reviewed publication)
Authors	Niklas Muennighoff, Nouamane Tazi, Loïc Magne, Nils Reimers
Original paper	arXiv:2210.07316
Hosting	Hugging Face Spaces (Gradio)
Code repository	github.com/embeddings-benchmark/mteb
Results repository	github.com/embeddings-benchmark/results
License	Apache 2.0 (code), task licenses vary
Task families (v1)	8
Datasets (v1)	58
Languages (v1)	112
Models scored (v1)	33
Successor	MMTEB (arXiv:2502.13595, ICLR 2025); MIEB for images (arXiv:2504.10471, ICCV 2025)

Why was MTEB created?

The original paper was driven by Niklas Muennighoff, who at the time was working at Hugging Face, together with co-authors at Hugging Face and Cohere.^[1] Their argument in the introduction was that text embedding evaluation in 2022 was fragmented. STS-B and SICK-R dominated the literature on sentence similarity, BEIR had become the standard for zero-shot retrieval, and people who cared about classification or clustering used yet other suites.^[1] A model that did well on one of these often did badly on another, and there was no single number that compared, say, Sentence-BERT against OpenAI's text-embedding-ada-002 in a way you could trust.

Muennighoff and his co-authors built MTEB to fill that gap. The benchmark deliberately reuses well-known existing datasets (BEIR for retrieval, SemEval STS years for similarity, MTOP and Banking77 for classification, and so on) rather than inventing new ones.^[1] The novelty is in the unification: a single Python package, one evaluation protocol per task family, and one leaderboard. The 33 models in the initial run included Sentence-Transformers checkpoints (all-MiniLM-L6-v2, all-mpnet-base-v2), Sentence-BERT, Sentence-T5 variants based on Google's T5, GTR, LaBSE, LASER2, and OpenAI's text-embedding-ada-002.^[1] No fine-tuning was done by the authors; models were evaluated as released.

The headline finding from the paper was that "no particular text embedding method dominates across all tasks."^[1] Sentence-T5 XXL won on STS, GTR-XXL won on retrieval, all-mpnet-base-v2 was strong on classification and reranking, and ada-002 was solid but not best-in-class anywhere.^[1] That "no winner" result was part of why the benchmark stuck. There was a real ranking to compete for, and no obvious incumbent.

What task families and datasets does MTEB cover?

MTEB groups tasks into 8 families. Each family uses one fixed metric, so a model's score on a task is a single number you can average.^[1] The split below is from the 2022 release; the family count has not changed in MMTEB, although the dataset count has grown by an order of magnitude.^[2]

Task family	Metric	Datasets in MTEB v1 (2022)	Example datasets
Bitext Mining	F1	3	BUCC, Tatoeba
Classification	Accuracy	12	Banking77, AmazonReviews, MassiveIntent, Emotion, ToxicConversations
Clustering	V-measure	11	ArxivClustering, RedditClustering, StackExchange, BiorxivClusteringS2S
Pair Classification	Average precision (cosine)	3	SprintDuplicateQuestions, TwitterSemEval, TwitterURLCorpus
Reranking	MAP	4	AskUbuntu, MindSmall, SciDocsRR, StackOverflowDupQuestions
Retrieval	nDCG@10	15	MS MARCO, NQ, HotpotQA, FiQA, Quora, SciFact, Touche2020, ArguAna, ClimateFEVER, the rest of BEIR
Semantic Textual Similarity	Spearman correlation (cosine)	10	STS12 through STS22, SICK-R, BIOSSES, STSBenchmark
Summarization	Spearman correlation (cosine)	1	SummEval

A few oddities are worth knowing. Pair Classification uses average precision over cosine similarity scores between pairs labelled as duplicates or not. Reranking starts from a fixed candidate list per query and asks the embedding model to re-score the candidates, so a high reranking score does not imply a high retrieval score (which has to find candidates from scratch in a much larger corpus). The Summarization task is unusual: you embed a summary and the source, then correlate cosine similarity with human-rated quality on SummEval. With only one dataset in that category, summarization scores are noisy, and most analyses skip it or down-weight it.

The Retrieval family is dominated by BEIR, and BEIR contributes 15 of the original 58 datasets.^[7] This is the source of one of MTEB's recurring criticisms: the average score is heavily influenced by retrieval-style tasks. A model trained with BEIR-style hard negatives gets a free boost on the average even if it does nothing special on classification or clustering.

How are the metrics defined?

The metric per family was chosen to match what the field already used for that task type, so MTEB scores are comparable to standalone results in the underlying papers.^[1] The choice does have consequences, though.

Bitext Mining: F1 over the closest-pair retrieval across two language sides. Matches the BUCC and Tatoeba conventions.
Classification: A logistic regression is fit on top of the frozen embeddings (a 100-iteration limit, default scikit-learn settings), and accuracy is reported.^[1] This is a linear probe, not full fine-tuning. Models that produce well-separated clusters in raw cosine space do well; models that need a non-linear head do less well.
Clustering: K-means is run with a known number of clusters, and V-measure is reported.^[1] V-measure is the harmonic mean of homogeneity and completeness; it is symmetric and bounded in [0, 1].
Pair Classification: Cosine similarities between embedding pairs are scored against binary labels using average precision. Some configurations also report max F1 across thresholds.
Reranking: Mean Average Precision (MAP) over a fixed candidate set per query.
Retrieval: nDCG@10 (the same metric BEIR uses).^[7] Documents are ranked by cosine similarity and the top 10 are scored against graded relevance.
STS: Spearman rank correlation between cosine similarity and human similarity ratings. Spearman is preferred over Pearson because the relationship between cosine and similarity is monotonic but not necessarily linear.
Summarization: Spearman correlation, same idea, applied to summary-document pairs.

The overall MTEB score is the average of task-family averages, not the average of all 58 dataset scores.^[1] This matters because the families are wildly unbalanced: with 15 retrieval datasets and 1 summarization dataset, an unweighted average over datasets would be dominated by retrieval. Averaging within family first and then across families gives each task family equal weight. Whether that is the right weighting is debatable; you could argue retrieval should count more than summarization since people care about it more, or that summarization should count less because it has only one dataset. Either way, the family-average convention is what the leaderboard uses.

How is MTEB run?

MTEB is run zero-shot. The model is evaluated as released; no further fine-tuning is allowed against MTEB datasets.^[1] Tasks have a designated split (usually test, sometimes validation when no public test set exists), and scores are computed from a single deterministic run.

For classification, the linear probe is fit on the train split and evaluated on the test split. For clustering, retrieval, reranking, STS, and pair classification, only the test split matters because no learned head is involved. For STS, SICK-R, and STSBenchmark, the gold human similarity ratings are public, which has caused contamination problems (more on this below).

The mteb Python package handles all of this.^[10] A typical evaluation looks like the following.

import mteb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
tasks = mteb.get_tasks(task_types=["Retrieval"])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder="results/all-MiniLM-L6-v2")

Results are written as JSON files per task. The leaderboard scrapes those JSON files (now usually pulled from the embeddings-benchmark/results GitHub repo) and recomputes the rankings.^[12] Anyone can submit a result by opening a pull request against that repo.^[12]

Which models top the MTEB leaderboard?

The top of the MTEB leaderboard has turned over several times since 2022. Roughly, you can split it into four eras: small encoder models (2022 to early 2023), the BGE-and-E5 era (2023 to mid 2024), the LLM-as-encoder era (mid 2024 onward), and the closed-API resurgence (2025 onward, dominated by Google's Gemini Embedding family).

Year	Top models	Approximate overall MTEB English score	Notes
2022	Sentence-T5 XXL, GTR-XXL, all-mpnet-base-v2, OpenAI text-embedding-ada-002	~58 to 61	Original release. ada-002 around 60.99.
2023 (early)	E5-large-v2 (Microsoft), BGE-large (BAAI), GTE-large	~62 to 64	Open BERT-based encoders pull ahead of ada-002.
2023 (late)	bge-large-en-v1.5, Cohere embed-v3, jina-embeddings-v2-base-en, OpenAI text-embedding-3-large	~64 to 66	Public retrieval-tuned encoders dominate. text-embedding-3-large around 64.6.
2024 (early)	NV-Embed (NVIDIA), SFR-Embedding (Salesforce), gte-Qwen2-7B-instruct, Stella-1.5B-v5, voyage-large-2-instruct	~68 to 70	Larger backbones, often 7B parameters, decoder LLMs adapted to embeddings.
2024 (late)	NV-Embed-v2	72.31	NVIDIA's NV-Embed-v2 reaches 72.31 across the 56 English MTEB tasks (No. 1 as of August 30, 2024), briefly holding the top spot.^[4]
2025 to 2026	Google gemini-embedding-001, Qwen3-Embedding-8B, NV-Embed successors, NVIDIA Llama-Embed-Nemotron-8B	68.32 on MTEB Multilingual (Gemini); higher on the original v1 English split	Closed APIs and 8B open models trade the top across leaderboard tabs. Gemini Embedding led MTEB Multilingual with 68.32 in 2025.

A couple of points are worth pulling out from that table. First, performance has gone up substantially on the harder, deduplicated MTEB(eng, v2) split since 2022, and a much larger gain on the original v1 leaderboard. Second, the model size at the top has gone from around 110M parameters (encoder-only BERT base) to several billion (decoder LLMs with last-token pooling, or large closed APIs of unknown size). Whether that is a good trade depends entirely on what you do with the embeddings; for serving billions of queries, the smaller open models often still win on cost.

The BGE family from BAAI deserves a separate note. BGE released in mid-2023 with a heavily curated dataset of hard negatives mined from the BEIR-adjacent space, plus an in-batch contrastive objective.^[5] It dominated the leaderboard for months and became the default open-source embedding choice for RAG pipelines. Most of the LLM-based embedding models that came later (NV-Embed, gte-Qwen2, SFR) borrowed BGE's training recipe in some form: hard negatives, instruction prefixes, last-token or mean pooling on a decoder backbone, optional Matryoshka representation learning so a single model can produce multiple embedding sizes.^[4]

How did closed APIs return to the top?

Google first published gemini-embedding-exp-03-07 on March 7, 2025, announcing that it "achieves a mean (task) score of 68.32, a margin of +5.81 over the next competing model" on the MTEB Multilingual leaderboard, and that it "achieves the top rank on the Massive Text Embedding Benchmark (MTEB) Multilingual leaderboard."^[14] The stable version, gemini-embedding-001, later reached general availability via the Gemini API and Vertex AI, retaining a lead across retrieval, classification, and other domains.^[14] The 68.32 figure is the MTEB Multilingual mean task score, not the English-only v2 split.^[14]

The open-source field did not concede the top quietly. Alibaba's Qwen3-Embedding-8B, released in June 2025, scored 70.58 on the MTEB Multilingual leaderboard as of June 5, 2025, briefly surpassing Gemini Embedding among publicly evaluated models, and posted strong results on the code subset as well. The fact that a closed API has stayed near the top of the leaderboard since 2025 is itself a change from the 2023 to 2024 pattern, when open-source models reliably traded the lead.

Not every leaderboard tab tells the same story. NVIDIA's Llama-Embed-Nemotron-8B (arXiv:2511.07025), released in late 2025, took #1 on the MMTEB multilingual leaderboard as of October 2025.^[8]^[13] It is a Llama-3.1-8B fine-tune trained on 16.1 million query-document pairs (7.7 million from public datasets and 8.4 million synthetic), with 32 hidden layers and a 4096-dim output.^[8] The picture is now more fragmented than in 2023: closed APIs win some tabs, open 7B-to-8B models win others, and the relative ranking depends on whether you weight English, multilingual, code, or domain tasks.

What is MMTEB and how does it differ from MTEB?

MMTEB (Massive Multilingual Text Embedding Benchmark, Enevoldsen et al., arXiv:2502.13595, accepted to ICLR 2025) is the natural successor to the original benchmark.^[2] Where MTEB had 58 datasets and 112 languages, MMTEB has more than 500 quality-controlled tasks across over 250 languages, contributed by a large community of co-authors.^[2] The expansion is mostly community-driven. People who care about a particular language or domain submit their tasks through the same GitHub repository, and once accepted those tasks become part of the benchmark.

MMTEB also adds task types that the original benchmark did not really cover. Long-document retrieval, instruction-following retrieval (where the query includes a task description that the model is supposed to follow), and code retrieval are the three big additions.^[2] These are harder than the original tasks, and the gap between models widens at the top.

MMTEB found, somewhat surprisingly, that scaling up to LLM-based embeddings does not always help on multilingual tasks. The paper reports that multilingual-e5-large-instruct, a 560M parameter model from Microsoft, was a strong publicly available option overall on the multilingual subset at submission time, competitive with much larger models.^[2] Larger models that did well on English MTEB sometimes underperformed it on languages outside of their training distribution. The lesson is roughly the same as the original 2022 finding: no single model dominates, and the ranking depends on which tasks you care about.

A related contribution from the MMTEB paper is a downsampling method that picks a subset of tasks correlated with the full-task ranking.^[2] Running the full MMTEB takes hundreds of GPU-hours per model, which is a real barrier. The downsampled "MTEB(Eng v2)" and similar subsets recover the same rankings at a small fraction of the compute (the paper reports a zero-shot English benchmark that reduces computational cost by roughly 98%), and have become the practical default for new evaluations.^[2]

What is MIEB (image embeddings)?

MIEB (Massive Image Embedding Benchmark, Xiao et al., arXiv:2504.10471, ICCV 2025) ports the MTEB methodology to vision and vision-language encoders.^[3] The benchmark covers 130 individual tasks across 38 languages, grouped into 8 high-level categories, and the initial paper benchmarks 50 image and image-text models.^[3] Tasks include image classification, image-text retrieval, multilingual image-text retrieval, document understanding, visual STS, compositionality (matching captions in the presence of confounders), and interleaved encodings.^[3] A lightweight 51-task version, MIEB-lite, offers efficient evaluation while preserving diagnostic power.^[3]

The paper's headline finding mirrors MTEB's: no single image encoder dominates across all eight categories.^[3] A second result is more practically useful. MIEB scores correlate strongly with how well a vision encoder performs when plugged into a multimodal large language model, which makes MIEB a reasonable proxy for selecting vision backbones for multimodal LLM training without having to train the full LLM.^[3] The same mteb Python package and Hugging Face Space host the MIEB leaderboard, alongside the text and code tabs.

What language-specific and domain variants exist?

A family of MTEB variants now exists for specific languages and domains.

Variant	Scope	Notes
MTEB(eng, v1)	The original 56-task English benchmark	Still the most-quoted single number for older models.
MTEB(eng, v2)	Curated and deduplicated English subset	Used in the leaderboard's "v2" tab; reduces task overlap and is the current default for new English submissions.
MTEB-French	French-specific tasks	Released by community contributors; covers AlloProf, MIRACL-fr, and others.
MTEB-Polish	Polish tasks	Used as the standard for Polish embedding evaluation.
C-MTEB (Chinese)	Chinese embedding tasks	Released alongside the BGE model paper (arXiv:2309.07597) by the BAAI team.
SEB (Scandinavian)	Danish, Swedish, Norwegian, Finnish	Scandinavian Embedding Benchmark, integrated into the MMTEB infrastructure.
MTEB-Code (CoIR)	Code search and retrieval	Tasks include CodeSearchNet, CosQA, and several function-name and docstring matching tasks.
MTEB-Law	Legal retrieval and similarity	Smaller suite; useful for domain-specific embedding evaluation.
MTEB(Multilingual, v1)	Multilingual subset of MMTEB	Default "multilingual" leaderboard tab.
MIEB	Image and image-text embeddings	130 tasks across 38 languages; ICCV 2025.

The variants share infrastructure with the main package. They are different task_types filters in the same Python library, and they appear as separate tabs in the leaderboard UI.^[11]

What are the main criticisms of MTEB?

MTEB has taken a fair amount of well-earned criticism, and most of it is legitimate.

Test-set contamination. Many of the underlying datasets are old enough that they appear in the training data of any large language model. STS-B, SICK-R, MS MARCO, NQ, and HotpotQA are all on the public web. Models trained with internet-scale corpora have likely seen them. The MMTEB paper and several follow-ups ("Maintaining MTEB", arXiv:2506.21182) discuss this directly.^[9] The pragmatic response has been to add new tasks (MTEB(eng, v2), MMTEB) that did not exist publicly when older models were trained.^[2] This helps but does not solve the problem, since even "new" datasets can get scraped quickly.

Retrieval bias from BEIR. Of the original 58 datasets, 15 are retrieval and most of those come from BEIR.^[7] If you train your model with hard-negative mining against BEIR-like data, you get a built-in advantage on the average score. This is not cheating in the strict sense (BEIR is the standard retrieval benchmark) but it does mean MTEB's overall number rewards retrieval-focused training and may underweight the other task types in practice.

STS dominance early on. Through 2022 and early 2023, the gap between models on retrieval and classification was narrower than the gap on STS, which meant STS Spearman correlations had outsize influence on the average. As retrieval tasks improved (and as models started solving STS at near-ceiling levels), this concern faded somewhat.

Reproducibility. Because the leaderboard accepts community submissions, and because some submissions report numbers without releasing weights or training data, several entries have been called out as not independently reproducible. The MMTEB paper explicitly notes that some top models (stella-1.5B-v5, gte-Qwen2-7B-instruct, bge-multilingual-gemma2, voyage-large-2-instruct, text-embed-3-large among them) "have not disclosed key technical details necessary for reproduction."^[2] The leaderboard now has filters and badges to flag this, but the issue is structural.

Closed-API opacity. With Gemini Embedding and other closed APIs returning to the top of the leaderboard in 2025 and 2026, a related concern is that the public cannot inspect what training data they used, what context-length tricks they employ, or how they handle the long-tail languages on the multilingual tab. Closed-API entries are ranked the same way as open-weight models, which some researchers argue is unfair to either side.

Evaluation cost. Running the full MTEB takes several GPU-hours for a small encoder and tens of hours for a 7B model. MMTEB is much worse.^[2] This pushes researchers to use the downsampled versions, which is fine in practice but means the headline number you see for a new model is sometimes computed on a smaller subset than the leaderboard's full ranking.

Single-pooling assumption. MTEB scores a model as configured. A model that uses mean pooling for one task and CLS pooling for another (a reasonable thing to do) is not really supported, even though that is how some research models behave. The package supports it through encoder wrapper code, but the leaderboard implicitly assumes a single embedding function per model.

None of these criticisms are fatal. They are the normal failure modes of any benchmark that gets popular, and the maintainers have been responsive to them.^[9] Anyone using MTEB as the only signal for model selection is making a mistake.

How do you run MTEB? (Tooling)

The mteb Python package (pip install mteb) is the entire user-facing surface.^[10] It provides:

A registry of tasks (mteb.get_tasks(...)) that can filter by language, domain, task type, or specific dataset name.
A model loading helper (mteb.get_model(...)) that wraps Sentence-Transformers, OpenAI, Cohere, Voyage AI, and custom callable models with a unified interface.
An evaluate(...) or MTEB(...).run(...) entry point that runs the full evaluation pipeline and writes JSON results.
A CLI: mteb run --model name --tasks task1 task2.

Results are stored in the embeddings-benchmark/results GitHub repository, which the leaderboard reads from.^[12] The leaderboard itself is a Gradio Hugging Face Space at huggingface.co/spaces/mteb/leaderboard, and it has tabs for English (v1, v2), Multilingual, Code, Law, French, Polish, Image (MIEB), and several others.^[11] New results show up within a few hours of being merged into the results repo.^[12]

How did MTEB shape the embeddings ecosystem?

It is hard to overstate how much MTEB shaped the open-source embedding landscape between 2023 and 2025. Before MTEB, embedding models were sold either with cherry-picked task numbers or with a citation to BEIR, which only covered retrieval.^[7] After MTEB, every serious release came with a multi-row task table.

The direct consequences:

Open-source embeddings caught up to and overtook ada-002. ada-002 was the default embedding API for RAG pipelines through 2022 and most of 2023, despite being a closed model with a paid API. By the second half of 2023, the BGE family and E5 from Microsoft were beating it on MTEB, and it started showing up in tutorials as the slow, expensive option you used if you did not want to self-host.^[5]^[6]
Hard-negative mining became standard. The training recipe that won MTEB (BGE, then E5, then the LLM-based models) is built around mining hard negatives from a large corpus and using them in a contrastive loss.^[5]^[6] That recipe is now the default starting point for any new embedding model.
Matryoshka representation learning got popular. Because the leaderboard rewards both quality and practical utility, models that could output multiple embedding dimensions (a 1024-dim and a 256-dim from the same forward pass) gained an advantage in real applications. Matryoshka, originally introduced in a 2022 paper, became standard in 2024 releases.
OpenAI shipped text-embedding-3. OpenAI's release of text-embedding-3-small and text-embedding-3-large in January 2024 cited MTEB scores directly, with text-embedding-3-large reaching around 64.6 on MTEB English. The release notes were essentially "here is our new MTEB number", and the model was positioned against open-source competitors.
The benchmark itself became a hiring signal. Several embedding research groups (NVIDIA, Salesforce, Voyage AI, Mistral) built reputations on the strength of leaderboard results. The leaderboard is competitive enough that being top-3 has commercial value.
Closed APIs came back. Google's Gemini Embedding (2025) re-established a clear closed-API lead on MTEB Multilingual after about two years of open models dominating the top.^[14]

MTEB's broader effect was to commoditize embeddings. By 2025 you could get a free open-source model that was within a few points of the absolute best closed model, deploy it on a single GPU, and get reasonable results on almost any text task. That is a meaningful change from 2022, when the practical choice was "pay OpenAI or use a small Sentence-Transformers checkpoint and hope for the best."

What does MTEB not measure?

A few things MTEB does not really capture, which matter for production use.

Latency and memory. A 7B-parameter top-of-leaderboard model is not a drop-in replacement for a 110M-parameter encoder. The leaderboard has a column for embedding dimension and parameter count, but no direct latency benchmark.^[11]
Domain shift. Most MTEB tasks are general-purpose. A model that does well on MTEB will not necessarily do well on legal, medical, or financial text without fine-tuning.
Long contexts. The original MTEB tasks used relatively short inputs (BEIR passages, STS sentence pairs). MMTEB added long-document retrieval, but evaluation of really long contexts (tens of thousands of tokens) is still patchy.^[2]
Cross-encoder behaviour. MTEB only evaluates bi-encoders that produce a single fixed-size embedding per input. Cross-encoders (which look at the query and document jointly) and hybrid retrievers are out of scope. Models that report MTEB numbers for a reranker step usually use a separate cross-encoder that is not part of the benchmark.

For background on the representations MTEB evaluates, see vector embeddings, the broader topic of embeddings, and the downstream application of semantic search.

References

Muennighoff, N., Tazi, N., Magne, L., and Reimers, N. (2022). "MTEB: Massive Text Embedding Benchmark." arXiv:2210.07316. Accepted to EACL 2023. https://arxiv.org/abs/2210.07316 ↩
Enevoldsen, K., et al. (2025). "MMTEB: Massive Multilingual Text Embedding Benchmark." arXiv:2502.13595. Accepted to ICLR 2025. https://arxiv.org/abs/2502.13595 ↩
Xiao, C., Chung, I., Kerboua, I., Stirling, J., Zhang, X., Kardos, M., Solomatin, R., Al Moubayed, N., Enevoldsen, K., and Muennighoff, N. (2025). "MIEB: Massive Image Embedding Benchmark." arXiv:2504.10471. ICCV 2025. https://arxiv.org/abs/2504.10471 ↩
Lee, C., Roy, R., Xu, M., Raiman, J., Shoeybi, M., Catanzaro, B., and Ping, W. (2024). "NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models." arXiv:2405.17428. ↩
Xiao, S., Liu, Z., Zhang, P., and Muennighoff, N. (2023). "C-Pack: Packaged Resources To Advance General Chinese Embedding" (the BGE model paper). arXiv:2309.07597. ↩
Wang, L., Yang, N., Huang, X., et al. (2022). "Text Embeddings by Weakly-Supervised Contrastive Pre-training" (the E5 paper). arXiv:2212.03533. ↩
Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., and Gurevych, I. (2021). "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models." NeurIPS 2021 Datasets and Benchmarks Track. ↩
NVIDIA (2025). "Llama-Embed-Nemotron-8B: A Universal Text Embedding Model for Multilingual and Cross-Lingual Tasks." arXiv:2511.07025. https://arxiv.org/abs/2511.07025 ↩
"Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks" (2025). arXiv:2506.21182. https://arxiv.org/abs/2506.21182 ↩
MTEB GitHub repository: https://github.com/embeddings-benchmark/mteb ↩
MTEB Leaderboard (Hugging Face Space): https://huggingface.co/spaces/mteb/leaderboard ↩
MTEB Results repository: https://github.com/embeddings-benchmark/results ↩
NVIDIA Developer Blog (2025). "Llama-Embed-Nemotron-8B Model Tops the Multilingual Text Retrieval Leaderboard." https://huggingface.co/blog/nvidia/llama-embed-nemotron-8b ↩
Google Developers Blog (2025). "State-of-the-art text embedding via the Gemini API." https://developers.googleblog.com/en/gemini-embedding-text-model-now-available-gemini-api/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

MTEB (Massive Text Embedding Benchmark)

Facts

Why was MTEB created?

What task families and datasets does MTEB cover?

How are the metrics defined?

How is MTEB run?

Which models top the MTEB leaderboard?

How did closed APIs return to the top?

What is MMTEB and how does it differ from MTEB?

What is MIEB (image embeddings)?

What language-specific and domain variants exist?

What are the main criticisms of MTEB?

How do you run MTEB? (Tooling)

How did MTEB shape the embeddings ecosystem?

What does MTEB not measure?

See also

References

Improve this article

What links here

What links here

Facts

Why was MTEB created?

What task families and datasets does MTEB cover?

How are the metrics defined?

How is MTEB run?

Which models top the MTEB leaderboard?

How did closed APIs return to the top?

What is MMTEB and how does it differ from MTEB?

What is MIEB (image embeddings)?

What language-specific and domain variants exist?

What are the main criticisms of MTEB?

How do you run MTEB? (Tooling)

How did MTEB shape the embeddings ecosystem?

What does MTEB not measure?

See also

References

Improve this article

Related Articles

MMTEB

AI search

LlamaIndex

Haystack (framework)

Jina Embeddings v3

MathArena

What links here

Related Articles

MMTEB

AI search

LlamaIndex

Haystack (framework)

Jina Embeddings v3

MathArena

What links here