MMTEB

AI Benchmarks Information Retrieval Natural Language Processing

9 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v2 · 1,884 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MMTEB (Massive Multilingual Text Embedding Benchmark) is a large, community-built suite for evaluating text embedding models across more than 500 quality-controlled tasks and over 250 languages, making it one of the broadest multilingual evaluations available for embedding models ^[1]^[2]. It was introduced by Kenneth Enevoldsen and roughly 85 co-authors in a paper accepted at ICLR 2025 (arXiv:2502.13595), and it extends the earlier MTEB (Massive Text Embedding Benchmark) from a mostly English collection into a massive multilingual one, adding hard task types such as instruction following, long-document retrieval, and code retrieval ^[1]^[2]. The project lives inside the same open-source mteb library as the original benchmark and is served through a public leaderboard on Hugging Face; the maintainers ask users to cite both papers when reporting results ^[3].

What is MMTEB?

MMTEB is a benchmark that measures how well text embedding models perform across many tasks and languages at once. Text embeddings turn sentences, passages, or documents into dense vectors so that similar pieces of text sit close together in vector space. Those vectors power search, retrieval-augmented generation, clustering, classification, and recommendation, so a trustworthy way to compare embedding models matters for a lot of downstream systems. The authors describe MMTEB as "a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages" ^[1].

Why does MMTEB exist?

The original MTEB, released by Niklas Muennighoff and colleagues at EACL 2023, was the first attempt at a broad standard. It gathered 8 task types over 56 datasets and benchmarked 33 models, and although it touched 112 languages, the bulk of its tasks were English ^[4]. That English tilt created a familiar problem: a model could top the leaderboard while saying little about how it handled Hindi retrieval, Swahili classification, or cross-lingual matching between, say, Finnish and Portuguese. Multilingual progress was hard to measure because the evaluation data was thin outside of English and a handful of high-resource languages.

MMTEB set out to fix that gap by scaling both the number of tasks and the language coverage by a large margin, and by adding task types that the first version did not cover. Because the people who build embedding models are spread across many language communities, the expansion was organized as an open contribution effort: researchers proposed datasets, reviewers checked quality, and the benchmark grew through pull requests rather than a single closed team. The authors report that the result is "the largest multilingual collection of evaluation tasks for embedding models to date" ^[1].

How many languages and tasks does MMTEB cover?

MMTEB spans more than 500 quality-controlled tasks across over 250 languages. When the bitext-mining tasks are counted at the level of individual language pairs, the coverage stretches past 1,000 languages, which gives the benchmark unusually long reach into low-resource settings. The paper states that "our work extends the number of languages to over 1000 (250 excluding bitext-mining tasks), particularly to cover more low-resource languages" ^[1]^[2].

Attribute	MTEB (2022/2023)	MMTEB (2025)
Evaluation tasks	56 datasets, 8 task types	500+ quality-controlled tasks, 10 task types
Languages	112 (mostly English tasks)	250+ (over 1,000 counting bitext pairs)
Venue	EACL 2023	ICLR 2025
First author	Niklas Muennighoff	Kenneth Enevoldsen
New task types	n/a	Instruction following, long-document retrieval, code retrieval
Library	`mteb`	`mteb` (shared)

What task types does MMTEB include?

The tasks fall into ten categories. Several carry over from MTEB, while instruction following, long-document retrieval, and code retrieval are newer additions aimed at the kinds of workloads that became common as embedding models started to serve retrieval-augmented generation and code search ^[1].

Task category	What the model is asked to do
Retrieval	Rank relevant documents for a query, the core search setting
Classification	Predict a label for a text using its embedding
Multilabel classification	Assign several labels at once to a single text
Clustering	Group texts so that related items land together
Pair classification	Decide whether two texts hold a particular relation, such as duplicate or paraphrase
Reranking	Reorder a candidate list to put better matches first
Semantic textual similarity	Score how close two texts are in meaning
Bitext mining	Find sentence pairs that are translations across two languages
Instruction retrieval	Retrieve while following a natural-language instruction about relevance
Code retrieval	Match natural-language queries to relevant code

Bitext mining is the category that pushes the language total so high, since each task pairs two languages and the benchmark includes many such pairs. Instruction retrieval and code retrieval are the parts most aligned with recent embedding use, where a single model is expected to handle plain prose, task-specific instructions, and source code.

How does MMTEB keep evaluation affordable?

A benchmark this size raises an obvious worry: evaluating every model on every task could cost more compute than most research groups, and especially most low-resource language communities, can spare. The authors treat that cost as a design constraint and lean on several methods to keep evaluation feasible without distorting the rankings ^[1].

Inter-task correlation downsampling is the headline idea. The team measured how strongly tasks predicted one another using Spearman rank correlation, then used backward selection to drop tasks whose results could be inferred from the ones that remained. The aim is a smaller set that still separates models the way the full set would, so that removing redundancy does not change who ranks where.

For retrieval, the benchmark caps the candidate pool using TREC-style pooling. Only the top 250 ranked documents per query are kept, which shrinks the largest datasets from more than 5 million documents down to a maximum of 250,000 while preserving the hard negatives that make retrieval discriminating. Caching the resulting hard negatives means the expensive selection step does not have to be repeated for every model.

Bitext mining gets an embedding-caching trick. Naively, comparing every language against every other is quadratic in the number of languages. By caching each text's embedding once and reusing it across pairings, the cost becomes linear in the number of languages instead. Clustering tasks use a bootstrapping approach that reuses encoded documents across sampled sets; for some tasks this cuts the number of documents that must be encoded by up to 100 times, and the paper reports an average speedup of about 16x for the clustering setup ^[1].

How are models ranked, and where is the leaderboard?

Ranking many models over hundreds of tasks needs an aggregation rule that does not let a few outlier scores dominate. MMTEB uses a Borda count drawn from social choice theory, following Colombo and colleagues. Each task acts like a voter that ranks the models, and the per-task rankings are combined into an overall order, with a tournament variant handling ties ^[1]^[5]. The practical effect is that the leaderboard rewards models that do consistently well across the whole spread of tasks and languages rather than models that win a small number of tasks by a wide margin.

The benchmark is not a single list. It ships as a family of named splits so that practitioners can focus on what they care about. MTEB(Multilingual) is the broad multilingual track, with regional cuts such as MTEB(Europe) and MTEB(Indic) for groups of related languages, language-specific tracks for languages including Chinese, French, Polish, and the Scandinavian languages, a refreshed English track called MTEB(eng, v2) that uses a much smaller fraction of the original documents for speed, and MTEB(Code) for code retrieval ^[1]^[2]. Results are served through a public leaderboard hosted on Hugging Face, which the maintainers update as new models and tasks arrive ^[3]^[6].

What are MMTEB's headline findings?

The most discussed result is about size. Large language model based embedders with billions of parameters did reach top scores on certain language subsets and task categories, but the paper reports that "the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters" ^[1]^[2]. On the broad MTEB(Multilingual) track it ranked first, ahead of the much larger GritLM-7B and e5-mistral-7b-instruct ^[1].

Two patterns sit behind that headline. First, instruction tuning helped a lot: instruction-tuned versions of a model tended to outperform their untuned counterparts, which suggests that how a model is adapted can matter as much as raw parameter count. Second, scale did not transfer evenly across languages. On several low-resource languages the 560 million parameter model held up against the 7-billion-parameter Mistral-based systems, so a bigger LLM-derived embedder was not a guarantee of better multilingual coverage ^[1]. For people choosing an embedding model, the takeaway is that the right choice depends on the languages and tasks in play, not on model size alone.

How does MMTEB extend MTEB?

MMTEB does not replace MTEB so much as absorb and widen it. The two share a codebase, and the original English-heavy tasks survive as one track inside the larger collection, which lets older results stay comparable while the multilingual scope grows around them ^[3]. Conceptually the benchmark continues a long line of work on text embeddings and on sentence-level representation models such as Sentence-BERT, which made it practical to compare texts by vector similarity at scale. As a benchmark, MMTEB plays the same role for embedding models that broad task suites play for large language models: a shared yardstick that makes claims about information retrieval and cross-lingual quality easier to check.

What are MMTEB's limitations?

A benchmark of this size carries trade-offs that the authors are open about. Even with downsampling and caching, full evaluation is heavy, and the very communities most in need of multilingual evaluation are often the ones with the least compute, which is the tension the efficiency work tries to ease rather than remove ^[1]. Coverage is uneven by necessity, since some languages and domains have far more available data than others, so a high score on the multilingual track can still rest on stronger evidence for high-resource languages. Borda aggregation, while robust to outliers, compresses the rich per-task detail into a single ranking, so the leaderboard position is best read alongside the task-level and language-level breakdowns. And like any static benchmark, MMTEB risks being optimized against over time; its open contribution model is meant to keep the task set growing so that it stays a moving target rather than a fixed exam.

References

Enevoldsen, K., et al. (2025). MMTEB: Massive Multilingual Text Embedding Benchmark. arXiv:2502.13595. https://arxiv.org/abs/2502.13595 ↩
MMTEB: Massive Multilingual Text Embedding Benchmark. OpenReview (ICLR 2025). https://openreview.net/forum?id=zl3pfz4VCV ↩
embeddings-benchmark/mteb: MTEB: Massive Text Embedding Benchmark. GitHub. https://github.com/embeddings-benchmark/mteb ↩
Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2023). MTEB: Massive Text Embedding Benchmark. EACL 2023. https://aclanthology.org/2023.eacl-main.148/ ↩
Colombo, P., et al. (2022). What Are the Best Systems? New Perspectives on NLP Benchmarking. arXiv:2202.03799. https://arxiv.org/abs/2202.03799 ↩
MMTEB: Massive Multilingual Text Embedding Benchmark. Hugging Face Papers. https://huggingface.co/papers/2502.13595 ↩
MMTEB: Massive Multilingual Text Embedding Benchmark. ICLR 2025 Poster. https://iclr.cc/virtual/2025/poster/27651
MMTEB: Massive Multilingual Text Embedding Benchmark. Aarhus University research portal. https://pure.au.dk/portal/en/publications/mmteb-massive-multilingual-text-embedding-benchmark/
MMTEB: Massive Multilingual Text Embedding Benchmark (HTML version). arXiv. https://arxiv.org/html/2502.13595v1

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

EmbeddingGemma Qwen3 Embedding Vector embeddings

What is MMTEB?

Why does MMTEB exist?

How many languages and tasks does MMTEB cover?

What task types does MMTEB include?

How does MMTEB keep evaluation affordable?

How are models ranked, and where is the leaderboard?

What are MMTEB's headline findings?

How does MMTEB extend MTEB?

What are MMTEB's limitations?

References

Improve this article

Related Articles

MTEB (Massive Text Embedding Benchmark)

FRAMES (benchmark)

Similarity Measure

Vector embeddings

LlamaIndex

AI search

What links here

Related Articles

MTEB (Massive Text Embedding Benchmark)

FRAMES (benchmark)

Similarity Measure

Vector embeddings

LlamaIndex

AI search

What links here