MMTEB
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,672 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,672 words
Add missing citations, update stale details, or suggest a clearer explanation.
MMTEB (Massive Multilingual Text Embedding Benchmark) is a large, community-built suite for evaluating text embedding models across more than 500 tasks and over 250 languages. It was introduced by Kenneth Enevoldsen and roughly 85 co-authors in a paper accepted at ICLR 2025, and it extends the earlier MTEB (Massive Text Embedding Benchmark) from a mostly English collection into one of the broadest multilingual evaluations available for embedding models [1][2]. The project lives inside the same open-source mteb library as the original benchmark, and the maintainers ask users to cite both papers when reporting results [3].
Text embeddings turn sentences, passages, or documents into dense vectors so that similar pieces of text sit close together in vector space. Those vectors power search, retrieval-augmented generation, clustering, classification, and recommendation, so a trustworthy way to compare embedding models matters for a lot of downstream systems.
The original MTEB, released by Niklas Muennighoff and colleagues at EACL 2023, was the first attempt at a broad standard. It gathered 8 task types over 56 datasets and benchmarked 33 models, and although it touched 112 languages, the bulk of its tasks were English [4]. That English tilt created a familiar problem: a model could top the leaderboard while saying little about how it handled Hindi retrieval, Swahili classification, or cross-lingual matching between, say, Finnish and Portuguese. Multilingual progress was hard to measure because the evaluation data was thin outside of English and a handful of high-resource languages.
MMTEB set out to fix that gap by scaling both the number of tasks and the language coverage by a large margin, and by adding task types that the first version did not cover. Because the people who build embedding models are spread across many language communities, the expansion was organized as an open contribution effort: researchers proposed datasets, reviewers checked quality, and the benchmark grew through pull requests rather than a single closed team. The authors describe the result as the largest multilingual collection of embedding evaluation tasks assembled to that point [1].
MMTEB spans more than 500 quality-controlled tasks across over 250 languages. When the bitext-mining tasks are counted at the level of individual language pairs, the coverage stretches past 1,000 languages, which gives the benchmark unusually long reach into low-resource settings [1][2].
The tasks fall into ten categories. Several carry over from MTEB, while instruction following, long-document retrieval, and code retrieval are newer additions aimed at the kinds of workloads that became common as embedding models started to serve retrieval-augmented generation and code search [1].
| Task category | What the model is asked to do |
|---|---|
| Retrieval | Rank relevant documents for a query, the core search setting |
| Classification | Predict a label for a text using its embedding |
| Multilabel classification | Assign several labels at once to a single text |
| Clustering | Group texts so that related items land together |
| Pair classification | Decide whether two texts hold a particular relation, such as duplicate or paraphrase |
| Reranking | Reorder a candidate list to put better matches first |
| Semantic textual similarity | Score how close two texts are in meaning |
| Bitext mining | Find sentence pairs that are translations across two languages |
| Instruction retrieval | Retrieve while following a natural-language instruction about relevance |
| Code retrieval | Match natural-language queries to relevant code |
Bitext mining is the category that pushes the language total so high, since each task pairs two languages and the benchmark includes many such pairs. Instruction retrieval and code retrieval are the parts most aligned with recent embedding use, where a single model is expected to handle plain prose, task-specific instructions, and source code.
A benchmark this size raises an obvious worry: evaluating every model on every task could cost more compute than most research groups, and especially most low-resource language communities, can spare. The authors treat that cost as a design constraint and lean on several methods to keep evaluation feasible without distorting the rankings [1].
Inter-task correlation downsampling is the headline idea. The team measured how strongly tasks predicted one another using Spearman rank correlation, then used backward selection to drop tasks whose results could be inferred from the ones that remained. The aim is a smaller set that still separates models the way the full set would, so that removing redundancy does not change who ranks where.
For retrieval, the benchmark caps the candidate pool using TREC-style pooling. Only the top 250 ranked documents per query are kept, which shrinks the largest datasets from more than 5 million documents down to a maximum of 250,000 while preserving the hard negatives that make retrieval discriminating. Caching the resulting hard negatives means the expensive selection step does not have to be repeated for every model.
Bitext mining gets an embedding-caching trick. Naively, comparing every language against every other is quadratic in the number of languages. By caching each text's embedding once and reusing it across pairings, the cost becomes linear in the number of languages instead. Clustering tasks use a bootstrapping approach that reuses encoded documents across sampled sets; for some tasks this cuts the number of documents that must be encoded by up to 100 times, and the paper reports an average speedup of about 16x for the clustering setup [1].
Ranking many models over hundreds of tasks needs an aggregation rule that does not let a few outlier scores dominate. MMTEB uses a Borda count drawn from social choice theory, following Colombo and colleagues. Each task acts like a voter that ranks the models, and the per-task rankings are combined into an overall order, with a tournament variant handling ties [1][5]. The practical effect is that the leaderboard rewards models that do consistently well across the whole spread of tasks and languages rather than models that win a small number of tasks by a wide margin.
The benchmark is not a single list. It ships as a family of named splits so that practitioners can focus on what they care about. MTEB(Multilingual) is the broad multilingual track, with regional cuts such as MTEB(Europe) and MTEB(Indic) for groups of related languages, language-specific tracks for languages including Chinese, French, Polish, and the Scandinavian languages, a refreshed English track called MTEB(eng, v2) that uses a much smaller fraction of the original documents for speed, and MTEB(Code) for code retrieval [1][2]. Results are served through a public leaderboard hosted on Hugging Face, which the maintainers update as new models and tasks arrive [3][6].
The most discussed result is about size. Large language model based embedders with billions of parameters did reach top scores on certain language subsets and task categories, but the best-performing publicly available model overall was multilingual-e5-large-instruct, which has only about 560 million parameters [1][2]. On the broad MTEB(Multilingual) track it ranked first, ahead of the much larger GritLM-7B and e5-mistral-7b-instruct [1].
Two patterns sit behind that headline. First, instruction tuning helped a lot: instruction-tuned versions of a model tended to outperform their untuned counterparts, which suggests that how a model is adapted can matter as much as raw parameter count. Second, scale did not transfer evenly across languages. On several low-resource languages the 560 million parameter model held up against the 7-billion-parameter Mistral-based systems, so a bigger LLM-derived embedder was not a guarantee of better multilingual coverage [1]. For people choosing an embedding model, the takeaway is that the right choice depends on the languages and tasks in play, not on model size alone.
MMTEB does not replace MTEB so much as absorb and widen it. The two share a codebase, and the original English-heavy tasks survive as one track inside the larger collection, which lets older results stay comparable while the multilingual scope grows around them [3]. Conceptually the benchmark continues a long line of work on text embeddings and on sentence-level representation models such as Sentence-BERT, which made it practical to compare texts by vector similarity at scale. As a benchmark, MMTEB plays the same role for embedding models that broad task suites play for large language models: a shared yardstick that makes claims about information retrieval and cross-lingual quality easier to check.
A benchmark of this size carries trade-offs that the authors are open about. Even with downsampling and caching, full evaluation is heavy, and the very communities most in need of multilingual evaluation are often the ones with the least compute, which is the tension the efficiency work tries to ease rather than remove [1]. Coverage is uneven by necessity, since some languages and domains have far more available data than others, so a high score on the multilingual track can still rest on stronger evidence for high-resource languages. Borda aggregation, while robust to outliers, compresses the rich per-task detail into a single ranking, so the leaderboard position is best read alongside the task-level and language-level breakdowns. And like any static benchmark, MMTEB risks being optimized against over time; its open contribution model is meant to keep the task set growing so that it stays a moving target rather than a fixed exam.