MMLU-ProX
Last reviewed
Jun 2, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,560 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,560 words
Add missing citations, update stale details, or suggest a clearer explanation.
MMLU-ProX is a multilingual benchmark for evaluating reasoning and knowledge in large language models, extending the English-only MMLU-Pro test set to 29 typologically diverse languages with a parallel set of identical questions in each language [1][2]. Built by a large academic collaboration led by Weihao Xuan and Irene Li, it was introduced in the paper "MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation" (arXiv:2503.10497, March 2025) and accepted to the EMNLP 2025 Main conference [1][3]. Because every language version contains the same 11,829 questions translated from the same source items, MMLU-ProX is designed to support direct cross-lingual comparison of model accuracy, isolating language effects from differences in question content [2][4].
MMLU-ProX targets a gap in multilingual evaluation: most widely used benchmarks are written in English, while many multilingual test sets lack strictly parallel questions and therefore cannot cleanly separate a model's reasoning ability from the difficulty of the specific questions asked in each language [1][4]. By translating a single difficult, reasoning-oriented source benchmark into many languages while keeping the underlying questions fixed, MMLU-ProX lets researchers measure how the same model performs on the same content across high-resource and low-resource languages [2][4].
The dataset preserves the demanding character of its English parent. Each item is a multiple-choice question with up to ten answer options (labeled A through J) drawn from academic and professional subjects, and many questions require multi-step reasoning rather than simple factual recall [4][5]. The benchmark is released under the MIT license and is distributed on Hugging Face as li-lab/MMLU-ProX, with an official evaluation pathway integrated into EleutherAI's lm-evaluation-harness [4][5].
MMLU-ProX sits at the end of a lineage of knowledge-and-reasoning benchmarks. The original MMLU (Massive Multitask Language Understanding) introduced a large four-option multiple-choice test covering 57 subjects in English [1]. MMLU-Pro revised that benchmark to make it harder and more reasoning-intensive: it expanded the option set from four to ten choices, removed trivial or noisy items, and reorganized the questions into 14 broad subject categories, which substantially reduced the chance of guessing correctly and increased the value of chain-of-thought reasoning [1][4].
MMLU-ProX is the multilingual extension of MMLU-Pro specifically, not of the original MMLU [1][2]. It inherits MMLU-Pro's ten-option format, its 14 subject categories, and its reasoning focus, then adds translations into 28 additional languages beyond English [4][5]. This distinguishes it from other multilingual MMLU efforts such as MMMLU, OpenAI's professionally translated version of the original MMLU, which is based on the easier four-option MMLU rather than on MMLU-Pro [1].
The benchmark covers 29 languages selected to span a range of language families, scripts, and resource levels [1][2]. Using ISO codes, the languages are: af (Afrikaans), ar (Arabic), bn (Bengali), cs (Czech), de (German), en (English), es (Spanish), fr (French), hi (Hindi), hu (Hungarian), id (Indonesian), it (Italian), ja (Japanese), ko (Korean), mr (Marathi), ne (Nepali), pt (Portuguese), ru (Russian), sr (Serbian), sw (Swahili), te (Telugu), th (Thai), uk (Ukrainian), ur (Urdu), vi (Vietnamese), wo (Wolof), yo (Yoruba), zh (Chinese), and zu (Zulu) [4][5].
The selection deliberately mixes high-resource languages such as English, Chinese, German, and French with lower-resource languages, including several African languages (Swahili, Wolof, Yoruba, Zulu) and South Asian languages (Bengali, Marathi, Nepali, Telugu, Urdu) [2][5]. This range is central to the benchmark's purpose, because the resource gap between these groups is what produces the cross-lingual performance differences the benchmark is built to measure [1][6].
The initial version 1 of the paper (March 2025) covered 13 languages and evaluated 25 models; the dataset and the final EMNLP 2025 version were expanded to the full set of 29 languages and 36 models [1][6]. The maintainers have indicated continued work toward additional languages in future releases [5].
MMLU-ProX is built through a semi-automatic pipeline that pairs machine translation by strong LLMs with human expert review, rather than relying on either alone [1][2]. The construction process begins from the English MMLU-Pro test set, with a data-curation step that removes duplicate items and corrects grammatical problems in the source questions before translation [6].
The translation stage uses Claude 3.7 Sonnet to produce initial translations, with a self-reflection phase in which the model reviews and refines its own output [6]. Each translated question is then independently verified by GPT-4o, which checks terminology consistency and fluency in the target language [6]. Finally, human expert annotators rate translations on a five-point Likert scale across dimensions such as accuracy, fluency, and completeness; items scoring below the acceptance thresholds are returned for revision [6]. The authors report mean human-evaluation scores above 4.2 across these dimensions, supporting the quality of the translated questions [6].
Keeping the question content fixed across languages is a defining design choice. Every language version contains the same 11,829 questions mapped to the same answers, so the only systematic difference between two language editions is the language itself, which is what makes parallel cross-lingual comparison possible [2][4].
The benchmark is structured to mirror MMLU-Pro's evaluation conventions. Each language edition is split into a small validation set used to supply few-shot exemplars and a large test set used for scoring, summing to 11,829 questions per language [4][5]. Models answer each ten-option multiple-choice question, and accuracy is the primary metric [4].
MMLU-ProX is evaluated under both 5-shot chain-of-thought prompting, in which the model is shown worked examples and asked to reason step by step before answering, and zero-shot prompting [1][6]. To make evaluation efficient across 29 languages and dozens of models, the maintainers also publish a "lite" subset of 658 questions per language, which preserves the multilingual structure at a fraction of the compute cost [2][4]. Official support is provided through lm-evaluation-harness with vLLM-based inference, exposing per-language task groups (for example mmlu_prox_{lang} and mmlu_prox_lite_{lang}) and per-subject tasks across the 14 categories [5].
Across all models tested, MMLU-ProX shows a consistent pattern: accuracy is highest on high-resource languages such as English and degrades as language resource levels fall, with the largest drops on African languages [1][2]. The paper reports performance gaps of up to 24.3 percentage points between high-resource and low-resource languages [1]. The table below lists representative English and low-resource results from the version 1 evaluation under 5-shot chain-of-thought prompting; all figures are accuracy percentages [6].
| Model | English | Bengali | Swahili | Source |
|---|---|---|---|---|
| QwQ-32B | 70.7 | 52.7 | 32.8 | [6] |
| Qwen2.5-72B | 70.3 | n/a | ~40.1 | [6] |
| Llama 3.1-405B | 68.8 | n/a | 52.1 | [6] |
The expanded final version evaluates 36 state-of-the-art systems, including reasoning-focused and multilingual models such as GPT-4.1, o4-mini, DeepSeek-R1, DeepSeek V3, and several Qwen3 variants (for example Qwen3-235B and its "thinking" mode) [2][3]. In that evaluation the strongest models again exceed 70% on English while falling well below their English scores on the lowest-resource languages, reproducing the high-to-low-resource degradation seen in version 1 [1][2]. A public leaderboard for the expanded results was listed as forthcoming on the project page [2].
MMLU-ProX is notable as one of the first large, strictly parallel benchmarks to bring MMLU-Pro's difficulty and reasoning emphasis to a wide set of languages [1][4]. Its parallel design allows the multilingual gap to be attributed to language rather than to differing question pools, which is a recurring confound in earlier multilingual test sets [4]. The results give quantitative evidence that even frontier models with strong English reasoning lose substantial accuracy on low-resource languages, a finding that has been used to argue for more investment in multilingual training data and evaluation [1][6].
The benchmark has been adopted into common tooling, including lm-evaluation-harness, and is referenced in multilingual leaderboards and follow-up studies, which has helped it become a standard reference point for cross-lingual reasoning evaluation [5]. Its acceptance to EMNLP 2025 Main further marks it as a peer-reviewed contribution to the multilingual evaluation literature [3].
Several limitations follow from the benchmark's construction. Because the questions are translations of an English MMLU-Pro source rather than items authored natively in each language, they may carry English-centric framing and may not capture culturally specific knowledge particular to the target-language community, even though the pipeline reviews for cultural appropriateness [2][6]. Translation quality, while supported by automatic verification and human Likert scoring, can still vary across languages, and the authors describe human verification as an ongoing effort [6].
The benchmark also measures multiple-choice accuracy on academic and professional subjects, which is only a partial proxy for broader multilingual competence such as open-ended generation, dialogue, or pragmatics [4]. Finally, results are sensitive to prompting strategy and decoding setup; reported numbers depend on whether 5-shot chain-of-thought or zero-shot prompting is used and on the inference configuration, so scores are best compared within a consistent evaluation protocol [1][6].