MMLU-ProX

AI Benchmarks Large Language Models

8 min read

Updated Jun 2, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 2, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v1 · 1,560 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MMLU-ProX is a multilingual benchmark for evaluating reasoning and knowledge in large language models, extending the English-only MMLU-Pro test set to 29 typologically diverse languages with a parallel set of identical questions in each language ^[1]^[2]. Built by a large academic collaboration led by Weihao Xuan and Irene Li, it was introduced in the paper "MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation" (arXiv:2503.10497, March 2025) and accepted to the EMNLP 2025 Main conference ^[1]^[3]. Because every language version contains the same 11,829 questions translated from the same source items, MMLU-ProX is designed to support direct cross-lingual comparison of model accuracy, isolating language effects from differences in question content ^[2]^[4].

Overview

MMLU-ProX targets a gap in multilingual evaluation: most widely used benchmarks are written in English, while many multilingual test sets lack strictly parallel questions and therefore cannot cleanly separate a model's reasoning ability from the difficulty of the specific questions asked in each language ^[1]^[4]. By translating a single difficult, reasoning-oriented source benchmark into many languages while keeping the underlying questions fixed, MMLU-ProX lets researchers measure how the same model performs on the same content across high-resource and low-resource languages ^[2]^[4].

The dataset preserves the demanding character of its English parent. Each item is a multiple-choice question with up to ten answer options (labeled A through J) drawn from academic and professional subjects, and many questions require multi-step reasoning rather than simple factual recall ^[4]^[5]. The benchmark is released under the MIT license and is distributed on Hugging Face as li-lab/MMLU-ProX, with an official evaluation pathway integrated into EleutherAI's lm-evaluation-harness ^[4]^[5].

Relationship to MMLU and MMLU-Pro

MMLU-ProX sits at the end of a lineage of knowledge-and-reasoning benchmarks. The original MMLU (Massive Multitask Language Understanding) introduced a large four-option multiple-choice test covering 57 subjects in English ^[1]. MMLU-Pro revised that benchmark to make it harder and more reasoning-intensive: it expanded the option set from four to ten choices, removed trivial or noisy items, and reorganized the questions into 14 broad subject categories, which substantially reduced the chance of guessing correctly and increased the value of chain-of-thought reasoning ^[1]^[4].

MMLU-ProX is the multilingual extension of MMLU-Pro specifically, not of the original MMLU ^[1]^[2]. It inherits MMLU-Pro's ten-option format, its 14 subject categories, and its reasoning focus, then adds translations into 28 additional languages beyond English ^[4]^[5]. This distinguishes it from other multilingual MMLU efforts such as MMMLU, OpenAI's professionally translated version of the original MMLU, which is based on the easier four-option MMLU rather than on MMLU-Pro ^[1].

The languages covered

The benchmark covers 29 languages selected to span a range of language families, scripts, and resource levels ^[1]^[2]. Using ISO codes, the languages are: af (Afrikaans), ar (Arabic), bn (Bengali), cs (Czech), de (German), en (English), es (Spanish), fr (French), hi (Hindi), hu (Hungarian), id (Indonesian), it (Italian), ja (Japanese), ko (Korean), mr (Marathi), ne (Nepali), pt (Portuguese), ru (Russian), sr (Serbian), sw (Swahili), te (Telugu), th (Thai), uk (Ukrainian), ur (Urdu), vi (Vietnamese), wo (Wolof), yo (Yoruba), zh (Chinese), and zu (Zulu) ^[4]^[5].

The selection deliberately mixes high-resource languages such as English, Chinese, German, and French with lower-resource languages, including several African languages (Swahili, Wolof, Yoruba, Zulu) and South Asian languages (Bengali, Marathi, Nepali, Telugu, Urdu) ^[2]^[5]. This range is central to the benchmark's purpose, because the resource gap between these groups is what produces the cross-lingual performance differences the benchmark is built to measure ^[1]^[6].

The initial version 1 of the paper (March 2025) covered 13 languages and evaluated 25 models; the dataset and the final EMNLP 2025 version were expanded to the full set of 29 languages and 36 models ^[1]^[6]. The maintainers have indicated continued work toward additional languages in future releases ^[5].

Construction and translation and verification pipeline

MMLU-ProX is built through a semi-automatic pipeline that pairs machine translation by strong LLMs with human expert review, rather than relying on either alone ^[1]^[2]. The construction process begins from the English MMLU-Pro test set, with a data-curation step that removes duplicate items and corrects grammatical problems in the source questions before translation ^[6].

The translation stage uses Claude 3.7 Sonnet to produce initial translations, with a self-reflection phase in which the model reviews and refines its own output ^[6]. Each translated question is then independently verified by GPT-4o, which checks terminology consistency and fluency in the target language ^[6]. Finally, human expert annotators rate translations on a five-point Likert scale across dimensions such as accuracy, fluency, and completeness; items scoring below the acceptance thresholds are returned for revision ^[6]. The authors report mean human-evaluation scores above 4.2 across these dimensions, supporting the quality of the translated questions ^[6].

Keeping the question content fixed across languages is a defining design choice. Every language version contains the same 11,829 questions mapped to the same answers, so the only systematic difference between two language editions is the language itself, which is what makes parallel cross-lingual comparison possible ^[2]^[4].

Evaluation methodology

The benchmark is structured to mirror MMLU-Pro's evaluation conventions. Each language edition is split into a small validation set used to supply few-shot exemplars and a large test set used for scoring, summing to 11,829 questions per language ^[4]^[5]. Models answer each ten-option multiple-choice question, and accuracy is the primary metric ^[4].

MMLU-ProX is evaluated under both 5-shot chain-of-thought prompting, in which the model is shown worked examples and asked to reason step by step before answering, and zero-shot prompting ^[1]^[6]. To make evaluation efficient across 29 languages and dozens of models, the maintainers also publish a "lite" subset of 658 questions per language, which preserves the multilingual structure at a fraction of the compute cost ^[2]^[4]. Official support is provided through lm-evaluation-harness with vLLM-based inference, exposing per-language task groups (for example mmlu_prox_{lang} and mmlu_prox_lite_{lang}) and per-subject tasks across the 14 categories ^[5].

Notable results by model

Across all models tested, MMLU-ProX shows a consistent pattern: accuracy is highest on high-resource languages such as English and degrades as language resource levels fall, with the largest drops on African languages ^[1]^[2]. The paper reports performance gaps of up to 24.3 percentage points between high-resource and low-resource languages ^[1]. The table below lists representative English and low-resource results from the version 1 evaluation under 5-shot chain-of-thought prompting; all figures are accuracy percentages ^[6].

Model	English	Bengali	Swahili	Source
QwQ-32B	70.7	52.7	32.8	^[6]
Qwen2.5-72B	70.3	n/a	~40.1	^[6]
Llama 3.1-405B	68.8	n/a	52.1	^[6]

The expanded final version evaluates 36 state-of-the-art systems, including reasoning-focused and multilingual models such as GPT-4.1, o4-mini, DeepSeek-R1, DeepSeek V3, and several Qwen3 variants (for example Qwen3-235B and its "thinking" mode) ^[2]^[3]. In that evaluation the strongest models again exceed 70% on English while falling well below their English scores on the lowest-resource languages, reproducing the high-to-low-resource degradation seen in version 1 ^[1]^[2]. A public leaderboard for the expanded results was listed as forthcoming on the project page ^[2].

Significance for multilingual evaluation

MMLU-ProX is notable as one of the first large, strictly parallel benchmarks to bring MMLU-Pro's difficulty and reasoning emphasis to a wide set of languages ^[1]^[4]. Its parallel design allows the multilingual gap to be attributed to language rather than to differing question pools, which is a recurring confound in earlier multilingual test sets ^[4]. The results give quantitative evidence that even frontier models with strong English reasoning lose substantial accuracy on low-resource languages, a finding that has been used to argue for more investment in multilingual training data and evaluation ^[1]^[6].

The benchmark has been adopted into common tooling, including lm-evaluation-harness, and is referenced in multilingual leaderboards and follow-up studies, which has helped it become a standard reference point for cross-lingual reasoning evaluation ^[5]. Its acceptance to EMNLP 2025 Main further marks it as a peer-reviewed contribution to the multilingual evaluation literature ^[3].

Limitations

Several limitations follow from the benchmark's construction. Because the questions are translations of an English MMLU-Pro source rather than items authored natively in each language, they may carry English-centric framing and may not capture culturally specific knowledge particular to the target-language community, even though the pipeline reviews for cultural appropriateness ^[2]^[6]. Translation quality, while supported by automatic verification and human Likert scoring, can still vary across languages, and the authors describe human verification as an ongoing effort ^[6].

The benchmark also measures multiple-choice accuracy on academic and professional subjects, which is only a partial proxy for broader multilingual competence such as open-ended generation, dialogue, or pragmatics ^[4]. Finally, results are sensitive to prompting strategy and decoding setup; reported numbers depend on whether 5-shot chain-of-thought or zero-shot prompting is used and on the inference configuration, so scores are best compared within a consistent evaluation protocol ^[1]^[6].

References

Xuan, Weihao; et al. "MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation." arXiv:2503.10497, March 2025. https://arxiv.org/abs/2503.10497 ↩
"MMLU-ProX: A Multilingual Benchmark for Advanced LLM Evaluation." Project page. https://mmluprox.github.io/ ↩
Xuan, Weihao; et al. "MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation." Proceedings of EMNLP 2025 (Main), pp. 1513-1532. ACL Anthology. https://aclanthology.org/2025.emnlp-main.79/ ↩
"li-lab/MMLU-ProX." Hugging Face Datasets. https://huggingface.co/datasets/li-lab/MMLU-ProX ↩
"MMLU-ProX task documentation." EleutherAI lm-evaluation-harness. https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mmlu_prox/README.md ↩
"MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation" (full text, version 1). arXiv HTML. https://arxiv.org/html/2503.10497v1 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

GSM8K

Overview

Relationship to MMLU and MMLU-Pro

The languages covered

Construction and translation and verification pipeline

Evaluation methodology

Notable results by model

Significance for multilingual evaluation

Limitations

References

Improve this article

Related Articles

MMLU-Pro

Chatbot Arena

BIG-Bench

MT-Bench

GSM8K

MBPP

What links here

Related Articles

MMLU-Pro

Chatbot Arena

BIG-Bench

MT-Bench

GSM8K

MBPP