MMMLU
Last reviewed
May 10, 2026
Sources
11 citations
Review status
Source-backed
Revision
v2 · 2,397 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
11 citations
Review status
Source-backed
Revision
v2 · 2,397 words
Add missing citations, update stale details, or suggest a clearer explanation.
| MMMLU | |
|---|---|
| Overview | |
| Full name | Multilingual Massive Multitask Language Understanding |
| Abbreviation | MMMLU |
| Description | Professional human translations of the MMLU test set into 14 languages, released by OpenAI to evaluate multilingual knowledge and reasoning in large language models |
| Release date | September 23, 2024 |
| Publisher | OpenAI |
| HuggingFace ID | openai/MMMLU |
| License | MIT |
| Source benchmark | MMLU (Hendrycks et al., 2020) |
| Technical Details | |
| Type | Knowledge assessment, multilingual evaluation |
| Modality | Text |
| Task format | Four-option multiple choice |
| Languages | 14 (Arabic, Bengali, Chinese (Simplified), French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese (Brazilian), Spanish (Latin America), Swahili, Yoruba) |
| Questions per language | 14,042 |
| Total examples | 196,588 |
| Subjects | 57 (STEM, humanities, social sciences, other) |
| Translation method | Professional human translators |
| Evaluation metric | Accuracy, zero-shot or few-shot |
| File format | CSV (auto-converted to Parquet on HuggingFace) |
| Resources | |
| HuggingFace dataset | openai/MMMLU |
| Reference paper | Measuring Massive Multitask Language Understanding (arXiv:2009.03300) |
| Evaluation code | openai/simple-evals |
| Predecessor | MMLU |
MMMLU (Multilingual Massive Multitask Language Understanding) is a multilingual evaluation dataset published by OpenAI on September 23, 2024. It is a professional human translation of the test split of the original MMLU benchmark into 14 languages, distributed on Hugging Face under the MIT license. The dataset preserves MMLU's 57 subjects and four-option multiple-choice format, but replaces the English questions with translations produced by paid human linguists rather than machine translation.
MMMLU lets developers measure how well a model retains its general-knowledge ability when prompted in languages other than English. It is one of the few large multilingual benchmarks whose translations were not produced by another language model, and it is now a standard line item in OpenAI system cards and on third-party leaderboards.
MMMLU is distinct from similarly named benchmarks. It is not MMLU (the original English-only test), not MMLU-Pro (the harder ten-option variant), and not MMMU (a multimodal college-exam benchmark, with the second M for Multimodal).
The original MMLU benchmark, introduced by Dan Hendrycks and colleagues in September 2020, contains 15,908 multiple-choice questions across 57 subjects. The test split has 14,042 questions, which is the portion translated for MMMLU.
For several years almost every reported MMLU number was an English-only score. Earlier multilingual MMLU efforts, such as the University of Oregon's mlmm-evaluation framework, used machine translation, which is cheap but introduces silent errors for low-resource languages and technical vocabulary.
OpenAI's motivation for MMMLU was to remove that confound by paying professional translators to render every question, answer choice, and label by hand. The dataset card frames the goal as increasing confidence in translation accuracy "especially for low-resource languages like Yoruba." The release was paired with the launch of OpenAI Academy, an initiative offering training and one million dollars in API credits to developers in low- and middle-income countries.
MMMLU covers 14 typologically diverse languages identified by locale codes that include a region tag.
| Locale code | Language | Region | Script |
|---|---|---|---|
| AR_XY | Arabic | Modern Standard, region-neutral | Arabic |
| BN_BD | Bengali | Bangladesh | Bengali |
| DE_DE | German | Germany | Latin |
| ES_LA | Spanish | Latin America | Latin |
| FR_FR | French | France | Latin |
| HI_IN | Hindi | India | Devanagari |
| ID_ID | Indonesian | Indonesia | Latin |
| IT_IT | Italian | Italy | Latin |
| JA_JP | Japanese | Japan | Kanji, Hiragana, Katakana |
| KO_KR | Korean | South Korea | Hangul |
| PT_BR | Portuguese | Brazil | Latin |
| SW_KE | Swahili | Kenya | Latin |
| YO_NG | Yoruba | Nigeria | Latin (with diacritics) |
| ZH_CN | Chinese (Simplified) | Mainland China | Hanzi |
Coverage is heavy on languages with hundreds of millions of speakers, and includes Swahili and Yoruba, usually treated as low-resource in NLP. Locale codes use region-specific variants where dialect matters: Latin American Spanish, Brazilian Portuguese, Simplified Chinese. The original English questions are not bundled inside MMMLU; researchers who want an English baseline pull the source MMLU dataset (cais/mmlu) directly.
Each language subset is a single CSV file with identical columns across languages.
| Column | Type | Description |
|---|---|---|
| Unnamed: 0 | integer | Row index from the source CSV |
| Question | string | Translated question text |
| A, B, C, D | string | Translated answer options |
| Answer | string | Correct option label, A, B, C, or D |
| Subject | string | Subject identifier in English (for example abstract_algebra, professional_law) |
The Subject field is left in English so scores can be aggregated by topic across languages. Each subset contains 14,042 rows, matching the MMLU test split. Across all 14 subsets the dataset contains 196,588 examples. Italian questions average roughly 760 characters, while Chinese questions average about 257; the asymmetry affects token budgets when running large evaluations.
MMMLU inherits MMLU's 57 subjects, organized into four categories.
| Category | Number of subjects | Representative subjects |
|---|---|---|
| STEM | 18 | Abstract algebra, college physics, computer security, electrical engineering, machine learning |
| Humanities | 13 | Formal logic, international law, moral scenarios, philosophy, professional law, world religions |
| Social sciences | 12 | Econometrics, high school macroeconomics, professional psychology, sociology, US foreign policy |
| Other | 14 | Anatomy, business ethics, clinical knowledge, college medicine, professional medicine, virology |
The split between high school, college, and professional levels is preserved across all 14 languages, so a question from professional_medicine in Japanese corresponds to the same question in the English MMLU test split, just with the stem and answer options translated.
Unlike most multilingual benchmarks, MMMLU did not use a translation model. OpenAI worked with a vendor of professional human translators and produced one human translation per language per question. The dataset card argues that the human approach matters most for technical vocabulary, where a translation model can silently shift the meaning of a chemistry term or legal phrase, and for low-resource languages where machine translation is least reliable. Yoruba is the example called out on the dataset card.
The pipeline kept the source structure rigid. Each question and its four options were translated independently, but the answer key was not changed and subject labels remained in English. There was no cultural localization step, which keeps MMMLU directly comparable to MMLU at the question level but means the benchmark continues to reflect the US-centric biases of the original MMLU subjects.
The canonical way to run MMMLU is OpenAI's open-source simple-evals repository, which contains run_multilingual_mmlu.py. The script loads each language subset from Hugging Face, prompts the target model one question at a time, and parses the model's answer with a multilingual regex that looks for ANSWER: followed by a single letter. Scoring is exact match against the Answer column, so no second model is needed as a grader.
By default the evaluation runs zero-shot on the test split. Most published numbers are zero-shot. The random-chance baseline is 25 percent. MMMLU has been integrated into third-party frameworks including EvalScope, Inspect Evals (the UK AI Safety Institute's toolkit), and aggregator leaderboards such as LLM-Stats.
OpenAI publishes MMMLU numbers in the simple-evals repository, where each new flagship release adds a row to a per-language results table. The selected results below are taken from that table for the zero-shot setting and rounded to three decimals.
| Model | Average | Best language | Worst language |
|---|---|---|---|
| o3 (high reasoning) | 0.888 | Italian 0.912 | Yoruba 0.780 |
| o1 | 0.877 | High-resource European | Yoruba |
| o4-mini (high reasoning) | 0.852 | Spanish | Yoruba |
| GPT-4.5 preview (Feb 2025) | 0.851 | Italian, Spanish | Yoruba |
| GPT-4.1 (April 2025) | 0.837 | Italian | Yoruba |
| GPT-4o (Nov 2024) | 0.814 | Italian | Yoruba |
| GPT-4o-mini (July 2024) | 0.705 | Italian | Yoruba |
| GPT-4.1-nano (April 2025) | 0.669 | Italian | Yoruba |
A few patterns are consistent across every OpenAI model tested. European Romance languages (Italian, Spanish, Portuguese, French) are easiest, often within a point or two of the model's English MMLU score. East Asian languages sit slightly below. Hindi, Bengali, and Arabic land in the middle. Swahili and especially Yoruba show the steepest drops, with Yoruba scoring 10 to 15 percentage points below the cross-language average. The o3-high range from 0.780 (Yoruba) to 0.912 (Italian) illustrates the gap that even the best system has on a low-resource African language.
Third-party leaderboards such as LLM-Stats include models from Anthropic, Google, Meta, and others. Top entries are a mix of Claude and Gemini variants, with leading averages in the high 0.92 range. Those numbers are self-reported and not independently verified.
MMMLU's main contribution is methodological. Paying for human translations removed the most common confound in multilingual evaluation: silent errors when a translation model misrenders a technical term and the resulting question is no longer answerable. Reviewers can verify question text directly in any of the 14 languages without also assessing an upstream translator.
The MIT license is unusual for a corporate benchmark. It allows free use, modification, and redistribution, including commercial use, which has made MMMLU a default choice for academic papers, eval libraries, and leaderboards.
The design preserves direct comparability with the original MMLU. Each MMMLU question is a translation of a specific MMLU test question with the same answer key and subject, so it is straightforward to compute a translation gap (English score minus per-language score) and attribute changes to language ability rather than to a different question distribution.
MMMLU inherits MMLU's well-documented problems. Error audits estimate that several percent of the original test questions are flawed, with wrong keys, ambiguous wording, or overlapping options. Those errors propagate, faithfully translated, into all 14 language subsets. Cleaned variants such as MMLU-Redux address the English version but have no MMMLU equivalent.
The content is culturally English-centric. Subjects like US foreign policy, US history, and professional law lean heavily on US institutions. Translating those questions does not make them culturally neutral; it makes the same US-centric content readable in another language.
Data contamination is a third concern. The translations are public on Hugging Face and have been crawled into web archives since September 2024, so any model trained on a recent web crawl may memorize MMMLU question-answer pairs.
Finally, MMMLU has no English subset, so comparing a model's MMMLU score to its MMLU score requires combining two different Hugging Face repositories.
MMMLU sits in a small family of multilingual general-knowledge benchmarks built on top of MMLU.
| Benchmark | Languages | Translation method | Question count | Owner |
|---|---|---|---|---|
| MMLU | 1 (English) | Original | 15,908 | Hendrycks et al. |
| MMLU-Pro | 1 (English) | Curated harder questions | 12,032 | TIGER-Lab |
| MMLU-ProX | 29 | LLM translation plus expert review | 11,829 per language | MMLU-ProX team |
| Okapi mlmm-evaluation | 26 | ChatGPT translation | 14,042 per language | University of Oregon |
| CMMLU | 1 (Chinese) | Native Chinese questions, not translated MMLU | 11,528 | Beijing AI Academy |
| MMMLU | 14 | Professional human translation | 14,042 per language | OpenAI |
MMMLU prioritizes translation quality over breadth (14 languages, human-translated), while MMLU-ProX and Okapi prioritize coverage (29 and 26 languages, machine-translated). CMMLU is sometimes confused with MMMLU but is written natively in Chinese, not translated from MMLU. Other related benchmarks include Global-MMLU (a community human-translated extension covering 42 languages), BIG-bench Hard's translated subsets, and the FLORES translation benchmark.
MMMLU was widely covered at launch. VentureBeat framed it as OpenAI's response to the global language divide, MarkTechPost noted that the human-translation pipeline made it usable for sensitive industries like healthcare and law, and Hugging Face commentators highlighted the choice by a frontier lab to release a benchmark under MIT rather than a research-only license.
Since late 2024, MMMLU has been a standard line item in frontier model evaluation tables. OpenAI's GPT-4o, GPT-4.1, GPT-4.5, and the o-series all report MMMLU scores, and most other labs include at least an average MMMLU number in their model cards. For multilingual disparities research, MMMLU is most useful as a controlled comparison: because question content is identical across languages, the gap between English MMLU and Yoruba MMMLU on the same model is a rough proxy for translation-equivalent reasoning ability.