MMMLU

AI Benchmarks Data & Datasets

12 min read

Updated Jul 17, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 17, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v3 · 2,397 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MMMLU
Overview
Full name	Multilingual Massive Multitask Language Understanding
Abbreviation	MMMLU
Description	Professional human translations of the MMLU test set into 14 languages, released by OpenAI to evaluate multilingual knowledge and reasoning in large language models
Release date	September 23, 2024
Publisher	OpenAI
HuggingFace ID	openai/MMMLU
License	MIT
Source benchmark	MMLU (Hendrycks et al., 2020)
Technical Details
Type	Knowledge assessment, multilingual evaluation
Modality	Text
Task format	Four-option multiple choice
Languages	14 (Arabic, Bengali, Chinese (Simplified), French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese (Brazilian), Spanish (Latin America), Swahili, Yoruba)
Questions per language	14,042
Total examples	196,588
Subjects	57 (STEM, humanities, social sciences, other)
Translation method	Professional human translators
Evaluation metric	Accuracy, zero-shot or few-shot
File format	CSV (auto-converted to Parquet on HuggingFace)
Resources
HuggingFace dataset	openai/MMMLU
Reference paper	Measuring Massive Multitask Language Understanding (arXiv:2009.03300)
Evaluation code	openai/simple-evals
Predecessor	MMLU

MMMLU (Multilingual Massive Multitask Language Understanding) is a multilingual evaluation dataset published by OpenAI on September 23, 2024.^[1]^[6] It is a professional human translation of the test split of the original MMLU benchmark into 14 languages, distributed on Hugging Face under the MIT license.^[1] The dataset preserves MMLU's 57 subjects and four-option multiple-choice format, but replaces the English questions with translations produced by paid human linguists rather than machine translation.^[1]

MMMLU lets developers measure how well a model retains its general-knowledge ability when prompted in languages other than English. It is one of the few large multilingual benchmarks whose translations were not produced by another language model, and it is now a standard line item in OpenAI system cards and on third-party leaderboards.

MMMLU is distinct from similarly named benchmarks. It is not MMLU (the original English-only test), not MMLU-Pro (the harder ten-option variant), and not MMMU (a multimodal college-exam benchmark, with the second M for Multimodal).

Background and motivation

The original MMLU benchmark, introduced by Dan Hendrycks and colleagues in September 2020, contains 15,908 multiple-choice questions across 57 subjects.^[5] The test split has 14,042 questions, which is the portion translated for MMMLU.^[1]

For several years almost every reported MMLU number was an English-only score. Earlier multilingual MMLU efforts, such as the University of Oregon's mlmm-evaluation framework, used machine translation, which is cheap but introduces silent errors for low-resource languages and technical vocabulary.

OpenAI's motivation for MMMLU was to remove that confound by paying professional translators to render every question, answer choice, and label by hand. The dataset card frames the goal as increasing confidence in translation accuracy "especially for low-resource languages like Yoruba."^[1] The release was paired with the launch of OpenAI Academy, an initiative offering training and one million dollars in API credits to developers in low- and middle-income countries.^[6]

Languages and locale codes

MMMLU covers 14 typologically diverse languages identified by locale codes that include a region tag.^[1]

Locale code	Language	Region	Script
AR_XY	Arabic	Modern Standard, region-neutral	Arabic
BN_BD	Bengali	Bangladesh	Bengali
DE_DE	German	Germany	Latin
ES_LA	Spanish	Latin America	Latin
FR_FR	French	France	Latin
HI_IN	Hindi	India	Devanagari
ID_ID	Indonesian	Indonesia	Latin
IT_IT	Italian	Italy	Latin
JA_JP	Japanese	Japan	Kanji, Hiragana, Katakana
KO_KR	Korean	South Korea	Hangul
PT_BR	Portuguese	Brazil	Latin
SW_KE	Swahili	Kenya	Latin
YO_NG	Yoruba	Nigeria	Latin (with diacritics)
ZH_CN	Chinese (Simplified)	Mainland China	Hanzi

Coverage is heavy on languages with hundreds of millions of speakers, and includes Swahili and Yoruba, usually treated as low-resource in NLP. Locale codes use region-specific variants where dialect matters: Latin American Spanish, Brazilian Portuguese, Simplified Chinese. The original English questions are not bundled inside MMMLU; researchers who want an English baseline pull the source MMLU dataset (cais/mmlu) directly.

Dataset structure

Each language subset is a single CSV file with identical columns across languages.

Column	Type	Description
Unnamed: 0	integer	Row index from the source CSV
Question	string	Translated question text
A, B, C, D	string	Translated answer options
Answer	string	Correct option label, A, B, C, or D
Subject	string	Subject identifier in English (for example abstract_algebra, professional_law)

The Subject field is left in English so scores can be aggregated by topic across languages. Each subset contains 14,042 rows, matching the MMLU test split. Across all 14 subsets the dataset contains 196,588 examples.^[1] Italian questions average roughly 760 characters, while Chinese questions average about 257; the asymmetry affects token budgets when running large evaluations.

The 57 subjects

MMMLU inherits MMLU's 57 subjects, organized into four categories.^[5]

Category	Number of subjects	Representative subjects
STEM	18	Abstract algebra, college physics, computer security, electrical engineering, machine learning
Humanities	13	Formal logic, international law, moral scenarios, philosophy, professional law, world religions
Social sciences	12	Econometrics, high school macroeconomics, professional psychology, sociology, US foreign policy
Other	14	Anatomy, business ethics, clinical knowledge, college medicine, professional medicine, virology

The split between high school, college, and professional levels is preserved across all 14 languages, so a question from professional_medicine in Japanese corresponds to the same question in the English MMLU test split, just with the stem and answer options translated.

Translation methodology

Unlike most multilingual benchmarks, MMMLU did not use a translation model. OpenAI worked with a vendor of professional human translators and produced one human translation per language per question.^[1] The dataset card argues that the human approach matters most for technical vocabulary, where a translation model can silently shift the meaning of a chemistry term or legal phrase, and for low-resource languages where machine translation is least reliable. Yoruba is the example called out on the dataset card.^[1]

The pipeline kept the source structure rigid. Each question and its four options were translated independently, but the answer key was not changed and subject labels remained in English. There was no cultural localization step, which keeps MMMLU directly comparable to MMLU at the question level but means the benchmark continues to reflect the US-centric biases of the original MMLU subjects.

Running the evaluation

The canonical way to run MMMLU is OpenAI's open-source simple-evals repository, which contains run_multilingual_mmlu.py.^[4] The script loads each language subset from Hugging Face, prompts the target model one question at a time, and parses the model's answer with a multilingual regex that looks for ANSWER: followed by a single letter. Scoring is exact match against the Answer column, so no second model is needed as a grader.^[4]

By default the evaluation runs zero-shot on the test split. Most published numbers are zero-shot. The random-chance baseline is 25 percent.^[4] MMMLU has been integrated into third-party frameworks including EvalScope, Inspect Evals (the UK AI Safety Institute's toolkit), and aggregator leaderboards such as LLM-Stats.^[9]^[10]

Reported results

OpenAI publishes MMMLU numbers in the simple-evals repository, where each new flagship release adds a row to a per-language results table. The selected results below are taken from that table for the zero-shot setting and rounded to three decimals.^[3]

Model	Average	Best language	Worst language
o3 (high reasoning)	0.888	Italian 0.912	Yoruba 0.780
o1	0.877	High-resource European	Yoruba
o4-mini (high reasoning)	0.852	Spanish	Yoruba
GPT-4.5 preview (Feb 2025)	0.851	Italian, Spanish	Yoruba
GPT-4.1 (April 2025)	0.837	Italian	Yoruba
GPT-4o (Nov 2024)	0.814	Italian	Yoruba
GPT-4o-mini (July 2024)	0.705	Italian	Yoruba
GPT-4.1-nano (April 2025)	0.669	Italian	Yoruba

A few patterns are consistent across every OpenAI model tested. European Romance languages (Italian, Spanish, Portuguese, French) are easiest, often within a point or two of the model's English MMLU score. East Asian languages sit slightly below. Hindi, Bengali, and Arabic land in the middle. Swahili and especially Yoruba show the steepest drops, with Yoruba scoring 10 to 15 percentage points below the cross-language average. The o3-high range from 0.780 (Yoruba) to 0.912 (Italian) illustrates the gap that even the best system has on a low-resource African language.^[3]

Third-party leaderboards such as LLM-Stats include models from Anthropic, Google, Meta, and others. Top entries are a mix of Claude and Gemini variants, with leading averages in the high 0.92 range. Those numbers are self-reported and not independently verified.^[10]

Strengths

MMMLU's main contribution is methodological. Paying for human translations removed the most common confound in multilingual evaluation: silent errors when a translation model misrenders a technical term and the resulting question is no longer answerable. Reviewers can verify question text directly in any of the 14 languages without also assessing an upstream translator.

The MIT license is unusual for a corporate benchmark. It allows free use, modification, and redistribution, including commercial use, which has made MMMLU a default choice for academic papers, eval libraries, and leaderboards.

The design preserves direct comparability with the original MMLU. Each MMMLU question is a translation of a specific MMLU test question with the same answer key and subject, so it is straightforward to compute a translation gap (English score minus per-language score) and attribute changes to language ability rather than to a different question distribution.

Limitations and criticism

MMMLU inherits MMLU's well-documented problems. Error audits estimate that several percent of the original test questions are flawed, with wrong keys, ambiguous wording, or overlapping options. Those errors propagate, faithfully translated, into all 14 language subsets. Cleaned variants such as MMLU-Redux address the English version but have no MMMLU equivalent.

The content is culturally English-centric. Subjects like US foreign policy, US history, and professional law lean heavily on US institutions. Translating those questions does not make them culturally neutral; it makes the same US-centric content readable in another language.

Data contamination is a third concern. The translations are public on Hugging Face and have been crawled into web archives since September 2024, so any model trained on a recent web crawl may memorize MMMLU question-answer pairs.

Finally, MMMLU has no English subset, so comparing a model's MMMLU score to its MMLU score requires combining two different Hugging Face repositories.

Relationship to other benchmarks

MMMLU sits in a small family of multilingual general-knowledge benchmarks built on top of MMLU.

Benchmark	Languages	Translation method	Question count	Owner
MMLU	1 (English)	Original	15,908	Hendrycks et al.
MMLU-Pro	1 (English)	Curated harder questions	12,032	TIGER-Lab
MMLU-ProX	29	LLM translation plus expert review	11,829 per language	MMLU-ProX team
Okapi mlmm-evaluation	26	ChatGPT translation	14,042 per language	University of Oregon
CMMLU	1 (Chinese)	Native Chinese questions, not translated MMLU	11,528	Beijing AI Academy
MMMLU	14	Professional human translation	14,042 per language	OpenAI

MMMLU prioritizes translation quality over breadth (14 languages, human-translated), while MMLU-ProX and Okapi prioritize coverage (29 and 26 languages, machine-translated). CMMLU is sometimes confused with MMMLU but is written natively in Chinese, not translated from MMLU. Other related benchmarks include Global-MMLU (a community human-translated extension covering 42 languages), BIG-bench Hard's translated subsets, and the FLORES translation benchmark.

Reception and use

MMMLU was widely covered at launch. VentureBeat framed it as OpenAI's response to the global language divide, MarkTechPost noted that the human-translation pipeline made it usable for sensitive industries like healthcare and law, and Hugging Face commentators highlighted the choice by a frontier lab to release a benchmark under MIT rather than a research-only license.^[6]^[7]^[11]

Since late 2024, MMMLU has been a standard line item in frontier model evaluation tables. OpenAI's GPT-4o, GPT-4.1, GPT-4.5, and the o-series all report MMMLU scores, and most other labs include at least an average MMMLU number in their model cards. For multilingual disparities research, MMMLU is most useful as a controlled comparison: because question content is identical across languages, the gap between English MMLU and Yoruba MMMLU on the same model is a rough proxy for translation-equivalent reasoning ability.

References

OpenAI. "openai/MMMLU." Hugging Face dataset card. https://huggingface.co/datasets/openai/MMMLU ↩
OpenAI. "README.md." openai/MMMLU repository on Hugging Face. https://huggingface.co/datasets/openai/MMMLU/blob/main/README.md
OpenAI. "simple-evals: multilingual MMLU benchmark results." GitHub. https://github.com/openai/simple-evals/blob/main/multilingual_mmlu_benchmark_results.md ↩
OpenAI. "simple-evals" evaluation framework. GitHub. https://github.com/openai/simple-evals ↩
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. "Measuring Massive Multitask Language Understanding." arXiv:2009.03300, September 7, 2020. https://arxiv.org/abs/2009.03300 ↩
Field, Hayden. "OpenAI tackles global language divide with massive multilingual AI dataset release." VentureBeat, September 23, 2024. https://venturebeat.com/ai/openai-tackles-global-language-divide-with-massive-multilingual-ai-dataset-release ↩
"OpenAI Releases Multilingual Massive Multitask Language Understanding (MMMLU) Dataset on Hugging Face to Easily Evaluate Multilingual LLMs." MarkTechPost, September 23, 2024. https://www.marktechpost.com/2024/09/23/openai-releases-multilingual-massive-multitask-language-understanding-mmmlu-dataset-on-hugging-face-to-easily-evaluate-multilingual-llms/ ↩
"OpenAI Releases Groundbreaking Multilingual AI Dataset to Promote Global Language Equality." AIbase, September 24, 2024. https://www.aibase.com/news/11950
EvalScope. "MMMLU benchmark documentation." Read the Docs. https://evalscope.readthedocs.io/en/latest/benchmarks/mmmlu.html ↩
LLM-Stats. "MMMLU Leaderboard." https://llm-stats.com/benchmarks/mmmlu ↩
Schmid, Philipp. Post about MMMLU release on X (formerly Twitter), September 23, 2024. https://x.com/_philschmid/status/1838230108072476951 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Global-MMLU