MMLU

MMLU
Overview
Full name	Measuring Massive Multitask Language Understanding
Abbreviation	MMLU
Description	A comprehensive benchmark evaluating large language models across 57 diverse academic subjects through multiple-choice questions
Release date	2020-09-07
Latest version	MMLU-Pro
Benchmark updated	2024-06-03
Authors	Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
Organization	University of California, Berkeley
Technical Details
Type	Multitask Language Understanding, Knowledge Evaluation
Modality	Text
Task format	Multiple choice (4 options)
Number of tasks	57
Total examples	15,908
Evaluation metric	Accuracy, Macro-average
Domains	STEM, Humanities, Social Sciences, Professional Fields
Languages	English
Performance
Human performance	89.8%
Baseline	25.0%
SOTA score	~92.7% (o3, December 2024)
SOTA model	OpenAI o3, Gemini 3 Pro, Claude Opus 4.5
SOTA date	2024-2026
Saturated	Yes
Resources
Website	Official website
Paper	Paper
GitHub	Repository
Dataset	Download
License	MIT License
Successor	MMLU-Pro

MMLU (Measuring Massive Multitask Language Understanding) is a comprehensive benchmark designed to evaluate large language models across 57 diverse academic and professional subjects through multiple-choice questions. Created by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt at the University of California, Berkeley, with contributions from researchers at the University of Chicago, the benchmark was first released on September 7, 2020 and published as a conference paper at ICLR 2021.[1][2] MMLU consists of 15,908 questions spanning topics from elementary mathematics to professional law, with difficulty levels ranging from high school to expert professional knowledge. It quickly became the most widely cited evaluation tool in artificial intelligence of its era, with over 100 million downloads as of 2024 and citations in thousands of academic papers.[2][3]

For four years, MMLU served as the default headline number in nearly every major model release, from GPT-3 in 2020 to GPT-4 in 2023, Claude 3 in 2024, and Llama 3.1 405B later that year. By the time OpenAI's o1-preview crossed 90% in September 2024 and o3 reached approximately 92.7% by year-end, the benchmark had effectively saturated. The community moved on to harder evaluations such as MMLU-Pro, GPQA, and Humanity's Last Exam, but MMLU remains a standard inclusion in evaluation suites, both as a baseline check and as a useful discriminator for mid-tier and open-weight models.

overview

MMLU was developed to address the need for a comprehensive evaluation framework that could assess language models across multiple domains simultaneously, testing both world knowledge and problem-solving abilities. The benchmark emerged from the recognition that existing evaluation methods often focused on narrow domains or specific tasks, failing to capture the breadth of knowledge required for artificial general intelligence.[1]

Before MMLU, the field relied on benchmarks like GLUE (General Language Understanding Evaluation, 2018), which contained nine natural language understanding tasks, and SuperGLUE (2019), which raised the difficulty with reading comprehension and coreference resolution tasks. However, models like BERT and RoBERTa achieved human-level performance on GLUE within roughly a year of its release, and SuperGLUE was similarly saturated by 2021, with many models already scoring above 90%.[4] MMLU was designed to be far more challenging and broader in scope: with 57 subjects spanning elementary to professional-level knowledge, it dwarfed GLUE's nine tasks and SuperGLUE's eight. Testing specialized domain knowledge, from abstract algebra to professional law, was a distinctive feature that set MMLU apart from earlier benchmarks that focused on elementary linguistic competence.[1][4]

The benchmark's design philosophy emphasizes zero-shot and few-shot learning, evaluating models on their pre-trained knowledge without task-specific fine-tuning. This approach provides insights into the general capabilities of language models rather than their ability to memorize specific datasets. Because of its timing and breadth, MMLU quickly became the main reference in AI papers and corporate reports, linking earlier work on narrow tasks with the emergence of general-purpose models. By 2024, MMLU had been downloaded over 100 million times, establishing itself as a standard evaluation metric in the AI research community.[2][3]

The original paper, titled simply "Measuring Massive Multitask Language Understanding," first appeared on arXiv as 2009.03300 on September 7, 2020. It was accepted to ICLR 2021 and earned an OpenReview score that placed it among the top accepted submissions that year. As of 2025, the paper has accumulated over 4,000 citations on Google Scholar, ranking among the most-cited evaluation papers in modern AI.[1][2]

why MMLU became the standard benchmark

Several factors contributed to MMLU's rapid adoption as the default benchmark for evaluating large language models:

Timing: MMLU arrived in September 2020, just as the scaling era of LLMs was beginning in earnest. GPT-3 had been released only three months earlier, and the field needed a benchmark that could meaningfully differentiate between increasingly capable models.
Breadth of coverage: With 57 subjects, MMLU provided a single aggregate number that could summarize a model's general knowledge and reasoning across a wide range of domains. This made it convenient for model comparison in technical reports and marketing materials.
Difficulty ceiling: When first released, even the best model (GPT-3 175B) scored only 43.9%, while expert human performance was estimated at 89.8%. This left a large gap for improvement, giving the benchmark years of useful discriminative power before saturation became a problem.
Standardized format: The 4-option multiple-choice format with a fixed 5-shot evaluation protocol made results easy to reproduce and compare across different labs and model architectures.
Open availability: The dataset was released under an MIT License on GitHub and later hosted on Hugging Face, making it freely accessible to the entire research community.
Industry adoption: Every major AI lab, including OpenAI, Google DeepMind, Anthropic, and Meta, began reporting MMLU scores in their model release papers, reinforcing its position as a shared reference point. MMLU scores feature across widely followed leaderboards such as the Open LLM Leaderboard, HELM Classic, and HELM Lite.[3][4]
Discriminative power across the scaling curve: Unlike narrower benchmarks that saturated for a single model class, MMLU produced different scores for tiny base models (around 25%, indistinguishable from random), small models in the 1-10B range (typically 30-50%), mid-tier models such as Llama 2 13B (around 55%), large pretrained models (60-75%), and frontier instruction-tuned systems (80-90%). For most of the period from 2020 to 2024, that range made MMLU genuinely useful for ranking models in published leaderboards.

origins and authorship

MMLU was led by Dan Hendrycks, then a PhD candidate at UC Berkeley advised by Dawn Song and Jacob Steinhardt. Hendrycks would later co-found the Center for AI Safety (CAIS) in 2022 and remains one of the most prolific creators of LLM evaluation benchmarks; alongside MMLU, he co-authored HellaSwag, MATH, GSM8K-adjacent problem sets, and the much later Humanity's Last Exam.[1]

The other co-authors were Collin Burns (then UC Berkeley, later OpenAI Superalignment), Steven Basart, Andy Zou, and Mantas Mazeika (Berkeley collaborators), with senior advisors Dawn Song (UC Berkeley) and Jacob Steinhardt (UC Berkeley). The dataset assembly itself was a substantial undertaking. The team manually collected and reformatted questions from a wide range of educational sources: free practice exams, GRE prep books, AP test materials, USMLE study guides, law school admission tests, professional certification exams, and Oxford and Cambridge undergraduate problem sets. Each of the 57 subjects required at least 100 questions, with sourcing tailored to the appropriate professional or academic level for that domain.[1]

methodology

dataset construction

MMLU's questions were sourced from various educational materials including textbooks, online resources, and practice exams. The dataset was carefully curated to ensure:[1]

Diverse coverage: Questions span 57 subjects across four major categories
Difficulty variation: Content ranges from elementary to professional level
Standardized format: All questions use 4-option multiple choice (A, B, C, D)
Quality control: Manual review to ensure accuracy and clarity

Questions were designed to require subject-matter knowledge, not just reading comprehension. Hendrycks and colleagues deliberately avoided pulling from any single test, since reusing a high-stakes exam in full would have made contamination detection trivially easy and would have invited copyright concerns. Instead, the questions were drawn from many free practice resources and reformatted into a uniform 4-option layout.

dataset structure

The complete MMLU dataset is organized as follows:

Component	Number of Questions	Purpose
Development set	285 (5 per subject)	Few-shot examples
Validation set	1,540	Hyperparameter tuning
Test set	14,079	Main evaluation
Auxiliary training set	~100,000	Supervised fine-tuning experiments
Total (test + val + dev)	15,908	Complete benchmark

Each of the 57 subjects contains a minimum of 100 test questions. The development set provides exactly 5 examples per subject, which serve as the in-context examples for the standard 5-shot evaluation protocol. The auxiliary training set was provided as an optional resource for researchers wanting to fine-tune models on similar material; in practice, almost all reported MMLU scores are evaluated zero-shot or few-shot rather than after fine-tuning.[1]

the 5-shot evaluation format

The standard evaluation protocol for MMLU uses a 5-shot in-context learning setup. In this format, the model receives five solved example questions from the same subject as context before being asked to answer a new question. Each example includes the question text, the four answer options (A through D), and the correct answer letter. The format looks like this:[1]

The following are multiple choice questions (with answers) about [subject].

[Example Question 1]
A. [Option A]
B. [Option B]
C. [Option C]
D. [Option D]
Answer: [Correct Letter]

... (4 more examples) ...

[Test Question]
A. [Option A]
B. [Option B]
C. [Option C]
D. [Option D]
Answer:

The model must produce the correct letter (A, B, C, or D) as its response. The primary metric is exact-match accuracy, and the overall score is computed as a macro-average across all 57 subjects, giving equal weight to each subject regardless of its size.[1]

evaluation paradigms

MMLU supports multiple evaluation approaches beyond the standard 5-shot format:[1]

Zero-shot: Direct evaluation without any in-context examples. Modern instruction-tuned models often perform well in this setting.
Few-shot (k-shot): Up to 5 examples per subject provided as context. The 5-shot setting is the canonical reported number.
Chain-of-thought: Models show intermediate reasoning steps before selecting an answer. Effects vary; on the original MMLU, chain-of-thought prompting often yields little or no improvement, in contrast to results on math-heavy benchmarks.
Self-consistency or majority voting (CoT@k): The model is sampled k times with chain-of-thought reasoning, and the most common answer is chosen. Google DeepMind reported Gemini Ultra's 90.0% MMLU score using a CoT@32 protocol, with the standard 5-shot score being 83.7%.[5][6]
Direct answer: Models provide only the letter choice without explanation, the most common method for base (non-instruction-tuned) models.

human performance baselines

The original paper established two human performance baselines:[1]

Baseline type	Accuracy	Description
Expert-level	89.8%	Estimated from 95th-percentile scores of real-world test takers in relevant domains
Non-expert (MTurk)	34.5%	Amazon Mechanical Turk workers without domain expertise
Random guessing	25.0%	Chance-level performance with 4 options

The large gap between non-expert (34.5%) and expert (89.8%) human performance highlights that MMLU questions require genuine domain knowledge, not just general reading comprehension or common sense. The 89.8% expert ceiling is itself a soft upper bound rather than a literal cap on possible accuracy: it reflects the average performance of competent humans across all 57 subjects, and a model that strictly outperformed expert humans on every subject could in principle exceed this number.

subject categories

The 57 subjects are organized into four broad categories. Each category is weighted equally when computing the macro-average score.

STEM (22 subjects)

The STEM category covers scientific and technical fields, ranging from abstract algebra at the graduate level to elementary mathematics suitable for grade-school students.

mathematics and physics

Abstract Algebra
College Mathematics
Elementary Mathematics
High School Mathematics
College Physics
High School Physics
Conceptual Physics
High School Statistics

life sciences

Anatomy
College Biology
High School Biology
Medical Genetics
Virology

chemistry and computer science

College Chemistry
High School Chemistry
College Computer Science
High School Computer Science
Computer Security
Machine Learning

applied sciences

Astronomy
Clinical Knowledge
College Medicine
Electrical Engineering

humanities (13 subjects)

The humanities category encompasses history, philosophy, and law:

history

High School European History
High School US History
High School World History
Prehistory

philosophy and logic

Philosophy
Formal Logic
Logical Fallacies
Moral Disputes
Moral Scenarios

law and religion

International Law
Jurisprudence
Professional Law
World Religions

Social sciences cover economics, psychology, and society:

economics

Econometrics
High School Macroeconomics
High School Microeconomics

psychology and sociology

High School Psychology
Professional Psychology
Human Aging
Human Sexuality
Sociology

politics and geography

High School Geography
High School Government and Politics
US Foreign Policy
Business Ethics

professional and other (10 subjects)

Professional fields and miscellaneous topics:

Professional Accounting
Professional Medicine
Management
Marketing
Public Relations
Nutrition
Security Studies
Global Facts
Miscellaneous

The Professional Law subject deserves special mention. With 1,534 questions sourced from law school admission tests and bar exam study materials, it is by far the largest subject in MMLU, accounting for roughly 11% of the total test set. As a result, even though the macro-average gives every subject equal weight, models that struggle with legal reasoning leave a particularly conspicuous deficit on this subject in subject-by-subject reports.

performance results

original paper results (2020)

When MMLU was first released, even the best available models performed far below expert-level accuracy. The original paper reported the following results using 5-shot evaluation:[1]

Model	Parameters	MMLU score (5-shot)	Notes
Random Baseline	-	25.0%	Chance-level with 4 options
GPT-2	1.5B	32.4%	Barely above random
RoBERTa-base	125M	27.9%	Near random chance
UnifiedQA	3B	43.7%	Question-answering specialized model
GPT-3 (zero-shot)	175B	37.7%	Without in-context examples
GPT-3 (5-shot)	175B	43.9%	Best result at release
Human expert	-	89.8%	95th-percentile test takers

GPT-3's performance was highly uneven across subjects. It achieved 69% on US Foreign Policy (its best subject) but scored near random chance on subjects like College Chemistry, highlighting the model's inconsistent knowledge coverage.[1] The authors interpreted this lopsidedness as a key finding: scale alone improved average performance, but it did not produce the kind of uniform expertise across domains that the term "general intelligence" might suggest.

historical performance evolution

The progression of model performance on MMLU from 2020 to 2026 demonstrates the rapid advancement of AI capabilities. The leaderboard below tracks the most influential reported scores; not all are directly comparable due to evaluation differences (see the section on standardized evaluation discrepancies below).

Year	Model	Organization	Parameters	MMLU score	Evaluation	Key milestone
2020	GPT-3	OpenAI	175B	43.9%	5-shot	Initial benchmark release
2021	Gopher	DeepMind	280B	60.0%	5-shot	First model above 50%
2022	Chinchilla	DeepMind	70B	67.5%	5-shot	Compute-optimal training
2022	PaLM	Google	540B	69.3%	5-shot	Pathway-based architecture
2023	Flan-PaLM 2-L	Google	-	81.2%	5-shot	Instruction tuning gains
2023	GPT-4	OpenAI	-	86.4%	5-shot	Approaching human performance
2024	Claude 3 Opus	Anthropic	-	86.8%	5-shot	Near human-expert level
2024	Llama 3 70B Instruct	Meta	70B	82.0%	5-shot	Open-weight model surpasses GPT-3.5
2024	Gemini Ultra	Google DeepMind	-	90.0%	CoT@32	First reported above 90% (non-standard eval)
2024	Gemini Ultra	Google DeepMind	-	83.7%	5-shot	Standard protocol, comparable to peers
2024	Llama 3.1 405B Instruct	Meta	405B	87.3%	5-shot	Largest open-weight frontier model
2024	GPT-4o	OpenAI	-	88.7%	5-shot	Multimodal flagship
2024	Claude 3.5 Sonnet	Anthropic	-	88.7%	5-shot	Compact high-performance model
2024	OpenAI o1-preview	OpenAI	-	90.8%	0-shot CoT	Reasoning model, surpasses human expert
2024	OpenAI o1	OpenAI	-	92.3%	0-shot CoT	First broad release reasoning model
2024	OpenAI o3	OpenAI	-	~92.7%	0-shot CoT	Reported by OpenAI in December 2024
2025	Claude Opus 4	Anthropic	-	~91%	5-shot / CoT	Frontier reasoning model, near-saturation
2025	Gemini 3 Pro	Google DeepMind	-	~91%	5-shot / CoT	Frontier reasoning model
2026	DeepSeek-V4-Pro-Base	DeepSeek	-	~90.1%	5-shot	Open-weight frontier model

Note: Scores are not always directly comparable due to differences in evaluation methodology. Some reported scores use chain-of-thought prompting, majority voting, or other techniques that can inflate results relative to the standard 5-shot protocol. Gemini Ultra's 90.0% score, for instance, used a chain-of-thought method with uncertainty routing across 32 samples (CoT@32), while its standard 5-shot score was 83.7%.[5][6] By 2025, most labs report MMLU primarily for backward compatibility and instead lead with newer benchmarks for frontier evaluations.

standardized evaluation discrepancies

A 2024 study by Stanford's Center for Research on Foundation Models (CRFM) using the HELM framework revealed that model creators frequently report higher MMLU scores than independent evaluation can reproduce. By running all models with the same prompt template and the same 5 in-context examples per subject, HELM found consistent discrepancies:[5]

Model	Creator-reported score	HELM score	Difference
GPT-4 (0613)	86.4%	82.4%	-4.0
Claude 3 Opus	86.8%	84.6%	-2.2
Claude 2.1	78.5%	73.5%	-5.0
PaLM 2 Unicorn	81.2%	78.6%	-2.6
Llama 3 (70B)	79.5%	79.3%	-0.2
Mixtral (8x22B)	77.6%	77.8%	+0.2
Gemma (7B)	64.3%	66.1%	+1.8

The HELM study identified several sources of score inflation: non-standard prompting techniques, proprietary evaluation snapshots that prevented independent verification, and insufficient documentation of prompt templates.[5] HELM also produced a side-by-side analysis showing that prompt-template variation alone (whitespace, label letters, the exact phrasing of the instruction line) could shift a single model's score by several percentage points without any change to the underlying capability.

performance by subject category

Analysis reveals significant variation in model performance across domains:[1]

Category	Average score (top models)	Easiest subject	Hardest subject
STEM	85%	High School Mathematics (92%)	Abstract Algebra (65%)
Humanities	87%	World Religions (91%)	Formal Logic (72%)
Social Sciences	89%	Marketing (93%)	Econometrics (70%)
Professional	86%	Management (90%)	Professional Law (75%)

The pattern of easy versus hard subjects has been remarkably stable across model generations. Models tend to do best on subjects with broad popular coverage on the open web (Marketing, Management, US history) and worst on subjects that require either rigorous formal manipulation (Formal Logic, Abstract Algebra) or highly specialized professional knowledge that is not heavily represented in scraped web data (Econometrics, Professional Law). Reasoning-tuned models such as the o-series narrowed but did not fully eliminate this gap.

quality analysis and known issues

the MMLU-Redux study

In June 2024, Aryo Pradipta Gema and colleagues from the University of Edinburgh published "Are We Done with MMLU?", a paper that systematically audited the quality of MMLU questions. The study, later published at NAACL 2025, introduced MMLU-Redux: a subset of 3,000 manually re-annotated questions across 30 MMLU subjects, reviewed by 14 human experts.[7]

The study found that more than 9% of the sampled questions contain errors. These errors were classified using a hierarchical taxonomy:[7]

Type 1, question assessment:

(1a) Bad question clarity: The question text is ambiguous, poorly worded, or missing necessary context.
(1b) Bad options clarity: The answer options are confusingly similar, misleading, or poorly formatted.

Type 2, ground truth verification:

(2a) No correct answer: None of the four provided options is actually correct.
(2b) Multiple correct answers: More than one option is defensible as the correct answer.
(2c) Wrong ground truth: A correct answer exists among the options, but the dataset labels a different option as correct.

Wrong ground truth labels (Type 2c) were the most prevalent error type, accounting for approximately 3% of all questions. The error rates varied dramatically by subject:[7]

Subject	Error rate	Breakdown
Virology	57%	33% wrong ground truth, 15% unclear questions, 4% multiple correct answers
Logical Fallacies	26%	Mix of unclear options and wrong labels
College Chemistry	25%	Incorrect answers and ambiguous questions
Professional Law	18%	Multiple defensible answers
Business Ethics	14%	Wrong ground truth labels
Formal Logic	13%	Ambiguous question formulations
Human Aging	12%	Wrong labels
Global Facts	12%	Outdated or incorrect factual claims
Machine Learning	11%	Ambiguous technical questions

The impact on model evaluation was substantial. In the Virology subset, for example, Claude 3 Opus's score shifted from 54% (ranked 9th) to 88% (ranked 6th) when only correctly labeled questions were considered, demonstrating that dataset errors can significantly distort model rankings.[7]

data contamination

Data contamination, where benchmark questions appear in a model's training data, has become a major concern for MMLU's reliability. Because MMLU's questions are drawn from publicly available educational materials, there is a high probability that some questions were included in the web-scraped training corpora of modern LLMs.[8][9]

A 2024 study applied a lexical contamination detection pipeline to 513 sampled MMLU test questions and found an overall contamination rate of 13.8%. The contamination was not evenly distributed: STEM subjects showed an 18.1% rate, and Philosophy had the highest individual contamination at 66.7%. The study also found that ChatGPT and GPT-4 could guess the missing answer options in benchmark test data with exact match rates of 52% and 57% respectively, suggesting significant memorization of the test questions.[8]

When contamination is controlled for, model performance drops significantly. One analysis found that model accuracy dropped by an average of 7.0 percentage points when surface wording was changed to indirect references, with drops as high as 19.8 percentage points in Law and Ethics, precisely the domains most heavily contaminated.[8]

The contamination problem is structural. The original MMLU dataset has been hosted publicly on GitHub since 2020, mirrored on Hugging Face, and copied across countless tutorials, blog posts, and educational sites. By the time a frontier model is pre-trained on a multi-trillion-token corpus scraped from the public web, the chance that some MMLU questions and their answers appear verbatim in the training data approaches certainty for any honest accounting. This is one of the main motivations behind MMLU-CF and the broader move toward held-out, contamination-free evaluations such as GPQA and Humanity's Last Exam.

prompt sensitivity

Model scores on MMLU can vary by up to 10% depending on the exact prompt template used, including factors like whitespace formatting, instruction phrasing, and answer extraction method. This sensitivity makes it difficult to compare scores across different evaluation setups and has led to calls for stricter standardization of evaluation protocols.[5][10]

A detailed Hugging Face investigation in mid-2023 traced this directly to the Open LLM Leaderboard. When the leaderboard adopted a particular MMLU prompt template, several open-source models fell several percentage points behind their reported scores, sparking confusion about whether the rankings were correct. The eventual answer was that the rankings were correct under that specific template, but the same model could legitimately report a different number under the template originally used by the model's authors. The lesson generalized: "the MMLU score" of a model is meaningful only relative to a specific implementation of the benchmark.[10]

other limitations

Beyond errors and contamination, MMLU has several other recognized limitations:

English-only: The benchmark tests knowledge exclusively in English, limiting its applicability for evaluating multilingual models.
Western-centric: Questions are drawn primarily from American and European educational curricula, introducing cultural bias in topics like history, law, and geography.
Multiple-choice format: The 4-option format allows for elimination strategies and does not test a model's ability to generate free-form answers or demonstrate deep understanding.
Static knowledge: Some questions contain outdated factual claims, particularly in fast-moving fields like computer science and medicine.
No reasoning assessment: Correct answers can be reached through pattern matching or memorization without genuine reasoning, making it difficult to distinguish surface-level knowledge retrieval from deep understanding.[10]
Random-guess floor: With 25% as the chance-level score, even badly trained models post non-trivial numbers, compressing the visible difference between weak models.
Subject imbalance in test counts: Professional Law alone contains 1,534 questions, while smaller subjects contain just over 100. The macro-average corrects for this in scoring, but per-subject statistics derived from small subjects are noisier than they appear.

benchmark saturation

By late 2024, MMLU was considered largely saturated as a discriminative benchmark for frontier models. Top models clustered within a narrow 86-92% accuracy band when evaluated with the standard 5-shot protocol, making it difficult to draw meaningful distinctions between leading systems.[10][11]

The saturation of MMLU followed a predictable pattern that has affected previous benchmarks:

Benchmark	Year introduced	Year saturated	Time to saturation
GLUE	2018	2019	~1 year
SuperGLUE	2019	2021	~2 years
MMLU	2020	2024	~4 years
GPQA	2023	2025	~2 years

Despite saturation at the frontier, MMLU remains useful for evaluating mid-tier and open-weight models, where there is still meaningful performance variation. Many leaderboards and evaluation platforms continue to include MMLU alongside newer, more challenging benchmarks. It also serves as a kind of regression test: a new release that posts substantially below 80% on MMLU is almost certainly weaker than the prior generation, regardless of how it scores on more selective benchmarks.[3][10]

The saturation conversation also prompted a useful methodological correction in the field. Rather than chasing a single number that frontier models had effectively topped out, leaderboards such as Stanford HELM, Vellum, Artificial Analysis, and the LM Arena began publishing aggregated views that combine MMLU with SWE-bench, HumanEval, GPQA, AIME, MATH, and Humanity's Last Exam. The shift acknowledges that no single benchmark can summarize a frontier model's capability, and that MMLU's strength was always its breadth rather than its depth.

MMLU variants and successors

MMLU-Pro

MMLU-Pro is a more challenging successor benchmark developed by TIGER-Lab (led by Yubo Wang, Xueguang Ma, Ge Zhang, Wenhu Chen, and colleagues) and published at NeurIPS 2024. It was designed to address the saturation and noise problems of the original MMLU.[11]

Key differences from the original MMLU:

Feature	MMLU	MMLU-Pro
Answer options	4 (A-D)	10 (A-J)
Total questions	15,908	~12,000
Subjects	57	14 consolidated domains
Random guess baseline	25.0%	10.0%
Prompt sensitivity	4-5% variation	~2% variation
CoT benefit	Negligible or negative	Significant positive effect
Question focus	Knowledge recall	Complex reasoning

MMLU-Pro's 10-option format makes random guessing far less effective (10% vs. 25% baseline) and forces models to engage in more careful discrimination between plausible answers. The benchmark was constructed by integrating harder questions from academic exams and textbooks across 14 domains: Biology, Business, Chemistry, Computer Science, Economics, Engineering, Health, History, Law, Math, Philosophy, Physics, Psychology, and Others.[11]

Performance on MMLU-Pro is significantly lower than on the original MMLU, with accuracy drops of 16 to 33 percentage points. Initial MMLU-Pro scores from the original paper:[11]

Model	MMLU-Pro score	MMLU score	Drop
Claude 3.5 Sonnet	76.1%	88.7%	-12.6
GPT-4o	72.6%	88.7%	-16.1
Gemini 1.5 Pro	69.0%	83.7%	-14.7
Claude 3 Opus	68.5%	86.8%	-18.3
GPT-4 Turbo	63.7%	86.4%	-22.7
Llama 3 70B Instruct	56.2%	79.5%	-23.3

A notable finding was that chain-of-thought reasoning provides a substantial benefit on MMLU-Pro, in contrast to the original MMLU where CoT prompting offered little or no improvement. This indicates that MMLU-Pro questions require more genuine multi-step reasoning rather than simple knowledge recall.[11] By 2025, frontier reasoning models had pushed MMLU-Pro into the upper 80s as well; for example, Gemini 3.1 Pro Preview reportedly led the Artificial Analysis MMLU-Pro leaderboard at around 91%, with Claude Opus 4.5 (Reasoning) and Gemini 3 Pro both close behind in the high 80s to low 90s, suggesting that MMLU-Pro itself may saturate within a few years.

MMLU-Redux

MMLU-Redux, introduced by Gema et al. (2024) in the paper "Are We Done with MMLU?" and published at NAACL 2025, is not a wholly new benchmark but rather an error-corrected subset of the original MMLU. It consists of 3,000 manually re-annotated questions across 30 subjects, with each question reviewed by domain experts for question clarity and answer correctness. The corrected labels provide a more reliable evaluation baseline and can reveal how much of a model's apparent performance is an artifact of dataset errors rather than genuine capability.[7]

MMLU-CF

MMLU-CF (Contamination-Free) was proposed by Microsoft researchers and published at ACL 2025. It addresses the contamination problem directly by creating an entirely new set of 20,000 multiple-choice questions (10,000 for a closed-source test set and 10,000 for an open-source validation set) sourced from broader domains with three decontamination rules designed to prevent both unintentional and malicious data leakage. Evaluation of over 40 mainstream LLMs on MMLU-CF showed that model accuracy dropped by 14 to 16 percentage points compared to the original MMLU, and performance rankings changed considerably, confirming that contamination has inflated reported scores on the original benchmark.[9]

CMMLU

CMMLU (Chinese MMLU) was introduced by Li et al. (2023, ACL Findings 2024) as a comprehensive Chinese-language counterpart to the English MMLU. It covers natural science, social sciences, engineering, and humanities, including more than 10 subjects that are not typically found in standard exams but are relevant to daily life in China, such as Chinese food culture, Chinese driving rules, and Chinese law. Compared to other Chinese benchmarks such as C-Eval and M3KE, CMMLU has more humanities, social science, and culture-specific subjects but fewer STEM subjects. It is one of the most widely used Chinese-language LLM benchmarks and is included by default when reporting results for Chinese-developed models such as Qwen, DeepSeek, and Baichuan.[12]

AGIEval

AGIEval, introduced by Microsoft researchers in 2023, takes a different approach: rather than reformatting practice exam questions, it draws problems from real, high-stakes standardized exams in both Chinese and English. Question sources include the SAT, LSAT, GMAT, GRE, GAOKAO (the Chinese national college entrance examination), and various civil service tests. AGIEval was explicitly motivated by the observation that scoring well on benchmarks composed of practice questions does not necessarily mean a model would score well on the underlying exams that the practice questions imitate.

MMMU and CMMMU

MMMU (Massive Multi-discipline Multimodal Understanding), introduced by Yue et al. in 2024, is the multimodal counterpart to MMLU. Where MMLU is text-only, MMMU consists of college-level questions that require interpreting images such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU is the Chinese-language version of MMMU. These benchmarks have largely replaced MMLU as the headline number for evaluating multimodal frontier models, while MMLU continues to be reported as a text-only baseline.[13]

other notable variants

Several additional specialized versions have been developed to address specific limitations of the original MMLU:

Variant	Focus	Key features	Venue/Year
MMLU-Redux	Error correction	3,000 re-annotated questions across 30 subjects	NAACL 2025
MMLU-Pro	Harder reasoning	10 answer options, ~12,000 questions, 14 domains	NeurIPS 2024
MMLU-CF	Contamination-free	20,000 new questions with decontamination rules	ACL 2025
MMLU-SR	Robustness testing	Modified terminology to test sensitivity to surface changes	2024
CodeMMLU	Programming	Software engineering and coding focus	2024
IndicMMLU-Pro	Multilingual	Adaptation for Indian languages	2025
Global-MMLU	Multilingual	Translated and culturally adapted across 42 languages	2024
CMMLU	Chinese	Native Chinese questions with 67 subjects	ACL Findings 2024
AGIEval	Standardized exams	SAT, LSAT, GMAT, GRE, GAOKAO	2023
MMMU	Multimodal	College-level questions requiring images	CVPR 2024

technical implementation

dataset access

MMLU is available through multiple platforms:[1][3]

GitHub: The original repository at github.com/hendrycks/test includes the dataset files and evaluation scripts.
Hugging Face: The dataset is hosted at huggingface.co/datasets/cais/mmlu for easy integration with the Transformers library.
Evaluation frameworks: Integrated into EleutherAI's lm-evaluation-harness, Hugging Face evaluate library, OpenAI evals, and Stanford HELM.

evaluation in lm-evaluation-harness

EleutherAI's lm-evaluation-harness is the most widely used open-source framework for evaluating language models, and it has become the de facto standard for reproducible MMLU scoring on open-weight models. The harness wraps over 60 academic benchmarks, including MMLU, HellaSwag, ARC, TruthfulQA, Winogrande, GSM8K, and HumanEval. When the Hugging Face Open LLM Leaderboard ran from 2023 to 2024, it used lm-evaluation-harness as its evaluation backbone for MMLU, ensuring that all submitted models were scored under identical conditions.[14]

The harness implements MMLU primarily as a log-probability evaluation rather than a generative one. For each question, it computes the model's log-probability for each of the four answer letter tokens (A, B, C, D) given the prompt, and selects the highest as the model's answer. This avoids the parsing problems that plague generative evaluation, where a model might respond with "The answer is C" or "C) Paris" or simply "Paris" depending on instruction tuning.[14]

This log-probability approach is also why MMLU scores on the same model can differ between, say, OpenAI's official report (typically generative, since the API does not expose log-probabilities by default for chat models) and an open-source reproduction (typically log-probability based). Both numbers are real; they just measure slightly different things.

evaluation protocol

The standard evaluation procedure follows this format:[1]

The following are multiple choice questions (with answers) about [subject name].

Question: [Question text]
A. [Option A]
B. [Option B]
C. [Option C]
D. [Option D]
Answer: [Correct letter]

(... 4 more examples ...)

Question: [Test question text]
A. [Option A]
B. [Option B]
C. [Option C]
D. [Option D]
Answer:

Models are scored on:

Exact match accuracy: The model must output the correct letter (A, B, C, or D).
Per-subject accuracy: Percentage correct within each of the 57 subjects.
Macro-average: The overall score is the unweighted average of all 57 per-subject accuracies, giving equal importance to each subject regardless of the number of questions it contains.

scoring methods

Two main approaches are used to extract a model's answer from its output:

Generative (string match): The model generates text, and the first letter (A/B/C/D) in its response is taken as the answer. This is the more common method for modern instruction-tuned models.
Log-probability: The model's log-probabilities for the tokens A, B, C, and D are compared, and the highest-probability token is selected as the answer. This method is used for base models that do not follow instructions reliably and is the default in lm-evaluation-harness.

The choice of scoring method can affect results, which is another source of variation in reported scores across different evaluations.[5][14]

A third, less common approach is full chain-of-thought generation followed by answer extraction. Here, the model is prompted to reason aloud before producing a final letter. This is the protocol used for evaluating reasoning models like the OpenAI o-series, where the bulk of the model's compute is spent on hidden reasoning before a short final answer. Reasoning-style evaluation is what produced the 90%+ MMLU scores associated with o1, o3, Claude Opus 4, and Gemini 3 Pro.

impact and significance

research impact

MMLU has had an outsized influence on AI research and development:[2][3]

100+ million downloads as of 2024
Standard benchmark reported in virtually all major model release papers from 2021 onward
Thousands of citations in academic literature, with the original paper accumulating over 4,000 Google Scholar citations by 2025
Industry adoption by all major AI labs including OpenAI, Google DeepMind, Anthropic, Meta, and Mistral AI
Leaderboard presence on Open LLM Leaderboard, HELM, Vellum, and Artificial Analysis

influence on scaling research

MMLU played a central role in demonstrating the relationship between model scale and knowledge acquisition. The benchmark showed clear log-linear improvement with increasing model size: GPT-3's jump from 25.9% (smallest variant) to 43.9% (175B parameters) provided early evidence of scaling laws for knowledge-intensive tasks. Later, Chinchilla's result of 67.5% with only 70B parameters (compared to Gopher's 60.0% with 280B) helped establish that training data volume matters as much as parameter count, supporting the compute-optimal training paradigm.[1][15]

The Hoffmann et al. (2022) Chinchilla paper itself used MMLU as one of its four headline benchmarks (alongside The Pile, BIG-bench, and several reading-comprehension tasks). The fact that a smaller, better-trained model out-scored a larger, undertrained one on MMLU was concrete enough to convince the field that the standard pre-training recipe of "more parameters, fixed data" was suboptimal. Subsequent training work, including Llama 1, Llama 2, Llama 3, and Mistral's open releases, all explicitly adopted Chinchilla-style data scaling, with MMLU scores serving as one of the main evidence points.[15]

educational applications

The benchmark has found applications beyond model evaluation:

Curriculum development: Identifying knowledge gaps in AI systems by subject area.
Educational assessment: Comparing AI and human performance on standardized academic content.
Tutoring systems: Providing baselines for evaluating educational AI applications.
Knowledge mapping: Understanding which domains models have learned well and which remain challenging.

influence on benchmark design

MMLU also influenced how subsequent benchmarks were designed and presented. The pattern of "57 subjects, 4-option MCQ, macro-average, 5-shot in-context examples, expert baseline" became a template that was reused or deliberately adapted by later benchmarks. CMMLU, AGIEval, MMMU, and even non-multiple-choice benchmarks like GPQA and Humanity's Last Exam all carry visible traces of MMLU's design choices, from the use of dev/val/test splits to the convention of reporting both per-subject and aggregate accuracy.

future directions

ongoing developments

The MMLU ecosystem continues to evolve in response to the limitations identified by the research community:

Quality improvements: Ongoing error correction efforts through projects like MMLU-Redux.
Contamination mitigation: New benchmarks like MMLU-CF that create fresh questions to avoid training data overlap.
Multilingual extensions: Adaptations for non-English languages, including IndicMMLU-Pro for Indian languages and Global-MMLU for over 40 languages.
Domain specialization: Field-specific variants like CodeMMLU for programming.
Reasoning emphasis: The shift from MMLU to MMLU-Pro reflects the field's growing interest in evaluating reasoning depth rather than factual recall.

successor benchmarks

As MMLU has become saturated for frontier models, several newer benchmarks have emerged to provide more discriminative evaluation:

Benchmark	Focus	Why it extends beyond MMLU
MMLU-Pro	Harder multitask	10 options, reasoning-focused questions
GPQA	Graduate-level expertise	PhD-level questions in physics, biology, chemistry
Humanity's Last Exam	Frontier difficulty	Expert-level questions designed to resist current models
SWE-bench	Real software engineering	Patches against real GitHub issues
HumanEval	Code generation	Functional programming challenges
GSM8K	Math reasoning	Grade school math word problems
BigBench	Broader task diversity	200+ tasks beyond multiple choice
AIME	Competition math	High-difficulty problems with verifiable answers
FrontierMath	Research math	Problems written by professional mathematicians

In the current landscape, frontier model releases typically lead with GPQA Diamond, AIME, SWE-bench Verified, Humanity's Last Exam, and Arena Elo, with MMLU and MMLU-Pro reported lower in the table as legacy comparisons. Mid-tier and open-weight model releases continue to feature MMLU prominently, since the saturation has not yet hit those tiers as hard.

references

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). "Measuring Massive Multitask Language Understanding." Proceedings of the International Conference on Learning Representations (ICLR 2021). arXiv:2009.03300.
Hendrycks, D. et al. "MMLU GitHub Repository." https://github.com/hendrycks/test
Hugging Face. "cais/mmlu Dataset." https://huggingface.co/datasets/cais/mmlu
DataCamp. "What is MMLU? LLM Benchmark Explained and Why It Matters." https://www.datacamp.com/blog/what-is-mmlu
Liang, P. et al. (2024). "Massive Multitask Language Understanding (MMLU)." Stanford Center for Research on Foundation Models (CRFM), HELM. https://crfm.stanford.edu/2024/05/01/helm-mmlu.html
Google DeepMind. (2024). "Gemini: A Family of Highly Capable Multimodal Models." arXiv:2312.11805.
Gema, A.P., Leang, J.O.J., Hong, G., Devoto, A., Mancino, A.C.M., Saxena, R., He, X., Zhao, Y., Du, X., Madani, M.R.G., Barale, C., McHardy, R., Harris, J., Kaddour, J., van Krieken, E., & Minervini, P. (2024). "Are We Done with MMLU?" Proceedings of NAACL 2025. arXiv:2406.04127.
Dong, G., Yuan, H., Lu, K., Li, C., Xue, M., Liu, D., Wang, W., Yuan, Z., Zhou, C., & Zhou, J. (2024). "Investigating Data Contamination in Modern Benchmarks for Large Language Models." arXiv:2311.09783.
Microsoft Research. (2024). "MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark." Proceedings of ACL 2025. https://github.com/microsoft/MMLU-CF
GraphLogic. (2025). "MMLU Benchmark in 2025: Strengths, Limits, and the Future of AI Evaluation." https://graphlogic.ai/blog/utilities/mmlu-better-benchmarking-for-llm-language-understanding/
Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., & Chen, W. (2024). "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." Proceedings of NeurIPS 2024. arXiv:2406.01574.
Li, H., Zhang, Y., Koto, F., Yang, Y., Zhao, H., Gong, Y., Duan, N., & Baldwin, T. (2024). "CMMLU: Measuring Massive Multitask Language Understanding in Chinese." Findings of ACL 2024. arXiv:2306.09212.
Yue, X. et al. (2024). "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI." CVPR 2024. https://mmmu-benchmark.github.io/
Gao, L. et al. (2023-2024). "A framework for few-shot language model evaluation (lm-evaluation-harness)." EleutherAI. https://github.com/EleutherAI/lm-evaluation-harness
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556.
Wikipedia. "MMLU." https://en.wikipedia.org/wiki/MMLU
Vellum AI. (2025). "LLM Leaderboard 2026." https://www.vellum.ai/llm-leaderboard
Artificial Analysis. (2025). "MMLU-Pro Benchmark Leaderboard." https://artificialanalysis.ai/evaluations/mmlu-pro

external links

overview

why MMLU became the standard benchmark

origins and authorship

methodology

dataset construction

dataset structure

the 5-shot evaluation format

evaluation paradigms

human performance baselines

subject categories

STEM (22 subjects)

mathematics and physics

life sciences

chemistry and computer science

applied sciences

humanities (13 subjects)

history

philosophy and logic

law and religion

social sciences (12 subjects)

economics

psychology and sociology

politics and geography

professional and other (10 subjects)

performance results

original paper results (2020)

historical performance evolution

standardized evaluation discrepancies

performance by subject category

quality analysis and known issues

the MMLU-Redux study

data contamination

prompt sensitivity

other limitations

benchmark saturation

MMLU variants and successors

MMLU-Pro

MMLU-Redux

MMLU-CF

CMMLU

AGIEval

MMMU and CMMMU

other notable variants

technical implementation

dataset access

evaluation in lm-evaluation-harness

evaluation protocol

scoring methods

impact and significance

research impact

influence on scaling research

educational applications

influence on benchmark design

future directions

ongoing developments

successor benchmarks

see also

references

external links

Improve this article

Related Articles

LLM Benchmarks Timeline

DROP (Discrete Reasoning Over Paragraphs)

Fox (benchmark)

GPQA

HumanEval

MATH

overview

why MMLU became the standard benchmark

origins and authorship

methodology

dataset construction

dataset structure

the 5-shot evaluation format

evaluation paradigms

human performance baselines

subject categories

STEM (22 subjects)

mathematics and physics

life sciences