MMLU-Pro

**

MMLU-Pro
Overview
Full name	Massive Multitask Language Understanding Professional
Abbreviation	MMLU-Pro
Description	A more robust and challenging multi-task language understanding benchmark with 10-choice questions
Release date	2024-06
Latest version	1.0
Benchmark updated	2024-10
Authors	Yubo Wang, Xueguang Ma, Ge Zhang, Et al.
Organization	TIGER-AI Lab
Technical Details
Type	Knowledge, Reasoning, Multi-task Understanding
Modality	Text
Task format	Multiple choice (10 options)
Number of tasks	12,032
Total examples	12,032
Evaluation metric	Accuracy, Chain-of-Thought performance
Domains	14 domains (Biology, Business, Chemistry, Computer Science, Etc.)
Languages	English
Performance
Human performance	~90% (estimated)
Baseline	Random guess: 10%
SOTA score	~90.1%
SOTA model	Gemini 3 Pro
SOTA date	2025
Saturated	Approaching
Resources
Website	Official website
Paper	Paper
GitHub	Repository
Dataset	Download
License	MIT
Predecessor	MMLU

MMLU-Pro** (Massive Multitask Language Understanding Professional) is an advanced artificial intelligence benchmark designed to evaluate large language models' capabilities across challenging multi-task language understanding problems. Released in June 2024 by the TIGER-AI Lab, MMLU-Pro extends the original MMLU benchmark by incorporating more complex reasoning-focused questions, expanding answer choices from four to ten options, and eliminating trivial or noisy questions, resulting in a more robust and discriminative evaluation framework. The benchmark was accepted as a Spotlight paper at the NeurIPS 2024 Datasets and Benchmarks Track.

Overview

MMLU-Pro addresses the performance saturation observed in the original MMLU benchmark, where frontier models had converged to scores between 86-87%, making it difficult to distinguish between model capabilities. By increasing both the difficulty and robustness of questions while reducing prompt sensitivity, MMLU-Pro provides a more challenging and stable evaluation environment for modern language models.

The benchmark was developed by Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. It was published under the title "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" and made available on arXiv on June 3, 2024 (arXiv:2406.01574). The dataset is released under the MIT license and hosted on Hugging Face.

Motivation

The development of MMLU-Pro was driven by several critical observations about the limitations of the original MMLU benchmark:

Performance saturation on original MMLU, with top models clustering at 86-87% accuracy
High sensitivity to prompt variations in MMLU (4-5% variance across different prompt styles)
Limited discrimination between frontier models, with gaps of only 1-2 percentage points
Insufficient emphasis on reasoning versus pure knowledge recall
Presence of trivial and noisy questions in the original dataset that inflated scores

The benchmark specifically targets the need for more challenging evaluations that can differentiate between increasingly capable AI systems while providing more stable and reliable measurements. For instance, the gap between GPT-4o and GPT-4-Turbo on MMLU was only about 1%, but MMLU-Pro widened this gap to approximately 9%, providing clearer signal about relative model capabilities.

Technical Specifications

Dataset Composition

MMLU-Pro comprises exactly 12,032 rigorously curated questions spanning 14 diverse domains. Unlike the original MMLU, which drew solely from existing exam questions, MMLU-Pro integrates questions from four distinct sources:

Source	Question Count	Percentage	Description
Original MMLU	6,810	56.60%	Filtered subset of the original MMLU dataset, with trivial and noisy questions removed
STEM Websites	4,083	33.93%	High-quality STEM problems collected from online educational platforms
TheoremQA	598	4.97%	Human-annotated questions requiring theorem application for resolution
SciBench	541	4.50%	Advanced science questions derived from college-level exams
Total	12,032	100%

The inclusion of questions from SciBench and TheoremQA specifically strengthens the STEM coverage of the benchmark. SciBench provides college-level exam questions from physics, chemistry, and mathematics courses, while TheoremQA contributes questions that require applying mathematical and scientific theorems. These additional sources help shift the benchmark's emphasis from factual recall toward multi-step reasoning.

Question Counts by Subject

The 14 subject domains vary in size. The original MMLU's 57 fine-grained subjects were consolidated into 14 broader categories to reduce redundancy and focus on core knowledge areas.

Subject	Total Questions	From MMLU	Newly Added	Primary Additional Source
Mathematics	1,351	846	505	TheoremQA (344), SciBench (161)
Physics	1,299	411	888	STEM Websites (617), SciBench (167), TheoremQA (104)
Chemistry	1,132	178	954	STEM Websites (741), SciBench (213)
Law	1,101	1,101	0	None (entirely from MMLU)
Engineering	969	67	902	STEM Websites (902)
Other	924	924	0	None (entirely from MMLU)
Economics	844	444	400	STEM Websites
Health	818	818	0	None (entirely from MMLU)
Psychology	798	493	305	STEM Websites
Business	789	155	634	STEM Websites
Biology	717	219	498	STEM Websites
Philosophy	499	499	0	None (entirely from MMLU)
Computer Science	410	274	136	STEM Websites
History	381	381	0	None (entirely from MMLU)
Total	12,032	6,810	5,222

Notably, several humanities domains (Law, Philosophy, History, Health) retain 100% of their questions from the original MMLU, while STEM domains received substantial supplementation from external sources. Engineering, for example, draws 93% of its questions from STEM websites rather than from MMLU.

Key Differences from MMLU

The following table summarizes the principal design differences between MMLU and MMLU-Pro:

Feature	MMLU	MMLU-Pro	Significance
Total Questions	15,908	12,032	Quality over quantity
Subject Categories	57	14	Consolidated for clarity
Answer Choices per Question	4 (A through D)	10 (A through J)	Reduces random guessing from 25% to 10%
Random Guess Baseline	25%	10%	60% reduction in guessing advantage
Prompt Sensitivity	4-5% variance	~2% variance	More stable and reliable scoring
Question Quality	Mixed; includes trivial items	Curated; noise removed	Stronger signal of true capability
Reasoning Requirement	Minimal; mostly knowledge recall	Substantial; multi-step reasoning	Better tests analytical ability
Chain-of-Thought Benefit	Negligible or negative	+4% to +19% improvement	Confirms reasoning-heavy content
Question Sources	Existing exam banks only	MMLU + SCIBENCH + TheoremQA + STEM sites	Broader, more challenging coverage
Publication	ICLR 2021 (Hendrycks et al.)	NeurIPS 2024 Datasets Track (Wang et al.)

Dataset Construction Process

The construction of MMLU-Pro followed a multi-stage pipeline involving automated filtering, GPT-4-based augmentation, and expert human review.

Stage 1: Filtering the Original MMLU

The authors began with the 13,937 questions in the MMLU test set. They evaluated each question against eight language models of varying capability. Questions that were answered correctly by four or more of the eight models were deemed too easy and removed from the dataset. This filtering step eliminated 5,886 questions (approximately 42% of the original set), leaving 8,051 MMLU questions. After additional noise and error removal through expert review, 6,810 MMLU-origin questions were retained for MMLU-Pro.

Stage 2: Integrating STEM Sources

To strengthen coverage in science, technology, engineering, and mathematics, the authors incorporated questions from three additional sources:

STEM Websites: 4,083 problems were collected from online educational platforms covering physics, chemistry, engineering, biology, business, economics, computer science, and psychology.
TheoremQA: 598 human-annotated questions requiring the application of specific mathematical or scientific theorems.
SciBench: 541 college-exam-level questions from physics, chemistry, and mathematics courses.

Questions from TheoremQA and SciBench were originally in free-response format. GPT-4-Turbo was used to extract short correct answers from the solutions and convert them into multiple-choice format.

Stage 3: Expanding to Ten Options

One of the most distinctive features of MMLU-Pro is its expansion from four answer choices to ten. For questions that originally had only four options, GPT-4-Turbo was employed to generate six additional plausible distractors for each question. These distractors were not random; they were designed to be plausible incorrect answers that require discriminative reasoning to eliminate. This expansion reduces the probability of guessing correctly from 25% (with four choices) to 10% (with ten choices), placing greater demand on genuine understanding.

Stage 4: Expert Review

The dataset underwent two phases of expert review:

Phase 1 (Correctness Verification): A panel of more than ten domain experts reviewed questions and their correct answers for factual accuracy.
Phase 2 (Distractor Validation): Gemini 1.5 Pro was used to re-evaluate all answer options and flag potential "false negatives" (distractors that might actually be correct). Human experts then reviewed each flagged case to confirm that distractors were genuinely incorrect and sufficiently distinct from the correct answer.

Evaluation Methodology

Answer Format

Each question in MMLU-Pro follows a standardized format:

Question Stem: The main question or problem statement
10 Answer Options: Labeled A through J
Single Correct Answer: Exactly one correct option per question
Distractors: Nine plausible but incorrect options

Scoring System

Metric	Description	Calculation
Overall Accuracy	Percentage of correct answers across all questions	(Correct / Total) x 100%
Domain Accuracy	Performance within each of the 14 subject areas	(Correct in domain / Total in domain) x 100%
CoT Accuracy	Accuracy when the model uses Chain-of-Thought prompting	CoT correct / Total x 100%
Direct Answer Accuracy	Accuracy with direct answer extraction (no reasoning)	Direct correct / Total x 100%
CoT Gain	Improvement attributable to Chain-of-Thought reasoning	CoT accuracy minus Direct accuracy

The standard evaluation protocol uses 5-shot Chain-of-Thought prompting, where the model is given five example questions with worked-out reasoning before being asked to solve test questions.

Prompt Robustness

A key advantage of MMLU-Pro over its predecessor is reduced sensitivity to prompt wording. The authors tested 24 different prompt styles across four categories:

Prompt Category	Variations Tested	Score Variance
Instruction Format	6 styles	Less than 1%
Few-shot Examples	0 to 5 examples	~1.5%
Output Format	6 formats	Less than 1%
Task Framing	6 approaches	Less than 1%

On the original MMLU, the same prompt variations produced 4-5% swings in model scores. MMLU-Pro's reduced sensitivity (approximately 2% total variance) means that benchmark results are more reproducible and less dependent on prompt engineering choices.

Chain-of-Thought vs. Direct Answering

One of the most important findings from the MMLU-Pro paper is the stark difference in how Chain-of-Thought (CoT) reasoning affects performance compared to the original MMLU.

On the original MMLU, CoT prompting provided minimal or even slightly negative benefit. This was because most MMLU questions tested factual recall rather than multi-step reasoning, and the additional reasoning steps sometimes introduced errors without improving accuracy. On MMLU-Pro, however, CoT prompting yields large and consistent improvements across all models tested, confirming that the benchmark genuinely requires reasoning rather than simple knowledge lookup.

CoT vs. Direct Answer Comparison

Model	MMLU (CoT)	MMLU (Direct)	MMLU CoT Gain	MMLU-Pro (CoT)	MMLU-Pro (Direct)	MMLU-Pro CoT Gain
GPT-4o	88.7%	87.2%	+1.5%	72.6%	53.5%	+19.1%
GPT-4-Turbo	86.5%	86.7%	-0.2%	63.7%	48.4%	+15.3%
Phi-3-medium	79.4%	78.0%	+1.4%	55.7%	47.5%	+8.2%

The CoT gain on MMLU-Pro is dramatically larger than on MMLU. GPT-4o, for example, improves by 19.1 percentage points when using Chain-of-Thought on MMLU-Pro, compared to only 1.5 points on MMLU. GPT-4-Turbo actually performs slightly worse with CoT on the original MMLU (-0.2%), but gains 15.3 points on MMLU-Pro. These results demonstrate that MMLU-Pro questions genuinely benefit from step-by-step reasoning and cannot be answered through pattern matching alone.

The subjects showing the largest CoT improvements include Mathematics (+41.8 percentage points for some models), Chemistry (+39.5 points), and Business (+39.4 points), indicating that these domains contain the most reasoning-intensive questions.

Performance Analysis

Original Paper Results (June 2024)

The following results were reported in the original MMLU-Pro paper using 5-shot CoT evaluation. These scores represent model capabilities at the time of the benchmark's release.

Rank	Model	Type	MMLU-Pro Score (CoT)	Organization
1	GPT-4o	Closed-source	72.6%	OpenAI
2	Gemini 1.5 Pro	Closed-source	69.0%	Google
3	Claude 3 Opus	Closed-source	68.5%	Anthropic
4	GPT-4-Turbo	Closed-source	63.7%	OpenAI
5	Gemini 1.5 Flash	Closed-source	59.1%	Google
6	Yi-large	Closed-source	57.5%	01.AI
7	Claude 3 Sonnet	Closed-source	56.8%	Anthropic
8	LLaMA 3 70B Instruct	Open-source	56.2%	Meta
9	Phi-3-medium-4k	Open-source	55.7%	Microsoft
10	DeepSeek V2 Chat	Open-source	54.8%	DeepSeek
11	LLaMA 3 70B	Open-source	52.8%	Meta
12	Qwen 1.5 72B Chat	Open-source	52.6%	Alibaba
13	Yi-1.5-34B-Chat	Open-source	52.3%	01.AI
14	Phi-3-medium-128k	Open-source	51.9%	Microsoft
15	MAmmoTH2-8x7B-Plus	Open-source	50.4%	TIGER-AI Lab
16	Mixtral 8x7B Instruct	Open-source	43.3%	Mistral AI

At the time of publication, GPT-4o was the clear leader at 72.6%. Open-source models lagged behind closed-source models by a significant margin, with the best open-source model (LLaMA 3 70B Instruct) scoring 16.4 percentage points below GPT-4o.

Updated Leaderboard (as of early 2026)

Since the benchmark's release, newer models have achieved substantially higher scores. The following table reflects reported MMLU-Pro scores from the official leaderboard and third-party evaluation platforms.

Rank	Model	MMLU-Pro Score	Organization
1	Gemini 3 Pro	~90.1%	Google
2	Claude Opus 4.5 (Thinking)	~89.5%	Anthropic
3	Gemini 3 Flash (Thinking)	~89.0%	Google
4	Claude Opus 4.5	~88.9%	Anthropic
5	Claude Opus 4.1 (Thinking)	~88.0%	Anthropic
6	Claude Sonnet 4.5 (Thinking)	~87.5%	Anthropic
7	GPT-5.2 Pro	~87.4%	OpenAI
8	Claude Opus 4 (Thinking)	~87.3%	Anthropic
9	GPT-5	~87.1%	OpenAI
10	Grok 4	~86.6%	xAI
11	Gemini 2.5 Pro	~86.2%	Google
12	DeepSeek V3.2 (Thinking)	~86.2%	DeepSeek
13	DeepSeek R1	~84.4%	DeepSeek
14	o3	~85.3%	OpenAI
15	o1	~84.1%	OpenAI
16	Claude 3.5 Sonnet	76.1%	Anthropic
17	GPT-4o	72.6%	OpenAI
18	LLaMA 3.1 405B Instruct	~73.2%	Meta
19	DeepSeek V3	~73.9%	DeepSeek
20	GPT-4o-mini	~64.8%	OpenAI

As of early 2026, 199 models have been evaluated on MMLU-Pro, with an average score of 74.4% across all submissions. Frontier models with extended reasoning capabilities (such as Claude Opus 4.5 Thinking and Gemini 3 Pro) have pushed scores above 89%, approaching the estimated human performance ceiling of approximately 90%.

Accuracy Drop from MMLU to MMLU-Pro

Models experience significant accuracy reductions when evaluated on MMLU-Pro compared to their MMLU performance. The magnitude of the drop correlates with model capability: stronger models tend to lose fewer percentage points.

Model Category	Typical MMLU Score	Typical MMLU-Pro Score	Approximate Drop
Frontier Models (2024)	86-87%	70-83%	16-20 points
High-Performance Models	80-85%	60-70%	20-25 points
Mid-Range Models	70-80%	45-60%	25-30 points
Smaller Models	60-70%	35-45%	30-33 points

The consistent pattern of larger drops for weaker models confirms that MMLU-Pro is more discriminative. On MMLU, a mid-range model might appear to be within striking distance of a frontier model, but MMLU-Pro reveals a much wider gap in actual capability.

Error Analysis

The MMLU-Pro paper includes a detailed error analysis of 120 incorrect responses from GPT-4o, which was the top-performing model at the time of publication. The analysis categorizes errors into four types:

Error Type	Percentage	Description
Reasoning Errors	39%	Logical inconsistencies or flawed inference chains, even when the model recalled relevant knowledge correctly
Knowledge Gaps	35%	Missing domain-specific expertise needed to arrive at the correct answer
Calculation Mistakes	12%	Arithmetic or computational errors in problems requiring numerical work
Other Errors	14%	Miscellaneous errors including misreading the question or selecting an answer that contradicts the model's own reasoning

The dominance of reasoning errors (39%) over knowledge gaps (35%) is noteworthy. It indicates that MMLU-Pro successfully tests reasoning ability rather than merely factual knowledge. Models frequently recalled the correct principles and formulas but failed to apply them through a valid chain of logical steps. For example, GPT-4o sometimes used incorrect values in financial calculations or misapplied refractive index ratios in optics problems, despite demonstrating awareness of the relevant concepts.

Domain-Specific Performance

Subject Area Analysis

Domain	Top Model Performance	Average Performance	Difficulty Rating
Mathematics	85%	65%	Very High
Physics	82%	62%	Very High
Computer Science	88%	70%	High
Chemistry	80%	60%	High
Engineering	79%	59%	High
Biology	84%	68%	Medium-High
Economics	81%	65%	Medium-High
Business	83%	67%	Medium
Psychology	85%	70%	Medium
History	87%	72%	Medium
Law	78%	58%	High
Philosophy	82%	66%	Medium-High
Health	86%	71%	Medium
Other	80%	65%	Variable

STEM subjects (Mathematics, Physics, Chemistry, Engineering) consistently rank as the most difficult, which aligns with the heavy supplementation of these domains with questions from SciBench, TheoremQA, and STEM websites. Law also ranks as particularly challenging despite containing only questions from the original MMLU, likely because legal reasoning requires nuanced interpretation of multiple overlapping rules.

Why MMLU-Pro Is Harder

Several design choices combine to make MMLU-Pro substantially more difficult than the original MMLU:

More Distractors

With ten answer options instead of four, models face 2.5 times as many plausible alternatives. Each additional distractor is not random filler but a carefully constructed incorrect answer that a model with partial understanding might select. The probability of guessing correctly drops from 25% to 10%, which means that a model's score above the 10% baseline more directly reflects genuine comprehension.

Reasoning-Intensive Questions

The filtering process removed questions that most models could answer through simple recall. The remaining questions, combined with new additions from SciBench and TheoremQA, demand multi-step reasoning. A typical MMLU-Pro question in physics or mathematics may require setting up equations, performing algebraic manipulation, applying domain-specific theorems, and then selecting from ten closely spaced numerical or conceptual answers.

Higher-Quality Questions

By removing trivial and noisy items from the original MMLU and adding questions from vetted academic sources, MMLU-Pro achieves a higher floor of question quality. Each question meaningfully tests competence rather than rewarding surface-level pattern recognition.

Reduced Exploitability

The combination of ten options and reduced prompt sensitivity makes MMLU-Pro harder to game. On the original MMLU, certain prompt formulations could swing scores by 4-5 percentage points, creating an incentive to optimize prompts rather than improve model capability. MMLU-Pro's ~2% prompt sensitivity largely eliminates this source of variance.

Implementation

Installation and Setup

# Clone the repository
git clone https://github.com/TIGER-AI-Lab/MMLU-Pro
cd MMLU-Pro

# Install dependencies
pip install -r requirements.txt

# Download dataset
python download_data.py

Running Evaluations

# Basic evaluation
python evaluate.py --model "gpt-4" --method "direct"

# With Chain-of-Thought
python evaluate.py --model "gpt-4" --method "cot"

# Specific domains
python evaluate.py --model "gpt-4" --domains "math,physics,cs"

# All 24 prompt styles
python evaluate.py --model "gpt-4" --test-prompts

Dataset Access

The dataset is available on Hugging Face and can be loaded directly using the datasets library:

from datasets import load_dataset

# Load MMLU-Pro dataset
dataset = load_dataset("TIGER-Lab/MMLU-Pro")

# Access specific split
test_set = dataset['test']       # 12,000 rows
validation_set = dataset['validation']  # 70 rows

# Filter by domain
math_questions = dataset['test'].filter(lambda x: x['category'] == 'math')

Each record in the dataset includes the following fields:

Field	Type	Description
question_id	int64	Unique identifier for the question
question	string	The question text
options	list of strings	Ten multiple-choice options
answer	string	Correct answer letter (A through J)
answer_index	int64	Index of correct answer (0 through 9)
cot_content	string	Reference Chain-of-Thought reasoning
category	string	Subject category (one of 14 domains)
src	string	Source of the question

Adoption and Impact

Since its release in June 2024, MMLU-Pro has rapidly become a standard evaluation benchmark in the large language model community.

Industry Adoption

Major AI labs now routinely report MMLU-Pro scores alongside other benchmarks when releasing new models. OpenAI, Anthropic, Google DeepMind, Meta, Mistral AI, and DeepSeek have all published MMLU-Pro results for their flagship models. Multiple independent benchmark aggregation platforms, including Artificial Analysis, Vals AI, and Kaggle Benchmarks, track MMLU-Pro scores, and as of early 2026, 199 models have been evaluated on the benchmark.

Research Influence

The MMLU-Pro paper has been widely cited in subsequent research on LLM evaluation methodology. It has also inspired derivative benchmarks:

MMLU-ProX (published at EMNLP 2025): Translates MMLU-Pro into 29 typologically diverse languages, enabling cross-lingual evaluation of multilingual models.
Mobile-MMLU: Adapts the multitask evaluation format to mobile and edge computing scenarios, expanding topical coverage to 80 mobile-relevant domains.

Approaching Saturation

With frontier reasoning models now scoring above 89% on MMLU-Pro, the benchmark is beginning to face the same saturation challenge that originally motivated its creation. The gap between top models has narrowed, and further improvements may be difficult to measure reliably. This trend suggests that even more challenging benchmarks may be needed in the future to differentiate the next generation of models.

Challenges and Insights

Key Challenges for Models

Challenge	Description	Impact
Distractor Quality	High-quality incorrect options generated by GPT-4-Turbo	Reduces guessing success and requires precise discrimination
Reasoning Depth	Multi-step problem solving required across STEM and non-STEM domains	Challenges surface-level knowledge
Domain Expertise	Specialized knowledge needed in 14 distinct domains	Tests both breadth and depth of training data
Option Discrimination	Subtle differences between closely related choices	Requires precise understanding rather than approximate matching
Cross-Domain Integration	Some questions require combining knowledge from multiple fields	Tests holistic understanding

Common Failure Modes

Surface Pattern Matching: Selecting answers based on superficial similarities rather than genuine understanding
Incomplete Reasoning: Beginning a valid reasoning chain but stopping before reaching the correct conclusion
Domain Confusion: Applying knowledge or methods from the wrong field to a given problem
Distractor Attraction: Being misled by plausible but incorrect options, especially those that represent common misconceptions
Calculation Errors: Performing arithmetic or algebraic steps incorrectly even when the overall approach is sound

Limitations and Considerations

Current Limitations

Limitation	Description	Impact
English Only	All questions are in English	Limits applicability for non-English language models (partially addressed by MMLU-ProX)
Multiple Choice Format	All questions use 10-option multiple choice	May not capture free-form reasoning or generation capabilities
Static Dataset	Fixed set of 12,032 questions	Creates risk of contamination if questions appear in training data
Academic Focus	Questions are drawn from academic exams and textbooks	May not reflect practical, real-world problem-solving ability
Approaching Saturation	Frontier models exceed 89%	Diminishing ability to differentiate top-tier models
GPT-4 Bias in Distractors	Additional options were generated by GPT-4-Turbo	Models from the same family may have subtle advantages or disadvantages with these distractors

Future Directions

Multilingual Extension: MMLU-ProX covers 29 languages; further expansion is ongoing
Dynamic Generation: Procedurally generated questions to prevent training data contamination
Free-Form Responses: Open-ended answer formats to test generation rather than selection
Multimodal Integration: Adding visual and audio components to test cross-modal reasoning
Adaptive Testing: Adjusting question difficulty based on model performance during evaluation
Harder Benchmarks: Projects like GPQA and HLE (Humanity's Last Exam) target even higher difficulty ceilings

MMLU: Original Massive Multitask Language Understanding (Hendrycks et al., 2021)
GPQA: Graduate-level science questions requiring PhD-level expertise
ARC: AI2 Reasoning Challenge for grade-school science questions
HellaSwag: Commonsense reasoning about everyday situations
BigBench: Diverse capability evaluation across 200+ tasks
AGIEval: Evaluation using human-level standardized exams
MATH: Competition-level mathematics problem solving
TheoremQA: Theorem-application questions (also a source for MMLU-Pro)
GSM8K: Grade-school math word problems

Significance

MMLU-Pro represents a crucial evolution in language model evaluation, addressing the saturation problem of its predecessor while providing more robust and discriminative assessment. Its reduced prompt sensitivity and enhanced focus on reasoning make it particularly valuable for:

Distinguishing between frontier model capabilities at a finer granularity than MMLU allows
Tracking genuine progress in AI reasoning development over time
Identifying whether models rely on reasoning versus memorization
Providing stable, reproducible performance measurements across different evaluation setups
Guiding model development priorities by highlighting specific domains and error types

The benchmark's success in revealing performance differences previously hidden by MMLU's ceiling effect has made it one of the most widely used LLM evaluation tools since its introduction. Its acceptance as a Spotlight paper at NeurIPS 2024 further cemented its standing in the research community.

References

Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., & Chen, W. (2024). "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." *Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Datasets and Benchmarks Track*. arXiv:2406.01574
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). "Measuring Massive Multitask Language Understanding." *Proceedings of the International Conference on Learning Representations (ICLR 2021)*. arXiv:2009.03300
TIGER-AI Lab. "MMLU-Pro GitHub Repository." https://github.com/TIGER-AI-Lab/MMLU-Pro
TIGER-AI Lab. "MMLU-Pro Dataset on Hugging Face." https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro
TIGER-AI Lab. "MMLU-Pro Leaderboard." https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro
Artificial Analysis. "MMLU-Pro Benchmark Leaderboard." https://artificialanalysis.ai/evaluations/mmlu-pro
Vals AI. "MMLU Pro Benchmark." https://www.vals.ai/benchmarks/mmlu_pro
NeurIPS 2024. "Poster: MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." https://neurips.cc/virtual/2024/poster/97435

Overview

Motivation

Technical Specifications

Dataset Composition

Question Counts by Subject

Key Differences from MMLU

Dataset Construction Process

Stage 1: Filtering the Original MMLU

Stage 2: Integrating STEM Sources

Stage 3: Expanding to Ten Options

Stage 4: Expert Review

Evaluation Methodology

Answer Format

Scoring System

Prompt Robustness

Chain-of-Thought vs. Direct Answering

CoT vs. Direct Answer Comparison

Performance Analysis

Original Paper Results (June 2024)

Updated Leaderboard (as of early 2026)

Accuracy Drop from MMLU to MMLU-Pro

Error Analysis

Domain-Specific Performance

Subject Area Analysis

Why MMLU-Pro Is Harder

More Distractors

Reasoning-Intensive Questions

Higher-Quality Questions

Reduced Exploitability

Implementation

Installation and Setup

Running Evaluations

Dataset Access

Adoption and Impact

Industry Adoption

Research Influence

Approaching Saturation

Challenges and Insights

Key Challenges for Models

Common Failure Modes

Limitations and Considerations

Current Limitations

Future Directions

Related Benchmarks

Significance

See Also

References

Improve this article

Related Articles

ARC-AGI 2

DeepSeek 3.0

Multi-token prediction

BIG-Bench

MATH (benchmark)

GSM8K

Overview

Motivation

Technical Specifications

Dataset Composition

Question Counts by Subject

Key Differences from MMLU

Dataset Construction Process

Stage 1: Filtering the Original MMLU

Stage 2: Integrating STEM Sources

Stage 3: Expanding to Ten Options

Stage 4: Expert Review

Evaluation Methodology

Answer Format

Scoring System

Prompt Robustness

Chain-of-Thought vs. Direct Answering

CoT vs. Direct Answer Comparison

Performance Analysis

Original Paper Results (June 2024)

Updated Leaderboard (as of early 2026)

Accuracy Drop from MMLU to MMLU-Pro

Error Analysis

Domain-Specific Performance

Subject Area Analysis

Why MMLU-Pro Is Harder