# MMLU-Pro

> Source: https://aiwiki.ai/wiki/mmlu-pro
> Updated: 2026-06-21
> Categories: AI Benchmarks, Large Language Models, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

| MMLU-Pro |
| --- |
| Overview |
| Full name | Massive Multitask Language Understanding Professional |
| Abbreviation | MMLU-Pro |
| Description | A more robust and challenging multi-task language understanding benchmark with 10-choice questions |
| Release date | 2024-06 |
| Latest version | 1.0 |
| Benchmark updated | 2024-10 |
| Authors | Yubo Wang, Xueguang Ma, Ge Zhang, Et al. |
| Organization | TIGER-AI Lab |
| Technical Details |
| Type | Knowledge, Reasoning, Multi-task Understanding |
| Modality | Text |
| Task format | Multiple choice (10 options) |
| Number of tasks | 12,032 |
| Total examples | 12,032 |
| Evaluation metric | [Accuracy](/wiki/accuracy), Chain-of-Thought performance |
| Domains | 14 domains (Biology, [Business](/wiki/business), Chemistry, Computer Science, Etc.) |
| Languages | English |
| Performance |
| Human performance | ~90% (estimated) |
| Baseline | Random guess: 10% |
| SOTA score | ~90.1% |
| SOTA model | Gemini 3 Pro |
| SOTA date | 2025 |
| Saturated | Approaching |
| Resources |
| Website | [Official website](https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro) |
| Paper | [Paper](https://arxiv.org/abs/2406.01574) |
| GitHub | [Repository](https://github.com/TIGER-AI-Lab/MMLU-Pro) |
| Dataset | [Download](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro) |
| License | MIT |
| Predecessor | [MMLU](/wiki/mmlu) |

**MMLU-Pro** (Massive Multitask Language Understanding Professional) is an [artificial intelligence](/wiki/artificial_intelligence) benchmark of 12,032 ten-choice questions across 14 academic domains, built to measure how well [large language models](/wiki/large_language_model) handle hard, reasoning-focused problems. Released in June 2024 by the TIGER-AI Lab, it extends the original [MMLU](/wiki/mmlu) benchmark by adding reasoning-heavy questions, expanding answer choices from four to ten options, and removing trivial or noisy items.[1] The redesign cut model accuracy by 16% to 33% relative to MMLU and reduced score sensitivity to prompt wording from 4-5% down to about 2%, making MMLU-Pro a more discriminative and reproducible test.[1] As of June 2024, the strongest model, [GPT-4o](/wiki/gpt_4o), reached 72.6% with Chain-of-Thought prompting, far below its ~88% on MMLU; by 2026 frontier reasoning models such as [Gemini](/wiki/gemini) 3 Pro had pushed scores to roughly 90.1%, approaching the estimated human ceiling.[1][5][6] The benchmark was accepted as a Spotlight paper at the [NeurIPS](/wiki/neurips) 2024 Datasets and [Benchmarks](/wiki/benchmarks) Track.[8]

In the words of the authors, MMLU-Pro "not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts," making it "a more discriminative benchmark to better track progress in the field."[1]

## Overview

MMLU-Pro addresses the performance saturation observed in the original MMLU benchmark, where frontier models had converged to scores between 86-87%, making it difficult to distinguish between model capabilities.[1] By increasing both the difficulty and robustness of questions while reducing prompt sensitivity, MMLU-Pro provides a more challenging and stable evaluation environment for modern language models.[1]

The benchmark was developed by Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen.[1] It was published under the title "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" and made available on [arXiv](/wiki/arxiv) on June 3, 2024 (arXiv:2406.01574).[1] The dataset is released under the MIT license and hosted on [Hugging Face](/wiki/hugging_face).[4]

### What problem does MMLU-Pro solve?

The development of MMLU-Pro was driven by several critical observations about the limitations of the original [MMLU](/wiki/mmlu) benchmark:[1]

- Performance saturation on original MMLU, with top models clustering at 86-87% accuracy
- High sensitivity to prompt variations in MMLU (4-5% variance across different prompt styles)
- Limited discrimination between frontier models, with gaps of only 1-2 percentage points
- Insufficient emphasis on reasoning versus pure knowledge recall
- Presence of trivial and noisy questions in the original dataset that inflated scores

The benchmark specifically targets the need for more challenging evaluations that can differentiate between increasingly capable AI systems while providing more stable and reliable measurements. For instance, the gap between [GPT-4o](/wiki/gpt_4o) and GPT-4-Turbo on MMLU was only about 1%, but MMLU-Pro widened this gap to approximately 9%, providing clearer signal about relative model capabilities.[1]

## Technical Specifications

### Dataset Composition

MMLU-Pro comprises exactly 12,032 rigorously curated questions spanning 14 diverse domains. Unlike the original MMLU, which drew solely from existing exam questions, MMLU-Pro integrates questions from four distinct sources:[1]

| Source | Question Count | Percentage | Description |
| --- | --- | --- | --- |
| Original [MMLU](/wiki/mmlu) | 6,810 | 56.60% | Filtered subset of the original MMLU dataset, with trivial and noisy questions removed |
| STEM Websites | 4,083 | 33.93% | High-quality STEM problems collected from online educational platforms |
| [TheoremQA](/wiki/theoremqa) | 598 | 4.97% | Human-annotated questions requiring theorem application for resolution |
| SciBench | 541 | 4.50% | Advanced science questions derived from college-level exams |
| **Total** | **12,032** | **100%** | |

The inclusion of questions from SciBench and TheoremQA specifically strengthens the STEM coverage of the benchmark. SciBench provides college-level exam questions from physics, chemistry, and mathematics courses, while TheoremQA contributes questions that require applying mathematical and scientific theorems.[1] These additional sources help shift the benchmark's emphasis from factual recall toward multi-step [reasoning](/wiki/reasoning).[1]

### Question Counts by Subject

The 14 subject domains vary in size. The original MMLU's 57 fine-grained subjects were consolidated into 14 broader categories to reduce redundancy and focus on core knowledge areas.[1]

| Subject | Total Questions | From MMLU | Newly Added | Primary Additional Source |
| --- | --- | --- | --- | --- |
| Mathematics | 1,351 | 846 | 505 | TheoremQA (344), SciBench (161) |
| Physics | 1,299 | 411 | 888 | STEM Websites (617), SciBench (167), TheoremQA (104) |
| Chemistry | 1,132 | 178 | 954 | STEM Websites (741), SciBench (213) |
| [Law](/wiki/law) | 1,101 | 1,101 | 0 | None (entirely from MMLU) |
| Engineering | 969 | 67 | 902 | STEM Websites (902) |
| Other | 924 | 924 | 0 | None (entirely from MMLU) |
| Economics | 844 | 444 | 400 | STEM Websites |
| [Health](/wiki/health) | 818 | 818 | 0 | None (entirely from MMLU) |
| Psychology | 798 | 493 | 305 | STEM Websites |
| [Business](/wiki/business) | 789 | 155 | 634 | STEM Websites |
| Biology | 717 | 219 | 498 | STEM Websites |
| Philosophy | 499 | 499 | 0 | None (entirely from MMLU) |
| Computer Science | 410 | 274 | 136 | STEM Websites |
| History | 381 | 381 | 0 | None (entirely from MMLU) |
| **Total** | **12,032** | **6,810** | **5,222** | |

Notably, several humanities domains (Law, Philosophy, History, Health) retain 100% of their questions from the original MMLU, while STEM domains received substantial supplementation from external sources. Engineering, for example, draws 93% of its questions from STEM websites rather than from MMLU.[1]

### How does MMLU-Pro differ from MMLU?

The following table summarizes the principal design differences between MMLU and MMLU-Pro:

| Feature | [MMLU](/wiki/mmlu) | MMLU-Pro | Significance |
| --- | --- | --- | --- |
| Total Questions | 15,908 | 12,032 | Quality over quantity |
| Subject Categories | 57 | 14 | Consolidated for clarity |
| Answer Choices per Question | 4 (A through D) | 10 (A through J) | Reduces random guessing from 25% to 10% |
| Random Guess Baseline | 25% | 10% | 60% reduction in guessing advantage |
| Prompt Sensitivity | 4-5% variance | ~2% variance | More stable and reliable scoring |
| Question Quality | Mixed; includes trivial items | Curated; noise removed | Stronger signal of true capability |
| Reasoning Requirement | Minimal; mostly knowledge recall | Substantial; multi-step reasoning | Better tests analytical ability |
| Chain-of-Thought Benefit | Negligible or negative | +4% to +19% improvement | Confirms reasoning-heavy content |
| Question Sources | Existing exam banks only | MMLU + SciBench + TheoremQA + STEM sites | Broader, more challenging coverage |
| Publication | ICLR 2021 (Hendrycks et al.) | NeurIPS 2024 Datasets Track (Wang et al.) | |

## Dataset Construction Process

The construction of MMLU-Pro followed a multi-stage pipeline involving automated filtering, [GPT-4](/wiki/gpt-4)-based augmentation, and expert human review.[1]

### Stage 1: Filtering the Original MMLU

The authors began with the 13,937 questions in the MMLU test set. They evaluated each question against eight [language models](/wiki/large_language_model) of varying capability. Questions that were answered correctly by four or more of the eight models were deemed too easy and removed from the dataset. This filtering step eliminated 5,886 questions (approximately 42% of the original set), leaving 8,051 MMLU questions. After additional noise and error removal through expert review, 6,810 MMLU-origin questions were retained for MMLU-Pro.[1]

### Stage 2: Integrating STEM Sources

To strengthen coverage in science, technology, engineering, and mathematics, the authors incorporated questions from three additional sources:[1]

- **STEM Websites**: 4,083 problems were collected from online educational platforms covering physics, chemistry, engineering, biology, business, economics, computer science, and psychology.
- **TheoremQA**: 598 human-annotated questions requiring the application of specific mathematical or scientific theorems.
- **SciBench**: 541 college-exam-level questions from physics, chemistry, and mathematics courses.

Questions from TheoremQA and SciBench were originally in free-response format. GPT-4-Turbo was used to extract short correct answers from the solutions and convert them into multiple-choice format.[1]

### Stage 3: Expanding to Ten Options

One of the most distinctive features of MMLU-Pro is its expansion from four answer choices to ten. For questions that originally had only four options, GPT-4-Turbo was employed to generate six additional plausible distractors for each question. These distractors were not random; they were designed to be plausible incorrect answers that require discriminative reasoning to eliminate. This expansion reduces the probability of guessing correctly from 25% (with four choices) to 10% (with ten choices), placing greater demand on genuine understanding.[1]

### Stage 4: Expert Review

The dataset underwent two phases of expert review:[1]

1. **Phase 1 (Correctness Verification)**: A panel of more than ten domain experts reviewed questions and their correct answers for factual accuracy.
2. **Phase 2 (Distractor Validation)**: [Gemini](/wiki/gemini) 1.5 Pro was used to re-evaluate all answer options and flag potential "false negatives" (distractors that might actually be correct). Human experts then reviewed each flagged case to confirm that distractors were genuinely incorrect and sufficiently distinct from the correct answer.

## Evaluation Methodology

### Answer Format

Each question in MMLU-Pro follows a standardized format:

- **Question Stem**: The main question or problem statement
- **10 Answer Options**: Labeled A through J
- **Single Correct Answer**: Exactly one correct option per question
- **Distractors**: Nine plausible but incorrect options

### Scoring System

| Metric | Description | Calculation |
| --- | --- | --- |
| Overall Accuracy | Percentage of correct answers across all questions | (Correct / Total) x 100% |
| Domain Accuracy | Performance within each of the 14 subject areas | (Correct in domain / Total in domain) x 100% |
| CoT Accuracy | Accuracy when the model uses [Chain-of-Thought](/wiki/chain_of_thought) prompting | CoT correct / Total x 100% |
| Direct Answer Accuracy | Accuracy with direct answer extraction (no reasoning) | Direct correct / Total x 100% |
| CoT Gain | Improvement attributable to Chain-of-Thought reasoning | CoT accuracy minus Direct accuracy |

The standard evaluation protocol uses 5-shot [Chain-of-Thought](/wiki/chain_of_thought) prompting, where the model is given five example questions with worked-out reasoning before being asked to solve test questions.[1]

### Prompt Robustness

A key advantage of MMLU-Pro over its predecessor is reduced sensitivity to prompt wording. The authors tested 24 different prompt styles across four categories:[1]

| Prompt Category | Variations Tested | Score Variance |
| --- | --- | --- |
| Instruction Format | 6 styles | Less than 1% |
| Few-shot Examples | 0 to 5 examples | ~1.5% |
| Output Format | 6 formats | Less than 1% |
| Task Framing | 6 approaches | Less than 1% |

On the original MMLU, the same prompt variations produced 4-5% swings in model scores. MMLU-Pro's reduced sensitivity (approximately 2% total variance) means that benchmark results are more reproducible and less dependent on prompt engineering choices.[1]

## Chain-of-Thought vs. Direct Answering

One of the most important findings from the MMLU-Pro paper is the stark difference in how [Chain-of-Thought](/wiki/chain_of_thought) (CoT) reasoning affects performance compared to the original MMLU.[1]

On the original MMLU, CoT prompting provided minimal or even slightly negative benefit. This was because most MMLU questions tested factual recall rather than multi-step reasoning, and the additional reasoning steps sometimes introduced errors without improving accuracy. On MMLU-Pro, however, CoT prompting yields large and consistent improvements across all models tested, confirming that the benchmark genuinely requires reasoning rather than simple knowledge lookup.[1]

### CoT vs. Direct Answer Comparison

| Model | MMLU (CoT) | MMLU (Direct) | MMLU CoT Gain | MMLU-Pro (CoT) | MMLU-Pro (Direct) | MMLU-Pro CoT Gain |
| --- | --- | --- | --- | --- | --- | --- |
| [GPT-4o](/wiki/gpt_4o) | 88.7% | 87.2% | +1.5% | 72.6% | 53.5% | +19.1% |
| GPT-4-Turbo | 86.5% | 86.7% | -0.2% | 63.7% | 48.4% | +15.3% |
| Phi-3-medium | 79.4% | 78.0% | +1.4% | 55.7% | 47.5% | +8.2% |

The CoT gain on MMLU-Pro is dramatically larger than on MMLU. GPT-4o, for example, improves by 19.1 percentage points when using Chain-of-Thought on MMLU-Pro, compared to only 1.5 points on MMLU. GPT-4-Turbo actually performs slightly worse with CoT on the original MMLU (-0.2%), but gains 15.3 points on MMLU-Pro. These results demonstrate that MMLU-Pro questions genuinely benefit from step-by-step reasoning and cannot be answered through pattern matching alone.[1]

The subjects showing the largest CoT improvements include Mathematics (+41.8 percentage points for some models), Chemistry (+39.5 points), and Business (+39.4 points), indicating that these domains contain the most reasoning-intensive questions.[1]

## Performance Analysis

### Original Paper Results (June 2024)

The following results were reported in the original MMLU-Pro paper using 5-shot CoT evaluation. These scores represent model capabilities at the time of the benchmark's release.[1]

| Rank | Model | Type | MMLU-Pro Score (CoT) | Organization |
| --- | --- | --- | --- | --- |
| 1 | [GPT-4o](/wiki/gpt_4o) | Closed-source | 72.6% | [OpenAI](/wiki/openai) |
| 2 | [Gemini](/wiki/gemini) 1.5 Pro | Closed-source | 69.0% | [Google](/wiki/google) |
| 3 | [Claude](/wiki/claude) 3 Opus | Closed-source | 68.5% | [Anthropic](/wiki/anthropic) |
| 4 | GPT-4-Turbo | Closed-source | 63.7% | [OpenAI](/wiki/openai) |
| 5 | Gemini 1.5 Flash | Closed-source | 59.1% | [Google](/wiki/google) |
| 6 | Yi-large | Closed-source | 57.5% | 01.AI |
| 7 | Claude 3 Sonnet | Closed-source | 56.8% | [Anthropic](/wiki/anthropic) |
| 8 | [LLaMA](/wiki/llama) 3 70B Instruct | Open-source | 56.2% | [Meta](/wiki/meta) |
| 9 | Phi-3-medium-4k | Open-source | 55.7% | [Microsoft](/wiki/microsoft) |
| 10 | [DeepSeek](/wiki/deepseek) V2 Chat | Open-source | 54.8% | DeepSeek |
| 11 | [LLaMA](/wiki/llama) 3 70B | Open-source | 52.8% | [Meta](/wiki/meta) |
| 12 | [Qwen](/wiki/qwen) 1.5 72B Chat | Open-source | 52.6% | [Alibaba](/wiki/alibaba_cloud) |
| 13 | Yi-1.5-34B-Chat | Open-source | 52.3% | 01.AI |
| 14 | Phi-3-medium-128k | Open-source | 51.9% | [Microsoft](/wiki/microsoft) |
| 15 | MAmmoTH2-8x7B-Plus | Open-source | 50.4% | TIGER-AI Lab |
| 16 | [Mixtral](/wiki/mixtral) 8x7B Instruct | Open-source | 43.3% | [Mistral AI](/wiki/mistral_ai) |

At the time of publication, GPT-4o was the clear leader at 72.6%. Open-source models lagged behind closed-source models by a significant margin, with the best open-source model ([LLaMA 3](/wiki/llama_3) 70B Instruct) scoring 16.4 percentage points below GPT-4o.[1]

### Updated Leaderboard (as of early 2026)

Since the benchmark's release, newer models have achieved substantially higher scores. The following table reflects reported MMLU-Pro scores from the official leaderboard and third-party evaluation platforms.[5][6]

| Rank | Model | MMLU-Pro Score | Organization |
| --- | --- | --- | --- |
| 1 | [Gemini](/wiki/gemini) 3 Pro | ~90.1% | [Google](/wiki/google) |
| 2 | [Claude](/wiki/claude) Opus 4.5 (Thinking) | ~89.5% | [Anthropic](/wiki/anthropic) |
| 3 | Gemini 3 Flash (Thinking) | ~89.0% | [Google](/wiki/google) |
| 4 | Claude Opus 4.5 | ~88.9% | [Anthropic](/wiki/anthropic) |
| 5 | Claude Opus 4.1 (Thinking) | ~88.0% | [Anthropic](/wiki/anthropic) |
| 6 | Claude Sonnet 4.5 (Thinking) | ~87.5% | [Anthropic](/wiki/anthropic) |
| 7 | GPT-5.2 Pro | ~87.4% | [OpenAI](/wiki/openai) |
| 8 | Claude Opus 4 (Thinking) | ~87.3% | [Anthropic](/wiki/anthropic) |
| 9 | GPT-5 | ~87.1% | [OpenAI](/wiki/openai) |
| 10 | [Grok](/wiki/grok) 4 | ~86.6% | [xAI](/wiki/xai) |
| 11 | Gemini 2.5 Pro | ~86.2% | [Google](/wiki/google) |
| 12 | [DeepSeek](/wiki/deepseek) V3.2 (Thinking) | ~86.2% | DeepSeek |
| 13 | [DeepSeek](/wiki/deepseek) R1 | ~84.4% | DeepSeek |
| 14 | [o3](/wiki/openai_o-series) | ~85.3% | [OpenAI](/wiki/openai) |
| 15 | [o1](/wiki/openai_o-series) | ~84.1% | [OpenAI](/wiki/openai) |
| 16 | [Claude](/wiki/claude) 3.5 Sonnet | 76.1% | [Anthropic](/wiki/anthropic) |
| 17 | [GPT-4o](/wiki/gpt_4o) | 72.6% | [OpenAI](/wiki/openai) |
| 18 | [LLaMA](/wiki/llama) 3.1 405B Instruct | ~73.2% | [Meta](/wiki/meta) |
| 19 | [DeepSeek](/wiki/deepseek) V3 | ~73.9% | DeepSeek |
| 20 | GPT-4o-mini | ~64.8% | [OpenAI](/wiki/openai) |

As of early 2026, 199 models have been evaluated on MMLU-Pro, with an average score of 74.4% across all submissions.[6] [Frontier models](/wiki/frontier_models) with extended reasoning capabilities (such as Claude Opus 4.5 Thinking and Gemini 3 Pro) have pushed scores above 89%, approaching the estimated human performance ceiling of approximately 90%. Independent trackers in 2026 reported the top models clustered within roughly one percentage point of each other, a clear sign that the benchmark is nearing saturation for the strongest systems.[5][6]

### Accuracy Drop from MMLU to MMLU-Pro

Models experience significant accuracy reductions when evaluated on MMLU-Pro compared to their MMLU performance. The paper reports an overall accuracy drop of 16% to 33% relative to MMLU, and the magnitude of the drop correlates with model capability: stronger models tend to lose fewer percentage points.[1]

| Model Category | Typical MMLU Score | Typical MMLU-Pro Score | Approximate Drop |
| --- | --- | --- | --- |
| Frontier Models (2024) | 86-87% | 70-83% | 16-20 points |
| High-Performance Models | 80-85% | 60-70% | 20-25 points |
| Mid-Range Models | 70-80% | 45-60% | 25-30 points |
| Smaller Models | 60-70% | 35-45% | 30-33 points |

The consistent pattern of larger drops for weaker models confirms that MMLU-Pro is more discriminative. On MMLU, a mid-range model might appear to be within striking distance of a frontier model, but MMLU-Pro reveals a much wider gap in actual capability.[1]

## Error Analysis

The MMLU-Pro paper includes a detailed error analysis of 120 incorrect responses from GPT-4o, which was the top-performing model at the time of publication. The analysis categorizes errors into four types:[1]

| Error Type | Percentage | Description |
| --- | --- | --- |
| Reasoning Errors | 39% | Logical inconsistencies or flawed inference chains, even when the model recalled relevant knowledge correctly |
| Knowledge Gaps | 35% | Missing domain-specific expertise needed to arrive at the correct answer |
| Calculation Mistakes | 12% | Arithmetic or computational errors in problems requiring numerical work |
| Other Errors | 14% | Miscellaneous errors including misreading the question or selecting an answer that contradicts the model's own reasoning |

The dominance of reasoning errors (39%) over knowledge gaps (35%) is noteworthy. It indicates that MMLU-Pro successfully tests reasoning ability rather than merely factual knowledge. Models frequently recalled the correct principles and formulas but failed to apply them through a valid chain of logical steps. For example, GPT-4o sometimes used incorrect values in financial calculations or misapplied refractive index ratios in optics problems, despite demonstrating awareness of the relevant concepts.[1]

## Domain-Specific Performance

### Subject Area Analysis

| Domain | Top Model Performance | Average Performance | Difficulty Rating |
| --- | --- | --- | --- |
| Mathematics | 85% | 65% | Very High |
| Physics | 82% | 62% | Very High |
| Computer Science | 88% | 70% | High |
| Chemistry | 80% | 60% | High |
| Engineering | 79% | 59% | High |
| Biology | 84% | 68% | Medium-High |
| Economics | 81% | 65% | Medium-High |
| Business | 83% | 67% | Medium |
| Psychology | 85% | 70% | Medium |
| History | 87% | 72% | Medium |
| Law | 78% | 58% | High |
| Philosophy | 82% | 66% | Medium-High |
| Health | 86% | 71% | Medium |
| Other | 80% | 65% | Variable |

STEM subjects (Mathematics, Physics, Chemistry, Engineering) consistently rank as the most difficult, which aligns with the heavy supplementation of these domains with questions from SciBench, TheoremQA, and STEM websites. Law also ranks as particularly challenging despite containing only questions from the original MMLU, likely because legal reasoning requires nuanced interpretation of multiple overlapping rules.[1]

## Why is MMLU-Pro harder than MMLU?

Several design choices combine to make MMLU-Pro substantially more difficult than the original MMLU:

### More Distractors

With ten answer options instead of four, models face 2.5 times as many plausible alternatives. Each additional distractor is not random filler but a carefully constructed incorrect answer that a model with partial understanding might select. The probability of guessing correctly drops from 25% to 10%, which means that a model's score above the 10% baseline more directly reflects genuine comprehension.[1]

### Reasoning-Intensive Questions

The filtering process removed questions that most models could answer through simple recall. The remaining questions, combined with new additions from SciBench and TheoremQA, demand multi-step reasoning. A typical MMLU-Pro question in physics or mathematics may require setting up equations, performing algebraic manipulation, applying domain-specific theorems, and then selecting from ten closely spaced numerical or conceptual answers.[1]

### Higher-Quality Questions

By removing trivial and noisy items from the original MMLU and adding questions from vetted academic sources, MMLU-Pro achieves a higher floor of question quality. Each question meaningfully tests competence rather than rewarding surface-level pattern recognition.[1]

### Reduced Exploitability

The combination of ten options and reduced prompt sensitivity makes MMLU-Pro harder to game. On the original MMLU, certain prompt formulations could swing scores by 4-5 percentage points, creating an incentive to optimize prompts rather than improve model capability. MMLU-Pro's ~2% prompt sensitivity largely eliminates this source of variance.[1]

## Implementation

### Installation and Setup

```bash
# Clone the repository
git clone https://github.com/TIGER-AI-Lab/MMLU-Pro
cd MMLU-Pro

# Install dependencies
pip install -r requirements.txt

# Download dataset
python download_data.py
```

### Running Evaluations

```python
# Basic evaluation
python evaluate.py --model "gpt-4" --method "direct"

# With Chain-of-Thought
python evaluate.py --model "gpt-4" --method "cot"

# Specific domains
python evaluate.py --model "gpt-4" --domains "math,physics,cs"

# All 24 prompt styles
python evaluate.py --model "gpt-4" --test-prompts
```

### Dataset Access

The dataset is available on [Hugging Face](/wiki/hugging_face) and can be loaded directly using the `datasets` library:[4]

```python
from datasets import load_dataset

# Load MMLU-Pro dataset
dataset = load_dataset("TIGER-Lab/MMLU-Pro")

# Access specific split
test_set = dataset['test']       # 12,000 rows
validation_set = dataset['validation']  # 70 rows

# Filter by domain
math_questions = dataset['test'].filter(lambda x: x['category'] == 'math')
```

Each record in the dataset includes the following fields:[4]

| Field | Type | Description |
| --- | --- | --- |
| question_id | int64 | Unique identifier for the question |
| question | string | The question text |
| options | list of strings | Ten multiple-choice options |
| answer | string | Correct answer letter (A through J) |
| answer_index | int64 | Index of correct answer (0 through 9) |
| cot_content | string | Reference Chain-of-Thought reasoning |
| category | string | Subject category (one of 14 domains) |
| src | string | Source of the question |

## Adoption and Impact

Since its release in June 2024, MMLU-Pro has rapidly become a standard evaluation benchmark in the [large language model](/wiki/large_language_model) community.

### Who uses MMLU-Pro?

Major AI labs now routinely report MMLU-Pro scores alongside other benchmarks when releasing new models. [OpenAI](/wiki/openai), [Anthropic](/wiki/anthropic), [Google](/wiki/google) DeepMind, [Meta](/wiki/meta), [Mistral AI](/wiki/mistral_ai), and [DeepSeek](/wiki/deepseek) have all published MMLU-Pro results for their flagship models. Multiple independent benchmark aggregation platforms, including [Artificial Analysis](/wiki/artificial_analysis), Vals AI, and Kaggle Benchmarks, track MMLU-Pro scores, and as of early 2026, 199 models have been evaluated on the benchmark.[6][7]

### Research Influence

The MMLU-Pro paper has been widely cited in subsequent research on LLM evaluation methodology. It has also inspired derivative benchmarks:

- **MMLU-ProX** (published at EMNLP 2025): Translates MMLU-Pro into 29 typologically diverse languages, enabling cross-lingual evaluation of multilingual models.
- **Mobile-MMLU**: Adapts the multitask evaluation format to mobile and edge computing scenarios, expanding topical coverage to 80 mobile-relevant domains.

### Approaching Saturation

With frontier reasoning models now scoring above 89% on MMLU-Pro, the benchmark is beginning to face the same saturation challenge that originally motivated its creation.[6] The gap between top models has narrowed, and further improvements may be difficult to measure reliably. This trend suggests that even more challenging benchmarks may be needed in the future to differentiate the next generation of models.

## Challenges and Insights

### Key Challenges for Models

| Challenge | Description | Impact |
| --- | --- | --- |
| Distractor Quality | High-quality incorrect options generated by GPT-4-Turbo | Reduces guessing success and requires precise discrimination |
| Reasoning Depth | Multi-step problem solving required across STEM and non-STEM domains | Challenges surface-level knowledge |
| Domain Expertise | Specialized knowledge needed in 14 distinct domains | Tests both breadth and depth of training data |
| Option Discrimination | Subtle differences between closely related choices | Requires precise understanding rather than approximate matching |
| Cross-Domain Integration | Some questions require combining knowledge from multiple fields | Tests holistic understanding |

### Common Failure Modes

1. **Surface Pattern Matching**: Selecting answers based on superficial similarities rather than genuine understanding
2. **Incomplete Reasoning**: Beginning a valid reasoning chain but stopping before reaching the correct conclusion
3. **Domain Confusion**: Applying knowledge or methods from the wrong field to a given problem
4. **Distractor Attraction**: Being misled by plausible but incorrect options, especially those that represent common misconceptions
5. **Calculation Errors**: Performing arithmetic or algebraic steps incorrectly even when the overall approach is sound

## Limitations and Considerations

### Current Limitations

| Limitation | Description | Impact |
| --- | --- | --- |
| English Only | All questions are in English | Limits applicability for non-English language models (partially addressed by MMLU-ProX) |
| Multiple Choice Format | All questions use 10-option multiple choice | May not capture free-form reasoning or generation capabilities |
| Static Dataset | Fixed set of 12,032 questions | Creates risk of contamination if questions appear in training data |
| Academic Focus | Questions are drawn from academic exams and textbooks | May not reflect practical, real-world problem-solving ability |
| Approaching Saturation | Frontier models exceed 89% | Diminishing ability to differentiate top-tier models |
| GPT-4 Bias in Distractors | Additional options were generated by GPT-4-Turbo | Models from the same family may have subtle advantages or disadvantages with these distractors |

### Future Directions

1. **Multilingual Extension**: MMLU-ProX covers 29 languages; further expansion is ongoing
2. **Dynamic Generation**: Procedurally generated questions to prevent training data contamination
3. **Free-Form Responses**: Open-ended answer formats to test generation rather than selection
4. **Multimodal Integration**: Adding visual and audio components to test cross-modal reasoning
5. **Adaptive Testing**: Adjusting question difficulty based on model performance during evaluation
6. **Harder Benchmarks**: Projects like [GPQA](/wiki/gpqa) and HLE ([Humanity's Last Exam](/wiki/humanity_s_last_exam)) target even higher difficulty ceilings

## Related Benchmarks

- **[MMLU](/wiki/mmlu)**: Original Massive Multitask Language Understanding (Hendrycks et al., 2021)
- **[GPQA](/wiki/gpqa)**: Graduate-level science questions requiring PhD-level expertise
- **ARC**: AI2 Reasoning Challenge for grade-school science questions
- **[HellaSwag](/wiki/hellaswag)**: Commonsense reasoning about everyday situations
- **[BigBench](/wiki/bigbench)**: Diverse capability evaluation across 200+ tasks
- **AGIEval**: Evaluation using human-level standardized exams
- **[MATH](/wiki/math)**: Competition-level mathematics problem solving
- **[TheoremQA](/wiki/theoremqa)**: Theorem-application questions (also a source for MMLU-Pro)
- **[GSM8K](/wiki/gsm8k)**: Grade-school math word problems

## Significance

MMLU-Pro represents a crucial evolution in language model evaluation, addressing the saturation problem of its predecessor while providing more robust and discriminative assessment. Its reduced prompt sensitivity and enhanced focus on reasoning make it particularly valuable for:

- Distinguishing between frontier model capabilities at a finer granularity than MMLU allows
- Tracking genuine progress in AI reasoning development over time
- Identifying whether models rely on reasoning versus memorization
- Providing stable, reproducible performance measurements across different evaluation setups
- Guiding model development priorities by highlighting specific domains and error types

The benchmark's success in revealing performance differences previously hidden by MMLU's ceiling effect has made it one of the most widely used LLM evaluation tools since its introduction. Its acceptance as a Spotlight paper at NeurIPS 2024 further cemented its standing in the research community.[8]

## See Also

- [Language Model Evaluation](/wiki/language_model)
- [Chain-of-Thought Prompting](/wiki/chain_of_thought)
- [Multi-task Learning](/wiki/multi_task_learning)
- [Knowledge Benchmarks](/wiki/knowledge_benchmarks)
- [Prompt Engineering](/wiki/prompt_engineering)

## References

1. Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., & Chen, W. (2024). "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." *Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Datasets and Benchmarks Track*. [arXiv:2406.01574](https://arxiv.org/abs/2406.01574)
2. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). "Measuring Massive Multitask Language Understanding." *Proceedings of the International Conference on Learning Representations (ICLR 2021)*. [arXiv:2009.03300](https://arxiv.org/abs/2009.03300)
3. TIGER-AI Lab. "MMLU-Pro GitHub Repository." [https://github.com/TIGER-AI-Lab/MMLU-Pro](https://github.com/TIGER-AI-Lab/MMLU-Pro)
4. TIGER-AI Lab. "MMLU-Pro Dataset on Hugging Face." [https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro](https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro)
5. TIGER-AI Lab. "MMLU-Pro Leaderboard." [https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro](https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro)
6. Artificial Analysis. "MMLU-Pro Benchmark Leaderboard." [https://artificialanalysis.ai/evaluations/mmlu-pro](https://artificialanalysis.ai/evaluations/mmlu-pro)
7. Vals AI. "MMLU Pro Benchmark." [https://www.vals.ai/benchmarks/mmlu_pro](https://www.vals.ai/benchmarks/mmlu_pro)
8. NeurIPS 2024. "Poster: MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." [https://neurips.cc/virtual/2024/poster/97435](https://neurips.cc/virtual/2024/poster/97435)