**
| MMLU-Pro | |
|---|---|
| Overview | |
| Full name | Massive Multitask Language Understanding Professional |
| Abbreviation | MMLU-Pro |
| Description | A more robust and challenging multi-task language understanding benchmark with 10-choice questions |
| Release date | 2024-06 |
| Latest version | 1.0 |
| Benchmark updated | 2024-10 |
| Authors | Yubo Wang, Xueguang Ma, Ge Zhang, Et al. |
| Organization | TIGER-AI Lab |
| Technical Details | |
| Type | Knowledge, Reasoning, Multi-task Understanding |
| Modality | Text |
| Task format | Multiple choice (10 options) |
| Number of tasks | 12,032 |
| Total examples | 12,032 |
| Evaluation metric | Accuracy, Chain-of-Thought performance |
| Domains | 14 domains (Biology, Business, Chemistry, Computer Science, Etc.) |
| Languages | English |
| Performance | |
| Human performance | ~90% (estimated) |
| Baseline | Random guess: 10% |
| SOTA score | ~90.1% |
| SOTA model | Gemini 3 Pro |
| SOTA date | 2025 |
| Saturated | Approaching |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT |
| Predecessor | MMLU |
MMLU-Pro** (Massive Multitask Language Understanding Professional) is an advanced artificial intelligence benchmark designed to evaluate large language models' capabilities across challenging multi-task language understanding problems. Released in June 2024 by the TIGER-AI Lab, MMLU-Pro extends the original MMLU benchmark by incorporating more complex reasoning-focused questions, expanding answer choices from four to ten options, and eliminating trivial or noisy questions, resulting in a more robust and discriminative evaluation framework. The benchmark was accepted as a Spotlight paper at the NeurIPS 2024 Datasets and Benchmarks Track.
MMLU-Pro addresses the performance saturation observed in the original MMLU benchmark, where frontier models had converged to scores between 86-87%, making it difficult to distinguish between model capabilities. By increasing both the difficulty and robustness of questions while reducing prompt sensitivity, MMLU-Pro provides a more challenging and stable evaluation environment for modern language models.
The benchmark was developed by Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. It was published under the title "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" and made available on arXiv on June 3, 2024 (arXiv:2406.01574). The dataset is released under the MIT license and hosted on Hugging Face.
The development of MMLU-Pro was driven by several critical observations about the limitations of the original MMLU benchmark:
The benchmark specifically targets the need for more challenging evaluations that can differentiate between increasingly capable AI systems while providing more stable and reliable measurements. For instance, the gap between GPT-4o and GPT-4-Turbo on MMLU was only about 1%, but MMLU-Pro widened this gap to approximately 9%, providing clearer signal about relative model capabilities.
MMLU-Pro comprises exactly 12,032 rigorously curated questions spanning 14 diverse domains. Unlike the original MMLU, which drew solely from existing exam questions, MMLU-Pro integrates questions from four distinct sources:
| Source | Question Count | Percentage | Description |
|---|---|---|---|
| Original MMLU | 6,810 | 56.60% | Filtered subset of the original MMLU dataset, with trivial and noisy questions removed |
| STEM Websites | 4,083 | 33.93% | High-quality STEM problems collected from online educational platforms |
| TheoremQA | 598 | 4.97% | Human-annotated questions requiring theorem application for resolution |
| SciBench | 541 | 4.50% | Advanced science questions derived from college-level exams |
| Total | 12,032 | 100% |
The inclusion of questions from SciBench and TheoremQA specifically strengthens the STEM coverage of the benchmark. SciBench provides college-level exam questions from physics, chemistry, and mathematics courses, while TheoremQA contributes questions that require applying mathematical and scientific theorems. These additional sources help shift the benchmark's emphasis from factual recall toward multi-step reasoning.
The 14 subject domains vary in size. The original MMLU's 57 fine-grained subjects were consolidated into 14 broader categories to reduce redundancy and focus on core knowledge areas.
| Subject | Total Questions | From MMLU | Newly Added | Primary Additional Source |
|---|---|---|---|---|
| Mathematics | 1,351 | 846 | 505 | TheoremQA (344), SciBench (161) |
| Physics | 1,299 | 411 | 888 | STEM Websites (617), SciBench (167), TheoremQA (104) |
| Chemistry | 1,132 | 178 | 954 | STEM Websites (741), SciBench (213) |
| Law | 1,101 | 1,101 | 0 | None (entirely from MMLU) |
| Engineering | 969 | 67 | 902 | STEM Websites (902) |
| Other | 924 | 924 | 0 | None (entirely from MMLU) |
| Economics | 844 | 444 | 400 | STEM Websites |
| Health | 818 | 818 | 0 | None (entirely from MMLU) |
| Psychology | 798 | 493 | 305 | STEM Websites |
| Business | 789 | 155 | 634 | STEM Websites |
| Biology | 717 | 219 | 498 | STEM Websites |
| Philosophy | 499 | 499 | 0 | None (entirely from MMLU) |
| Computer Science | 410 | 274 | 136 | STEM Websites |
| History | 381 | 381 | 0 | None (entirely from MMLU) |
| Total | 12,032 | 6,810 | 5,222 |
Notably, several humanities domains (Law, Philosophy, History, Health) retain 100% of their questions from the original MMLU, while STEM domains received substantial supplementation from external sources. Engineering, for example, draws 93% of its questions from STEM websites rather than from MMLU.
The following table summarizes the principal design differences between MMLU and MMLU-Pro:
| Feature | MMLU | MMLU-Pro | Significance |
|---|---|---|---|
| Total Questions | 15,908 | 12,032 | Quality over quantity |
| Subject Categories | 57 | 14 | Consolidated for clarity |
| Answer Choices per Question | 4 (A through D) | 10 (A through J) | Reduces random guessing from 25% to 10% |
| Random Guess Baseline | 25% | 10% | 60% reduction in guessing advantage |
| Prompt Sensitivity | 4-5% variance | ~2% variance | More stable and reliable scoring |
| Question Quality | Mixed; includes trivial items | Curated; noise removed | Stronger signal of true capability |
| Reasoning Requirement | Minimal; mostly knowledge recall | Substantial; multi-step reasoning | Better tests analytical ability |
| Chain-of-Thought Benefit | Negligible or negative | +4% to +19% improvement | Confirms reasoning-heavy content |
| Question Sources | Existing exam banks only | MMLU + SCIBENCH + TheoremQA + STEM sites | Broader, more challenging coverage |
| Publication | ICLR 2021 (Hendrycks et al.) | NeurIPS 2024 Datasets Track (Wang et al.) |
The construction of MMLU-Pro followed a multi-stage pipeline involving automated filtering, GPT-4-based augmentation, and expert human review.
The authors began with the 13,937 questions in the MMLU test set. They evaluated each question against eight language models of varying capability. Questions that were answered correctly by four or more of the eight models were deemed too easy and removed from the dataset. This filtering step eliminated 5,886 questions (approximately 42% of the original set), leaving 8,051 MMLU questions. After additional noise and error removal through expert review, 6,810 MMLU-origin questions were retained for MMLU-Pro.
To strengthen coverage in science, technology, engineering, and mathematics, the authors incorporated questions from three additional sources:
Questions from TheoremQA and SciBench were originally in free-response format. GPT-4-Turbo was used to extract short correct answers from the solutions and convert them into multiple-choice format.
One of the most distinctive features of MMLU-Pro is its expansion from four answer choices to ten. For questions that originally had only four options, GPT-4-Turbo was employed to generate six additional plausible distractors for each question. These distractors were not random; they were designed to be plausible incorrect answers that require discriminative reasoning to eliminate. This expansion reduces the probability of guessing correctly from 25% (with four choices) to 10% (with ten choices), placing greater demand on genuine understanding.
The dataset underwent two phases of expert review:
Each question in MMLU-Pro follows a standardized format:
| Metric | Description | Calculation |
|---|---|---|
| Overall Accuracy | Percentage of correct answers across all questions | (Correct / Total) x 100% |
| Domain Accuracy | Performance within each of the 14 subject areas | (Correct in domain / Total in domain) x 100% |
| CoT Accuracy | Accuracy when the model uses Chain-of-Thought prompting | CoT correct / Total x 100% |
| Direct Answer Accuracy | Accuracy with direct answer extraction (no reasoning) | Direct correct / Total x 100% |
| CoT Gain | Improvement attributable to Chain-of-Thought reasoning | CoT accuracy minus Direct accuracy |
The standard evaluation protocol uses 5-shot Chain-of-Thought prompting, where the model is given five example questions with worked-out reasoning before being asked to solve test questions.
A key advantage of MMLU-Pro over its predecessor is reduced sensitivity to prompt wording. The authors tested 24 different prompt styles across four categories:
| Prompt Category | Variations Tested | Score Variance |
|---|---|---|
| Instruction Format | 6 styles | Less than 1% |
| Few-shot Examples | 0 to 5 examples | ~1.5% |
| Output Format | 6 formats | Less than 1% |
| Task Framing | 6 approaches | Less than 1% |
On the original MMLU, the same prompt variations produced 4-5% swings in model scores. MMLU-Pro's reduced sensitivity (approximately 2% total variance) means that benchmark results are more reproducible and less dependent on prompt engineering choices.
One of the most important findings from the MMLU-Pro paper is the stark difference in how Chain-of-Thought (CoT) reasoning affects performance compared to the original MMLU.
On the original MMLU, CoT prompting provided minimal or even slightly negative benefit. This was because most MMLU questions tested factual recall rather than multi-step reasoning, and the additional reasoning steps sometimes introduced errors without improving accuracy. On MMLU-Pro, however, CoT prompting yields large and consistent improvements across all models tested, confirming that the benchmark genuinely requires reasoning rather than simple knowledge lookup.
| Model | MMLU (CoT) | MMLU (Direct) | MMLU CoT Gain | MMLU-Pro (CoT) | MMLU-Pro (Direct) | MMLU-Pro CoT Gain |
|---|---|---|---|---|---|---|
| GPT-4o | 88.7% | 87.2% | +1.5% | 72.6% | 53.5% | +19.1% |
| GPT-4-Turbo | 86.5% | 86.7% | -0.2% | 63.7% | 48.4% | +15.3% |
| Phi-3-medium | 79.4% | 78.0% | +1.4% | 55.7% | 47.5% | +8.2% |
The CoT gain on MMLU-Pro is dramatically larger than on MMLU. GPT-4o, for example, improves by 19.1 percentage points when using Chain-of-Thought on MMLU-Pro, compared to only 1.5 points on MMLU. GPT-4-Turbo actually performs slightly worse with CoT on the original MMLU (-0.2%), but gains 15.3 points on MMLU-Pro. These results demonstrate that MMLU-Pro questions genuinely benefit from step-by-step reasoning and cannot be answered through pattern matching alone.
The subjects showing the largest CoT improvements include Mathematics (+41.8 percentage points for some models), Chemistry (+39.5 points), and Business (+39.4 points), indicating that these domains contain the most reasoning-intensive questions.
The following results were reported in the original MMLU-Pro paper using 5-shot CoT evaluation. These scores represent model capabilities at the time of the benchmark's release.
| Rank | Model | Type | MMLU-Pro Score (CoT) | Organization |
|---|---|---|---|---|
| 1 | GPT-4o | Closed-source | 72.6% | OpenAI |
| 2 | Gemini 1.5 Pro | Closed-source | 69.0% | |
| 3 | Claude 3 Opus | Closed-source | 68.5% | Anthropic |
| 4 | GPT-4-Turbo | Closed-source | 63.7% | OpenAI |
| 5 | Gemini 1.5 Flash | Closed-source | 59.1% | |
| 6 | Yi-large | Closed-source | 57.5% | 01.AI |
| 7 | Claude 3 Sonnet | Closed-source | 56.8% | Anthropic |
| 8 | LLaMA 3 70B Instruct | Open-source | 56.2% | Meta |
| 9 | Phi-3-medium-4k | Open-source | 55.7% | Microsoft |
| 10 | DeepSeek V2 Chat | Open-source | 54.8% | DeepSeek |
| 11 | LLaMA 3 70B | Open-source | 52.8% | Meta |
| 12 | Qwen 1.5 72B Chat | Open-source | 52.6% | Alibaba |
| 13 | Yi-1.5-34B-Chat | Open-source | 52.3% | 01.AI |
| 14 | Phi-3-medium-128k | Open-source | 51.9% | Microsoft |
| 15 | MAmmoTH2-8x7B-Plus | Open-source | 50.4% | TIGER-AI Lab |
| 16 | Mixtral 8x7B Instruct | Open-source | 43.3% | Mistral AI |
At the time of publication, GPT-4o was the clear leader at 72.6%. Open-source models lagged behind closed-source models by a significant margin, with the best open-source model (LLaMA 3 70B Instruct) scoring 16.4 percentage points below GPT-4o.
Since the benchmark's release, newer models have achieved substantially higher scores. The following table reflects reported MMLU-Pro scores from the official leaderboard and third-party evaluation platforms.
| Rank | Model | MMLU-Pro Score | Organization |
|---|---|---|---|
| 1 | Gemini 3 Pro | ~90.1% | |
| 2 | Claude Opus 4.5 (Thinking) | ~89.5% | Anthropic |
| 3 | Gemini 3 Flash (Thinking) | ~89.0% | |
| 4 | Claude Opus 4.5 | ~88.9% | Anthropic |
| 5 | Claude Opus 4.1 (Thinking) | ~88.0% | Anthropic |
| 6 | Claude Sonnet 4.5 (Thinking) | ~87.5% | Anthropic |
| 7 | GPT-5.2 Pro | ~87.4% | OpenAI |
| 8 | Claude Opus 4 (Thinking) | ~87.3% | Anthropic |
| 9 | GPT-5 | ~87.1% | OpenAI |
| 10 | Grok 4 | ~86.6% | xAI |
| 11 | Gemini 2.5 Pro | ~86.2% | |
| 12 | DeepSeek V3.2 (Thinking) | ~86.2% | DeepSeek |
| 13 | DeepSeek R1 | ~84.4% | DeepSeek |
| 14 | o3 | ~85.3% | OpenAI |
| 15 | o1 | ~84.1% | OpenAI |
| 16 | Claude 3.5 Sonnet | 76.1% | Anthropic |
| 17 | GPT-4o | 72.6% | OpenAI |
| 18 | LLaMA 3.1 405B Instruct | ~73.2% | Meta |
| 19 | DeepSeek V3 | ~73.9% | DeepSeek |
| 20 | GPT-4o-mini | ~64.8% | OpenAI |
As of early 2026, 199 models have been evaluated on MMLU-Pro, with an average score of 74.4% across all submissions. Frontier models with extended reasoning capabilities (such as Claude Opus 4.5 Thinking and Gemini 3 Pro) have pushed scores above 89%, approaching the estimated human performance ceiling of approximately 90%.
Models experience significant accuracy reductions when evaluated on MMLU-Pro compared to their MMLU performance. The magnitude of the drop correlates with model capability: stronger models tend to lose fewer percentage points.
| Model Category | Typical MMLU Score | Typical MMLU-Pro Score | Approximate Drop |
|---|---|---|---|
| Frontier Models (2024) | 86-87% | 70-83% | 16-20 points |
| High-Performance Models | 80-85% | 60-70% | 20-25 points |
| Mid-Range Models | 70-80% | 45-60% | 25-30 points |
| Smaller Models | 60-70% | 35-45% | 30-33 points |
The consistent pattern of larger drops for weaker models confirms that MMLU-Pro is more discriminative. On MMLU, a mid-range model might appear to be within striking distance of a frontier model, but MMLU-Pro reveals a much wider gap in actual capability.
The MMLU-Pro paper includes a detailed error analysis of 120 incorrect responses from GPT-4o, which was the top-performing model at the time of publication. The analysis categorizes errors into four types:
| Error Type | Percentage | Description |
|---|---|---|
| Reasoning Errors | 39% | Logical inconsistencies or flawed inference chains, even when the model recalled relevant knowledge correctly |
| Knowledge Gaps | 35% | Missing domain-specific expertise needed to arrive at the correct answer |
| Calculation Mistakes | 12% | Arithmetic or computational errors in problems requiring numerical work |
| Other Errors | 14% | Miscellaneous errors including misreading the question or selecting an answer that contradicts the model's own reasoning |
The dominance of reasoning errors (39%) over knowledge gaps (35%) is noteworthy. It indicates that MMLU-Pro successfully tests reasoning ability rather than merely factual knowledge. Models frequently recalled the correct principles and formulas but failed to apply them through a valid chain of logical steps. For example, GPT-4o sometimes used incorrect values in financial calculations or misapplied refractive index ratios in optics problems, despite demonstrating awareness of the relevant concepts.
| Domain | Top Model Performance | Average Performance | Difficulty Rating |
|---|---|---|---|
| Mathematics | 85% | 65% | Very High |
| Physics | 82% | 62% | Very High |
| Computer Science | 88% | 70% | High |
| Chemistry | 80% | 60% | High |
| Engineering | 79% | 59% | High |
| Biology | 84% | 68% | Medium-High |
| Economics | 81% | 65% | Medium-High |
| Business | 83% | 67% | Medium |
| Psychology | 85% | 70% | Medium |
| History | 87% | 72% | Medium |
| Law | 78% | 58% | High |
| Philosophy | 82% | 66% | Medium-High |
| Health | 86% | 71% | Medium |
| Other | 80% | 65% | Variable |
STEM subjects (Mathematics, Physics, Chemistry, Engineering) consistently rank as the most difficult, which aligns with the heavy supplementation of these domains with questions from SciBench, TheoremQA, and STEM websites. Law also ranks as particularly challenging despite containing only questions from the original MMLU, likely because legal reasoning requires nuanced interpretation of multiple overlapping rules.
Several design choices combine to make MMLU-Pro substantially more difficult than the original MMLU:
With ten answer options instead of four, models face 2.5 times as many plausible alternatives. Each additional distractor is not random filler but a carefully constructed incorrect answer that a model with partial understanding might select. The probability of guessing correctly drops from 25% to 10%, which means that a model's score above the 10% baseline more directly reflects genuine comprehension.
The filtering process removed questions that most models could answer through simple recall. The remaining questions, combined with new additions from SciBench and TheoremQA, demand multi-step reasoning. A typical MMLU-Pro question in physics or mathematics may require setting up equations, performing algebraic manipulation, applying domain-specific theorems, and then selecting from ten closely spaced numerical or conceptual answers.
By removing trivial and noisy items from the original MMLU and adding questions from vetted academic sources, MMLU-Pro achieves a higher floor of question quality. Each question meaningfully tests competence rather than rewarding surface-level pattern recognition.
The combination of ten options and reduced prompt sensitivity makes MMLU-Pro harder to game. On the original MMLU, certain prompt formulations could swing scores by 4-5 percentage points, creating an incentive to optimize prompts rather than improve model capability. MMLU-Pro's ~2% prompt sensitivity largely eliminates this source of variance.
# Clone the repository
git clone https://github.com/TIGER-AI-Lab/MMLU-Pro
cd MMLU-Pro
# Install dependencies
pip install -r requirements.txt
# Download dataset
python download_data.py
# Basic evaluation
python evaluate.py --model "gpt-4" --method "direct"
# With Chain-of-Thought
python evaluate.py --model "gpt-4" --method "cot"
# Specific domains
python evaluate.py --model "gpt-4" --domains "math,physics,cs"
# All 24 prompt styles
python evaluate.py --model "gpt-4" --test-prompts
The dataset is available on Hugging Face and can be loaded directly using the datasets library:
from datasets import load_dataset
# Load MMLU-Pro dataset
dataset = load_dataset("TIGER-Lab/MMLU-Pro")
# Access specific split
test_set = dataset['test'] # 12,000 rows
validation_set = dataset['validation'] # 70 rows
# Filter by domain
math_questions = dataset['test'].filter(lambda x: x['category'] == 'math')
Each record in the dataset includes the following fields:
| Field | Type | Description |
|---|---|---|
| question_id | int64 | Unique identifier for the question |
| question | string | The question text |
| options | list of strings | Ten multiple-choice options |
| answer | string | Correct answer letter (A through J) |
| answer_index | int64 | Index of correct answer (0 through 9) |
| cot_content | string | Reference Chain-of-Thought reasoning |
| category | string | Subject category (one of 14 domains) |
| src | string | Source of the question |
Since its release in June 2024, MMLU-Pro has rapidly become a standard evaluation benchmark in the large language model community.
Major AI labs now routinely report MMLU-Pro scores alongside other benchmarks when releasing new models. OpenAI, Anthropic, Google DeepMind, Meta, Mistral AI, and DeepSeek have all published MMLU-Pro results for their flagship models. Multiple independent benchmark aggregation platforms, including Artificial Analysis, Vals AI, and Kaggle Benchmarks, track MMLU-Pro scores, and as of early 2026, 199 models have been evaluated on the benchmark.
The MMLU-Pro paper has been widely cited in subsequent research on LLM evaluation methodology. It has also inspired derivative benchmarks:
With frontier reasoning models now scoring above 89% on MMLU-Pro, the benchmark is beginning to face the same saturation challenge that originally motivated its creation. The gap between top models has narrowed, and further improvements may be difficult to measure reliably. This trend suggests that even more challenging benchmarks may be needed in the future to differentiate the next generation of models.
| Challenge | Description | Impact |
|---|---|---|
| Distractor Quality | High-quality incorrect options generated by GPT-4-Turbo | Reduces guessing success and requires precise discrimination |
| Reasoning Depth | Multi-step problem solving required across STEM and non-STEM domains | Challenges surface-level knowledge |
| Domain Expertise | Specialized knowledge needed in 14 distinct domains | Tests both breadth and depth of training data |
| Option Discrimination | Subtle differences between closely related choices | Requires precise understanding rather than approximate matching |
| Cross-Domain Integration | Some questions require combining knowledge from multiple fields | Tests holistic understanding |
| Limitation | Description | Impact |
|---|---|---|
| English Only | All questions are in English | Limits applicability for non-English language models (partially addressed by MMLU-ProX) |
| Multiple Choice Format | All questions use 10-option multiple choice | May not capture free-form reasoning or generation capabilities |
| Static Dataset | Fixed set of 12,032 questions | Creates risk of contamination if questions appear in training data |
| Academic Focus | Questions are drawn from academic exams and textbooks | May not reflect practical, real-world problem-solving ability |
| Approaching Saturation | Frontier models exceed 89% | Diminishing ability to differentiate top-tier models |
| GPT-4 Bias in Distractors | Additional options were generated by GPT-4-Turbo | Models from the same family may have subtle advantages or disadvantages with these distractors |
MMLU-Pro represents a crucial evolution in language model evaluation, addressing the saturation problem of its predecessor while providing more robust and discriminative assessment. Its reduced prompt sensitivity and enhanced focus on reasoning make it particularly valuable for:
The benchmark's success in revealing performance differences previously hidden by MMLU's ceiling effect has made it one of the most widely used LLM evaluation tools since its introduction. Its acceptance as a Spotlight paper at NeurIPS 2024 further cemented its standing in the research community.