| MMLU-Pro | |
|---|---|
| Overview | |
| Full name | Massive Multitask Language Understanding Professional |
| Abbreviation | MMLU-Pro |
| Description | A more robust and challenging multi-task language understanding benchmark with 10-choice questions |
| Release date | 2024-06 |
| Latest version | 1.0 |
| Benchmark updated | 2024-10 |
| Authors | Yubo Wang, Xueguang Ma, Ge Zhang, Et al. |
| Organization | TIGER-AI Lab |
| Technical Details | |
| Type | Knowledge, Reasoning, Multi-task Understanding |
| Modality | Text |
| Task format | Multiple choice (10 options) |
| Number of tasks | 12,000+ |
| Total examples | 12,000+ |
| Evaluation metric | Accuracy, Chain-of-Thought performance |
| Domains | 14 domains (Biology, Business, Chemistry, Computer Science, Etc.) |
| Languages | English |
| Performance | |
| Human performance | ~90% (estimated) |
| Baseline | Random guess: 10% |
| SOTA score | 83.5% |
| SOTA model | OpenAI o1 |
| SOTA date | 2025 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT |
| Predecessor | MMLU |
MMLU-Pro (Massive Multitask Language Understanding Professional) is an advanced artificial intelligence benchmark designed to evaluate large language models' capabilities across challenging multi-task language understanding problems. Released in June 2024 by the TIGER-AI Lab, MMLU-Pro extends the original MMLU benchmark by incorporating more complex reasoning-focused questions, expanding answer choices from four to ten options, and eliminating trivial or noisy questions, resulting in a more robust and discriminative evaluation framework.
MMLU-Pro addresses the performance saturation observed in the original MMLU benchmark, where frontier models had converged to scores between 86-87%, making it difficult to distinguish between model capabilities. By increasing both the difficulty and robustness of questions while reducing prompt sensitivity, MMLU-Pro provides a more challenging and stable evaluation environment for modern language models.
The development of MMLU-Pro was driven by several critical observations:
The benchmark specifically targets the need for more challenging evaluations that can differentiate between increasingly capable AI systems while providing more stable and reliable measurements.
MMLU-Pro comprises over 12,000 rigorously curated questions spanning 14 diverse domains:
| Domain | Description | Question Types | Approximate Count |
|---|---|---|---|
| Biology | Life sciences and biological systems | Conceptual, analytical | ~850 |
| Business | Commerce, management, economics | Case studies, theory | ~850 |
| Chemistry | Chemical processes and reactions | Problem-solving, theory | ~850 |
| Computer Science | Programming, algorithms, theory | Code analysis, concepts | ~850 |
| Economics | Economic theory and applications | Models, analysis | ~850 |
| Engineering | Applied sciences and design | Technical problems | ~850 |
| Health | Medical and health sciences | Clinical, research | ~850 |
| History | Historical events and analysis | Factual, interpretive | ~850 |
| Law | Legal principles and applications | Case law, theory | ~850 |
| Mathematics | Pure and applied mathematics | Problem-solving | ~850 |
| Philosophy | Philosophical concepts and reasoning | Logical analysis | ~850 |
| Physics | Physical sciences and mechanics | Calculations, concepts | ~850 |
| Psychology | Human behavior and cognition | Theory, research | ~850 |
| Other | Miscellaneous disciplines | Various | ~850 |
| Feature | MMLU | MMLU-Pro | Improvement |
|---|---|---|---|
| Answer Choices | 4 options | 10 options | 150% increase |
| Random Guess Rate | 25% | 10% | 60% reduction |
| Prompt Sensitivity | 4-5% variance | ~2% variance | 50-60% reduction |
| Question Quality | Mixed quality | Curated, noise-free | Significant improvement |
| Reasoning Focus | Limited | Enhanced | Major emphasis |
| Total Questions | 15,908 | 12,000+ | Quality over quantity |
Each question in MMLU-Pro follows a standardized format:
| Metric | Description | Calculation |
|---|---|---|
| Overall Accuracy | Percentage of correct answers | (Correct / Total) × 100% |
| Domain Accuracy | Performance per subject area | (Correct in domain / Total in domain) × 100% |
| CoT Performance | Accuracy with Chain-of-Thought | CoT correct / Total × 100% |
| Direct Answer | Accuracy without reasoning | Direct correct / Total × 100% |
| CoT Gain | Improvement from reasoning | CoT accuracy - Direct accuracy |
MMLU-Pro tested 24 different prompt styles to ensure stability:
| Prompt Category | Variations | Impact on Score |
|---|---|---|
| Instruction Format | 6 styles | <1% variance |
| Few-shot Examples | 0-5 examples | ~1.5% variance |
| Output Format | 6 formats | <1% variance |
| Task Framing | 6 approaches | <1% variance |
| Rank | Model | Overall Accuracy | CoT Gain | Organization |
|---|---|---|---|---|
| 1 | OpenAI o1 | 83.5% | +8% | OpenAI |
| 2 | Claude 3.7 Sonnet (Thinking) | 82.7% | +7% | Anthropic |
| 3 | GPT-4o | 78.2% | +5% | OpenAI |
| 4 | Gemini 2.0 Flash | 77.4% | +4% | |
| 5 | DeepSeek-V3 | ~75% | +6% | DeepSeek |
| 6 | Claude 3.5 Sonnet | 73.1% | +4% | Anthropic |
| 7 | Gemini 1.5 Pro | 72.8% | +3% | |
| 8 | Llama 3.1 405B | 68.5% | +3% | Meta |
Models experience significant accuracy reduction when evaluated on MMLU-Pro:
| Model Category | MMLU Score | MMLU-Pro Score | Drop |
|---|---|---|---|
| Frontier Models | 86-87% | 70-83% | 16-20% |
| High-Performance | 80-85% | 60-70% | 20-25% |
| Mid-Range | 70-80% | 45-60% | 25-30% |
| Smaller Models | 60-70% | 35-45% | 30-33% |
MMLU-Pro shows distinct CoT benefits unlike original MMLU:
| Domain | Top Model Performance | Average Performance | Difficulty Rating |
|---|---|---|---|
| Mathematics | 85% | 65% | Very High |
| Physics | 82% | 62% | Very High |
| Computer Science | 88% | 70% | High |
| Chemistry | 80% | 60% | High |
| Engineering | 79% | 59% | High |
| Biology | 84% | 68% | Medium-High |
| Economics | 81% | 65% | Medium-High |
| Business | 83% | 67% | Medium |
| Psychology | 85% | 70% | Medium |
| History | 87% | 72% | Medium |
| Law | 78% | 58% | High |
| Philosophy | 82% | 66% | Medium-High |
| Health | 86% | 71% | Medium |
| Other | 80% | 65% | Variable |
```bash
git clone https://github.com/TIGER-AI-Lab/MMLU-Pro cd MMLU-Pro
pip install -r requirements.txt
python download_data.py ```
```python
python evaluate.py --model "gpt-4" --method "direct"
python evaluate.py --model "gpt-4" --method "cot"
python evaluate.py --model "gpt-4" --domains "math,physics,cs"
python evaluate.py --model "gpt-4" --test-prompts ```
```python from datasets import load_dataset
dataset = load_dataset("TIGER-Lab/MMLU-Pro")
test_set = dataset['test'] validation_set = dataset['validation']
math_questions = dataset.filter(lambda x: x['domain'] == 'mathematics') ```
| Challenge | Description | Impact |
|---|---|---|
| Distractor Quality | High-quality incorrect options | Reduces guessing success |
| Reasoning Depth | Multi-step problem solving required | Challenges surface knowledge |
| Domain Expertise | Specialized knowledge needed | Tests breadth and depth |
| Option Discrimination | Subtle differences between choices | Requires precise understanding |
| Complex Integration | Cross-domain reasoning | Tests holistic understanding |
1. **Surface Pattern Matching**: Selecting answers based on superficial similarities 2. **Incomplete Reasoning**: Stopping analysis before reaching correct conclusion 3. **Domain Confusion**: Applying incorrect domain knowledge 4. **Distractor Attraction**: Being misled by plausible but incorrect options 5. **Prompt Misinterpretation**: Despite reduced sensitivity, some variation remains
| Application | Purpose | Value |
|---|---|---|
| Model Development | Identifying capability gaps | Targeted improvements |
| Architecture Comparison | Evaluating design choices | Technical insights |
| Training Optimization | Measuring learning progress | Performance tracking |
| Reasoning Studies | Understanding model cognition | Theoretical advancement |
| Limitation | Description | Impact |
|---|---|---|
| English Only | Single language focus | Limited global applicability |
| Multiple Choice Format | Restricted response type | May not capture full capabilities |
| Static Dataset | Fixed question set | Potential for overfitting |
| Academic Focus | Emphasis on formal knowledge | May miss practical skills |
| Cultural Bias | Western academic perspective | Limited cultural diversity |
1. **Multilingual Extension**: Versions in multiple languages 2. **Dynamic Generation**: Procedurally generated questions 3. **Free-Form Responses**: Open-ended answer formats 4. **Multimodal Integration**: Adding visual and audio components 5. **Adaptive Testing**: Difficulty adjustment based on performance 6. **Real-World Tasks**: Practical application scenarios
MMLU-Pro represents a crucial evolution in language model evaluation, addressing the saturation problem of its predecessor while providing more robust and discriminative assessment. Its reduced prompt sensitivity and enhanced focus on reasoning make it particularly valuable for:
The benchmark's success in revealing performance differences previously hidden by MMLU's ceiling effect makes it an essential tool for advancing toward more capable AI systems.
Cite error: <ref> tag with name "mmlu_pro_paper" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "mmlu_pro_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "mmlu_pro_hf" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "mmlu_pro_space" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "neurips2024" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "vals_benchmark" defined in <references> has group attribute "" which does not appear in prior text.