MMLU-Pro
| MMLU-Pro | |
|---|---|
| Overview | |
| Full name | Massive Multitask Language Understanding Professional |
| Abbreviation | MMLU-Pro |
| Description | A more robust and challenging multi-task language understanding benchmark with 10-choice questions |
| Release date | 2024-06 |
| Latest version | 1.0 |
| Benchmark updated | 2024-10 |
| Authors | Yubo Wang, Xueguang Ma, Ge Zhang, Et al. |
| Organization | TIGER-AI Lab |
| Technical Details | |
| Type | Knowledge, Reasoning, Multi-task Understanding |
| Modality | Text |
| Task format | Multiple choice (10 options) |
| Number of tasks | 12,000+ |
| Total examples | 12,000+ |
| Evaluation metric | Accuracy, Chain-of-Thought performance |
| Domains | 14 domains (Biology, Business, Chemistry, Computer Science, Etc.) |
| Languages | English |
| Performance | |
| Human performance | ~90% (estimated) |
| Baseline | Random guess: 10% |
| SOTA score | 83.5% |
| SOTA model | OpenAI o1 |
| SOTA date | 2025 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT |
| Predecessor | MMLU |
MMLU-Pro (Massive Multitask Language Understanding Professional) is an advanced artificial intelligence benchmark designed to evaluate large language models' capabilities across challenging multi-task language understanding problems. Released in June 2024 by the TIGER-AI Lab, MMLU-Pro extends the original MMLU benchmark by incorporating more complex reasoning-focused questions, expanding answer choices from four to ten options, and eliminating trivial or noisy questions, resulting in a more robust and discriminative evaluation framework.
Overview
MMLU-Pro addresses the performance saturation observed in the original MMLU benchmark, where frontier models had converged to scores between 86-87%, making it difficult to distinguish between model capabilities. By increasing both the difficulty and robustness of questions while reducing prompt sensitivity, MMLU-Pro provides a more challenging and stable evaluation environment for modern language models.
Motivation
The development of MMLU-Pro was driven by several critical observations:
- Performance saturation on original MMLU, with top models clustering at 86-87% accuracy
- High sensitivity to prompt variations in MMLU (4-5% variance)
- Limited discrimination between frontier models
- Insufficient emphasis on reasoning versus pure knowledge recall
- Presence of trivial and noisy questions in the original dataset
The benchmark specifically targets the need for more challenging evaluations that can differentiate between increasingly capable AI systems while providing more stable and reliable measurements.
Technical Specifications
Dataset Composition
MMLU-Pro comprises over 12,000 rigorously curated questions spanning 14 diverse domains:
| Domain | Description | Question Types | Approximate Count |
|---|---|---|---|
| Biology | Life sciences and biological systems | Conceptual, analytical | ~850 |
| Business | Commerce, management, economics | Case studies, theory | ~850 |
| Chemistry | Chemical processes and reactions | Problem-solving, theory | ~850 |
| Computer Science | Programming, algorithms, theory | Code analysis, concepts | ~850 |
| Economics | Economic theory and applications | Models, analysis | ~850 |
| Engineering | Applied sciences and design | Technical problems | ~850 |
| Health | Medical and health sciences | Clinical, research | ~850 |
| History | Historical events and analysis | Factual, interpretive | ~850 |
| Law | Legal principles and applications | Case law, theory | ~850 |
| Mathematics | Pure and applied mathematics | Problem-solving | ~850 |
| Philosophy | Philosophical concepts and reasoning | Logical analysis | ~850 |
| Physics | Physical sciences and mechanics | Calculations, concepts | ~850 |
| Psychology | Human behavior and cognition | Theory, research | ~850 |
| Other | Miscellaneous disciplines | Various | ~850 |
Key Improvements Over MMLU
| Feature | MMLU | MMLU-Pro | Improvement |
|---|---|---|---|
| Answer Choices | 4 options | 10 options | 150% increase |
| Random Guess Rate | 25% | 10% | 60% reduction |
| Prompt Sensitivity | 4-5% variance | ~2% variance | 50-60% reduction |
| Question Quality | Mixed quality | Curated, noise-free | Significant improvement |
| Reasoning Focus | Limited | Enhanced | Major emphasis |
| Total Questions | 15,908 | 12,000+ | Quality over quantity |
Evaluation Methodology
Answer Format
Each question in MMLU-Pro follows a standardized format:
- **Question Stem**: The main question or problem statement
- **10 Answer Options**: Labeled A through J
- **Single Correct Answer**: Exactly one correct option
- **Distractors**: Nine plausible but incorrect options
Scoring System
| Metric | Description | Calculation |
|---|---|---|
| Overall Accuracy | Percentage of correct answers | (Correct / Total) × 100% |
| Domain Accuracy | Performance per subject area | (Correct in domain / Total in domain) × 100% |
| CoT Performance | Accuracy with Chain-of-Thought | CoT correct / Total × 100% |
| Direct Answer | Accuracy without reasoning | Direct correct / Total × 100% |
| CoT Gain | Improvement from reasoning | CoT accuracy - Direct accuracy |
Prompt Robustness
MMLU-Pro tested 24 different prompt styles to ensure stability:
| Prompt Category | Variations | Impact on Score |
|---|---|---|
| Instruction Format | 6 styles | <1% variance |
| Few-shot Examples | 0-5 examples | ~1.5% variance |
| Output Format | 6 formats | <1% variance |
| Task Framing | 6 approaches | <1% variance |
Performance Analysis
Current Leaderboard (2025)
| Rank | Model | Overall Accuracy | CoT Gain | Organization |
|---|---|---|---|---|
| 1 | OpenAI o1 | 83.5% | +8% | OpenAI |
| 2 | Claude 3.7 Sonnet (Thinking) | 82.7% | +7% | Anthropic |
| 3 | GPT-4o | 78.2% | +5% | OpenAI |
| 4 | Gemini 2.0 Flash | 77.4% | +4% | |
| 5 | DeepSeek-V3 | ~75% | +6% | DeepSeek |
| 6 | Claude 3.5 Sonnet | 73.1% | +4% | Anthropic |
| 7 | Gemini 1.5 Pro | 72.8% | +3% | |
| 8 | Llama 3.1 405B | 68.5% | +3% | Meta |
Performance Insights
Accuracy Drop from MMLU
Models experience significant accuracy reduction when evaluated on MMLU-Pro:
| Model Category | MMLU Score | MMLU-Pro Score | Drop |
|---|---|---|---|
| Frontier Models | 86-87% | 70-83% | 16-20% |
| High-Performance | 80-85% | 60-70% | 20-25% |
| Mid-Range | 70-80% | 45-60% | 25-30% |
| Smaller Models | 60-70% | 35-45% | 30-33% |
Chain-of-Thought Impact
MMLU-Pro shows distinct CoT benefits unlike original MMLU:
- **Positive CoT Effect**: All models benefit from reasoning chains
- **Average Gain**: 4-8% improvement with CoT
- **Reasoning Indicator**: Strong CoT gains indicate reasoning-heavy questions
- **Contrast with MMLU**: Original MMLU showed minimal or negative CoT impact
Domain-Specific Performance
Subject Area Analysis
| Domain | Top Model Performance | Average Performance | Difficulty Rating |
|---|---|---|---|
| Mathematics | 85% | 65% | Very High |
| Physics | 82% | 62% | Very High |
| Computer Science | 88% | 70% | High |
| Chemistry | 80% | 60% | High |
| Engineering | 79% | 59% | High |
| Biology | 84% | 68% | Medium-High |
| Economics | 81% | 65% | Medium-High |
| Business | 83% | 67% | Medium |
| Psychology | 85% | 70% | Medium |
| History | 87% | 72% | Medium |
| Law | 78% | 58% | High |
| Philosophy | 82% | 66% | Medium-High |
| Health | 86% | 71% | Medium |
| Other | 80% | 65% | Variable |
Implementation
Installation and Setup
```bash
- Clone the repository
git clone https://github.com/TIGER-AI-Lab/MMLU-Pro cd MMLU-Pro
- Install dependencies
pip install -r requirements.txt
- Download dataset
python download_data.py ```
Running Evaluations
```python
- Basic evaluation
python evaluate.py --model "gpt-4" --method "direct"
- With Chain-of-Thought
python evaluate.py --model "gpt-4" --method "cot"
- Specific domains
python evaluate.py --model "gpt-4" --domains "math,physics,cs"
- All 24 prompt styles
python evaluate.py --model "gpt-4" --test-prompts ```
Dataset Access
```python from datasets import load_dataset
- Load MMLU-Pro dataset
dataset = load_dataset("TIGER-Lab/MMLU-Pro")
- Access specific split
test_set = dataset['test'] validation_set = dataset['validation']
- Filter by domain
math_questions = dataset.filter(lambda x: x['domain'] == 'mathematics') ```
Challenges and Insights
Key Challenges for Models
| Challenge | Description | Impact |
|---|---|---|
| Distractor Quality | High-quality incorrect options | Reduces guessing success |
| Reasoning Depth | Multi-step problem solving required | Challenges surface knowledge |
| Domain Expertise | Specialized knowledge needed | Tests breadth and depth |
| Option Discrimination | Subtle differences between choices | Requires precise understanding |
| Complex Integration | Cross-domain reasoning | Tests holistic understanding |
Common Failure Modes
1. **Surface Pattern Matching**: Selecting answers based on superficial similarities 2. **Incomplete Reasoning**: Stopping analysis before reaching correct conclusion 3. **Domain Confusion**: Applying incorrect domain knowledge 4. **Distractor Attraction**: Being misled by plausible but incorrect options 5. **Prompt Misinterpretation**: Despite reduced sensitivity, some variation remains
Applications and Impact
Research Applications
| Application | Purpose | Value |
|---|---|---|
| Model Development | Identifying capability gaps | Targeted improvements |
| Architecture Comparison | Evaluating design choices | Technical insights |
| Training Optimization | Measuring learning progress | Performance tracking |
| Reasoning Studies | Understanding model cognition | Theoretical advancement |
Practical Applications
- **Educational Assessment**: Evaluating AI tutoring capabilities
- **Professional Certification**: Testing domain expertise
- **Recruitment Tools**: Assessing AI for skill evaluation
- **Research Assistance**: Measuring research support capabilities
- **Decision Support**: Evaluating analytical reasoning
Limitations and Considerations
Current Limitations
| Limitation | Description | Impact |
|---|---|---|
| English Only | Single language focus | Limited global applicability |
| Multiple Choice Format | Restricted response type | May not capture full capabilities |
| Static Dataset | Fixed question set | Potential for overfitting |
| Academic Focus | Emphasis on formal knowledge | May miss practical skills |
| Cultural Bias | Western academic perspective | Limited cultural diversity |
Future Directions
1. **Multilingual Extension**: Versions in multiple languages 2. **Dynamic Generation**: Procedurally generated questions 3. **Free-Form Responses**: Open-ended answer formats 4. **Multimodal Integration**: Adding visual and audio components 5. **Adaptive Testing**: Difficulty adjustment based on performance 6. **Real-World Tasks**: Practical application scenarios
Related Benchmarks
- MMLU: Original Massive Multitask Language Understanding
- GPQA: Graduate-level science questions
- ARC: AI2 Reasoning Challenge
- HellaSwag: Commonsense reasoning
- BigBench: Diverse capability evaluation
- AGIEval: Human-level exam questions
- MATH: Mathematics problem solving
Significance
MMLU-Pro represents a crucial evolution in language model evaluation, addressing the saturation problem of its predecessor while providing more robust and discriminative assessment. Its reduced prompt sensitivity and enhanced focus on reasoning make it particularly valuable for:
- Distinguishing between frontier model capabilities
- Tracking genuine progress in AI development
- Identifying reasoning versus memorization abilities
- Providing stable performance measurements
- Guiding model development priorities
The benchmark's success in revealing performance differences previously hidden by MMLU's ceiling effect makes it an essential tool for advancing toward more capable AI systems.
See Also
- Language Model Evaluation
- Multi-task Learning
- Chain-of-Thought Reasoning
- Knowledge Benchmarks
- Academic Question Answering
- AI Evaluation Metrics
References
Cite error: <ref> tag with name "mmlu_pro_paper" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "mmlu_pro_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "mmlu_pro_hf" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "mmlu_pro_space" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "neurips2024" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "vals_benchmark" defined in <references> has group attribute "" which does not appear in prior text.