MMLU
| MMLU | |
|---|---|
| Overview | |
| Full name | Measuring Massive Multitask Language Understanding |
| Abbreviation | MMLU |
| Description | A comprehensive benchmark evaluating large language models across 57 diverse academic subjects through multiple-choice questions
Property "Description" (as page type) with input value "A comprehensive benchmark evaluating large language models across 57 diverse academic subjects through multiple-choice questions" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process. |
| Release date | 2020-09-07 |
| Latest version | MMLU-Pro |
| Benchmark updated | 2024-06-03 |
| Authors | Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt |
| Organization | University of California, Berkeley |
| Technical Details | |
| Type | Multitask Language Understanding, Knowledge Evaluation |
| Modality | Text |
| Task format | Multiple choice (4 options) |
| Number of tasks | 57 |
| Total examples | 15908 |
| Evaluation metric | Accuracy, Macro-average |
| Domains | STEM, Humanities, Social Sciences, Professional Fields |
| Languages | English |
| Performance | |
| Human performance | 89.8 |
| Baseline | 25.0 |
| SOTA score | 90.0 |
| SOTA model | GPT-4 o1-preview |
| SOTA date | 2025-01-01 |
| Saturated | Yes |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT License |
| Successor | MMLU-Pro |
MMLU (Measuring Massive Multitask Language Understanding) is a comprehensive benchmark designed to evaluate large language models across 57 diverse academic and professional subjects through multiple-choice questions. Created by researchers at the University of California, Berkeley and released in September 2020, MMLU has become one of the most widely adopted benchmarks for assessing general knowledge and reasoning capabilities in artificial intelligence systems. The benchmark consists of 15,908 questions spanning topics from elementary mathematics to professional law, with difficulty levels ranging from high school to expert professional knowledge.[1][2]
Overview
MMLU was developed to address the need for a comprehensive evaluation framework that could assess language models across multiple domains simultaneously, testing both world knowledge and problem-solving abilities. The benchmark emerged from the recognition that existing evaluation methods often focused on narrow domains or specific tasks, failing to capture the breadth of knowledge required for artificial general intelligence.[1]
The benchmark's design philosophy emphasizes zero-shot and few-shot learning, evaluating models on their pre-trained knowledge without task-specific fine-tuning. This approach provides insights into the general capabilities of language models rather than their ability to memorize specific datasets. By 2024, MMLU had been downloaded over 100 million times, establishing itself as a standard evaluation metric in the AI research community.[2]
Methodology
Dataset Construction
MMLU's questions were sourced from various educational materials including textbooks, online resources, and practice exams. The dataset was carefully curated to ensure:[1]
- Diverse coverage: Questions span 57 subjects across four major categories
- Difficulty variation: Content ranges from elementary to professional level
- Standardized format: All questions use 4-option multiple choice (A, B, C, D)
- Quality control: Manual review to ensure accuracy and clarity
Dataset Structure
The complete MMLU dataset is organized as follows:
| Component | Number of Questions | Purpose |
|---|---|---|
| Development Set | 285 (5 per subject) | Few-shot examples |
| Validation Set | 1,540 | Hyperparameter tuning |
| Test Set | 14,079 | Main evaluation |
| Total | 15,908 | Complete benchmark |
Evaluation Paradigms
MMLU supports multiple evaluation approaches:[1]
- Zero-shot: Direct evaluation without examples
- Few-shot: Up to 5 examples per subject provided
- Chain-of-thought: Models can show reasoning steps
- Direct answer: Models provide only the letter choice
The primary metric is accuracy through exact string matching, where models must produce the correct letter (A, B, C, or D) to receive credit.
Subject Categories
STEM (22 subjects)
The STEM category covers scientific and technical fields:
Mathematics and Physics
- Abstract Algebra
- College Mathematics
- Elementary Mathematics
- High School Mathematics
- College Physics
- High School Physics
- Conceptual Physics
- High School Statistics
Life Sciences
Chemistry and Computer Science
- College Chemistry
- High School Chemistry
- College Computer Science
- High School Computer Science
- Computer Security
- Machine Learning
Applied Sciences
Humanities (13 subjects)
The humanities category encompasses history, philosophy, and law:
History
Philosophy and Logic
Law and Religion
Social Sciences (12 subjects)
Social sciences cover economics, psychology, and society:
Economics
Psychology and Sociology
Politics and Geography
Professional and Other (10 subjects)
Professional fields and miscellaneous topics:
- Professional Accounting
- Professional Medicine
- Management
- Marketing
- Public Relations
- Nutrition
- Security Studies
- Global Facts
- Miscellaneous
Performance Results
Current Leaderboard (2025)
Top performing models on MMLU have approached human expert performance:[3]
| Rank | Model | Organization | MMLU Score | Evaluation Type |
|---|---|---|---|---|
| 1 | GPT-4 o1-preview | OpenAI | 90.0% | 5-shot |
| 2 | Claude 3.5 Sonnet | Anthropic | 88.3% | 5-shot |
| 3 | GPT-4o | OpenAI | 88.0% | 5-shot |
| 4 | Gemini 1.5 Pro | 83.7% | 5-shot | |
| 5 | LLaMA 3.1 405B | Meta | 88.0% | 5-shot |
| 6 | Claude 3 Opus | Anthropic | 77.35% | 5-shot |
| 7 | Qwen 2.5 72B | Alibaba | 85.3% | 5-shot |
| - | Human Expert | - | 89.8% | - |
| - | Random Baseline | - | 25.0% | - |
Historical Performance Evolution
The progression of model performance on MMLU demonstrates rapid advancement in AI capabilities:
| Year | Best Model | Score | Key Milestone |
|---|---|---|---|
| 2020 | GPT-3 175B | 43.9% | Initial benchmark release |
| 2021 | Gopher 280B | 60.0% | First model above 50% |
| 2022 | PaLM 540B | 69.3% | Significant architecture improvements |
| 2023 | GPT-4 | 86.4% | Approaching human performance |
| 2024 | Multiple models | ~88% | Benchmark saturation begins |
| 2025 | GPT-4 o1-preview | 90.0% | Exceeds human expert performance |
Performance by Subject Category
Analysis reveals significant variation in model performance across domains:[1]
| Category | Average Score (Top Models) | Easiest Subject | Hardest Subject |
|---|---|---|---|
| STEM | 85% | High School Mathematics (92%) | Abstract Algebra (65%) |
| Humanities | 87% | World Religions (91%) | Formal Logic (72%) |
| Social Sciences | 89% | Marketing (93%) | Econometrics (70%) |
| Professional | 86% | Management (90%) | Professional Law (75%) |
Quality Analysis and Limitations
Identified Issues
Research has revealed several quality concerns in the MMLU dataset:[4]
- Error rate: Approximately 6.5% of questions contain errors
- Multiple correct answers: 4% of questions have ambiguous answers
- Unclear questions: 14% lack sufficient clarity
- Subject-specific errors: Virology has 33% incorrect answers
- Cultural bias: Western-centric knowledge representation
Data Contamination
Studies suggest potential data contamination issues:
- Many questions appear in online educational resources
- Some models show anomalously high performance on specific subjects
- Performance gaps between MMLU and newer, uncontaminated benchmarks
MMLU Variants
MMLU-Pro
Released in June 2024, MMLU-Pro addresses limitations of the original benchmark:[5]
Key improvements:
- 10 answer choices instead of 4 (reducing random guess accuracy to 10%)
- 12,000+ questions across 14 consolidated domains
- Reasoning focus: Emphasis on complex reasoning over memorization
- Quality control: Eliminated trivial and noisy questions
- Performance impact: 16-33% accuracy drop compared to original MMLU
Other Notable Variants
Several specialized versions have been developed:
| Variant | Focus | Key Features | Release |
|---|---|---|---|
| MMLU-Redux | Error correction | Fixed ~1,000 problematic questions | 2024 |
| MMLU-SR | Stress testing | Modified terminology to test robustness | 2024 |
| CodeMMLU | Programming | Software engineering focus | 2024 |
| Mobile-MMLU | Efficiency | Optimized for mobile deployment | 2025 |
| IndicMMLU-Pro | Multilingual | Indian languages support | 2025 |
Technical Implementation
Dataset Access
MMLU is available through multiple platforms:[6]
- GitHub: Original repository with evaluation scripts
- Hugging Face: Dataset hosting and easy integration
- API Access: Through various evaluation platforms
Evaluation Protocol
Standard evaluation procedure:
```python
- Example evaluation format
Question: [Question text] A) [Option A] B) [Option B] C) [Option C] D) [Option D] Answer: [Correct letter] ```
Models are evaluated on:
- Exact match accuracy
- Macro-average across all subjects
- Optional: Per-category and per-subject analysis
Integration with LLM Frameworks
MMLU is integrated into major evaluation frameworks:
- EleutherAI lm-evaluation-harness
- Hugging Face evaluate library
- OpenAI evals
- Custom evaluation pipelines
Impact and Significance
Research Impact
MMLU has significantly influenced AI research:[2]
- 100+ million downloads as of 2024
- Standard benchmark in model releases
- 2,000+ citations in academic literature
- Industry adoption by all major AI labs
Educational Applications
The benchmark has applications beyond model evaluation:
- Curriculum development: Identifying knowledge gaps
- Educational assessment: Comparing AI and human performance
- Tutoring systems: Baseline for educational AI
- Knowledge mapping: Understanding model capabilities
Benchmark Saturation
By 2025, MMLU is considered largely saturated:[2]
- Top models achieve 85-90% accuracy
- Minimal differentiation between leading systems
- Shift toward more challenging benchmarks
- Continued value for mid-tier model evaluation
Future Directions
Ongoing Developments
The MMLU ecosystem continues to evolve:
- Quality improvements: Ongoing error correction efforts
- Multilingual extensions: Adaptations for non-English languages
- Domain specialization: Field-specific variants
- Reasoning focus: Shift from knowledge to reasoning evaluation
Successor Benchmarks
Several benchmarks build upon MMLU's foundation:
- MMLU-Pro: More challenging with 10-option questions
- GPQA: Graduate-level questions
- ARC: Advanced reasoning challenges
- BigBench: Broader task diversity
See Also
- Large language models
- Benchmark (computing)
- Natural language processing
- Multiple choice
- Zero-shot learning
- Few-shot learning
- AI evaluation
- Knowledge representation
References
- ↑ 1.0 1.1 1.2 1.3 1.4 Hendrycks, Dan, et al. "Measuring Massive Multitask Language Understanding." arXiv preprint arXiv:2009.03300 (2020). Cite error: Invalid
<ref>tag; name "arxiv" defined multiple times with different content - ↑ 2.0 2.1 2.2 2.3 Wikipedia. "MMLU." https://en.wikipedia.org/wiki/MMLU Accessed 2025. Cite error: Invalid
<ref>tag; name "wikipedia" defined multiple times with different content - ↑ Various AI leaderboards. Accessed January 2025. Cite error: Invalid
<ref>tag; name "leaderboard" defined multiple times with different content - ↑ Gema, Aryo Pradipta, et al. "Are We Done with MMLU?" arXiv:2406.04127 (2024). Cite error: Invalid
<ref>tag; name "redux" defined multiple times with different content - ↑ Wang, Yubo, et al. "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." arXiv:2406.01574 (2024). Cite error: Invalid
<ref>tag; name "mmlupro" defined multiple times with different content - ↑ MMLU GitHub Repository. https://github.com/hendrycks/test Accessed 2025. Cite error: Invalid
<ref>tag; name "github" defined multiple times with different content