MMLU

From AI Wiki


MMLU
Overview
Full name Measuring Massive Multitask Language Understanding
Abbreviation MMLU
Description A comprehensive benchmark evaluating large language models across 57 diverse academic subjects through multiple-choice questions

Property "Description" (as page type) with input value "A comprehensive benchmark evaluating large language models across 57 diverse academic subjects through multiple-choice questions" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.

Release date 2020-09-07
Latest version MMLU-Pro
Benchmark updated 2024-06-03
Authors Dan HendrycksCollin BurnsSteven BasartAndy ZouMantas MazeikaDawn SongJacob Steinhardt
Organization University of CaliforniaBerkeley
Technical Details
Type Multitask Language UnderstandingKnowledge Evaluation
Modality Text
Task format Multiple choice (4 options)
Number of tasks 57
Total examples 15908
Evaluation metric AccuracyMacro-average
Domains STEMHumanitiesSocial SciencesProfessional Fields
Languages English
Performance
Human performance 89.8
Baseline 25.0
SOTA score 90.0
SOTA model GPT-4 o1-preview
SOTA date 2025-01-01
Saturated Yes
Resources
Website Official website
Paper Paper
GitHub Repository
Dataset Download
License MIT License
Successor MMLU-Pro


MMLU (Measuring Massive Multitask Language Understanding) is a comprehensive benchmark designed to evaluate large language models across 57 diverse academic and professional subjects through multiple-choice questions. Created by researchers at the University of California, Berkeley and released in September 2020, MMLU has become one of the most widely adopted benchmarks for assessing general knowledge and reasoning capabilities in artificial intelligence systems. The benchmark consists of 15,908 questions spanning topics from elementary mathematics to professional law, with difficulty levels ranging from high school to expert professional knowledge.[1][2]

Overview

MMLU was developed to address the need for a comprehensive evaluation framework that could assess language models across multiple domains simultaneously, testing both world knowledge and problem-solving abilities. The benchmark emerged from the recognition that existing evaluation methods often focused on narrow domains or specific tasks, failing to capture the breadth of knowledge required for artificial general intelligence.[1]

The benchmark's design philosophy emphasizes zero-shot and few-shot learning, evaluating models on their pre-trained knowledge without task-specific fine-tuning. This approach provides insights into the general capabilities of language models rather than their ability to memorize specific datasets. By 2024, MMLU had been downloaded over 100 million times, establishing itself as a standard evaluation metric in the AI research community.[2]

Methodology

Dataset Construction

MMLU's questions were sourced from various educational materials including textbooks, online resources, and practice exams. The dataset was carefully curated to ensure:[1]

  • Diverse coverage: Questions span 57 subjects across four major categories
  • Difficulty variation: Content ranges from elementary to professional level
  • Standardized format: All questions use 4-option multiple choice (A, B, C, D)
  • Quality control: Manual review to ensure accuracy and clarity

Dataset Structure

The complete MMLU dataset is organized as follows:

Component Number of Questions Purpose
Development Set 285 (5 per subject) Few-shot examples
Validation Set 1,540 Hyperparameter tuning
Test Set 14,079 Main evaluation
Total 15,908 Complete benchmark

Evaluation Paradigms

MMLU supports multiple evaluation approaches:[1]

  • Zero-shot: Direct evaluation without examples
  • Few-shot: Up to 5 examples per subject provided
  • Chain-of-thought: Models can show reasoning steps
  • Direct answer: Models provide only the letter choice

The primary metric is accuracy through exact string matching, where models must produce the correct letter (A, B, C, or D) to receive credit.

Subject Categories

STEM (22 subjects)

The STEM category covers scientific and technical fields:

Mathematics and Physics

Life Sciences

Chemistry and Computer Science

Applied Sciences

Humanities (13 subjects)

The humanities category encompasses history, philosophy, and law:

History

Philosophy and Logic

Law and Religion

Social Sciences (12 subjects)

Social sciences cover economics, psychology, and society:

Economics

Psychology and Sociology

Politics and Geography

Professional and Other (10 subjects)

Professional fields and miscellaneous topics:

Performance Results

Current Leaderboard (2025)

Top performing models on MMLU have approached human expert performance:[3]

Rank Model Organization MMLU Score Evaluation Type
1 GPT-4 o1-preview OpenAI 90.0% 5-shot
2 Claude 3.5 Sonnet Anthropic 88.3% 5-shot
3 GPT-4o OpenAI 88.0% 5-shot
4 Gemini 1.5 Pro Google 83.7% 5-shot
5 LLaMA 3.1 405B Meta 88.0% 5-shot
6 Claude 3 Opus Anthropic 77.35% 5-shot
7 Qwen 2.5 72B Alibaba 85.3% 5-shot
- Human Expert - 89.8% -
- Random Baseline - 25.0% -

Historical Performance Evolution

The progression of model performance on MMLU demonstrates rapid advancement in AI capabilities:

Year Best Model Score Key Milestone
2020 GPT-3 175B 43.9% Initial benchmark release
2021 Gopher 280B 60.0% First model above 50%
2022 PaLM 540B 69.3% Significant architecture improvements
2023 GPT-4 86.4% Approaching human performance
2024 Multiple models ~88% Benchmark saturation begins
2025 GPT-4 o1-preview 90.0% Exceeds human expert performance

Performance by Subject Category

Analysis reveals significant variation in model performance across domains:[1]

Category Average Score (Top Models) Easiest Subject Hardest Subject
STEM 85% High School Mathematics (92%) Abstract Algebra (65%)
Humanities 87% World Religions (91%) Formal Logic (72%)
Social Sciences 89% Marketing (93%) Econometrics (70%)
Professional 86% Management (90%) Professional Law (75%)

Quality Analysis and Limitations

Identified Issues

Research has revealed several quality concerns in the MMLU dataset:[4]

  • Error rate: Approximately 6.5% of questions contain errors
  • Multiple correct answers: 4% of questions have ambiguous answers
  • Unclear questions: 14% lack sufficient clarity
  • Subject-specific errors: Virology has 33% incorrect answers
  • Cultural bias: Western-centric knowledge representation

Data Contamination

Studies suggest potential data contamination issues:

  • Many questions appear in online educational resources
  • Some models show anomalously high performance on specific subjects
  • Performance gaps between MMLU and newer, uncontaminated benchmarks

MMLU Variants

MMLU-Pro

Released in June 2024, MMLU-Pro addresses limitations of the original benchmark:[5]

Key improvements:

  • 10 answer choices instead of 4 (reducing random guess accuracy to 10%)
  • 12,000+ questions across 14 consolidated domains
  • Reasoning focus: Emphasis on complex reasoning over memorization
  • Quality control: Eliminated trivial and noisy questions
  • Performance impact: 16-33% accuracy drop compared to original MMLU

Other Notable Variants

Several specialized versions have been developed:

Variant Focus Key Features Release
MMLU-Redux Error correction Fixed ~1,000 problematic questions 2024
MMLU-SR Stress testing Modified terminology to test robustness 2024
CodeMMLU Programming Software engineering focus 2024
Mobile-MMLU Efficiency Optimized for mobile deployment 2025
IndicMMLU-Pro Multilingual Indian languages support 2025

Technical Implementation

Dataset Access

MMLU is available through multiple platforms:[6]

  • GitHub: Original repository with evaluation scripts
  • Hugging Face: Dataset hosting and easy integration
  • API Access: Through various evaluation platforms

Evaluation Protocol

Standard evaluation procedure:

```python

  1. Example evaluation format

Question: [Question text] A) [Option A] B) [Option B] C) [Option C] D) [Option D] Answer: [Correct letter] ```

Models are evaluated on:

  • Exact match accuracy
  • Macro-average across all subjects
  • Optional: Per-category and per-subject analysis

Integration with LLM Frameworks

MMLU is integrated into major evaluation frameworks:

Impact and Significance

Research Impact

MMLU has significantly influenced AI research:[2]

  • 100+ million downloads as of 2024
  • Standard benchmark in model releases
  • 2,000+ citations in academic literature
  • Industry adoption by all major AI labs

Educational Applications

The benchmark has applications beyond model evaluation:

  • Curriculum development: Identifying knowledge gaps
  • Educational assessment: Comparing AI and human performance
  • Tutoring systems: Baseline for educational AI
  • Knowledge mapping: Understanding model capabilities

Benchmark Saturation

By 2025, MMLU is considered largely saturated:[2]

  • Top models achieve 85-90% accuracy
  • Minimal differentiation between leading systems
  • Shift toward more challenging benchmarks
  • Continued value for mid-tier model evaluation

Future Directions

Ongoing Developments

The MMLU ecosystem continues to evolve:

  • Quality improvements: Ongoing error correction efforts
  • Multilingual extensions: Adaptations for non-English languages
  • Domain specialization: Field-specific variants
  • Reasoning focus: Shift from knowledge to reasoning evaluation

Successor Benchmarks

Several benchmarks build upon MMLU's foundation:

  • MMLU-Pro: More challenging with 10-option questions
  • GPQA: Graduate-level questions
  • ARC: Advanced reasoning challenges
  • BigBench: Broader task diversity

See Also

References

  1. 1.0 1.1 1.2 1.3 1.4 Hendrycks, Dan, et al. "Measuring Massive Multitask Language Understanding." arXiv preprint arXiv:2009.03300 (2020). Cite error: Invalid <ref> tag; name "arxiv" defined multiple times with different content
  2. 2.0 2.1 2.2 2.3 Wikipedia. "MMLU." https://en.wikipedia.org/wiki/MMLU Accessed 2025. Cite error: Invalid <ref> tag; name "wikipedia" defined multiple times with different content
  3. Various AI leaderboards. Accessed January 2025. Cite error: Invalid <ref> tag; name "leaderboard" defined multiple times with different content
  4. Gema, Aryo Pradipta, et al. "Are We Done with MMLU?" arXiv:2406.04127 (2024). Cite error: Invalid <ref> tag; name "redux" defined multiple times with different content
  5. Wang, Yubo, et al. "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." arXiv:2406.01574 (2024). Cite error: Invalid <ref> tag; name "mmlupro" defined multiple times with different content
  6. MMLU GitHub Repository. https://github.com/hendrycks/test Accessed 2025. Cite error: Invalid <ref> tag; name "github" defined multiple times with different content

External Links