Measuring Massive Multitask Language Understanding
Abbreviation
MMLU
Description
A comprehensive benchmark evaluating large language models across 57 diverse academic subjects through multiple-choice questions
Property "Description" (as page type) with input value "A comprehensive benchmark evaluating large language models across 57 diverse academic subjects through multiple-choice questions" contains invalid characters or is incomplete and therefore can cause unexpected results during a query or annotation process.
Release date
2020-09-07
Latest version
MMLU-Pro
Benchmark updated
2024-06-03
Authors
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
Organization
University of California, Berkeley
Technical Details
Type
Multitask Language Understanding, Knowledge Evaluation
MMLU (Measuring Massive Multitask Language Understanding) is a comprehensive benchmark designed to evaluate large language models across 57 diverse academic and professional subjects through multiple-choice questions. Created by researchers at the University of California, Berkeley and released in September 2020, MMLU has become one of the most widely adopted benchmarks for assessing general knowledge and reasoning capabilities in artificial intelligence systems. The benchmark consists of 15,908 questions spanning topics from elementary mathematics to professional law, with difficulty levels ranging from high school to expert professional knowledge.[1][2]
Overview
MMLU was developed to address the need for a comprehensive evaluation framework that could assess language models across multiple domains simultaneously, testing both world knowledge and problem-solving abilities. The benchmark emerged from the recognition that existing evaluation methods often focused on narrow domains or specific tasks, failing to capture the breadth of knowledge required for artificial general intelligence.[1]
The benchmark's design philosophy emphasizes zero-shot and few-shot learning, evaluating models on their pre-trained knowledge without task-specific fine-tuning. This approach provides insights into the general capabilities of language models rather than their ability to memorize specific datasets. By 2024, MMLU had been downloaded over 100 million times, establishing itself as a standard evaluation metric in the AI research community.[2]
Methodology
Dataset Construction
MMLU's questions were sourced from various educational materials including textbooks, online resources, and practice exams. The dataset was carefully curated to ensure:[1]
Diverse coverage: Questions span 57 subjects across four major categories
Difficulty variation: Content ranges from elementary to professional level
Standardized format: All questions use 4-option multiple choice (A, B, C, D)
Quality control: Manual review to ensure accuracy and clarity
Dataset Structure
The complete MMLU dataset is organized as follows:
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). "Measuring Massive Multitask Language Understanding." Proceedings of the International Conference on Learning Representations (ICLR 2021). arXiv:2009.03300.
Google DeepMind. (2024). "Gemini: A Family of Highly Capable Multimodal Models." arXiv:2312.11805.
Gema, A.P., Leang, J.O.J., Hong, G., Devoto, A., Mancino, A.C.M., Saxena, R., He, X., Zhao, Y., Du, X., Madani, M.R.G., Barale, C., McHardy, R., Harris, J., Kaddour, J., van Krieken, E., & Minervini, P. (2024). "Are We Done with MMLU?" Proceedings of NAACL 2025. arXiv:2406.04127.
Dong, G., Yuan, H., Lu, K., Li, C., Xue, M., Liu, D., Wang, W., Yuan, Z., Zhou, C., & Zhou, J. (2024). "Investigating Data Contamination in Modern Benchmarks for Large Language Models." arXiv:2311.09783.
Microsoft Research. (2024). "MMLU-CF: A Contamination-free Multi-task Language Understanding Benchmark." Proceedings of ACL 2025. https://github.com/microsoft/MMLU-CF
Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., Li, T., Ku, M., Wang, K., Zhuang, A., Fan, R., Yue, X., & Chen, W. (2024). "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark." Proceedings of NeurIPS 2024. arXiv:2406.01574.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L.A., Welbl, J., Clark, A., et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556.