| MMLU | |
|---|---|
| Overview | |
| Full name | Measuring Massive Multitask Language Understanding |
| Abbreviation | MMLU |
| Description | A comprehensive benchmark evaluating large language models across 57 diverse academic subjects through multiple-choice questions |
| Release date | 2020-09-07 |
| Latest version | MMLU-Pro |
| Benchmark updated | 2024-06-03 |
| Authors | Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt |
| Organization | University of California, Berkeley |
| Technical Details | |
| Type | Multitask Language Understanding, Knowledge Evaluation |
| Modality | Text |
| Task format | Multiple choice (4 options) |
| Number of tasks | 57 |
| Total examples | 15,908 |
| Evaluation metric | Accuracy, Macro-average |
| Domains | STEM, Humanities, Social Sciences, Professional Fields |
| Languages | English |
| Performance | |
| Human performance | 89.8% |
| Baseline | 25.0% |
| SOTA score | ~92.7% (o3, December 2024) |
| SOTA model | OpenAI o3, Gemini 3 Pro, Claude Opus 4.5 |
| SOTA date | 2024-2026 |
| Saturated | Yes |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT License |
| Successor | MMLU-Pro |
MMLU (Measuring Massive Multitask Language Understanding) is a comprehensive benchmark designed to evaluate large language models across 57 diverse academic and professional subjects through multiple-choice questions. Created by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt at the University of California, Berkeley, with contributions from researchers at the University of Chicago, the benchmark was first released on September 7, 2020 and published as a conference paper at ICLR 2021.[1][2] MMLU consists of 15,908 questions spanning topics from elementary mathematics to professional law, with difficulty levels ranging from high school to expert professional knowledge. It quickly became the most widely cited evaluation tool in artificial intelligence of its era, with over 100 million downloads as of 2024 and citations in thousands of academic papers.[2][3]
For four years, MMLU served as the default headline number in nearly every major model release, from GPT-3 in 2020 to GPT-4 in 2023, Claude 3 in 2024, and Llama 3.1 405B later that year. By the time OpenAI's o1-preview crossed 90% in September 2024 and o3 reached approximately 92.7% by year-end, the benchmark had effectively saturated. The community moved on to harder evaluations such as MMLU-Pro, GPQA, and Humanity's Last Exam, but MMLU remains a standard inclusion in evaluation suites, both as a baseline check and as a useful discriminator for mid-tier and open-weight models.
MMLU was developed to address the need for a comprehensive evaluation framework that could assess language models across multiple domains simultaneously, testing both world knowledge and problem-solving abilities. The benchmark emerged from the recognition that existing evaluation methods often focused on narrow domains or specific tasks, failing to capture the breadth of knowledge required for artificial general intelligence.[1]
Before MMLU, the field relied on benchmarks like GLUE (General Language Understanding Evaluation, 2018), which contained nine natural language understanding tasks, and SuperGLUE (2019), which raised the difficulty with reading comprehension and coreference resolution tasks. However, models like BERT and RoBERTa achieved human-level performance on GLUE within roughly a year of its release, and SuperGLUE was similarly saturated by 2021, with many models already scoring above 90%.[4] MMLU was designed to be far more challenging and broader in scope: with 57 subjects spanning elementary to professional-level knowledge, it dwarfed GLUE's nine tasks and SuperGLUE's eight. Testing specialized domain knowledge, from abstract algebra to professional law, was a distinctive feature that set MMLU apart from earlier benchmarks that focused on elementary linguistic competence.[1][4]
The benchmark's design philosophy emphasizes zero-shot and few-shot learning, evaluating models on their pre-trained knowledge without task-specific fine-tuning. This approach provides insights into the general capabilities of language models rather than their ability to memorize specific datasets. Because of its timing and breadth, MMLU quickly became the main reference in AI papers and corporate reports, linking earlier work on narrow tasks with the emergence of general-purpose models. By 2024, MMLU had been downloaded over 100 million times, establishing itself as a standard evaluation metric in the AI research community.[2][3]
The original paper, titled simply "Measuring Massive Multitask Language Understanding," first appeared on arXiv as 2009.03300 on September 7, 2020. It was accepted to ICLR 2021 and earned an OpenReview score that placed it among the top accepted submissions that year. As of 2025, the paper has accumulated over 4,000 citations on Google Scholar, ranking among the most-cited evaluation papers in modern AI.[1][2]
Several factors contributed to MMLU's rapid adoption as the default benchmark for evaluating large language models:
Timing: MMLU arrived in September 2020, just as the scaling era of LLMs was beginning in earnest. GPT-3 had been released only three months earlier, and the field needed a benchmark that could meaningfully differentiate between increasingly capable models.
Breadth of coverage: With 57 subjects, MMLU provided a single aggregate number that could summarize a model's general knowledge and reasoning across a wide range of domains. This made it convenient for model comparison in technical reports and marketing materials.
Difficulty ceiling: When first released, even the best model (GPT-3 175B) scored only 43.9%, while expert human performance was estimated at 89.8%. This left a large gap for improvement, giving the benchmark years of useful discriminative power before saturation became a problem.
Standardized format: The 4-option multiple-choice format with a fixed 5-shot evaluation protocol made results easy to reproduce and compare across different labs and model architectures.
Open availability: The dataset was released under an MIT License on GitHub and later hosted on Hugging Face, making it freely accessible to the entire research community.
Industry adoption: Every major AI lab, including OpenAI, Google DeepMind, Anthropic, and Meta, began reporting MMLU scores in their model release papers, reinforcing its position as a shared reference point. MMLU scores feature across widely followed leaderboards such as the Open LLM Leaderboard, HELM Classic, and HELM Lite.[3][4]
Discriminative power across the scaling curve: Unlike narrower benchmarks that saturated for a single model class, MMLU produced different scores for tiny base models (around 25%, indistinguishable from random), small models in the 1-10B range (typically 30-50%), mid-tier models such as Llama 2 13B (around 55%), large pretrained models (60-75%), and frontier instruction-tuned systems (80-90%). For most of the period from 2020 to 2024, that range made MMLU genuinely useful for ranking models in published leaderboards.
MMLU was led by Dan Hendrycks, then a PhD candidate at UC Berkeley advised by Dawn Song and Jacob Steinhardt. Hendrycks would later co-found the Center for AI Safety (CAIS) in 2022 and remains one of the most prolific creators of LLM evaluation benchmarks; alongside MMLU, he co-authored HellaSwag, MATH, GSM8K-adjacent problem sets, and the much later Humanity's Last Exam.[1]
The other co-authors were Collin Burns (then UC Berkeley, later OpenAI Superalignment), Steven Basart, Andy Zou, and Mantas Mazeika (Berkeley collaborators), with senior advisors Dawn Song (UC Berkeley) and Jacob Steinhardt (UC Berkeley). The dataset assembly itself was a substantial undertaking. The team manually collected and reformatted questions from a wide range of educational sources: free practice exams, GRE prep books, AP test materials, USMLE study guides, law school admission tests, professional certification exams, and Oxford and Cambridge undergraduate problem sets. Each of the 57 subjects required at least 100 questions, with sourcing tailored to the appropriate professional or academic level for that domain.[1]
MMLU's questions were sourced from various educational materials including textbooks, online resources, and practice exams. The dataset was carefully curated to ensure:[1]
Questions were designed to require subject-matter knowledge, not just reading comprehension. Hendrycks and colleagues deliberately avoided pulling from any single test, since reusing a high-stakes exam in full would have made contamination detection trivially easy and would have invited copyright concerns. Instead, the questions were drawn from many free practice resources and reformatted into a uniform 4-option layout.
The complete MMLU dataset is organized as follows:
| Component | Number of Questions | Purpose |
|---|---|---|
| Development set | 285 (5 per subject) | Few-shot examples |
| Validation set | 1,540 | Hyperparameter tuning |
| Test set | 14,079 | Main evaluation |
| Auxiliary training set | ~100,000 | Supervised fine-tuning experiments |
| Total (test + val + dev) | 15,908 | Complete benchmark |
Each of the 57 subjects contains a minimum of 100 test questions. The development set provides exactly 5 examples per subject, which serve as the in-context examples for the standard 5-shot evaluation protocol. The auxiliary training set was provided as an optional resource for researchers wanting to fine-tune models on similar material; in practice, almost all reported MMLU scores are evaluated zero-shot or few-shot rather than after fine-tuning.[1]
The standard evaluation protocol for MMLU uses a 5-shot in-context learning setup. In this format, the model receives five solved example questions from the same subject as context before being asked to answer a new question. Each example includes the question text, the four answer options (A through D), and the correct answer letter. The format looks like this:[1]
The following are multiple choice questions (with answers) about [subject].
[Example Question 1]
A. [Option A]
B. [Option B]
C. [Option C]
D. [Option D]
Answer: [Correct Letter]
... (4 more examples) ...
[Test Question]
A. [Option A]
B. [Option B]
C. [Option C]
D. [Option D]
Answer:
The model must produce the correct letter (A, B, C, or D) as its response. The primary metric is exact-match accuracy, and the overall score is computed as a macro-average across all 57 subjects, giving equal weight to each subject regardless of its size.[1]
MMLU supports multiple evaluation approaches beyond the standard 5-shot format:[1]
The original paper established two human performance baselines:[1]
| Baseline type | Accuracy | Description |
|---|---|---|
| Expert-level | 89.8% | Estimated from 95th-percentile scores of real-world test takers in relevant domains |
| Non-expert (MTurk) | 34.5% | Amazon Mechanical Turk workers without domain expertise |
| Random guessing | 25.0% | Chance-level performance with 4 options |
The large gap between non-expert (34.5%) and expert (89.8%) human performance highlights that MMLU questions require genuine domain knowledge, not just general reading comprehension or common sense. The 89.8% expert ceiling is itself a soft upper bound rather than a literal cap on possible accuracy: it reflects the average performance of competent humans across all 57 subjects, and a model that strictly outperformed expert humans on every subject could in principle exceed this number.
The 57 subjects are organized into four broad categories. Each category is weighted equally when computing the macro-average score.
The STEM category covers scientific and technical fields, ranging from abstract algebra at the graduate level to elementary mathematics suitable for grade-school students.
The humanities category encompasses history, philosophy, and law:
Social sciences cover economics, psychology, and society:
Professional fields and miscellaneous topics:
The Professional Law subject deserves special mention. With 1,534 questions sourced from law school admission tests and bar exam study materials, it is by far the largest subject in MMLU, accounting for roughly 11% of the total test set. As a result, even though the macro-average gives every subject equal weight, models that struggle with legal reasoning leave a particularly conspicuous deficit on this subject in subject-by-subject reports.
When MMLU was first released, even the best available models performed far below expert-level accuracy. The original paper reported the following results using 5-shot evaluation:[1]
| Model | Parameters | MMLU score (5-shot) | Notes |
|---|---|---|---|
| Random Baseline | - | 25.0% | Chance-level with 4 options |
| GPT-2 | 1.5B | 32.4% | Barely above random |
| RoBERTa-base | 125M | 27.9% | Near random chance |
| UnifiedQA | 3B | 43.7% | Question-answering specialized model |
| GPT-3 (zero-shot) | 175B | 37.7% | Without in-context examples |
| GPT-3 (5-shot) | 175B | 43.9% | Best result at release |
| Human expert | - | 89.8% | 95th-percentile test takers |
GPT-3's performance was highly uneven across subjects. It achieved 69% on US Foreign Policy (its best subject) but scored near random chance on subjects like College Chemistry, highlighting the model's inconsistent knowledge coverage.[1] The authors interpreted this lopsidedness as a key finding: scale alone improved average performance, but it did not produce the kind of uniform expertise across domains that the term "general intelligence" might suggest.
The progression of model performance on MMLU from 2020 to 2026 demonstrates the rapid advancement of AI capabilities. The leaderboard below tracks the most influential reported scores; not all are directly comparable due to evaluation differences (see the section on standardized evaluation discrepancies below).
| Year | Model | Organization | Parameters | MMLU score | Evaluation | Key milestone |
|---|---|---|---|---|---|---|
| 2020 | GPT-3 | OpenAI | 175B | 43.9% | 5-shot | Initial benchmark release |
| 2021 | Gopher | DeepMind | 280B | 60.0% | 5-shot | First model above 50% |
| 2022 | Chinchilla | DeepMind | 70B | 67.5% | 5-shot | Compute-optimal training |
| 2022 | PaLM | 540B | 69.3% | 5-shot | Pathway-based architecture | |
| 2023 | Flan-PaLM 2-L | - | 81.2% | 5-shot | Instruction tuning gains | |
| 2023 | GPT-4 | OpenAI | - | 86.4% | 5-shot | Approaching human performance |
| 2024 | Claude 3 Opus | Anthropic | - | 86.8% | 5-shot | Near human-expert level |
| 2024 | Llama 3 70B Instruct | Meta | 70B | 82.0% | 5-shot | Open-weight model surpasses GPT-3.5 |
| 2024 | Gemini Ultra | Google DeepMind | - | 90.0% | CoT@32 | First reported above 90% (non-standard eval) |
| 2024 | Gemini Ultra | Google DeepMind | - | 83.7% | 5-shot | Standard protocol, comparable to peers |
| 2024 | Llama 3.1 405B Instruct | Meta | 405B | 87.3% | 5-shot | Largest open-weight frontier model |
| 2024 | GPT-4o | OpenAI | - | 88.7% | 5-shot | Multimodal flagship |
| 2024 | Claude 3.5 Sonnet | Anthropic | - | 88.7% | 5-shot | Compact high-performance model |
| 2024 | OpenAI o1-preview | OpenAI | - | 90.8% | 0-shot CoT | Reasoning model, surpasses human expert |
| 2024 | OpenAI o1 | OpenAI | - | 92.3% | 0-shot CoT | First broad release reasoning model |
| 2024 | OpenAI o3 | OpenAI | - | ~92.7% | 0-shot CoT | Reported by OpenAI in December 2024 |
| 2025 | Claude Opus 4 | Anthropic | - | ~91% | 5-shot / CoT | Frontier reasoning model, near-saturation |
| 2025 | Gemini 3 Pro | Google DeepMind | - | ~91% | 5-shot / CoT | Frontier reasoning model |
| 2026 | DeepSeek-V4-Pro-Base | DeepSeek | - | ~90.1% | 5-shot | Open-weight frontier model |
Note: Scores are not always directly comparable due to differences in evaluation methodology. Some reported scores use chain-of-thought prompting, majority voting, or other techniques that can inflate results relative to the standard 5-shot protocol. Gemini Ultra's 90.0% score, for instance, used a chain-of-thought method with uncertainty routing across 32 samples (CoT@32), while its standard 5-shot score was 83.7%.[5][6] By 2025, most labs report MMLU primarily for backward compatibility and instead lead with newer benchmarks for frontier evaluations.
A 2024 study by Stanford's Center for Research on Foundation Models (CRFM) using the HELM framework revealed that model creators frequently report higher MMLU scores than independent evaluation can reproduce. By running all models with the same prompt template and the same 5 in-context examples per subject, HELM found consistent discrepancies:[5]
| Model | Creator-reported score | HELM score | Difference |
|---|---|---|---|
| GPT-4 (0613) | 86.4% | 82.4% | -4.0 |
| Claude 3 Opus | 86.8% | 84.6% | -2.2 |
| Claude 2.1 | 78.5% | 73.5% | -5.0 |
| PaLM 2 Unicorn | 81.2% | 78.6% | -2.6 |
| Llama 3 (70B) | 79.5% | 79.3% | -0.2 |
| Mixtral (8x22B) | 77.6% | 77.8% | +0.2 |
| Gemma (7B) | 64.3% | 66.1% | +1.8 |
The HELM study identified several sources of score inflation: non-standard prompting techniques, proprietary evaluation snapshots that prevented independent verification, and insufficient documentation of prompt templates.[5] HELM also produced a side-by-side analysis showing that prompt-template variation alone (whitespace, label letters, the exact phrasing of the instruction line) could shift a single model's score by several percentage points without any change to the underlying capability.
Analysis reveals significant variation in model performance across domains:[1]
| Category | Average score (top models) | Easiest subject | Hardest subject |
|---|---|---|---|
| STEM | 85% | High School Mathematics (92%) | Abstract Algebra (65%) |
| Humanities | 87% | World Religions (91%) | Formal Logic (72%) |
| Social Sciences | 89% | Marketing (93%) | Econometrics (70%) |
| Professional | 86% | Management (90%) | Professional Law (75%) |
The pattern of easy versus hard subjects has been remarkably stable across model generations. Models tend to do best on subjects with broad popular coverage on the open web (Marketing, Management, US history) and worst on subjects that require either rigorous formal manipulation (Formal Logic, Abstract Algebra) or highly specialized professional knowledge that is not heavily represented in scraped web data (Econometrics, Professional Law). Reasoning-tuned models such as the o-series narrowed but did not fully eliminate this gap.
In June 2024, Aryo Pradipta Gema and colleagues from the University of Edinburgh published "Are We Done with MMLU?", a paper that systematically audited the quality of MMLU questions. The study, later published at NAACL 2025, introduced MMLU-Redux: a subset of 3,000 manually re-annotated questions across 30 MMLU subjects, reviewed by 14 human experts.[7]
The study found that more than 9% of the sampled questions contain errors. These errors were classified using a hierarchical taxonomy:[7]
Type 1, question assessment:
Type 2, ground truth verification:
Wrong ground truth labels (Type 2c) were the most prevalent error type, accounting for approximately 3% of all questions. The error rates varied dramatically by subject:[7]
| Subject | Error rate | Breakdown |
|---|---|---|
| Virology | 57% | 33% wrong ground truth, 15% unclear questions, 4% multiple correct answers |
| Logical Fallacies | 26% | Mix of unclear options and wrong labels |
| College Chemistry | 25% | Incorrect answers and ambiguous questions |
| Professional Law | 18% | Multiple defensible answers |
| Business Ethics | 14% | Wrong ground truth labels |
| Formal Logic | 13% | Ambiguous question formulations |
| Human Aging | 12% | Wrong labels |
| Global Facts | 12% | Outdated or incorrect factual claims |
| Machine Learning | 11% | Ambiguous technical questions |
The impact on model evaluation was substantial. In the Virology subset, for example, Claude 3 Opus's score shifted from 54% (ranked 9th) to 88% (ranked 6th) when only correctly labeled questions were considered, demonstrating that dataset errors can significantly distort model rankings.[7]
Data contamination, where benchmark questions appear in a model's training data, has become a major concern for MMLU's reliability. Because MMLU's questions are drawn from publicly available educational materials, there is a high probability that some questions were included in the web-scraped training corpora of modern LLMs.[8][9]
A 2024 study applied a lexical contamination detection pipeline to 513 sampled MMLU test questions and found an overall contamination rate of 13.8%. The contamination was not evenly distributed: STEM subjects showed an 18.1% rate, and Philosophy had the highest individual contamination at 66.7%. The study also found that ChatGPT and GPT-4 could guess the missing answer options in benchmark test data with exact match rates of 52% and 57% respectively, suggesting significant memorization of the test questions.[8]
When contamination is controlled for, model performance drops significantly. One analysis found that model accuracy dropped by an average of 7.0 percentage points when surface wording was changed to indirect references, with drops as high as 19.8 percentage points in Law and Ethics, precisely the domains most heavily contaminated.[8]
The contamination problem is structural. The original MMLU dataset has been hosted publicly on GitHub since 2020, mirrored on Hugging Face, and copied across countless tutorials, blog posts, and educational sites. By the time a frontier model is pre-trained on a multi-trillion-token corpus scraped from the public web, the chance that some MMLU questions and their answers appear verbatim in the training data approaches certainty for any honest accounting. This is one of the main motivations behind MMLU-CF and the broader move toward held-out, contamination-free evaluations such as GPQA and Humanity's Last Exam.
Model scores on MMLU can vary by up to 10% depending on the exact prompt template used, including factors like whitespace formatting, instruction phrasing, and answer extraction method. This sensitivity makes it difficult to compare scores across different evaluation setups and has led to calls for stricter standardization of evaluation protocols.[5][10]
A detailed Hugging Face investigation in mid-2023 traced this directly to the Open LLM Leaderboard. When the leaderboard adopted a particular MMLU prompt template, several open-source models fell several percentage points behind their reported scores, sparking confusion about whether the rankings were correct. The eventual answer was that the rankings were correct under that specific template, but the same model could legitimately report a different number under the template originally used by the model's authors. The lesson generalized: "the MMLU score" of a model is meaningful only relative to a specific implementation of the benchmark.[10]
Beyond errors and contamination, MMLU has several other recognized limitations:
By late 2024, MMLU was considered largely saturated as a discriminative benchmark for frontier models. Top models clustered within a narrow 86-92% accuracy band when evaluated with the standard 5-shot protocol, making it difficult to draw meaningful distinctions between leading systems.[10][11]
The saturation of MMLU followed a predictable pattern that has affected previous benchmarks:
| Benchmark | Year introduced | Year saturated | Time to saturation |
|---|---|---|---|
| GLUE | 2018 | 2019 | ~1 year |
| SuperGLUE | 2019 | 2021 | ~2 years |
| MMLU | 2020 | 2024 | ~4 years |
| GPQA | 2023 | 2025 | ~2 years |
Despite saturation at the frontier, MMLU remains useful for evaluating mid-tier and open-weight models, where there is still meaningful performance variation. Many leaderboards and evaluation platforms continue to include MMLU alongside newer, more challenging benchmarks. It also serves as a kind of regression test: a new release that posts substantially below 80% on MMLU is almost certainly weaker than the prior generation, regardless of how it scores on more selective benchmarks.[3][10]
The saturation conversation also prompted a useful methodological correction in the field. Rather than chasing a single number that frontier models had effectively topped out, leaderboards such as Stanford HELM, Vellum, Artificial Analysis, and the LM Arena began publishing aggregated views that combine MMLU with SWE-bench, HumanEval, GPQA, AIME, MATH, and Humanity's Last Exam. The shift acknowledges that no single benchmark can summarize a frontier model's capability, and that MMLU's strength was always its breadth rather than its depth.
MMLU-Pro is a more challenging successor benchmark developed by TIGER-Lab (led by Yubo Wang, Xueguang Ma, Ge Zhang, Wenhu Chen, and colleagues) and published at NeurIPS 2024. It was designed to address the saturation and noise problems of the original MMLU.[11]
Key differences from the original MMLU:
| Feature | MMLU | MMLU-Pro |
|---|---|---|
| Answer options | 4 (A-D) | 10 (A-J) |
| Total questions | 15,908 | ~12,000 |
| Subjects | 57 | 14 consolidated domains |
| Random guess baseline | 25.0% | 10.0% |
| Prompt sensitivity | 4-5% variation | ~2% variation |
| CoT benefit | Negligible or negative | Significant positive effect |
| Question focus | Knowledge recall | Complex reasoning |
MMLU-Pro's 10-option format makes random guessing far less effective (10% vs. 25% baseline) and forces models to engage in more careful discrimination between plausible answers. The benchmark was constructed by integrating harder questions from academic exams and textbooks across 14 domains: Biology, Business, Chemistry, Computer Science, Economics, Engineering, Health, History, Law, Math, Philosophy, Physics, Psychology, and Others.[11]
Performance on MMLU-Pro is significantly lower than on the original MMLU, with accuracy drops of 16 to 33 percentage points. Initial MMLU-Pro scores from the original paper:[11]
| Model | MMLU-Pro score | MMLU score | Drop |
|---|---|---|---|
| Claude 3.5 Sonnet | 76.1% | 88.7% | -12.6 |
| GPT-4o | 72.6% | 88.7% | -16.1 |
| Gemini 1.5 Pro | 69.0% | 83.7% | -14.7 |
| Claude 3 Opus | 68.5% | 86.8% | -18.3 |
| GPT-4 Turbo | 63.7% | 86.4% | -22.7 |
| Llama 3 70B Instruct | 56.2% | 79.5% | -23.3 |
A notable finding was that chain-of-thought reasoning provides a substantial benefit on MMLU-Pro, in contrast to the original MMLU where CoT prompting offered little or no improvement. This indicates that MMLU-Pro questions require more genuine multi-step reasoning rather than simple knowledge recall.[11] By 2025, frontier reasoning models had pushed MMLU-Pro into the upper 80s as well; for example, Gemini 3.1 Pro Preview reportedly led the Artificial Analysis MMLU-Pro leaderboard at around 91%, with Claude Opus 4.5 (Reasoning) and Gemini 3 Pro both close behind in the high 80s to low 90s, suggesting that MMLU-Pro itself may saturate within a few years.
MMLU-Redux, introduced by Gema et al. (2024) in the paper "Are We Done with MMLU?" and published at NAACL 2025, is not a wholly new benchmark but rather an error-corrected subset of the original MMLU. It consists of 3,000 manually re-annotated questions across 30 subjects, with each question reviewed by domain experts for question clarity and answer correctness. The corrected labels provide a more reliable evaluation baseline and can reveal how much of a model's apparent performance is an artifact of dataset errors rather than genuine capability.[7]
MMLU-CF (Contamination-Free) was proposed by Microsoft researchers and published at ACL 2025. It addresses the contamination problem directly by creating an entirely new set of 20,000 multiple-choice questions (10,000 for a closed-source test set and 10,000 for an open-source validation set) sourced from broader domains with three decontamination rules designed to prevent both unintentional and malicious data leakage. Evaluation of over 40 mainstream LLMs on MMLU-CF showed that model accuracy dropped by 14 to 16 percentage points compared to the original MMLU, and performance rankings changed considerably, confirming that contamination has inflated reported scores on the original benchmark.[9]
CMMLU (Chinese MMLU) was introduced by Li et al. (2023, ACL Findings 2024) as a comprehensive Chinese-language counterpart to the English MMLU. It covers natural science, social sciences, engineering, and humanities, including more than 10 subjects that are not typically found in standard exams but are relevant to daily life in China, such as Chinese food culture, Chinese driving rules, and Chinese law. Compared to other Chinese benchmarks such as C-Eval and M3KE, CMMLU has more humanities, social science, and culture-specific subjects but fewer STEM subjects. It is one of the most widely used Chinese-language LLM benchmarks and is included by default when reporting results for Chinese-developed models such as Qwen, DeepSeek, and Baichuan.[12]
AGIEval, introduced by Microsoft researchers in 2023, takes a different approach: rather than reformatting practice exam questions, it draws problems from real, high-stakes standardized exams in both Chinese and English. Question sources include the SAT, LSAT, GMAT, GRE, GAOKAO (the Chinese national college entrance examination), and various civil service tests. AGIEval was explicitly motivated by the observation that scoring well on benchmarks composed of practice questions does not necessarily mean a model would score well on the underlying exams that the practice questions imitate.
MMMU (Massive Multi-discipline Multimodal Understanding), introduced by Yue et al. in 2024, is the multimodal counterpart to MMLU. Where MMLU is text-only, MMMU consists of college-level questions that require interpreting images such as charts, diagrams, maps, tables, music sheets, and chemical structures. CMMMU is the Chinese-language version of MMMU. These benchmarks have largely replaced MMLU as the headline number for evaluating multimodal frontier models, while MMLU continues to be reported as a text-only baseline.[13]
Several additional specialized versions have been developed to address specific limitations of the original MMLU:
| Variant | Focus | Key features | Venue/Year |
|---|---|---|---|
| MMLU-Redux | Error correction | 3,000 re-annotated questions across 30 subjects | NAACL 2025 |
| MMLU-Pro | Harder reasoning | 10 answer options, ~12,000 questions, 14 domains | NeurIPS 2024 |
| MMLU-CF | Contamination-free | 20,000 new questions with decontamination rules | ACL 2025 |
| MMLU-SR | Robustness testing | Modified terminology to test sensitivity to surface changes | 2024 |
| CodeMMLU | Programming | Software engineering and coding focus | 2024 |
| IndicMMLU-Pro | Multilingual | Adaptation for Indian languages | 2025 |
| Global-MMLU | Multilingual | Translated and culturally adapted across 42 languages | 2024 |
| CMMLU | Chinese | Native Chinese questions with 67 subjects | ACL Findings 2024 |
| AGIEval | Standardized exams | SAT, LSAT, GMAT, GRE, GAOKAO | 2023 |
| MMMU | Multimodal | College-level questions requiring images | CVPR 2024 |
MMLU is available through multiple platforms:[1][3]
EleutherAI's lm-evaluation-harness is the most widely used open-source framework for evaluating language models, and it has become the de facto standard for reproducible MMLU scoring on open-weight models. The harness wraps over 60 academic benchmarks, including MMLU, HellaSwag, ARC, TruthfulQA, Winogrande, GSM8K, and HumanEval. When the Hugging Face Open LLM Leaderboard ran from 2023 to 2024, it used lm-evaluation-harness as its evaluation backbone for MMLU, ensuring that all submitted models were scored under identical conditions.[14]
The harness implements MMLU primarily as a log-probability evaluation rather than a generative one. For each question, it computes the model's log-probability for each of the four answer letter tokens (A, B, C, D) given the prompt, and selects the highest as the model's answer. This avoids the parsing problems that plague generative evaluation, where a model might respond with "The answer is C" or "C) Paris" or simply "Paris" depending on instruction tuning.[14]
This log-probability approach is also why MMLU scores on the same model can differ between, say, OpenAI's official report (typically generative, since the API does not expose log-probabilities by default for chat models) and an open-source reproduction (typically log-probability based). Both numbers are real; they just measure slightly different things.
The standard evaluation procedure follows this format:[1]
The following are multiple choice questions (with answers) about [subject name].
Question: [Question text]
A. [Option A]
B. [Option B]
C. [Option C]
D. [Option D]
Answer: [Correct letter]
(... 4 more examples ...)
Question: [Test question text]
A. [Option A]
B. [Option B]
C. [Option C]
D. [Option D]
Answer:
Models are scored on:
Two main approaches are used to extract a model's answer from its output:
The choice of scoring method can affect results, which is another source of variation in reported scores across different evaluations.[5][14]
A third, less common approach is full chain-of-thought generation followed by answer extraction. Here, the model is prompted to reason aloud before producing a final letter. This is the protocol used for evaluating reasoning models like the OpenAI o-series, where the bulk of the model's compute is spent on hidden reasoning before a short final answer. Reasoning-style evaluation is what produced the 90%+ MMLU scores associated with o1, o3, Claude Opus 4, and Gemini 3 Pro.
MMLU has had an outsized influence on AI research and development:[2][3]
MMLU played a central role in demonstrating the relationship between model scale and knowledge acquisition. The benchmark showed clear log-linear improvement with increasing model size: GPT-3's jump from 25.9% (smallest variant) to 43.9% (175B parameters) provided early evidence of scaling laws for knowledge-intensive tasks. Later, Chinchilla's result of 67.5% with only 70B parameters (compared to Gopher's 60.0% with 280B) helped establish that training data volume matters as much as parameter count, supporting the compute-optimal training paradigm.[1][15]
The Hoffmann et al. (2022) Chinchilla paper itself used MMLU as one of its four headline benchmarks (alongside The Pile, BIG-bench, and several reading-comprehension tasks). The fact that a smaller, better-trained model out-scored a larger, undertrained one on MMLU was concrete enough to convince the field that the standard pre-training recipe of "more parameters, fixed data" was suboptimal. Subsequent training work, including Llama 1, Llama 2, Llama 3, and Mistral's open releases, all explicitly adopted Chinchilla-style data scaling, with MMLU scores serving as one of the main evidence points.[15]
The benchmark has found applications beyond model evaluation:
MMLU also influenced how subsequent benchmarks were designed and presented. The pattern of "57 subjects, 4-option MCQ, macro-average, 5-shot in-context examples, expert baseline" became a template that was reused or deliberately adapted by later benchmarks. CMMLU, AGIEval, MMMU, and even non-multiple-choice benchmarks like GPQA and Humanity's Last Exam all carry visible traces of MMLU's design choices, from the use of dev/val/test splits to the convention of reporting both per-subject and aggregate accuracy.
The MMLU ecosystem continues to evolve in response to the limitations identified by the research community:
As MMLU has become saturated for frontier models, several newer benchmarks have emerged to provide more discriminative evaluation:
| Benchmark | Focus | Why it extends beyond MMLU |
|---|---|---|
| MMLU-Pro | Harder multitask | 10 options, reasoning-focused questions |
| GPQA | Graduate-level expertise | PhD-level questions in physics, biology, chemistry |
| Humanity's Last Exam | Frontier difficulty | Expert-level questions designed to resist current models |
| SWE-bench | Real software engineering | Patches against real GitHub issues |
| HumanEval | Code generation | Functional programming challenges |
| GSM8K | Math reasoning | Grade school math word problems |
| BigBench | Broader task diversity | 200+ tasks beyond multiple choice |
| AIME | Competition math | High-difficulty problems with verifiable answers |
| FrontierMath | Research math | Problems written by professional mathematicians |
In the current landscape, frontier model releases typically lead with GPQA Diamond, AIME, SWE-bench Verified, Humanity's Last Exam, and Arena Elo, with MMLU and MMLU-Pro reported lower in the table as legacy comparisons. Mid-tier and open-weight model releases continue to feature MMLU prominently, since the saturation has not yet hit those tiers as hard.