MMLU-Redux

AI Benchmarks Model Evaluation

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v1 · 1,519 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

MMLU-Redux is a manually re-annotated, error-corrected subset of the Massive Multitask Language Understanding (MMLU) benchmark, intended to serve as a cleaner reference for evaluating large language models (LLMs). It was introduced in the paper "Are We Done with MMLU?" by Aryo Pradipta Gema and colleagues, first posted to arXiv on June 6, 2024, and later published at the 2025 conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025) ^[1]^[2]. The work was carried out by a team spanning the University of Edinburgh, University College London, Sapienza University of Rome, the Polytechnic University of Bari, the UK Health Security Agency, AssemblyAI, and other institutions ^[2].

The project was motivated by a simple but consequential observation: MMLU, one of the most widely cited AI benchmarks for measuring the knowledge and reasoning of language models, contains a meaningful number of flawed questions and incorrect ground-truth labels. By re-annotating a sample of MMLU using a defined error taxonomy, the authors estimated that roughly 6.49 percent of MMLU questions contain errors and showed that correcting these errors materially changes model rankings on some subjects ^[1]^[2]. The original release covered 3,000 questions across 30 of MMLU's subjects; an expanded version, MMLU-Redux-2.0, covers all 57 subjects with 5,700 re-annotated questions ^[3]^[4]. Both releases are distributed publicly under a Creative Commons Attribution 4.0 (CC-BY-4.0) license on the Hugging Face Hub by the "edinburgh-dawg" organization ^[3]^[4].

The problem with MMLU

MMLU, released in 2020 by Dan Hendrycks and collaborators, is a multiple-choice benchmark covering 57 subjects ranging from elementary mathematics and US history to professional law, medicine, and virology. Each question offers four answer options and a single designated correct answer, and a model's score is its accuracy across the test set. Because MMLU is broad, easy to score automatically, and widely reported, it became a default headline number in LLM release announcements and leaderboards.

That ubiquity is precisely what makes its defects costly. The authors of MMLU-Redux argue that the original dataset, assembled by collecting questions from online sources such as practice exams, inherited several recurring problems ^[1]:

Wrong ground-truth labels, where the answer key marks an option as correct that is not actually correct.
Questions with no correct answer among the four provided options.
Questions with multiple defensible correct answers, so that a capable model can be penalized for a reasonable choice.
Poorly worded or under-specified questions, where missing context, ambiguity, or grammatical issues make the intended answer unclear.
Unclear or irrelevant answer options.

When a benchmark contains such items, a model can be marked wrong for giving a sensible or even correct response, while a model that happens to match a faulty key is rewarded. This injects noise into scores and can distort the relative ordering of models, especially on small or error-dense subjects. The MMLU-Redux paper frames its central question in its title, "Are We Done with MMLU?", and answers that the benchmark needs cleaning before its numbers can be trusted as a precise measure of capability ^[1].

How MMLU-Redux was built (error taxonomy)

To turn the problem into something measurable, the authors defined a structured error taxonomy and applied it through manual expert annotation. They sampled 100 questions at random from each subject and had 14 human experts independently assess and re-annotate them, assigning each question to a category in the taxonomy ^[1]^[3]. Annotators were encouraged to perform exact-match searches with a search engine to locate the original source of a question and verify the intended answer against credible references ^[3].

The taxonomy organizes errors into two top-level groups, question assessment and ground-truth verification, with the following categories ^[1]:

Category	Type	Definition
Bad question clarity	Question assessment	The question is poorly presented in terms of clarity, grammar, or sufficiency of information.
Bad options clarity	Question assessment	The answer options are unclear, too similar, or irrelevant to the question.
No correct answer	Ground-truth verification	None of the provided options correctly answers the question.
Multiple correct answers	Ground-truth verification	More than one option can legitimately be selected as correct.
Wrong ground truth	Ground-truth verification	The designated correct answer differs from the actual correct answer.

Questions that passed inspection were marked "ok." The released datasets expose this judgment through an error_type field whose values include ok, bad_question_clarity, bad_options_clarity, no_correct_answer, multiple_correct_answers, and wrong_groundtruth, alongside an expert flag used during annotation ^[3]^[4]. To gauge consistency, the authors measured inter-annotator agreement with Cohen's Kappa on several high-error subjects, reporting values ranging from about 0.64 to 1.0 ^[1].

The first release, hosted at the Hugging Face dataset edinburgh-dawg/mmlu-redux, contained 3,000 re-annotated questions across 30 subjects (100 per subject) ^[3]. The expanded edinburgh-dawg/mmlu-redux-2.0 extended coverage to all 57 MMLU subjects, for a total of 5,700 re-annotated questions ^[4]. The authors also published a datasheet and maintenance guidelines so that the community could contribute corrections over time ^[1].

Findings

Aggregating the per-subject annotations and projecting via stratified sampling, the authors estimated that approximately 6.49 percent of questions across the full MMLU dataset contain errors ^[1]^[2]. The errors are far from evenly distributed. Some subjects are comparatively clean, while a handful are heavily affected. The most striking case is the Virology subset, where the authors found that about 57 percent of the analyzed questions contain errors, with a large share attributable to wrong ground-truth labels ^[1]^[2]. Other subjects with elevated error rates reported in the paper include logical fallacies, college chemistry, professional law, formal logic, and human aging ^[1].

Crucially, the paper shows that these errors are not merely cosmetic: removing or correcting flawed items reshuffles model rankings. The authors report that on the Virology subset, a 405-billion-parameter Llama 3.1 Instruct model ranked 16th when scored on all instances but rose to first when scored only on the verified-correct instances ^[1]. In the human sexuality subject, GPT-4 (the 0613 version) scored 0.91 and ranked fifth across all instances, but its exact-match score fell to 0.43 when evaluating only the clean instances, dropping it to last among the top ten models considered ^[1]. These examples illustrate the paper's broader claim that label noise can obscure the true capabilities of LLMs and that benchmark cleanliness can be the difference between a model appearing best or middling on a given topic ^[1]^[2].

Relationship to MMLU-Pro and other variants

MMLU-Redux is one of several efforts to address the limitations of the original MMLU, but it takes a distinctive approach. Rather than replacing MMLU with a harder or augmented test, it re-annotates the existing questions to identify which items are trustworthy, producing a diagnostic and a cleaner reference rather than a new exam.

MMLU-Pro, developed concurrently in 2024, instead filters the original MMLU, adds more challenging questions, and expands the number of answer options from four to ten in order to reduce guessing and raise the difficulty ceiling. The MMLU-Redux authors note this concurrent work and observe that MMLU-Pro also surfaces quality issues in the underlying data; importantly, they report that some errors from the original MMLU persist into the extended MMLU-Pro benchmark, which they cite as evidence that careful re-annotation remains valuable even alongside harder successor benchmarks ^[1]. The two efforts are therefore complementary: MMLU-Pro targets difficulty and headroom, while MMLU-Redux targets label correctness and ambiguity.

These projects sit within a wider family of MMLU descendants and multilingual or multimodal extensions, such as translated versions of MMLU and domain-specific spin-offs. MMLU-Redux's contribution to that ecosystem is methodological as much as it is a dataset: a reusable taxonomy and protocol for auditing the correctness of multiple-choice benchmarks.

Significance

MMLU-Redux helped crystallize a growing concern in LLM evaluation that benchmark integrity, not just benchmark difficulty, shapes the conclusions drawn from leaderboards. By quantifying an overall error rate near 6.5 percent and exposing extreme cases such as virology, the work gave practitioners concrete reasons to treat small differences in MMLU scores with caution and to prefer cleaned subsets when fine-grained comparisons matter ^[1]^[2]. The accompanying open datasets, transparent error taxonomy, and inter-annotator agreement statistics made the analysis reproducible and extensible, and the CC-BY-4.0 license lowered the barrier to adoption in research and model-development pipelines ^[3]^[4].

More broadly, the paper reinforced a shift in the evaluation community toward auditing and maintaining benchmarks as living artifacts rather than treating them as fixed ground truth. Its title question, "Are We Done with MMLU?", became a shorthand for the argument that widely used benchmarks deserve the same scrutiny applied to the models they measure, and that careful human re-annotation can recover signal that would otherwise be lost to noisy labels.

References

Gema, Aryo Pradipta; Leang, Joshua Ong Jun; Hong, Giwon; Devoto, Alessio; et al. "Are We Done with MMLU?" arXiv:2406.04127, June 6, 2024. https://arxiv.org/abs/2406.04127 ↩
Gema, Aryo Pradipta; et al. "Are We Done with MMLU?" Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025). https://aclanthology.org/2025.naacl-long.262/ ↩
"edinburgh-dawg/mmlu-redux." Hugging Face Datasets. https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux ↩
"edinburgh-dawg/mmlu-redux-2.0." Hugging Face Datasets. https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

SWE-bench Verified Terminal-Bench

Overview

The problem with MMLU

How MMLU-Redux was built (error taxonomy)

Findings

Relationship to MMLU-Pro and other variants

Significance

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here