# LLM Benchmarks Timeline

> Source: https://aiwiki.ai/wiki/llm_benchmarks_timeline
> Updated: 2026-06-25
> Categories: AI Benchmarks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [LLM Comparisons](/wiki/llm_comparisons) and [LLM Rankings](/wiki/llm_rankings)*

The LLM benchmarks timeline is the chronological record of the major tests used to evaluate [large language models](/wiki/large_language_model), from the [BERT](/wiki/bert) era reading-comprehension suites of 2018 to the agentic and frontier-reasoning evaluations of 2025 and 2026. Its defining pattern is rapid saturation: almost every benchmark that becomes a public scorecard is matched or exceeded by frontier models within one to three years, forcing the community to build progressively harder tests. [MMLU](/wiki/mmlu) went from GPT-3's 43.9% in 2020 to GPT-4's 86.4% by March 2023 [7][39], and abstract-reasoning benchmark ARC-AGI took four years to climb from 0% (GPT-3, 2020) to 5% (GPT-4o, 2024) before [o3](/wiki/o3) reached 87.5% in December 2024 [40][6].

This page lists, for each major [benchmark](/wiki/benchmark), when it was first released, who built it, what it tries to measure, and (where applicable) when frontier models effectively saturated it. The page covers benchmarks from the BERT era (2018) through the agentic and reasoning era of 2025 and 2026.

LLM evaluation has gone through several distinct phases. The first phase (2018 to 2020) focused on natural language understanding tasks like reading comprehension, sentence classification, and common-sense reasoning. The second phase (2020 to 2022) shifted toward broad knowledge tests, math word problems, and code generation. The third phase (2023 onward) is dominated by graduate-level reasoning, agentic and tool-use tasks, multimodal understanding, and frontier evaluations like FrontierMath and Humanity's Last Exam designed to resist saturation for years.

## What does this timeline cover?

The table below collects the major LLM benchmarks in order of release. "Saturation" indicates whether top models now exceed strong human baselines.

| Year | Benchmark | Category | Creators | Status | Paper |
| --- | --- | --- | --- | --- | --- |
| 2016-06 | SQuAD | Reading comprehension | Rajpurkar et al., Stanford | Saturated by [BERT](/wiki/bert) (2019) | [arxiv 1606.05250](https://arxiv.org/abs/1606.05250) |
| 2017-05 | TriviaQA | Knowledge QA | Joshi et al., UW | Saturated 2019 | [arxiv 1705.03551](https://arxiv.org/abs/1705.03551) |
| 2018-03 | ARC (AI2) | Science reasoning | Allen Institute for AI | Saturated 2023 | [arxiv 1803.05457](https://arxiv.org/abs/1803.05457) |
| 2018-04 | GLUE | NLU suite | NYU, UW, DeepMind | Saturated 2019 | [arxiv 1804.07461](https://arxiv.org/abs/1804.07461) |
| 2018-05 | SWAG | Common-sense | Zellers et al., UW | Saturated 2018 | [arxiv 1808.05326](https://arxiv.org/abs/1808.05326) |
| 2018-06 | SQuAD 2.0 | Reading comprehension | Rajpurkar et al., Stanford | Saturated 2019 | [arxiv 1806.03822](https://arxiv.org/abs/1806.03822) |
| 2019-04 | BoolQ | Yes/No QA | Google AI | Saturated | [arxiv 1905.10044](https://arxiv.org/abs/1905.10044) |
| 2019-05 | [HellaSwag](/wiki/hellaswag) | Common-sense | Zellers et al., AI2 | Saturated 2023 | [arxiv 1905.07830](https://arxiv.org/abs/1905.07830) |
| 2019-05 | SuperGLUE | NLU suite | NYU, DeepMind, FAIR | Saturated 2019 | [arxiv 1905.00537](https://arxiv.org/abs/1905.00537) |
| 2019-07 | WinoGrande | Common-sense | Sakaguchi et al., AI2 | Saturated 2023 | [arxiv 1907.10641](https://arxiv.org/abs/1907.10641) |
| 2019-09 | DROP | Reading + math | Dua et al., AI2 | Saturated 2023 | [arxiv 1903.00161](https://arxiv.org/abs/1903.00161) |
| 2019-11 | ARC-AGI | Abstract reasoning | François Chollet | Saturated by [o3](/wiki/o3) (2024) | [arxiv 1911.01547](https://arxiv.org/abs/1911.01547) |
| 2020-09 | [MMLU](/wiki/mmlu) | Broad knowledge | Hendrycks et al. | Saturated 2023 | [arxiv 2009.03300](https://arxiv.org/abs/2009.03300) |
| 2021-03 | [MATH](/wiki/math) | Competition math | Hendrycks et al. | Saturated 2024 | [arxiv 2103.03874](https://arxiv.org/abs/2103.03874) |
| 2021-07 | [HumanEval](/wiki/humaneval) | Code generation | OpenAI (Chen et al.) | Saturated 2024 | [arxiv 2107.03374](https://arxiv.org/abs/2107.03374) |
| 2021-08 | MBPP | Code generation | Google Research | Saturated 2024 | [arxiv 2108.07732](https://arxiv.org/abs/2108.07732) |
| 2021-09 | TruthfulQA | Factuality | Lin et al., Oxford | Hard, partially solved | [arxiv 2109.07958](https://arxiv.org/abs/2109.07958) |
| 2021-10 | GSM8K | Grade-school math | OpenAI (Cobbe et al.) | Saturated 2023 | [arxiv 2110.14168](https://arxiv.org/abs/2110.14168) |
| 2022-02 | CodeContests | Competitive code | DeepMind (AlphaCode) | Active | [Science 2022](https://www.science.org/doi/10.1126/science.abq1158) |
| 2022-06 | BIG-bench | Multi-task | 444 authors, Google | Largely saturated | [arxiv 2206.04615](https://arxiv.org/abs/2206.04615) |
| 2022-10 | BIG-Bench Hard | Multi-task | Suzgun et al., Google | Saturated 2024 | [arxiv 2210.09261](https://arxiv.org/abs/2210.09261) |
| 2023-04 | LMSYS Chatbot Arena | Human preference | LMSYS / UC Berkeley | Active | [blog](https://lmsys.org/blog/2023-05-03-arena/) |
| 2023-05 | HumanEval+ / MBPP+ | Code generation | EvalPlus | Active | [arxiv 2305.01210](https://arxiv.org/abs/2305.01210) |
| 2023-10 | [SWE-bench](/wiki/swe_bench) | Real GitHub issues | Princeton NLP | Replaced by Verified | [arxiv 2310.06770](https://arxiv.org/abs/2310.06770) |
| 2023-11 | IFEval | Instruction following | Google Research | Saturated 2024 | [arxiv 2311.07911](https://arxiv.org/abs/2311.07911) |
| 2023-11 | [GPQA](/wiki/gpqa) / GPQA-Diamond | Graduate-level science | Rein et al., NYU | Active | [arxiv 2311.12022](https://arxiv.org/abs/2311.12022) |
| 2023-11 | GAIA | General AI assistant | Meta, HuggingFace | Active | [arxiv 2311.12983](https://arxiv.org/abs/2311.12983) |
| 2023-11 | [MMMU](/wiki/mmmu) | Multimodal | Yue et al. | Active | [arxiv 2311.16502](https://arxiv.org/abs/2311.16502) |
| 2024-02 | AIME 2024 | Olympiad math | MAA (used by labs) | Saturated 2025 | [MAA AIME](https://maa.org/maa-invitational-competitions) |
| 2024-03 | LiveCodeBench | Contamination-free code | Berkeley, MIT, Cornell | Active (rolling) | [arxiv 2403.07974](https://arxiv.org/abs/2403.07974) |
| 2024-04 | Arena-Hard | Auto-rated chat | LMSYS | Active | [blog](https://lmsys.org/blog/2024-04-19-arena-hard/) |
| 2024-06 | MMLU-Pro | Broad knowledge | TIGER-Lab | Active | [arxiv 2406.01574](https://arxiv.org/abs/2406.01574) |
| 2024-06 | tau-bench | Tool-agent-user | Sierra Research | Active | [arxiv 2406.12045](https://arxiv.org/abs/2406.12045) |
| 2024-08 | [SWE-bench Verified](/wiki/swe_bench_verified) | Real GitHub issues | OpenAI + Princeton | Deprecated 2026-02 | [OpenAI](https://openai.com/index/introducing-swe-bench-verified/) |
| 2024-09 | MMMU-Pro | Multimodal | Yue et al. | Active | [arxiv 2409.02813](https://arxiv.org/abs/2409.02813) |
| 2024-10 | MLE-Bench | ML engineering agents | OpenAI | Active | [arxiv 2410.07095](https://arxiv.org/abs/2410.07095) |
| 2024-10 | SimpleQA | Short-form factuality | OpenAI | Active | [arxiv 2411.04368](https://arxiv.org/abs/2411.04368) |
| 2024-11 | FrontierMath | Research math | Epoch AI | Active | [arxiv 2411.04872](https://arxiv.org/abs/2411.04872) |
| 2024-12 | WebDev Arena | Front-end coding | LMArena | Active | [blog](https://lmarena.ai/) |
| 2024-12 | Aider Polyglot | Multi-language code | Aider AI | Active | [aider.chat](https://aider.chat/2024/12/21/polyglot.html) |
| 2025-01 | [Humanity's Last Exam](/wiki/humanitys_last_exam) | Frontier knowledge | CAIS + Scale AI | Active | [arxiv 2501.14249](https://arxiv.org/abs/2501.14249) |
| 2025-02 | AIME 2025 | Olympiad math | MAA (used by labs) | Active | [MAA AIME](https://maa.org/maa-invitational-competitions) |
| 2025-03 | ARC-AGI-2 | Abstract reasoning | François Chollet | Active | [arxiv 2505.11831](https://arxiv.org/abs/2505.11831) |
| 2025-05 | SWE-bench Live | Real GitHub issues, rolling | Microsoft Research | Active | [arxiv 2505.23419](https://arxiv.org/abs/2505.23419) |
| 2025-05 | HealthBench | Health conversations | OpenAI + 262 physicians | Active | [arxiv 2505.08775](https://arxiv.org/abs/2505.08775) |
| 2025-06 | tau2-bench | Conversational tool agents | Sierra Research | Active | [arxiv 2506.07982](https://arxiv.org/abs/2506.07982) |

## What were the first LLM benchmarks (2018, the BERT era)?

The modern wave of LLM benchmarking starts with reading-comprehension and natural-language-understanding suites built around the [Transformer](/wiki/transformer). Stanford's SQuAD (2016) and SQuAD 2.0 (June 2018) tested whether models could extract answer spans from Wikipedia paragraphs and recognize when no answer existed [2]. NYU's GLUE benchmark (April 2018) bundled nine NLU tasks (sentence classification, paraphrase detection, entailment, similarity) into one score [1]. SWAG (May 2018) introduced grounded common-sense "next-event" multiple choice.

Within a year of release, [BERT](/wiki/bert) and its successors had matched or exceeded human baselines on all of these. The pattern, fast saturation by a single new model class, would repeat over and over.

## How did benchmarks evolve in 2019 (SuperGLUE and the common-sense wave)?

The SuperGLUE benchmark (May 2019) deliberately replaced GLUE tasks that had been solved with harder ones: word-sense disambiguation, causal reasoning, multi-sentence reading comprehension. T5 saturated SuperGLUE within a few months, reaching 89.3 against a human baseline of 89.8 [3].

Common-sense and grounded reasoning got a big push in 2019. HellaSwag (May 2019) extended SWAG with adversarial filtering and stayed unsolved by BERT-class models for years [4]. WinoGrande (July 2019) scaled the Winograd Schema Challenge to 44,000 examples of pronoun resolution that require real-world knowledge [5]. BoolQ (May 2019) used naturally occurring yes/no questions. DROP (Sept 2019) combined reading comprehension with discrete arithmetic.

A different track started the same year. ARC-AGI (Nov 2019), proposed by François Chollet in his "On the Measure of Intelligence" paper, focused on abstract visual reasoning rather than language [6]. It was designed to resist scaling brute-force training data, and held up far longer than any of the language-only benchmarks above.

## How did 2020 to 2021 shift to knowledge, math, and code?

MMLU (Sept 2020) by Dan Hendrycks and collaborators was the breakout benchmark of the GPT-3 era. It bundled 57 subjects (US history, professional law, abstract algebra, clinical knowledge, virology, machine learning) into a single multiple-choice test scraped from real exams [7]. GPT-3's modest 43.9% score made it the headline number for years; GPT-4 reached 86.4% in March 2023, effectively saturating it [39].

Math got two big benchmarks in 2021. Hendrycks' MATH dataset (March 2021) used 12,500 competition math problems from AMC, AIME, and similar contests; PhD CS students averaged about 40% [8]. OpenAI's GSM8K (October 2021) consisted of 8,500 grade-school math word problems, simpler in math content but requiring multi-step natural-language reasoning [12]. Both of these became the de facto math evaluations for frontier models, and both were effectively solved by 2024 ([o1](/wiki/o1) hit 94.8% on MATH; GPT-4 had already hit 92% on GSM8K).

Code generation arrived around the same time. OpenAI's HumanEval (July 2021) shipped with the original [Codex](/wiki/codex) paper: 164 hand-written Python programming problems with hidden unit tests [9]. Google's MBPP (August 2021) added 974 entry-level Python problems [10]. The pass@1 metric (does the first sampled solution pass?) became standard. By 2024 [GPT-4o](/wiki/gpt-4o) was scoring 90.2% on HumanEval. EvalPlus (May 2023) addressed the problem that HumanEval's test cases were too weak by extending them roughly 80x, producing HumanEval+ and MBPP+ [17].

Factuality got its first dedicated test in TruthfulQA (Sept 2021): 817 questions where humans commonly hold false beliefs (medicine, law, conspiracies, fictional history) [11]. Larger models initially did *worse* on this benchmark, an early example of inverse scaling.

## What was BIG-bench and the 2022 megabenchmark experiment?

BIG-bench (June 2022) was an unusual collaboration: 444 authors from 132 institutions contributing 204 tasks ranging from arithmetic to checkmate-in-one to crash blossom disambiguation [14]. Google's PaLM 540B was the first model to clear average human performance overall (61.4 versus a human baseline of 49.8). The follow-up paper, BIG-Bench Hard (Oct 2022) by Suzgun et al., picked 23 tasks where prior models had failed; chain-of-thought prompting unlocked dramatic gains on most of them [15].

2022 also saw DeepMind's AlphaCode publish CodeContests, a dataset of competitive programming problems used to evaluate medal-level code generation [13].

## What changed in 2023 (GPT-4, GPQA, MMMU, SWE-bench)?

2023 was the year benchmarks pivoted toward graduate-level rigor and toward agentic, real-world tasks. Three benchmarks stand out:

- **GPQA** (Nov 2023, David Rein et al.) introduced 448 multiple-choice questions in biology, physics, and chemistry written by domain PhDs. Highly skilled non-expert validators with 30+ minutes and unrestricted web access reached only 34%; PhDs in the relevant field reached about 65% [20]. The hardest subset, GPQA-Diamond (198 items), became one of the most-watched scores for frontier reasoning models.
- **MMMU** (Nov 2023, Yue et al.) covered 11,500 multimodal exam questions across 30 subjects, with images of charts, diagrams, chemical structures, and medical scans interleaved with the questions [22]. It became the standard multimodal benchmark for [GPT-4V](/wiki/gpt-4v), [Gemini](/wiki/gemini), and [Claude 3](/wiki/claude_3).
- **SWE-bench** (Oct 2023, Princeton NLP) framed software engineering as a benchmark for the first time at scale: 2,294 real GitHub issues from 12 popular Python repositories, with the model expected to produce a patch that passes the original maintainer's hidden test suite [18].

Alongside these, IFEval (Nov 2023, Google) tested verifiable instruction following ("reply in exactly three paragraphs", "include the word 'falcon' twice") [19], and GAIA (Nov 2023, Meta + HuggingFace) introduced a tool-use benchmark where 92% of questions were solvable by humans but [GPT-4](/wiki/gpt-4) with browsing managed only about 15% [21]. LMSYS Chatbot Arena, launched in April 2023 by UC Berkeley, became the leading human-preference leaderboard, ranking models by Elo (later Bradley-Terry) coefficients fitted to crowdsourced pairwise votes [16].

## What benchmarks defined 2024 (agents, math frontiers, contamination defenses)?

By 2024 most pre-2023 benchmarks were saturated, and the field divided into three new directions: agentic and coding tasks, frontier math, and explicit defenses against benchmark contamination.

### Agentic and coding

- **LiveCodeBench** (Mar 2024) hosted 400+ coding problems from contests run between May 2023 and May 2024 with continuous updates, designed to defeat training-data leakage [23].
- **MLE-Bench** (Oct 2024, OpenAI) wrapped 75 Kaggle ML engineering competitions into an agent benchmark testing whether models could train models, prepare data, and submit predictions end to end [29].
- **SWE-bench Verified** (Aug 2024, OpenAI + Princeton) was a 500-task subset of SWE-bench audited by 93 contracted developers to remove ambiguous issues and unfair tests [27]. It became the dominant coding benchmark in 2024 and 2025 before OpenAI deprecated it on February 23, 2026 after finding that at least 59.4% of failed test cases were flawed and that every frontier model showed signs of training-data contamination [41]. OpenAI wrote that "improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities," and that gains "increasingly reflect how much the model was exposed to the benchmark at training time" [41].
- **tau-bench** (June 2024, Sierra Research) introduced a framework where an agent talks to a simulated user under business rules from real airline and retail policies; only models with strong tool-use and consistency could pass [26].
- **Aider Polyglot** (Dec 2024) used 225 hard Exercism problems across C++, Go, Java, JavaScript, Python, and Rust to evaluate code-editing skill across languages [32]. [o1](/wiki/o1) topped the initial leaderboard.
- **WebDev Arena** (Dec 2024, LMArena) had humans pick which of two model-generated single-file web apps looked and worked better.

### Frontier math

- **AIME 2024** (Feb 2024) is a real high-school olympiad with 30 integer-answer problems from algebra, combinatorics, geometry, and number theory. It became a standard reasoning eval for [o1](/wiki/o1), [o3](/wiki/o3), and competitors. Note: AIME 2024 has since been shown to suffer from training-data contamination for several models.
- **FrontierMath** (Nov 2024, Epoch AI) is the most ambitious math benchmark to date: hundreds of original research-level problems written by 60+ professional mathematicians (including Fields medalists Terence Tao, Timothy Gowers, and Richard Borcherds) covering number theory, real analysis, algebraic geometry, and category theory [31][42]. At launch, leading models solved less than 2% of the problems [31][42]. Tao described the problems as "extremely challenging," suggesting that in the near term the only way to solve them is "a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages" [42].

### Anti-contamination and refresh efforts

- **MMLU-Pro** (June 2024, TIGER-Lab) responded to MMLU saturation by adding harder reasoning items, expanding choices from 4 to 10, and removing trivial questions; 12,000 cleaner items across 14 domains [25].
- **MMMU-Pro** (Sept 2024) tightened MMMU by filtering text-only-solvable items, increasing distractors, and adding a vision-only mode where the question itself is rendered as an image [28].
- **SimpleQA** (Oct 2024, OpenAI) targeted short-form factual recall with 4,326 hand-crafted questions; even GPT-4o scored under 40% [30].
- **Arena-Hard** (April 2024, LMSYS) used 500 challenging user prompts from Chatbot Arena, scored by a strong judge model, as an automatic and reproducible alternative to crowdsourced votes [24].

## What benchmarks define 2025 to 2026 (HLE, ARC-AGI-2, post-saturation evaluation)?

2025 marked the first year that frontier models started to saturate the 2024 benchmarks too. The community responded with deliberately harder evaluations.

- **Humanity's Last Exam (HLE)** (Jan 24, 2025) was assembled by the Center for AI Safety and Scale AI from contributions by nearly 1,000 subject experts across over 100 disciplines [33]. The launch set was around 3,000 expert-level questions (2,500 public plus a 500-question private holdout) [43]. At release, top models scored under 10%: GPT-4o reached 3.3%, Claude 3.5 Sonnet 4.3%, Gemini 6.2%, [o1](/wiki/o1) 9.1%, and [DeepSeek-R1](/wiki/deepseek_r1) 9.4% [43].
- **AIME 2025** (Feb 2025) replaced AIME 2024 as a clean, uncontaminated math benchmark; many labs report both numbers because AIME 2024 has known contamination.
- **ARC-AGI-2** (March 2025, ARC Prize team) was the long-awaited successor to ARC-AGI after [o3](/wiki/o3) cleared the original [34]. It uses harder symbolic-pattern tasks; in early 2025, frontier models scored only a few percent while humans scored over 60%.
- **HealthBench** (May 12, 2025, OpenAI with 262 physicians from 60 countries) introduced 5,000 health-related conversations, each scored against a physician-written rubric covering accuracy, completeness, communication, and safety [35].
- **SWE-bench Live** (May 2025, Microsoft Research) addressed the contamination problem in coding benchmarks by collecting 1,319 real GitHub issues created after Jan 2024 across 93 repositories, with continuous monthly refreshes [36].
- **tau2-bench** (June 2025, Sierra Research) extended tau-bench with a dual-control telecom domain where both the agent and a simulated user actively change shared world state, then was further extended into tau3-bench with a banking domain and voice modality [37].

Beyond this list, the modern evaluation stack now includes domain benchmarks like LegalBench, FinanceBench, ChemBench, MultiMedQA, and MedQA; safety-and-jailbreak benchmarks like JailbreakBench, HarmBench, and AdvBench; reward-model benchmarks like RewardBench; and continually-refreshed leaderboards like LiveBench. The full list runs into the hundreds.

## How and why do benchmarks get saturated?

Three mechanisms tend to defeat any given LLM benchmark within a few years:

1. **Pretraining-data contamination.** Most public benchmarks end up in scraped pretraining corpora, especially anything posted to GitHub, Hugging Face, or arXiv. Recent analyses show AIME 2024 was likely contaminated for several frontier models, with score inflation of 10 to 20 points relative to AIME 2025, and OpenAI found contamination in every frontier model it tested against SWE-bench Verified [41].
2. **Targeted fine-tuning and RLHF.** Once a benchmark becomes a public scorecard, labs explicitly tune for it. This is part of why GLUE, SuperGLUE, and BIG-Bench Hard fell so quickly after first appearing.
3. **Capability genuinely catching up.** Some benchmarks (HumanEval, GSM8K, MATH, ARC-AGI-1) were defeated by clear capability gains rather than contamination. When o3 reached 87.5% on ARC-AGI, benchmark author François Chollet called it "a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models" [40]. Reasoning models from late 2024 onward (o1, o3, DeepSeek-R1, [Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet) extended thinking) cleared many tasks that had stood for years.

The newest benchmarks (FrontierMath, HLE, ARC-AGI-2, SWE-bench Live) try to defend against all three: original problems written by experts, held-out test sets, continuous refreshes, and difficulty calibrated so that even the best 2025 systems cannot brute-force them.

## Which benchmarks are saturated and which are still hard?

As of mid-2026, essentially all benchmarks released before 2023 are saturated, including SQuAD, GLUE, SuperGLUE, HellaSwag, WinoGrande, MMLU, MATH, GSM8K, HumanEval, and BIG-Bench Hard. The 2024 frontier evaluations are also falling fast: AIME 2024 is effectively saturated and ARC-AGI-1 was cleared by o3. The benchmarks that remain genuinely difficult as of 2026 are the deliberately hard 2024-2025 cohort, where even the strongest models start well below human-expert ceilings: FrontierMath (under 2% at launch) [31], Humanity's Last Exam (under 10% at launch) [43], ARC-AGI-2 (low single digits for models versus 60%+ for humans) [34], and the contamination-resistant coding benchmarks SWE-bench Pro and SWE-bench Live [36][41]. These are the scores AI labs now compete on, precisely because the older benchmarks no longer separate frontier systems from one another.

## Original article tables (preserved)

The following "killed by" tables track the moment specific benchmarks were defeated by specific models, retained from the original version of this page [38].

### 2024

| Benchmark | Category | Date Created | Date Defeated | Killed By | Defeated By Model | Original Score | Final Score | Details | Links |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ARC-AGI | Reasoning | 2019-11 | 2024-12 | Saturation | [o3](/wiki/o3) | Human Baseline: ~80% | o3: 87.5% | Abstract reasoning challenge with visual pattern completion tasks created by François Chollet. | [Paper](https://arxiv.org/abs/1911.01547), [Website](https://arcprize.org) |
| [MATH](/wiki/math) | Mathematics | 2021-03 | 2024-09 | Saturation | [o1](/wiki/o1) | Average CS PhD: ~40% | o1: 94.8% | 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning. | [Paper](https://arxiv.org/abs/2103.03874), [GitHub](https://github.com/hendrycks/math) |
| BIG-Bench-Hard | Multi-task | 2022-10 | 2024-06 | Saturation | [Sonnet 3.5](/wiki/claude_3_5_sonnet) | Average Human: 67.7% | Sonnet 3.5: 93.1% | A curated suite of 23 challenging tasks from BIG-Bench. | [Paper](https://arxiv.org/abs/2210.09261), [GitHub](https://github.com/suzgunmirac/BIG-Bench-Hard) |
| [HumanEval](/wiki/humaneval) | Coding | 2021-07 | 2024-05 | Saturation | [GPT-4o](/wiki/gpt-4o) | Unspecified | GPT-4o: 90.2% | 164 Python programming problems testing coding abilities. | [Paper](https://arxiv.org/abs/2107.03374), [GitHub](https://github.com/openai/human-eval) |
| IFEval | Instruction Following | 2023-11 | 2024-03 | Saturation | [Llama 3.3 70B](/wiki/llama_3_3) | Unspecified | Llama 3.3 70B: 92.1% | Evaluation suite testing multi-step instruction-following capabilities. | [Paper](https://arxiv.org/abs/2311.07911), [GitHub](https://github.com/google-research/google-research/tree/master/instruction_following_eval) |

### 2023

| Benchmark | Category | Date Created | Date Defeated | Killed By | Defeated By Model | Original Score | Final Score | Details | Links |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| GSM8K | Mathematics | 2021-10 | 2023-11 | Saturation | [GPT-4](/wiki/gpt-4) | Unspecified | GPT-4: 92.0% | 8.5K grade school math word problems requiring step-by-step solutions. | [Paper](https://arxiv.org/abs/2110.14168), [GitHub](https://github.com/openai/grade-school-math) |
| Turing Test | Conversation | 1950-10 | 2023-03 | Saturation | [GPT-4](/wiki/gpt-4) | Interrogator > 50% | Interrogator 46% | The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game"). | [Paper](https://courses.cs.umbc.edu/471/papers/turing.pdf) |
| ARC (AI2) | Reasoning | 2018-03 | 2023-03 | Saturation | [GPT-4](/wiki/gpt-4) | Unspecified | GPT-4: 96.3% | Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning. | [Paper](https://arxiv.org/abs/1803.05457) |
| [HellaSwag](/wiki/hellaswag) | Common Sense | 2019-05 | 2023-03 | Saturation | [GPT-4](/wiki/gpt-4) | Human: 95.6% | GPT-4: 95.3% | Multiple-choice questions about everyday scenarios with adversarial filtering. | [Paper](https://arxiv.org/abs/1905.07830) |
| [MMLU](/wiki/mmlu) | Knowledge | 2020-09 | 2023-03 | Saturation | [GPT-4](/wiki/gpt-4) | 95th pct Human: 87.0% | GPT-4: 87.3% | 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge. | [Paper](https://arxiv.org/abs/2009.03300) |
| WinoGrande | Common Sense | 2019-07 | 2023-03 | Saturation | [GPT-4](/wiki/gpt-4) | Human: 94% | GPT-4: 87.5% | Enhanced WSC with 44K problems testing common-sense pronoun resolution. | [Paper](https://arxiv.org/abs/1907.10641) |

### Pre-2023

#### 2022

| Benchmark | Category | Date Created | Date Defeated | Killed By Model | Defeated By | Original Score | Final Score | Details | Links |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| BIG-Bench | Multi-task | 2021-06 | 2022-04 | Saturation | PaLM 540B | Human: 49.8% | PaLM 540B: 61.4% | 204 tasks spanning linguistics, math, common-sense reasoning, and more. | [Paper](https://arxiv.org/abs/2206.04615), [GitHub](https://github.com/google/BIG-bench) |

#### 2019

| Benchmark | Category | Date Created | Date Defeated | Killed By | Defeated By Model | Original Score | Final Score | Details | Links |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| SuperGLUE | Language | 2019-05 | 2019-10 | Saturation | T5 | Human: 89.8% | T5: 89.3% | More challenging language understanding tasks (word sense, causal reasoning, RC). | [Paper](https://arxiv.org/abs/1905.00537) |
| WSC | Common Sense | 2012-05 | 2019-07 | Saturation | RoBERTa (w SFT) | Human: 96.5% | RoBERTa (w SFT): 90.1% | Carefully crafted sentence pairs with ambiguous pronoun references. | [Paper](https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf) |
| GLUE | Language | 2018-05 | 2019-06 | Saturation | XLNet | Human: 87.1% | XLNet: 88.4% | Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.). | [Paper](https://arxiv.org/abs/1804.07461) |
| TriviaQA | Knowledge | 2017-05 | 2019-06 | Saturation | SpanBERT | Human: 79.7% | SpanBERT: 83.6% | 650K QA-evidence triples requiring cross-sentence reasoning. | [Paper](https://arxiv.org/abs/1705.03551) |
| SQuAD v2.0 | Language | 2018-05 | 2019-04 | Saturation | [BERT](/wiki/bert) | Human: 89.5% | BERT: 89.5% | Extension of SQuAD adding unanswerable questions. | [Paper](https://arxiv.org/abs/1806.03822) |
| SQuAD | Language | 2016-05 | 2019-03 | Saturation | [BERT](/wiki/bert) | Human: 91.2% | BERT: 93.2% | 100,000+ QA tasks on Wikipedia articles. | [Paper](https://arxiv.org/abs/1606.05250) |

#### 2018

| Benchmark | Category | Date Created | Date Defeated | Killed By | Defeated By Model | Original Score | Final Score | Details | Links |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| SWAG | Common Sense | 2018-05 | 2018-10 | Saturation | [BERT](/wiki/bert) | Human: 88% | BERT: 86% | 113K multiple-choice questions about grounded situations (common sense "next step"). | [Paper](https://arxiv.org/abs/1808.05326) |

## ELI5: how does this whole thing work?

Think of an LLM benchmark as a standardized exam for AI. Researchers write a big set of questions with known answers (math problems, science questions, coding tasks, reading passages), let the model take the test, and report the percentage it gets right next to how well humans do. When a new model finally beats the human baseline, people say the benchmark is "saturated" or "solved," and it stops being useful for telling top models apart. So researchers write a harder exam, the models catch up again, and the cycle repeats. This timeline is the scoreboard of that race: each row is an exam, when it was written, and (often) the model that finally aced it.

## See also

- [Benchmarks](/wiki/benchmarks) (overview)
- [LLM Comparisons](/wiki/llm_comparisons)
- [LLM Rankings](/wiki/llm_rankings)
- [MMLU](/wiki/mmlu), [GPQA](/wiki/gpqa), [SWE-bench](/wiki/swe_bench), [HumanEval](/wiki/humaneval), [Humanity's Last Exam](/wiki/humanitys_last_exam)

## References

1. Wang, A., et al. (2018). "GLUE: A Multi-Task Benchmark." [arxiv 1804.07461](https://arxiv.org/abs/1804.07461)
2. Rajpurkar, P., et al. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD." [arxiv 1806.03822](https://arxiv.org/abs/1806.03822)
3. Wang, A., et al. (2019). "SuperGLUE: A Stickier Benchmark." [arxiv 1905.00537](https://arxiv.org/abs/1905.00537)
4. Zellers, R., et al. (2019). "HellaSwag: Can a Machine Really Finish Your Sentence?" [arxiv 1905.07830](https://arxiv.org/abs/1905.07830)
5. Sakaguchi, K., et al. (2019). "WinoGrande." [arxiv 1907.10641](https://arxiv.org/abs/1907.10641)
6. Chollet, F. (2019). "On the Measure of Intelligence" (ARC-AGI). [arxiv 1911.01547](https://arxiv.org/abs/1911.01547)
7. Hendrycks, D., et al. (2020). "Measuring Massive Multitask Language Understanding (MMLU)." [arxiv 2009.03300](https://arxiv.org/abs/2009.03300)
8. Hendrycks, D., et al. (2021). "Measuring Mathematical Problem Solving with the MATH Dataset." [arxiv 2103.03874](https://arxiv.org/abs/2103.03874)
9. Chen, M., et al. (2021). "Evaluating Large Language Models Trained on Code (HumanEval)." [arxiv 2107.03374](https://arxiv.org/abs/2107.03374)
10. Austin, J., et al. (2021). "Program Synthesis with Large Language Models (MBPP)." [arxiv 2108.07732](https://arxiv.org/abs/2108.07732)
11. Lin, S., et al. (2021). "TruthfulQA." [arxiv 2109.07958](https://arxiv.org/abs/2109.07958)
12. Cobbe, K., et al. (2021). "Training Verifiers to Solve Math Word Problems (GSM8K)." [arxiv 2110.14168](https://arxiv.org/abs/2110.14168)
13. Li, Y., et al. (2022). "Competition-Level Code Generation with AlphaCode." [Science](https://www.science.org/doi/10.1126/science.abq1158)
14. Srivastava, A., et al. (2022). "Beyond the Imitation Game (BIG-bench)." [arxiv 2206.04615](https://arxiv.org/abs/2206.04615)
15. Suzgun, M., et al. (2022). "BIG-Bench Hard." [arxiv 2210.09261](https://arxiv.org/abs/2210.09261)
16. Chiang, W.-L., et al. (2023). "Chatbot Arena." [LMSYS blog](https://lmsys.org/blog/2023-05-03-arena/)
17. Liu, J., et al. (2023). "Is Your Code Generated by ChatGPT Really Correct? (EvalPlus / HumanEval+)." [arxiv 2305.01210](https://arxiv.org/abs/2305.01210)
18. Jimenez, C. E., et al. (2023). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" [arxiv 2310.06770](https://arxiv.org/abs/2310.06770)
19. Zhou, J., et al. (2023). "Instruction-Following Evaluation (IFEval)." [arxiv 2311.07911](https://arxiv.org/abs/2311.07911)
20. Rein, D., et al. (2023). "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." [arxiv 2311.12022](https://arxiv.org/abs/2311.12022)
21. Mialon, G., et al. (2023). "GAIA: a benchmark for General AI Assistants." [arxiv 2311.12983](https://arxiv.org/abs/2311.12983)
22. Yue, X., et al. (2023). "MMMU." [arxiv 2311.16502](https://arxiv.org/abs/2311.16502)
23. Jain, N., et al. (2024). "LiveCodeBench." [arxiv 2403.07974](https://arxiv.org/abs/2403.07974)
24. Li, T., et al. (2024). "From Crowdsourced Data to High-quality Benchmarks: Arena-Hard." [LMSYS blog](https://lmsys.org/blog/2024-04-19-arena-hard/)
25. Wang, Y., et al. (2024). "MMLU-Pro." [arxiv 2406.01574](https://arxiv.org/abs/2406.01574)
26. Yao, S., et al. (2024). "tau-bench." [arxiv 2406.12045](https://arxiv.org/abs/2406.12045)
27. OpenAI Preparedness team (2024). "Introducing SWE-bench Verified." [OpenAI](https://openai.com/index/introducing-swe-bench-verified/)
28. Yue, X., et al. (2024). "MMMU-Pro." [arxiv 2409.02813](https://arxiv.org/abs/2409.02813)
29. Chan, J. S., et al. (2024). "MLE-bench." [arxiv 2410.07095](https://arxiv.org/abs/2410.07095)
30. Wei, J., et al. (2024). "Measuring Short-Form Factuality (SimpleQA)." [arxiv 2411.04368](https://arxiv.org/abs/2411.04368)
31. Glazer, E., et al. (2024). "FrontierMath." [arxiv 2411.04872](https://arxiv.org/abs/2411.04872)
32. Aider AI (Dec 2024). "o1 tops aider's new polyglot leaderboard." [aider.chat](https://aider.chat/2024/12/21/polyglot.html)
33. Phan, L., et al. (2025). "Humanity's Last Exam." [arxiv 2501.14249](https://arxiv.org/abs/2501.14249)
34. Chollet, F., et al. (2025). "ARC-AGI-2." [arxiv 2505.11831](https://arxiv.org/abs/2505.11831)
35. Arora, R. K., et al. (2025). "HealthBench." [arxiv 2505.08775](https://arxiv.org/abs/2505.08775)
36. Zhang, L., et al. (2025). "SWE-bench Goes Live!" [arxiv 2505.23419](https://arxiv.org/abs/2505.23419)
37. Barres, V., et al. (2025). "tau2-bench." [arxiv 2506.07982](https://arxiv.org/abs/2506.07982)
38. R0bk (community). "Killed by LLM." [website](https://r0bk.github.io/killedbyllm/), [github](https://github.com/R0bk/killedbyllm)
39. OpenAI (2023). "GPT-4 Technical Report." [arxiv 2303.08774](https://arxiv.org/abs/2303.08774)
40. ARC Prize (Dec 2024). "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub." [arcprize.org](https://arcprize.org/blog/oai-o3-pub-breakthrough)
41. OpenAI (2026). "Why we no longer evaluate SWE-bench Verified." [openai.com](https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/)
42. Epoch AI (2024). "FrontierMath: Evaluating Advanced Mathematical Reasoning in AI." [epoch.ai](https://epoch.ai/frontiermath)
43. Scale AI and CAIS (Jan 2025). "Unveiling the Results of Humanity's Last Exam." [scale.com](https://scale.com/blog/humanitys-last-exam-results)