LLM Benchmarks Timeline
Last reviewed
May 10, 2026
Sources
38 citations
Review status
Source-backed
Revision
v2 · 4,119 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
38 citations
Review status
Source-backed
Revision
v2 · 4,119 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: LLM Comparisons and LLM Rankings
This is a chronological timeline of major benchmarks used to evaluate large language models (LLMs). Each entry lists when the benchmark was first released, who built it, what it tries to measure, and (where applicable) when frontier models effectively saturated it. The page covers benchmarks from the BERT era (2018) through the agentic and reasoning era of 2025 and 2026.
LLM evaluation has gone through several distinct phases. The first phase (2018 to 2020) focused on natural language understanding tasks like reading comprehension, sentence classification, and common-sense reasoning. The second phase (2020 to 2022) shifted toward broad knowledge tests, math word problems, and code generation. The third phase (2023 onward) is dominated by graduate-level reasoning, agentic and tool-use tasks, multimodal understanding, and frontier evaluations like FrontierMath and Humanity's Last Exam designed to resist saturation for years.
The table below collects the major LLM benchmarks in order of release. "Saturation" indicates whether top models now exceed strong human baselines.
| Year | Benchmark | Category | Creators | Status | Paper |
|---|---|---|---|---|---|
| 2016-06 | SQuAD | Reading comprehension | Rajpurkar et al., Stanford | Saturated by BERT (2019) | arxiv 1606.05250 |
| 2017-05 | TriviaQA | Knowledge QA | Joshi et al., UW | Saturated 2019 | arxiv 1705.03551 |
| 2018-03 | ARC (AI2) | Science reasoning | Allen Institute for AI | Saturated 2023 | arxiv 1803.05457 |
| 2018-04 | GLUE | NLU suite | NYU, UW, DeepMind | Saturated 2019 | arxiv 1804.07461 |
| 2018-05 | SWAG | Common-sense | Zellers et al., UW | Saturated 2018 | arxiv 1808.05326 |
| 2018-06 | SQuAD 2.0 | Reading comprehension | Rajpurkar et al., Stanford | Saturated 2019 | arxiv 1806.03822 |
| 2019-04 | BoolQ | Yes/No QA | Google AI | Saturated | arxiv 1905.10044 |
| 2019-05 | HellaSwag | Common-sense | Zellers et al., AI2 | Saturated 2023 | arxiv 1905.07830 |
| 2019-05 | SuperGLUE | NLU suite | NYU, DeepMind, FAIR | Saturated 2019 | arxiv 1905.00537 |
| 2019-07 | WinoGrande | Common-sense | Sakaguchi et al., AI2 | Saturated 2023 | arxiv 1907.10641 |
| 2019-09 | DROP | Reading + math | Dua et al., AI2 | Saturated 2023 | arxiv 1903.00161 |
| 2019-11 | ARC-AGI | Abstract reasoning | François Chollet | Saturated by o3 (2024) | arxiv 1911.01547 |
| 2020-09 | MMLU | Broad knowledge | Hendrycks et al. | Saturated 2023 | arxiv 2009.03300 |
| 2021-03 | MATH | Competition math | Hendrycks et al. | Saturated 2024 | arxiv 2103.03874 |
| 2021-07 | HumanEval | Code generation | OpenAI (Chen et al.) | Saturated 2024 | arxiv 2107.03374 |
| 2021-08 | MBPP | Code generation | Google Research | Saturated 2024 | arxiv 2108.07732 |
| 2021-09 | TruthfulQA | Factuality | Lin et al., Oxford | Hard, partially solved | arxiv 2109.07958 |
| 2021-10 | GSM8K | Grade-school math | OpenAI (Cobbe et al.) | Saturated 2023 | arxiv 2110.14168 |
| 2022-02 | CodeContests | Competitive code | DeepMind (AlphaCode) | Active | Science 2022 |
| 2022-06 | BIG-bench | Multi-task | 444 authors, Google | Largely saturated | arxiv 2206.04615 |
| 2022-10 | BIG-Bench Hard | Multi-task | Suzgun et al., Google | Saturated 2024 | arxiv 2210.09261 |
| 2023-04 | LMSYS Chatbot Arena | Human preference | LMSYS / UC Berkeley | Active | blog |
| 2023-05 | HumanEval+ / MBPP+ | Code generation | EvalPlus | Active | arxiv 2305.01210 |
| 2023-10 | SWE-bench | Real GitHub issues | Princeton NLP | Replaced by Verified | arxiv 2310.06770 |
| 2023-11 | IFEval | Instruction following | Google Research | Saturated 2024 | arxiv 2311.07911 |
| 2023-11 | GPQA / GPQA-Diamond | Graduate-level science | Rein et al., NYU | Active | arxiv 2311.12022 |
| 2023-11 | GAIA | General AI assistant | Meta, HuggingFace | Active | arxiv 2311.12983 |
| 2023-11 | MMMU | Multimodal | Yue et al. | Active | arxiv 2311.16502 |
| 2024-02 | AIME 2024 | Olympiad math | MAA (used by labs) | Saturated 2025 | MAA AIME |
| 2024-03 | LiveCodeBench | Contamination-free code | Berkeley, MIT, Cornell | Active (rolling) | arxiv 2403.07974 |
| 2024-04 | Arena-Hard | Auto-rated chat | LMSYS | Active | blog |
| 2024-06 | MMLU-Pro | Broad knowledge | TIGER-Lab | Active | arxiv 2406.01574 |
| 2024-06 | tau-bench | Tool-agent-user | Sierra Research | Active | arxiv 2406.12045 |
| 2024-08 | SWE-bench Verified | Real GitHub issues | OpenAI + Princeton | Deprecated 2026-02 | OpenAI |
| 2024-09 | MMMU-Pro | Multimodal | Yue et al. | Active | arxiv 2409.02813 |
| 2024-10 | MLE-Bench | ML engineering agents | OpenAI | Active | arxiv 2410.07095 |
| 2024-10 | SimpleQA | Short-form factuality | OpenAI | Active | arxiv 2411.04368 |
| 2024-11 | FrontierMath | Research math | Epoch AI | Active | arxiv 2411.04872 |
| 2024-12 | WebDev Arena | Front-end coding | LMArena | Active | blog |
| 2024-12 | Aider Polyglot | Multi-language code | Aider AI | Active | aider.chat |
| 2025-01 | Humanity's Last Exam | Frontier knowledge | CAIS + Scale AI | Active | arxiv 2501.14249 |
| 2025-02 | AIME 2025 | Olympiad math | MAA (used by labs) | Active | MAA AIME |
| 2025-03 | ARC-AGI-2 | Abstract reasoning | François Chollet | Active | arxiv 2505.11831 |
| 2025-05 | SWE-bench Live | Real GitHub issues, rolling | Microsoft Research | Active | arxiv 2505.23419 |
| 2025-05 | HealthBench | Health conversations | OpenAI + 262 physicians | Active | arxiv 2505.08775 |
| 2025-06 | tau2-bench | Conversational tool agents | Sierra Research | Active | arxiv 2506.07982 |
The modern wave of LLM benchmarking starts with reading-comprehension and natural-language-understanding suites built around the Transformer. Stanford's SQuAD (2016) and SQuAD 2.0 (June 2018) tested whether models could extract answer spans from Wikipedia paragraphs and recognize when no answer existed. NYU's GLUE benchmark (April 2018) bundled nine NLU tasks (sentence classification, paraphrase detection, entailment, similarity) into one score. SWAG (May 2018) introduced grounded common-sense "next-event" multiple choice.
Within a year of release, BERT and its successors had matched or exceeded human baselines on all of these. The pattern, fast saturation by a single new model class, would repeat over and over.
The SuperGLUE benchmark (May 2019) deliberately replaced GLUE tasks that had been solved with harder ones: word-sense disambiguation, causal reasoning, multi-sentence reading comprehension. T5 saturated SuperGLUE within a few months.
Common-sense and grounded reasoning got a big push in 2019. HellaSwag (May 2019) extended SWAG with adversarial filtering and stayed unsolved by BERT-class models for years. WinoGrande (July 2019) scaled the Winograd Schema Challenge to 44,000 examples of pronoun resolution that require real-world knowledge. BoolQ (May 2019) used naturally occurring yes/no questions. DROP (Sept 2019) combined reading comprehension with discrete arithmetic.
A different track started the same year. ARC-AGI (Nov 2019), proposed by François Chollet in his "On the Measure of Intelligence" paper, focused on abstract visual reasoning rather than language. It was designed to resist scaling brute-force training data, and held up far longer than any of the language-only benchmarks above.
MMLU (Sept 2020) by Dan Hendrycks and collaborators was the breakout benchmark of the GPT-3 era. It bundled 57 subjects (US history, professional law, abstract algebra, clinical knowledge, virology, machine learning) into a single multiple-choice test scraped from real exams. GPT-3's modest 43.9% score made it the headline number for years; GPT-4 reached 86.4% in March 2023, effectively saturating it.
Math got two big benchmarks in 2021. Hendrycks' MATH dataset (March 2021) used 12,500 competition math problems from AMC, AIME, and similar contests; PhD CS students averaged about 40%. OpenAI's GSM8K (October 2021) consisted of 8,500 grade-school math word problems, simpler in math content but requiring multi-step natural-language reasoning. Both of these became the de facto math evaluations for frontier models, and both were effectively solved by 2024 (o1 hit 94.8% on MATH; GPT-4 had already hit 92% on GSM8K).
Code generation arrived around the same time. OpenAI's HumanEval (July 2021) shipped with the original Codex paper: 164 hand-written Python programming problems with hidden unit tests. Google's MBPP (August 2021) added 974 entry-level Python problems. The pass@1 metric (does the first sampled solution pass?) became standard. By 2024 GPT-4o was scoring 90.2% on HumanEval. EvalPlus (May 2023) addressed the problem that HumanEval's test cases were too weak by extending them roughly 80x, producing HumanEval+ and MBPP+.
Factuality got its first dedicated test in TruthfulQA (Sept 2021): 817 questions where humans commonly hold false beliefs (medicine, law, conspiracies, fictional history). Larger models initially did worse on this benchmark, an early example of inverse scaling.
BIG-bench (June 2022) was an unusual collaboration: 444 authors from 132 institutions contributing 204 tasks ranging from arithmetic to checkmate-in-one to crash blossom disambiguation. Google's PaLM 540B was the first model to clear average human performance overall. The follow-up paper, BIG-Bench Hard (Oct 2022) by Suzgun et al., picked 23 tasks where prior models had failed; chain-of-thought prompting unlocked dramatic gains on most of them.
2022 also saw DeepMind's AlphaCode publish CodeContests, a dataset of competitive programming problems used to evaluate medal-level code generation.
2023 was the year benchmarks pivoted toward graduate-level rigor and toward agentic, real-world tasks. Three benchmarks stand out:
Alongside these, IFEval (Nov 2023, Google) tested verifiable instruction following ("reply in exactly three paragraphs", "include the word 'falcon' twice"), and GAIA (Nov 2023, Meta + HuggingFace) introduced a tool-use benchmark where 92% of questions were solvable by humans but GPT-4 with browsing managed only about 15%. LMSYS Chatbot Arena, launched in April 2023 by UC Berkeley, became the leading human-preference leaderboard, ranking models by Elo (later Bradley-Terry) coefficients fitted to crowdsourced pairwise votes.
By 2024 most pre-2023 benchmarks were saturated, and the field divided into three new directions: agentic and coding tasks, frontier math, and explicit defenses against benchmark contamination.
2025 marked the first year that frontier models started to saturate the 2024 benchmarks too. The community responded with deliberately harder evaluations.
Beyond this list, the modern evaluation stack now includes domain benchmarks like LegalBench, FinanceBench, ChemBench, MultiMedQA, and MedQA; safety-and-jailbreak benchmarks like JailbreakBench, HarmBench, and AdvBench; reward-model benchmarks like RewardBench; and continually-refreshed leaderboards like LiveBench. The full list runs into the hundreds.
Three mechanisms tend to defeat any given LLM benchmark within a few years:
The newest benchmarks (FrontierMath, HLE, ARC-AGI-2, SWE-bench Live) try to defend against all three: original problems written by experts, held-out test sets, continuous refreshes, and difficulty calibrated so that even the best 2025 systems cannot brute-force them.
The following "killed by" tables track the moment specific benchmarks were defeated by specific models, retained from the original version of this page.
| Benchmark | Category | Date Created | Date Defeated | Killed By | Defeated By Model | Original Score | Final Score | Details | Links |
|---|---|---|---|---|---|---|---|---|---|
| ARC-AGI | Reasoning | 2019-11 | 2024-12 | Saturation | o3 | Human Baseline: ~80% | o3: 87.5% | Abstract reasoning challenge with visual pattern completion tasks created by François Chollet. | Paper, Website |
| MATH | Mathematics | 2021-03 | 2024-09 | Saturation | o1 | Average CS PhD: ~40% | o1: 94.8% | 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning. | Paper, GitHub |
| BIG-Bench-Hard | Multi-task | 2022-10 | 2024-06 | Saturation | Sonnet 3.5 | Average Human: 67.7% | Sonnet 3.5: 93.1% | A curated suite of 23 challenging tasks from BIG-Bench. | Paper, GitHub |
| HumanEval | Coding | 2021-07 | 2024-05 | Saturation | GPT-4o | Unspecified | GPT-4o: 90.2% | 164 Python programming problems testing coding abilities. | Paper, GitHub |
| IFEval | Instruction Following | 2023-11 | 2024-03 | Saturation | Llama 3.3 70B | Unspecified | Llama 3.3 70B: 92.1% | Evaluation suite testing multi-step instruction-following capabilities. | Paper, GitHub |
| Benchmark | Category | Date Created | Date Defeated | Killed By | Defeated By Model | Original Score | Final Score | Details | Links |
|---|---|---|---|---|---|---|---|---|---|
| GSM8K | Mathematics | 2021-10 | 2023-11 | Saturation | GPT-4 | Unspecified | GPT-4: 92.0% | 8.5K grade school math word problems requiring step-by-step solutions. | Paper, GitHub |
| Turing Test | Conversation | 1950-10 | 2023-03 | Saturation | GPT-4 | Interrogator > 50% | Interrogator 46% | The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game"). | Paper |
| ARC (AI2) | Reasoning | 2018-03 | 2023-03 | Saturation | GPT-4 | Unspecified | GPT-4: 96.3% | Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning. | Paper |
| HellaSwag | Common Sense | 2019-05 | 2023-03 | Saturation | GPT-4 | Human: 95.6% | GPT-4: 95.3% | Multiple-choice questions about everyday scenarios with adversarial filtering. | Paper |
| MMLU | Knowledge | 2020-09 | 2023-03 | Saturation | GPT-4 | 95th pct Human: 87.0% | GPT-4: 87.3% | 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge. | Paper |
| WinoGrande | Common Sense | 2019-07 | 2023-03 | Saturation | GPT-4 | Human: 94% | GPT-4: 87.5% | Enhanced WSC with 44K problems testing common-sense pronoun resolution. | Paper |
| Benchmark | Category | Date Created | Date Defeated | Killed By Model | Defeated By | Original Score | Final Score | Details | Links |
|---|---|---|---|---|---|---|---|---|---|
| BIG-Bench | Multi-task | 2021-06 | 2022-04 | Saturation | PaLM 540B | Human: 49.8% | PaLM 540B: 61.4% | 204 tasks spanning linguistics, math, common-sense reasoning, and more. | Paper, GitHub |
| Benchmark | Category | Date Created | Date Defeated | Killed By | Defeated By Model | Original Score | Final Score | Details | Links |
|---|---|---|---|---|---|---|---|---|---|
| SuperGLUE | Language | 2019-05 | 2019-10 | Saturation | T5 | Human: 89.8% | T5: 89.3% | More challenging language understanding tasks (word sense, causal reasoning, RC). | Paper |
| WSC | Common Sense | 2012-05 | 2019-07 | Saturation | RoBERTa (w SFT) | Human: 96.5% | RoBERTa (w SFT): 90.1% | Carefully crafted sentence pairs with ambiguous pronoun references. | Paper |
| GLUE | Language | 2018-05 | 2019-06 | Saturation | XLNet | Human: 87.1% | XLNet: 88.4% | Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.). | Paper |
| TriviaQA | Knowledge | 2017-05 | 2019-06 | Saturation | SpanBERT | Human: 79.7% | SpanBERT: 83.6% | 650K QA-evidence triples requiring cross-sentence reasoning. | Paper |
| SQuAD v2.0 | Language | 2018-05 | 2019-04 | Saturation | BERT | Human: 89.5% | BERT: 89.5% | Extension of SQuAD adding unanswerable questions. | Paper |
| SQuAD | Language | 2016-05 | 2019-03 | Saturation | BERT | Human: 91.2% | BERT: 93.2% | 100,000+ QA tasks on Wikipedia articles. | Paper |
| Benchmark | Category | Date Created | Date Defeated | Killed By | Defeated By Model | Original Score | Final Score | Details | Links |
|---|---|---|---|---|---|---|---|---|---|
| SWAG | Common Sense | 2018-05 | 2018-10 | Saturation | BERT | Human: 88% | BERT: 86% | 113K multiple-choice questions about grounded situations (common sense "next step"). | Paper |