LLM Benchmarks Timeline

AI Benchmarks

24 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

43 citations

Revision

v3 · 4,803 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: LLM Comparisons and LLM Rankings

The LLM benchmarks timeline is the chronological record of the major tests used to evaluate large language models, from the BERT era reading-comprehension suites of 2018 to the agentic and frontier-reasoning evaluations of 2025 and 2026. Its defining pattern is rapid saturation: almost every benchmark that becomes a public scorecard is matched or exceeded by frontier models within one to three years, forcing the community to build progressively harder tests. MMLU went from GPT-3's 43.9% in 2020 to GPT-4's 86.4% by March 2023 ^[7]^[39], and abstract-reasoning benchmark ARC-AGI took four years to climb from 0% (GPT-3, 2020) to 5% (GPT-4o, 2024) before o3 reached 87.5% in December 2024 ^[40]^[6].

This page lists, for each major benchmark, when it was first released, who built it, what it tries to measure, and (where applicable) when frontier models effectively saturated it. The page covers benchmarks from the BERT era (2018) through the agentic and reasoning era of 2025 and 2026.

LLM evaluation has gone through several distinct phases. The first phase (2018 to 2020) focused on natural language understanding tasks like reading comprehension, sentence classification, and common-sense reasoning. The second phase (2020 to 2022) shifted toward broad knowledge tests, math word problems, and code generation. The third phase (2023 onward) is dominated by graduate-level reasoning, agentic and tool-use tasks, multimodal understanding, and frontier evaluations like FrontierMath and Humanity's Last Exam designed to resist saturation for years.

What does this timeline cover?

The table below collects the major LLM benchmarks in order of release. "Saturation" indicates whether top models now exceed strong human baselines.

Year	Benchmark	Category	Creators	Status	Paper
2016-06	SQuAD	Reading comprehension	Rajpurkar et al., Stanford	Saturated by BERT (2019)	arxiv 1606.05250
2017-05	TriviaQA	Knowledge QA	Joshi et al., UW	Saturated 2019	arxiv 1705.03551
2018-03	ARC (AI2)	Science reasoning	Allen Institute for AI	Saturated 2023	arxiv 1803.05457
2018-04	GLUE	NLU suite	NYU, UW, DeepMind	Saturated 2019	arxiv 1804.07461
2018-05	SWAG	Common-sense	Zellers et al., UW	Saturated 2018	arxiv 1808.05326
2018-06	SQuAD 2.0	Reading comprehension	Rajpurkar et al., Stanford	Saturated 2019	arxiv 1806.03822
2019-04	BoolQ	Yes/No QA	Google AI	Saturated	arxiv 1905.10044
2019-05	HellaSwag	Common-sense	Zellers et al., AI2	Saturated 2023	arxiv 1905.07830
2019-05	SuperGLUE	NLU suite	NYU, DeepMind, FAIR	Saturated 2019	arxiv 1905.00537
2019-07	WinoGrande	Common-sense	Sakaguchi et al., AI2	Saturated 2023	arxiv 1907.10641
2019-09	DROP	Reading + math	Dua et al., AI2	Saturated 2023	arxiv 1903.00161
2019-11	ARC-AGI	Abstract reasoning	François Chollet	Saturated by o3 (2024)	arxiv 1911.01547
2020-09	MMLU	Broad knowledge	Hendrycks et al.	Saturated 2023	arxiv 2009.03300
2021-03	MATH	Competition math	Hendrycks et al.	Saturated 2024	arxiv 2103.03874
2021-07	HumanEval	Code generation	OpenAI (Chen et al.)	Saturated 2024	arxiv 2107.03374
2021-08	MBPP	Code generation	Google Research	Saturated 2024	arxiv 2108.07732
2021-09	TruthfulQA	Factuality	Lin et al., Oxford	Hard, partially solved	arxiv 2109.07958
2021-10	GSM8K	Grade-school math	OpenAI (Cobbe et al.)	Saturated 2023	arxiv 2110.14168
2022-02	CodeContests	Competitive code	DeepMind (AlphaCode)	Active	Science 2022
2022-06	BIG-bench	Multi-task	444 authors, Google	Largely saturated	arxiv 2206.04615
2022-10	BIG-Bench Hard	Multi-task	Suzgun et al., Google	Saturated 2024	arxiv 2210.09261
2023-04	LMSYS Chatbot Arena	Human preference	LMSYS / UC Berkeley	Active	blog
2023-05	HumanEval+ / MBPP+	Code generation	EvalPlus	Active	arxiv 2305.01210
2023-10	SWE-bench	Real GitHub issues	Princeton NLP	Replaced by Verified	arxiv 2310.06770
2023-11	IFEval	Instruction following	Google Research	Saturated 2024	arxiv 2311.07911
2023-11	GPQA / GPQA-Diamond	Graduate-level science	Rein et al., NYU	Active	arxiv 2311.12022
2023-11	GAIA	General AI assistant	Meta, HuggingFace	Active	arxiv 2311.12983
2023-11	MMMU	Multimodal	Yue et al.	Active	arxiv 2311.16502
2024-02	AIME 2024	Olympiad math	MAA (used by labs)	Saturated 2025	MAA AIME
2024-03	LiveCodeBench	Contamination-free code	Berkeley, MIT, Cornell	Active (rolling)	arxiv 2403.07974
2024-04	Arena-Hard	Auto-rated chat	LMSYS	Active	blog
2024-06	MMLU-Pro	Broad knowledge	TIGER-Lab	Active	arxiv 2406.01574
2024-06	tau-bench	Tool-agent-user	Sierra Research	Active	arxiv 2406.12045
2024-08	SWE-bench Verified	Real GitHub issues	OpenAI + Princeton	Deprecated 2026-02	OpenAI
2024-09	MMMU-Pro	Multimodal	Yue et al.	Active	arxiv 2409.02813
2024-10	MLE-Bench	ML engineering agents	OpenAI	Active	arxiv 2410.07095
2024-10	SimpleQA	Short-form factuality	OpenAI	Active	arxiv 2411.04368
2024-11	FrontierMath	Research math	Epoch AI	Active	arxiv 2411.04872
2024-12	WebDev Arena	Front-end coding	LMArena	Active	blog
2024-12	Aider Polyglot	Multi-language code	Aider AI	Active	aider.chat
2025-01	Humanity's Last Exam	Frontier knowledge	CAIS + Scale AI	Active	arxiv 2501.14249
2025-02	AIME 2025	Olympiad math	MAA (used by labs)	Active	MAA AIME
2025-03	ARC-AGI-2	Abstract reasoning	François Chollet	Active	arxiv 2505.11831
2025-05	SWE-bench Live	Real GitHub issues, rolling	Microsoft Research	Active	arxiv 2505.23419
2025-05	HealthBench	Health conversations	OpenAI + 262 physicians	Active	arxiv 2505.08775
2025-06	tau2-bench	Conversational tool agents	Sierra Research	Active	arxiv 2506.07982

What were the first LLM benchmarks (2018, the BERT era)?

The modern wave of LLM benchmarking starts with reading-comprehension and natural-language-understanding suites built around the Transformer. Stanford's SQuAD (2016) and SQuAD 2.0 (June 2018) tested whether models could extract answer spans from Wikipedia paragraphs and recognize when no answer existed ^[2]. NYU's GLUE benchmark (April 2018) bundled nine NLU tasks (sentence classification, paraphrase detection, entailment, similarity) into one score ^[1]. SWAG (May 2018) introduced grounded common-sense "next-event" multiple choice.

Within a year of release, BERT and its successors had matched or exceeded human baselines on all of these. The pattern, fast saturation by a single new model class, would repeat over and over.

How did benchmarks evolve in 2019 (SuperGLUE and the common-sense wave)?

The SuperGLUE benchmark (May 2019) deliberately replaced GLUE tasks that had been solved with harder ones: word-sense disambiguation, causal reasoning, multi-sentence reading comprehension. T5 saturated SuperGLUE within a few months, reaching 89.3 against a human baseline of 89.8 ^[3].

Common-sense and grounded reasoning got a big push in 2019. HellaSwag (May 2019) extended SWAG with adversarial filtering and stayed unsolved by BERT-class models for years ^[4]. WinoGrande (July 2019) scaled the Winograd Schema Challenge to 44,000 examples of pronoun resolution that require real-world knowledge ^[5]. BoolQ (May 2019) used naturally occurring yes/no questions. DROP (Sept 2019) combined reading comprehension with discrete arithmetic.

A different track started the same year. ARC-AGI (Nov 2019), proposed by François Chollet in his "On the Measure of Intelligence" paper, focused on abstract visual reasoning rather than language ^[6]. It was designed to resist scaling brute-force training data, and held up far longer than any of the language-only benchmarks above.

How did 2020 to 2021 shift to knowledge, math, and code?

MMLU (Sept 2020) by Dan Hendrycks and collaborators was the breakout benchmark of the GPT-3 era. It bundled 57 subjects (US history, professional law, abstract algebra, clinical knowledge, virology, machine learning) into a single multiple-choice test scraped from real exams ^[7]. GPT-3's modest 43.9% score made it the headline number for years; GPT-4 reached 86.4% in March 2023, effectively saturating it ^[39].

Math got two big benchmarks in 2021. Hendrycks' MATH dataset (March 2021) used 12,500 competition math problems from AMC, AIME, and similar contests; PhD CS students averaged about 40% ^[8]. OpenAI's GSM8K (October 2021) consisted of 8,500 grade-school math word problems, simpler in math content but requiring multi-step natural-language reasoning ^[12]. Both of these became the de facto math evaluations for frontier models, and both were effectively solved by 2024 (o1 hit 94.8% on MATH; GPT-4 had already hit 92% on GSM8K).

Code generation arrived around the same time. OpenAI's HumanEval (July 2021) shipped with the original Codex paper: 164 hand-written Python programming problems with hidden unit tests ^[9]. Google's MBPP (August 2021) added 974 entry-level Python problems ^[10]. The pass@1 metric (does the first sampled solution pass?) became standard. By 2024 GPT-4o was scoring 90.2% on HumanEval. EvalPlus (May 2023) addressed the problem that HumanEval's test cases were too weak by extending them roughly 80x, producing HumanEval+ and MBPP+ ^[17].

Factuality got its first dedicated test in TruthfulQA (Sept 2021): 817 questions where humans commonly hold false beliefs (medicine, law, conspiracies, fictional history) ^[11]. Larger models initially did worse on this benchmark, an early example of inverse scaling.

What was BIG-bench and the 2022 megabenchmark experiment?

BIG-bench (June 2022) was an unusual collaboration: 444 authors from 132 institutions contributing 204 tasks ranging from arithmetic to checkmate-in-one to crash blossom disambiguation ^[14]. Google's PaLM 540B was the first model to clear average human performance overall (61.4 versus a human baseline of 49.8). The follow-up paper, BIG-Bench Hard (Oct 2022) by Suzgun et al., picked 23 tasks where prior models had failed; chain-of-thought prompting unlocked dramatic gains on most of them ^[15].

2022 also saw DeepMind's AlphaCode publish CodeContests, a dataset of competitive programming problems used to evaluate medal-level code generation ^[13].

What changed in 2023 (GPT-4, GPQA, MMMU, SWE-bench)?

2023 was the year benchmarks pivoted toward graduate-level rigor and toward agentic, real-world tasks. Three benchmarks stand out:

GPQA (Nov 2023, David Rein et al.) introduced 448 multiple-choice questions in biology, physics, and chemistry written by domain PhDs. Highly skilled non-expert validators with 30+ minutes and unrestricted web access reached only 34%; PhDs in the relevant field reached about 65% ^[20]. The hardest subset, GPQA-Diamond (198 items), became one of the most-watched scores for frontier reasoning models.
MMMU (Nov 2023, Yue et al.) covered 11,500 multimodal exam questions across 30 subjects, with images of charts, diagrams, chemical structures, and medical scans interleaved with the questions ^[22]. It became the standard multimodal benchmark for GPT-4V, Gemini, and Claude 3.
SWE-bench (Oct 2023, Princeton NLP) framed software engineering as a benchmark for the first time at scale: 2,294 real GitHub issues from 12 popular Python repositories, with the model expected to produce a patch that passes the original maintainer's hidden test suite ^[18].

Alongside these, IFEval (Nov 2023, Google) tested verifiable instruction following ("reply in exactly three paragraphs", "include the word 'falcon' twice") ^[19], and GAIA (Nov 2023, Meta + HuggingFace) introduced a tool-use benchmark where 92% of questions were solvable by humans but GPT-4 with browsing managed only about 15% ^[21]. LMSYS Chatbot Arena, launched in April 2023 by UC Berkeley, became the leading human-preference leaderboard, ranking models by Elo (later Bradley-Terry) coefficients fitted to crowdsourced pairwise votes ^[16].

What benchmarks defined 2024 (agents, math frontiers, contamination defenses)?

By 2024 most pre-2023 benchmarks were saturated, and the field divided into three new directions: agentic and coding tasks, frontier math, and explicit defenses against benchmark contamination.

Agentic and coding

LiveCodeBench (Mar 2024) hosted 400+ coding problems from contests run between May 2023 and May 2024 with continuous updates, designed to defeat training-data leakage ^[23].
MLE-Bench (Oct 2024, OpenAI) wrapped 75 Kaggle ML engineering competitions into an agent benchmark testing whether models could train models, prepare data, and submit predictions end to end ^[29].
SWE-bench Verified (Aug 2024, OpenAI + Princeton) was a 500-task subset of SWE-bench audited by 93 contracted developers to remove ambiguous issues and unfair tests ^[27]. It became the dominant coding benchmark in 2024 and 2025 before OpenAI deprecated it on February 23, 2026 after finding that at least 59.4% of failed test cases were flawed and that every frontier model showed signs of training-data contamination ^[41]. OpenAI wrote that "improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities," and that gains "increasingly reflect how much the model was exposed to the benchmark at training time" ^[41].
tau-bench (June 2024, Sierra Research) introduced a framework where an agent talks to a simulated user under business rules from real airline and retail policies; only models with strong tool-use and consistency could pass ^[26].
Aider Polyglot (Dec 2024) used 225 hard Exercism problems across C++, Go, Java, JavaScript, Python, and Rust to evaluate code-editing skill across languages ^[32]. o1 topped the initial leaderboard.
WebDev Arena (Dec 2024, LMArena) had humans pick which of two model-generated single-file web apps looked and worked better.

Frontier math

AIME 2024 (Feb 2024) is a real high-school olympiad with 30 integer-answer problems from algebra, combinatorics, geometry, and number theory. It became a standard reasoning eval for o1, o3, and competitors. Note: AIME 2024 has since been shown to suffer from training-data contamination for several models.
FrontierMath (Nov 2024, Epoch AI) is the most ambitious math benchmark to date: hundreds of original research-level problems written by 60+ professional mathematicians (including Fields medalists Terence Tao, Timothy Gowers, and Richard Borcherds) covering number theory, real analysis, algebraic geometry, and category theory ^[31]^[42]. At launch, leading models solved less than 2% of the problems ^[31]^[42]. Tao described the problems as "extremely challenging," suggesting that in the near term the only way to solve them is "a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages" ^[42].

Anti-contamination and refresh efforts

MMLU-Pro (June 2024, TIGER-Lab) responded to MMLU saturation by adding harder reasoning items, expanding choices from 4 to 10, and removing trivial questions; 12,000 cleaner items across 14 domains ^[25].
MMMU-Pro (Sept 2024) tightened MMMU by filtering text-only-solvable items, increasing distractors, and adding a vision-only mode where the question itself is rendered as an image ^[28].
SimpleQA (Oct 2024, OpenAI) targeted short-form factual recall with 4,326 hand-crafted questions; even GPT-4o scored under 40% ^[30].
Arena-Hard (April 2024, LMSYS) used 500 challenging user prompts from Chatbot Arena, scored by a strong judge model, as an automatic and reproducible alternative to crowdsourced votes ^[24].

What benchmarks define 2025 to 2026 (HLE, ARC-AGI-2, post-saturation evaluation)?

2025 marked the first year that frontier models started to saturate the 2024 benchmarks too. The community responded with deliberately harder evaluations.

Humanity's Last Exam (HLE) (Jan 24, 2025) was assembled by the Center for AI Safety and Scale AI from contributions by nearly 1,000 subject experts across over 100 disciplines ^[33]. The launch set was around 3,000 expert-level questions (2,500 public plus a 500-question private holdout) ^[43]. At release, top models scored under 10%: GPT-4o reached 3.3%, Claude 3.5 Sonnet 4.3%, Gemini 6.2%, o1 9.1%, and DeepSeek-R1 9.4% ^[43].
AIME 2025 (Feb 2025) replaced AIME 2024 as a clean, uncontaminated math benchmark; many labs report both numbers because AIME 2024 has known contamination.
ARC-AGI-2 (March 2025, ARC Prize team) was the long-awaited successor to ARC-AGI after o3 cleared the original ^[34]. It uses harder symbolic-pattern tasks; in early 2025, frontier models scored only a few percent while humans scored over 60%.
HealthBench (May 12, 2025, OpenAI with 262 physicians from 60 countries) introduced 5,000 health-related conversations, each scored against a physician-written rubric covering accuracy, completeness, communication, and safety ^[35].
SWE-bench Live (May 2025, Microsoft Research) addressed the contamination problem in coding benchmarks by collecting 1,319 real GitHub issues created after Jan 2024 across 93 repositories, with continuous monthly refreshes ^[36].
tau2-bench (June 2025, Sierra Research) extended tau-bench with a dual-control telecom domain where both the agent and a simulated user actively change shared world state, then was further extended into tau3-bench with a banking domain and voice modality ^[37].

Beyond this list, the modern evaluation stack now includes domain benchmarks like LegalBench, FinanceBench, ChemBench, MultiMedQA, and MedQA; safety-and-jailbreak benchmarks like JailbreakBench, HarmBench, and AdvBench; reward-model benchmarks like RewardBench; and continually-refreshed leaderboards like LiveBench. The full list runs into the hundreds.

How and why do benchmarks get saturated?

Three mechanisms tend to defeat any given LLM benchmark within a few years:

Pretraining-data contamination. Most public benchmarks end up in scraped pretraining corpora, especially anything posted to GitHub, Hugging Face, or arXiv. Recent analyses show AIME 2024 was likely contaminated for several frontier models, with score inflation of 10 to 20 points relative to AIME 2025, and OpenAI found contamination in every frontier model it tested against SWE-bench Verified ^[41].
Targeted fine-tuning and RLHF. Once a benchmark becomes a public scorecard, labs explicitly tune for it. This is part of why GLUE, SuperGLUE, and BIG-Bench Hard fell so quickly after first appearing.
Capability genuinely catching up. Some benchmarks (HumanEval, GSM8K, MATH, ARC-AGI-1) were defeated by clear capability gains rather than contamination. When o3 reached 87.5% on ARC-AGI, benchmark author François Chollet called it "a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models" ^[40]. Reasoning models from late 2024 onward (o1, o3, DeepSeek-R1, Claude 3.7 Sonnet extended thinking) cleared many tasks that had stood for years.

The newest benchmarks (FrontierMath, HLE, ARC-AGI-2, SWE-bench Live) try to defend against all three: original problems written by experts, held-out test sets, continuous refreshes, and difficulty calibrated so that even the best 2025 systems cannot brute-force them.

Which benchmarks are saturated and which are still hard?

As of mid-2026, essentially all benchmarks released before 2023 are saturated, including SQuAD, GLUE, SuperGLUE, HellaSwag, WinoGrande, MMLU, MATH, GSM8K, HumanEval, and BIG-Bench Hard. The 2024 frontier evaluations are also falling fast: AIME 2024 is effectively saturated and ARC-AGI-1 was cleared by o3. The benchmarks that remain genuinely difficult as of 2026 are the deliberately hard 2024-2025 cohort, where even the strongest models start well below human-expert ceilings: FrontierMath (under 2% at launch) ^[31], Humanity's Last Exam (under 10% at launch) ^[43], ARC-AGI-2 (low single digits for models versus 60%+ for humans) ^[34], and the contamination-resistant coding benchmarks SWE-bench Pro and SWE-bench Live ^[36]^[41]. These are the scores AI labs now compete on, precisely because the older benchmarks no longer separate frontier systems from one another.

Original article tables (preserved)

The following "killed by" tables track the moment specific benchmarks were defeated by specific models, retained from the original version of this page ^[38].

2024

Benchmark	Category	Date Created	Date Defeated	Killed By	Defeated By Model	Original Score	Final Score	Details	Links
ARC-AGI	Reasoning	2019-11	2024-12	Saturation	o3	Human Baseline: ~80%	o3: 87.5%	Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.	Paper, Website
MATH	Mathematics	2021-03	2024-09	Saturation	o1	Average CS PhD: ~40%	o1: 94.8%	12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.	Paper, GitHub
BIG-Bench-Hard	Multi-task	2022-10	2024-06	Saturation	Sonnet 3.5	Average Human: 67.7%	Sonnet 3.5: 93.1%	A curated suite of 23 challenging tasks from BIG-Bench.	Paper, GitHub
HumanEval	Coding	2021-07	2024-05	Saturation	GPT-4o	Unspecified	GPT-4o: 90.2%	164 Python programming problems testing coding abilities.	Paper, GitHub
IFEval	Instruction Following	2023-11	2024-03	Saturation	Llama 3.3 70B	Unspecified	Llama 3.3 70B: 92.1%	Evaluation suite testing multi-step instruction-following capabilities.	Paper, GitHub

2023

Benchmark	Category	Date Created	Date Defeated	Killed By	Defeated By Model	Original Score	Final Score	Details	Links
GSM8K	Mathematics	2021-10	2023-11	Saturation	GPT-4	Unspecified	GPT-4: 92.0%	8.5K grade school math word problems requiring step-by-step solutions.	Paper, GitHub
Turing Test	Conversation	1950-10	2023-03	Saturation	GPT-4	Interrogator > 50%	Interrogator 46%	The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").	Paper
ARC (AI2)	Reasoning	2018-03	2023-03	Saturation	GPT-4	Unspecified	GPT-4: 96.3%	Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.	Paper
HellaSwag	Common Sense	2019-05	2023-03	Saturation	GPT-4	Human: 95.6%	GPT-4: 95.3%	Multiple-choice questions about everyday scenarios with adversarial filtering.	Paper
MMLU	Knowledge	2020-09	2023-03	Saturation	GPT-4	95th pct Human: 87.0%	GPT-4: 87.3%	57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.	Paper
WinoGrande	Common Sense	2019-07	2023-03	Saturation	GPT-4	Human: 94%	GPT-4: 87.5%	Enhanced WSC with 44K problems testing common-sense pronoun resolution.	Paper

Pre-2023

2022

Benchmark	Category	Date Created	Date Defeated	Killed By Model	Defeated By	Original Score	Final Score	Details	Links
BIG-Bench	Multi-task	2021-06	2022-04	Saturation	PaLM 540B	Human: 49.8%	PaLM 540B: 61.4%	204 tasks spanning linguistics, math, common-sense reasoning, and more.	Paper, GitHub

2019

Benchmark	Category	Date Created	Date Defeated	Killed By	Defeated By Model	Original Score	Final Score	Details	Links
SuperGLUE	Language	2019-05	2019-10	Saturation	T5	Human: 89.8%	T5: 89.3%	More challenging language understanding tasks (word sense, causal reasoning, RC).	Paper
WSC	Common Sense	2012-05	2019-07	Saturation	RoBERTa (w SFT)	Human: 96.5%	RoBERTa (w SFT): 90.1%	Carefully crafted sentence pairs with ambiguous pronoun references.	Paper
GLUE	Language	2018-05	2019-06	Saturation	XLNet	Human: 87.1%	XLNet: 88.4%	Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).	Paper
TriviaQA	Knowledge	2017-05	2019-06	Saturation	SpanBERT	Human: 79.7%	SpanBERT: 83.6%	650K QA-evidence triples requiring cross-sentence reasoning.	Paper
SQuAD v2.0	Language	2018-05	2019-04	Saturation	BERT	Human: 89.5%	BERT: 89.5%	Extension of SQuAD adding unanswerable questions.	Paper
SQuAD	Language	2016-05	2019-03	Saturation	BERT	Human: 91.2%	BERT: 93.2%	100,000+ QA tasks on Wikipedia articles.	Paper

2018

Benchmark	Category	Date Created	Date Defeated	Killed By	Defeated By Model	Original Score	Final Score	Details	Links
SWAG	Common Sense	2018-05	2018-10	Saturation	BERT	Human: 88%	BERT: 86%	113K multiple-choice questions about grounded situations (common sense "next step").	Paper

ELI5: how does this whole thing work?

Think of an LLM benchmark as a standardized exam for AI. Researchers write a big set of questions with known answers (math problems, science questions, coding tasks, reading passages), let the model take the test, and report the percentage it gets right next to how well humans do. When a new model finally beats the human baseline, people say the benchmark is "saturated" or "solved," and it stops being useful for telling top models apart. So researchers write a harder exam, the models catch up again, and the cycle repeats. This timeline is the scoreboard of that race: each row is an exam, when it was written, and (often) the model that finally aced it.

References

Wang, A., et al. (2018). "GLUE: A Multi-Task Benchmark." arxiv 1804.07461 ↩
Rajpurkar, P., et al. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD." arxiv 1806.03822 ↩
Wang, A., et al. (2019). "SuperGLUE: A Stickier Benchmark." arxiv 1905.00537 ↩
Zellers, R., et al. (2019). "HellaSwag: Can a Machine Really Finish Your Sentence?" arxiv 1905.07830 ↩
Sakaguchi, K., et al. (2019). "WinoGrande." arxiv 1907.10641 ↩
Chollet, F. (2019). "On the Measure of Intelligence" (ARC-AGI). arxiv 1911.01547 ↩
Hendrycks, D., et al. (2020). "Measuring Massive Multitask Language Understanding (MMLU)." arxiv 2009.03300 ↩
Hendrycks, D., et al. (2021). "Measuring Mathematical Problem Solving with the MATH Dataset." arxiv 2103.03874 ↩
Chen, M., et al. (2021). "Evaluating Large Language Models Trained on Code (HumanEval)." arxiv 2107.03374 ↩
Austin, J., et al. (2021). "Program Synthesis with Large Language Models (MBPP)." arxiv 2108.07732 ↩
Lin, S., et al. (2021). "TruthfulQA." arxiv 2109.07958 ↩
Cobbe, K., et al. (2021). "Training Verifiers to Solve Math Word Problems (GSM8K)." arxiv 2110.14168 ↩
Li, Y., et al. (2022). "Competition-Level Code Generation with AlphaCode." Science ↩
Srivastava, A., et al. (2022). "Beyond the Imitation Game (BIG-bench)." arxiv 2206.04615 ↩
Suzgun, M., et al. (2022). "BIG-Bench Hard." arxiv 2210.09261 ↩
Chiang, W.-L., et al. (2023). "Chatbot Arena." LMSYS blog ↩
Liu, J., et al. (2023). "Is Your Code Generated by ChatGPT Really Correct? (EvalPlus / HumanEval+)." arxiv 2305.01210 ↩
Jimenez, C. E., et al. (2023). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" arxiv 2310.06770 ↩
Zhou, J., et al. (2023). "Instruction-Following Evaluation (IFEval)." arxiv 2311.07911 ↩
Rein, D., et al. (2023). "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." arxiv 2311.12022 ↩
Mialon, G., et al. (2023). "GAIA: a benchmark for General AI Assistants." arxiv 2311.12983 ↩
Yue, X., et al. (2023). "MMMU." arxiv 2311.16502 ↩
Jain, N., et al. (2024). "LiveCodeBench." arxiv 2403.07974 ↩
Li, T., et al. (2024). "From Crowdsourced Data to High-quality Benchmarks: Arena-Hard." LMSYS blog ↩
Wang, Y., et al. (2024). "MMLU-Pro." arxiv 2406.01574 ↩
Yao, S., et al. (2024). "tau-bench." arxiv 2406.12045 ↩
OpenAI Preparedness team (2024). "Introducing SWE-bench Verified." OpenAI ↩
Yue, X., et al. (2024). "MMMU-Pro." arxiv 2409.02813 ↩
Chan, J. S., et al. (2024). "MLE-bench." arxiv 2410.07095 ↩
Wei, J., et al. (2024). "Measuring Short-Form Factuality (SimpleQA)." arxiv 2411.04368 ↩
Glazer, E., et al. (2024). "FrontierMath." arxiv 2411.04872 ↩
Aider AI (Dec 2024). "o1 tops aider's new polyglot leaderboard." aider.chat ↩
Phan, L., et al. (2025). "Humanity's Last Exam." arxiv 2501.14249 ↩
Chollet, F., et al. (2025). "ARC-AGI-2." arxiv 2505.11831 ↩
Arora, R. K., et al. (2025). "HealthBench." arxiv 2505.08775 ↩
Zhang, L., et al. (2025). "SWE-bench Goes Live!" arxiv 2505.23419 ↩
Barres, V., et al. (2025). "tau2-bench." arxiv 2506.07982 ↩
R0bk (community). "Killed by LLM." website, github ↩
OpenAI (2023). "GPT-4 Technical Report." arxiv 2303.08774 ↩
ARC Prize (Dec 2024). "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub." arcprize.org ↩
OpenAI (2026). "Why we no longer evaluate SWE-bench Verified." openai.com ↩
Epoch AI (2024). "FrontierMath: Evaluating Advanced Mathematical Reasoning in AI." epoch.ai ↩
Scale AI and CAIS (Jan 2025). "Unveiling the Results of Humanity's Last Exam." scale.com ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

AI Model Release Timeline (2022-2026)AI Wiki Deep Research Bench GSM8K HELM (Holistic Evaluation of Language Models)LLM Benchmark Comparison (Leaderboard Overview)LLM Comparisons LLM Rankings MMStar

What does this timeline cover?

What were the first LLM benchmarks (2018, the BERT era)?

How did benchmarks evolve in 2019 (SuperGLUE and the common-sense wave)?

How did 2020 to 2021 shift to knowledge, math, and code?

What was BIG-bench and the 2022 megabenchmark experiment?

What changed in 2023 (GPT-4, GPQA, MMMU, SWE-bench)?

What benchmarks defined 2024 (agents, math frontiers, contamination defenses)?

Agentic and coding

Frontier math

Anti-contamination and refresh efforts

What benchmarks define 2025 to 2026 (HLE, ARC-AGI-2, post-saturation evaluation)?

How and why do benchmarks get saturated?

Which benchmarks are saturated and which are still hard?

Original article tables (preserved)

2024

2023

Pre-2023

2022

2019

2018

ELI5: how does this whole thing work?

See also

References

Improve this article

Related Articles

AA-LCR

CharXiv

GSO

LLM Comparisons

LLM Rankings

AIME (American Invitational Mathematics Examination)

What links here

Related Articles

AA-LCR

CharXiv

GSO

LLM Comparisons

LLM Rankings

AIME (American Invitational Mathematics Examination)

What links here