- See also: LLM Comparisons and LLM Rankings
Timeline of benchmarks surpassed by large language models (LLMs).
2024
Benchmark
|
Category
|
Date Created
|
Date Defeated
|
Killed By
|
Defeated By Model
|
Original Score
|
Final Score
|
Details
|
Links
|
ARC-AGI
|
Reasoning
|
2019-11
|
2024-12
|
Saturation
|
O3
|
Human Baseline: ~80%
|
O3: 87.5%
|
Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
|
Paper, Website
|
MATH
|
Mathematics
|
2021-03
|
2024-09
|
Saturation
|
O1
|
Average CS PhD: ~40%
|
O1: 94.8%
|
12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
|
Paper, GitHub
|
BIG-Bench-Hard
|
Multi-task
|
2022-10
|
2024-06
|
Saturation
|
Sonnet 3.5
|
Average Human: 67.7%
|
Sonnet 3.5: 93.1%
|
A curated suite of 23 challenging tasks from BIG-Bench.
|
Paper, GitHub, Evidence
|
HumanEval
|
Coding
|
2021-07
|
2024-05
|
Saturation
|
GPT-4o
|
Unspecified
|
GPT-4o: 90.2%
|
164 Python programming problems testing coding abilities.
|
Paper, GitHub, Evidence
|
IFEval
|
Instruction Following
|
2023-11
|
2024-03
|
Saturation
|
LLama 3.3 70B
|
Unspecified
|
LLama 3.3 70B: 92.1%
|
Evaluation suite testing multi-step instruction-following capabilities.
|
Paper, GitHub, Evidence
|
2023
Benchmark
|
Category
|
Date Created
|
Date Defeated
|
Killed By
|
Defeated By Model
|
Original Score
|
Final Score
|
Details
|
Links
|
GSM8K
|
Mathematics
|
2021-10
|
2023-11
|
Saturation
|
GPT-4
|
Unspecified
|
GPT-4: 92.0%
|
8.5K grade school math word problems requiring step-by-step solutions.
|
Paper, GitHub, Evidence
|
Turing Test
|
Conversation
|
1950-10
|
2023-03
|
Saturation
|
GPT-4
|
Interrogator > 50%
|
Interrogator 46%
|
The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
|
Paper, Evidence
|
ARC (AI2)
|
Reasoning
|
2018-03
|
2023-03
|
Saturation
|
GPT-4
|
Unspecified
|
GPT-4: 96.3%
|
Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
|
Paper, Website, Evidence
|
HellaSwag
|
Common Sense
|
2019-05
|
2023-03
|
Saturation
|
GPT-4
|
Human: 95.6%
|
GPT-4: 95.3%
|
Multiple-choice questions about everyday scenarios with adversarial filtering.
|
Paper, Website, Evidence
|
MMLU
|
Knowledge
|
2020-09
|
2023-03
|
Saturation
|
GPT-4
|
95th pct Human: 87.0%
|
GPT-4: 87.3%
|
57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
|
Paper, GitHub, Evidence
|
WinoGrande
|
Common Sense
|
2019-07
|
2023-03
|
Saturation
|
GPT-4
|
Human: 94%
|
GPT-4: 87.5%
|
Enhanced WSC with 44K problems testing common-sense pronoun resolution.
|
Paper, Website, Evidence
|
Pre-2023
2022
Benchmark
|
Category
|
Date Created
|
Date Defeated
|
Killed By Model
|
Defeated By
|
Original Score
|
Final Score
|
Details
|
Links
|
BIG-Bench
|
Multi-task
|
2021-06
|
2022-04
|
Saturation
|
Palm 540B
|
Human: 49.8%
|
Palm 540B: 61.4%
|
204 tasks spanning linguistics, math, common-sense reasoning, and more.
|
Paper, GitHub, Evidence
|
2019
Benchmark
|
Category
|
Date Created
|
Date Defeated
|
Killed By
|
Defeated By Model
|
Original Score
|
Final Score
|
Details
|
Links
|
SuperGLUE
|
Language
|
2019-05
|
2019-10
|
Saturation
|
T5
|
Human: 89.8%
|
T5: 89.3%
|
More challenging language understanding tasks (word sense, causal reasoning, RC).
|
Paper, Website
|
WSC
|
Common Sense
|
2012-05
|
2019-07
|
Saturation
|
ROBERTA (w SFT)
|
Human: 96.5%
|
ROBERTA (w SFT): 90.1%
|
Carefully crafted sentence pairs with ambiguous pronoun references.
|
Paper, Website
|
GLUE
|
Language
|
2018-05
|
2019-06
|
Saturation
|
XLNet
|
Human: 87.1%
|
XLNet: 88.4%
|
Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).
|
Paper, Website
|
TriviaQA
|
Knowledge
|
2017-05
|
2019-06
|
Saturation
|
SpanBERT
|
Human: 79.7%
|
SpanBERT: 83.6%
|
650K QA-evidence triples requiring cross-sentence reasoning.
|
Paper, Website
|
SQuAD v2.0
|
Language
|
2018-05
|
2019-04
|
Saturation
|
BERT
|
Human: 89.5%
|
BERT: 89.5%
|
Extension of SQuAD adding unanswerable questions.
|
Paper, Website
|
SQuAD
|
Language
|
2016-05
|
2019-03
|
Saturation
|
BERT
|
Human: 91.2%
|
BERT: 93.2%
|
100,000+ QA tasks on Wikipedia articles.
|
Paper, Website
|
2018
Benchmark
|
Category
|
Date Created
|
Date Defeated
|
Killed By
|
Defeated By Model
|
Original Score
|
Final Score
|
Details
|
Links
|
SWAG
|
Common Sense
|
2018-05
|
2018-10
|
Saturation
|
BERT
|
Human: 88%
|
BERT: 86%
|
113K multiple-choice questions about grounded situations (common sense “next step”).
|
Paper, Website
|
References
website
github