LLM Benchmarks Timeline: Difference between revisions

Latest revision as of 21:01, 13 January 2025

See also: LLM Comparisons and LLM Rankings

Timeline of benchmarks surpassed by large language models (LLMs).

2024

Benchmark	Category	Date Created	Date Defeated	Killed By	Defeated By Model	Original Score	Final Score	Details	Links
ARC-AGI	Reasoning	2019-11	2024-12	Saturation	O3	Human Baseline: ~80%	O3: 87.5%	Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.	Paper, Website
MATH	Mathematics	2021-03	2024-09	Saturation	O1	Average CS PhD: ~40%	O1: 94.8%	12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.	Paper, GitHub
BIG-Bench-Hard	Multi-task	2022-10	2024-06	Saturation	Sonnet 3.5	Average Human: 67.7%	Sonnet 3.5: 93.1%	A curated suite of 23 challenging tasks from BIG-Bench.	Paper, GitHub, Evidence
HumanEval	Coding	2021-07	2024-05	Saturation	GPT-4o	Unspecified	GPT-4o: 90.2%	164 Python programming problems testing coding abilities.	Paper, GitHub, Evidence
IFEval	Instruction Following	2023-11	2024-03	Saturation	LLama 3.3 70B	Unspecified	LLama 3.3 70B: 92.1%	Evaluation suite testing multi-step instruction-following capabilities.	Paper, GitHub, Evidence

2023

Benchmark	Category	Date Created	Date Defeated	Killed By	Defeated By Model	Original Score	Final Score	Details	Links
GSM8K	Mathematics	2021-10	2023-11	Saturation	GPT-4	Unspecified	GPT-4: 92.0%	8.5K grade school math word problems requiring step-by-step solutions.	Paper, GitHub, Evidence
Turing Test	Conversation	1950-10	2023-03	Saturation	GPT-4	Interrogator > 50%	Interrogator 46%	The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").	Paper, Evidence
ARC (AI2)	Reasoning	2018-03	2023-03	Saturation	GPT-4	Unspecified	GPT-4: 96.3%	Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.	Paper, Website, Evidence
HellaSwag	Common Sense	2019-05	2023-03	Saturation	GPT-4	Human: 95.6%	GPT-4: 95.3%	Multiple-choice questions about everyday scenarios with adversarial filtering.	Paper, Website, Evidence
MMLU	Knowledge	2020-09	2023-03	Saturation	GPT-4	95th pct Human: 87.0%	GPT-4: 87.3%	57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.	Paper, GitHub, Evidence
WinoGrande	Common Sense	2019-07	2023-03	Saturation	GPT-4	Human: 94%	GPT-4: 87.5%	Enhanced WSC with 44K problems testing common-sense pronoun resolution.	Paper, Website, Evidence

Pre-2023

2022

Benchmark	Category	Date Created	Date Defeated	Killed By Model	Defeated By	Original Score	Final Score	Details	Links
BIG-Bench	Multi-task	2021-06	2022-04	Saturation	Palm 540B	Human: 49.8%	Palm 540B: 61.4%	204 tasks spanning linguistics, math, common-sense reasoning, and more.	Paper, GitHub, Evidence

2019

Benchmark	Category	Date Created	Date Defeated	Killed By	Defeated By Model	Original Score	Final Score	Details	Links
SuperGLUE	Language	2019-05	2019-10	Saturation	T5	Human: 89.8%	T5: 89.3%	More challenging language understanding tasks (word sense, causal reasoning, RC).	Paper, Website
WSC	Common Sense	2012-05	2019-07	Saturation	ROBERTA (w SFT)	Human: 96.5%	ROBERTA (w SFT): 90.1%	Carefully crafted sentence pairs with ambiguous pronoun references.	Paper, Website
GLUE	Language	2018-05	2019-06	Saturation	XLNet	Human: 87.1%	XLNet: 88.4%	Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).	Paper, Website
TriviaQA	Knowledge	2017-05	2019-06	Saturation	SpanBERT	Human: 79.7%	SpanBERT: 83.6%	650K QA-evidence triples requiring cross-sentence reasoning.	Paper, Website
SQuAD v2.0	Language	2018-05	2019-04	Saturation	BERT	Human: 89.5%	BERT: 89.5%	Extension of SQuAD adding unanswerable questions.	Paper, Website
SQuAD	Language	2016-05	2019-03	Saturation	BERT	Human: 91.2%	BERT: 93.2%	100,000+ QA tasks on Wikipedia articles.	Paper, Website

2018

Benchmark	Category	Date Created	Date Defeated	Killed By	Defeated By Model	Original Score	Final Score	Details	Links
SWAG	Common Sense	2018-05	2018-10	Saturation	BERT	Human: 88%	BERT: 86%	113K multiple-choice questions about grounded situations (common sense “next step”).	Paper, Website

References

website github

@@ Line 1: / Line 1: @@
-== 2024 ==
+{{see also|LLM Comparisons|LLM Rankings}}
+Timeline of [[benchmarks]] surpassed by [[large language models]] (LLMs).
+==2024==
 {| class="wikitable sortable"
 |-
@@ Line 7: / Line 10: @@
 ! Date Defeated
 ! Killed By
-! Defeated By
+! Defeated By Model
 ! Original Score
 ! Final Score
@@ Line 68: / Line 71: @@
 | [https://arxiv.org/abs/2311.07911 Paper], [https://github.com/google-research/google-research/tree/master/instruction_following_eval GitHub], [https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md Evidence]
 |}
-== 2023 ==
+==2023==
 {| class="wikitable sortable"
 |-
@@ Line 76: / Line 79: @@
 ! Date Defeated
 ! Killed By
-! Defeated By
+! Defeated By Model
 ! Original Score
 ! Final Score
@@ Line 149: / Line 152: @@
 |}
-== Pre-2023 ==
+==Pre-2023==
-=== 2022 ===
+===2022===
 {| class="wikitable sortable"
 |-
 ! Benchmark
 ! Category
-! Time Span
 ! Date Created
 ! Date Defeated
-! Killed By
+! Killed By Model
 ! Defeated By
 ! Original Score
@@ Line 167: / Line 169: @@
 | [[BIG-Bench]]
 | Multi-task
-| 2021-06 – 2022-04
 | 2021-06
 | 2022-04
 | Saturation
-| Palm 540B
+| [[Palm 540B]]
 | Human: 49.8%
 | Palm 540B: 61.4%
@@ Line 178: / Line 179: @@
 |}
-=== 2019 ===
+===2019===
 {| class="wikitable sortable"
 |-
 ! Benchmark
 ! Category
-! Time Span
 ! Date Created
 ! Date Defeated
 ! Killed By
-! Defeated By
+! Defeated By Model
 ! Original Score
 ! Final Score
@@ Line 195: / Line 195: @@
 | [[SuperGLUE]]
 | Language
-| 2019-05 – 2019-10
 | 2019-05
 | 2019-10
 | Saturation
-| T5
+| [[T5]]
 | Human: 89.8%
 | T5: 89.3%
@@ Line 207: / Line 206: @@
 | [[WSC]]
 | Common Sense
-| 2012-05 – 2019-07
 | 2012-05
 | 2019-07
 | Saturation
-| ROBERTA (w SFT)
+| [[ROBERTA (w SFT)]]
 | Human: 96.5%
 | ROBERTA (w SFT): 90.1%
@@ Line 219: / Line 217: @@
 | [[GLUE]]
 | Language
-| 2018-05 – 2019-06
 | 2018-05
 | 2019-06
 | Saturation
-| XLNet
+| [[XLNet]]
 | Human: 87.1%
 | XLNet: 88.4%
@@ Line 231: / Line 228: @@
 | [[TriviaQA]]
 | Knowledge
-| 2017-05 – 2019-06
 | 2017-05
 | 2019-06
 | Saturation
-| SpanBERT
+| [[SpanBERT]]
 | Human: 79.7%
 | SpanBERT: 83.6%
@@ Line 243: / Line 239: @@
 | [[SQuAD v2.0]]
 | Language
-| 2018-05 – 2019-04
 | 2018-05
 | 2019-04
 | Saturation
-| BERT
+| [[BERT]]
 | Human: 89.5%
 | BERT: 89.5%
@@ Line 255: / Line 250: @@
 | [[SQuAD]]
 | Language
-| 2016-05 – 2019-03
 | 2016-05
 | 2019-03
 | Saturation
-| BERT
+| [[BERT]]
 | Human: 91.2%
 | BERT: 93.2%
@@ Line 266: / Line 260: @@
 |}
-=== 2018 ===
+===2018===
 {| class="wikitable sortable"
 |-
 ! Benchmark
 ! Category
-! Time Span
 ! Date Created
 ! Date Defeated
 ! Killed By
-! Defeated By
+! Defeated By Model
 ! Original Score
 ! Final Score
@@ Line 283: / Line 276: @@
 | [[SWAG]]
 | Common Sense
-| 2018-05 – 2018-10
 | 2018-05
 | 2018-10
 | Saturation
-| BERT
+| [[BERT]]
 | Human: 88%
 | BERT: 86%
@@ Line 293: / Line 285: @@
 | [https://arxiv.org/abs/1808.05326 Paper], [https://rowanzellers.com/swag/ Website]
 |}
+==References==
+[https://r0bk.github.io/killedbyllm/ website]
+[https://github.com/R0bk/killedbyllm github]
+[[Category:Benchmarks]] [[Category:Timelines]] [[Category:Aggregate pages]]