LLM Benchmarks Timeline: Difference between revisions

Revision as of 17:02, 10 January 2025

2024

Benchmark	Category	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Details	Links
ARC-AGI	Reasoning	2019-11	2024-12	Saturation	O3	Human Baseline: ~80%	O3: 87.5%	Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.	Paper, Website
MATH	Mathematics	2021-03	2024-09	Saturation	O1	Average CS PhD: ~40%	O1: 94.8%	12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.	Paper, GitHub
BIG-Bench-Hard	Multi-task	2022-10	2024-06	Saturation	Sonnet 3.5	Average Human: 67.7%	Sonnet 3.5: 93.1%	A curated suite of 23 challenging tasks from BIG-Bench.	Paper, GitHub, Evidence
HumanEval	Coding	2021-07	2024-05	Saturation	GPT-4o	Unspecified	GPT-4o: 90.2%	164 Python programming problems testing coding abilities.	Paper, GitHub, Evidence
IFEval	Instruction Following	2023-11	2024-03	Saturation	LLama 3.3 70B	Unspecified	LLama 3.3 70B: 92.1%	Evaluation suite testing multi-step instruction-following capabilities.	Paper, GitHub, Evidence

2023

Benchmark	Category	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Details	Links
GSM8K	Mathematics	2021-10	2023-11	Saturation	GPT-4	Unspecified	GPT-4: 92.0%	8.5K grade school math word problems requiring step-by-step solutions.	Paper, GitHub, Evidence
Turing Test	Conversation	1950-10	2023-03	Saturation	GPT-4	Interrogator > 50%	Interrogator 46%	The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").	Paper, Evidence
ARC (AI2)	Reasoning	2018-03	2023-03	Saturation	GPT-4	Unspecified	GPT-4: 96.3%	Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.	Paper, Website, Evidence
HellaSwag	Common Sense	2019-05	2023-03	Saturation	GPT-4	Human: 95.6%	GPT-4: 95.3%	Multiple-choice questions about everyday scenarios with adversarial filtering.	Paper, Website, Evidence
MMLU	Knowledge	2020-09	2023-03	Saturation	GPT-4	95th pct Human: 87.0%	GPT-4: 87.3%	57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.	Paper, GitHub, Evidence
WinoGrande	Common Sense	2019-07	2023-03	Saturation	GPT-4	Human: 94%	GPT-4: 87.5%	Enhanced WSC with 44K problems testing common-sense pronoun resolution.	Paper, Website, Evidence

Pre-2023

2022

Benchmark	Category	Time Span	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Details	Links
BIG-Bench	Multi-task	2021-06 – 2022-04	2021-06	2022-04	Saturation	Palm 540B	Human: 49.8%	Palm 540B: 61.4%	204 tasks spanning linguistics, math, common-sense reasoning, and more.	Paper, GitHub, Evidence

2019

Benchmark	Category	Time Span	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Details	Links
SuperGLUE	Language	2019-05 – 2019-10	2019-05	2019-10	Saturation	T5	Human: 89.8%	T5: 89.3%	More challenging language understanding tasks (word sense, causal reasoning, RC).	Paper, Website
WSC	Common Sense	2012-05 – 2019-07	2012-05	2019-07	Saturation	ROBERTA (w SFT)	Human: 96.5%	ROBERTA (w SFT): 90.1%	Carefully crafted sentence pairs with ambiguous pronoun references.	Paper, Website
GLUE	Language	2018-05 – 2019-06	2018-05	2019-06	Saturation	XLNet	Human: 87.1%	XLNet: 88.4%	Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).	Paper, Website
TriviaQA	Knowledge	2017-05 – 2019-06	2017-05	2019-06	Saturation	SpanBERT	Human: 79.7%	SpanBERT: 83.6%	650K QA-evidence triples requiring cross-sentence reasoning.	Paper, Website
SQuAD v2.0	Language	2018-05 – 2019-04	2018-05	2019-04	Saturation	BERT	Human: 89.5%	BERT: 89.5%	Extension of SQuAD adding unanswerable questions.	Paper, Website
SQuAD	Language	2016-05 – 2019-03	2016-05	2019-03	Saturation	BERT	Human: 91.2%	BERT: 93.2%	100,000+ QA tasks on Wikipedia articles.	Paper, Website

2018

Benchmark	Category	Time Span	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Details	Links
SWAG	Common Sense	2018-05 – 2018-10	2018-05	2018-10	Saturation	BERT	Human: 88%	BERT: 86%	113K multiple-choice questions about grounded situations (common sense “next step”).	Paper, Website

@@ Line 4: / Line 4: @@
 ! Benchmark
 ! Category
-! Time Span
 ! Date Created
 ! Date Defeated
@@ Line 16: / Line 15: @@
 | [[ARC-AGI]]
 | Reasoning
-| 2019-11 – 2024-12
 | 2019-11
 | 2024-12
 | Saturation
-| O3
+| [[O3]]
 | Human Baseline: ~80%
 | O3: 87.5%
@@ Line 28: / Line 26: @@
 | [[MATH]]
 | Mathematics
-| 2021-03 – 2024-09
 | 2021-03
 | 2024-09
 | Saturation
-| O1
+| [[O1]]
 | Average CS PhD: ~40%
 | O1: 94.8%
@@ Line 40: / Line 37: @@
 | [[BIG-Bench-Hard]]
 | Multi-task
-| 2022-10 – 2024-06
 | 2022-10
 | 2024-06
 | Saturation
-| Sonnet 3.5
+| [[Sonnet 3.5]]
 | Average Human: 67.7%
 | Sonnet 3.5: 93.1%
@@ Line 52: / Line 48: @@
 | [[HumanEval]]
 | Coding
-| 2021-07 – 2024-05
 | 2021-07
 | 2024-05
 | Saturation
-| GPT-4o
+| [[GPT-4o]]
 | Unspecified
 | GPT-4o: 90.2%
@@ Line 64: / Line 59: @@
 | [[IFEval]]
 | Instruction Following
-| 2023-11 – 2024-03
 | 2023-11
 | 2024-03
 | Saturation
-| LLama 3.3 70B
+| [[LLama 3.3 70B]]
 | Unspecified
 | LLama 3.3 70B: 92.1%
@@ Line 79: / Line 73: @@
 ! Benchmark
 ! Category
-! Time Span
 ! Date Created
 ! Date Defeated
@@ Line 91: / Line 84: @@
 | [[GSM8K]]
 | Mathematics
-| 2021-10 – 2023-11
 | 2021-10
 | 2023-11
 | Saturation
-| GPT-4
+| [[GPT-4]]
 | Unspecified
 | GPT-4: 92.0%
@@ Line 103: / Line 95: @@
 | [[Turing Test]]
 | Conversation
-| 1950-10 – 2023-03
 | 1950-10
 | 2023-03
 | Saturation
-| GPT-4
+| [[GPT-4]]
 | Interrogator > 50%
 | Interrogator 46%
@@ Line 115: / Line 106: @@
 | [[ARC (AI2)]]
 | Reasoning
-| 2018-03 – 2023-03
 | 2018-03
 | 2023-03
 | Saturation
-| GPT-4
+| [[GPT-4]]
 | Unspecified
 | GPT-4: 96.3%
@@ Line 127: / Line 117: @@
 | [[HellaSwag]]
 | Common Sense
-| 2019-05 – 2023-03
 | 2019-05
 | 2023-03
 | Saturation
-| GPT-4
+| [[GPT-4]]
 | Human: 95.6%
 | GPT-4: 95.3%
@@ Line 139: / Line 128: @@
 | [[MMLU]]
 | Knowledge
-| 2020-09 – 2023-03
 | 2020-09
 | 2023-03
 | Saturation
-| GPT-4
+| [[GPT-4]]
 | 95th pct Human: 87.0%
 | GPT-4: 87.3%
@@ Line 151: / Line 139: @@
 | [[WinoGrande]]
 | Common Sense
-| 2019-07 – 2023-03
 | 2019-07
 | 2023-03
 | Saturation
-| GPT-4
+| [[GPT-4]]
 | Human: 94%
 | GPT-4: 87.5%