LLM Benchmarks Timeline: Difference between revisions

Revision as of 16:53, 10 January 2025

2024

Benchmark	Category	Time Span	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Details	Links
ARC-AGI	Reasoning	2019-11 – 2024-12	2019-11	2024-12	Saturation	O3	Human Baseline: ~80%	O3: 87.5%	Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.	Paper, Website
MATH	Mathematics	2021-03 – 2024-09	2021-03	2024-09	Saturation	O1	Average CS PhD: ~40%	O1: 94.8%	12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.	Paper, GitHub
BIG-Bench-Hard	Multi-task	2022-10 – 2024-06	2022-10	2024-06	Saturation	Sonnet 3.5	Average Human: 67.7%	Sonnet 3.5: 93.1%	A curated suite of 23 challenging tasks from BIG-Bench.	Paper, GitHub, Evidence
HumanEval	Coding	2021-07 – 2024-05	2021-07	2024-05	Saturation	GPT-4o	Unspecified	GPT-4o: 90.2%	164 Python programming problems testing coding abilities.	Paper, GitHub, Evidence
IFEval	Instruction Following	2023-11 – 2024-03	2023-11	2024-03	Saturation	LLama 3.3 70B	Unspecified	LLama 3.3 70B: 92.1%	Evaluation suite testing multi-step instruction-following capabilities.	Paper, GitHub, Evidence

2023

Benchmark	Category	Time Span	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Details	Links
GSM8K	Mathematics	2021-10 – 2023-11	2021-10	2023-11	Saturation	GPT-4	Unspecified	GPT-4: 92.0%	8.5K grade school math word problems requiring step-by-step solutions.	Paper, GitHub, Evidence
Turing Test	Conversation	1950-10 – 2023-03	1950-10	2023-03	Saturation	GPT-4	Interrogator > 50%	Interrogator 46%	The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").	Paper, Evidence
ARC (AI2)	Reasoning	2018-03 – 2023-03	2018-03	2023-03	Saturation	GPT-4	Unspecified	GPT-4: 96.3%	Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.	Paper, Website, Evidence
HellaSwag	Common Sense	2019-05 – 2023-03	2019-05	2023-03	Saturation	GPT-4	Human: 95.6%	GPT-4: 95.3%	Multiple-choice questions about everyday scenarios with adversarial filtering.	Paper, Website, Evidence
MMLU	Knowledge	2020-09 – 2023-03	2020-09	2023-03	Saturation	GPT-4	95th pct Human: 87.0%	GPT-4: 87.3%	57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.	Paper, GitHub, Evidence
WinoGrande	Common Sense	2019-07 – 2023-03	2019-07	2023-03	Saturation	GPT-4	Human: 94%	GPT-4: 87.5%	Enhanced WSC with 44K problems testing common-sense pronoun resolution.	Paper, Website, Evidence

Pre-2023

2022

Benchmark	Category	Time Span	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Links	Details
BIG-Bench	Multi-task	2021-06 – 2022-04	2021-06	2022-04	Saturation	Palm 540B	Human: 49.8%	Palm 540B: 61.4%	Paper, GitHub, Evidence	204 tasks spanning linguistics, math, common-sense reasoning, and more.

2019

Benchmark	Category	Time Span	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Links	Details
SuperGLUE	Language	2019-05 – 2019-10	2019-05	2019-10	Saturation	T5	Human: 89.8%	T5: 89.3%	Paper, Website	More challenging language understanding tasks (word sense, causal reasoning, RC).

2018

Benchmark	Category	Time Span	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Links	Details
SWAG	Common Sense	2018-05 – 2018-10	2018-05	2018-10	Saturation	BERT	Human: 88%	BERT: 86%	[Paper](https://arxiv.org/abs/1808.05326), [Website](https://rowanzellers.com/swag/)	113K multiple-choice questions about grounded situations (common sense “next step”).

@@ Line 11: / Line 11: @@
 ! Original Score
 ! Final Score
+! Details
 ! Links
-! Details
 |-
-| '''ARC-AGI'''
+| [[ARC-AGI]]
 | Reasoning
 | 2019-11 – 2024-12
@@ Line 23: / Line 23: @@
 | Human Baseline: ~80%
 | O3: 87.5%
+| Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
 | [https://arxiv.org/abs/1911.01547 Paper], [https://arcs-benchmark.org Website]
-| Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
 |-
-| '''MATH'''
+| [[MATH]]
 | Mathematics
 | 2021-03 – 2024-09
@@ Line 35: / Line 35: @@
 | Average CS PhD: ~40%
 | O1: 94.8%
+| 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
 | [https://arxiv.org/abs/2103.03874 Paper], [https://github.com/hendrycks/math GitHub]
-| 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
 |-
-| '''BIG-Bench-Hard'''
+| [[BIG-Bench-Hard]]
 | Multi-task
 | 2022-10 – 2024-06
@@ Line 47: / Line 47: @@
 | Average Human: 67.7%
 | Sonnet 3.5: 93.1%
+| A curated suite of 23 challenging tasks from BIG-Bench.
 | [https://arxiv.org/abs/2210.09261 Paper], [https://github.com/suzgunmirac/BIG-Bench-Hard GitHub], [https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf Evidence]
-| A curated suite of 23 challenging tasks from BIG-Bench.
 |-
-| '''HumanEval'''
+| [[HumanEval]]
 | Coding
 | 2021-07 – 2024-05
@@ Line 59: / Line 59: @@
 | Unspecified
 | GPT-4o: 90.2%
+| 164 Python programming problems testing coding abilities.
 | [https://arxiv.org/abs/2107.03374 Paper], [https://github.com/openai/human-eval GitHub], [https://openai.com/index/hello-gpt-4o/ Evidence]
-| 164 Python programming problems testing coding abilities.
 |-
-| '''IFEval'''
+| [[IFEval]]
 | Instruction Following
 | 2023-11 – 2024-03
@@ Line 71: / Line 71: @@
 | Unspecified
 | LLama 3.3 70B: 92.1%
+| Evaluation suite testing multi-step instruction-following capabilities.
 | [https://arxiv.org/abs/2311.07911 Paper], [https://github.com/google-research/google-research/tree/master/instruction_following_eval GitHub], [https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md Evidence]
-| Evaluation suite testing multi-step instruction-following capabilities.
 |}
 == 2023 ==
 {| class="wikitable"
@@ Line 87: / Line 86: @@
 ! Original Score
 ! Final Score
+! Details
 ! Links
-! Details
 |-
-| '''GSM8K'''
+| [[GSM8K]]
 | Mathematics
 | 2021-10 – 2023-11
@@ Line 99: / Line 98: @@
 | Unspecified
 | GPT-4: 92.0%
+| 8.5K grade school math word problems requiring step-by-step solutions.
 | [https://arxiv.org/abs/2110.14168 Paper], [https://github.com/openai/grade-school-math GitHub], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
-| 8.5K grade school math word problems requiring step-by-step solutions.
 |-
-| '''Turing Test'''
+| [[Turing Test]]
 | Conversation
 | 1950-10 – 2023-03
@@ Line 111: / Line 110: @@
 | Interrogator > 50%
 | Interrogator 46%
+| The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
 | [https://courses.cs.umbc.edu/471/papers/turing.pdf Paper], [https://arxiv.org/pdf/2405.08007 Evidence]
-| The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
 |-
-| '''ARC (AI2)'''
+| [[ARC (AI2)]]
 | Reasoning
 | 2018-03 – 2023-03
@@ Line 123: / Line 122: @@
 | Unspecified
 | GPT-4: 96.3%
+| Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
 | [https://arxiv.org/abs/1803.05457 Paper], [https://leaderboard.allenai.org/arc/submissions/get-started Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
-| Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
 |-
-| '''HellaSwag'''
+| [[HellaSwag]]
 | Common Sense
 | 2019-05 – 2023-03
@@ Line 135: / Line 134: @@
 | Human: 95.6%
 | GPT-4: 95.3%
+| Multiple-choice questions about everyday scenarios with adversarial filtering.
 | [https://arxiv.org/abs/1905.07830 Paper], [https://rowanzellers.com/hellaswag/ Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
-| Multiple-choice questions about everyday scenarios with adversarial filtering.
 |-
-| '''MMLU'''
+| [[MMLU]]
 | Knowledge
 | 2020-09 – 2023-03
@@ Line 147: / Line 146: @@
 | 95th pct Human: 87.0%
 | GPT-4: 87.3%
+| 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
 | [https://arxiv.org/abs/2009.03300 Paper], [https://github.com/hendrycks/test GitHub], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
-| 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
 |-
-| '''WinoGrande'''
+| [[WinoGrande]]
 | Common Sense
 | 2019-07 – 2023-03
@@ Line 159: / Line 158: @@
 | Human: 94%
 | GPT-4: 87.5%
+| Enhanced WSC with 44K problems testing common-sense pronoun resolution.
 | [https://arxiv.org/abs/1907.10641 Paper], [https://winogrande.allenai.org/ Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
-| Enhanced WSC with 44K problems testing common-sense pronoun resolution.
 |}