LLM Benchmarks Timeline: Difference between revisions

Revision as of 16:36, 10 January 2025

2024

Benchmark	Category	Time Span	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Links	Details
ARC-AGI	Reasoning	2019-11 – 2024-12	2019-11	2024-12	Saturation	O3	Human Baseline: ~80%	O3: 87.5%	[Paper](https://arxiv.org/abs/1911.01547), [Website](https://arcs-benchmark.org)	Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
MATH	Mathematics	2021-03 – 2024-09	2021-03	2024-09	Saturation	O1	Average CS PhD: ~40%	O1: 94.8%	[Paper](https://arxiv.org/abs/2103.03874), [GitHub](https://github.com/hendrycks/math)	12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
BIG-Bench-Hard	Multi-task	2022-10 – 2024-06	2022-10	2024-06	Saturation	Sonnet 3.5	Average Human: 67.7%	Sonnet 3.5: 93.1%	[Paper](https://arxiv.org/abs/2210.09261), [GitHub](https://github.com/suzgunmirac/BIG-Bench-Hard), [Evidence](https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf)	A curated suite of 23 challenging tasks from BIG-Bench.
HumanEval	Coding	2021-07 – 2024-05	2021-07	2024-05	Saturation	GPT-4o	Unspecified	GPT-4o: 90.2%	[Paper](https://arxiv.org/abs/2107.03374), [GitHub](https://github.com/openai/human-eval), [Evidence](https://openai.com/index/hello-gpt-4o/)	164 Python programming problems testing coding abilities.
IFEval	Instruction Following	2023-11 – 2024-03	2023-11	2024-03	Saturation	LLama 3.3 70B	Unspecified	LLama 3.3 70B: 92.1%	[Paper](https://arxiv.org/abs/2311.07911), [GitHub](https://github.com/google-research/google-research/tree/master/instruction_following_eval), [Evidence](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md)	Evaluation suite testing multi-step instruction-following capabilities.

2023

Benchmark	Category	Time Span	Killed By	Killed (Ago)	Age	Defeated By	Original Score	Final Score	Details
GSM8K	Mathematics	2021 – 2023	Saturation	1 year ago	2 years, 1 month	GPT-4	Unspecified	GPT-4: 92.0%	8.5K grade school math word problems requiring step-by-step solutions.
Turing Test	Conversation	1950 – 2023	Saturation	1 year ago	73 years, 5 months	GPT-4	Interrogator > 50%	Interrogator 46%	The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
ARC (AI2)	Reasoning	2018 – 2023	Saturation	1 year ago	5 years	GPT-4	Unspecified	GPT-4: 96.3%	Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
HellaSwag	Common Sense	2019 – 2023	Saturation	1 year ago	3 years, 10 months	GPT-4	Human: 95.6%	GPT-4: 95.3%	Multiple-choice questions about everyday scenarios with adversarial filtering.
MMLU	Knowledge	2020 – 2023	Saturation	1 year ago	2 years, 6 months	GPT-4	95th pct Human: 87.0%	GPT-4: 87.3%	57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
WinoGrande	Common Sense	2019 – 2023	Saturation	1 year ago	3 years, 8 months	GPT-4	Human: 94%	GPT-4: 87.5%	Enhanced WSC with 44K problems testing common-sense pronoun resolution.

Pre-2023

2022

Benchmark	Category	Time Span	Killed By	Killed (Ago)	Age	Defeated By	Original Score	Final Score	Details
BIG-Bench	Multi-task	2021 – 2022	Saturation	2 years ago	10 months	Palm 540B	Human: 49.8%	Palm 540B: 61.4%	204 tasks spanning linguistics, math, common-sense reasoning, and more.

2019

Benchmark	Category	Time Span	Killed By	Killed (Ago)	Age	Defeated By	Original Score	Final Score	Details
SuperGLUE	Language	2019 – 2019	Saturation	5 years ago	5 months	T5	Human: 89.8%	T5: 89.3%	More challenging language understanding tasks (word sense, causal reasoning, RC).
WSC	Common Sense	2012 – 2019	Saturation	5 years ago	7 years, 3 months	ROBERTA (w SFT)	Human: 96.5%	ROBERTA (w SFT): 90.1%	Carefully crafted sentence pairs with ambiguous pronoun references.
GLUE	Language	2018 – 2019	Saturation	5 years ago	1 year, 1 month	XLNet	Human: 87.1%	XLNet: 88.4%	Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).
TriviaQA	Knowledge	2017 – 2019	Saturation	5 years ago	2 years, 1 month	SpanBERT	Human: 79.7%	SpanBERT: 83.6%	650K QA-evidence triples requiring cross-sentence reasoning.
SQuAD v2.0	Language	2018 – 2019	Saturation	5 years ago	11 months	BERT	Human: 89.5%	BERT: 89.5%	Extension of SQuAD adding unanswerable questions.
SQuAD	Language	2016 – 2019	Saturation	5 years ago	2 years, 10 months	BERT	Human: 91.2%	BERT: 93.2%	100,000+ QA tasks on Wikipedia articles.

2018

Benchmark	Category	Time Span	Killed By	Killed (Ago)	Age	Defeated By	Original Score	Final Score	Details
SWAG	Common Sense	2018 – 2018	Saturation	6 years ago	5 months	BERT	Human: 88%	BERT: 86%	113K multiple-choice questions about grounded situations (common sense “next step”).

@@ Line 5: / Line 5: @@
 ! Category
 ! Time Span
+! Date Created
+! Date Defeated
 ! Killed By
-! Killed (Ago)
-! Age
 ! Defeated By
 ! Original Score
 ! Final Score
+! Links
 ! Details
 |-
 | '''ARC-AGI'''
 | Reasoning
-| 2019 – 2024
+| 2019-11 – 2024-12
+| 2019-11
+| 2024-12
 | Saturation
-| 1 month ago
-| 5 years, 1 month
 | O3
 | Human Baseline: ~80%
 | O3: 87.5%
+| [Paper](https://arxiv.org/abs/1911.01547), [Website](https://arcs-benchmark.org)
 | Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
 |-
 | '''MATH'''
 | Mathematics
-| 2021 – 2024
+| 2021-03 – 2024-09
+| 2021-03
+| 2024-09
 | Saturation
-| 4 months ago
-| 3 years, 6 months
 | O1
 | Average CS PhD: ~40%
 | O1: 94.8%
+| [Paper](https://arxiv.org/abs/2103.03874), [GitHub](https://github.com/hendrycks/math)
 | 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
 |-
 | '''BIG-Bench-Hard'''
 | Multi-task
-| 2022 – 2024
+| 2022-10 – 2024-06
+| 2022-10
+| 2024-06
 | Saturation
-| 7 months ago
-| 1 year, 8 months
 | Sonnet 3.5
 | Average Human: 67.7%
 | Sonnet 3.5: 93.1%
+| [Paper](https://arxiv.org/abs/2210.09261), [GitHub](https://github.com/suzgunmirac/BIG-Bench-Hard), [Evidence](https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf)
 | A curated suite of 23 challenging tasks from BIG-Bench.
 |-
 | '''HumanEval'''
 | Coding
-| 2021 – 2024
+| 2021-07 – 2024-05
+| 2021-07
+| 2024-05
 | Saturation
-| 8 months ago
-| 2 years, 10 months
 | GPT-4o
 | Unspecified
 | GPT-4o: 90.2%
+| [Paper](https://arxiv.org/abs/2107.03374), [GitHub](https://github.com/openai/human-eval), [Evidence](https://openai.com/index/hello-gpt-4o/)
 | 164 Python programming problems testing coding abilities.
 |-
 | '''IFEval'''
 | Instruction Following
-| 2023 – 2024
+| 2023-11 – 2024-03
+| 2023-11
+| 2024-03
 | Saturation
-| 10 months ago
-| 4 months
 | LLama 3.3 70B
 | Unspecified
 | LLama 3.3 70B: 92.1%
+| [Paper](https://arxiv.org/abs/2311.07911), [GitHub](https://github.com/google-research/google-research/tree/master/instruction_following_eval), [Evidence](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md)
 | Evaluation suite testing multi-step instruction-following capabilities.
 |}
-----
 == 2023 ==