LLM Benchmarks Timeline: Difference between revisions

Revision as of 16:49, 10 January 2025

2024

Benchmark	Category	Time Span	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Links	Details
ARC-AGI	Reasoning	2019-11 – 2024-12	2019-11	2024-12	Saturation	O3	Human Baseline: ~80%	O3: 87.5%	Paper, Website	Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
MATH	Mathematics	2021-03 – 2024-09	2021-03	2024-09	Saturation	O1	Average CS PhD: ~40%	O1: 94.8%	Paper, GitHub	12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
BIG-Bench-Hard	Multi-task	2022-10 – 2024-06	2022-10	2024-06	Saturation	Sonnet 3.5	Average Human: 67.7%	Sonnet 3.5: 93.1%	Paper, GitHub, Evidence	A curated suite of 23 challenging tasks from BIG-Bench.
HumanEval	Coding	2021-07 – 2024-05	2021-07	2024-05	Saturation	GPT-4o	Unspecified	GPT-4o: 90.2%	Paper, GitHub, Evidence	164 Python programming problems testing coding abilities.
IFEval	Instruction Following	2023-11 – 2024-03	2023-11	2024-03	Saturation	LLama 3.3 70B	Unspecified	LLama 3.3 70B: 92.1%	Paper, GitHub, Evidence	Evaluation suite testing multi-step instruction-following capabilities.

2023

Benchmark	Category	Time Span	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Links	Details
GSM8K	Mathematics	2021-10 – 2023-11	2021-10	2023-11	Saturation	GPT-4	Unspecified	GPT-4: 92.0%	Paper, GitHub, Evidence	8.5K grade school math word problems requiring step-by-step solutions.
Turing Test	Conversation	1950-10 – 2023-03	1950-10	2023-03	Saturation	GPT-4	Interrogator > 50%	Interrogator 46%	Paper, Evidence	The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
ARC (AI2)	Reasoning	2018-03 – 2023-03	2018-03	2023-03	Saturation	GPT-4	Unspecified	GPT-4: 96.3%	Paper, Website, Evidence	Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
HellaSwag	Common Sense	2019-05 – 2023-03	2019-05	2023-03	Saturation	GPT-4	Human: 95.6%	GPT-4: 95.3%	Paper, Website, Evidence	Multiple-choice questions about everyday scenarios with adversarial filtering.
MMLU	Knowledge	2020-09 – 2023-03	2020-09	2023-03	Saturation	GPT-4	95th pct Human: 87.0%	GPT-4: 87.3%	Paper, GitHub, Evidence	57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
WinoGrande	Common Sense	2019-07 – 2023-03	2019-07	2023-03	Saturation	GPT-4	Human: 94%	GPT-4: 87.5%	Paper, Website, Evidence	Enhanced WSC with 44K problems testing common-sense pronoun resolution.

Pre-2023

2022

Benchmark	Category	Time Span	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Links	Details
BIG-Bench	Multi-task	2021-06 – 2022-04	2021-06	2022-04	Saturation	Palm 540B	Human: 49.8%	Palm 540B: 61.4%	Paper, GitHub, Evidence	204 tasks spanning linguistics, math, common-sense reasoning, and more.

2019

Benchmark	Category	Time Span	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Links	Details
SuperGLUE	Language	2019-05 – 2019-10	2019-05	2019-10	Saturation	T5	Human: 89.8%	T5: 89.3%	Paper, Website	More challenging language understanding tasks (word sense, causal reasoning, RC).

2018

Benchmark	Category	Time Span	Date Created	Date Defeated	Killed By	Defeated By	Original Score	Final Score	Links	Details
SWAG	Common Sense	2018-05 – 2018-10	2018-05	2018-10	Saturation	BERT	Human: 88%	BERT: 86%	[Paper](https://arxiv.org/abs/1808.05326), [Website](https://rowanzellers.com/swag/)	113K multiple-choice questions about grounded situations (common sense “next step”).

@@ Line 216: / Line 216: @@
 | Human: 89.8%
 | T5: 89.3%
-| [Paper](https://arxiv.org/abs/1905.00537), [Website](https://super.gluebenchmark.com/)
+| [https://arxiv.org/abs/1905.00537 Paper], [https://super.gluebenchmark.com/ Website]
 | More challenging language understanding tasks (word sense, causal reasoning, RC).
-|-
-| '''WSC'''
-| Common Sense
-| 2012-05 – 2019-07
-| 2012-05
-| 2019-07
-| Saturation
-| ROBERTA (w SFT)
-| Human: 96.5%
-| ROBERTA (w SFT): 90.1%
-| [Paper](https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf), [Website](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html)
-| Carefully crafted sentence pairs with ambiguous pronoun references.
-|-
-| '''GLUE'''
-| Language
-| 2018-05 – 2019-06
-| 2018-05
-| 2019-06
-| Saturation
-| XLNet
-| Human: 87.1%
-| XLNet: 88.4%
-| [Paper](https://arxiv.org/abs/1804.07461), [Website](https://gluebenchmark.com/)
-| Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).
-|-
-| '''TriviaQA'''
-| Knowledge
-| 2017-05 – 2019-06
-| 2017-05
-| 2019-06
-| Saturation
-| SpanBERT
-| Human: 79.7%
-| SpanBERT: 83.6%
-| [Paper](https://arxiv.org/abs/1705.03551), [Website](http://nlp.cs.washington.edu/triviaqa/)
-| 650K QA-evidence triples requiring cross-sentence reasoning.
-|-
-| '''SQuAD v2.0'''
-| Language
-| 2018-05 – 2019-04
-| 2018-05
-| 2019-04
-| Saturation
-| BERT
-| Human: 89.5%
-| BERT: 89.5%
-| [Paper](https://arxiv.org/abs/1806.03822), [Website](https://rajpurkar.github.io/SQuAD-explorer/)
-| Extension of SQuAD adding unanswerable questions.
-|-
-| '''SQuAD'''
-| Language
-| 2016-05 – 2019-03
-| 2016-05
-| 2019-03
-| Saturation
-| BERT
-| Human: 91.2%
-| BERT: 93.2%
-| [Paper](https://arxiv.org/abs/1606.05250), [Website](https://rajpurkar.github.io/SQuAD-explorer/)
-| 100,000+ QA tasks on Wikipedia articles.
 |}