LLM Benchmarks Timeline: Difference between revisions

From AI Wiki
(Created page with "= LLM Benchmarks Timeline = A memorial to the benchmarks that defined—and were defeated by—AI progress === All Time === * [#2024 2024] * [#2023 2023] * [#pre-2023 Pre-2023] ---- == 2024 == === ARC-AGI (2019 - 2024) === ; Category : Reasoning ; Killed by : Saturation ; Details : Killed 1 month ago, Abstract reasoning challenge consisting of visual pattern completion tasks. Each task presents a sequence of abstract visual patterns and requires selecting the correct...")
 
No edit summary
Line 1: Line 1:
= LLM Benchmarks Timeline =
A memorial to the benchmarks that defined—and were defeated by—AI progress
=== All Time ===
* [#2024 2024]
* [#2023 2023]
* [#pre-2023 Pre-2023]
----
== 2024 ==
== 2024 ==
=== ARC-AGI (2019 - 2024) ===
{| class="wikitable"
; Category
|-
: Reasoning
! Benchmark
; Killed by
! Category
: Saturation
! Time Span
; Details
! Killed By
: Killed 1 month ago, Abstract reasoning challenge consisting of visual pattern completion tasks. Each task presents a sequence of abstract visual patterns and requires selecting the correct completion. Created by François Chollet as part of a broader investigation into measuring intelligence. It was 5 years and 1 month old.
! Killed (Ago)
; Defeated by
! Age
: O3
! Defeated By
; Original Score
! Original Score
: Human Baseline: ~80%
! Final Score
; Final Score
! Details
: O3: 87.5%
|-
 
| '''ARC-AGI'''
=== MATH (2021 - 2024) ===
| Reasoning
; Category
| 2019 – 2024
: Mathematics
| Saturation
; Killed by
| 1 month ago
: Saturation
| 5 years, 1 month
; Details
| O3
: Killed 4 months ago, A dataset of 12K challenging competition mathematics problems from AMC, AIME, and other math competitions. Problems range from pre-algebra to olympiad-level and require complex multi-step reasoning. Each problem has a detailed solution that tests mathematical reasoning capabilities. It was 3 years and 6 months old.
| Human Baseline: ~80%
; Defeated by
| O3: 87.5%
: O1
| Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
; Original Score
|-
: Average CS PhD: ~40%
| '''MATH'''
; Final Score
| Mathematics
: O1: 94.8%
| 2021 – 2024
 
| Saturation
=== BIG-Bench-Hard (2022 - 2024) ===
| 4 months ago
; Category
| 3 years, 6 months
: Multi-task
| O1
; Killed by
| Average CS PhD: ~40%
: Saturation
| O1: 94.8%
; Details
| 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
: Killed 7 months ago, A curated suite of 23 challenging tasks from BIG-Bench where language models initially performed below average human level. Selected to measure progress on particularly difficult capabilities. It was 1 year and 8 months old.
|-
; Defeated by
| '''BIG-Bench-Hard'''
: Sonnet 3.5
| Multi-task
; Original Score
| 2022 – 2024
: Average Human: 67.7%
| Saturation
; Final Score
| 7 months ago
: Sonnet 3.5: 93.1%
| 1 year, 8 months
 
| Sonnet 3.5
=== HumanEval (2021 - 2024) ===
| Average Human: 67.7%
; Category
| Sonnet 3.5: 93.1%
: Coding
| A curated suite of 23 challenging tasks from BIG-Bench.
; Killed by
|-
: Saturation
| '''HumanEval'''
; Details
| Coding
: Killed 8 months ago, A collection of 164 Python programming problems designed to test language models' coding abilities. Each problem includes a function signature, docstring, and unit tests. Models must generate complete, correct function implementations that pass all test cases. It was 2 years and 10 months old.
| 2021 – 2024
; Defeated by
| Saturation
: GPT-4o
| 8 months ago
; Original Score
| 2 years, 10 months
: Unspecified
| GPT-4o
; Final Score
| Unspecified
: GPT-4o: 90.2%
| GPT-4o: 90.2%
 
| 164 Python programming problems testing coding abilities.
=== IFEval (2023 - 2024) ===
|-
; Category
| '''IFEval'''
: Instruction Following
| Instruction Following
; Killed by
| 2023 – 2024
: Saturation
| Saturation
; Details
| 10 months ago
: Killed 10 months ago, A comprehensive evaluation suite testing instruction following capabilities across coding, math, roleplay, and other tasks. Measures ability to handle complex multi-step instructions and constraints. It was 4 months old.
| 4 months
; Defeated by
| LLama 3.3 70B
: LLama 3.3 70B
| Unspecified
; Original Score
| LLama 3.3 70B: 92.1%
: Unspecified
| Evaluation suite testing multi-step instruction-following capabilities.
; Final Score
|}
: LLama 3.3 70B: 92.1%


----
----


== 2023 ==
== 2023 ==
=== GSM8K (2021 - 2023) ===
{| class="wikitable"
; Category
|-
: Mathematics
! Benchmark
; Killed by
! Category
: Saturation
! Time Span
; Details
! Killed By
: Killed 1 year ago, A collection of 8.5K grade school math word problems requiring step-by-step solutions. Problems test both numerical computation and natural language understanding through multi-step mathematical reasoning. It was 2 years and 1 month old.
! Killed (Ago)
; Defeated by
! Age
: GPT-4
! Defeated By
; Original Score
! Original Score
: Unspecified
! Final Score
; Final Score
! Details
: GPT-4: 92.0%
|-
 
| '''GSM8K'''
=== Turing Test (1950 - 2023) ===
| Mathematics
; Category
| 2021 – 2023
: Conversation
| Saturation
; Killed by
| 1 year ago
: Saturation
| 2 years, 1 month
; Details
| GPT-4
: Killed 1 year ago, The original AI benchmark proposed by Alan Turing in 1950. In this “imitation game,” a computer must convince human judges it is human through natural conversation. The test sparked decades of debate about machine intelligence and consciousness. It was 73 years and 5 months old.
| Unspecified
; Defeated by
| GPT-4: 92.0%
: GPT-4
| 8.5K grade school math word problems requiring step-by-step solutions.
; Original Score
|-
: Interrogator > 50%
| '''Turing Test'''
; Final Score
| Conversation
: Interrogator 46%
| 1950 – 2023
 
| Saturation
=== ARC (AI2) (2018 - 2023) ===
| 1 year ago
; Category
| 73 years, 5 months
: Reasoning
| GPT-4
; Killed by
| Interrogator > 50%
: Saturation
| Interrogator 46%
; Details
| The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
: Killed 1 year ago, AI2 Reasoning Challenge (ARC) – A collection of grade-school level multiple-choice reasoning tasks testing logical deduction, spatial reasoning, and temporal reasoning. Each task requires applying abstract reasoning skills to solve multi-step problems. It was 5 years old.
|-
; Defeated by
| '''ARC (AI2)'''
: GPT-4
| Reasoning
; Original Score
| 2018 – 2023
: Unspecified
| Saturation
; Final Score
| 1 year ago
: GPT-4: 96.3%
| 5 years
 
| GPT-4
=== HellaSwag (2019 - 2023) ===
| Unspecified
; Category
| GPT-4: 96.3%
: Common Sense
| Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
; Killed by
|-
: Saturation
| '''HellaSwag'''
; Details
| Common Sense
: Killed 1 year ago, A challenging dataset of multiple-choice questions about everyday scenarios. Uses adversarial filtering to test models' ability to understand and reason about real-world situations and their likely outcomes. It was 3 years and 10 months old.
| 2019 – 2023
; Defeated by
| Saturation
: GPT-4
| 1 year ago
; Original Score
| 3 years, 10 months
: Human: 95.6%
| GPT-4
; Final Score
| Human: 95.6%
: GPT-4: 95.3%
| GPT-4: 95.3%
 
| Multiple-choice questions about everyday scenarios with adversarial filtering.
=== MMLU (2020 - 2023) ===
|-
; Category
| '''MMLU'''
: Knowledge
| Knowledge
; Killed by
| 2020 – 2023
: Saturation
| Saturation
; Details
| 1 year ago
: Killed 1 year ago, A comprehensive benchmark covering 57 subjects including mathematics, history, law, computer science, and more. Questions are drawn from real-world sources like professional exams to test both breadth and depth of knowledge across diverse academic domains. It was 2 years and 6 months old.
| 2 years, 6 months
; Defeated by
| GPT-4
: GPT-4
| 95th pct Human: 87.0%
; Original Score
| GPT-4: 87.3%
: 95th pct Human: 87.0%
| 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
; Final Score
|-
: GPT-4: 87.3%
| '''WinoGrande'''
 
| Common Sense
=== WinoGrande (2019 - 2023) ===
| 2019 – 2023
; Category
| Saturation
: Common Sense
| 1 year ago
; Killed by
| 3 years, 8 months
: Saturation
| GPT-4
; Details
| Human: 94%
: Killed 1 year ago, An enhanced version of WSC with 44K problems testing common-sense reasoning through pronoun resolution. Uses adversarial filtering to ensure problems require real-world understanding. It was 3 years and 8 months old.
| GPT-4: 87.5%
; Defeated by
| Enhanced WSC with 44K problems testing common-sense pronoun resolution.
: GPT-4
|}
; Original Score
: Human: 94%
; Final Score
: GPT-4: 87.5%


----
----
Line 171: Line 156:
== Pre-2023 ==
== Pre-2023 ==
=== 2022 ===
=== 2022 ===
==== BIG-Bench (2021 - 2022) ====
{| class="wikitable"
; Category
|-
: Multi-task
! Benchmark
; Killed by
! Category
: Saturation
! Time Span
; Details
! Killed By
: Killed 2 years ago, A collaborative collection of 204 tasks spanning linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, and software development. Tests diverse capabilities of language models. It was 10 months old.
! Killed (Ago)
; Defeated by
! Age
: Palm 540B
! Defeated By
; Original Score
! Original Score
: Human: 49.8%
! Final Score
; Final Score
! Details
: Palm 540B: 61.4%
|-
| '''BIG-Bench'''
| Multi-task
| 2021 – 2022
| Saturation
| 2 years ago
| 10 months
| Palm 540B
| Human: 49.8%
| Palm 540B: 61.4%
| 204 tasks spanning linguistics, math, common-sense reasoning, and more.
|}


=== 2019 ===
=== 2019 ===
==== SuperGLUE (2019 - 2019) ====
{| class="wikitable"
; Category
|-
: Language
! Benchmark
; Killed by
! Category
: Saturation
! Time Span
; Details
! Killed By
: Killed 5 years ago, A collection of more challenging language understanding tasks including word sense disambiguation, causal reasoning, and reading comprehension. Designed as a more difficult successor to GLUE. It was 5 months old.
! Killed (Ago)
; Defeated by
! Age
: T5
! Defeated By
; Original Score
! Original Score
: Human: 89.8%
! Final Score
; Final Score
! Details
: T5: 89.3%
|-
 
| '''SuperGLUE'''
==== WSC (2012 - 2019) ====
| Language
; Category
| 2019 – 2019
: Common Sense
| Saturation
; Killed by
| 5 years ago
: Saturation
| 5 months
; Details
| T5
: Killed 5 years ago, A collection of carefully crafted sentence pairs with ambiguous pronoun references that resolve differently based on small changes. Designed to test genuine language understanding over statistical patterns. It was 7 years and 3 months old.
| Human: 89.8%
; Defeated by
| T5: 89.3%
: ROBERTA (w SFT)
| More challenging language understanding tasks (word sense, causal reasoning, RC).
; Original Score
|-
: Human: 96.5%
| '''WSC'''
; Final Score
| Common Sense
: ROBERTA (w SFT): 90.1%
| 2012 – 2019
 
| Saturation
==== GLUE (2018 - 2019) ====
| 5 years ago
; Category
| 7 years, 3 months
: Language
| ROBERTA (w SFT)
; Killed by
| Human: 96.5%
: Saturation
| ROBERTA (w SFT): 90.1%
; Details
| Carefully crafted sentence pairs with ambiguous pronoun references.
: Killed 5 years ago, A collection of nine tasks for evaluating natural language understanding, including single-sentence tasks, similarity and paraphrase tasks, and inference tasks. The primary NLU benchmark before SuperGLUE. It was 1 year and 1 month old.
|-
; Defeated by
| '''GLUE'''
: XLNet
| Language
; Original Score
| 2018 – 2019
: Human: 87.1%
| Saturation
; Final Score
| 5 years ago
: XLNet: 88.4%
| 1 year, 1 month
 
| XLNet
==== TriviaQA (2017 - 2019) ====
| Human: 87.1%
; Category
| XLNet: 88.4%
: Knowledge
| Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).
; Killed by
|-
: Saturation
| '''TriviaQA'''
; Details
| Knowledge
: Killed 5 years ago, A large-scale dataset of 650K question-answer-evidence triples authored by trivia enthusiasts. Requires cross-sentence reasoning and synthesis of information from multiple sources. It was 2 years and 1 month old.
| 2017 – 2019
; Defeated by
| Saturation
: SpanBERT
| 5 years ago
; Original Score
| 2 years, 1 month
: Human: 79.7%
| SpanBERT
; Final Score
| Human: 79.7%
: SpanBERT: 83.6%
| SpanBERT: 83.6%
 
| 650K QA-evidence triples requiring cross-sentence reasoning.
==== SQuAD v2.0 (2018 - 2019) ====
|-
; Category
| '''SQuAD v2.0'''
: Language
| Language
; Killed by
| 2018 – 2019
: Saturation
| Saturation
; Details
| 5 years ago
: Killed 5 years ago, An extension of SQuAD that adds unanswerable questions. Models must both answer questions when possible and determine when no answer is supported by the passage. It was 11 months old.
| 11 months
; Defeated by
| BERT
: BERT
| Human: 89.5%
; Original Score
| BERT: 89.5%
: Human: 89.5%
| Extension of SQuAD adding unanswerable questions.
; Final Score
|-
: BERT: 89.5%
| '''SQuAD'''
 
| Language
==== SQuAD (2016 - 2019) ====
| 2016 – 2019
; Category
| Saturation
: Language
| 5 years ago
; Killed by
| 2 years, 10 months
: Saturation
| BERT
; Details
| Human: 91.2%
: Killed 5 years ago, A reading comprehension dataset of 100,000+ questions posed by crowdworkers on Wikipedia articles. Answers must be text segments from the corresponding reading passage. It was 2 years and 10 months old.
| BERT: 93.2%
; Defeated by
| 100,000+ QA tasks on Wikipedia articles.
: BERT
|}
; Original Score
: Human: 91.2%
; Final Score
: BERT: 93.2%


=== 2018 ===
=== 2018 ===
==== SWAG (2018 - 2018) ====
{| class="wikitable"
; Category
|-
: Common Sense
! Benchmark
; Killed by
! Category
: Saturation
! Time Span
; Details
! Killed By
: Killed 6 years ago, A dataset of 113K multiple choice questions about grounded situations. Given a partial description of a situation, models must predict what happens next from 4 choices using common sense reasoning. It was 5 months old.
! Killed (Ago)
; Defeated by
! Age
: BERT
! Defeated By
; Original Score
! Original Score
: Human: 88%
! Final Score
; Final Score
! Details
: BERT: 86%
|-
| '''SWAG'''
| Common Sense
| 2018 – 2018
| Saturation
| 6 years ago
| 5 months
| BERT
| Human: 88%
| BERT: 86%
| 113K multiple-choice questions about grounded situations (common sense “next step”).
|}

Revision as of 16:28, 10 January 2025

2024

Benchmark Category Time Span Killed By Killed (Ago) Age Defeated By Original Score Final Score Details
ARC-AGI Reasoning 2019 – 2024 Saturation 1 month ago 5 years, 1 month O3 Human Baseline: ~80% O3: 87.5% Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
MATH Mathematics 2021 – 2024 Saturation 4 months ago 3 years, 6 months O1 Average CS PhD: ~40% O1: 94.8% 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
BIG-Bench-Hard Multi-task 2022 – 2024 Saturation 7 months ago 1 year, 8 months Sonnet 3.5 Average Human: 67.7% Sonnet 3.5: 93.1% A curated suite of 23 challenging tasks from BIG-Bench.
HumanEval Coding 2021 – 2024 Saturation 8 months ago 2 years, 10 months GPT-4o Unspecified GPT-4o: 90.2% 164 Python programming problems testing coding abilities.
IFEval Instruction Following 2023 – 2024 Saturation 10 months ago 4 months LLama 3.3 70B Unspecified LLama 3.3 70B: 92.1% Evaluation suite testing multi-step instruction-following capabilities.

2023

Benchmark Category Time Span Killed By Killed (Ago) Age Defeated By Original Score Final Score Details
GSM8K Mathematics 2021 – 2023 Saturation 1 year ago 2 years, 1 month GPT-4 Unspecified GPT-4: 92.0% 8.5K grade school math word problems requiring step-by-step solutions.
Turing Test Conversation 1950 – 2023 Saturation 1 year ago 73 years, 5 months GPT-4 Interrogator > 50% Interrogator 46% The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
ARC (AI2) Reasoning 2018 – 2023 Saturation 1 year ago 5 years GPT-4 Unspecified GPT-4: 96.3% Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
HellaSwag Common Sense 2019 – 2023 Saturation 1 year ago 3 years, 10 months GPT-4 Human: 95.6% GPT-4: 95.3% Multiple-choice questions about everyday scenarios with adversarial filtering.
MMLU Knowledge 2020 – 2023 Saturation 1 year ago 2 years, 6 months GPT-4 95th pct Human: 87.0% GPT-4: 87.3% 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
WinoGrande Common Sense 2019 – 2023 Saturation 1 year ago 3 years, 8 months GPT-4 Human: 94% GPT-4: 87.5% Enhanced WSC with 44K problems testing common-sense pronoun resolution.

Pre-2023

2022

Benchmark Category Time Span Killed By Killed (Ago) Age Defeated By Original Score Final Score Details
BIG-Bench Multi-task 2021 – 2022 Saturation 2 years ago 10 months Palm 540B Human: 49.8% Palm 540B: 61.4% 204 tasks spanning linguistics, math, common-sense reasoning, and more.

2019

Benchmark Category Time Span Killed By Killed (Ago) Age Defeated By Original Score Final Score Details
SuperGLUE Language 2019 – 2019 Saturation 5 years ago 5 months T5 Human: 89.8% T5: 89.3% More challenging language understanding tasks (word sense, causal reasoning, RC).
WSC Common Sense 2012 – 2019 Saturation 5 years ago 7 years, 3 months ROBERTA (w SFT) Human: 96.5% ROBERTA (w SFT): 90.1% Carefully crafted sentence pairs with ambiguous pronoun references.
GLUE Language 2018 – 2019 Saturation 5 years ago 1 year, 1 month XLNet Human: 87.1% XLNet: 88.4% Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).
TriviaQA Knowledge 2017 – 2019 Saturation 5 years ago 2 years, 1 month SpanBERT Human: 79.7% SpanBERT: 83.6% 650K QA-evidence triples requiring cross-sentence reasoning.
SQuAD v2.0 Language 2018 – 2019 Saturation 5 years ago 11 months BERT Human: 89.5% BERT: 89.5% Extension of SQuAD adding unanswerable questions.
SQuAD Language 2016 – 2019 Saturation 5 years ago 2 years, 10 months BERT Human: 91.2% BERT: 93.2% 100,000+ QA tasks on Wikipedia articles.

2018

Benchmark Category Time Span Killed By Killed (Ago) Age Defeated By Original Score Final Score Details
SWAG Common Sense 2018 – 2018 Saturation 6 years ago 5 months BERT Human: 88% BERT: 86% 113K multiple-choice questions about grounded situations (common sense “next step”).