LLM Benchmarks Timeline: Difference between revisions
(Created page with "= LLM Benchmarks Timeline = A memorial to the benchmarks that defined—and were defeated by—AI progress === All Time === * [#2024 2024] * [#2023 2023] * [#pre-2023 Pre-2023] ---- == 2024 == === ARC-AGI (2019 - 2024) === ; Category : Reasoning ; Killed by : Saturation ; Details : Killed 1 month ago, Abstract reasoning challenge consisting of visual pattern completion tasks. Each task presents a sequence of abstract visual patterns and requires selecting the correct...") |
No edit summary |
||
Line 1: | Line 1: | ||
== 2024 == | == 2024 == | ||
= | {| class="wikitable" | ||
|- | |||
! Benchmark | |||
! Category | |||
! Time Span | |||
! Killed By | |||
! Killed (Ago) | |||
! Age | |||
! Defeated By | |||
! Original Score | |||
! Final Score | |||
! Details | |||
|- | |||
| '''ARC-AGI''' | |||
| Reasoning | |||
| 2019 – 2024 | |||
| Saturation | |||
| 1 month ago | |||
| 5 years, 1 month | |||
| O3 | |||
| Human Baseline: ~80% | |||
| O3: 87.5% | |||
| Abstract reasoning challenge with visual pattern completion tasks created by François Chollet. | |||
|- | |||
| '''MATH''' | |||
| Mathematics | |||
| 2021 – 2024 | |||
| Saturation | |||
| 4 months ago | |||
| 3 years, 6 months | |||
| O1 | |||
| Average CS PhD: ~40% | |||
| O1: 94.8% | |||
| 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning. | |||
|- | |||
| '''BIG-Bench-Hard''' | |||
| Multi-task | |||
| 2022 – 2024 | |||
| Saturation | |||
| 7 months ago | |||
| 1 year, 8 months | |||
| Sonnet 3.5 | |||
| Average Human: 67.7% | |||
| Sonnet 3.5: 93.1% | |||
| A curated suite of 23 challenging tasks from BIG-Bench. | |||
|- | |||
| '''HumanEval''' | |||
| Coding | |||
| 2021 – 2024 | |||
| Saturation | |||
| 8 months ago | |||
| 2 years, 10 months | |||
| GPT-4o | |||
| Unspecified | |||
| GPT-4o: 90.2% | |||
| 164 Python programming problems testing coding abilities. | |||
|- | |||
| '''IFEval''' | |||
| Instruction Following | |||
| 2023 – 2024 | |||
| Saturation | |||
| 10 months ago | |||
| 4 months | |||
| LLama 3.3 70B | |||
| Unspecified | |||
| LLama 3.3 70B: 92.1% | |||
| Evaluation suite testing multi-step instruction-following capabilities. | |||
|} | |||
---- | ---- | ||
== 2023 == | == 2023 == | ||
= | {| class="wikitable" | ||
|- | |||
! Benchmark | |||
! Category | |||
! Time Span | |||
! Killed By | |||
! Killed (Ago) | |||
! Age | |||
! Defeated By | |||
! Original Score | |||
! Final Score | |||
! Details | |||
|- | |||
| '''GSM8K''' | |||
| Mathematics | |||
| 2021 – 2023 | |||
| Saturation | |||
| 1 year ago | |||
| 2 years, 1 month | |||
| GPT-4 | |||
| Unspecified | |||
| GPT-4: 92.0% | |||
| 8.5K grade school math word problems requiring step-by-step solutions. | |||
|- | |||
| '''Turing Test''' | |||
| Conversation | |||
| 1950 – 2023 | |||
| Saturation | |||
| 1 year ago | |||
| 73 years, 5 months | |||
| GPT-4 | |||
| Interrogator > 50% | |||
| Interrogator 46% | |||
| The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game"). | |||
|- | |||
| '''ARC (AI2)''' | |||
| Reasoning | |||
| 2018 – 2023 | |||
| Saturation | |||
| 1 year ago | |||
| 5 years | |||
| GPT-4 | |||
| Unspecified | |||
| GPT-4: 96.3% | |||
| Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning. | |||
|- | |||
| '''HellaSwag''' | |||
| Common Sense | |||
| 2019 – 2023 | |||
| Saturation | |||
| 1 year ago | |||
| 3 years, 10 months | |||
| GPT-4 | |||
| Human: 95.6% | |||
| GPT-4: 95.3% | |||
| Multiple-choice questions about everyday scenarios with adversarial filtering. | |||
|- | |||
| '''MMLU''' | |||
| Knowledge | |||
| 2020 – 2023 | |||
| Saturation | |||
| 1 year ago | |||
| 2 years, 6 months | |||
| GPT-4 | |||
| 95th pct Human: 87.0% | |||
| GPT-4: 87.3% | |||
| 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge. | |||
|- | |||
| '''WinoGrande''' | |||
| Common Sense | |||
| 2019 – 2023 | |||
| Saturation | |||
| 1 year ago | |||
| 3 years, 8 months | |||
| GPT-4 | |||
| Human: 94% | |||
| GPT-4: 87.5% | |||
| Enhanced WSC with 44K problems testing common-sense pronoun resolution. | |||
|} | |||
---- | ---- | ||
Line 171: | Line 156: | ||
== Pre-2023 == | == Pre-2023 == | ||
=== 2022 === | === 2022 === | ||
= | {| class="wikitable" | ||
|- | |||
! Benchmark | |||
! Category | |||
! Time Span | |||
! Killed By | |||
! Killed (Ago) | |||
! Age | |||
! Defeated By | |||
! Original Score | |||
! Final Score | |||
! Details | |||
|- | |||
| '''BIG-Bench''' | |||
| Multi-task | |||
| 2021 – 2022 | |||
| Saturation | |||
| 2 years ago | |||
| 10 months | |||
| Palm 540B | |||
| Human: 49.8% | |||
| Palm 540B: 61.4% | |||
| 204 tasks spanning linguistics, math, common-sense reasoning, and more. | |||
|} | |||
=== 2019 === | === 2019 === | ||
= | {| class="wikitable" | ||
|- | |||
! Benchmark | |||
! Category | |||
! Time Span | |||
! Killed By | |||
! Killed (Ago) | |||
! Age | |||
! Defeated By | |||
! Original Score | |||
! Final Score | |||
! Details | |||
|- | |||
| '''SuperGLUE''' | |||
| Language | |||
| 2019 – 2019 | |||
| Saturation | |||
| 5 years ago | |||
| 5 months | |||
| T5 | |||
| Human: 89.8% | |||
| T5: 89.3% | |||
| More challenging language understanding tasks (word sense, causal reasoning, RC). | |||
|- | |||
| '''WSC''' | |||
| Common Sense | |||
| 2012 – 2019 | |||
| Saturation | |||
| 5 years ago | |||
| 7 years, 3 months | |||
| ROBERTA (w SFT) | |||
| Human: 96.5% | |||
| ROBERTA (w SFT): 90.1% | |||
| Carefully crafted sentence pairs with ambiguous pronoun references. | |||
|- | |||
| '''GLUE''' | |||
| Language | |||
| 2018 – 2019 | |||
| Saturation | |||
| 5 years ago | |||
| 1 year, 1 month | |||
| XLNet | |||
| Human: 87.1% | |||
| XLNet: 88.4% | |||
| Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.). | |||
|- | |||
| '''TriviaQA''' | |||
| Knowledge | |||
| 2017 – 2019 | |||
| Saturation | |||
| 5 years ago | |||
| 2 years, 1 month | |||
| SpanBERT | |||
| Human: 79.7% | |||
| SpanBERT: 83.6% | |||
| 650K QA-evidence triples requiring cross-sentence reasoning. | |||
|- | |||
| '''SQuAD v2.0''' | |||
| Language | |||
| 2018 – 2019 | |||
| Saturation | |||
| 5 years ago | |||
| 11 months | |||
| BERT | |||
| Human: 89.5% | |||
| BERT: 89.5% | |||
| Extension of SQuAD adding unanswerable questions. | |||
|- | |||
| '''SQuAD''' | |||
| Language | |||
| 2016 – 2019 | |||
| Saturation | |||
| 5 years ago | |||
| 2 years, 10 months | |||
| BERT | |||
| Human: 91.2% | |||
| BERT: 93.2% | |||
| 100,000+ QA tasks on Wikipedia articles. | |||
|} | |||
=== 2018 === | === 2018 === | ||
= | {| class="wikitable" | ||
|- | |||
! Benchmark | |||
! Category | |||
! Time Span | |||
! Killed By | |||
! Killed (Ago) | |||
! Age | |||
! Defeated By | |||
! Original Score | |||
! Final Score | |||
! Details | |||
|- | |||
| '''SWAG''' | |||
| Common Sense | |||
| 2018 – 2018 | |||
| Saturation | |||
| 6 years ago | |||
| 5 months | |||
| BERT | |||
| Human: 88% | |||
| BERT: 86% | |||
| 113K multiple-choice questions about grounded situations (common sense “next step”). | |||
|} |
Revision as of 16:28, 10 January 2025
2024
Benchmark | Category | Time Span | Killed By | Killed (Ago) | Age | Defeated By | Original Score | Final Score | Details |
---|---|---|---|---|---|---|---|---|---|
ARC-AGI | Reasoning | 2019 – 2024 | Saturation | 1 month ago | 5 years, 1 month | O3 | Human Baseline: ~80% | O3: 87.5% | Abstract reasoning challenge with visual pattern completion tasks created by François Chollet. |
MATH | Mathematics | 2021 – 2024 | Saturation | 4 months ago | 3 years, 6 months | O1 | Average CS PhD: ~40% | O1: 94.8% | 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning. |
BIG-Bench-Hard | Multi-task | 2022 – 2024 | Saturation | 7 months ago | 1 year, 8 months | Sonnet 3.5 | Average Human: 67.7% | Sonnet 3.5: 93.1% | A curated suite of 23 challenging tasks from BIG-Bench. |
HumanEval | Coding | 2021 – 2024 | Saturation | 8 months ago | 2 years, 10 months | GPT-4o | Unspecified | GPT-4o: 90.2% | 164 Python programming problems testing coding abilities. |
IFEval | Instruction Following | 2023 – 2024 | Saturation | 10 months ago | 4 months | LLama 3.3 70B | Unspecified | LLama 3.3 70B: 92.1% | Evaluation suite testing multi-step instruction-following capabilities. |
2023
Benchmark | Category | Time Span | Killed By | Killed (Ago) | Age | Defeated By | Original Score | Final Score | Details |
---|---|---|---|---|---|---|---|---|---|
GSM8K | Mathematics | 2021 – 2023 | Saturation | 1 year ago | 2 years, 1 month | GPT-4 | Unspecified | GPT-4: 92.0% | 8.5K grade school math word problems requiring step-by-step solutions. |
Turing Test | Conversation | 1950 – 2023 | Saturation | 1 year ago | 73 years, 5 months | GPT-4 | Interrogator > 50% | Interrogator 46% | The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game"). |
ARC (AI2) | Reasoning | 2018 – 2023 | Saturation | 1 year ago | 5 years | GPT-4 | Unspecified | GPT-4: 96.3% | Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning. |
HellaSwag | Common Sense | 2019 – 2023 | Saturation | 1 year ago | 3 years, 10 months | GPT-4 | Human: 95.6% | GPT-4: 95.3% | Multiple-choice questions about everyday scenarios with adversarial filtering. |
MMLU | Knowledge | 2020 – 2023 | Saturation | 1 year ago | 2 years, 6 months | GPT-4 | 95th pct Human: 87.0% | GPT-4: 87.3% | 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge. |
WinoGrande | Common Sense | 2019 – 2023 | Saturation | 1 year ago | 3 years, 8 months | GPT-4 | Human: 94% | GPT-4: 87.5% | Enhanced WSC with 44K problems testing common-sense pronoun resolution. |
Pre-2023
2022
Benchmark | Category | Time Span | Killed By | Killed (Ago) | Age | Defeated By | Original Score | Final Score | Details |
---|---|---|---|---|---|---|---|---|---|
BIG-Bench | Multi-task | 2021 – 2022 | Saturation | 2 years ago | 10 months | Palm 540B | Human: 49.8% | Palm 540B: 61.4% | 204 tasks spanning linguistics, math, common-sense reasoning, and more. |
2019
Benchmark | Category | Time Span | Killed By | Killed (Ago) | Age | Defeated By | Original Score | Final Score | Details |
---|---|---|---|---|---|---|---|---|---|
SuperGLUE | Language | 2019 – 2019 | Saturation | 5 years ago | 5 months | T5 | Human: 89.8% | T5: 89.3% | More challenging language understanding tasks (word sense, causal reasoning, RC). |
WSC | Common Sense | 2012 – 2019 | Saturation | 5 years ago | 7 years, 3 months | ROBERTA (w SFT) | Human: 96.5% | ROBERTA (w SFT): 90.1% | Carefully crafted sentence pairs with ambiguous pronoun references. |
GLUE | Language | 2018 – 2019 | Saturation | 5 years ago | 1 year, 1 month | XLNet | Human: 87.1% | XLNet: 88.4% | Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.). |
TriviaQA | Knowledge | 2017 – 2019 | Saturation | 5 years ago | 2 years, 1 month | SpanBERT | Human: 79.7% | SpanBERT: 83.6% | 650K QA-evidence triples requiring cross-sentence reasoning. |
SQuAD v2.0 | Language | 2018 – 2019 | Saturation | 5 years ago | 11 months | BERT | Human: 89.5% | BERT: 89.5% | Extension of SQuAD adding unanswerable questions. |
SQuAD | Language | 2016 – 2019 | Saturation | 5 years ago | 2 years, 10 months | BERT | Human: 91.2% | BERT: 93.2% | 100,000+ QA tasks on Wikipedia articles. |
2018
Benchmark | Category | Time Span | Killed By | Killed (Ago) | Age | Defeated By | Original Score | Final Score | Details |
---|---|---|---|---|---|---|---|---|---|
SWAG | Common Sense | 2018 – 2018 | Saturation | 6 years ago | 5 months | BERT | Human: 88% | BERT: 86% | 113K multiple-choice questions about grounded situations (common sense “next step”). |