LLM Benchmarks Timeline: Difference between revisions
(Created page with "= LLM Benchmarks Timeline = A memorial to the benchmarks that defined—and were defeated by—AI progress === All Time === * [#2024 2024] * [#2023 2023] * [#pre-2023 Pre-2023] ---- == 2024 == === ARC-AGI (2019 - 2024) === ; Category : Reasoning ; Killed by : Saturation ; Details : Killed 1 month ago, Abstract reasoning challenge consisting of visual pattern completion tasks. Each task presents a sequence of abstract visual patterns and requires selecting the correct...") |
No edit summary |
||
(21 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
{{see also|LLM Comparisons|LLM Rankings}} | |||
Timeline of [[benchmarks]] surpassed by [[large language models]] (LLMs). | |||
== | ==2024== | ||
{| class="wikitable sortable" | |||
|- | |||
! Benchmark | |||
! Category | |||
! Date Created | |||
! Date Defeated | |||
! Killed By | |||
! Defeated By Model | |||
! Original Score | |||
! Final Score | |||
! Details | |||
! Links | |||
|- | |||
| [[ARC-AGI]] | |||
| Reasoning | |||
| 2019-11 | |||
| 2024-12 | |||
| Saturation | |||
| [[O3]] | |||
| Human Baseline: ~80% | |||
| O3: 87.5% | |||
| Abstract reasoning challenge with visual pattern completion tasks created by François Chollet. | |||
| [https://arxiv.org/abs/1911.01547 Paper], [https://arcs-benchmark.org Website] | |||
|- | |||
| [[MATH]] | |||
| Mathematics | |||
| 2021-03 | |||
| 2024-09 | |||
| Saturation | |||
| [[O1]] | |||
| Average CS PhD: ~40% | |||
| O1: 94.8% | |||
| 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning. | |||
| [https://arxiv.org/abs/2103.03874 Paper], [https://github.com/hendrycks/math GitHub] | |||
|- | |||
| [[BIG-Bench-Hard]] | |||
| Multi-task | |||
| 2022-10 | |||
| 2024-06 | |||
| Saturation | |||
| [[Sonnet 3.5]] | |||
| Average Human: 67.7% | |||
| Sonnet 3.5: 93.1% | |||
| A curated suite of 23 challenging tasks from BIG-Bench. | |||
| [https://arxiv.org/abs/2210.09261 Paper], [https://github.com/suzgunmirac/BIG-Bench-Hard GitHub], [https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf Evidence] | |||
|- | |||
| [[HumanEval]] | |||
| Coding | |||
| 2021-07 | |||
| 2024-05 | |||
| Saturation | |||
| [[GPT-4o]] | |||
| Unspecified | |||
| GPT-4o: 90.2% | |||
| 164 Python programming problems testing coding abilities. | |||
| [https://arxiv.org/abs/2107.03374 Paper], [https://github.com/openai/human-eval GitHub], [https://openai.com/index/hello-gpt-4o/ Evidence] | |||
|- | |||
| [[IFEval]] | |||
| Instruction Following | |||
| 2023-11 | |||
| 2024-03 | |||
| Saturation | |||
| [[LLama 3.3 70B]] | |||
| Unspecified | |||
| LLama 3.3 70B: 92.1% | |||
| Evaluation suite testing multi-step instruction-following capabilities. | |||
| [https://arxiv.org/abs/2311.07911 Paper], [https://github.com/google-research/google-research/tree/master/instruction_following_eval GitHub], [https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md Evidence] | |||
|} | |||
==2023== | |||
{| class="wikitable sortable" | |||
|- | |||
! Benchmark | |||
! Category | |||
! Date Created | |||
! Date Defeated | |||
! Killed By | |||
! Defeated By Model | |||
! Original Score | |||
! Final Score | |||
! Details | |||
! Links | |||
|- | |||
| [[GSM8K]] | |||
| Mathematics | |||
| 2021-10 | |||
| 2023-11 | |||
| Saturation | |||
| [[GPT-4]] | |||
| Unspecified | |||
| GPT-4: 92.0% | |||
| 8.5K grade school math word problems requiring step-by-step solutions. | |||
| [https://arxiv.org/abs/2110.14168 Paper], [https://github.com/openai/grade-school-math GitHub], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | |||
|- | |||
| [[Turing Test]] | |||
| Conversation | |||
| 1950-10 | |||
| 2023-03 | |||
| Saturation | |||
| [[GPT-4]] | |||
| Interrogator > 50% | |||
| Interrogator 46% | |||
| The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game"). | |||
| [https://courses.cs.umbc.edu/471/papers/turing.pdf Paper], [https://arxiv.org/pdf/2405.08007 Evidence] | |||
|- | |||
| [[ARC (AI2)]] | |||
| Reasoning | |||
| 2018-03 | |||
| 2023-03 | |||
| Saturation | |||
| [[GPT-4]] | |||
| Unspecified | |||
| GPT-4: 96.3% | |||
| Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning. | |||
| [https://arxiv.org/abs/1803.05457 Paper], [https://leaderboard.allenai.org/arc/submissions/get-started Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | |||
|- | |||
| [[HellaSwag]] | |||
| Common Sense | |||
| 2019-05 | |||
| 2023-03 | |||
| Saturation | |||
| [[GPT-4]] | |||
| Human: 95.6% | |||
| GPT-4: 95.3% | |||
| Multiple-choice questions about everyday scenarios with adversarial filtering. | |||
| [https://arxiv.org/abs/1905.07830 Paper], [https://rowanzellers.com/hellaswag/ Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | |||
|- | |||
| [[MMLU]] | |||
| Knowledge | |||
| 2020-09 | |||
| 2023-03 | |||
| Saturation | |||
| [[GPT-4]] | |||
| 95th pct Human: 87.0% | |||
| GPT-4: 87.3% | |||
| 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge. | |||
| [https://arxiv.org/abs/2009.03300 Paper], [https://github.com/hendrycks/test GitHub], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | |||
|- | |||
| [[WinoGrande]] | |||
| Common Sense | |||
| 2019-07 | |||
| 2023-03 | |||
| Saturation | |||
| [[GPT-4]] | |||
| Human: 94% | |||
| GPT-4: 87.5% | |||
| Enhanced WSC with 44K problems testing common-sense pronoun resolution. | |||
| [https://arxiv.org/abs/1907.10641 Paper], [https://winogrande.allenai.org/ Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | |||
|} | |||
---- | ==Pre-2023== | ||
===2022=== | |||
{| class="wikitable sortable" | |||
|- | |||
! Benchmark | |||
! Category | |||
! Date Created | |||
! Date Defeated | |||
! Killed By Model | |||
! Defeated By | |||
! Original Score | |||
! Final Score | |||
! Details | |||
! Links | |||
|- | |||
| [[BIG-Bench]] | |||
| Multi-task | |||
| 2021-06 | |||
| 2022-04 | |||
| Saturation | |||
| [[Palm 540B]] | |||
| Human: 49.8% | |||
| Palm 540B: 61.4% | |||
| 204 tasks spanning linguistics, math, common-sense reasoning, and more. | |||
| [https://arxiv.org/abs/2206.04615 Paper], [https://github.com/google/BIG-bench GitHub], [https://arxiv.org/pdf/2204.02311 Evidence] | |||
|} | |||
== | ===2019=== | ||
=== | {| class="wikitable sortable" | ||
|- | |||
: | ! Benchmark | ||
! Category | |||
: Saturation | ! Date Created | ||
! Date Defeated | |||
: | ! Killed By | ||
! Defeated By Model | |||
: | ! Original Score | ||
! Final Score | |||
: | ! Details | ||
! Links | |||
: | |- | ||
| [[SuperGLUE]] | |||
| Language | |||
| 2019-05 | |||
| 2019-10 | |||
| Saturation | |||
| [[T5]] | |||
| Human: 89.8% | |||
| T5: 89.3% | |||
| More challenging language understanding tasks (word sense, causal reasoning, RC). | |||
| [https://arxiv.org/abs/1905.00537 Paper], [https://super.gluebenchmark.com/ Website] | |||
|- | |||
| [[WSC]] | |||
| Common Sense | |||
| 2012-05 | |||
| 2019-07 | |||
| Saturation | |||
| [[ROBERTA (w SFT)]] | |||
| Human: 96.5% | |||
| ROBERTA (w SFT): 90.1% | |||
| Carefully crafted sentence pairs with ambiguous pronoun references. | |||
| [https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf Paper], [https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html Website] | |||
|- | |||
| [[GLUE]] | |||
| Language | |||
| 2018-05 | |||
| 2019-06 | |||
| Saturation | |||
| [[XLNet]] | |||
| Human: 87.1% | |||
| XLNet: 88.4% | |||
| Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.). | |||
| [https://arxiv.org/abs/1804.07461 Paper], [https://gluebenchmark.com/ Website] | |||
|- | |||
| [[TriviaQA]] | |||
| Knowledge | |||
| 2017-05 | |||
| 2019-06 | |||
| Saturation | |||
| [[SpanBERT]] | |||
| Human: 79.7% | |||
| SpanBERT: 83.6% | |||
| 650K QA-evidence triples requiring cross-sentence reasoning. | |||
| [https://arxiv.org/abs/1705.03551 Paper], [http://nlp.cs.washington.edu/triviaqa/ Website] | |||
|- | |||
| [[SQuAD v2.0]] | |||
| Language | |||
| 2018-05 | |||
| 2019-04 | |||
| Saturation | |||
| [[BERT]] | |||
| Human: 89.5% | |||
| BERT: 89.5% | |||
| Extension of SQuAD adding unanswerable questions. | |||
| [https://arxiv.org/abs/1806.03822 Paper], [https://rajpurkar.github.io/SQuAD-explorer/ Website] | |||
|- | |||
| [[SQuAD]] | |||
| Language | |||
| 2016-05 | |||
| 2019-03 | |||
| Saturation | |||
| [[BERT]] | |||
| Human: 91.2% | |||
| BERT: 93.2% | |||
| 100,000+ QA tasks on Wikipedia articles. | |||
| [https://arxiv.org/abs/1606.05250 Paper], [https://rajpurkar.github.io/SQuAD-explorer/ Website] | |||
|} | |||
=== | ===2018=== | ||
{| class="wikitable sortable" | |||
|- | |||
! Benchmark | |||
! Category | |||
! Date Created | |||
! Date Defeated | |||
! Killed By | |||
: | ! Defeated By Model | ||
! Original Score | |||
: | ! Final Score | ||
! Details | |||
: | ! Links | ||
|- | |||
| [[SWAG]] | |||
| Common Sense | |||
| 2018-05 | |||
| 2018-10 | |||
| Saturation | |||
| [[BERT]] | |||
| Human: 88% | |||
| BERT: 86% | |||
| 113K multiple-choice questions about grounded situations (common sense “next step”). | |||
| [https://arxiv.org/abs/1808.05326 Paper], [https://rowanzellers.com/swag/ Website] | |||
|} | |||
=== | ==References== | ||
[https://r0bk.github.io/killedbyllm/ website] | |||
[https://github.com/R0bk/killedbyllm github] | |||
: | |||
[[Category:Benchmarks]] [[Category:Timelines]] [[Category:Aggregate pages]] | |||
: | |||
: | |||
Latest revision as of 21:01, 13 January 2025
- See also: LLM Comparisons and LLM Rankings
Timeline of benchmarks surpassed by large language models (LLMs).
2024
Benchmark | Category | Date Created | Date Defeated | Killed By | Defeated By Model | Original Score | Final Score | Details | Links |
---|---|---|---|---|---|---|---|---|---|
ARC-AGI | Reasoning | 2019-11 | 2024-12 | Saturation | O3 | Human Baseline: ~80% | O3: 87.5% | Abstract reasoning challenge with visual pattern completion tasks created by François Chollet. | Paper, Website |
MATH | Mathematics | 2021-03 | 2024-09 | Saturation | O1 | Average CS PhD: ~40% | O1: 94.8% | 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning. | Paper, GitHub |
BIG-Bench-Hard | Multi-task | 2022-10 | 2024-06 | Saturation | Sonnet 3.5 | Average Human: 67.7% | Sonnet 3.5: 93.1% | A curated suite of 23 challenging tasks from BIG-Bench. | Paper, GitHub, Evidence |
HumanEval | Coding | 2021-07 | 2024-05 | Saturation | GPT-4o | Unspecified | GPT-4o: 90.2% | 164 Python programming problems testing coding abilities. | Paper, GitHub, Evidence |
IFEval | Instruction Following | 2023-11 | 2024-03 | Saturation | LLama 3.3 70B | Unspecified | LLama 3.3 70B: 92.1% | Evaluation suite testing multi-step instruction-following capabilities. | Paper, GitHub, Evidence |
2023
Benchmark | Category | Date Created | Date Defeated | Killed By | Defeated By Model | Original Score | Final Score | Details | Links |
---|---|---|---|---|---|---|---|---|---|
GSM8K | Mathematics | 2021-10 | 2023-11 | Saturation | GPT-4 | Unspecified | GPT-4: 92.0% | 8.5K grade school math word problems requiring step-by-step solutions. | Paper, GitHub, Evidence |
Turing Test | Conversation | 1950-10 | 2023-03 | Saturation | GPT-4 | Interrogator > 50% | Interrogator 46% | The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game"). | Paper, Evidence |
ARC (AI2) | Reasoning | 2018-03 | 2023-03 | Saturation | GPT-4 | Unspecified | GPT-4: 96.3% | Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning. | Paper, Website, Evidence |
HellaSwag | Common Sense | 2019-05 | 2023-03 | Saturation | GPT-4 | Human: 95.6% | GPT-4: 95.3% | Multiple-choice questions about everyday scenarios with adversarial filtering. | Paper, Website, Evidence |
MMLU | Knowledge | 2020-09 | 2023-03 | Saturation | GPT-4 | 95th pct Human: 87.0% | GPT-4: 87.3% | 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge. | Paper, GitHub, Evidence |
WinoGrande | Common Sense | 2019-07 | 2023-03 | Saturation | GPT-4 | Human: 94% | GPT-4: 87.5% | Enhanced WSC with 44K problems testing common-sense pronoun resolution. | Paper, Website, Evidence |
Pre-2023
2022
Benchmark | Category | Date Created | Date Defeated | Killed By Model | Defeated By | Original Score | Final Score | Details | Links |
---|---|---|---|---|---|---|---|---|---|
BIG-Bench | Multi-task | 2021-06 | 2022-04 | Saturation | Palm 540B | Human: 49.8% | Palm 540B: 61.4% | 204 tasks spanning linguistics, math, common-sense reasoning, and more. | Paper, GitHub, Evidence |
2019
Benchmark | Category | Date Created | Date Defeated | Killed By | Defeated By Model | Original Score | Final Score | Details | Links |
---|---|---|---|---|---|---|---|---|---|
SuperGLUE | Language | 2019-05 | 2019-10 | Saturation | T5 | Human: 89.8% | T5: 89.3% | More challenging language understanding tasks (word sense, causal reasoning, RC). | Paper, Website |
WSC | Common Sense | 2012-05 | 2019-07 | Saturation | ROBERTA (w SFT) | Human: 96.5% | ROBERTA (w SFT): 90.1% | Carefully crafted sentence pairs with ambiguous pronoun references. | Paper, Website |
GLUE | Language | 2018-05 | 2019-06 | Saturation | XLNet | Human: 87.1% | XLNet: 88.4% | Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.). | Paper, Website |
TriviaQA | Knowledge | 2017-05 | 2019-06 | Saturation | SpanBERT | Human: 79.7% | SpanBERT: 83.6% | 650K QA-evidence triples requiring cross-sentence reasoning. | Paper, Website |
SQuAD v2.0 | Language | 2018-05 | 2019-04 | Saturation | BERT | Human: 89.5% | BERT: 89.5% | Extension of SQuAD adding unanswerable questions. | Paper, Website |
SQuAD | Language | 2016-05 | 2019-03 | Saturation | BERT | Human: 91.2% | BERT: 93.2% | 100,000+ QA tasks on Wikipedia articles. | Paper, Website |
2018
Benchmark | Category | Date Created | Date Defeated | Killed By | Defeated By Model | Original Score | Final Score | Details | Links |
---|---|---|---|---|---|---|---|---|---|
SWAG | Common Sense | 2018-05 | 2018-10 | Saturation | BERT | Human: 88% | BERT: 86% | 113K multiple-choice questions about grounded situations (common sense “next step”). | Paper, Website |