LLM Benchmarks Timeline: Difference between revisions

From AI Wiki
No edit summary
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
== 2024 ==
{{see also|LLM Comparisons|LLM Rankings}}
{| class="wikitable"
Timeline of [[benchmarks]] surpassed by [[large language models]] (LLMs).
 
==2024==
{| class="wikitable sortable"
|-
|-
! Benchmark
! Benchmark
! Category
! Category
! Time Span
! Date Created
! Date Created
! Date Defeated
! Date Defeated
! Killed By
! Killed By
! Defeated By
! Defeated By Model
! Original Score
! Original Score
! Final Score
! Final Score
! Details
! Links
! Links
! Details
|-
|-
| '''ARC-AGI'''
| [[ARC-AGI]]
| Reasoning
| Reasoning
| 2019-11 – 2024-12
| 2019-11
| 2019-11
| 2024-12
| 2024-12
| Saturation
| Saturation
| O3
| [[O3]]
| Human Baseline: ~80%
| Human Baseline: ~80%
| O3: 87.5%
| O3: 87.5%
| [Paper](https://arxiv.org/abs/1911.01547), [Website](https://arcs-benchmark.org)
| Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
| Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
| [https://arxiv.org/abs/1911.01547 Paper], [https://arcs-benchmark.org Website]
|-
|-
| '''MATH'''
| [[MATH]]
| Mathematics
| Mathematics
| 2021-03 – 2024-09
| 2021-03
| 2021-03
| 2024-09
| 2024-09
| Saturation
| Saturation
| O1
| [[O1]]
| Average CS PhD: ~40%
| Average CS PhD: ~40%
| O1: 94.8%
| O1: 94.8%
| [Paper](https://arxiv.org/abs/2103.03874), [GitHub](https://github.com/hendrycks/math)
| 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
| 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
| [https://arxiv.org/abs/2103.03874 Paper], [https://github.com/hendrycks/math GitHub]
|-
|-
| '''BIG-Bench-Hard'''
| [[BIG-Bench-Hard]]
| Multi-task
| Multi-task
| 2022-10 – 2024-06
| 2022-10
| 2022-10
| 2024-06
| 2024-06
| Saturation
| Saturation
| Sonnet 3.5
| [[Sonnet 3.5]]
| Average Human: 67.7%
| Average Human: 67.7%
| Sonnet 3.5: 93.1%
| Sonnet 3.5: 93.1%
| [Paper](https://arxiv.org/abs/2210.09261), [GitHub](https://github.com/suzgunmirac/BIG-Bench-Hard), [Evidence](https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf)
| A curated suite of 23 challenging tasks from BIG-Bench.
| A curated suite of 23 challenging tasks from BIG-Bench.
| [https://arxiv.org/abs/2210.09261 Paper], [https://github.com/suzgunmirac/BIG-Bench-Hard GitHub], [https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf Evidence]
|-
|-
| '''HumanEval'''
| [[HumanEval]]
| Coding
| Coding
| 2021-07 – 2024-05
| 2021-07
| 2021-07
| 2024-05
| 2024-05
| Saturation
| Saturation
| GPT-4o
| [[GPT-4o]]
| Unspecified
| Unspecified
| GPT-4o: 90.2%
| GPT-4o: 90.2%
| [Paper](https://arxiv.org/abs/2107.03374), [GitHub](https://github.com/openai/human-eval), [Evidence](https://openai.com/index/hello-gpt-4o/)
| 164 Python programming problems testing coding abilities.
| 164 Python programming problems testing coding abilities.
| [https://arxiv.org/abs/2107.03374 Paper], [https://github.com/openai/human-eval GitHub], [https://openai.com/index/hello-gpt-4o/ Evidence]
|-
|-
| '''IFEval'''
| [[IFEval]]
| Instruction Following
| Instruction Following
| 2023-11 – 2024-03
| 2023-11
| 2023-11
| 2024-03
| 2024-03
| Saturation
| Saturation
| LLama 3.3 70B
| [[LLama 3.3 70B]]
| Unspecified
| Unspecified
| LLama 3.3 70B: 92.1%
| LLama 3.3 70B: 92.1%
| [Paper](https://arxiv.org/abs/2311.07911), [GitHub](https://github.com/google-research/google-research/tree/master/instruction_following_eval), [Evidence](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md)
| Evaluation suite testing multi-step instruction-following capabilities.
| Evaluation suite testing multi-step instruction-following capabilities.
| [https://arxiv.org/abs/2311.07911 Paper], [https://github.com/google-research/google-research/tree/master/instruction_following_eval GitHub], [https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md Evidence]
|}
|}
 
==2023==
== 2023 ==
{| class="wikitable sortable"
{| class="wikitable"
|-
|-
! Benchmark
! Benchmark
! Category
! Category
! Time Span
! Date Created
! Date Created
! Date Defeated
! Date Defeated
! Killed By
! Killed By
! Defeated By
! Defeated By Model
! Original Score
! Original Score
! Final Score
! Final Score
! Details
! Links
! Links
! Details
|-
|-
| '''GSM8K'''
| [[GSM8K]]
| Mathematics
| Mathematics
| 2021-10 – 2023-11
| 2021-10
| 2021-10
| 2023-11
| 2023-11
| Saturation
| Saturation
| GPT-4
| [[GPT-4]]
| Unspecified
| Unspecified
| GPT-4: 92.0%
| GPT-4: 92.0%
| [Paper](https://arxiv.org/abs/2110.14168), [GitHub](https://github.com/openai/grade-school-math), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| 8.5K grade school math word problems requiring step-by-step solutions.
| 8.5K grade school math word problems requiring step-by-step solutions.
| [https://arxiv.org/abs/2110.14168 Paper], [https://github.com/openai/grade-school-math GitHub], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
|-
|-
| '''Turing Test'''
| [[Turing Test]]
| Conversation
| Conversation
| 1950-10 – 2023-03
| 1950-10
| 1950-10
| 2023-03
| 2023-03
| Saturation
| Saturation
| GPT-4
| [[GPT-4]]
| Interrogator > 50%
| Interrogator > 50%
| Interrogator 46%
| Interrogator 46%
| [Paper](https://courses.cs.umbc.edu/471/papers/turing.pdf), [Evidence](https://arxiv.org/pdf/2405.08007)
| The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
| The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
| [https://courses.cs.umbc.edu/471/papers/turing.pdf Paper], [https://arxiv.org/pdf/2405.08007 Evidence]
|-
|-
| '''ARC (AI2)'''
| [[ARC (AI2)]]
| Reasoning
| Reasoning
| 2018-03 – 2023-03
| 2018-03
| 2018-03
| 2023-03
| 2023-03
| Saturation
| Saturation
| GPT-4
| [[GPT-4]]
| Unspecified
| Unspecified
| GPT-4: 96.3%
| GPT-4: 96.3%
| [Paper](https://arxiv.org/abs/1803.05457), [Website](https://leaderboard.allenai.org/arc/submissions/get-started), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
| Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
| [https://arxiv.org/abs/1803.05457 Paper], [https://leaderboard.allenai.org/arc/submissions/get-started Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
|-
|-
| '''HellaSwag'''
| [[HellaSwag]]
| Common Sense
| Common Sense
| 2019-05 – 2023-03
| 2019-05
| 2019-05
| 2023-03
| 2023-03
| Saturation
| Saturation
| GPT-4
| [[GPT-4]]
| Human: 95.6%
| Human: 95.6%
| GPT-4: 95.3%
| GPT-4: 95.3%
| [Paper](https://arxiv.org/abs/1905.07830), [Website](https://rowanzellers.com/hellaswag/), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| Multiple-choice questions about everyday scenarios with adversarial filtering.
| Multiple-choice questions about everyday scenarios with adversarial filtering.
| [https://arxiv.org/abs/1905.07830 Paper], [https://rowanzellers.com/hellaswag/ Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
|-
|-
| '''MMLU'''
| [[MMLU]]
| Knowledge
| Knowledge
| 2020-09 – 2023-03
| 2020-09
| 2020-09
| 2023-03
| 2023-03
| Saturation
| Saturation
| GPT-4
| [[GPT-4]]
| 95th pct Human: 87.0%
| 95th pct Human: 87.0%
| GPT-4: 87.3%
| GPT-4: 87.3%
| [Paper](https://arxiv.org/abs/2009.03300), [GitHub](https://github.com/hendrycks/test), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
| 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
| [https://arxiv.org/abs/2009.03300 Paper], [https://github.com/hendrycks/test GitHub], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
|-
|-
| '''WinoGrande'''
| [[WinoGrande]]
| Common Sense
| Common Sense
| 2019-07 – 2023-03
| 2019-07
| 2019-07
| 2023-03
| 2023-03
| Saturation
| Saturation
| GPT-4
| [[GPT-4]]
| Human: 94%
| Human: 94%
| GPT-4: 87.5%
| GPT-4: 87.5%
| [Paper](https://arxiv.org/abs/1907.10641), [Website](https://winogrande.allenai.org/), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| Enhanced WSC with 44K problems testing common-sense pronoun resolution.
| Enhanced WSC with 44K problems testing common-sense pronoun resolution.
| [https://arxiv.org/abs/1907.10641 Paper], [https://winogrande.allenai.org/ Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
|}
|}


== Pre-2023 ==
==Pre-2023==
=== 2022 ===
===2022===
{| class="wikitable"
{| class="wikitable sortable"
|-
|-
! Benchmark
! Benchmark
! Category
! Category
! Time Span
! Date Created
! Killed By
! Date Defeated
! Killed (Ago)
! Killed By Model
! Age
! Defeated By
! Defeated By
! Original Score
! Original Score
! Final Score
! Final Score
! Details
! Details
! Links
|-
|-
| '''BIG-Bench'''
| [[BIG-Bench]]
| Multi-task
| Multi-task
| 2021 2022
| 2021-06
| 2022-04
| Saturation
| Saturation
| 2 years ago
| [[Palm 540B]]
| 10 months
| Palm 540B
| Human: 49.8%
| Human: 49.8%
| Palm 540B: 61.4%
| Palm 540B: 61.4%
| 204 tasks spanning linguistics, math, common-sense reasoning, and more.
| 204 tasks spanning linguistics, math, common-sense reasoning, and more.
| [https://arxiv.org/abs/2206.04615 Paper], [https://github.com/google/BIG-bench GitHub], [https://arxiv.org/pdf/2204.02311 Evidence]
|}
|}


=== 2019 ===
===2019===
{| class="wikitable"
{| class="wikitable sortable"
|-
|-
! Benchmark
! Benchmark
! Category
! Category
! Time Span
! Date Created
! Date Defeated
! Killed By
! Killed By
! Killed (Ago)
! Defeated By Model
! Age
! Defeated By
! Original Score
! Original Score
! Final Score
! Final Score
! Details
! Details
! Links
|-
|-
| '''SuperGLUE'''
| [[SuperGLUE]]
| Language
| Language
| 2019 2019
| 2019-05
| 2019-10
| Saturation
| Saturation
| 5 years ago
| [[T5]]
| 5 months
| T5
| Human: 89.8%
| Human: 89.8%
| T5: 89.3%
| T5: 89.3%
| More challenging language understanding tasks (word sense, causal reasoning, RC).
| More challenging language understanding tasks (word sense, causal reasoning, RC).
| [https://arxiv.org/abs/1905.00537 Paper], [https://super.gluebenchmark.com/ Website]
|-
|-
| '''WSC'''
| [[WSC]]
| Common Sense
| Common Sense
| 2012 2019
| 2012-05
| 2019-07
| Saturation
| Saturation
| 5 years ago
| [[ROBERTA (w SFT)]]
| 7 years, 3 months
| ROBERTA (w SFT)
| Human: 96.5%
| Human: 96.5%
| ROBERTA (w SFT): 90.1%
| ROBERTA (w SFT): 90.1%
| Carefully crafted sentence pairs with ambiguous pronoun references.
| Carefully crafted sentence pairs with ambiguous pronoun references.
| [https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf Paper], [https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html Website]
|-
|-
| '''GLUE'''
| [[GLUE]]
| Language
| Language
| 2018 2019
| 2018-05
| 2019-06
| Saturation
| Saturation
| 5 years ago
| [[XLNet]]
| 1 year, 1 month
| XLNet
| Human: 87.1%
| Human: 87.1%
| XLNet: 88.4%
| XLNet: 88.4%
| Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).
| Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).
| [https://arxiv.org/abs/1804.07461 Paper], [https://gluebenchmark.com/ Website]
|-
|-
| '''TriviaQA'''
| [[TriviaQA]]
| Knowledge
| Knowledge
| 2017 2019
| 2017-05
| 2019-06
| Saturation
| Saturation
| 5 years ago
| [[SpanBERT]]
| 2 years, 1 month
| SpanBERT
| Human: 79.7%
| Human: 79.7%
| SpanBERT: 83.6%
| SpanBERT: 83.6%
| 650K QA-evidence triples requiring cross-sentence reasoning.
| 650K QA-evidence triples requiring cross-sentence reasoning.
| [https://arxiv.org/abs/1705.03551 Paper], [http://nlp.cs.washington.edu/triviaqa/ Website]
|-
|-
| '''SQuAD v2.0'''
| [[SQuAD v2.0]]
| Language
| Language
| 2018 2019
| 2018-05
| 2019-04
| Saturation
| Saturation
| 5 years ago
| [[BERT]]
| 11 months
| BERT
| Human: 89.5%
| Human: 89.5%
| BERT: 89.5%
| BERT: 89.5%
| Extension of SQuAD adding unanswerable questions.
| Extension of SQuAD adding unanswerable questions.
| [https://arxiv.org/abs/1806.03822 Paper], [https://rajpurkar.github.io/SQuAD-explorer/ Website]
|-
|-
| '''SQuAD'''
| [[SQuAD]]
| Language
| Language
| 2016 2019
| 2016-05
| 2019-03
| Saturation
| Saturation
| 5 years ago
| [[BERT]]
| 2 years, 10 months
| BERT
| Human: 91.2%
| Human: 91.2%
| BERT: 93.2%
| BERT: 93.2%
| 100,000+ QA tasks on Wikipedia articles.
| 100,000+ QA tasks on Wikipedia articles.
| [https://arxiv.org/abs/1606.05250 Paper], [https://rajpurkar.github.io/SQuAD-explorer/ Website]
|}
|}


=== 2018 ===
===2018===
{| class="wikitable"
{| class="wikitable sortable"
|-
|-
! Benchmark
! Benchmark
! Category
! Category
! Time Span
! Date Created
! Date Defeated
! Killed By
! Killed By
! Killed (Ago)
! Defeated By Model
! Age
! Defeated By
! Original Score
! Original Score
! Final Score
! Final Score
! Details
! Details
! Links
|-
|-
| '''SWAG'''
| [[SWAG]]
| Common Sense
| Common Sense
| 2018 2018
| 2018-05
| 2018-10
| Saturation
| Saturation
| 6 years ago
| [[BERT]]
| 5 months
| BERT
| Human: 88%
| Human: 88%
| BERT: 86%
| BERT: 86%
| 113K multiple-choice questions about grounded situations (common sense “next step”).
| 113K multiple-choice questions about grounded situations (common sense “next step”).
| [https://arxiv.org/abs/1808.05326 Paper], [https://rowanzellers.com/swag/ Website]
|}
|}
==References==
[https://r0bk.github.io/killedbyllm/ website]
[https://github.com/R0bk/killedbyllm github]
[[Category:Benchmarks]] [[Category:Timelines]] [[Category:Aggregate pages]]

Latest revision as of 21:01, 13 January 2025

See also: LLM Comparisons and LLM Rankings

Timeline of benchmarks surpassed by large language models (LLMs).

2024

Benchmark Category Date Created Date Defeated Killed By Defeated By Model Original Score Final Score Details Links
ARC-AGI Reasoning 2019-11 2024-12 Saturation O3 Human Baseline: ~80% O3: 87.5% Abstract reasoning challenge with visual pattern completion tasks created by François Chollet. Paper, Website
MATH Mathematics 2021-03 2024-09 Saturation O1 Average CS PhD: ~40% O1: 94.8% 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning. Paper, GitHub
BIG-Bench-Hard Multi-task 2022-10 2024-06 Saturation Sonnet 3.5 Average Human: 67.7% Sonnet 3.5: 93.1% A curated suite of 23 challenging tasks from BIG-Bench. Paper, GitHub, Evidence
HumanEval Coding 2021-07 2024-05 Saturation GPT-4o Unspecified GPT-4o: 90.2% 164 Python programming problems testing coding abilities. Paper, GitHub, Evidence
IFEval Instruction Following 2023-11 2024-03 Saturation LLama 3.3 70B Unspecified LLama 3.3 70B: 92.1% Evaluation suite testing multi-step instruction-following capabilities. Paper, GitHub, Evidence

2023

Benchmark Category Date Created Date Defeated Killed By Defeated By Model Original Score Final Score Details Links
GSM8K Mathematics 2021-10 2023-11 Saturation GPT-4 Unspecified GPT-4: 92.0% 8.5K grade school math word problems requiring step-by-step solutions. Paper, GitHub, Evidence
Turing Test Conversation 1950-10 2023-03 Saturation GPT-4 Interrogator > 50% Interrogator 46% The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game"). Paper, Evidence
ARC (AI2) Reasoning 2018-03 2023-03 Saturation GPT-4 Unspecified GPT-4: 96.3% Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning. Paper, Website, Evidence
HellaSwag Common Sense 2019-05 2023-03 Saturation GPT-4 Human: 95.6% GPT-4: 95.3% Multiple-choice questions about everyday scenarios with adversarial filtering. Paper, Website, Evidence
MMLU Knowledge 2020-09 2023-03 Saturation GPT-4 95th pct Human: 87.0% GPT-4: 87.3% 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge. Paper, GitHub, Evidence
WinoGrande Common Sense 2019-07 2023-03 Saturation GPT-4 Human: 94% GPT-4: 87.5% Enhanced WSC with 44K problems testing common-sense pronoun resolution. Paper, Website, Evidence

Pre-2023

2022

Benchmark Category Date Created Date Defeated Killed By Model Defeated By Original Score Final Score Details Links
BIG-Bench Multi-task 2021-06 2022-04 Saturation Palm 540B Human: 49.8% Palm 540B: 61.4% 204 tasks spanning linguistics, math, common-sense reasoning, and more. Paper, GitHub, Evidence

2019

Benchmark Category Date Created Date Defeated Killed By Defeated By Model Original Score Final Score Details Links
SuperGLUE Language 2019-05 2019-10 Saturation T5 Human: 89.8% T5: 89.3% More challenging language understanding tasks (word sense, causal reasoning, RC). Paper, Website
WSC Common Sense 2012-05 2019-07 Saturation ROBERTA (w SFT) Human: 96.5% ROBERTA (w SFT): 90.1% Carefully crafted sentence pairs with ambiguous pronoun references. Paper, Website
GLUE Language 2018-05 2019-06 Saturation XLNet Human: 87.1% XLNet: 88.4% Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.). Paper, Website
TriviaQA Knowledge 2017-05 2019-06 Saturation SpanBERT Human: 79.7% SpanBERT: 83.6% 650K QA-evidence triples requiring cross-sentence reasoning. Paper, Website
SQuAD v2.0 Language 2018-05 2019-04 Saturation BERT Human: 89.5% BERT: 89.5% Extension of SQuAD adding unanswerable questions. Paper, Website
SQuAD Language 2016-05 2019-03 Saturation BERT Human: 91.2% BERT: 93.2% 100,000+ QA tasks on Wikipedia articles. Paper, Website

2018

Benchmark Category Date Created Date Defeated Killed By Defeated By Model Original Score Final Score Details Links
SWAG Common Sense 2018-05 2018-10 Saturation BERT Human: 88% BERT: 86% 113K multiple-choice questions about grounded situations (common sense “next step”). Paper, Website

References

website github