LLM Benchmarks Timeline: Difference between revisions

From AI Wiki
(Replaced content with "=== 2019 === {| class="wikitable" |- ! Benchmark ! Category ! Time Span ! Date Created ! Date Defeated ! Killed By ! Defeated By ! Original Score ! Final Score ! Links ! Details |- | '''SuperGLUE''' | Language | 2019-05 – 2019-10 | 2019-05 | 2019-10 | Saturation | T5 | Human: 89.8% | T5: 89.3% | [https://arxiv.org/abs/1905.00537 Paper], [https://super.gluebenchmark.com/ Website] | More challenging language understanding tasks (word sense, causal reasoning, RC). |}")
Tags: Replaced Reverted
(Undo revision 13733 by Alpha5 (talk))
Tag: Undo
Line 1: Line 1:
== 2024 ==
{| class="wikitable"
|-
! Benchmark
! Category
! Time Span
! Date Created
! Date Defeated
! Killed By
! Defeated By
! Original Score
! Final Score
! Links
! Details
|-
| '''ARC-AGI'''
| Reasoning
| 2019-11 – 2024-12
| 2019-11
| 2024-12
| Saturation
| O3
| Human Baseline: ~80%
| O3: 87.5%
| [Paper](https://arxiv.org/abs/1911.01547), [Website](https://arcs-benchmark.org)
| Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
|-
| '''MATH'''
| Mathematics
| 2021-03 – 2024-09
| 2021-03
| 2024-09
| Saturation
| O1
| Average CS PhD: ~40%
| O1: 94.8%
| [Paper](https://arxiv.org/abs/2103.03874), [GitHub](https://github.com/hendrycks/math)
| 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
|-
| '''BIG-Bench-Hard'''
| Multi-task
| 2022-10 – 2024-06
| 2022-10
| 2024-06
| Saturation
| Sonnet 3.5
| Average Human: 67.7%
| Sonnet 3.5: 93.1%
| [Paper](https://arxiv.org/abs/2210.09261), [GitHub](https://github.com/suzgunmirac/BIG-Bench-Hard), [Evidence](https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf)
| A curated suite of 23 challenging tasks from BIG-Bench.
|-
| '''HumanEval'''
| Coding
| 2021-07 – 2024-05
| 2021-07
| 2024-05
| Saturation
| GPT-4o
| Unspecified
| GPT-4o: 90.2%
| [Paper](https://arxiv.org/abs/2107.03374), [GitHub](https://github.com/openai/human-eval), [Evidence](https://openai.com/index/hello-gpt-4o/)
| 164 Python programming problems testing coding abilities.
|-
| '''IFEval'''
| Instruction Following
| 2023-11 – 2024-03
| 2023-11
| 2024-03
| Saturation
| LLama 3.3 70B
| Unspecified
| LLama 3.3 70B: 92.1%
| [Paper](https://arxiv.org/abs/2311.07911), [GitHub](https://github.com/google-research/google-research/tree/master/instruction_following_eval), [Evidence](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md)
| Evaluation suite testing multi-step instruction-following capabilities.
|}
== 2023 ==
{| class="wikitable"
|-
! Benchmark
! Category
! Time Span
! Date Created
! Date Defeated
! Killed By
! Defeated By
! Original Score
! Final Score
! Links
! Details
|-
| '''GSM8K'''
| Mathematics
| 2021-10 – 2023-11
| 2021-10
| 2023-11
| Saturation
| GPT-4
| Unspecified
| GPT-4: 92.0%
| [Paper](https://arxiv.org/abs/2110.14168), [GitHub](https://github.com/openai/grade-school-math), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| 8.5K grade school math word problems requiring step-by-step solutions.
|-
| '''Turing Test'''
| Conversation
| 1950-10 – 2023-03
| 1950-10
| 2023-03
| Saturation
| GPT-4
| Interrogator > 50%
| Interrogator 46%
| [Paper](https://courses.cs.umbc.edu/471/papers/turing.pdf), [Evidence](https://arxiv.org/pdf/2405.08007)
| The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
|-
| '''ARC (AI2)'''
| Reasoning
| 2018-03 – 2023-03
| 2018-03
| 2023-03
| Saturation
| GPT-4
| Unspecified
| GPT-4: 96.3%
| [Paper](https://arxiv.org/abs/1803.05457), [Website](https://leaderboard.allenai.org/arc/submissions/get-started), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
|-
| '''HellaSwag'''
| Common Sense
| 2019-05 – 2023-03
| 2019-05
| 2023-03
| Saturation
| GPT-4
| Human: 95.6%
| GPT-4: 95.3%
| [Paper](https://arxiv.org/abs/1905.07830), [Website](https://rowanzellers.com/hellaswag/), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| Multiple-choice questions about everyday scenarios with adversarial filtering.
|-
| '''MMLU'''
| Knowledge
| 2020-09 – 2023-03
| 2020-09
| 2023-03
| Saturation
| GPT-4
| 95th pct Human: 87.0%
| GPT-4: 87.3%
| [Paper](https://arxiv.org/abs/2009.03300), [GitHub](https://github.com/hendrycks/test), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
|-
| '''WinoGrande'''
| Common Sense
| 2019-07 – 2023-03
| 2019-07
| 2023-03
| Saturation
| GPT-4
| Human: 94%
| GPT-4: 87.5%
| [Paper](https://arxiv.org/abs/1907.10641), [Website](https://winogrande.allenai.org/), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| Enhanced WSC with 44K problems testing common-sense pronoun resolution.
|}
== Pre-2023 ==
=== 2022 ===
{| class="wikitable"
|-
! Benchmark
! Category
! Time Span
! Date Created
! Date Defeated
! Killed By
! Defeated By
! Original Score
! Final Score
! Links
! Details
|-
| '''BIG-Bench'''
| Multi-task
| 2021-06 – 2022-04
| 2021-06
| 2022-04
| Saturation
| Palm 540B
| Human: 49.8%
| Palm 540B: 61.4%
| [Paper](https://arxiv.org/abs/2206.04615), [GitHub](https://github.com/google/BIG-bench), [Evidence](https://arxiv.org/pdf/2204.02311)
| 204 tasks spanning linguistics, math, common-sense reasoning, and more.
|}
=== 2019 ===
=== 2019 ===
{| class="wikitable"
{| class="wikitable"
Line 23: Line 216:
| Human: 89.8%
| Human: 89.8%
| T5: 89.3%
| T5: 89.3%
| [https://arxiv.org/abs/1905.00537 Paper], [https://super.gluebenchmark.com/ Website]
| [Paper](https://arxiv.org/abs/1905.00537), [Website](https://super.gluebenchmark.com/)
| More challenging language understanding tasks (word sense, causal reasoning, RC).
| More challenging language understanding tasks (word sense, causal reasoning, RC).
|-
| '''WSC'''
| Common Sense
| 2012-05 – 2019-07
| 2012-05
| 2019-07
| Saturation
| ROBERTA (w SFT)
| Human: 96.5%
| ROBERTA (w SFT): 90.1%
| [Paper](https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf), [Website](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html)
| Carefully crafted sentence pairs with ambiguous pronoun references.
|-
| '''GLUE'''
| Language
| 2018-05 – 2019-06
| 2018-05
| 2019-06
| Saturation
| XLNet
| Human: 87.1%
| XLNet: 88.4%
| [Paper](https://arxiv.org/abs/1804.07461), [Website](https://gluebenchmark.com/)
| Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).
|-
| '''TriviaQA'''
| Knowledge
| 2017-05 – 2019-06
| 2017-05
| 2019-06
| Saturation
| SpanBERT
| Human: 79.7%
| SpanBERT: 83.6%
| [Paper](https://arxiv.org/abs/1705.03551), [Website](http://nlp.cs.washington.edu/triviaqa/)
| 650K QA-evidence triples requiring cross-sentence reasoning.
|-
| '''SQuAD v2.0'''
| Language
| 2018-05 – 2019-04
| 2018-05
| 2019-04
| Saturation
| BERT
| Human: 89.5%
| BERT: 89.5%
| [Paper](https://arxiv.org/abs/1806.03822), [Website](https://rajpurkar.github.io/SQuAD-explorer/)
| Extension of SQuAD adding unanswerable questions.
|-
| '''SQuAD'''
| Language
| 2016-05 – 2019-03
| 2016-05
| 2019-03
| Saturation
| BERT
| Human: 91.2%
| BERT: 93.2%
| [Paper](https://arxiv.org/abs/1606.05250), [Website](https://rajpurkar.github.io/SQuAD-explorer/)
| 100,000+ QA tasks on Wikipedia articles.
|}
=== 2018 ===
{| class="wikitable"
|-
! Benchmark
! Category
! Time Span
! Date Created
! Date Defeated
! Killed By
! Defeated By
! Original Score
! Final Score
! Links
! Details
|-
| '''SWAG'''
| Common Sense
| 2018-05 – 2018-10
| 2018-05
| 2018-10
| Saturation
| BERT
| Human: 88%
| BERT: 86%
| [Paper](https://arxiv.org/abs/1808.05326), [Website](https://rowanzellers.com/swag/)
| 113K multiple-choice questions about grounded situations (common sense “next step”).
|}
|}

Revision as of 16:48, 10 January 2025

2024

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Links Details
ARC-AGI Reasoning 2019-11 – 2024-12 2019-11 2024-12 Saturation O3 Human Baseline: ~80% O3: 87.5% [Paper](https://arxiv.org/abs/1911.01547), [Website](https://arcs-benchmark.org) Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
MATH Mathematics 2021-03 – 2024-09 2021-03 2024-09 Saturation O1 Average CS PhD: ~40% O1: 94.8% [Paper](https://arxiv.org/abs/2103.03874), [GitHub](https://github.com/hendrycks/math) 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
BIG-Bench-Hard Multi-task 2022-10 – 2024-06 2022-10 2024-06 Saturation Sonnet 3.5 Average Human: 67.7% Sonnet 3.5: 93.1% [Paper](https://arxiv.org/abs/2210.09261), [GitHub](https://github.com/suzgunmirac/BIG-Bench-Hard), [Evidence](https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf) A curated suite of 23 challenging tasks from BIG-Bench.
HumanEval Coding 2021-07 – 2024-05 2021-07 2024-05 Saturation GPT-4o Unspecified GPT-4o: 90.2% [Paper](https://arxiv.org/abs/2107.03374), [GitHub](https://github.com/openai/human-eval), [Evidence](https://openai.com/index/hello-gpt-4o/) 164 Python programming problems testing coding abilities.
IFEval Instruction Following 2023-11 – 2024-03 2023-11 2024-03 Saturation LLama 3.3 70B Unspecified LLama 3.3 70B: 92.1% [Paper](https://arxiv.org/abs/2311.07911), [GitHub](https://github.com/google-research/google-research/tree/master/instruction_following_eval), [Evidence](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md) Evaluation suite testing multi-step instruction-following capabilities.

2023

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Links Details
GSM8K Mathematics 2021-10 – 2023-11 2021-10 2023-11 Saturation GPT-4 Unspecified GPT-4: 92.0% [Paper](https://arxiv.org/abs/2110.14168), [GitHub](https://github.com/openai/grade-school-math), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf) 8.5K grade school math word problems requiring step-by-step solutions.
Turing Test Conversation 1950-10 – 2023-03 1950-10 2023-03 Saturation GPT-4 Interrogator > 50% Interrogator 46% [Paper](https://courses.cs.umbc.edu/471/papers/turing.pdf), [Evidence](https://arxiv.org/pdf/2405.08007) The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
ARC (AI2) Reasoning 2018-03 – 2023-03 2018-03 2023-03 Saturation GPT-4 Unspecified GPT-4: 96.3% [Paper](https://arxiv.org/abs/1803.05457), [Website](https://leaderboard.allenai.org/arc/submissions/get-started), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf) Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
HellaSwag Common Sense 2019-05 – 2023-03 2019-05 2023-03 Saturation GPT-4 Human: 95.6% GPT-4: 95.3% [Paper](https://arxiv.org/abs/1905.07830), [Website](https://rowanzellers.com/hellaswag/), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf) Multiple-choice questions about everyday scenarios with adversarial filtering.
MMLU Knowledge 2020-09 – 2023-03 2020-09 2023-03 Saturation GPT-4 95th pct Human: 87.0% GPT-4: 87.3% [Paper](https://arxiv.org/abs/2009.03300), [GitHub](https://github.com/hendrycks/test), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf) 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
WinoGrande Common Sense 2019-07 – 2023-03 2019-07 2023-03 Saturation GPT-4 Human: 94% GPT-4: 87.5% [Paper](https://arxiv.org/abs/1907.10641), [Website](https://winogrande.allenai.org/), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf) Enhanced WSC with 44K problems testing common-sense pronoun resolution.

Pre-2023

2022

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Links Details
BIG-Bench Multi-task 2021-06 – 2022-04 2021-06 2022-04 Saturation Palm 540B Human: 49.8% Palm 540B: 61.4% [Paper](https://arxiv.org/abs/2206.04615), [GitHub](https://github.com/google/BIG-bench), [Evidence](https://arxiv.org/pdf/2204.02311) 204 tasks spanning linguistics, math, common-sense reasoning, and more.

2019

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Links Details
SuperGLUE Language 2019-05 – 2019-10 2019-05 2019-10 Saturation T5 Human: 89.8% T5: 89.3% [Paper](https://arxiv.org/abs/1905.00537), [Website](https://super.gluebenchmark.com/) More challenging language understanding tasks (word sense, causal reasoning, RC).
WSC Common Sense 2012-05 – 2019-07 2012-05 2019-07 Saturation ROBERTA (w SFT) Human: 96.5% ROBERTA (w SFT): 90.1% [Paper](https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf), [Website](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html) Carefully crafted sentence pairs with ambiguous pronoun references.
GLUE Language 2018-05 – 2019-06 2018-05 2019-06 Saturation XLNet Human: 87.1% XLNet: 88.4% [Paper](https://arxiv.org/abs/1804.07461), [Website](https://gluebenchmark.com/) Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).
TriviaQA Knowledge 2017-05 – 2019-06 2017-05 2019-06 Saturation SpanBERT Human: 79.7% SpanBERT: 83.6% [Paper](https://arxiv.org/abs/1705.03551), [Website](http://nlp.cs.washington.edu/triviaqa/) 650K QA-evidence triples requiring cross-sentence reasoning.
SQuAD v2.0 Language 2018-05 – 2019-04 2018-05 2019-04 Saturation BERT Human: 89.5% BERT: 89.5% [Paper](https://arxiv.org/abs/1806.03822), [Website](https://rajpurkar.github.io/SQuAD-explorer/) Extension of SQuAD adding unanswerable questions.
SQuAD Language 2016-05 – 2019-03 2016-05 2019-03 Saturation BERT Human: 91.2% BERT: 93.2% [Paper](https://arxiv.org/abs/1606.05250), [Website](https://rajpurkar.github.io/SQuAD-explorer/) 100,000+ QA tasks on Wikipedia articles.

2018

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Links Details
SWAG Common Sense 2018-05 – 2018-10 2018-05 2018-10 Saturation BERT Human: 88% BERT: 86% [Paper](https://arxiv.org/abs/1808.05326), [Website](https://rowanzellers.com/swag/) 113K multiple-choice questions about grounded situations (common sense “next step”).