LLM Benchmarks Timeline: Difference between revisions

From AI Wiki
No edit summary
Line 11: Line 11:
! Original Score
! Original Score
! Final Score
! Final Score
! Details
! Links
! Links
! Details
|-
|-
| '''ARC-AGI'''
| [[ARC-AGI]]
| Reasoning
| Reasoning
| 2019-11 – 2024-12
| 2019-11 – 2024-12
Line 23: Line 23:
| Human Baseline: ~80%
| Human Baseline: ~80%
| O3: 87.5%
| O3: 87.5%
| Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
| [https://arxiv.org/abs/1911.01547 Paper], [https://arcs-benchmark.org Website]
| [https://arxiv.org/abs/1911.01547 Paper], [https://arcs-benchmark.org Website]
| Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
|-
|-
| '''MATH'''
| [[MATH]]
| Mathematics
| Mathematics
| 2021-03 – 2024-09
| 2021-03 – 2024-09
Line 35: Line 35:
| Average CS PhD: ~40%
| Average CS PhD: ~40%
| O1: 94.8%
| O1: 94.8%
| 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
| [https://arxiv.org/abs/2103.03874 Paper], [https://github.com/hendrycks/math GitHub]
| [https://arxiv.org/abs/2103.03874 Paper], [https://github.com/hendrycks/math GitHub]
| 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
|-
|-
| '''BIG-Bench-Hard'''
| [[BIG-Bench-Hard]]
| Multi-task
| Multi-task
| 2022-10 – 2024-06
| 2022-10 – 2024-06
Line 47: Line 47:
| Average Human: 67.7%
| Average Human: 67.7%
| Sonnet 3.5: 93.1%
| Sonnet 3.5: 93.1%
| A curated suite of 23 challenging tasks from BIG-Bench.
| [https://arxiv.org/abs/2210.09261 Paper], [https://github.com/suzgunmirac/BIG-Bench-Hard GitHub], [https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf Evidence]
| [https://arxiv.org/abs/2210.09261 Paper], [https://github.com/suzgunmirac/BIG-Bench-Hard GitHub], [https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf Evidence]
| A curated suite of 23 challenging tasks from BIG-Bench.
|-
|-
| '''HumanEval'''
| [[HumanEval]]
| Coding
| Coding
| 2021-07 – 2024-05
| 2021-07 – 2024-05
Line 59: Line 59:
| Unspecified
| Unspecified
| GPT-4o: 90.2%
| GPT-4o: 90.2%
| 164 Python programming problems testing coding abilities.
| [https://arxiv.org/abs/2107.03374 Paper], [https://github.com/openai/human-eval GitHub], [https://openai.com/index/hello-gpt-4o/ Evidence]
| [https://arxiv.org/abs/2107.03374 Paper], [https://github.com/openai/human-eval GitHub], [https://openai.com/index/hello-gpt-4o/ Evidence]
| 164 Python programming problems testing coding abilities.
|-
|-
| '''IFEval'''
| [[IFEval]]
| Instruction Following
| Instruction Following
| 2023-11 – 2024-03
| 2023-11 – 2024-03
Line 71: Line 71:
| Unspecified
| Unspecified
| LLama 3.3 70B: 92.1%
| LLama 3.3 70B: 92.1%
| Evaluation suite testing multi-step instruction-following capabilities.
| [https://arxiv.org/abs/2311.07911 Paper], [https://github.com/google-research/google-research/tree/master/instruction_following_eval GitHub], [https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md Evidence]
| [https://arxiv.org/abs/2311.07911 Paper], [https://github.com/google-research/google-research/tree/master/instruction_following_eval GitHub], [https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md Evidence]
| Evaluation suite testing multi-step instruction-following capabilities.
|}
|}
== 2023 ==
== 2023 ==
{| class="wikitable"
{| class="wikitable"
Line 87: Line 86:
! Original Score
! Original Score
! Final Score
! Final Score
! Details
! Links
! Links
! Details
|-
|-
| '''GSM8K'''
| [[GSM8K]]
| Mathematics
| Mathematics
| 2021-10 – 2023-11
| 2021-10 – 2023-11
Line 99: Line 98:
| Unspecified
| Unspecified
| GPT-4: 92.0%
| GPT-4: 92.0%
| 8.5K grade school math word problems requiring step-by-step solutions.
| [https://arxiv.org/abs/2110.14168 Paper], [https://github.com/openai/grade-school-math GitHub], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
| [https://arxiv.org/abs/2110.14168 Paper], [https://github.com/openai/grade-school-math GitHub], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
| 8.5K grade school math word problems requiring step-by-step solutions.
|-
|-
| '''Turing Test'''
| [[Turing Test]]
| Conversation
| Conversation
| 1950-10 – 2023-03
| 1950-10 – 2023-03
Line 111: Line 110:
| Interrogator > 50%
| Interrogator > 50%
| Interrogator 46%
| Interrogator 46%
| The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
| [https://courses.cs.umbc.edu/471/papers/turing.pdf Paper], [https://arxiv.org/pdf/2405.08007 Evidence]
| [https://courses.cs.umbc.edu/471/papers/turing.pdf Paper], [https://arxiv.org/pdf/2405.08007 Evidence]
| The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
|-
|-
| '''ARC (AI2)'''
| [[ARC (AI2)]]
| Reasoning
| Reasoning
| 2018-03 – 2023-03
| 2018-03 – 2023-03
Line 123: Line 122:
| Unspecified
| Unspecified
| GPT-4: 96.3%
| GPT-4: 96.3%
| Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
| [https://arxiv.org/abs/1803.05457 Paper], [https://leaderboard.allenai.org/arc/submissions/get-started Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
| [https://arxiv.org/abs/1803.05457 Paper], [https://leaderboard.allenai.org/arc/submissions/get-started Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
| Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
|-
|-
| '''HellaSwag'''
| [[HellaSwag]]
| Common Sense
| Common Sense
| 2019-05 – 2023-03
| 2019-05 – 2023-03
Line 135: Line 134:
| Human: 95.6%
| Human: 95.6%
| GPT-4: 95.3%
| GPT-4: 95.3%
| Multiple-choice questions about everyday scenarios with adversarial filtering.
| [https://arxiv.org/abs/1905.07830 Paper], [https://rowanzellers.com/hellaswag/ Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
| [https://arxiv.org/abs/1905.07830 Paper], [https://rowanzellers.com/hellaswag/ Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
| Multiple-choice questions about everyday scenarios with adversarial filtering.
|-
|-
| '''MMLU'''
| [[MMLU]]
| Knowledge
| Knowledge
| 2020-09 – 2023-03
| 2020-09 – 2023-03
Line 147: Line 146:
| 95th pct Human: 87.0%
| 95th pct Human: 87.0%
| GPT-4: 87.3%
| GPT-4: 87.3%
| 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
| [https://arxiv.org/abs/2009.03300 Paper], [https://github.com/hendrycks/test GitHub], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
| [https://arxiv.org/abs/2009.03300 Paper], [https://github.com/hendrycks/test GitHub], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
| 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
|-
|-
| '''WinoGrande'''
| [[WinoGrande]]
| Common Sense
| Common Sense
| 2019-07 – 2023-03
| 2019-07 – 2023-03
Line 159: Line 158:
| Human: 94%
| Human: 94%
| GPT-4: 87.5%
| GPT-4: 87.5%
| Enhanced WSC with 44K problems testing common-sense pronoun resolution.
| [https://arxiv.org/abs/1907.10641 Paper], [https://winogrande.allenai.org/ Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
| [https://arxiv.org/abs/1907.10641 Paper], [https://winogrande.allenai.org/ Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence]
| Enhanced WSC with 44K problems testing common-sense pronoun resolution.
|}
|}



Revision as of 16:53, 10 January 2025

2024

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Details Links
ARC-AGI Reasoning 2019-11 – 2024-12 2019-11 2024-12 Saturation O3 Human Baseline: ~80% O3: 87.5% Abstract reasoning challenge with visual pattern completion tasks created by François Chollet. Paper, Website
MATH Mathematics 2021-03 – 2024-09 2021-03 2024-09 Saturation O1 Average CS PhD: ~40% O1: 94.8% 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning. Paper, GitHub
BIG-Bench-Hard Multi-task 2022-10 – 2024-06 2022-10 2024-06 Saturation Sonnet 3.5 Average Human: 67.7% Sonnet 3.5: 93.1% A curated suite of 23 challenging tasks from BIG-Bench. Paper, GitHub, Evidence
HumanEval Coding 2021-07 – 2024-05 2021-07 2024-05 Saturation GPT-4o Unspecified GPT-4o: 90.2% 164 Python programming problems testing coding abilities. Paper, GitHub, Evidence
IFEval Instruction Following 2023-11 – 2024-03 2023-11 2024-03 Saturation LLama 3.3 70B Unspecified LLama 3.3 70B: 92.1% Evaluation suite testing multi-step instruction-following capabilities. Paper, GitHub, Evidence

2023

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Details Links
GSM8K Mathematics 2021-10 – 2023-11 2021-10 2023-11 Saturation GPT-4 Unspecified GPT-4: 92.0% 8.5K grade school math word problems requiring step-by-step solutions. Paper, GitHub, Evidence
Turing Test Conversation 1950-10 – 2023-03 1950-10 2023-03 Saturation GPT-4 Interrogator > 50% Interrogator 46% The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game"). Paper, Evidence
ARC (AI2) Reasoning 2018-03 – 2023-03 2018-03 2023-03 Saturation GPT-4 Unspecified GPT-4: 96.3% Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning. Paper, Website, Evidence
HellaSwag Common Sense 2019-05 – 2023-03 2019-05 2023-03 Saturation GPT-4 Human: 95.6% GPT-4: 95.3% Multiple-choice questions about everyday scenarios with adversarial filtering. Paper, Website, Evidence
MMLU Knowledge 2020-09 – 2023-03 2020-09 2023-03 Saturation GPT-4 95th pct Human: 87.0% GPT-4: 87.3% 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge. Paper, GitHub, Evidence
WinoGrande Common Sense 2019-07 – 2023-03 2019-07 2023-03 Saturation GPT-4 Human: 94% GPT-4: 87.5% Enhanced WSC with 44K problems testing common-sense pronoun resolution. Paper, Website, Evidence

Pre-2023

2022

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Links Details
BIG-Bench Multi-task 2021-06 – 2022-04 2021-06 2022-04 Saturation Palm 540B Human: 49.8% Palm 540B: 61.4% Paper, GitHub, Evidence 204 tasks spanning linguistics, math, common-sense reasoning, and more.

2019

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Links Details
SuperGLUE Language 2019-05 – 2019-10 2019-05 2019-10 Saturation T5 Human: 89.8% T5: 89.3% Paper, Website More challenging language understanding tasks (word sense, causal reasoning, RC).

2018

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Links Details
SWAG Common Sense 2018-05 – 2018-10 2018-05 2018-10 Saturation BERT Human: 88% BERT: 86% [Paper](https://arxiv.org/abs/1808.05326), [Website](https://rowanzellers.com/swag/) 113K multiple-choice questions about grounded situations (common sense “next step”).