LLM Benchmarks Timeline: Difference between revisions

From AI Wiki
No edit summary
Line 155: Line 155:
! Benchmark
! Benchmark
! Category
! Category
! Time Span
! Date Created
! Date Created
! Date Defeated
! Date Defeated
! Killed By
! Killed By Model
! Defeated By
! Defeated By
! Original Score
! Original Score
Line 167: Line 166:
| [[BIG-Bench]]
| [[BIG-Bench]]
| Multi-task
| Multi-task
| 2021-06 – 2022-04
| 2021-06
| 2021-06
| 2022-04
| 2022-04
| Saturation
| Saturation
| Palm 540B
| [[Palm 540B]]
| Human: 49.8%
| Human: 49.8%
| Palm 540B: 61.4%
| Palm 540B: 61.4%
Line 183: Line 181:
! Benchmark
! Benchmark
! Category
! Category
! Time Span
! Date Created
! Date Created
! Date Defeated
! Date Defeated
! Killed By
! Killed By
! Defeated By
! Defeated By Model
! Original Score
! Original Score
! Final Score
! Final Score
Line 195: Line 192:
| [[SuperGLUE]]
| [[SuperGLUE]]
| Language
| Language
| 2019-05 – 2019-10
| 2019-05
| 2019-05
| 2019-10
| 2019-10
| Saturation
| Saturation
| T5
| [[T5]]
| Human: 89.8%
| Human: 89.8%
| T5: 89.3%
| T5: 89.3%
Line 207: Line 203:
| [[WSC]]
| [[WSC]]
| Common Sense
| Common Sense
| 2012-05 – 2019-07
| 2012-05
| 2012-05
| 2019-07
| 2019-07
| Saturation
| Saturation
| ROBERTA (w SFT)
| [[ROBERTA (w SFT)]]
| Human: 96.5%
| Human: 96.5%
| ROBERTA (w SFT): 90.1%
| ROBERTA (w SFT): 90.1%
Line 219: Line 214:
| [[GLUE]]
| [[GLUE]]
| Language
| Language
| 2018-05 – 2019-06
| 2018-05
| 2018-05
| 2019-06
| 2019-06
| Saturation
| Saturation
| XLNet
| [[XLNet]]
| Human: 87.1%
| Human: 87.1%
| XLNet: 88.4%
| XLNet: 88.4%
Line 231: Line 225:
| [[TriviaQA]]
| [[TriviaQA]]
| Knowledge
| Knowledge
| 2017-05 – 2019-06
| 2017-05
| 2017-05
| 2019-06
| 2019-06
| Saturation
| Saturation
| SpanBERT
| [[SpanBERT]]
| Human: 79.7%
| Human: 79.7%
| SpanBERT: 83.6%
| SpanBERT: 83.6%
Line 243: Line 236:
| [[SQuAD v2.0]]
| [[SQuAD v2.0]]
| Language
| Language
| 2018-05 – 2019-04
| 2018-05
| 2018-05
| 2019-04
| 2019-04
| Saturation
| Saturation
| BERT
| [[BERT]]
| Human: 89.5%
| Human: 89.5%
| BERT: 89.5%
| BERT: 89.5%
Line 255: Line 247:
| [[SQuAD]]
| [[SQuAD]]
| Language
| Language
| 2016-05 – 2019-03
| 2016-05
| 2016-05
| 2019-03
| 2019-03
| Saturation
| Saturation
| BERT
| [[BERT]]
| Human: 91.2%
| Human: 91.2%
| BERT: 93.2%
| BERT: 93.2%
Line 271: Line 262:
! Benchmark
! Benchmark
! Category
! Category
! Time Span
! Date Created
! Date Created
! Date Defeated
! Date Defeated
! Killed By
! Killed By
! Defeated By
! Defeated By Model
! Original Score
! Original Score
! Final Score
! Final Score
Line 283: Line 273:
| [[SWAG]]
| [[SWAG]]
| Common Sense
| Common Sense
| 2018-05 – 2018-10
| 2018-05
| 2018-05
| 2018-10
| 2018-10
| Saturation
| Saturation
| BERT
| [[BERT]]
| Human: 88%
| Human: 88%
| BERT: 86%
| BERT: 86%

Revision as of 17:04, 10 January 2025

2024

Benchmark Category Date Created Date Defeated Killed By Defeated By Model Original Score Final Score Details Links
ARC-AGI Reasoning 2019-11 2024-12 Saturation O3 Human Baseline: ~80% O3: 87.5% Abstract reasoning challenge with visual pattern completion tasks created by François Chollet. Paper, Website
MATH Mathematics 2021-03 2024-09 Saturation O1 Average CS PhD: ~40% O1: 94.8% 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning. Paper, GitHub
BIG-Bench-Hard Multi-task 2022-10 2024-06 Saturation Sonnet 3.5 Average Human: 67.7% Sonnet 3.5: 93.1% A curated suite of 23 challenging tasks from BIG-Bench. Paper, GitHub, Evidence
HumanEval Coding 2021-07 2024-05 Saturation GPT-4o Unspecified GPT-4o: 90.2% 164 Python programming problems testing coding abilities. Paper, GitHub, Evidence
IFEval Instruction Following 2023-11 2024-03 Saturation LLama 3.3 70B Unspecified LLama 3.3 70B: 92.1% Evaluation suite testing multi-step instruction-following capabilities. Paper, GitHub, Evidence

2023

Benchmark Category Date Created Date Defeated Killed By Defeated By Model Original Score Final Score Details Links
GSM8K Mathematics 2021-10 2023-11 Saturation GPT-4 Unspecified GPT-4: 92.0% 8.5K grade school math word problems requiring step-by-step solutions. Paper, GitHub, Evidence
Turing Test Conversation 1950-10 2023-03 Saturation GPT-4 Interrogator > 50% Interrogator 46% The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game"). Paper, Evidence
ARC (AI2) Reasoning 2018-03 2023-03 Saturation GPT-4 Unspecified GPT-4: 96.3% Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning. Paper, Website, Evidence
HellaSwag Common Sense 2019-05 2023-03 Saturation GPT-4 Human: 95.6% GPT-4: 95.3% Multiple-choice questions about everyday scenarios with adversarial filtering. Paper, Website, Evidence
MMLU Knowledge 2020-09 2023-03 Saturation GPT-4 95th pct Human: 87.0% GPT-4: 87.3% 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge. Paper, GitHub, Evidence
WinoGrande Common Sense 2019-07 2023-03 Saturation GPT-4 Human: 94% GPT-4: 87.5% Enhanced WSC with 44K problems testing common-sense pronoun resolution. Paper, Website, Evidence

Pre-2023

2022

Benchmark Category Date Created Date Defeated Killed By Model Defeated By Original Score Final Score Details Links
BIG-Bench Multi-task 2021-06 2022-04 Saturation Palm 540B Human: 49.8% Palm 540B: 61.4% 204 tasks spanning linguistics, math, common-sense reasoning, and more. Paper, GitHub, Evidence

2019

Benchmark Category Date Created Date Defeated Killed By Defeated By Model Original Score Final Score Details Links
SuperGLUE Language 2019-05 2019-10 Saturation T5 Human: 89.8% T5: 89.3% More challenging language understanding tasks (word sense, causal reasoning, RC). Paper, Website
WSC Common Sense 2012-05 2019-07 Saturation ROBERTA (w SFT) Human: 96.5% ROBERTA (w SFT): 90.1% Carefully crafted sentence pairs with ambiguous pronoun references. Paper, Website
GLUE Language 2018-05 2019-06 Saturation XLNet Human: 87.1% XLNet: 88.4% Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.). Paper, Website
TriviaQA Knowledge 2017-05 2019-06 Saturation SpanBERT Human: 79.7% SpanBERT: 83.6% 650K QA-evidence triples requiring cross-sentence reasoning. Paper, Website
SQuAD v2.0 Language 2018-05 2019-04 Saturation BERT Human: 89.5% BERT: 89.5% Extension of SQuAD adding unanswerable questions. Paper, Website
SQuAD Language 2016-05 2019-03 Saturation BERT Human: 91.2% BERT: 93.2% 100,000+ QA tasks on Wikipedia articles. Paper, Website

2018

Benchmark Category Date Created Date Defeated Killed By Defeated By Model Original Score Final Score Details Links
SWAG Common Sense 2018-05 2018-10 Saturation BERT Human: 88% BERT: 86% 113K multiple-choice questions about grounded situations (common sense “next step”). Paper, Website