LLM Benchmarks Timeline: Difference between revisions

From AI Wiki
Line 216: Line 216:
| Human: 89.8%
| Human: 89.8%
| T5: 89.3%
| T5: 89.3%
| [Paper](https://arxiv.org/abs/1905.00537), [Website](https://super.gluebenchmark.com/)
| [https://arxiv.org/abs/1905.00537 Paper], [https://super.gluebenchmark.com/ Website]
| More challenging language understanding tasks (word sense, causal reasoning, RC).
| More challenging language understanding tasks (word sense, causal reasoning, RC).
|-
| '''WSC'''
| Common Sense
| 2012-05 – 2019-07
| 2012-05
| 2019-07
| Saturation
| ROBERTA (w SFT)
| Human: 96.5%
| ROBERTA (w SFT): 90.1%
| [Paper](https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf), [Website](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html)
| Carefully crafted sentence pairs with ambiguous pronoun references.
|-
| '''GLUE'''
| Language
| 2018-05 – 2019-06
| 2018-05
| 2019-06
| Saturation
| XLNet
| Human: 87.1%
| XLNet: 88.4%
| [Paper](https://arxiv.org/abs/1804.07461), [Website](https://gluebenchmark.com/)
| Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).
|-
| '''TriviaQA'''
| Knowledge
| 2017-05 – 2019-06
| 2017-05
| 2019-06
| Saturation
| SpanBERT
| Human: 79.7%
| SpanBERT: 83.6%
| [Paper](https://arxiv.org/abs/1705.03551), [Website](http://nlp.cs.washington.edu/triviaqa/)
| 650K QA-evidence triples requiring cross-sentence reasoning.
|-
| '''SQuAD v2.0'''
| Language
| 2018-05 – 2019-04
| 2018-05
| 2019-04
| Saturation
| BERT
| Human: 89.5%
| BERT: 89.5%
| [Paper](https://arxiv.org/abs/1806.03822), [Website](https://rajpurkar.github.io/SQuAD-explorer/)
| Extension of SQuAD adding unanswerable questions.
|-
| '''SQuAD'''
| Language
| 2016-05 – 2019-03
| 2016-05
| 2019-03
| Saturation
| BERT
| Human: 91.2%
| BERT: 93.2%
| [Paper](https://arxiv.org/abs/1606.05250), [Website](https://rajpurkar.github.io/SQuAD-explorer/)
| 100,000+ QA tasks on Wikipedia articles.
|}
|}



Revision as of 16:49, 10 January 2025

2024

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Links Details
ARC-AGI Reasoning 2019-11 – 2024-12 2019-11 2024-12 Saturation O3 Human Baseline: ~80% O3: 87.5% Paper, Website Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
MATH Mathematics 2021-03 – 2024-09 2021-03 2024-09 Saturation O1 Average CS PhD: ~40% O1: 94.8% Paper, GitHub 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
BIG-Bench-Hard Multi-task 2022-10 – 2024-06 2022-10 2024-06 Saturation Sonnet 3.5 Average Human: 67.7% Sonnet 3.5: 93.1% Paper, GitHub, Evidence A curated suite of 23 challenging tasks from BIG-Bench.
HumanEval Coding 2021-07 – 2024-05 2021-07 2024-05 Saturation GPT-4o Unspecified GPT-4o: 90.2% Paper, GitHub, Evidence 164 Python programming problems testing coding abilities.
IFEval Instruction Following 2023-11 – 2024-03 2023-11 2024-03 Saturation LLama 3.3 70B Unspecified LLama 3.3 70B: 92.1% Paper, GitHub, Evidence Evaluation suite testing multi-step instruction-following capabilities.

2023

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Links Details
GSM8K Mathematics 2021-10 – 2023-11 2021-10 2023-11 Saturation GPT-4 Unspecified GPT-4: 92.0% Paper, GitHub, Evidence 8.5K grade school math word problems requiring step-by-step solutions.
Turing Test Conversation 1950-10 – 2023-03 1950-10 2023-03 Saturation GPT-4 Interrogator > 50% Interrogator 46% Paper, Evidence The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
ARC (AI2) Reasoning 2018-03 – 2023-03 2018-03 2023-03 Saturation GPT-4 Unspecified GPT-4: 96.3% Paper, Website, Evidence Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
HellaSwag Common Sense 2019-05 – 2023-03 2019-05 2023-03 Saturation GPT-4 Human: 95.6% GPT-4: 95.3% Paper, Website, Evidence Multiple-choice questions about everyday scenarios with adversarial filtering.
MMLU Knowledge 2020-09 – 2023-03 2020-09 2023-03 Saturation GPT-4 95th pct Human: 87.0% GPT-4: 87.3% Paper, GitHub, Evidence 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
WinoGrande Common Sense 2019-07 – 2023-03 2019-07 2023-03 Saturation GPT-4 Human: 94% GPT-4: 87.5% Paper, Website, Evidence Enhanced WSC with 44K problems testing common-sense pronoun resolution.

Pre-2023

2022

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Links Details
BIG-Bench Multi-task 2021-06 – 2022-04 2021-06 2022-04 Saturation Palm 540B Human: 49.8% Palm 540B: 61.4% Paper, GitHub, Evidence 204 tasks spanning linguistics, math, common-sense reasoning, and more.

2019

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Links Details
SuperGLUE Language 2019-05 – 2019-10 2019-05 2019-10 Saturation T5 Human: 89.8% T5: 89.3% Paper, Website More challenging language understanding tasks (word sense, causal reasoning, RC).

2018

Benchmark Category Time Span Date Created Date Defeated Killed By Defeated By Original Score Final Score Links Details
SWAG Common Sense 2018-05 – 2018-10 2018-05 2018-10 Saturation BERT Human: 88% BERT: 86% [Paper](https://arxiv.org/abs/1808.05326), [Website](https://rowanzellers.com/swag/) 113K multiple-choice questions about grounded situations (common sense “next step”).