LLM Benchmarks Timeline

From AI Wiki
Revision as of 16:28, 10 January 2025 by Alpha5 (talk | contribs)

2024

Benchmark Category Time Span Killed By Killed (Ago) Age Defeated By Original Score Final Score Details
ARC-AGI Reasoning 2019 – 2024 Saturation 1 month ago 5 years, 1 month O3 Human Baseline: ~80% O3: 87.5% Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
MATH Mathematics 2021 – 2024 Saturation 4 months ago 3 years, 6 months O1 Average CS PhD: ~40% O1: 94.8% 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
BIG-Bench-Hard Multi-task 2022 – 2024 Saturation 7 months ago 1 year, 8 months Sonnet 3.5 Average Human: 67.7% Sonnet 3.5: 93.1% A curated suite of 23 challenging tasks from BIG-Bench.
HumanEval Coding 2021 – 2024 Saturation 8 months ago 2 years, 10 months GPT-4o Unspecified GPT-4o: 90.2% 164 Python programming problems testing coding abilities.
IFEval Instruction Following 2023 – 2024 Saturation 10 months ago 4 months LLama 3.3 70B Unspecified LLama 3.3 70B: 92.1% Evaluation suite testing multi-step instruction-following capabilities.

2023

Benchmark Category Time Span Killed By Killed (Ago) Age Defeated By Original Score Final Score Details
GSM8K Mathematics 2021 – 2023 Saturation 1 year ago 2 years, 1 month GPT-4 Unspecified GPT-4: 92.0% 8.5K grade school math word problems requiring step-by-step solutions.
Turing Test Conversation 1950 – 2023 Saturation 1 year ago 73 years, 5 months GPT-4 Interrogator > 50% Interrogator 46% The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
ARC (AI2) Reasoning 2018 – 2023 Saturation 1 year ago 5 years GPT-4 Unspecified GPT-4: 96.3% Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
HellaSwag Common Sense 2019 – 2023 Saturation 1 year ago 3 years, 10 months GPT-4 Human: 95.6% GPT-4: 95.3% Multiple-choice questions about everyday scenarios with adversarial filtering.
MMLU Knowledge 2020 – 2023 Saturation 1 year ago 2 years, 6 months GPT-4 95th pct Human: 87.0% GPT-4: 87.3% 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
WinoGrande Common Sense 2019 – 2023 Saturation 1 year ago 3 years, 8 months GPT-4 Human: 94% GPT-4: 87.5% Enhanced WSC with 44K problems testing common-sense pronoun resolution.

Pre-2023

2022

Benchmark Category Time Span Killed By Killed (Ago) Age Defeated By Original Score Final Score Details
BIG-Bench Multi-task 2021 – 2022 Saturation 2 years ago 10 months Palm 540B Human: 49.8% Palm 540B: 61.4% 204 tasks spanning linguistics, math, common-sense reasoning, and more.

2019

Benchmark Category Time Span Killed By Killed (Ago) Age Defeated By Original Score Final Score Details
SuperGLUE Language 2019 – 2019 Saturation 5 years ago 5 months T5 Human: 89.8% T5: 89.3% More challenging language understanding tasks (word sense, causal reasoning, RC).
WSC Common Sense 2012 – 2019 Saturation 5 years ago 7 years, 3 months ROBERTA (w SFT) Human: 96.5% ROBERTA (w SFT): 90.1% Carefully crafted sentence pairs with ambiguous pronoun references.
GLUE Language 2018 – 2019 Saturation 5 years ago 1 year, 1 month XLNet Human: 87.1% XLNet: 88.4% Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).
TriviaQA Knowledge 2017 – 2019 Saturation 5 years ago 2 years, 1 month SpanBERT Human: 79.7% SpanBERT: 83.6% 650K QA-evidence triples requiring cross-sentence reasoning.
SQuAD v2.0 Language 2018 – 2019 Saturation 5 years ago 11 months BERT Human: 89.5% BERT: 89.5% Extension of SQuAD adding unanswerable questions.
SQuAD Language 2016 – 2019 Saturation 5 years ago 2 years, 10 months BERT Human: 91.2% BERT: 93.2% 100,000+ QA tasks on Wikipedia articles.

2018

Benchmark Category Time Span Killed By Killed (Ago) Age Defeated By Original Score Final Score Details
SWAG Common Sense 2018 – 2018 Saturation 6 years ago 5 months BERT Human: 88% BERT: 86% 113K multiple-choice questions about grounded situations (common sense “next step”).