LLM Benchmarks Timeline
2024
Benchmark | Category | Time Span | Date Created | Date Defeated | Killed By | Defeated By | Original Score | Final Score | Links | Details |
---|---|---|---|---|---|---|---|---|---|---|
ARC-AGI | Reasoning | 2019-11 – 2024-12 | 2019-11 | 2024-12 | Saturation | O3 | Human Baseline: ~80% | O3: 87.5% | [Paper](https://arxiv.org/abs/1911.01547), [Website](https://arcs-benchmark.org) | Abstract reasoning challenge with visual pattern completion tasks created by François Chollet. |
MATH | Mathematics | 2021-03 – 2024-09 | 2021-03 | 2024-09 | Saturation | O1 | Average CS PhD: ~40% | O1: 94.8% | [Paper](https://arxiv.org/abs/2103.03874), [GitHub](https://github.com/hendrycks/math) | 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning. |
BIG-Bench-Hard | Multi-task | 2022-10 – 2024-06 | 2022-10 | 2024-06 | Saturation | Sonnet 3.5 | Average Human: 67.7% | Sonnet 3.5: 93.1% | [Paper](https://arxiv.org/abs/2210.09261), [GitHub](https://github.com/suzgunmirac/BIG-Bench-Hard), [Evidence](https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf) | A curated suite of 23 challenging tasks from BIG-Bench. |
HumanEval | Coding | 2021-07 – 2024-05 | 2021-07 | 2024-05 | Saturation | GPT-4o | Unspecified | GPT-4o: 90.2% | [Paper](https://arxiv.org/abs/2107.03374), [GitHub](https://github.com/openai/human-eval), [Evidence](https://openai.com/index/hello-gpt-4o/) | 164 Python programming problems testing coding abilities. |
IFEval | Instruction Following | 2023-11 – 2024-03 | 2023-11 | 2024-03 | Saturation | LLama 3.3 70B | Unspecified | LLama 3.3 70B: 92.1% | [Paper](https://arxiv.org/abs/2311.07911), [GitHub](https://github.com/google-research/google-research/tree/master/instruction_following_eval), [Evidence](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md) | Evaluation suite testing multi-step instruction-following capabilities. |
2023
Benchmark | Category | Time Span | Date Created | Date Defeated | Killed By | Defeated By | Original Score | Final Score | Links | Details |
---|---|---|---|---|---|---|---|---|---|---|
GSM8K | Mathematics | 2021-10 – 2023-11 | 2021-10 | 2023-11 | Saturation | GPT-4 | Unspecified | GPT-4: 92.0% | [Paper](https://arxiv.org/abs/2110.14168), [GitHub](https://github.com/openai/grade-school-math), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf) | 8.5K grade school math word problems requiring step-by-step solutions. |
Turing Test | Conversation | 1950-10 – 2023-03 | 1950-10 | 2023-03 | Saturation | GPT-4 | Interrogator > 50% | Interrogator 46% | [Paper](https://courses.cs.umbc.edu/471/papers/turing.pdf), [Evidence](https://arxiv.org/pdf/2405.08007) | The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game"). |
ARC (AI2) | Reasoning | 2018-03 – 2023-03 | 2018-03 | 2023-03 | Saturation | GPT-4 | Unspecified | GPT-4: 96.3% | [Paper](https://arxiv.org/abs/1803.05457), [Website](https://leaderboard.allenai.org/arc/submissions/get-started), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf) | Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning. |
HellaSwag | Common Sense | 2019-05 – 2023-03 | 2019-05 | 2023-03 | Saturation | GPT-4 | Human: 95.6% | GPT-4: 95.3% | [Paper](https://arxiv.org/abs/1905.07830), [Website](https://rowanzellers.com/hellaswag/), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf) | Multiple-choice questions about everyday scenarios with adversarial filtering. |
MMLU | Knowledge | 2020-09 – 2023-03 | 2020-09 | 2023-03 | Saturation | GPT-4 | 95th pct Human: 87.0% | GPT-4: 87.3% | [Paper](https://arxiv.org/abs/2009.03300), [GitHub](https://github.com/hendrycks/test), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf) | 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge. |
WinoGrande | Common Sense | 2019-07 – 2023-03 | 2019-07 | 2023-03 | Saturation | GPT-4 | Human: 94% | GPT-4: 87.5% | [Paper](https://arxiv.org/abs/1907.10641), [Website](https://winogrande.allenai.org/), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf) | Enhanced WSC with 44K problems testing common-sense pronoun resolution. |
Pre-2023
2022
Benchmark | Category | Time Span | Killed By | Killed (Ago) | Age | Defeated By | Original Score | Final Score | Details |
---|---|---|---|---|---|---|---|---|---|
BIG-Bench | Multi-task | 2021 – 2022 | Saturation | 2 years ago | 10 months | Palm 540B | Human: 49.8% | Palm 540B: 61.4% | 204 tasks spanning linguistics, math, common-sense reasoning, and more. |
2019
Benchmark | Category | Time Span | Killed By | Killed (Ago) | Age | Defeated By | Original Score | Final Score | Details |
---|---|---|---|---|---|---|---|---|---|
SuperGLUE | Language | 2019 – 2019 | Saturation | 5 years ago | 5 months | T5 | Human: 89.8% | T5: 89.3% | More challenging language understanding tasks (word sense, causal reasoning, RC). |
WSC | Common Sense | 2012 – 2019 | Saturation | 5 years ago | 7 years, 3 months | ROBERTA (w SFT) | Human: 96.5% | ROBERTA (w SFT): 90.1% | Carefully crafted sentence pairs with ambiguous pronoun references. |
GLUE | Language | 2018 – 2019 | Saturation | 5 years ago | 1 year, 1 month | XLNet | Human: 87.1% | XLNet: 88.4% | Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.). |
TriviaQA | Knowledge | 2017 – 2019 | Saturation | 5 years ago | 2 years, 1 month | SpanBERT | Human: 79.7% | SpanBERT: 83.6% | 650K QA-evidence triples requiring cross-sentence reasoning. |
SQuAD v2.0 | Language | 2018 – 2019 | Saturation | 5 years ago | 11 months | BERT | Human: 89.5% | BERT: 89.5% | Extension of SQuAD adding unanswerable questions. |
SQuAD | Language | 2016 – 2019 | Saturation | 5 years ago | 2 years, 10 months | BERT | Human: 91.2% | BERT: 93.2% | 100,000+ QA tasks on Wikipedia articles. |
2018
Benchmark | Category | Time Span | Killed By | Killed (Ago) | Age | Defeated By | Original Score | Final Score | Details |
---|---|---|---|---|---|---|---|---|---|
SWAG | Common Sense | 2018 – 2018 | Saturation | 6 years ago | 5 months | BERT | Human: 88% | BERT: 86% | 113K multiple-choice questions about grounded situations (common sense “next step”). |