LLM Benchmarks Timeline: Difference between revisions
(→2019) |
No edit summary |
||
Line 11: | Line 11: | ||
! Original Score | ! Original Score | ||
! Final Score | ! Final Score | ||
! Details | |||
! Links | ! Links | ||
|- | |- | ||
| | | [[ARC-AGI]] | ||
| Reasoning | | Reasoning | ||
| 2019-11 – 2024-12 | | 2019-11 – 2024-12 | ||
Line 23: | Line 23: | ||
| Human Baseline: ~80% | | Human Baseline: ~80% | ||
| O3: 87.5% | | O3: 87.5% | ||
| Abstract reasoning challenge with visual pattern completion tasks created by François Chollet. | |||
| [https://arxiv.org/abs/1911.01547 Paper], [https://arcs-benchmark.org Website] | | [https://arxiv.org/abs/1911.01547 Paper], [https://arcs-benchmark.org Website] | ||
|- | |- | ||
| | | [[MATH]] | ||
| Mathematics | | Mathematics | ||
| 2021-03 – 2024-09 | | 2021-03 – 2024-09 | ||
Line 35: | Line 35: | ||
| Average CS PhD: ~40% | | Average CS PhD: ~40% | ||
| O1: 94.8% | | O1: 94.8% | ||
| 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning. | |||
| [https://arxiv.org/abs/2103.03874 Paper], [https://github.com/hendrycks/math GitHub] | | [https://arxiv.org/abs/2103.03874 Paper], [https://github.com/hendrycks/math GitHub] | ||
|- | |- | ||
| | | [[BIG-Bench-Hard]] | ||
| Multi-task | | Multi-task | ||
| 2022-10 – 2024-06 | | 2022-10 – 2024-06 | ||
Line 47: | Line 47: | ||
| Average Human: 67.7% | | Average Human: 67.7% | ||
| Sonnet 3.5: 93.1% | | Sonnet 3.5: 93.1% | ||
| A curated suite of 23 challenging tasks from BIG-Bench. | |||
| [https://arxiv.org/abs/2210.09261 Paper], [https://github.com/suzgunmirac/BIG-Bench-Hard GitHub], [https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf Evidence] | | [https://arxiv.org/abs/2210.09261 Paper], [https://github.com/suzgunmirac/BIG-Bench-Hard GitHub], [https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf Evidence] | ||
|- | |- | ||
| | | [[HumanEval]] | ||
| Coding | | Coding | ||
| 2021-07 – 2024-05 | | 2021-07 – 2024-05 | ||
Line 59: | Line 59: | ||
| Unspecified | | Unspecified | ||
| GPT-4o: 90.2% | | GPT-4o: 90.2% | ||
| 164 Python programming problems testing coding abilities. | |||
| [https://arxiv.org/abs/2107.03374 Paper], [https://github.com/openai/human-eval GitHub], [https://openai.com/index/hello-gpt-4o/ Evidence] | | [https://arxiv.org/abs/2107.03374 Paper], [https://github.com/openai/human-eval GitHub], [https://openai.com/index/hello-gpt-4o/ Evidence] | ||
|- | |- | ||
| | | [[IFEval]] | ||
| Instruction Following | | Instruction Following | ||
| 2023-11 – 2024-03 | | 2023-11 – 2024-03 | ||
Line 71: | Line 71: | ||
| Unspecified | | Unspecified | ||
| LLama 3.3 70B: 92.1% | | LLama 3.3 70B: 92.1% | ||
| Evaluation suite testing multi-step instruction-following capabilities. | |||
| [https://arxiv.org/abs/2311.07911 Paper], [https://github.com/google-research/google-research/tree/master/instruction_following_eval GitHub], [https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md Evidence] | | [https://arxiv.org/abs/2311.07911 Paper], [https://github.com/google-research/google-research/tree/master/instruction_following_eval GitHub], [https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md Evidence] | ||
|} | |} | ||
== 2023 == | == 2023 == | ||
{| class="wikitable" | {| class="wikitable" | ||
Line 87: | Line 86: | ||
! Original Score | ! Original Score | ||
! Final Score | ! Final Score | ||
! Details | |||
! Links | ! Links | ||
|- | |- | ||
| | | [[GSM8K]] | ||
| Mathematics | | Mathematics | ||
| 2021-10 – 2023-11 | | 2021-10 – 2023-11 | ||
Line 99: | Line 98: | ||
| Unspecified | | Unspecified | ||
| GPT-4: 92.0% | | GPT-4: 92.0% | ||
| 8.5K grade school math word problems requiring step-by-step solutions. | |||
| [https://arxiv.org/abs/2110.14168 Paper], [https://github.com/openai/grade-school-math GitHub], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | | [https://arxiv.org/abs/2110.14168 Paper], [https://github.com/openai/grade-school-math GitHub], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | ||
|- | |- | ||
| | | [[Turing Test]] | ||
| Conversation | | Conversation | ||
| 1950-10 – 2023-03 | | 1950-10 – 2023-03 | ||
Line 111: | Line 110: | ||
| Interrogator > 50% | | Interrogator > 50% | ||
| Interrogator 46% | | Interrogator 46% | ||
| The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game"). | |||
| [https://courses.cs.umbc.edu/471/papers/turing.pdf Paper], [https://arxiv.org/pdf/2405.08007 Evidence] | | [https://courses.cs.umbc.edu/471/papers/turing.pdf Paper], [https://arxiv.org/pdf/2405.08007 Evidence] | ||
|- | |- | ||
| | | [[ARC (AI2)]] | ||
| Reasoning | | Reasoning | ||
| 2018-03 – 2023-03 | | 2018-03 – 2023-03 | ||
Line 123: | Line 122: | ||
| Unspecified | | Unspecified | ||
| GPT-4: 96.3% | | GPT-4: 96.3% | ||
| Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning. | |||
| [https://arxiv.org/abs/1803.05457 Paper], [https://leaderboard.allenai.org/arc/submissions/get-started Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | | [https://arxiv.org/abs/1803.05457 Paper], [https://leaderboard.allenai.org/arc/submissions/get-started Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | ||
|- | |- | ||
| | | [[HellaSwag]] | ||
| Common Sense | | Common Sense | ||
| 2019-05 – 2023-03 | | 2019-05 – 2023-03 | ||
Line 135: | Line 134: | ||
| Human: 95.6% | | Human: 95.6% | ||
| GPT-4: 95.3% | | GPT-4: 95.3% | ||
| Multiple-choice questions about everyday scenarios with adversarial filtering. | |||
| [https://arxiv.org/abs/1905.07830 Paper], [https://rowanzellers.com/hellaswag/ Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | | [https://arxiv.org/abs/1905.07830 Paper], [https://rowanzellers.com/hellaswag/ Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | ||
|- | |- | ||
| | | [[MMLU]] | ||
| Knowledge | | Knowledge | ||
| 2020-09 – 2023-03 | | 2020-09 – 2023-03 | ||
Line 147: | Line 146: | ||
| 95th pct Human: 87.0% | | 95th pct Human: 87.0% | ||
| GPT-4: 87.3% | | GPT-4: 87.3% | ||
| 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge. | |||
| [https://arxiv.org/abs/2009.03300 Paper], [https://github.com/hendrycks/test GitHub], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | | [https://arxiv.org/abs/2009.03300 Paper], [https://github.com/hendrycks/test GitHub], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | ||
|- | |- | ||
| | | [[WinoGrande]] | ||
| Common Sense | | Common Sense | ||
| 2019-07 – 2023-03 | | 2019-07 – 2023-03 | ||
Line 159: | Line 158: | ||
| Human: 94% | | Human: 94% | ||
| GPT-4: 87.5% | | GPT-4: 87.5% | ||
| Enhanced WSC with 44K problems testing common-sense pronoun resolution. | |||
| [https://arxiv.org/abs/1907.10641 Paper], [https://winogrande.allenai.org/ Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | | [https://arxiv.org/abs/1907.10641 Paper], [https://winogrande.allenai.org/ Website], [https://cdn.openai.com/papers/gpt-4.pdf Evidence] | ||
|} | |} | ||
Revision as of 16:53, 10 January 2025
2024
Benchmark | Category | Time Span | Date Created | Date Defeated | Killed By | Defeated By | Original Score | Final Score | Details | Links |
---|---|---|---|---|---|---|---|---|---|---|
ARC-AGI | Reasoning | 2019-11 – 2024-12 | 2019-11 | 2024-12 | Saturation | O3 | Human Baseline: ~80% | O3: 87.5% | Abstract reasoning challenge with visual pattern completion tasks created by François Chollet. | Paper, Website |
MATH | Mathematics | 2021-03 – 2024-09 | 2021-03 | 2024-09 | Saturation | O1 | Average CS PhD: ~40% | O1: 94.8% | 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning. | Paper, GitHub |
BIG-Bench-Hard | Multi-task | 2022-10 – 2024-06 | 2022-10 | 2024-06 | Saturation | Sonnet 3.5 | Average Human: 67.7% | Sonnet 3.5: 93.1% | A curated suite of 23 challenging tasks from BIG-Bench. | Paper, GitHub, Evidence |
HumanEval | Coding | 2021-07 – 2024-05 | 2021-07 | 2024-05 | Saturation | GPT-4o | Unspecified | GPT-4o: 90.2% | 164 Python programming problems testing coding abilities. | Paper, GitHub, Evidence |
IFEval | Instruction Following | 2023-11 – 2024-03 | 2023-11 | 2024-03 | Saturation | LLama 3.3 70B | Unspecified | LLama 3.3 70B: 92.1% | Evaluation suite testing multi-step instruction-following capabilities. | Paper, GitHub, Evidence |
2023
Benchmark | Category | Time Span | Date Created | Date Defeated | Killed By | Defeated By | Original Score | Final Score | Details | Links |
---|---|---|---|---|---|---|---|---|---|---|
GSM8K | Mathematics | 2021-10 – 2023-11 | 2021-10 | 2023-11 | Saturation | GPT-4 | Unspecified | GPT-4: 92.0% | 8.5K grade school math word problems requiring step-by-step solutions. | Paper, GitHub, Evidence |
Turing Test | Conversation | 1950-10 – 2023-03 | 1950-10 | 2023-03 | Saturation | GPT-4 | Interrogator > 50% | Interrogator 46% | The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game"). | Paper, Evidence |
ARC (AI2) | Reasoning | 2018-03 – 2023-03 | 2018-03 | 2023-03 | Saturation | GPT-4 | Unspecified | GPT-4: 96.3% | Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning. | Paper, Website, Evidence |
HellaSwag | Common Sense | 2019-05 – 2023-03 | 2019-05 | 2023-03 | Saturation | GPT-4 | Human: 95.6% | GPT-4: 95.3% | Multiple-choice questions about everyday scenarios with adversarial filtering. | Paper, Website, Evidence |
MMLU | Knowledge | 2020-09 – 2023-03 | 2020-09 | 2023-03 | Saturation | GPT-4 | 95th pct Human: 87.0% | GPT-4: 87.3% | 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge. | Paper, GitHub, Evidence |
WinoGrande | Common Sense | 2019-07 – 2023-03 | 2019-07 | 2023-03 | Saturation | GPT-4 | Human: 94% | GPT-4: 87.5% | Enhanced WSC with 44K problems testing common-sense pronoun resolution. | Paper, Website, Evidence |
Pre-2023
2022
Benchmark | Category | Time Span | Date Created | Date Defeated | Killed By | Defeated By | Original Score | Final Score | Links | Details |
---|---|---|---|---|---|---|---|---|---|---|
BIG-Bench | Multi-task | 2021-06 – 2022-04 | 2021-06 | 2022-04 | Saturation | Palm 540B | Human: 49.8% | Palm 540B: 61.4% | Paper, GitHub, Evidence | 204 tasks spanning linguistics, math, common-sense reasoning, and more. |
2019
Benchmark | Category | Time Span | Date Created | Date Defeated | Killed By | Defeated By | Original Score | Final Score | Links | Details |
---|---|---|---|---|---|---|---|---|---|---|
SuperGLUE | Language | 2019-05 – 2019-10 | 2019-05 | 2019-10 | Saturation | T5 | Human: 89.8% | T5: 89.3% | Paper, Website | More challenging language understanding tasks (word sense, causal reasoning, RC). |
2018
Benchmark | Category | Time Span | Date Created | Date Defeated | Killed By | Defeated By | Original Score | Final Score | Links | Details |
---|---|---|---|---|---|---|---|---|---|---|
SWAG | Common Sense | 2018-05 – 2018-10 | 2018-05 | 2018-10 | Saturation | BERT | Human: 88% | BERT: 86% | [Paper](https://arxiv.org/abs/1808.05326), [Website](https://rowanzellers.com/swag/) | 113K multiple-choice questions about grounded situations (common sense “next step”). |