|
|
Line 1: |
Line 1: |
| == 2024 ==
| |
| {| class="wikitable"
| |
| |-
| |
| ! Benchmark
| |
| ! Category
| |
| ! Time Span
| |
| ! Date Created
| |
| ! Date Defeated
| |
| ! Killed By
| |
| ! Defeated By
| |
| ! Original Score
| |
| ! Final Score
| |
| ! Links
| |
| ! Details
| |
| |-
| |
| | '''ARC-AGI'''
| |
| | Reasoning
| |
| | 2019-11 – 2024-12
| |
| | 2019-11
| |
| | 2024-12
| |
| | Saturation
| |
| | O3
| |
| | Human Baseline: ~80%
| |
| | O3: 87.5%
| |
| | [Paper](https://arxiv.org/abs/1911.01547), [Website](https://arcs-benchmark.org)
| |
| | Abstract reasoning challenge with visual pattern completion tasks created by François Chollet.
| |
| |-
| |
| | '''MATH'''
| |
| | Mathematics
| |
| | 2021-03 – 2024-09
| |
| | 2021-03
| |
| | 2024-09
| |
| | Saturation
| |
| | O1
| |
| | Average CS PhD: ~40%
| |
| | O1: 94.8%
| |
| | [Paper](https://arxiv.org/abs/2103.03874), [GitHub](https://github.com/hendrycks/math)
| |
| | 12K challenging competition math problems from AMC/AIME, requiring complex multi-step reasoning.
| |
| |-
| |
| | '''BIG-Bench-Hard'''
| |
| | Multi-task
| |
| | 2022-10 – 2024-06
| |
| | 2022-10
| |
| | 2024-06
| |
| | Saturation
| |
| | Sonnet 3.5
| |
| | Average Human: 67.7%
| |
| | Sonnet 3.5: 93.1%
| |
| | [Paper](https://arxiv.org/abs/2210.09261), [GitHub](https://github.com/suzgunmirac/BIG-Bench-Hard), [Evidence](https://assets.anthropic.com/m/1cd9d098ac3e6467/original/Claude-3-Model-Card-October-Addendum.pdf)
| |
| | A curated suite of 23 challenging tasks from BIG-Bench.
| |
| |-
| |
| | '''HumanEval'''
| |
| | Coding
| |
| | 2021-07 – 2024-05
| |
| | 2021-07
| |
| | 2024-05
| |
| | Saturation
| |
| | GPT-4o
| |
| | Unspecified
| |
| | GPT-4o: 90.2%
| |
| | [Paper](https://arxiv.org/abs/2107.03374), [GitHub](https://github.com/openai/human-eval), [Evidence](https://openai.com/index/hello-gpt-4o/)
| |
| | 164 Python programming problems testing coding abilities.
| |
| |-
| |
| | '''IFEval'''
| |
| | Instruction Following
| |
| | 2023-11 – 2024-03
| |
| | 2023-11
| |
| | 2024-03
| |
| | Saturation
| |
| | LLama 3.3 70B
| |
| | Unspecified
| |
| | LLama 3.3 70B: 92.1%
| |
| | [Paper](https://arxiv.org/abs/2311.07911), [GitHub](https://github.com/google-research/google-research/tree/master/instruction_following_eval), [Evidence](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md)
| |
| | Evaluation suite testing multi-step instruction-following capabilities.
| |
| |}
| |
|
| |
| == 2023 ==
| |
| {| class="wikitable"
| |
| |-
| |
| ! Benchmark
| |
| ! Category
| |
| ! Time Span
| |
| ! Date Created
| |
| ! Date Defeated
| |
| ! Killed By
| |
| ! Defeated By
| |
| ! Original Score
| |
| ! Final Score
| |
| ! Links
| |
| ! Details
| |
| |-
| |
| | '''GSM8K'''
| |
| | Mathematics
| |
| | 2021-10 – 2023-11
| |
| | 2021-10
| |
| | 2023-11
| |
| | Saturation
| |
| | GPT-4
| |
| | Unspecified
| |
| | GPT-4: 92.0%
| |
| | [Paper](https://arxiv.org/abs/2110.14168), [GitHub](https://github.com/openai/grade-school-math), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| |
| | 8.5K grade school math word problems requiring step-by-step solutions.
| |
| |-
| |
| | '''Turing Test'''
| |
| | Conversation
| |
| | 1950-10 – 2023-03
| |
| | 1950-10
| |
| | 2023-03
| |
| | Saturation
| |
| | GPT-4
| |
| | Interrogator > 50%
| |
| | Interrogator 46%
| |
| | [Paper](https://courses.cs.umbc.edu/471/papers/turing.pdf), [Evidence](https://arxiv.org/pdf/2405.08007)
| |
| | The original AI benchmark proposed by Alan Turing in 1950 (the "imitation game").
| |
| |-
| |
| | '''ARC (AI2)'''
| |
| | Reasoning
| |
| | 2018-03 – 2023-03
| |
| | 2018-03
| |
| | 2023-03
| |
| | Saturation
| |
| | GPT-4
| |
| | Unspecified
| |
| | GPT-4: 96.3%
| |
| | [Paper](https://arxiv.org/abs/1803.05457), [Website](https://leaderboard.allenai.org/arc/submissions/get-started), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| |
| | Grade-school multiple-choice reasoning tasks testing logical, spatial, temporal reasoning.
| |
| |-
| |
| | '''HellaSwag'''
| |
| | Common Sense
| |
| | 2019-05 – 2023-03
| |
| | 2019-05
| |
| | 2023-03
| |
| | Saturation
| |
| | GPT-4
| |
| | Human: 95.6%
| |
| | GPT-4: 95.3%
| |
| | [Paper](https://arxiv.org/abs/1905.07830), [Website](https://rowanzellers.com/hellaswag/), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| |
| | Multiple-choice questions about everyday scenarios with adversarial filtering.
| |
| |-
| |
| | '''MMLU'''
| |
| | Knowledge
| |
| | 2020-09 – 2023-03
| |
| | 2020-09
| |
| | 2023-03
| |
| | Saturation
| |
| | GPT-4
| |
| | 95th pct Human: 87.0%
| |
| | GPT-4: 87.3%
| |
| | [Paper](https://arxiv.org/abs/2009.03300), [GitHub](https://github.com/hendrycks/test), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| |
| | 57 subjects from real-world sources (professional exams) testing breadth and depth of knowledge.
| |
| |-
| |
| | '''WinoGrande'''
| |
| | Common Sense
| |
| | 2019-07 – 2023-03
| |
| | 2019-07
| |
| | 2023-03
| |
| | Saturation
| |
| | GPT-4
| |
| | Human: 94%
| |
| | GPT-4: 87.5%
| |
| | [Paper](https://arxiv.org/abs/1907.10641), [Website](https://winogrande.allenai.org/), [Evidence](https://cdn.openai.com/papers/gpt-4.pdf)
| |
| | Enhanced WSC with 44K problems testing common-sense pronoun resolution.
| |
| |}
| |
|
| |
| == Pre-2023 ==
| |
| === 2022 ===
| |
| {| class="wikitable"
| |
| |-
| |
| ! Benchmark
| |
| ! Category
| |
| ! Time Span
| |
| ! Date Created
| |
| ! Date Defeated
| |
| ! Killed By
| |
| ! Defeated By
| |
| ! Original Score
| |
| ! Final Score
| |
| ! Links
| |
| ! Details
| |
| |-
| |
| | '''BIG-Bench'''
| |
| | Multi-task
| |
| | 2021-06 – 2022-04
| |
| | 2021-06
| |
| | 2022-04
| |
| | Saturation
| |
| | Palm 540B
| |
| | Human: 49.8%
| |
| | Palm 540B: 61.4%
| |
| | [Paper](https://arxiv.org/abs/2206.04615), [GitHub](https://github.com/google/BIG-bench), [Evidence](https://arxiv.org/pdf/2204.02311)
| |
| | 204 tasks spanning linguistics, math, common-sense reasoning, and more.
| |
| |}
| |
|
| |
| === 2019 === | | === 2019 === |
| {| class="wikitable" | | {| class="wikitable" |
Line 216: |
Line 23: |
| | Human: 89.8% | | | Human: 89.8% |
| | T5: 89.3% | | | T5: 89.3% |
| | [Paper](https://arxiv.org/abs/1905.00537), [Website](https://super.gluebenchmark.com/) | | | [https://arxiv.org/abs/1905.00537 Paper], [https://super.gluebenchmark.com/ Website] |
| | More challenging language understanding tasks (word sense, causal reasoning, RC). | | | More challenging language understanding tasks (word sense, causal reasoning, RC). |
| |-
| |
| | '''WSC'''
| |
| | Common Sense
| |
| | 2012-05 – 2019-07
| |
| | 2012-05
| |
| | 2019-07
| |
| | Saturation
| |
| | ROBERTA (w SFT)
| |
| | Human: 96.5%
| |
| | ROBERTA (w SFT): 90.1%
| |
| | [Paper](https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf), [Website](https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html)
| |
| | Carefully crafted sentence pairs with ambiguous pronoun references.
| |
| |-
| |
| | '''GLUE'''
| |
| | Language
| |
| | 2018-05 – 2019-06
| |
| | 2018-05
| |
| | 2019-06
| |
| | Saturation
| |
| | XLNet
| |
| | Human: 87.1%
| |
| | XLNet: 88.4%
| |
| | [Paper](https://arxiv.org/abs/1804.07461), [Website](https://gluebenchmark.com/)
| |
| | Nine tasks for evaluating NLU (inference, paraphrase, similarity, etc.).
| |
| |-
| |
| | '''TriviaQA'''
| |
| | Knowledge
| |
| | 2017-05 – 2019-06
| |
| | 2017-05
| |
| | 2019-06
| |
| | Saturation
| |
| | SpanBERT
| |
| | Human: 79.7%
| |
| | SpanBERT: 83.6%
| |
| | [Paper](https://arxiv.org/abs/1705.03551), [Website](http://nlp.cs.washington.edu/triviaqa/)
| |
| | 650K QA-evidence triples requiring cross-sentence reasoning.
| |
| |-
| |
| | '''SQuAD v2.0'''
| |
| | Language
| |
| | 2018-05 – 2019-04
| |
| | 2018-05
| |
| | 2019-04
| |
| | Saturation
| |
| | BERT
| |
| | Human: 89.5%
| |
| | BERT: 89.5%
| |
| | [Paper](https://arxiv.org/abs/1806.03822), [Website](https://rajpurkar.github.io/SQuAD-explorer/)
| |
| | Extension of SQuAD adding unanswerable questions.
| |
| |-
| |
| | '''SQuAD'''
| |
| | Language
| |
| | 2016-05 – 2019-03
| |
| | 2016-05
| |
| | 2019-03
| |
| | Saturation
| |
| | BERT
| |
| | Human: 91.2%
| |
| | BERT: 93.2%
| |
| | [Paper](https://arxiv.org/abs/1606.05250), [Website](https://rajpurkar.github.io/SQuAD-explorer/)
| |
| | 100,000+ QA tasks on Wikipedia articles.
| |
| |}
| |
|
| |
| === 2018 ===
| |
| {| class="wikitable"
| |
| |-
| |
| ! Benchmark
| |
| ! Category
| |
| ! Time Span
| |
| ! Date Created
| |
| ! Date Defeated
| |
| ! Killed By
| |
| ! Defeated By
| |
| ! Original Score
| |
| ! Final Score
| |
| ! Links
| |
| ! Details
| |
| |-
| |
| | '''SWAG'''
| |
| | Common Sense
| |
| | 2018-05 – 2018-10
| |
| | 2018-05
| |
| | 2018-10
| |
| | Saturation
| |
| | BERT
| |
| | Human: 88%
| |
| | BERT: 86%
| |
| | [Paper](https://arxiv.org/abs/1808.05326), [Website](https://rowanzellers.com/swag/)
| |
| | 113K multiple-choice questions about grounded situations (common sense “next step”).
| |
| |} | | |} |