# Benchmark (AI)

> Source: https://aiwiki.ai/wiki/benchmark
> Updated: 2026-07-14
> Categories: AI Benchmarks, Model Evaluation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

In artificial intelligence and machine learning, a **benchmark** is a standardized combination of a dataset, a task definition, and a scoring protocol that lets different models be compared on a common footing, so that a number one lab reports can be reproduced and contrasted with prior results. Benchmarks have repeatedly defined eras of AI progress: the 2012 ImageNet result, in which a deep convolutional network cut the top-5 image-classification error rate to 15.3% versus 26.2% for the runner-up, is widely credited with launching the modern deep learning era.[^37] Modern AI benchmarks typically include training and held-out test splits, a precisely specified input/output format, and one or more quantitative metrics such as accuracy, exact match, F1, BLEU, or pass@k. The history of the field is closely tied to its benchmarks: image classification was reshaped by the ImageNet Large Scale Visual Recognition Challenge,[^1] natural language understanding was driven by GLUE and SuperGLUE,[^2][^3] and the evaluation of large language models now spans dozens of evaluations covering knowledge, mathematics, code, reasoning, multimodal understanding, long context, and tool use. Benchmarks are also one of the most contested artifacts in AI, criticized for data contamination, saturation, construct validity, and Goodhart-style optimization pressure.[^4]

## What is an AI benchmark?

A benchmark in machine learning has four components. The first is a **dataset**: a collection of inputs (and usually reference outputs) drawn from some target distribution. The second is a **task definition**: a precise specification of what the model must produce given each input, including the allowed prompt format, decoding constraints, and any few-shot exemplars. The third is a **scoring metric**: a function that maps model outputs and reference outputs to a numerical score. The fourth is an **evaluation protocol**: the rules governing which split is used, whether the test labels are public, how many samples may be drawn, and whether external tools or retrieval are allowed.

This combination matters because the same dataset can support multiple benchmarks. Wikipedia text, for example, underlies the Stanford Question Answering Dataset (SQuAD),[^5] the unanswerable-question extension SQuAD 2.0,[^6] and many open-domain question-answering setups, each with its own scoring conventions. Likewise, the same model can score very differently depending on how prompts and decoding are configured, which is one reason benchmark organizers increasingly publish reference harnesses such as Stanford's [helm](/wiki/helm) codebase[^7] and EleutherAI's lm-evaluation-harness.

## Purpose

Benchmarks exist to make claims about model capability **comparable**, **reproducible**, **trackable over time**, and **mappable to capabilities of interest**. Comparability means that a number reported on a benchmark by one lab can, in principle, be reproduced by another lab and contrasted with prior numbers. Reproducibility requires that the dataset, prompts, and scoring code be published. Progress tracking lets the community plot performance versus model size, compute, or release date and observe trends such as the scaling laws documented by Kaplan et al. and the [chinchilla scaling](/wiki/chinchilla_scaling) revisions.[^8] Capability mapping means that a suite of benchmarks attempts to cover a structured set of skills (knowledge recall, multi-step math, code synthesis, multimodal perception, tool use) so that a single aggregate score reflects competence across many dimensions, the explicit design goal of [big bench](/wiki/big_bench) and [helm](/wiki/helm).[^9][^7]

## History

### Vision benchmarks (1998 to 2014)

The first widely adopted ML benchmark was [mnist](/wiki/mnist), introduced in the LeCun et al. 1998 paper *Gradient-based learning applied to document recognition*. MNIST contains 60,000 training and 10,000 test images of handwritten digits, normalized to 28-by-28 grayscale pixels, derived from earlier NIST datasets.[^10] It became the canonical sanity check for neural network research for two decades.

[fei fei li](/wiki/fei_fei_li) and collaborators released **Caltech-101** in 2004, with 9,146 images across 101 object categories.[^11] This was followed by the PASCAL Visual Object Classes (VOC) challenge, which ran annually from 2005 to 2012 and standardized object detection and segmentation evaluation. A separate Wikipedia-class entry covers [pascal voc](/wiki/pascal_voc) in detail.

The pivotal moment for computer vision was [imagenet](/wiki/imagenet), assembled at Princeton starting in 2007 by Fei-Fei Li and Jia Deng and described in Russakovsky et al. 2015, *ImageNet Large Scale Visual Recognition Challenge* (arXiv:1409.0575).[^1] The full ImageNet database is organized on the WordNet hierarchy and contains more than 14 million labeled images across roughly 22,000 categories; the ILSVRC competition used a subset of 1,000 categories with about 1.2 million training images.[^38] The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) ran from 2010 to 2017 and provided the substrate on which AlexNet (2012) and subsequent deep convolutional networks demonstrated that supervised deep learning at scale could outperform classical computer-vision pipelines. AlexNet, from Krizhevsky, Sutskever, and Hinton, won ILSVRC 2012 with a top-5 error rate of 15.3%, compared with 26.2% for the second-place entry, a margin that catalyzed the deep learning revolution in computer vision.[^37] Microsoft COCO, introduced in Lin et al. 2014, extended the field to detection, segmentation, and captioning of common objects in context.[^12]

### NLP benchmarks (2016 to 2019)

Reading comprehension was reshaped by [squad](/wiki/squad) (Rajpurkar et al. 2016), which provided more than 100,000 crowd-written questions answerable by spans from Wikipedia passages, with an F1 metric against reference span answers.[^5] Two years later, Rajpurkar, Jia, and Liang released **SQuAD 2.0**, adding more than 50,000 adversarial unanswerable questions and requiring models to abstain when no span is supported; strong neural systems that reached 86% F1 on SQuAD 1.1 dropped to 66% F1 on SQuAD 2.0.[^6]

[glue benchmark](/wiki/glue_benchmark) (Wang et al. 2018) bundled nine sentence-level English understanding tasks behind a common API, intending to discourage task-specific tricks.[^2] Within a year, top systems exceeded the published human baseline on most GLUE tasks, so the same group released **SuperGLUE** in 2019 with harder tasks and clearer headroom.[^3] By 2020, [superglue](/wiki/superglue) itself was at or above the human baseline for leading systems, illustrating an early pattern: useful benchmarks saturate quickly once the community focuses on them.

[hellaswag](/wiki/hellaswag) (Zellers et al. 2019) introduced an adversarially filtered commonsense sentence-completion benchmark where humans score above 95% and the best models at release scored under 48%, demonstrating that adversarial filtering can produce a benchmark with large initial headroom even when its constituent questions are individually easy for humans.[^13]

### Broad LLM evaluations (2020 to 2022)

[mmlu](/wiki/mmlu) (Hendrycks et al. 2020, arXiv:2009.03300), formally *Measuring Massive Multitask Language Understanding*, contains questions drawn from 57 subjects ranging from elementary mathematics to professional law and ethics, all formatted as four-way multiple choice.[^14] It became the de facto comparison metric in the GPT-3 / PaLM / Llama era.

[big bench](/wiki/big_bench) (Srivastava et al. 2022, arXiv:2206.04615) is *Beyond the Imitation Game*, a community-built benchmark with 204 tasks contributed by 449 authors across 132 institutions.[^9] A curated *BIG-bench Hard* subset (BBH) selected 23 tasks where state-of-the-art models at the time fell well short of human performance.

[helm](/wiki/helm) (Liang et al. 2022, arXiv:2211.09110), the Holistic Evaluation of Language Models from Stanford's Center for Research on Foundation Models, evaluated 30 models across 42 scenarios under standardized prompts and reported seven metrics per cell: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. The authors reported that benchmark coverage of major models rose from an average of 17.9% to 96.0% under their methodology.[^7]

## Modern LLM benchmarks

By 2023 the field had moved from single-number leaderboards toward portfolios of specialized evaluations. The categories below summarize widely cited benchmarks; each is covered in its own article on this wiki.

### Knowledge

- [mmlu](/wiki/mmlu): 57-subject multiple choice, 15,908 test questions.[^14]
- **MMLU-Pro** (Wang et al. 2024, NeurIPS 2024): extends MMLU by replacing four-option questions with ten-option questions, removing trivial items, and emphasizing reasoning. Accuracy drops 16% to 33% relative to MMLU on the same models, and the variance under different prompts falls from 4-5% to 2%.[^15]
- [gpqa](/wiki/gpqa) (Rein et al. 2023, arXiv:2311.12022): 448 multiple-choice physics, chemistry, and biology questions written by domain PhDs. Domain-PhD validators reach 65% accuracy; skilled non-experts with unrestricted web access reach 34%. At release, GPT-4 reached 39%. The GPQA Diamond subset is the most commonly reported number.[^16]
- **Humanity's Last Exam** (Phan et al. 2025, arXiv:2501.14249): a deliberately frontier-level academic benchmark of 2,500 questions across more than 100 subjects, assembled by close to 1,000 expert contributors and released in January 2025 by the Center for AI Safety and Scale AI to counter MMLU saturation. Its authors note that "LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities," and describe HLE as "the final closed-ended academic benchmark of its kind with broad subject coverage."[^39] At launch, leading reasoning models scored below 10% (DeepSeek-R1 was highest at roughly 9%); by late 2025, frontier models had climbed toward the high 30s on the same test, illustrating how quickly even a benchmark designed to be hard can be approached.[^39]

### Math

- [gsm8k](/wiki/gsm8k) (Cobbe et al. 2021, arXiv:2110.14168): 8,500 high-quality grade-school math word problems, each with a step-by-step reasoning trace. The paper also introduced verifier-based selection.[^17]
- [math benchmark](/wiki/math_benchmark) (Hendrycks et al. 2021, arXiv:2103.03874): 12,500 competition-style problems with step-by-step solutions, originally far from saturation.[^18]
- [aime](/wiki/aime) and AIME 2024 / AIME 2025: short integer-answer problems from the American Invitational Mathematics Examination, used heavily for reasoning-model evaluation.
- [frontiermath](/wiki/frontiermath) (Epoch AI, November 2024): 350 original research-level problems written and vetted by professional mathematicians, including Fields medalists. At launch, leading models including o1-preview, Claude 3.5 Sonnet, GPT-4o, Grok 2 Beta, and Gemini 1.5 Pro 002 solved less than 2% of problems.[^19]

### Code

- [humaneval](/wiki/humaneval) (Chen et al. 2021, arXiv:2107.03374): 164 hand-written Python programming problems graded by execution against unit tests, scored as pass@k. Codex solved 28.8% pass@1 in the original paper, and 70.2% with k=100.[^20]
- [mbpp](/wiki/mbpp): the Mostly Basic Python Programming benchmark of 974 entry-level problems, often paired with HumanEval.
- [swe bench](/wiki/swe_bench) (Jimenez et al. 2023, arXiv:2310.06770): 2,294 real GitHub issues from 12 Python repositories. At release, Claude 2 solved only 1.96% of issues, demonstrating a large gap between toy code generation and real software engineering.[^21] In August 2024, OpenAI's Preparedness team released **SWE-bench Verified**, a 500-issue subset hand-reviewed by 93 professional developers to ensure that each task is solvable, has unambiguous descriptions, and has fair unit tests.[^22] Progress on SWE-bench Verified has been steep: where the original SWE-bench saw single-digit resolution rates in 2023, frontier coding models reported scores above 70% on the Verified subset during 2025, a roughly fortyfold improvement in under two years.[^40]
- [livecodebench](/wiki/livecodebench) (Jain et al. 2024, arXiv:2403.07974): a continually updated set of programming-contest problems from LeetCode, AtCoder, and CodeForces, designed to mitigate training-set contamination by accepting only problems released after each model's cutoff.[^23]

### Reasoning

- [arc agi](/wiki/arc_agi) (Chollet 2019, arXiv:1911.01547): the Abstraction and Reasoning Corpus introduced in *On the Measure of Intelligence*, designed to test few-shot generalization on visual reasoning puzzles unlike anything in pretraining data.[^24] Chollet's framing motivates much of modern benchmark design: he argues that "skill is heavily modulated by prior knowledge and experience," so that "unlimited priors or unlimited training data allow experimenters to buy arbitrary levels of skills," and proposes instead to measure intelligence as skill-acquisition efficiency.[^41]
- **ARC-AGI-2** (Chollet et al. 2025, arXiv:2505.11831): a revised version with greater task complexity and tasks that use symbols whose meaning is defined within the task. The 2025 ARC Prize Kaggle competition attracted 1,455 teams and 15,154 entries, with the top score on the private evaluation set reaching 24%.[^25]
- **ARC-AGI-3**: announced for release in early 2026 alongside ARC Prize 2026, with an interactive-reasoning design that requires exploration, planning, memory, and goal acquisition.[^25]

### Multimodal

- **MMMU** (Yue et al. 2023, arXiv:2311.16502): the Massive Multi-discipline Multimodal Understanding benchmark, with 11,500 college-exam-style multimodal questions covering 30 subjects and 183 subfields across six disciplines. Image types include charts, diagrams, maps, tables, music sheets, and chemical structures.[^26]
- **MathVista**: multimodal mathematical reasoning over charts, figures, and geometric diagrams.
- **ChartQA**: question answering on chart images.

### Long context

- [needle in a haystack](/wiki/needle_in_a_haystack) (NIAH): a synthetic protocol in which a single sentence (the "needle") is placed at varying depths within a long distractor passage and the model is asked to retrieve it. NIAH became the default sanity check for context-length claims in 2023.
- [ruler benchmark](/wiki/ruler_benchmark) (Hsieh et al. 2024, arXiv:2404.06654): a long-context benchmark from NVIDIA covering 13 tasks across retrieval, multi-hop tracing, aggregation, and question answering. Despite near-perfect performance on vanilla NIAH, most evaluated models degrade sharply on RULER as length increases; of 17 models claiming context windows of 32K tokens or more, only about half effectively handle 32K sequences.[^27]
- **BABILong**: an extension of the bAbI tasks that interleaves question structure with long distractor text to measure reasoning over long contexts.

### Agentic

- **GAIA** (Mialon et al. 2023, arXiv:2311.12983): 466 real-world questions for general AI assistants, requiring reasoning, multimodal handling, web browsing, and tool use. At release, humans scored 92% while GPT-4 with plugins reached 15%.[^28]
- [osworld](/wiki/osworld) (Xie et al. 2024, arXiv:2404.07972): 369 real-world computer tasks on Ubuntu, Windows, and macOS desktops, evaluated by execution-based checks.[^29]
- [webarena](/wiki/webarena) (Zhou et al. 2023, arXiv:2307.13854): a self-hosted suite of fully functional web applications (shopping, forums, code hosting, maps) with task-completion checks.[^30]
- [agentbench](/wiki/agentbench) (Liu et al. 2023, arXiv:2308.03688): a multi-dimensional benchmark across 8 environments, evaluating LLMs in reasoning and decision making.[^31]
- **SWE-bench Verified** (see Code).

### Tool use

- **τ-bench** (Yao et al. 2024, arXiv:2406.12045): a tool-agent-user interaction benchmark from Sierra, with realistic databases and APIs, domain policy documents, and simulated users. It introduces the pass^k metric to measure reliability across multiple independent trials, a stricter standard than pass@k.[^32]
- **ToolBench**: a large-scale tool-use evaluation built on top of RapidAPI tools.

### Vibes and human preference

- **Chatbot Arena / LMArena** (Chiang et al. 2024, arXiv:2403.04132): a crowdsourced platform from LMSYS (now LMArena.ai) where users compare anonymized model outputs side by side, with rankings derived using a Bradley-Terry style Elo system. The platform accumulated over 240,000 votes in the period reported in the paper and has become a widely cited reference for "vibes" assessment.[^33] See [lmsys chatbot arena](/wiki/lmsys_chatbot_arena) and [lmarena org](/wiki/lmarena_org) for the platform itself.
- [mt bench](/wiki/mt_bench) (Zheng et al. 2023, arXiv:2306.05685): an 80-question multi-turn set scored by GPT-4 as judge, introduced alongside the Chatbot Arena pipeline.[^34]
- [alpacaeval](/wiki/alpacaeval): an automatic preference-based evaluator. Dubois et al. 2024 (arXiv:2404.04475) introduced **Length-Controlled AlpacaEval (AlpacaEval 2 LC)**, which fits a generalized linear model over output length to debias the auto-annotator. Spearman correlation with [lmsys chatbot arena](/wiki/lmsys_chatbot_arena) rose from 0.94 to 0.98 after length control.[^35]
- [arena hard](/wiki/arena_hard): a curated harder slice derived from Chatbot Arena prompts, judged by a strong LLM.

## Scoring metrics

Different tasks require different scoring functions, and the choice of metric is part of what defines a benchmark.

**Accuracy and exact match** are used for multiple-choice tasks (MMLU, GPQA, HellaSwag) and for short-answer tasks where any deviation from the gold answer counts as wrong (GSM8K final-answer extraction).

**[f1 score](/wiki/f1_score)** is standard for span-extraction reading comprehension such as SQuAD, where the model output is a span of text whose tokens are compared against a reference span at the token level.

**[bleu](/wiki/bleu)**, **[rouge score](/wiki/rouge_score)**, and **METEOR** are n-gram overlap metrics for machine translation and summarization. BLEU was introduced by Papineni et al. in 2002 for machine translation, and ROUGE by Lin in 2004 for summarization. These metrics are still used as automatic proxies despite known weaknesses on paraphrastic outputs.

**pass@k** is the metric for code generation, introduced with HumanEval: a problem is considered solved if any of k independent samples passes all unit tests. The expected pass@k can be estimated unbiasedly from a larger sample of n completions using a closed-form formula in the original Chen et al. paper.[^20]

**Elo and Bradley-Terry ratings** are used for pairwise-comparison arenas such as Chatbot Arena. Each comparison updates the ratings using the same logic as competitive chess ratings, with confidence intervals reported via bootstrap.[^33]

**LLM-as-judge** metrics use a strong model to score outputs against either a reference or a competing output. MT-Bench uses GPT-4 to score 80 multi-turn questions; AlpacaEval LC uses GPT-4 as a pairwise judge with length-control regression; Arena-Hard uses Claude or GPT-4 to judge a curated set of harder prompts.

Auxiliary metrics increasingly accompany these primary scores. HELM's seven-metric design reports calibration, robustness, fairness, bias, toxicity, and efficiency alongside accuracy, on the argument that single-number leaderboards obscure important trade-offs.[^7]

## Issues and critiques

### When is a benchmark contaminated?

Public benchmarks are scraped into pretraining corpora, raising the possibility that a model has seen test items during training. Sainz et al. 2023 (*NLP Evaluation in trouble*, EMNLP Findings, arXiv:2310.18018) define multiple levels of contamination and argue for per-benchmark contamination measurement as a community standard. They note that the most severe case is when a model is trained on the test split of the very benchmark on which it is then evaluated.[^4] Public discussions around GPT-4 highlighted possible exposure to SAT, AP, bar-exam, and codeforces problems used in the system-card evaluations, contributing to skepticism about headline numbers on standardized exams.

### Goodhart's law and overfitting

[goodharts law](/wiki/goodharts_law) states that "when a measure becomes a target, it ceases to be a good measure." In ML, this manifests as labs targeting specific benchmarks during pretraining mixture selection, instruction tuning, or RLHF reward shaping. Critics argue that strong scores on saturated benchmarks no longer track the underlying capability the benchmark was intended to measure.

### Why do benchmarks saturate?

[glue benchmark](/wiki/glue_benchmark) reached human parity within a year of release; [superglue](/wiki/superglue) within two; [mmlu](/wiki/mmlu) is now near saturation for frontier models, motivating MMLU-Pro.[^15] Saturation happens because once a benchmark becomes a target, the community pours optimization pressure into it (better prompting, fine-tuning on similar data, and sometimes contamination), so the headline gap to human performance closes faster than the underlying capability changes. This pushes the field toward harder evaluations (FrontierMath, GPQA, Humanity's Last Exam, ARC-AGI-2) and toward dynamic or held-out designs that resist memorization.

### Construct validity

Construct validity asks whether the benchmark actually measures the latent capability it claims to. A long-running line of work questions whether multi-choice exams measure reasoning, or only test-taking heuristics, and whether translation BLEU measures translation quality, or only n-gram overlap. Concerns about cultural bias in MMLU and similar benchmarks (predominantly US-centric content) also fall under construct validity.

### Cultural and linguistic bias

Many widely cited benchmarks are English-only or US-centric. Efforts such as MGSM (multilingual GSM8K) and MMLU-ProX address some of this gap, but benchmark coverage outside English remains thin compared to English coverage.

## Solutions

Several design strategies attempt to mitigate the above problems.

**Held-out and dynamic benchmarks.** [livecodebench](/wiki/livecodebench) only counts programming-contest problems released after a given model's training cutoff, making contamination structurally impossible for new entries.[^23] [frontiermath](/wiki/frontiermath) keeps a private test set, with answers withheld from the public.[^19] [arc agi](/wiki/arc_agi) keeps a private evaluation set against which the ARC Prize Kaggle competition is graded.[^25]

**Verified and audited benchmarks.** OpenAI's **SWE-bench Verified** removed underspecified and unsolvable tasks from the original SWE-bench via human review by 93 professional developers, with the goal of producing more accurate estimates of autonomous software engineering capability.[^22]

**Private leaderboards.** Some benchmarks (FrontierMath Tier 4, ARC-AGI evaluation set, parts of GAIA) keep ground-truth answers private and require model providers to submit predictions to an organizer.

**Adversarial construction.** [hellaswag](/wiki/hellaswag) used adversarial filtering to remove items that were easy for current models, yielding large initial human-model gaps.[^13] Adversarial NLI (ANLI) iterated this process across rounds, with annotators specifically trying to construct examples that fooled the current best model.

**Verifiable and execution-based scoring.** Benchmarks that grade by executing code against unit tests (HumanEval, MBPP, SWE-bench, LiveCodeBench, OSWorld) are harder to game than benchmarks scored by text-overlap metrics, because the grader is a deterministic program rather than a reference string.

## Leaderboards

Several public leaderboards aggregate benchmark numbers.

- **Papers With Code** maintains crowdsourced state-of-the-art tables for thousands of benchmarks across ML, linked to the papers that achieved each score.
- **Hugging Face Open LLM Leaderboard** evaluates open-weights models on a fixed suite under standardized prompting, using EleutherAI's [lm-evaluation-harness](/wiki/lm_evaluation_harness) as its backend. The v2 launch in June 2024 retired saturated benchmarks (the original HellaSwag, ARC, and original MMLU configurations) and replaced them with MMLU-Pro, GPQA, MuSR, MATH, IFEval, and BBH to restore headroom.[^36]
- **LMArena.ai** (formerly Chatbot Arena) is the canonical preference-based ranking, with Elo ratings updated continuously from anonymous head-to-head user votes.[^33] See [lmarena org](/wiki/lmarena_org).
- **HELM** publishes an interactive interface to its leaderboard with per-scenario, per-metric breakdowns rather than a single composite.[^7]
- **Scale AI SEAL** is a set of private, expert-curated leaderboards run by [scale ai](/wiki/scale_ai) across domains such as math, coding, and adversarial robustness.

## What is the hardest AI benchmark?

There is no single hardest benchmark, because difficulty is relative to the capability being measured and to the current frontier. As of 2026, the evaluations with the largest remaining human-model gaps are research-level math and broad expert knowledge suites: [frontiermath](/wiki/frontiermath), where leading 2024 models solved under 2% of problems,[^19] and Humanity's Last Exam, on which the best reasoning models scored below 10% at its January 2025 launch.[^39] For agents, real-world computer-use and long-horizon software tasks such as [osworld](/wiki/osworld) and the original SWE-bench remained far from solved when released.[^29][^21] Because frontier models climb these benchmarks within months, "the hardest benchmark" is a moving target, and organizers deliberately keep private test sets and refresh problem streams to preserve headroom.

## Shift to verifiable, agentic, and harder evaluations

Between 2023 and 2026, the most cited LLM benchmarks shifted in three directions. First, **harder static benchmarks** such as GPQA, MMLU-Pro, MATH-500, Humanity's Last Exam, and FrontierMath replaced saturated predecessors. Second, **verifiable and execution-based evaluations** such as SWE-bench Verified, LiveCodeBench, OSWorld, and WebArena replaced text-overlap or LLM-as-judge metrics where possible, because they cannot be gamed by stylistic mimicry. Third, **agentic and long-horizon benchmarks** such as GAIA, OSWorld, WebArena, τ-bench, and AgentBench moved evaluation away from single-turn prompts toward multi-step tasks involving tools, memory, and recovery from failure.

The reasoning-model era starting with OpenAI o1 in 2024 accelerated this shift. Within a year, frontier models saturated MATH, surpassed human PhDs on GPQA Diamond, and posted nontrivial scores on FrontierMath, AIME 2024, and AIME 2025. Benchmark organizers responded with private test sets (FrontierMath), held-out problem streams (LiveCodeBench), and entirely new task formats (ARC-AGI-2 and the planned ARC-AGI-3).

## Connection to scaling laws

The empirical [scaling laws](/wiki/scaling_laws) literature relies on benchmarks as the dependent variable: pretraining loss, perplexity, and downstream-benchmark accuracy are plotted against model size, dataset size, and training compute to identify power-law trends. The Kaplan et al. 2020 and Hoffmann et al. 2022 ([chinchilla scaling](/wiki/chinchilla_scaling)) papers used cross-entropy loss and downstream task accuracy as proxies for capability.[^8] BIG-bench documented the *breakthrough* phenomenon, in which some tasks remain at random-chance accuracy until a critical scale, after which they jump sharply.[^9] Benchmark design choices therefore feed directly back into how the community describes and predicts model progress.

## See also

- [FRAMES (benchmark)](/wiki/frames_benchmark)
- [RE-Bench](/wiki/re_bench)
- [HELMET](/wiki/helmet)
- [SuperGPQA](/wiki/supergpqa)
- [HalluLens](/wiki/hallulens)
- [mmlu](/wiki/mmlu)
- [gpqa](/wiki/gpqa)
- [humaneval](/wiki/humaneval)
- [mbpp](/wiki/mbpp)
- [swe bench](/wiki/swe_bench)
- [arc agi](/wiki/arc_agi)
- [lmsys chatbot arena](/wiki/lmsys_chatbot_arena)
- [mt bench](/wiki/mt_bench)
- [alpacaeval](/wiki/alpacaeval)
- [helm](/wiki/helm)
- [imagenet](/wiki/imagenet)
- [mnist](/wiki/mnist)
- [glue benchmark](/wiki/glue_benchmark)
- [superglue](/wiki/superglue)
- [squad](/wiki/squad)
- [hellaswag](/wiki/hellaswag)
- [big bench](/wiki/big_bench)
- [frontiermath](/wiki/frontiermath)
- [livecodebench](/wiki/livecodebench)
- [gaia benchmark](/wiki/gaia_benchmark)
- [osworld](/wiki/osworld)
- [webarena](/wiki/webarena)
- [agentbench](/wiki/agentbench)
- [gsm8k](/wiki/gsm8k)
- [math benchmark](/wiki/math_benchmark)
- [needle in a haystack](/wiki/needle_in_a_haystack)
- [ruler benchmark](/wiki/ruler_benchmark)
- [mmmu](/wiki/mmmu)
- [mathvista](/wiki/mathvista)
- [bleu](/wiki/bleu)
- [rouge score](/wiki/rouge_score)
- [f1 score](/wiki/f1_score)
- [goodharts law](/wiki/goodharts_law)
- [scaling laws](/wiki/scaling_laws)
- [agent evaluation](/wiki/agent_evaluation)

## References

[^1]: Olga Russakovsky et al., "ImageNet Large Scale Visual Recognition Challenge", arXiv:1409.0575, 2015-01-30. https://arxiv.org/abs/1409.0575. Accessed 2026-05-26.
[^2]: Alex Wang et al., "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding", arXiv:1804.07461, 2018-04-20. https://arxiv.org/abs/1804.07461. Accessed 2026-05-26.
[^3]: Alex Wang et al., "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems", arXiv:1905.00537, 2019-05-02. https://arxiv.org/abs/1905.00537. Accessed 2026-05-26.
[^4]: Oscar Sainz, Jon Ander Campos, Iker Garcia-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, Eneko Agirre, "NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark", arXiv:2310.18018, 2023-10-27. https://arxiv.org/abs/2310.18018. Accessed 2026-05-26.
[^5]: Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang, "SQuAD: 100,000+ Questions for Machine Comprehension of Text", arXiv:1606.05250, 2016-06-16. https://arxiv.org/abs/1606.05250. Accessed 2026-05-26.
[^6]: Pranav Rajpurkar, Robin Jia, Percy Liang, "Know What You Don't Know: Unanswerable Questions for SQuAD", arXiv:1806.03822, 2018-06-11. https://arxiv.org/abs/1806.03822. Accessed 2026-05-26.
[^7]: Percy Liang et al., "Holistic Evaluation of Language Models", arXiv:2211.09110, 2022-11-16. https://arxiv.org/abs/2211.09110. Accessed 2026-05-26.
[^8]: Jared Kaplan et al., "Scaling Laws for Neural Language Models", arXiv:2001.08361, 2020-01-23. https://arxiv.org/abs/2001.08361. Accessed 2026-05-26.
[^9]: Aarohi Srivastava et al., "Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models", arXiv:2206.04615, 2022-06-09. https://arxiv.org/abs/2206.04615. Accessed 2026-05-26.
[^10]: Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner, "Gradient-based learning applied to document recognition", Proceedings of the IEEE 86(11):2278-2324, 1998-11. http://yann.lecun.com/exdb/mnist/. Accessed 2026-05-26.
[^11]: Li Fei-Fei, Rob Fergus, Pietro Perona, "Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories", Caltech / CVPR Workshop, 2004. https://data.caltech.edu/records/mzrjq-6wc02. Accessed 2026-05-26.
[^12]: Tsung-Yi Lin et al., "Microsoft COCO: Common Objects in Context", arXiv:1405.0312, 2014-05-01. https://arxiv.org/abs/1405.0312. Accessed 2026-05-26.
[^13]: Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi, "HellaSwag: Can a Machine Really Finish Your Sentence?", arXiv:1905.07830, 2019-05-19. https://arxiv.org/abs/1905.07830. Accessed 2026-05-26.
[^14]: Dan Hendrycks et al., "Measuring Massive Multitask Language Understanding", arXiv:2009.03300, 2020-09-07. https://arxiv.org/abs/2009.03300. Accessed 2026-05-26.
[^15]: Yubo Wang et al., "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark", arXiv:2406.01574, 2024-06-03. https://arxiv.org/abs/2406.01574. Accessed 2026-05-26.
[^16]: David Rein et al., "GPQA: A Graduate-Level Google-Proof Q&A Benchmark", arXiv:2311.12022, 2023-11-20. https://arxiv.org/abs/2311.12022. Accessed 2026-05-26.
[^17]: Karl Cobbe et al., "Training Verifiers to Solve Math Word Problems", arXiv:2110.14168, 2021-10-27. https://arxiv.org/abs/2110.14168. Accessed 2026-05-26.
[^18]: Dan Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", arXiv:2103.03874, 2021-03-05. https://arxiv.org/abs/2103.03874. Accessed 2026-05-26.
[^19]: Epoch AI, "FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI", Epoch AI, 2024-11-08. https://epoch.ai/frontiermath/. Accessed 2026-05-26.
[^20]: Mark Chen et al., "Evaluating Large Language Models Trained on Code", arXiv:2107.03374, 2021-07-07. https://arxiv.org/abs/2107.03374. Accessed 2026-05-26.
[^21]: Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik Narasimhan, "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?", arXiv:2310.06770, 2023-10-10. https://arxiv.org/abs/2310.06770. Accessed 2026-05-26.
[^22]: OpenAI, "Introducing SWE-bench Verified", OpenAI Blog, 2024-08-13. https://openai.com/index/introducing-swe-bench-verified/. Accessed 2026-05-26.
[^23]: Naman Jain et al., "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code", arXiv:2403.07974, 2024-03-12. https://arxiv.org/abs/2403.07974. Accessed 2026-05-26.
[^24]: Francois Chollet, "On the Measure of Intelligence", arXiv:1911.01547, 2019-11-05. https://arxiv.org/abs/1911.01547. Accessed 2026-05-26.
[^25]: Francois Chollet et al., "ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems", arXiv:2505.11831, 2025-05-17. https://arxiv.org/abs/2505.11831. Accessed 2026-05-26.
[^26]: Xiang Yue et al., "MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI", arXiv:2311.16502, 2023-11-27. https://arxiv.org/abs/2311.16502. Accessed 2026-05-26.
[^27]: Cheng-Ping Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?", arXiv:2404.06654, 2024-04-09. https://arxiv.org/abs/2404.06654. Accessed 2026-05-26.
[^28]: Gregoire Mialon, Clementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, Thomas Scialom, "GAIA: a benchmark for General AI Assistants", arXiv:2311.12983, 2023-11-21. https://arxiv.org/abs/2311.12983. Accessed 2026-05-26.
[^29]: Tianbao Xie et al., "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments", arXiv:2404.07972, 2024-04-11. https://arxiv.org/abs/2404.07972. Accessed 2026-05-26.
[^30]: Shuyan Zhou et al., "WebArena: A Realistic Web Environment for Building Autonomous Agents", arXiv:2307.13854, 2023-07-25. https://arxiv.org/abs/2307.13854. Accessed 2026-05-26.
[^31]: Xiao Liu et al., "AgentBench: Evaluating LLMs as Agents", arXiv:2308.03688, 2023-08-07. https://arxiv.org/abs/2308.03688. Accessed 2026-05-26.
[^32]: Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan, "tau-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains", arXiv:2406.12045, 2024-06-17. https://arxiv.org/abs/2406.12045. Accessed 2026-05-26.
[^33]: Wei-Lin Chiang et al., "Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference", arXiv:2403.04132, 2024-03-07. https://arxiv.org/abs/2403.04132. Accessed 2026-05-26.
[^34]: Lianmin Zheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena", arXiv:2306.05685, 2023-06-09. https://arxiv.org/abs/2306.05685. Accessed 2026-05-26.
[^35]: Yann Dubois, Balazs Galambosi, Percy Liang, Tatsunori B. Hashimoto, "Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators", arXiv:2404.04475, 2024-04-06. https://arxiv.org/abs/2404.04475. Accessed 2026-05-26.
[^36]: Hugging Face Open LLM Leaderboard Team, "Open-LLM performances are plateauing, let's make the leaderboard steep again", Hugging Face Blog, 2024-06-26. https://huggingface.co/spaces/open-llm-leaderboard/blog. Accessed 2026-05-26.
[^37]: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks", Advances in Neural Information Processing Systems 25 (NeurIPS 2012), 2012. https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html. Accessed 2026-06-20.
[^38]: Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, "ImageNet: A Large-Scale Hierarchical Image Database", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. https://www.image-net.org/static_files/papers/imagenet_cvpr09.pdf. Accessed 2026-06-20.
[^39]: Long Phan et al., "Humanity's Last Exam", arXiv:2501.14249, 2025-01-23. https://arxiv.org/abs/2501.14249. Accessed 2026-06-20.
[^40]: SWE-bench, "SWE-bench Verified Leaderboard", swebench.com. https://www.swebench.com/. Accessed 2026-06-20.
[^41]: Francois Chollet, "On the Measure of Intelligence", arXiv:1911.01547, 2019-11-05. https://arxiv.org/abs/1911.01547. Accessed 2026-06-20.