Benchmark (AI)
Last reviewed
May 26, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,023 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 26, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,023 words
Add missing citations, update stale details, or suggest a clearer explanation.
In artificial intelligence and machine learning, a benchmark is a standardized combination of a dataset, a task definition, and a scoring protocol that allows different models and systems to be compared on a common footing. Modern AI benchmarks typically include training and held-out test splits, a precisely specified input/output format, and one or more quantitative metrics such as accuracy, exact match, F1, BLEU, or pass@k. The history of the field is closely tied to its benchmarks: image classification was reshaped by the ImageNet Large Scale Visual Recognition Challenge,[1] natural language understanding was driven by GLUE and SuperGLUE,[2][3] and the evaluation of large language models now spans dozens of evaluations covering knowledge, mathematics, code, reasoning, multimodal understanding, long context, and tool use. Benchmarks are also one of the most contested artifacts in AI, criticized for data contamination, saturation, construct validity, and Goodhart-style optimization pressure.[4]
A benchmark in machine learning has four components. The first is a dataset: a collection of inputs (and usually reference outputs) drawn from some target distribution. The second is a task definition: a precise specification of what the model must produce given each input, including the allowed prompt format, decoding constraints, and any few-shot exemplars. The third is a scoring metric: a function that maps model outputs and reference outputs to a numerical score. The fourth is an evaluation protocol: the rules governing which split is used, whether the test labels are public, how many samples may be drawn, and whether external tools or retrieval are allowed.
This combination matters because the same dataset can support multiple benchmarks. Wikipedia text, for example, underlies the Stanford Question Answering Dataset (SQuAD),[5] the unanswerable-question extension SQuAD 2.0,[6] and many open-domain question-answering setups, each with its own scoring conventions. Likewise, the same model can score very differently depending on how prompts and decoding are configured, which is one reason benchmark organizers increasingly publish reference harnesses such as Stanford's helm codebase[7] and EleutherAI's lm-evaluation-harness.
Benchmarks exist to make claims about model capability comparable, reproducible, trackable over time, and mappable to capabilities of interest. Comparability means that a number reported on a benchmark by one lab can, in principle, be reproduced by another lab and contrasted with prior numbers. Reproducibility requires that the dataset, prompts, and scoring code be published. Progress tracking lets the community plot performance versus model size, compute, or release date and observe trends such as the scaling laws documented by Kaplan et al. and the chinchilla scaling revisions.[8] Capability mapping means that a suite of benchmarks attempts to cover a structured set of skills (knowledge recall, multi-step math, code synthesis, multimodal perception, tool use) so that a single aggregate score reflects competence across many dimensions, the explicit design goal of big bench and helm.[9][7]
The first widely adopted ML benchmark was mnist, introduced in the LeCun et al. 1998 paper Gradient-based learning applied to document recognition. MNIST contains 60,000 training and 10,000 test images of handwritten digits, normalized to 28-by-28 grayscale pixels, derived from earlier NIST datasets.[10] It became the canonical sanity check for neural network research for two decades.
fei fei li and collaborators released Caltech-101 in 2004, with 9,146 images across 101 object categories.[11] This was followed by the PASCAL Visual Object Classes (VOC) challenge, which ran annually from 2005 to 2012 and standardized object detection and segmentation evaluation. A separate Wikipedia-class entry covers pascal voc in detail.
The pivotal moment for computer vision was imagenet, assembled at Princeton starting in 2007 by Fei-Fei Li and Jia Deng and described in Russakovsky et al. 2015, ImageNet Large Scale Visual Recognition Challenge (arXiv:1409.0575).[1] The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) ran from 2010 to 2017 and provided the substrate on which AlexNet (2012) and subsequent deep convolutional networks demonstrated that supervised deep learning at scale could outperform classical computer-vision pipelines. Microsoft COCO, introduced in Lin et al. 2014, extended the field to detection, segmentation, and captioning of common objects in context.[12]
Reading comprehension was reshaped by squad (Rajpurkar et al. 2016), which provided more than 100,000 crowd-written questions answerable by spans from Wikipedia passages, with an F1 metric against reference span answers.[5] Two years later, Rajpurkar, Jia, and Liang released SQuAD 2.0, adding more than 50,000 adversarial unanswerable questions and requiring models to abstain when no span is supported; strong neural systems that reached 86% F1 on SQuAD 1.1 dropped to 66% F1 on SQuAD 2.0.[6]
glue benchmark (Wang et al. 2018) bundled nine sentence-level English understanding tasks behind a common API, intending to discourage task-specific tricks.[2] Within a year, top systems exceeded the published human baseline on most GLUE tasks, so the same group released SuperGLUE in 2019 with harder tasks and clearer headroom.[3] By 2020, superglue itself was at or above the human baseline for leading systems, illustrating an early pattern: useful benchmarks saturate quickly once the community focuses on them.
hellaswag (Zellers et al. 2019) introduced an adversarially filtered commonsense sentence-completion benchmark where humans score above 95% and the best models at release scored under 48%, demonstrating that adversarial filtering can produce a benchmark with large initial headroom even when its constituent questions are individually easy for humans.[13]
mmlu (Hendrycks et al. 2020, arXiv:2009.03300), formally Measuring Massive Multitask Language Understanding, contains questions drawn from 57 subjects ranging from elementary mathematics to professional law and ethics, all formatted as four-way multiple choice.[14] It became the de facto comparison metric in the GPT-3 / PaLM / Llama era.
big bench (Srivastava et al. 2022, arXiv:2206.04615) is Beyond the Imitation Game, a community-built benchmark with 204 tasks contributed by 449 authors across 132 institutions.[9] A curated BIG-bench Hard subset (BBH) selected 23 tasks where state-of-the-art models at the time fell well short of human performance.
helm (Liang et al. 2022, arXiv:2211.09110), the Holistic Evaluation of Language Models from Stanford's Center for Research on Foundation Models, evaluated 30 models across 42 scenarios under standardized prompts and reported seven metrics per cell: accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. The authors reported that benchmark coverage of major models rose from an average of 17.9% to 96.0% under their methodology.[7]
By 2023 the field had moved from single-number leaderboards toward portfolios of specialized evaluations. The categories below summarize widely cited benchmarks; each is covered in its own article on this wiki.
Different tasks require different scoring functions, and the choice of metric is part of what defines a benchmark.
Accuracy and exact match are used for multiple-choice tasks (MMLU, GPQA, HellaSwag) and for short-answer tasks where any deviation from the gold answer counts as wrong (GSM8K final-answer extraction).
f1 score is standard for span-extraction reading comprehension such as SQuAD, where the model output is a span of text whose tokens are compared against a reference span at the token level.
bleu, rouge score, and METEOR are n-gram overlap metrics for machine translation and summarization. BLEU was introduced by Papineni et al. in 2002 for machine translation, and ROUGE by Lin in 2004 for summarization. These metrics are still used as automatic proxies despite known weaknesses on paraphrastic outputs.
pass@k is the metric for code generation, introduced with HumanEval: a problem is considered solved if any of k independent samples passes all unit tests. The expected pass@k can be estimated unbiasedly from a larger sample of n completions using a closed-form formula in the original Chen et al. paper.[20]
Elo and Bradley-Terry ratings are used for pairwise-comparison arenas such as Chatbot Arena. Each comparison updates the ratings using the same logic as competitive chess ratings, with confidence intervals reported via bootstrap.[33]
LLM-as-judge metrics use a strong model to score outputs against either a reference or a competing output. MT-Bench uses GPT-4 to score 80 multi-turn questions; AlpacaEval LC uses GPT-4 as a pairwise judge with length-control regression; Arena-Hard uses Claude or GPT-4 to judge a curated set of harder prompts.
Auxiliary metrics increasingly accompany these primary scores. HELM's seven-metric design reports calibration, robustness, fairness, bias, toxicity, and efficiency alongside accuracy, on the argument that single-number leaderboards obscure important trade-offs.[7]
Public benchmarks are scraped into pretraining corpora, raising the possibility that a model has seen test items during training. Sainz et al. 2023 (NLP Evaluation in trouble, EMNLP Findings, arXiv:2310.18018) define multiple levels of contamination and argue for per-benchmark contamination measurement as a community standard. They note that the most severe case is when a model is trained on the test split of the very benchmark on which it is then evaluated.[4] Public discussions around GPT-4 highlighted possible exposure to SAT, AP, bar-exam, and codeforces problems used in the system-card evaluations, contributing to skepticism about headline numbers on standardized exams.
goodharts law states that "when a measure becomes a target, it ceases to be a good measure." In ML, this manifests as labs targeting specific benchmarks during pretraining mixture selection, instruction tuning, or RLHF reward shaping. Critics argue that strong scores on saturated benchmarks no longer track the underlying capability the benchmark was intended to measure.
glue benchmark reached human parity within a year of release; superglue within two; mmlu is now near saturation for frontier models, motivating MMLU-Pro.[15] Benchmark saturation pushes the field toward harder evaluations (FrontierMath, GPQA, ARC-AGI-2) and toward dynamic or held-out designs that resist memorization.
Construct validity asks whether the benchmark actually measures the latent capability it claims to. A long-running line of work questions whether multi-choice exams measure reasoning, or only test-taking heuristics, and whether translation BLEU measures translation quality, or only n-gram overlap. Concerns about cultural bias in MMLU and similar benchmarks (predominantly US-centric content) also fall under construct validity.
Many widely cited benchmarks are English-only or US-centric. Efforts such as MGSM (multilingual GSM8K) and MMLU-ProX address some of this gap, but benchmark coverage outside English remains thin compared to English coverage.
Several design strategies attempt to mitigate the above problems.
Held-out and dynamic benchmarks. livecodebench only counts programming-contest problems released after a given model's training cutoff, making contamination structurally impossible for new entries.[23] frontiermath keeps a private test set, with answers withheld from the public.[19] arc agi keeps a private evaluation set against which the ARC Prize Kaggle competition is graded.[25]
Verified and audited benchmarks. OpenAI's SWE-bench Verified removed underspecified and unsolvable tasks from the original SWE-bench via human review by 93 professional developers, with the goal of producing more accurate estimates of autonomous software engineering capability.[22]
Private leaderboards. Some benchmarks (FrontierMath Tier 4, ARC-AGI evaluation set, parts of GAIA) keep ground-truth answers private and require model providers to submit predictions to an organizer.
Adversarial construction. hellaswag used adversarial filtering to remove items that were easy for current models, yielding large initial human-model gaps.[13] Adversarial NLI (ANLI) iterated this process across rounds, with annotators specifically trying to construct examples that fooled the current best model.
Verifiable and execution-based scoring. Benchmarks that grade by executing code against unit tests (HumanEval, MBPP, SWE-bench, LiveCodeBench, OSWorld) are harder to game than benchmarks scored by text-overlap metrics, because the grader is a deterministic program rather than a reference string.
Several public leaderboards aggregate benchmark numbers.
Between 2023 and 2026, the most cited LLM benchmarks shifted in three directions. First, harder static benchmarks such as GPQA, MMLU-Pro, MATH-500, and FrontierMath replaced saturated predecessors. Second, verifiable and execution-based evaluations such as SWE-bench Verified, LiveCodeBench, OSWorld, and WebArena replaced text-overlap or LLM-as-judge metrics where possible, because they cannot be gamed by stylistic mimicry. Third, agentic and long-horizon benchmarks such as GAIA, OSWorld, WebArena, τ-bench, and AgentBench moved evaluation away from single-turn prompts toward multi-step tasks involving tools, memory, and recovery from failure.
The reasoning-model era starting with OpenAI o1 in 2024 accelerated this shift. Within a year, frontier models saturated MATH, surpassed human PhDs on GPQA Diamond, and posted nontrivial scores on FrontierMath, AIME 2024, and AIME 2025. Benchmark organizers responded with private test sets (FrontierMath), held-out problem streams (LiveCodeBench), and entirely new task formats (ARC-AGI-2 and the planned ARC-AGI-3).
The empirical scaling laws literature relies on benchmarks as the dependent variable: pretraining loss, perplexity, and downstream-benchmark accuracy are plotted against model size, dataset size, and training compute to identify power-law trends. The Kaplan et al. 2020 and Hoffmann et al. 2022 (chinchilla scaling) papers used cross-entropy loss and downstream task accuracy as proxies for capability.[8] BIG-bench documented the breakthrough phenomenon, in which some tasks remain at random-chance accuracy until a critical scale, after which they jump sharply.[9] Benchmark design choices therefore feed directly back into how the community describes and predicts model progress.