Benchmarks

In artificial intelligence, benchmarks are standardized tests and evaluation protocols used to measure the capabilities of AI systems across defined tasks. They provide a common framework for comparing different models, tracking progress over time, and identifying areas where systems fall short. Benchmarks have become central to how AI labs communicate the performance of their large language models, multimodal systems, and AI agents, though they also face growing criticism around data contamination, saturation, and the applicability of Goodhart's Law.

Why benchmarks matter

Benchmarks serve several interconnected purposes in AI research and development. At their core, they provide reproducible evaluation conditions that allow different research groups to compare their models on exactly the same tasks under the same conditions. Without standardized benchmarks, claims about model performance would be difficult to verify or compare across organizations.^[1]^[2]

Beyond comparison, benchmarks help identify specific strengths and weaknesses within a model. A system might excel at mathematical reasoning (as measured by MATH or GSM8K) while struggling with coding tasks (as measured by SWE-bench), and this granularity helps researchers target improvements. Benchmarks also create shared reference points for the broader community to gauge progress toward milestones like human-level performance on various cognitive tasks.^[1]^[2]

For commercial AI development, benchmark results have taken on additional significance as marketing material. Results are prominently featured in model announcements from labs like OpenAI, Google DeepMind, Anthropic, and Meta AI. This dual role as both scientific measurement tool and promotional metric creates tensions that are explored in the sections on Goodhart's Law and limitations below.^[3]

The Stanford AI Index Report for 2025 documented how benchmark-measured progress accelerated dramatically: performance on GPQA rose by 48.9 percentage points between 2023 and 2024, and on SWE-bench, AI systems went from solving 4.4% of coding problems in 2023 to 71.7% in 2024.^[1]

Types of benchmarks

AI benchmarks can be organized into several broad categories based on the capabilities they evaluate.

Language understanding and knowledge

These benchmarks test a model's ability to comprehend text, recall factual knowledge, and demonstrate understanding across academic subjects. MMLU (Massive Multitask Language Understanding) is the most widely cited example, covering 57 subjects from elementary mathematics to professional law with 15,908 multiple-choice questions. Created by Dan Hendrycks and collaborators in 2020, MMLU essentially functions as a broad "SAT test" for AI models.^[4]^[5]

As leading models began saturating MMLU (with scores above 90%), harder variants were introduced. MMLU-Pro (2024) increases difficulty by offering 10 answer choices instead of 4 and drawing from more challenging question pools. MMMLU extends the concept to multilingual evaluation across multiple languages.^[4]^[5]

Reasoning

Reasoning benchmarks evaluate logical thinking, graduate-level problem solving, and commonsense inference. GPQA Diamond (Graduate-Level Google-Proof Q&A) stands out for its difficulty: the benchmark's 198 questions are designed so that even PhD-level domain experts only achieve around 65% accuracy, while skilled non-experts reach only 34% even with web access. The "Google-proof" design means answers cannot be easily found through internet search.^[6]

HellaSwag (2019) evaluates commonsense reasoning through sentence completion tasks. The benchmark provides the beginning of a scenario and asks models to choose the correct continuation from multiple options. Humans achieve approximately 95% accuracy, and while top models have approached this level, the benchmark revealed important gaps in language models' understanding of everyday physical and social situations.^[7]

WinoGrande (2019) focuses on commonsense reasoning through pronoun resolution. Its 44,000 fill-in-the-blank problems, inspired by the Winograd Schema Challenge, require models to resolve ambiguous pronoun references using world knowledge. Human performance sits at 94%, and the benchmark has been a staple of model evaluation suites.^[7]

BIG-Bench Hard (BBH) extracts 23 of the most challenging tasks from the broader BIG-Bench collection, focusing on problems that were deemed beyond the capabilities of language models at the time of creation. These tasks span diverse reasoning challenges including logical deduction, causal reasoning, and multi-step inference.^[7]

Mathematics

Mathematics benchmarks range from grade-school arithmetic to research-level problems. GSM8K (Grade School Math 8K), created by Karl Cobbe and colleagues at OpenAI in 2021, consists of 8,500 linguistically diverse grade-school math word problems. Each problem requires between 2 and 8 steps of elementary arithmetic and is designed to test multi-step reasoning rather than rote calculation. Solutions include natural language explanations of intermediate steps, making the benchmark useful for evaluating chain-of-thought reasoning. GSM8K has become effectively saturated for frontier models as of 2025.^[8]

MATH, also from 2021, presents 12,500 competition-level mathematics problems across five difficulty levels, requiring multi-step symbolic reasoning, proof construction, and mathematical creativity. Problems are drawn from competitions such as AMC, AIME, and olympiad-level contests.^[9]

As these benchmarks became saturated, the community developed harder alternatives. FrontierMath, maintained by Epoch AI, features research-level problems at the frontier of mathematical knowledge. Humanity's Last Exam (HLE), a collaboration between the Center for AI Safety and Scale AI published in Nature in 2025, contains 2,500 expert-vetted questions across mathematics (41%), physics (9%), biology/medicine (11%), and other fields. The questions are designed to be "Google-proof," and early frontier model scores were remarkably low: GPT-4o scored 2.7%, Claude 3.5 Sonnet scored 4.1%, and OpenAI's o1 reached 8%. Around 1,000 subject expert contributors from over 500 institutions across 50 countries participated in creating the questions.^[10]

Coding

HumanEval, created by Mark Chen and colleagues at OpenAI in 2021, was one of the first widely adopted coding benchmarks. It presents 164 Python function signatures with docstrings and evaluates whether model-generated code passes hidden unit tests. As of 2025, top models score above 90% on HumanEval, effectively saturating the benchmark for frontier systems.^[11]

SWE-bench (Software Engineering Benchmark), developed by Carlos Jimenez and collaborators at Princeton Language and Intelligence in 2023, represents a major step up in complexity. It presents models with real GitHub issues from 12 popular Python repositories and asks them to generate patches that resolve the described problems. Success is determined by whether the generated patch passes both fail-to-pass tests (tests that fail before the fix but pass after) and pass-to-pass tests (regression tests). The full benchmark contains 2,294 tasks, while the human-verified SWE-bench Verified subset of 500 tasks has become the standard evaluation target. Top AI coding agents now solve over 70% of SWE-bench Verified tasks.^[12]

LiveCodeBench (2024) addresses contamination concerns by drawing problems exclusively from recent programming competitions that post-date model training cutoffs. By continuously adding new problems, it maintains its validity as a benchmark even as training data expands. This rolling design represents one approach to the broader contamination problem.^[13]

Multimodal

MMMU (Massive Multi-discipline Multimodal Understanding), released in 2023, tests vision-language models on 11,500 college-level problems that require understanding both visual and textual information. Questions span subjects from art history to electrical engineering, with images including charts, diagrams, photographs, and mathematical notation. Other multimodal benchmarks include MathVista for mathematical reasoning with visual inputs, OCRBench for text recognition in images, and CharXiv for chart understanding.^[14]

Safety and instruction following

TruthfulQA, created by Stephanie Lin and colleagues in 2021, evaluates whether models resist generating widely held misconceptions and falsehoods. Its 817 questions across 38 knowledge domains are designed to elicit common human errors, and the benchmark revealed that many models reproduce popular misconceptions rather than providing accurate answers. However, analysis has shown that TruthfulQA is now partially saturated due to its inclusion in training datasets.^[15]

IFEval (Instruction-Following Evaluation), developed by Jeffrey Zhou and colleagues at Google in 2023, tests a model's ability to follow 25 types of automatically verifiable instructions. These include constraints like "write in more than 400 words," "mention the keyword AI at least 3 times," or "output in JSON format." The benchmark contains approximately 500 prompts, each with one or more verifiable instructions, and evaluates compliance at both the prompt level and the individual instruction level.^[16]

Human preference

Chatbot Arena (formerly LMSYS Chatbot Arena) takes a fundamentally different approach from static benchmarks. Users interact with two anonymous models simultaneously and vote on which one provides a better response. The platform, created by LMSYS in 2023, has accumulated over 6 million votes, and Elo-style ratings are computed using the Bradley-Terry model. This approach captures real-world user preferences in a way that static benchmarks cannot, though it can be influenced by factors like response length, formatting style, and the demographics of the voting population.^[17]

As of early 2026, the Chatbot Arena leaderboard serves as one of the most watched indicators of overall model quality in the AI industry, with top models clustering around 1,400-1,500 Elo points.^[17]

Major benchmarks

The following table provides a comprehensive reference for widely used benchmarks in the AI evaluation ecosystem as of early 2026.

Benchmark	Domain	Format	Size	Created	Key detail
MMLU	Knowledge (57 subjects)	Multiple choice (4 options)	15,908 questions	2020	Hendrycks et al.; top models above 90%
MMLU-Pro	Knowledge (harder)	Multiple choice (10 options)	12,000 questions	2024	Wider answer set; more challenging
GPQA Diamond	Expert science	Multiple choice	198 questions	2023	PhD experts ~65%; non-experts ~34%
MATH	Competition math	Open-ended	12,500 problems	2021	Five difficulty levels
GSM8K	Grade-school math	Open-ended	8,500 problems	2021	Cobbe et al.; 2-8 step arithmetic
HumanEval	Python coding	Code generation	164 problems	2021	Chen et al.; top models above 90%
SWE-bench	Software engineering	Patch generation	2,294 tasks	2023	Princeton; real GitHub issues
SWE-bench Verified	Software engineering	Patch generation	500 tasks	2024	Human-verified; top agents above 70%
LiveCodeBench	Coding (contamination-resistant)	Code generation	Rolling	2024	Post-cutoff competition problems
ARC-AGI	Fluid intelligence	Visual grid puzzles	400+400 tasks	2019	Chollet; ~50-80% with heavy compute
MMMU	Multimodal understanding	Multiple choice	11,500 questions	2023	College-level vision+text problems
Chatbot Arena	Human preference	Pairwise voting	6M+ votes	2023	Crowdsourced Elo via LMSYS
HellaSwag	Commonsense reasoning	Sentence completion	10,000 items	2019	Humans ~95%; models near saturation
WinoGrande	Commonsense reasoning	Fill-in-the-blank	44,000 problems	2019	Pronoun resolution
BIG-Bench Hard	Diverse reasoning	Mixed	23 tasks	2022	Hardest BIG-Bench subset
TruthfulQA	Truthfulness	Open-ended / MC	817 questions	2021	Lin et al.; tests misconception resistance
IFEval	Instruction following	Prompted tasks	~500 prompts	2023	Zhou et al.; 25 verifiable instruction types
Humanity's Last Exam	Expert knowledge	MC + short-answer	2,500 questions	2025	CASI + Scale AI; early top scores under 20%
FrontierMath	Research math	Open-ended	Ongoing	2024	Epoch AI; frontier-level problems
SimpleQA	Factual accuracy	Short-answer	4,326 questions	2024	OpenAI; verifiable factual recall
BrowseComp	Web browsing	Information retrieval	Varies	2025	Tests agentic web research ability

Benchmark contamination

Data contamination occurs when a model's training data includes questions or answers from a benchmark's test set, artificially inflating scores. This is one of the most significant and persistent threats to the validity of AI evaluation.

The mechanism is straightforward: large language models are trained on vast internet crawls that can contain billions of web pages. Benchmark questions, once published online (in papers, GitHub repositories, forums, or blog posts), can be inadvertently included in these training datasets. A model that has "seen" a test question during training may appear to reason through it when it is actually recalling a memorized answer.^[3]^[18]

Research has demonstrated that contamination effects can be dramatic. In one study, the coding model StarCoder-7B scored 4.9 times higher on leaked test data compared to clean data. More broadly, models can gain up to 10 percentage points on test sets simply through exposure during training, without any intentional overfitting. This means that apparent leaps in capability may sometimes reflect data leakage rather than genuine improvement.^[18]

The problem is compounded by a lack of standardization. As of early 2026, there are no industry-wide standards for contamination detection. Every lab uses different methods with different sensitivity thresholds, making it difficult to compare contamination assessments across organizations. Some mitigation strategies have emerged:

Private test sets: Benchmarks like ARC-AGI maintain a private evaluation set that is never published, so it cannot appear in training data.
Rolling benchmarks: LiveCodeBench continuously adds problems from recent programming competitions that post-date model training cutoffs.
Contamination detection: Some evaluation frameworks attempt to detect whether a model has seen specific test questions, though these methods are imperfect.
Canary strings: Some benchmarks embed unique identifiable strings that can be searched for in training data dumps.^[3]^[18]

Goodhart's Law and benchmarks

Goodhart's Law, often paraphrased as "when a measure becomes a target, it ceases to be a good measure," applies directly to AI benchmarks. Once AI labs and researchers optimize their models to achieve high scores on specific benchmarks, those benchmarks progressively lose their ability to meaningfully distinguish between systems or predict real-world performance.^[3]^[19]

Several dynamics contribute to this problem:

Selective disclosure. Labs can privately evaluate many model variants across many benchmarks and choose to report only the most favorable results. When researchers analyzed 2.8 million model comparison records from Chatbot Arena, they found that selective model submissions inflated scores by up to 100 Elo points through cherry-picking favorable matchups.^[19]
Benchmark-adjacent training. Models may be specifically fine-tuned on data that closely resembles benchmark questions without being trained directly on test items. This practice is difficult to detect and falls into a gray area between legitimate improvement and gaming.
Benchmark-specific optimization. Labs may tune hyperparameters, prompting strategies, or inference settings specifically for benchmark evaluation, achieving scores that do not reflect typical usage conditions.^[3]

The benchmark lifecycle reflects this dynamic pattern: a new benchmark is introduced and reveals genuine capability gaps; labs optimize for it; scores saturate; and researchers create harder replacements. MMLU's progression to MMLU-Pro, HumanEval's evolution toward SWE-bench, and the creation of Humanity's Last Exam as an intentionally "Google-proof" benchmark all illustrate this cycle. The useful lifespan of a public benchmark can be surprisingly short once high scores drive prestige and funding decisions.^[10]^[19]

Limitations of benchmarks

Beyond contamination and Goodhart effects, AI benchmarks face several structural limitations that the research community has increasingly acknowledged.

Narrow task coverage. Most benchmarks test isolated capabilities rather than the integrated, multi-step skills needed for real-world applications. A model may score well on HumanEval's isolated coding problems but struggle with the full software engineering workflow tested by SWE-bench, which requires understanding large codebases, interpreting issue descriptions, and generating correct patches.^[2]
Static nature. Fixed benchmarks become stale as models improve. The saturation of GSM8K, MMLU, and HumanEval by 2024-2025 reduced their ability to differentiate frontier models, forcing the community into a constant cycle of creating harder replacements.^[1]
Format sensitivity. Model performance can vary significantly based on prompt format, few-shot examples, system prompts, and evaluation protocols. Different evaluation harnesses (such as EleutherAI's LM Evaluation Harness versus individual lab implementations) can produce meaningfully different scores on the same benchmark with the same model.^[2]
Missing real-world validity. High benchmark scores do not always translate to useful real-world performance. A model with a strong MMLU score may still produce hallucinations on simple factual questions or fail at tasks requiring sustained multi-turn reasoning over many interactions.^[3]
Lack of linguistic and cultural diversity. Many benchmarks are English-centric and focus on Western academic knowledge systems. Efforts like MGSM (multilingual GSM8K) and MMMLU (multilingual MMLU) address this gap, but they remain less widely adopted than their English-only counterparts.^[2]
Cost and access barriers. Running comprehensive benchmark evaluations requires significant computational resources. Some benchmarks, particularly agent-based evaluations like SWE-bench, involve executing code in sandboxed environments, which adds infrastructure complexity and cost.^[12]

Current state (2025-2026)

The AI evaluation landscape continues to evolve rapidly in response to the limitations of traditional benchmarks. Several trends characterize the current state.

First, there has been a clear shift toward harder and more contamination-resistant evaluations. Humanity's Last Exam, FrontierMath, and ARC-AGI (now in its third version) represent attempts to stay ahead of rapid model improvement. Private and rolling benchmarks that continuously refresh their question pools are gaining adoption.^[1]^[10]

Second, human-preference platforms like Chatbot Arena are increasingly viewed as complementary or even superior to traditional static benchmarks for measuring overall model quality. The crowdsourced approach captures aspects of model performance (helpfulness, tone, formatting, nuance) that fixed-answer benchmarks miss, though it introduces its own biases related to evaluator demographics and preferences.^[17]

Third, agent-based evaluations that test multi-step, real-world task completion are growing in importance. Benchmarks like SWE-bench, tau-bench, and WebArena evaluate models not just on isolated questions but on their ability to use tools, navigate complex environments, and complete practical workflows. This shift reflects the industry's move toward AI agents that operate autonomously in real-world settings.^[12]

Fourth, the community is grappling with the tension between benchmark transparency (needed for scientific reproducibility) and benchmark security (needed to prevent contamination). Various approaches, from private test sets to rolling evaluation to controlled-access platforms, attempt to balance these competing needs.^[3]^[18]

The Open LLM Leaderboard maintained by Hugging Face standardizes evaluation across a core set of benchmarks, while platforms like Artificial Analysis, Scale Labs, and the Vellum LLM Leaderboard provide additional comparison points. Despite their well-documented limitations, benchmarks remain the primary shared language through which the AI community communicates model capabilities and tracks progress.^[2]

References

"Technical Performance." The 2025 AI Index Report, Stanford HAI. https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance
"Top 50 AI Model Benchmarks & Evaluation Metrics (2025 Guide)." o-mega. https://o-mega.ai/articles/top-50-ai-model-evals-full-list-of-benchmarks-october-2025
"Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation." arXiv, February 2025. https://arxiv.org/html/2502.06559v1
Hendrycks, D. et al. "Measuring Massive Multitask Language Understanding." ICLR, 2021. https://arxiv.org/abs/2009.03300
"MMLU." Wikipedia. https://en.wikipedia.org/wiki/MMLU
Rein, D. et al. "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." arXiv, 2023. https://arxiv.org/abs/2311.12022
"WinoGrande, HellaSwag, BIG-Bench." DeepEval. https://deepeval.com/docs/benchmarks-hellaswag
Cobbe, K. et al. "Training Verifiers to Solve Math Word Problems." arXiv, 2021. https://arxiv.org/abs/2110.14168
Hendrycks, D. et al. "Measuring Mathematical Problem Solving With the MATH Dataset." NeurIPS, 2021. https://arxiv.org/abs/2103.03874
Phan, L. et al. "Humanity's Last Exam." Nature, 2025. https://www.nature.com/articles/s41586-025-09962-4
Chen, M. et al. "Evaluating Large Language Models Trained on Code." arXiv, 2021. https://arxiv.org/abs/2107.03374
Jimenez, C.E. et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR, 2024. https://pli.princeton.edu/blog/2023/swe-bench-can-language-models-resolve-real-world-github-issues
"LiveCodeBench." LiveCodeBench. https://livecodebench.github.io/
"MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark." 2023. https://mmmu-benchmark.github.io/
Lin, S. et al. "TruthfulQA: Measuring How Models Mimic Human Falsehoods." ACL, 2022. https://arxiv.org/abs/2109.07958
Zhou, J. et al. "Instruction-Following Evaluation for Large Language Models." arXiv, 2023. https://arxiv.org/abs/2311.07911
"Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings." LMSYS Org, 2023. https://lmsys.org/blog/2023-05-03-arena/
"AI Benchmarks Are a Game Now - And the Industry Is Cheating to Win." UC Strategies. https://ucstrategies.com/news/ai-benchmarks-are-a-game-now-and-the-industry-is-cheating-to-win/
"Gaming the System: Goodhart's Law Exemplified in AI Leaderboard Controversy." Collinear AI Blog. https://blog.collinear.ai/p/gaming-the-system-goodharts-law-exemplified-in-ai-leaderboard-controversy

Page	Type	What it covers
Mind2Web	Dataset and benchmark	Real-website benchmark for generalist web agents
BrowserGym	Evaluation ecosystem	Unified environment for running multiple web-agent benchmarks

Page	Type	What it covers
Mind2Web	Dataset and benchmark	Real-website benchmark for generalist web agents
BrowserGym	Evaluation ecosystem	Unified environment for running multiple web-agent benchmarks

Benchmarks

Why benchmarks matter

Types of benchmarks

Language understanding and knowledge

Reasoning

Mathematics

Coding

Multimodal

Safety and instruction following

Human preference

Major benchmarks

Benchmark contamination

Goodhart's Law and benchmarks

Limitations of benchmarks

Current state (2025-2026)

See also

References

Improve this article

Why benchmarks matter

Types of benchmarks

Language understanding and knowledge

Reasoning

Mathematics

Coding

Multimodal

Safety and instruction following

Human preference

Major benchmarks

Benchmark contamination

Goodhart's Law and benchmarks

Limitations of benchmarks

Current state (2025-2026)

See also

References

Why benchmarks matter

Types of benchmarks

Language understanding and knowledge

Reasoning

Mathematics

Coding

Multimodal

Safety and instruction following

Human preference

Major benchmarks

Benchmark contamination

Goodhart's Law and benchmarks

Limitations of benchmarks

Current state (2025-2026)

See also

Related pages created in April 2026

References

Improve this article

Related Articles

Machine learning terms/Fairness

Humanity's Last Exam

ARC-AGI 2

DeepSeek 3.0

Open-source AI

AI search

Why benchmarks matter

Types of benchmarks

Language understanding and knowledge

Reasoning

Mathematics

Coding

Multimodal

Safety and instruction following

Human preference

Major benchmarks

Benchmark contamination

Goodhart's Law and benchmarks

Limitations of benchmarks

Current state (2025-2026)

See also

Related pages created in April 2026

References

Related Articles

Machine learning terms/Fairness

Humanity's Last Exam

ARC-AGI 2

DeepSeek 3.0

Open-source AI

AI search