Model Evaluation
Last reviewed
Sources
26 citations
Review status
Source-backed
Revision
v2 ยท 3,611 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
26 citations
Review status
Source-backed
Revision
v2 ยท 3,611 words
Add missing citations, update stale details, or suggest a clearer explanation.
Model evaluation is the process of measuring how well a machine learning model performs on data it was not trained on, in order to estimate how it will generalize to new, unseen inputs. The foundational technique is to split a dataset into training, validation, and test partitions (or to use cross-validation), then score the model with task-appropriate metrics: accuracy, precision, recall, F1 score, and area under the ROC curve for classification; mean squared error (MSE) and mean absolute error (MAE) for regression; and perplexity, BLEU, or ROUGE for language tasks. The central goal is to detect overfitting, the gap between performance on training data and on held-out data, while avoiding pitfalls such as data leakage. For modern large language models, this routine step has grown into a research field, a marketing battleground, and an industry in its own right, built on static benchmarks scored against fixed answer keys, human preference platforms such as Chatbot Arena, model-graded "LLM-as-judge" methods, agentic task suites such as SWE-bench, and adversarial red-teaming. Every approach has documented failure modes, and by 2025 the combination of data contamination, benchmark saturation, and leaderboard gaming had become serious enough that prominent researchers described an "evaluation crisis" in AI.
Before the LLM era, evaluation followed a simple recipe: train on one split of a dataset, then report a metric on a held-out test split. Classic metrics included accuracy, F1 score, BLEU for translation, ROUGE for summarization, and top-5 error on ImageNet; aggregate suites such as GLUE (2018) and SuperGLUE (2019) extended the recipe to language understanding. General-purpose LLMs broke it. A single model can attempt almost any task, its web-scale training data may silently include the test questions, and many of the qualities users care about, such as helpfulness, style, and judgment, have no answer key at all.
Evaluation has therefore become a layered ecosystem in which no single number is fully trusted:
| Approach | Representative examples | Characteristic weakness |
|---|---|---|
| Static benchmarks | MMLU, GSM8K, HumanEval, GPQA | Contamination and saturation |
| Human preference arenas | Chatbot Arena (LMArena) | Style bias; gameable submission policies |
| LLM-as-judge | MT-Bench, AlpacaEval | Position, verbosity, and self-preference biases |
| Agentic task suites | SWE-bench, OSWorld, GDPval | Expensive; sensitive to agent scaffolding |
| Safety and red-team evaluations | Frontier-lab dangerous-capability tests, government institute reviews | Limited transparency and reproducibility |
The foundation of supervised model evaluation is partitioning labeled data so that the model is scored on examples it never saw during fitting. A typical workflow uses three disjoint sets: a training set used to fit model parameters, a validation set (or development set) used to tune hyperparameters and choose between candidate models, and a test set held back and used only once, at the very end, to produce an unbiased estimate of generalization performance [22]. Common ratios are 60/20/20 or 80/10/10, though the right split depends on dataset size. This single-split approach is called the holdout method. Sebastian Raschka's 2018 review of evaluation methodology notes that "the holdout method for model evaluation and selection" is widely used but is "not recommended when working with small datasets," because a small test set yields a high-variance, unreliable estimate [22].
Two rules make a holdout estimate trustworthy. First, the split should be stratified for classification, meaning each partition preserves the overall class proportions, and it must respect structure in the data: time series are split chronologically, and grouped data (for example, multiple records per patient) is split by group so that no entity appears in more than one partition. Second, the test set must be quarantined: it is excluded from every preprocessing, feature-selection, and tuning decision, and ideally evaluated only once, since repeatedly checking against it gradually leaks information and inflates the reported score.
When data is scarce, a single holdout split wastes information and gives a noisy estimate. K-fold cross-validation addresses this by shuffling the data and dividing it into k equal, non-overlapping folds; the model is then trained k times, each time holding out one fold as the validation set and training on the remaining k-1 folds, and the k scores are averaged into a single estimate [22]. Every example is used for both training and validation across the rounds, which is why cross-validation uses the available data more efficiently than a single holdout. Common choices are k = 5 or k = 10; the extreme case k = n, where each fold is a single example, is called leave-one-out cross-validation. Raschka's review observes that the choice of k itself involves a "bias-variance trade-off," with larger k reducing bias in the performance estimate but increasing its variance and computational cost [22].
Cross-validation is used during model selection and hyperparameter tuning, not as a substitute for a final test set. To avoid optimistic bias when both selecting a model and estimating its performance, practitioners use nested cross-validation, in which an inner loop tunes hyperparameters and an outer loop estimates generalization. A crucial discipline applies in both holdout and cross-validation: the final test data must be defined and isolated before any cross-validation is run, so that it never participates in training or model selection.
The right metric depends on the task. For classification, predictions are summarized in a confusion matrix of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), from which the standard metrics derive [23]:
| Metric | Definition | When to use |
|---|---|---|
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | Balanced classes |
| Precision | TP / (TP + FP) | Cost of false positives is high |
| Recall (sensitivity) | TP / (TP + FN) | Cost of missing positives is high |
| F1 score | Harmonic mean of precision and recall | Imbalanced classes; balancing the two |
| ROC AUC | Area under the true-positive vs false-positive curve | Threshold-independent ranking quality |
Accuracy alone can be deeply misleading on imbalanced data. As Google's machine learning crash course puts it, "For heavily imbalanced datasets, where one class appears very rarely, say 1% of the time, a model that predicts negative 100% of the time would score 99% on accuracy, despite being useless" [23]. The F1 score, introduced from C. J. van Rijsbergen's effectiveness measure, fixes this by combining precision and recall; per Wikipedia's definition, "The F1 score is the harmonic mean of the precision and recall," computed as F1 = 2 * (precision * recall) / (precision + recall), so it falls toward zero whenever either component is low [24].
For regression, error is measured against continuous targets. Mean absolute error (MAE) averages the absolute differences between predictions and true values and is robust to outliers, while mean squared error (MSE) averages the squared differences and therefore penalizes large errors much more heavily; its square root, RMSE, returns the error to the units of the target, and the coefficient of determination, R-squared, reports the fraction of variance explained.
For natural language generation, perplexity measures how well a language model predicts a text and is defined as the exponential of the per-token cross-entropy: a perplexity of 10 means the model is, on average, as uncertain as if choosing uniformly among 10 next words, and lower is better [25]. Task-specific overlap metrics include BLEU for machine translation and ROUGE for summarization, both of which compare generated text against reference outputs.
The core diagnostic of model evaluation is the generalization gap, the difference between a model's error on the training set and its error on held-out data. A model that fits its training data closely but performs poorly on the test set is overfitting: it has memorized noise and idiosyncrasies rather than the underlying pattern. The opposite failure, where a model is too simple to capture the signal and performs poorly on both training and test data, is underfitting.
These two failure modes are formalized by the bias-variance decomposition, which expresses a model's expected generalization error as the sum of three parts: bias (systematic error from an overly simple model, causing underfitting), variance (sensitivity to fluctuations in the training set, causing overfitting), and irreducible error from noise in the problem itself [26]. The bias-variance tradeoff is the conflict in trying to minimize bias and variance simultaneously: as model complexity increases, bias falls but variance rises, producing the characteristic U-shaped test-error curve [26]. A small train-test gap with high error indicates high bias (underfitting); a large gap indicates high variance (overfitting). Practitioners reduce variance through regularization, more training data, and early stopping, monitoring validation curves to halt training before the gap widens.
Data leakage is one of the most common and damaging errors in evaluation: it occurs when information that would not be available at prediction time is used during training, producing performance estimates that look excellent in development but collapse in deployment. The phenomenon, given a formal treatment by Kaufman and colleagues in 2012, is commonly divided into two types. Target leakage happens when a feature encodes the answer or information that only becomes known after the target, such as a billing code that exists only for patients already diagnosed. Train-test contamination happens when information from the test set leaks into training, often through preprocessing or feature selection performed on the full dataset before splitting, or through duplicate or near-duplicate records spanning the partitions.
The practical defenses follow directly from the splitting discipline above: fit all preprocessing, including scaling, imputation, and feature selection, only on the training fold and apply it to validation and test data, split before any data-dependent transformation, and deduplicate across partitions. In the LLM era, the same problem reappears at web scale as benchmark data contamination, where public test questions are absorbed into pretraining corpora, which is the central reason modern evaluation increasingly favors private and freshly generated test sets, discussed below.
Static benchmarks dominated LLM evaluation from roughly 2020 to 2024. MMLU (2020) tests multiple-choice knowledge across 57 subjects [1]. GSM8K (2021) contains about 8,500 grade-school math word problems, HumanEval (2021) scores Python code generation against unit tests, and GPQA (2023) poses graduate-level science questions designed to be "Google-proof." Hugging Face's Open LLM Leaderboard, built on EleutherAI's lm-evaluation-harness, automated this kind of scoring for open models and evaluated more than 13,000 of them before being retired in March 2025, with its maintainers concluding that the underlying benchmarks no longer captured meaningful capability differences [2].
Saturation forced repeated escalation. The best frontier models now score at or above 90 percent on MMLU, GSM8K, and HumanEval, so harder successors appeared: MMLU-Pro, Epoch AI's research-level FrontierMath (November 2024), and Humanity's Last Exam (January 2025), a 2,500-question benchmark assembled by the Center for AI Safety and Scale AI from contributions by nearly 1,000 subject-matter experts [3]. Frontier models scored under 10 percent on Humanity's Last Exam at release; one year later, Google's Gemini 3 Pro reached 37.5 percent without tools, showing how quickly even purpose-built "hard" benchmarks erode.
Because many LLM qualities lack ground truth, the field increasingly asks humans to choose between outputs. Chatbot Arena, launched in May 2023 by the LMSYS group at UC Berkeley, shows visitors two anonymous responses to their own prompt and asks which is better; millions of pairwise votes are converted into Elo-style ratings using a Bradley-Terry model [4]. The arena became the industry's most watched leaderboard, and in 2025 its academic operators spun out a company, LMArena, which raised $100 million at a $600 million valuation in May 2025 and a further $150 million at a $1.7 billion valuation in January 2026 [5][6].
Influence attracted gaming. In April 2025 Meta promoted the second-place arena rank of Llama 4 Maverick, a position earned by an unreleased "experimental" variant optimized for conversationality; the publicly released model placed around 32nd when evaluated, and LMArena updated its submission policies in response [7]. Weeks later, "The Leaderboard Illusion," a study by researchers at Cohere Labs, the Allen Institute for AI, Princeton, Stanford, Waterloo, and the University of Washington, argued that the arena structurally favored large proprietary labs: providers could privately test many variants and publish only the winner (Meta tested 27 before Llama 4's release), and Google and OpenAI models received an estimated 19.2 percent and 20.4 percent of all arena data respectively, versus 29.7 percent for 83 open-weight models combined [8]. LMArena disputed the paper's framing and several of its figures [9]. A separate criticism is stylistic: voters tend to reward longer and more elaborately formatted answers, so LMArena added "style control" rankings in 2024 that statistically adjust for length and formatting.
LLM-as-judge methods use a strong model to grade other models' outputs, making large-scale evaluation cheap and fast. The approach was formalized in 2023 with MT-Bench, in which GPT-4 graded multi-turn answers and matched human preferences more than 80 percent of the time, comparable to agreement between humans [4]. The same study cataloged systematic biases that have shaped all subsequent work: position bias (preferring whichever answer appears first), verbosity bias (preferring longer answers), self-enhancement bias (judges favoring their own outputs), and weak grading of mathematical reasoning [4]. Common mitigations include swapping answer order, length-controlled scoring as in AlpacaEval, detailed rubrics, and panels of multiple judge models. Model grading now underpins much of the industry's routine testing as well as the reward models used in RLHF, but it inherits the judge's blind spots, and systems optimized against model-graded objectives can learn to exploit them.
As models became agents, evaluation shifted from answering questions to completing real work. SWE-bench (2023) asks models to resolve real GitHub issues from popular Python repositories; in August 2024 OpenAI and the benchmark's authors released SWE-bench Verified, a 500-task subset screened by 93 experienced developers [10]. Related suites include OSWorld for computer use, WebArena for web navigation, tau-bench for tool-using customer-service agents, and Terminal-Bench for command-line tasks. Agentic scores are highly sensitive to the scaffolding around the model, and saturation arrived fast: leading agents passed 70 percent on SWE-bench Verified during 2025, and in February 2026 OpenAI announced it would stop reporting the benchmark, citing an audit in which 59.4 percent of audited problems had flawed tests that rejected correct solutions, along with contamination concerns, and pointing to harder successors such as Scale AI's SWE-bench Pro, which uses held-out private test sets [11].
Two newer task-based approaches stand out. METR measures a model's "time horizon," the length of task (in human-professional time) that an agent can complete with 50 percent reliability; METR found this horizon doubling roughly every seven months since 2019 [12]. GDPval, released by OpenAI in September 2025, evaluates models on 1,320 real work deliverables spanning 44 occupations in nine major US industries, graded blind by professionals averaging 14 years of experience; at launch the strongest model, Claude Opus 4.1, produced deliverables judged as good as or better than the human expert's in just under half of comparisons [13].
Safety evaluation asks not what a model can do but what it should not do. Frontier labs run dangerous-capability evaluations covering biological, chemical, cyber, and autonomy risks under commitments such as Anthropic's Responsible Scaling Policy, OpenAI's Preparedness Framework, and Google DeepMind's Frontier Safety Framework, supplemented by internal and contracted red-teaming. Governments have institutionalized the practice: the UK's AI Security Institute (founded as the AI Safety Institute in November 2023) conducts pre-deployment testing of frontier models and open-sourced its Inspect evaluation framework in 2024, while the US AI Safety Institute was renamed the Center for AI Standards and Innovation in June 2025. In August 2025 OpenAI and Anthropic completed a first-of-its-kind exchange in which each lab ran its internal alignment and misuse evaluations against the other's models and published the results [14].
By 2025 several compounding failures led researchers, including Sara Hooker, senior author of the Leaderboard Illusion study and then head of Cohere Labs, to describe AI evaluation as being in crisis [15].
Contamination. Public benchmarks leak into web-scale training data. Scale AI's GSM1k study (2024) wrote fresh GSM8K-style math problems from scratch and found that some model families scored up to 13 percentage points lower on the new questions, with statistical evidence of partial memorization of the originals; frontier models showed minimal gaps [16].
Saturation. When top models cluster near a benchmark's ceiling, the remaining differences reflect noise and label errors rather than capability, which motivated the retirements of the Open LLM Leaderboard and SWE-bench Verified [2][11].
Goodhart's law and overfitting. Because benchmark results drive launches, funding, and procurement, labs optimize for them, illustrating Goodhart's law: when a measure becomes a target, it ceases to be a good measure. The Llama 4 Maverick episode and the Leaderboard Illusion findings are widely cited examples [7][8].
Conflicts of interest and benchmark quality. In December 2024 OpenAI announced that o3 solved about 25 percent of FrontierMath; in January 2025 it emerged that OpenAI had commissioned and owned the benchmark, with access to the problems and solutions apart from a 50-question holdout, and Epoch AI acknowledged it should have disclosed the funding relationship earlier [17]. An AI-assisted error review that Epoch launched in 2026 subsequently flagged fatal errors in roughly a third of the benchmark's Tier 1-4 problems [18].
The main responses aim to make evaluations harder to game and closer to real use.
Private and semi-private test sets. Scale AI's SEAL leaderboards (May 2024) rank models on unpublished expert-written datasets of roughly 1,000 prompts [19]. The ARC Prize maintains tiered splits for ARC-AGI: a public set, a semi-private set used to test API-served models, and a fully private set on which official state-of-the-art results are reported; the interactive ARC-AGI-3 keeps the large majority of its game environments held out entirely [20]. Humanity's Last Exam likewise retains a private question subset to measure overfitting to its public set [3].
Dynamic benchmarks. LiveBench (June 2024) adds new objectively scored questions every month, sourced from recent competitions, papers, and news [21], and LiveCodeBench timestamps coding problems so that models can be scored only on problems published after their training cutoff. The approach descends from Meta's earlier Dynabench (2020) experiment in dynamic adversarial data collection.
Independent evaluation organizations. A small ecosystem of third parties now exists between the labs and the public:
| Organization | Role |
|---|---|
| Epoch AI | Nonprofit benchmarking hub; maintains FrontierMath and cross-model capability data |
| METR | Autonomy and long-task evaluations; pre-deployment testing for frontier labs |
| LMArena | Crowdsourced human preference rankings |
| ARC Prize Foundation | Fluid-intelligence benchmarks with private test sets |
| Artificial Analysis | Independent cross-provider benchmarking of quality, price, and speed |
| AI Security Institute (UK) | Government pre-deployment safety testing; Inspect framework |
Enterprise evaluation tooling. Production teams increasingly treat evals as core engineering infrastructure: curated "golden" datasets, regression suites run in continuous integration, LLM-as-judge scoring with human spot checks, and platforms such as LangSmith, Braintrust, Arize Phoenix, Weights & Biases Weave, and Patronus AI, alongside open frameworks like OpenAI Evals, Inspect, and lm-evaluation-harness.
No single fix has resolved the underlying tension: any measurement that becomes influential invites optimization against it. The practical consensus as of 2026 is to rely on portfolios of evaluations, to refresh or withhold test data, and to weight independent and task-based results over self-reported benchmark tables.