# Validation Set

> Source: https://aiwiki.ai/wiki/validation_set
> Updated: 2026-06-21
> Categories: Machine Learning, Model Evaluation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **validation set** (also called a **development set** or **dev set**) is a subset of labeled data that is held out from the [training set](/wiki/training_set) and used to evaluate a [model](/wiki/model)'s performance during development, before any final assessment on the [test set](/wiki/test_set). It is the partition practitioners consult to tune [hyperparameters](/wiki/hyperparameter), select among competing models, and decide when to stop training, acting as a proxy for how well a model will generalize to unseen data without contaminating the test set.

The validation set is one of the three canonical partitions of labeled data in [supervised learning](/wiki/supervised_learning), alongside the training set and the test set. While the training set teaches the model and the test set delivers a final unbiased verdict, the validation set is where day-to-day modeling decisions are made: which architecture to keep, when to halt training, and which hyperparameters yield the best generalization. The careful, disciplined use of a validation set is often the difference between a model that performs well in the lab and one that survives contact with real-world data.

## What is a validation set used for?

In supervised learning, the available labeled data is typically divided into two or three non-overlapping partitions: a [training set](/wiki/training_set), a validation set, and a [test set](/wiki/test_set). The training set is used to fit the model's internal [parameters](/wiki/parameter) (such as [weights](/wiki/weight) in a [neural network](/wiki/neural_network)), while the validation set provides an independent evaluation of the model's fit during training. The test set is reserved for the final, unbiased assessment of the finished model.

The validation set serves several key purposes:

- **[Hyperparameter](/wiki/hyperparameter) tuning.** The validation set is the standard benchmark for comparing models trained with different hyperparameter configurations (for example, different [learning rates](/wiki/learning_rate), regularization strengths, or network architectures). By measuring each candidate model's performance on the validation set, practitioners select the configuration that generalizes best.
- **Model selection.** When evaluating multiple algorithms or architectures, the validation set is used to pick the best-performing approach before final evaluation on the test set.
- **[Early stopping](/wiki/early_stopping).** For iterative training algorithms such as [gradient descent](/wiki/stochastic_gradient_descent_sgd), the validation set is monitored at each [epoch](/wiki/epoch) to detect when performance begins to degrade, a signal that the model is starting to overfit.
- **Guarding the [test set](/wiki/test_set).** By making all development decisions using the validation set, the test set remains completely untouched and can provide a trustworthy estimate of real-world performance.
- **Diagnosing bias and variance.** Comparing training and validation error reveals whether a model is suffering from high bias (both errors high) or high variance (large gap between training and validation error). This diagnostic guides whether to add capacity, gather more data, or strengthen [regularization](/wiki/regularization).
- **Calibration and threshold selection.** For probabilistic classifiers, the validation set is used to fit calibration maps (Platt scaling, isotonic regression) and to choose decision thresholds that balance precision and recall.

The word "validation" can be misleading because it suggests a single act of confirmation. In practice, the validation set is consulted hundreds or thousands of times during a project. That is precisely why it is kept separate from the test set: any data used repeatedly to make decisions becomes contaminated by selection bias.

## How does a validation set differ from a test set?

One of the most common sources of confusion in [machine learning](/wiki/machine_learning) is the distinction between the validation set and the test set. Although both are used to evaluate a model on data it was not trained on, their roles in the workflow are fundamentally different.

| Aspect | Validation Set | [Test Set](/wiki/test_set) |
|---|---|---|
| **When used** | During model development (repeatedly) | After all development is complete (ideally once) |
| **Primary purpose** | Tune [hyperparameters](/wiki/hyperparameter) and select models | Provide an unbiased final performance estimate |
| **Influence on the model** | Indirectly shapes the model through selection decisions | Should have zero influence on the model |
| **Allowed to look at results** | Yes, results guide further development | Looking at results and then changing the model introduces bias |
| **Typical usage frequency** | Many times throughout training | Once, at the very end |
| **Recommended size** | Large enough to discriminate between models | Large enough to give tight confidence intervals on the final metric |
| **Public visibility** | Often shared with collaborators | Sometimes sealed by an external party (Kaggle private leaderboard, NIST holdouts) |

The critical rule is that the test set should never be used to make any training or design decision. If test set performance is used to adjust hyperparameters or select among candidate models, the test set effectively becomes a second validation set, and the reported performance will be optimistically biased [1]. Cawley and Talbot (2010) showed that this danger applies to the validation procedure itself: in their analysis of model selection, they warn that "the variance of the model selection criterion can result in over-fitting in model selection, resulting in a form of selection bias" that makes naive performance estimates unreliable [5].

In academic benchmarks, this discipline is often enforced by infrastructure rather than honor. Kaggle competitions split the test set into a public leaderboard portion (visible during the competition) and a private leaderboard portion (revealed only at the end). Submissions ranked on the public board can suffer from leaderboard overfitting, while final standings on the private board reveal which teams generalized. The [GLUE](/wiki/glue) and [SuperGLUE](/wiki/superglue) language understanding benchmarks similarly hide test labels behind a submission server.

## How are validation sets created?

### Hold-out validation

The simplest approach is to randomly split the available data into a training portion and a validation portion before training begins. This is called hold-out validation (or a simple train/validation split). Common split ratios include:

| Dataset Size | Typical Train : Validation : Test Split |
|---|---|
| Small (hundreds to low thousands) | 70 : 15 : 15 or 80 : 10 : 10 |
| Medium (tens of thousands) | 80 : 10 : 10 |
| Large (hundreds of thousands or more) | 90 : 5 : 5 or even 98 : 1 : 1 |

With very large datasets, even 1% of the data can contain tens of thousands of examples, which is more than enough to estimate performance reliably. For smaller datasets, devoting a larger fraction to validation ensures that the performance estimate is stable [2].

When splitting data, practitioners should ensure that the validation set follows the same probability distribution as the training set. For classification problems with imbalanced classes, **stratified splitting** preserves the original class proportions in both the training and validation partitions. For grouped data (such as multiple measurements from the same patient or multiple frames from the same video), group-aware splitting ensures that all samples from a single group appear in only one partition.

In scikit-learn, the function `train_test_split` performs hold-out splitting and accepts a `stratify` argument that takes the class labels and ensures proportional representation. The companion function `GroupShuffleSplit` accepts a group label and guarantees that no group is split across partitions.

### [Cross-validation](/wiki/cross_validation)

In [k-fold cross-validation](/wiki/cross_validation), the training data is divided into *k* equally sized folds. The model is trained *k* times, each time using a different fold as the validation set and the remaining *k* minus 1 folds as the training set. The final performance metric is the average across all *k* runs.

This approach is especially valuable when data is limited, because every sample serves as part of the validation set exactly once and as part of the training set *k* minus 1 times. Common choices are *k* = 5 or *k* = 10. Stratified k-fold ensures that each fold preserves the class distribution of the full dataset [3].

Cross-validation is implemented in scikit-learn through `cross_val_score` and `cross_validate`, with splitter classes such as `KFold`, `StratifiedKFold`, `GroupKFold`, and `TimeSeriesSplit` defining how folds are constructed. Each splitter respects a different invariant, so choosing the correct splitter for the data structure is critical.

### Hold-out vs. k-fold: which should you use?

| Factor | Hold-Out Validation | K-Fold [Cross-Validation](/wiki/cross_validation) |
|---|---|---|
| **Computational cost** | Low (train once) | Higher (train *k* times) |
| **Data efficiency** | Lower (part of data is never used for training) | Higher (all data used for both training and validation) |
| **Variance of estimate** | Higher (depends on which samples land in which split) | Lower (averages over *k* different splits) |
| **Bias of estimate** | Can be higher with small data | Generally lower |
| **Best suited for** | Large datasets where a single split is representative | Small to medium datasets |
| **Communicates uncertainty** | Single number with no built-in error bar | Mean and standard deviation across folds |

For [deep learning](/wiki/deep_learning) models that are expensive to train, hold-out validation is often the pragmatic choice, since training a large neural network *k* times can be prohibitively slow. For classical [machine learning](/wiki/machine_learning) algorithms with faster training times, k-fold cross-validation is the standard practice [4].

## What are the main cross-validation variants?

While basic k-fold is the workhorse, the right cross-validation scheme depends on the structure of the data. Choosing the wrong scheme is one of the most common ways to produce overly optimistic validation scores.

| Variant | How it works | Best for | Implementation |
|---|---|---|---|
| **K-fold** | Split into *k* equal folds; each fold serves once as validation | i.i.d. data, regression, balanced classification | `KFold` |
| **Stratified k-fold** | Preserves class proportions in each fold | Classification with class imbalance | `StratifiedKFold` |
| **Group k-fold** | Ensures samples from the same group never span fold boundaries | Patient records, user sessions, document sentences | `GroupKFold` |
| **Stratified group k-fold** | Combines stratification with group constraints | Imbalanced classification with grouped samples | `StratifiedGroupKFold` |
| **Time-series split** | Forward chaining: training fold always precedes validation fold | Forecasting, sequential data | `TimeSeriesSplit` |
| **Repeated k-fold** | Runs k-fold multiple times with different shuffles | Reducing variance of the estimate | `RepeatedKFold`, `RepeatedStratifiedKFold` |
| **Leave-one-out (LOOCV)** | k = n; each sample takes its turn as a single-element validation set | Very small datasets (n < 50) | `LeaveOneOut` |
| **Leave-p-out** | Each combination of p samples serves once as the validation set | Theoretical analysis; rarely practical for p > 1 | `LeavePOut` |
| **Leave-one-group-out** | Each unique group is held out as a validation fold | Multi-site studies, federated data | `LeaveOneGroupOut` |
| **Shuffle split** | Random subsets repeated for a fixed number of iterations | Quick estimation when k-fold is overkill | `ShuffleSplit` |
| **Nested cross-validation** | Inner loop tunes hyperparameters, outer loop estimates generalization | Reporting unbiased results when tuning is part of the pipeline | Manual composition of CV objects |
| **Blocked cross-validation** | Contiguous chunks of time or space form folds with gaps between them | Spatial data, autocorrelated time series | Custom splitters |

For time series, the standard rule is that any validation point must come strictly after every training point. Random shuffling for time series silently leaks the future into the past and can inflate metrics by orders of magnitude. The `TimeSeriesSplit` object implements expanding-window forward chaining: fold 1 trains on [1..n] and validates on [n+1..2n], fold 2 trains on [1..2n] and validates on [2n+1..3n], and so on. Some forecasting workflows further insert a gap between training and validation to prevent leakage from short-range autocorrelation.

### What is nested cross-validation?

When hyperparameter tuning and final evaluation both use the same cross-validation procedure, the reported score is biased upward because the chosen hyperparameters were selected to maximize that very score. **Nested cross-validation** addresses this by separating the two responsibilities into two nested loops. The inner loop performs hyperparameter search on each outer training fold, and the outer loop computes a held-out score using a fold the inner loop never saw. The outer scores are then averaged.

For a typical 5x5 nested scheme with grid search over 100 configurations:

- The outer loop trains and evaluates 5 times.
- For each outer training fold, the inner loop performs a 5-fold search across 100 configurations, requiring 500 fits.
- Total fits: 5 outer x 5 inner x 100 configurations = 2,500 fits, plus 5 final fits with the chosen configuration.

Nested cross-validation is the gold standard for academic publications that compare algorithms, because non-nested estimates can lure a researcher into overestimating generalization performance, particularly when the inner-fold scores have large standard deviations and the maximum is taken across many candidate configurations [5].

## How does a validation set drive early stopping?

[Early stopping](/wiki/early_stopping) is a [regularization](/wiki/regularization) technique that uses the validation set to determine when to halt training. During each epoch of training, the model's [loss](/wiki/loss) (or another metric such as [accuracy](/wiki/accuracy)) is computed on both the training set and the validation set.

In a typical training run:

1. Both training [loss](/wiki/loss) and validation loss decrease during the early epochs as the model learns useful patterns.
2. At some point, the training loss continues to decrease while the validation loss levels off or starts to increase. This divergence signals that the model is beginning to memorize the training data rather than learning generalizable patterns, a phenomenon known as [overfitting](/wiki/overfitting).
3. Early stopping halts training when the validation loss has not improved for a specified number of consecutive epochs (a threshold called **patience**).
4. The model checkpoint with the lowest validation loss is restored as the final model.

Typical patience values range from 3 to 10 epochs, depending on the dataset and model complexity. Early stopping is widely used in training [neural networks](/wiki/neural_network) because it is simple to implement and effective at preventing [overfitting](/wiki/overfitting) without requiring manual tuning of the number of training epochs [6].

Prechelt's classic 1998 study cataloged several variants of the early stopping criterion, ranging from a simple "stop when the validation error has not improved for *p* epochs" rule to more elaborate definitions based on the generalization-to-progress quotient (GL/Pk). The simplest patience-based scheme is the most widely used in practice because of its robustness and ease of implementation [6].

## How do you interpret validation loss curves?

Plotting training loss and validation loss across epochs produces learning curves that provide diagnostic information about the model's behavior.

| Curve Pattern | Training Loss | Validation Loss | Diagnosis |
|---|---|---|---|
| **Good fit** | Decreases and stabilizes | Decreases and stabilizes close to training loss | The model generalizes well |
| **[Overfitting](/wiki/overfitting)** | Continues to decrease | Decreases then increases (diverges from training loss) | The model memorizes training data |
| **[Underfitting](/wiki/underfitting)** | Remains high | Remains high, mirrors training loss | The model is too simple or needs more training |
| **Oscillating loss** | Fluctuates erratically | Fluctuates erratically | [Learning rate](/wiki/learning_rate) may be too high or data may have quality issues |
| **Validation lower than training** | Higher than expected | Below training loss | Distribution mismatch, label noise in training, or [dropout](/wiki/dropout_regularization) inflating training loss |
| **Sudden spike in validation** | Smooth | Sudden jump | Catastrophic step due to [learning rate](/wiki/learning_rate) instability or bad batch |

The gap between the training loss and the validation loss is sometimes called the **generalization gap**. A small generalization gap indicates that the model's performance on the training data is a good predictor of its performance on new data. A large and growing gap is a classic indicator of [overfitting](/wiki/overfitting) [7].

A related diagnostic is the **learning curve** plotted against training set size rather than epoch number. By training the same model on increasing fractions of the data and recording validation error, the learning curve reveals whether the model is data-limited (validation error is still falling as more data is added) or capacity-limited (validation error has plateaued and adding data will not help). This guides whether to invest in more labels or in a larger model.

## How is the validation set used in hyperparameter tuning?

Hyperparameters are settings that are not learned during training but must be specified before training begins. Examples include the [learning rate](/wiki/learning_rate), the number of [hidden layers](/wiki/hidden_layer), [dropout](/wiki/dropout_regularization) rate, [batch size](/wiki/batch_size), and [regularization](/wiki/regularization) strength.

The standard workflow for hyperparameter tuning is:

1. Define a set of hyperparameter configurations to evaluate (through grid search, random search, or [Bayesian optimization](/wiki/bayesian_optimization)).
2. For each configuration, train the model on the training set.
3. Evaluate the trained model on the validation set.
4. Select the hyperparameter configuration that achieves the best validation performance.
5. Retrain the final model on the combined training and validation data using the selected hyperparameters.
6. Evaluate once on the test set to get an unbiased performance estimate.

When using [cross-validation](/wiki/cross_validation) for hyperparameter tuning, steps 2 and 3 are repeated for each fold, and the average validation performance across folds is used to compare configurations. This gives a more robust estimate of each configuration's quality, especially with limited data [8].

### Hyperparameter search methods

Validation performance is the objective that hyperparameter search algorithms optimize. The choice of search method affects how quickly a good configuration is found and how thoroughly the space is explored.

| Method | How it works | Strengths | Weaknesses | Typical tools |
|---|---|---|---|---|
| **[Grid search](/wiki/grid_search)** | Exhaustively evaluates every combination on a predefined grid | Simple, reproducible, embarrassingly parallel | Combinatorial explosion in high dimensions; wastes effort on unimportant axes | scikit-learn `GridSearchCV` |
| **Random search** | Samples configurations uniformly from the search space | Often beats grid search for the same compute budget; trivially parallel | No memory of past trials | scikit-learn `RandomizedSearchCV` |
| **[Bayesian optimization](/wiki/bayesian_optimization)** | Builds a surrogate model of validation performance and chooses next trial by an acquisition function (Expected Improvement, UCB) | Sample-efficient; uses past trials to guide future ones | Hard to parallelize naively; surrogate cost grows with trials | scikit-optimize, GPyOpt, BoTorch |
| **Tree-structured Parzen Estimator (TPE)** | Models the densities of good and bad configurations separately | Handles conditional and discrete spaces well | Can underperform Gaussian processes on smooth low-dim spaces | Optuna, Hyperopt |
| **Hyperband** | Successive halving across multiple bracket widths; cheaply rules out bad trials | Anytime algorithm; strong results on deep learning | Requires a meaningful budget knob (epochs, data fraction) | Ray Tune, Optuna |
| **BOHB** | Combines Bayesian sampling with Hyperband's resource allocation | Sample-efficient and budget-aware | More complex to implement and debug | HpBandSter, Ray Tune |
| **Halving search (SH)** | Trains all candidates on a small budget, halves them by validation score, doubles budget, repeats | Fast under tight compute | Brittle to noisy validation scores at small budgets | scikit-learn `HalvingGridSearchCV`, `HalvingRandomSearchCV` |
| **Population-based training (PBT)** | Evolves a population of models; periodically replaces underperformers and perturbs hyperparameters | Adapts hyperparameter schedules during training | Stochastic, hard to reproduce exactly | DeepMind's PBT, Ray Tune PBT |
| **Evolutionary search** | Mutates and recombines configurations across generations | No gradient or surrogate needed | Can be sample-inefficient | DEAP, NSGA-II implementations |

The theoretical insight behind random search is that hyperparameter response surfaces typically have low effective dimensionality: only a few hyperparameters meaningfully affect performance, and grid search wastes most of its evaluations along unimportant axes. Bergstra and Bengio (2012) showed that random search finds equally good or better configurations than grid search using the same budget, especially as the dimensionality of the space grows. In their words, "randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid," because for most datasets "only a few of the hyper-parameters really matter" [9].

Hyperband (Li et al., 2017) reformulates hyperparameter search as a non-stochastic best-arm identification problem and uses a principled early-stopping schedule. It calls successive halving as a subroutine: start with a large set of configurations and a small budget, evaluate all of them, then keep only the top fraction (typically the top half) and double their budget. Hyperband sweeps over different initial pool sizes to balance exploration and exploitation. In Li et al.'s benchmarks, Hyperband achieved more than an order-of-magnitude speedup over standard Bayesian optimization on several deep learning workloads [10].

BOHB (Falkner et al., 2018) replaces Hyperband's random configuration sampling with a TPE-based model, retaining Hyperband's resource allocation while gaining the sample efficiency of Bayesian optimization. BOHB is the default search strategy in many AutoML systems because it combines fast early-stage progress with strong asymptotic performance [11].

### scikit-learn API for validation-based tuning

| API | Purpose | Notes |
|---|---|---|
| `train_test_split` | Single train/validation split | Supports `stratify`, `shuffle`, `random_state` |
| `cross_val_score` | Average a metric across CV folds | Returns one number per fold |
| `cross_validate` | Same as above with multi-metric scoring and timing | Returns a dict of arrays |
| `cross_val_predict` | Out-of-fold predictions for every sample | Useful for stacking and calibration |
| `GridSearchCV` | Exhaustive grid search with CV | `refit=True` retrains on all data with best params |
| `RandomizedSearchCV` | Random search with CV | Specify `n_iter` and parameter distributions |
| `HalvingGridSearchCV` | Successive halving over a grid | Experimental; faster than `GridSearchCV` for many candidates |
| `HalvingRandomSearchCV` | Successive halving over random samples | Combines speed of halving with breadth of random sampling |
| `validation_curve` | Sweep one parameter, plot training vs. validation score | Diagnostic for overfitting/underfitting on a single axis |
| `learning_curve` | Sweep training set size | Diagnostic for data-limited vs. capacity-limited regimes |

For distributed and large-scale tuning, dedicated libraries pick up where scikit-learn ends. **Optuna** offers a define-by-run API that lets the search space depend on previous samples, plus pruners that stop unpromising trials early. **Ray Tune** scales tuning across clusters and integrates schedulers like ASHA, PBT, and HyperBand. **Weights and Biases Sweeps** and **MLflow** add experiment tracking and visualization. **AutoML systems** such as Auto-sklearn, AutoGluon, H2O.ai, and Google Vertex AI fully automate algorithm and hyperparameter selection on the validation set, returning a tuned pipeline without manual intervention [12].

## What is validation set overfitting (and how do you prevent it)?

Although the validation set is not used to train the model's parameters directly, repeated evaluation on the same validation set can lead to a subtler form of [overfitting](/wiki/overfitting). When practitioners run many experiments, choosing hyperparameters and model designs based on validation performance, information from the validation set gradually leaks into the modeling decisions. Over time, the selected model may be tuned to perform well specifically on the validation set rather than on truly unseen data.

This problem is known as **validation set overfitting** or **adaptive overfitting**. Signs include:

- The chosen model performs well on the validation set but poorly on the test set.
- Performance on the validation set improves steadily across many rounds of experimentation, but test set performance does not follow the same trend.
- Final ensembles dominated by configurations that were tested most recently, suggesting recency bias rather than genuine improvement.

Strategies to mitigate validation set leakage include:

- **Using [cross-validation](/wiki/cross_validation)** instead of a single hold-out validation set, which reduces the chance of tuning to one specific data split.
- **Limiting the number of evaluations** on the validation set. Each time validation results influence a decision, some information leaks.
- **Maintaining a strict separation** between the validation and test sets. The test set should only be evaluated once, at the very end of the project.
- **Refreshing the validation set** periodically if new labeled data becomes available.
- **Differential privacy mechanisms** like the Reusable Holdout (Dwork et al., 2015), which adds calibrated noise to validation queries to bound the information that can be extracted across many evaluations [19].
- **Using nested cross-validation** for the final reported metric, so that the score reflects the entire pipeline including hyperparameter selection.

### Sources of leakage during preprocessing

Another common source of leakage is **preprocessing leakage**: fitting data transformations (such as [normalization](/wiki/normalization), [feature engineering](/wiki/feature_engineering), or imputation) on the entire dataset before splitting it into training and validation sets. Preprocessing steps should be fitted only on the training data and then applied to the validation and test sets [13].

| Leakage source | Example | Fix |
|---|---|---|
| **Scaling on the full dataset** | `StandardScaler().fit(X)` before split | Fit scaler on training fold only, transform validation fold |
| **Imputation with global statistics** | Filling NaN with the mean of the entire dataset | Compute mean on training fold only |
| **Target encoding leakage** | Using target statistics computed from the validation rows | Use out-of-fold target encoding |
| **Feature selection on full data** | Picking features by their correlation with the target on the full dataset | Run feature selection inside each CV fold |
| **Oversampling before split** | Applying SMOTE before splitting | Apply oversampling only inside the training fold |
| **Lookahead in time series** | Including future observations in moving-average features | Use only past data, with a sufficient lag |
| **Group leakage** | Same patient in train and validation | Use group-aware splitters such as `GroupKFold` |
| **Duplicate or near-duplicate rows** | Same image present multiple times | Deduplicate before splitting |
| **Train/validation contamination via embeddings** | Embeddings pretrained on data that overlaps validation | Track and disclose the pretraining corpus |

The scikit-learn `Pipeline` object exists in part to make leakage-free preprocessing easier: any transformer added to a `Pipeline` is fit only on the training fold during cross-validation, eliminating an entire category of bugs.

### Why is group leakage so dangerous in medical imaging?

Per-group leakage is especially severe in medical imaging, where multiple slices, scans, or visits commonly come from the same patient. If the same patient appears in both the training and validation sets, the model can memorize patient-specific features (anatomy, scanner artifacts, demographics) and produce dramatically inflated metrics. A 2021 study on brain MRI classification reported that slice-wise random splitting boosted apparent slice-level accuracy by 30% on OASIS, 29% on ADNI, 48% on PPMI, and 55% on a local Parkinson's dataset compared with patient-wise splitting [14]. The same study ran a control experiment on randomly relabeled data: slice-level splitting produced about 96% accuracy on data with meaningless labels, while subject-level splitting correctly collapsed to about 50% (chance), a stark demonstration that the inflated scores were an artifact of leakage rather than real signal [14]. The accepted practice is to split at the patient level, not the slice or visit level, and to verify by computing a hash on patient identifiers across splits.

Similar group-leakage failures occur in user-level recommender systems (same user split across folds), document-level NLP tasks (same document split across folds), and connectome-based neuroimaging studies (same subject's connectivity matrix split across folds).

## How is validation used for neural networks?

Validation sets are especially important when training [deep learning](/wiki/deep_learning) models because neural networks have a large number of [parameters](/wiki/parameter) and a strong capacity to memorize training data. Without a validation set, it is difficult to know when to stop training or which architecture works best.

In practice, training a neural network involves:

1. Splitting data into training, validation, and test sets.
2. Training the network on the training set for multiple [epochs](/wiki/epoch).
3. After each epoch (or after a fixed number of [batches](/wiki/batch)), computing the validation [loss](/wiki/loss) and any relevant metrics.
4. Saving the model [checkpoint](/wiki/checkpoint) whenever the validation metric improves.
5. Applying [early stopping](/wiki/early_stopping) if the validation metric has not improved for several epochs.
6. Loading the best checkpoint and evaluating on the test set.

Modern deep learning frameworks such as [PyTorch](/wiki/pytorch) and [TensorFlow](/wiki/tensorflow) provide built-in callbacks and utilities for monitoring validation metrics during training. For example, PyTorch Lightning's `EarlyStopping` callback and Keras's `ModelCheckpoint` callback automate the process of tracking the validation loss and saving the best model. Training scripts based on Hugging Face's `transformers.Trainer` similarly accept an `evaluation_strategy` argument that controls how often the trainer runs the validation loop.

### How do large language models use a validation set?

For large pretrained models, validation takes on a slightly different shape. During pretraining, a held-out portion of the corpus is used to compute validation [perplexity](/wiki/perplexity) (or equivalently bits-per-byte or cross-entropy on next-token prediction). The Chinchilla scaling laws (Hoffmann et al., 2022) used this validation loss as the objective for compute-optimal scaling, sweeping model and data size combinations under a fixed compute budget to find configurations that minimize held-out loss. Their central finding was that "for compute-optimal training, the model size and the number of training tokens should be scaled equally": doubling the model size should be matched by doubling the training tokens [15]. To test this prediction they trained Chinchilla, a 70 billion parameter model, on 1.4 trillion tokens using the same compute budget as the 280 billion parameter Gopher model (trained on roughly 300 billion tokens), and Chinchilla outperformed Gopher despite being four times smaller [15]. The validation set in this setting is large (tens of millions of tokens) and chosen to be representative of the pretraining distribution, with separate "validation" and "test" splits maintained from the start.

Downstream evaluation, on the other hand, uses curated benchmark suites rather than perplexity. **LM-Eval-Harness** (EleutherAI) provides a unified interface for over 200 tasks including MMLU, HellaSwag, ARC, GSM8K, and BBH, supporting multiple-choice, generation, and likelihood-based scoring. **HELM** (Holistic Evaluation of Language Models, Liang et al., 2022) measures seven properties (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) across 16 core scenarios. **BIG-bench** offers more than 200 tasks contributed by hundreds of researchers, while **MT-Bench**, **AlpacaEval**, and **Chatbot Arena** focus on instruction-following and conversational quality [16].

A recurring pitfall in modern LLM evaluation is that benchmark validation sets often leak into the pretraining corpus, especially as web-scale crawls grow. Practitioners mitigate this with **decontamination** procedures: hashing benchmark text and removing matches from the training corpus, or using held-out splits constructed after the model's data cutoff. Even so, comparing scores across reports requires fixing the harness, prompt templates, and scoring rules. A common mistake is comparing model A's lm-evaluation-harness score against model B's HELM score, since different harnesses use different prompt templates, few-shot examples, and aggregation rules and the numbers are not directly comparable.

## How big should a validation set be?

Choosing the right size for the validation set involves balancing two competing concerns:

- **Too small a validation set** leads to noisy, unreliable performance estimates. A difference of a few correctly or incorrectly classified examples can cause large swings in the measured metric.
- **Too large a validation set** reduces the amount of data available for training, potentially leading to a worse model.

General guidelines for sizing:

| Scenario | Recommendation |
|---|---|
| Small dataset (< 1,000 samples) | Use k-fold [cross-validation](/wiki/cross_validation) instead of a fixed validation set |
| Medium dataset (1,000 to 100,000 samples) | 10% to 20% for validation |
| Large dataset (> 100,000 samples) | 1% to 10% for validation (even 1% may yield thousands of samples) |
| Very large dataset (millions of samples) | 1% or less is often sufficient |

The validation set should be large enough to detect meaningful differences between candidate models. If the expected improvement from a hyperparameter change is small (for example, a 0.1% increase in [accuracy](/wiki/accuracy)), the validation set needs to be large enough for that difference to be statistically significant [17].

A quick rule of thumb based on the binomial standard error: with 1,000 validation samples and a true accuracy near 80%, the 95% confidence interval has a half-width of approximately 2.5 percentage points. With 10,000 samples, the same interval shrinks to about 0.8 percentage points. Researchers comparing models that differ by less than 1% in accuracy should ensure their validation set holds tens of thousands of examples or use cross-validation with many folds to reduce variance.

### Statistical tests for comparing models

When two candidate models produce similar validation scores, a single comparison of point estimates can be misleading. Practitioners often apply paired statistical tests:

| Test | Use case |
|---|---|
| **McNemar's test** | Two classifiers evaluated on the same validation set |
| **5x2 cross-validation paired t-test** (Dietterich) | Robust comparison of two algorithms with limited data |
| **Wilcoxon signed-rank** | Non-parametric pairwise comparison across folds |
| **Bootstrap confidence intervals** | Estimating uncertainty around a single metric |
| **Friedman + Nemenyi** | Comparing many algorithms across multiple datasets |

These tests answer the question: given the size and variability of the validation results, is the observed difference plausibly real or just noise?

## Specialized validation strategies

Beyond standard hold-out and k-fold approaches, several specialized validation methods address specific data characteristics:

- **Stratified validation.** Ensures that class proportions in the validation set match those in the full dataset. This is critical for imbalanced classification problems where rare classes might be underrepresented or absent in a naive random split.
- **Group-based validation.** Prevents data from the same group (such as the same user, patient, or document) from appearing in both the training and validation sets. This avoids inflated performance estimates caused by the model recognizing group-level patterns rather than learning generalizable features.
- **Time-series validation.** For temporal data, the validation set must always come from a later time period than the training set to simulate realistic forecasting conditions. Scikit-learn's `TimeSeriesSplit` implements this by using expanding training windows with forward-looking validation windows.
- **Leave-one-out cross-validation (LOOCV).** An extreme case of k-fold where *k* equals the number of samples. Each sample is used as a single-item validation set while the rest serve as training data. LOOCV provides a nearly unbiased estimate but has high variance and is computationally expensive, so it is generally only practical for very small datasets (fewer than 50 samples) [18].
- **Out-of-distribution (OOD) validation.** A second validation set drawn from a target distribution different from training, used to check robustness. Common in domain generalization and continual learning research.
- **Adversarial validation.** Train a classifier to discriminate training from validation samples; if it succeeds easily, the splits differ in distribution and metrics may be misleading.
- **Spatial cross-validation.** For geospatial data, training and validation tiles are separated by a buffer zone to prevent spatial autocorrelation from leaking nearby measurements into both sets.
- **Reusable holdout.** Differential-privacy-based mechanisms (Dwork et al., 2015) that allow a validation set to be queried many times without overfitting, by adding calibrated noise to each answer [19].
- **Train/dev/test/dev-test splits.** Some workflows (such as Andrew Ng's Deep Learning Specialization) recommend two validation sets: a smaller "dev" set for fast iteration and a larger "dev-test" set for less frequent sanity checks.

## Common pitfalls

Even experienced practitioners run into recurring failure modes when working with validation sets. The following table summarizes the ones most often seen in production projects.

| Pitfall | Symptom | Fix |
|---|---|---|
| Random shuffle on time-series data | Validation metrics vastly better than live performance | Use `TimeSeriesSplit` or temporal hold-out |
| Same patient or user in both splits | Per-group memorization inflates scores | Use `GroupKFold` and verify with hashing |
| Preprocessing fit on full dataset | Subtle leakage that may go undetected | Wrap preprocessing inside a `Pipeline` and fit per fold |
| Stratification ignored | Rare classes missing from a fold | Use `StratifiedKFold` or `StratifiedGroupKFold` |
| Tuning on test set | Test scores cease to be unbiased | Reserve test set for final, single evaluation |
| Reporting cherry-picked seeds | Reproducibility fails for others | Report mean and standard deviation across seeds |
| Comparing models across different harnesses | Apples-to-oranges comparison | Pin the harness, prompts, and scoring rules |
| Forgetting to retrain on train + val | Final model uses less data than necessary | Refit with the chosen hyperparameters on the union |
| Validation set too small to discriminate | Noise dominates differences | Increase size or use cross-validation |
| Distribution drift between dev and prod | Lab metrics do not transfer | Refresh validation set from current production data |

## Validation in production machine learning

Validation does not stop when a model ships. Production systems usually maintain a **shadow validation set** drawn from recent production traffic to detect concept drift, label shift, and feature pipeline regressions. Tools such as Evidently AI, Fiddler, and Arize compute distribution distances (PSI, KS, Wasserstein) and prediction-quality metrics on this rolling validation slice. ML platforms (MLflow, Kubeflow, SageMaker) version validation datasets alongside trained models so that rerunning the same evaluation tomorrow produces the same number. Tracking the SHA hash of the validation set is a small but important discipline: a metric without a versioned dataset is unreproducible.

## Explain like I'm 5 (ELI5)

Imagine you are studying for a big test at school. You have a workbook full of practice problems. You use most of the problems to learn and practice (that is your [training set](/wiki/training_set)). But you save a few problems that you do not look at while studying. After you think you have studied enough, you try those saved problems to see if you really understand the material (that is your validation set). If you get them wrong, you go back and study differently. You keep checking with those saved problems until you do well.

Then, on test day, the teacher gives you brand-new problems you have never seen before (that is the [test set](/wiki/test_set)). Your score on those brand-new problems tells you how well you truly learned, not just how well you memorized the practice answers.

The validation set is like a practice quiz you give yourself before the real test. It helps you figure out the best way to study without spoiling the real test.

If you peek at the practice quiz too many times and only study what is on it, you might do great on the practice but fail the real test.

## References

1. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning*. Springer. Chapter 7: Model Assessment and Selection.
2. Ng, A. (2018). "Train / Dev / Test sets." *Deep Learning Specialization*, Coursera.
3. Kohavi, R. (1995). "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection." *Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI)*, 2, 1137-1143.
4. Raschka, S. (2018). "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning." *arXiv preprint arXiv:1811.12808*.
5. Cawley, G. C., & Talbot, N. L. C. (2010). "On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation." *Journal of Machine Learning Research*, 11, 2079-2107.
6. Prechelt, L. (1998). "Early Stopping - But When?" In *Neural Networks: Tricks of the Trade*, Springer, 55-69.
7. Google Developers. "Overfitting: Interpreting Loss Curves." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/overfitting/interpreting-loss-curves
8. Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
9. Bergstra, J., & Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization." *Journal of Machine Learning Research*, 13, 281-305.
10. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2017). "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization." *Journal of Machine Learning Research*, 18(185), 1-52. arXiv:1603.06560.
11. Falkner, S., Klein, A., & Hutter, F. (2018). "BOHB: Robust and Efficient Hyperparameter Optimization at Scale." *Proceedings of the 35th International Conference on Machine Learning (ICML)*, 1437-1446.
12. Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). "Optuna: A Next-generation Hyperparameter Optimization Framework." *Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*, 2623-2631.
13. Kaufman, S., Rosset, S., & Perlich, C. (2012). "Leakage in Data Mining: Formulation, Detection, and Avoidance." *ACM Transactions on Knowledge Discovery from Data*, 6(4), 1-21.
14. Yagis, E., Atnafu, S. W., Garcia Seco de Herrera, A., et al. (2021). "Effect of data leakage in brain MRI classification using 2D convolutional neural networks." *Scientific Reports*, 11, 22544.
15. Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." *Advances in Neural Information Processing Systems*, 35. arXiv:2203.15556.
16. Liang, P., Bommasani, R., Lee, T., et al. (2022). "Holistic Evaluation of Language Models." *arXiv preprint arXiv:2211.09110*.
17. Guyon, I. (1997). "A Scaling Law for the Validation-Set Training-Set Size Ratio." AT&T Bell Laboratories.
18. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 5: Machine Learning Basics.
19. Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. (2015). "The reusable holdout: Preserving validity in adaptive data analysis." *Science*, 349(6248), 636-638.
20. Dietterich, T. G. (1998). "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms." *Neural Computation*, 10(7), 1895-1923.