A validation set (also called a development set or dev set) is a subset of labeled data that is held out from the training set and used to evaluate a model's performance during development. It plays a central role in hyperparameter tuning, model selection, and early stopping, acting as a proxy for how well the model will generalize to unseen data without touching the final test set.
In supervised learning, the available labeled data is typically divided into two or three non-overlapping partitions: a training set, a validation set, and a test set. The training set is used to fit the model's internal parameters (such as weights in a neural network), while the validation set provides an independent evaluation of the model's fit during training. The test set is reserved for the final, unbiased assessment of the finished model.
The validation set serves several key purposes:
One of the most common sources of confusion in machine learning is the distinction between the validation set and the test set. Although both are used to evaluate a model on data it was not trained on, their roles in the workflow are fundamentally different.
| Aspect | Validation Set | Test Set |
|---|---|---|
| When used | During model development (repeatedly) | After all development is complete (ideally once) |
| Primary purpose | Tune hyperparameters and select models | Provide an unbiased final performance estimate |
| Influence on the model | Indirectly shapes the model through selection decisions | Should have zero influence on the model |
| Allowed to look at results | Yes, results guide further development | Looking at results and then changing the model introduces bias |
| Typical usage frequency | Many times throughout training | Once, at the very end |
The critical rule is that the test set should never be used to make any training or design decision. If test set performance is used to adjust hyperparameters or select among candidate models, the test set effectively becomes a second validation set, and the reported performance will be optimistically biased [1].
The simplest approach is to randomly split the available data into a training portion and a validation portion before training begins. This is called hold-out validation (or a simple train/validation split). Common split ratios include:
| Dataset Size | Typical Train : Validation : Test Split |
|---|---|
| Small (hundreds to low thousands) | 70 : 15 : 15 or 80 : 10 : 10 |
| Medium (tens of thousands) | 80 : 10 : 10 |
| Large (hundreds of thousands or more) | 90 : 5 : 5 or even 98 : 1 : 1 |
With very large datasets, even 1% of the data can contain tens of thousands of examples, which is more than enough to estimate performance reliably. For smaller datasets, devoting a larger fraction to validation ensures that the performance estimate is stable [2].
When splitting data, practitioners should ensure that the validation set follows the same probability distribution as the training set. For classification problems with imbalanced classes, stratified splitting preserves the original class proportions in both the training and validation partitions. For grouped data (such as multiple measurements from the same patient or multiple frames from the same video), group-aware splitting ensures that all samples from a single group appear in only one partition.
In k-fold cross-validation, the training data is divided into k equally sized folds. The model is trained k times, each time using a different fold as the validation set and the remaining k − 1 folds as the training set. The final performance metric is the average across all k runs.
This approach is especially valuable when data is limited, because every sample serves as part of the validation set exactly once and as part of the training set k − 1 times. Common choices are k = 5 or k = 10. Stratified k-fold ensures that each fold preserves the class distribution of the full dataset [3].
| Factor | Hold-Out Validation | K-Fold Cross-Validation |
|---|---|---|
| Computational cost | Low (train once) | Higher (train k times) |
| Data efficiency | Lower (part of data is never used for training) | Higher (all data used for both training and validation) |
| Variance of estimate | Higher (depends on which samples land in which split) | Lower (averages over k different splits) |
| Bias of estimate | Can be higher with small data | Generally lower |
| Best suited for | Large datasets where a single split is representative | Small to medium datasets |
For deep learning models that are expensive to train, hold-out validation is often the pragmatic choice, since training a large neural network k times can be prohibitively slow. For classical machine learning algorithms with faster training times, k-fold cross-validation is the standard practice [4].
Early stopping is a regularization technique that uses the validation set to determine when to halt training. During each epoch of training, the model's loss (or another metric such as accuracy) is computed on both the training set and the validation set.
In a typical training run:
Typical patience values range from 3 to 10 epochs, depending on the dataset and model complexity. Early stopping is widely used in training neural networks because it is simple to implement and effective at preventing overfitting without requiring manual tuning of the number of training epochs [5].
Plotting training loss and validation loss across epochs produces learning curves that provide diagnostic information about the model's behavior.
| Curve Pattern | Training Loss | Validation Loss | Diagnosis |
|---|---|---|---|
| Good fit | Decreases and stabilizes | Decreases and stabilizes close to training loss | The model generalizes well |
| Overfitting | Continues to decrease | Decreases then increases (diverges from training loss) | The model memorizes training data |
| Underfitting | Remains high | Remains high, mirrors training loss | The model is too simple or needs more training |
| Oscillating loss | Fluctuates erratically | Fluctuates erratically | Learning rate may be too high or data may have quality issues |
The gap between the training loss and the validation loss is sometimes called the generalization gap. A small generalization gap indicates that the model's performance on the training data is a good predictor of its performance on new data. A large and growing gap is a classic indicator of overfitting [6].
Hyperparameters are settings that are not learned during training but must be specified before training begins. Examples include the learning rate, the number of hidden layers, dropout rate, batch size, and regularization strength.
The standard workflow for hyperparameter tuning is:
When using cross-validation for hyperparameter tuning, steps 2 and 3 are repeated for each fold, and the average validation performance across folds is used to compare configurations. This gives a more robust estimate of each configuration's quality, especially with limited data [7].
Although the validation set is not used to train the model's parameters directly, repeated evaluation on the same validation set can lead to a subtler form of overfitting. When practitioners run many experiments, choosing hyperparameters and model designs based on validation performance, information from the validation set gradually leaks into the modeling decisions. Over time, the selected model may be tuned to perform well specifically on the validation set rather than on truly unseen data.
This problem is known as validation set overfitting or adaptive overfitting. Signs include:
Strategies to mitigate validation set leakage include:
Another common source of leakage is preprocessing leakage: fitting data transformations (such as normalization, feature engineering, or imputation) on the entire dataset before splitting it into training and validation sets. Preprocessing steps should be fitted only on the training data and then applied to the validation and test sets [8].
Validation sets are especially important when training deep learning models because neural networks have a large number of parameters and a strong capacity to memorize training data. Without a validation set, it is difficult to know when to stop training or which architecture works best.
In practice, training a neural network involves:
Modern deep learning frameworks such as PyTorch and TensorFlow provide built-in callbacks and utilities for monitoring validation metrics during training. For example, PyTorch Lightning's EarlyStopping callback and Keras's ModelCheckpoint callback automate the process of tracking the validation loss and saving the best model.
Choosing the right size for the validation set involves balancing two competing concerns:
General guidelines for sizing:
| Scenario | Recommendation |
|---|---|
| Small dataset (< 1,000 samples) | Use k-fold cross-validation instead of a fixed validation set |
| Medium dataset (1,000 to 100,000 samples) | 10% to 20% for validation |
| Large dataset (> 100,000 samples) | 1% to 10% for validation (even 1% may yield thousands of samples) |
| Very large dataset (millions of samples) | 1% or less is often sufficient |
The validation set should be large enough to detect meaningful differences between candidate models. If the expected improvement from a hyperparameter change is small (for example, a 0.1% increase in accuracy), the validation set needs to be large enough for that difference to be statistically significant [9].
Beyond standard hold-out and k-fold approaches, several specialized validation methods address specific data characteristics:
TimeSeriesSplit implements this by using expanding training windows with forward-looking validation windows.Imagine you are studying for a big test at school. You have a workbook full of practice problems. You use most of the problems to learn and practice (that is your training set). But you save a few problems that you do not look at while studying. After you think you have studied enough, you try those saved problems to see if you really understand the material (that is your validation set). If you get them wrong, you go back and study differently. You keep checking with those saved problems until you do well.
Then, on test day, the teacher gives you brand-new problems you have never seen before (that is the test set). Your score on those brand-new problems tells you how well you truly learned, not just how well you memorized the practice answers.
The validation set is like a practice quiz you give yourself before the real test. It helps you figure out the best way to study without spoiling the real test.