Validation

Introduction

Validation in machine learning is the process of checking the quality of a model's predictions by testing the trained model against data it has not seen during training. The data used for this check is called a validation set. Validation is the first round of evaluation and is usually run during training or right after it, while the test set is the second and final round of evaluation on a separate, untouched sample.

A typical supervised learning project splits a labeled dataset into three parts: a training set that fits model parameters, a validation set that guides choices like hyperparameter settings and early stopping, and a test set that gives the final, unbiased read on how well the model generalizes. Keeping these three pools separate is the simplest way to detect overfitting and to avoid optimistic performance estimates that come from training and judging on the same data.

Google's machine learning glossary describes the validation set as "a dataset to evaluate the model during training" and the test set as a dataset used to evaluate the final model after initial vetting by the validation set. The split between these roles matters because any information that leaks from validation or test data into the trained weights gives a misleading picture of real-world performance.

Why a validation set is needed

Training a model on a sample and then judging it on the same sample is a known mistake. The model can memorize the labels and look perfect, then fail on anything new. A separate test set fixes part of this problem, but a single test set runs into a different issue: as soon as you start tuning hyperparameters based on test scores, the test set becomes part of the modeling process. Its scores stop being unbiased.

The validation set sits in the middle to absorb this tuning. You can compare different learning rates, network depths, regularization strengths, or feature sets by their validation scores, then save the test set for one final number at the end. Without that middle layer, every round of tuning leaks information about the test set into your choices, and the headline accuracy drifts upward without the model actually being better. A held-out validation set, or cross-validation on the training portion, keeps the test set clean for the final reading.

Validation vs training vs test sets

All three sets are drawn from the same overall pool of labeled examples and should follow the same probability distribution. They serve different jobs.

Set	Role	When used
Training set	Fits model parameters (weights, coefficients)	During training
Validation set	Evaluates the model during training, guides hyperparameter tuning and early stopping	Between training runs or epochs
Test set	Provides a final, unbiased estimate of generalization	Once, after all training and tuning is done

The training set teaches the model. The validation set helps the engineer pick between models. The test set is the receipt at the end. If the test set ever influences a hyperparameter choice, it has effectively become a second validation set and the project needs a fresh hold-out for the final read.

Typical data splits

There is no universal split that always works. The right ratio depends on dataset size, class imbalance, and how stable the metric needs to be. A few patterns show up often in practice.

Split	Use case
80 / 10 / 10	Default for many supervised learning projects; keeps most data for training while leaving enough for both validation and test
70 / 15 / 15	Used when more careful evaluation is wanted, or when validation and test scores need lower variance
60 / 20 / 20	Common on smaller tabular datasets where the held-out sets need to be bigger to be representative
98 / 1 / 1 (or similar)	Used on very large datasets (millions of examples) where 1% is already a large validation and test set in absolute terms

Larger datasets tolerate a smaller percentage for validation and test because 1% of ten million examples is still 100,000 examples, which is plenty to estimate a metric. Small datasets push toward bigger validation and test portions, or toward k-fold cross-validation so every example takes a turn in each role.

For classification problems with imbalanced classes, the split is usually stratified so each set keeps roughly the same class proportions as the full dataset. Without stratification, a minority class can end up nearly absent from the validation or test set, which makes metrics like precision, recall, and AUC unstable or undefined.

Hold-out validation

Hold-out validation is the simplest approach. You set aside a single validation set before training begins and never use it to fit parameters. After each training run, you compute the validation loss or another metric on this set and use it to compare models or to decide when to stop training.

The approach is fast, easy to explain, and well-suited to large datasets where a single 10% slice is already big enough to be representative. The downside is variance: with a small dataset, the score depends on which examples happened to land in the validation set. A different random split can give a noticeably different answer. Hold-out is often paired with a fixed random seed for reproducibility and with stratification so the split is fair.

k-fold cross-validation

When the dataset is small or the validation score swings too much between random splits, k-fold cross-validation is the standard alternative. The training portion is divided into k equal pieces called folds. The model is trained k times. In each round, one fold is held out as the validation fold and the remaining k minus one folds are used for training. After all k rounds, the k validation scores are averaged into a single estimate.

The most common choices are k equal to 5 and k equal to 10. Five folds give a faster, slightly higher-bias estimate; ten folds give a smoother, slightly more expensive one. Wikipedia notes that ten-fold cross-validation is the most widely used setting, though the right number depends on dataset size and compute budget.

k-fold has two practical advantages. Every example is used for validation exactly once, which makes the score less sensitive to a lucky or unlucky split. The averaged score has lower variance than a single hold-out, so small differences between models are easier to trust. The cost is that the full training procedure runs k times instead of once.

Stratified k-fold

For classification, stratified k-fold preserves the class distribution in each fold. This matters most for imbalanced problems, where a plain random split can drop a rare class entirely from a fold and break the metric calculation.

Leave-one-out cross-validation

Leave-one-out cross-validation, often written as LOOCV, is the extreme case where k equals the number of examples. The model is trained once for every single data point: each point in turn is the validation set, and the rest is training data. The averaged score uses every example.

LOOCV is attractive on very small datasets because it wastes almost nothing. It has two well-known drawbacks. The compute cost is high because the model is fit as many times as there are examples. The variance of the estimate can also be surprisingly large, because the n training sets are nearly identical to each other, so the errors are correlated. Scikit-learn's documentation explicitly recommends 5-fold or 10-fold over LOOCV for most problems.

Validation loss and early stopping

During training, the model's loss on the training set usually drops steadily. The loss on the validation set tends to fall at first, then flatten, then start rising again as the model begins to memorize training data rather than learn patterns that generalize. The gap between these two curves is the classic signature of overfitting.

Early stopping uses this signal. The training loop tracks validation loss at the end of each epoch. If validation loss stops improving for a set number of checks, training halts and the model rolls back to the weights from the best epoch. The Wikipedia article on early stopping describes it as using "the error on the validation set as a proxy for the generalization error in determining when overfitting has begun."

A few practical pieces matter. The patience setting controls how many epochs without improvement are allowed before stopping; typical values are five to ten epochs for deep learning models. The monitored metric can be validation loss or validation accuracy, with loss preferred because it changes smoothly. The best weights are usually restored, not the final ones.

Early stopping is one of the cheapest forms of regularization. It does not change the loss function, the optimizer, or the architecture; it just stops at the right time. Modern frameworks like Keras, PyTorch Lightning, and XGBoost expose it as a standard callback.

Hyperparameter tuning with the validation set

The validation set is also where hyperparameter tuning happens. Hyperparameters are settings chosen before training: learning rate, batch size, number of layers, regularization coefficient, tree depth, and so on. A typical sweep trains several candidate configurations on the training set, scores each on the validation set, and picks the configuration with the best validation score.

Methods include manual tweaking, grid search, random search, and Bayesian optimization. All share the same risk: every comparison spends a little of the validation set's budget, and after enough comparisons the best score may be a lucky outlier. Ways to push back: keep tuning rounds modest, use k-fold cross-validation so each configuration is scored on multiple folds, or use nested cross-validation when stakes are high. Always keep a final test set that is never touched until the model is frozen.

Overfitting to the validation set

The validation set is not magic. If you compare hundreds of variants against it and pick the winner, you can fit the noise in the validation set the same way the model can fit noise in the training set. The result looks great on validation, scores worse on test, and worse still in production.

Typical signs are a steady gap between validation and test scores, or configurations that win by tiny margins which disappear on a fresh hold-out. Common defenses include limiting the number of comparisons, using cross-validation instead of a single hold-out, and reserving the test set for the very end of the project. Competition platforms like Kaggle split a public leaderboard from a private leaderboard for the same reason: the public score acts like a validation set that participants can probe, while the private score is the test set that decides the final ranking.

Common evaluation metrics on the validation set

The metric used for validation depends on the task.

Metric	Used for
Accuracy	Balanced classification; misleading on imbalanced data
Precision and recall	Classification with class imbalance
F1 score	Harmonic mean of precision and recall
AUC-ROC	Ranking quality of a binary classifier
Mean squared error	Regression; penalizes large errors heavily
Mean absolute error	Regression; more robust to outliers than MSE
Cross-entropy loss	Probabilistic classification; standard objective for neural networks

Good practice is to pick the validation metric to match the business or scientific goal. A model that wins on raw accuracy can be the wrong choice if the cost of false negatives is much higher than the cost of false positives.

Practical workflow

A standard supervised learning workflow looks roughly like this:

Collect and clean a labeled dataset.
Split into training, validation, and test sets, with stratification if classes are imbalanced.
Train a baseline model on the training set.
Score on the validation set. Adjust hyperparameters, architecture, features, or preprocessing.
Optionally swap the single hold-out for k-fold cross-validation, especially on small datasets.
Use early stopping based on validation loss to avoid overtraining.
Once the model is frozen, run a single evaluation on the test set and report that number.

The order matters. Touching the test set before step 7, even casually, breaks the guarantee that the test score is unbiased.

Explain like I'm 5 (ELI5)

Imagine you are studying for a math test. The training set is your homework: you do problems and check the answers in the back of the book so you can learn. The validation set is a practice quiz your teacher gives you the day before the real test. You can see how you did, study a bit more, change how you take notes. The real test is the test set: you only get one shot, and that score is what counts.

If you peek at the real test ahead of time, your score does not really mean anything anymore. That is why machine learning practitioners keep the test set locked away and use the validation set for all the in-between checking.

Validation

Introduction

Why a validation set is needed

Validation vs training vs test sets

Typical data splits

Hold-out validation

k-fold cross-validation

Stratified k-fold

Leave-one-out cross-validation

Validation loss and early stopping

Hyperparameter tuning with the validation set

Overfitting to the validation set

Common evaluation metrics on the validation set

Practical workflow

Explain like I'm 5 (ELI5)

References

Improve this article

Introduction

Why a validation set is needed

Validation vs training vs test sets

Typical data splits

Hold-out validation

k-fold cross-validation

Stratified k-fold

Leave-one-out cross-validation

Validation loss and early stopping

Hyperparameter tuning with the validation set

Overfitting to the validation set

Common evaluation metrics on the validation set

Practical workflow

Explain like I'm 5 (ELI5)

References

Introduction

Why a validation set is needed

Validation vs training vs test sets

Typical data splits

Hold-out validation

k-fold cross-validation

Stratified k-fold

Leave-one-out cross-validation

Validation loss and early stopping

Hyperparameter tuning with the validation set

Overfitting to the validation set

Common evaluation metrics on the validation set

Practical workflow

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Agglomerative clustering

Introduction

Why a validation set is needed

Validation vs training vs test sets

Typical data splits

Hold-out validation

k-fold cross-validation

Stratified k-fold

Leave-one-out cross-validation

Validation loss and early stopping

Hyperparameter tuning with the validation set

Overfitting to the validation set

Common evaluation metrics on the validation set

Practical workflow

Explain like I'm 5 (ELI5)

References

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Agglomerative clustering