Validation
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,234 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,234 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Validation in machine learning is the process of checking the quality of a model's predictions by testing the trained model against data it has not seen during training. The data used for this check is called a validation set. Validation is the first round of evaluation and is usually run during training or right after it, while the test set is the second and final round of evaluation on a separate, untouched sample.
A typical supervised learning project splits a labeled dataset into three parts: a training set that fits model parameters, a validation set that guides choices like hyperparameter settings and early stopping, and a test set that gives the final, unbiased read on how well the model generalizes. Keeping these three pools separate is the simplest way to detect overfitting and to avoid optimistic performance estimates that come from training and judging on the same data.
Google's machine learning glossary describes the validation set as "a dataset to evaluate the model during training" and the test set as a dataset used to evaluate the final model after initial vetting by the validation set. The split between these roles matters because any information that leaks from validation or test data into the trained weights gives a misleading picture of real-world performance.
Training a model on a sample and then judging it on the same sample is a known mistake. The model can memorize the labels and look perfect, then fail on anything new. A separate test set fixes part of this problem, but a single test set runs into a different issue: as soon as you start tuning hyperparameters based on test scores, the test set becomes part of the modeling process. Its scores stop being unbiased.
The validation set sits in the middle to absorb this tuning. You can compare different learning rates, network depths, regularization strengths, or feature sets by their validation scores, then save the test set for one final number at the end. Without that middle layer, every round of tuning leaks information about the test set into your choices, and the headline accuracy drifts upward without the model actually being better. A held-out validation set, or cross-validation on the training portion, keeps the test set clean for the final reading.
All three sets are drawn from the same overall pool of labeled examples and should follow the same probability distribution. They serve different jobs.
| Set | Role | When used |
|---|---|---|
| Training set | Fits model parameters (weights, coefficients) | During training |
| Validation set | Evaluates the model during training, guides hyperparameter tuning and early stopping | Between training runs or epochs |
| Test set | Provides a final, unbiased estimate of generalization | Once, after all training and tuning is done |
The training set teaches the model. The validation set helps the engineer pick between models. The test set is the receipt at the end. If the test set ever influences a hyperparameter choice, it has effectively become a second validation set and the project needs a fresh hold-out for the final read.
There is no universal split that always works. The right ratio depends on dataset size, class imbalance, and how stable the metric needs to be. A few patterns show up often in practice.
| Split | Use case |
|---|---|
| 80 / 10 / 10 | Default for many supervised learning projects; keeps most data for training while leaving enough for both validation and test |
| 70 / 15 / 15 | Used when more careful evaluation is wanted, or when validation and test scores need lower variance |
| 60 / 20 / 20 | Common on smaller tabular datasets where the held-out sets need to be bigger to be representative |
| 98 / 1 / 1 (or similar) | Used on very large datasets (millions of examples) where 1% is already a large validation and test set in absolute terms |
Larger datasets tolerate a smaller percentage for validation and test because 1% of ten million examples is still 100,000 examples, which is plenty to estimate a metric. Small datasets push toward bigger validation and test portions, or toward k-fold cross-validation so every example takes a turn in each role.
For classification problems with imbalanced classes, the split is usually stratified so each set keeps roughly the same class proportions as the full dataset. Without stratification, a minority class can end up nearly absent from the validation or test set, which makes metrics like precision, recall, and AUC unstable or undefined.
Hold-out validation is the simplest approach. You set aside a single validation set before training begins and never use it to fit parameters. After each training run, you compute the validation loss or another metric on this set and use it to compare models or to decide when to stop training.
The approach is fast, easy to explain, and well-suited to large datasets where a single 10% slice is already big enough to be representative. The downside is variance: with a small dataset, the score depends on which examples happened to land in the validation set. A different random split can give a noticeably different answer. Hold-out is often paired with a fixed random seed for reproducibility and with stratification so the split is fair.
When the dataset is small or the validation score swings too much between random splits, k-fold cross-validation is the standard alternative. The training portion is divided into k equal pieces called folds. The model is trained k times. In each round, one fold is held out as the validation fold and the remaining k minus one folds are used for training. After all k rounds, the k validation scores are averaged into a single estimate.
The most common choices are k equal to 5 and k equal to 10. Five folds give a faster, slightly higher-bias estimate; ten folds give a smoother, slightly more expensive one. Wikipedia notes that ten-fold cross-validation is the most widely used setting, though the right number depends on dataset size and compute budget.
k-fold has two practical advantages. Every example is used for validation exactly once, which makes the score less sensitive to a lucky or unlucky split. The averaged score has lower variance than a single hold-out, so small differences between models are easier to trust. The cost is that the full training procedure runs k times instead of once.
For classification, stratified k-fold preserves the class distribution in each fold. This matters most for imbalanced problems, where a plain random split can drop a rare class entirely from a fold and break the metric calculation.
Leave-one-out cross-validation, often written as LOOCV, is the extreme case where k equals the number of examples. The model is trained once for every single data point: each point in turn is the validation set, and the rest is training data. The averaged score uses every example.
LOOCV is attractive on very small datasets because it wastes almost nothing. It has two well-known drawbacks. The compute cost is high because the model is fit as many times as there are examples. The variance of the estimate can also be surprisingly large, because the n training sets are nearly identical to each other, so the errors are correlated. Scikit-learn's documentation explicitly recommends 5-fold or 10-fold over LOOCV for most problems.
During training, the model's loss on the training set usually drops steadily. The loss on the validation set tends to fall at first, then flatten, then start rising again as the model begins to memorize training data rather than learn patterns that generalize. The gap between these two curves is the classic signature of overfitting.
Early stopping uses this signal. The training loop tracks validation loss at the end of each epoch. If validation loss stops improving for a set number of checks, training halts and the model rolls back to the weights from the best epoch. The Wikipedia article on early stopping describes it as using "the error on the validation set as a proxy for the generalization error in determining when overfitting has begun."
A few practical pieces matter. The patience setting controls how many epochs without improvement are allowed before stopping; typical values are five to ten epochs for deep learning models. The monitored metric can be validation loss or validation accuracy, with loss preferred because it changes smoothly. The best weights are usually restored, not the final ones.
Early stopping is one of the cheapest forms of regularization. It does not change the loss function, the optimizer, or the architecture; it just stops at the right time. Modern frameworks like Keras, PyTorch Lightning, and XGBoost expose it as a standard callback.
The validation set is also where hyperparameter tuning happens. Hyperparameters are settings chosen before training: learning rate, batch size, number of layers, regularization coefficient, tree depth, and so on. A typical sweep trains several candidate configurations on the training set, scores each on the validation set, and picks the configuration with the best validation score.
Methods include manual tweaking, grid search, random search, and Bayesian optimization. All share the same risk: every comparison spends a little of the validation set's budget, and after enough comparisons the best score may be a lucky outlier. Ways to push back: keep tuning rounds modest, use k-fold cross-validation so each configuration is scored on multiple folds, or use nested cross-validation when stakes are high. Always keep a final test set that is never touched until the model is frozen.
The validation set is not magic. If you compare hundreds of variants against it and pick the winner, you can fit the noise in the validation set the same way the model can fit noise in the training set. The result looks great on validation, scores worse on test, and worse still in production.
Typical signs are a steady gap between validation and test scores, or configurations that win by tiny margins which disappear on a fresh hold-out. Common defenses include limiting the number of comparisons, using cross-validation instead of a single hold-out, and reserving the test set for the very end of the project. Competition platforms like Kaggle split a public leaderboard from a private leaderboard for the same reason: the public score acts like a validation set that participants can probe, while the private score is the test set that decides the final ranking.
The metric used for validation depends on the task.
| Metric | Used for |
|---|---|
| Accuracy | Balanced classification; misleading on imbalanced data |
| Precision and recall | Classification with class imbalance |
| F1 score | Harmonic mean of precision and recall |
| AUC-ROC | Ranking quality of a binary classifier |
| Mean squared error | Regression; penalizes large errors heavily |
| Mean absolute error | Regression; more robust to outliers than MSE |
| Cross-entropy loss | Probabilistic classification; standard objective for neural networks |
Good practice is to pick the validation metric to match the business or scientific goal. A model that wins on raw accuracy can be the wrong choice if the cost of false negatives is much higher than the cost of false positives.
A standard supervised learning workflow looks roughly like this:
The order matters. Touching the test set before step 7, even casually, breaks the guarantee that the test score is unbiased.
Imagine you are studying for a math test. The training set is your homework: you do problems and check the answers in the back of the book so you can learn. The validation set is a practice quiz your teacher gives you the day before the real test. You can see how you did, study a bit more, change how you take notes. The real test is the test set: you only get one shot, and that score is what counts.
If you peek at the real test ahead of time, your score does not really mean anything anymore. That is why machine learning practitioners keep the test set locked away and use the validation set for all the in-between checking.