# Validation

> Source: https://aiwiki.ai/wiki/validation
> Updated: 2026-06-27
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

Validation in [machine learning](/wiki/machine_learning) is the process of checking how well a trained [model](/wiki/model) performs on data it did not see during [training](/wiki/training), using a held-out [validation set](/wiki/validation_set) to tune [hyperparameters](/wiki/hyperparameter), choose between candidate models, and decide when to stop training. It is the first round of evaluation, run during or right after training, and it sits between the [training set](/wiki/training_set) that fits the model and the [test set](/wiki/test_set) that gives the final, untouched read on performance. Keeping these three pools separate is the standard way to detect [overfitting](/wiki/overfitting) and to avoid the optimistic accuracy you get from training and judging a model on the same data. [1][2]

## What is validation in machine learning?

A typical supervised learning project splits a [labeled dataset](/wiki/labeled_data) into three parts: a training set that fits model parameters, a validation set that guides choices like hyperparameter settings and early stopping, and a test set that gives the final, unbiased read on how well the model generalizes. The Wikipedia reference article defines the middle pool precisely: "A validation data set is a data set of examples used to tune the hyperparameters (i.e. the architecture) of a model." [1] The same source notes the validation set provides "an unbiased evaluation of a model fit on the training data set while tuning the model's hyperparameters," and supports "regularization by early stopping." [1]

Google's machine learning glossary describes the validation set as a dataset used to evaluate the model during training and the test set as a [dataset](/wiki/dataset) used to evaluate the final model after initial vetting by the validation set. [3] The split between these roles matters because any information that leaks from validation or test data into the trained weights gives a misleading picture of real-world performance.

## Why is a separate validation set needed?

Training a model on a sample and then judging it on the same sample is a known mistake. The model can memorize the labels and look perfect, then fail on anything new. A separate test set fixes part of this problem, but a single test set runs into a different issue: as soon as you start tuning hyperparameters based on test scores, the test set becomes part of the modeling process. Its scores stop being unbiased.

The validation set sits in the middle to absorb this tuning. You can compare different learning rates, network depths, regularization strengths, or feature sets by their validation scores, then save the test set for one final number at the end. Without that middle layer, every round of tuning leaks information about the test set into your choices, and the headline accuracy drifts upward without the model actually being better. A held-out validation set, or [cross-validation](/wiki/cross_validation) on the training portion, keeps the test set clean for the final reading.

## How do validation, training, and test sets differ?

All three sets are drawn from the same overall pool of labeled examples and should follow the same probability distribution. They serve different jobs.

| Set | Role | When used |
|---|---|---|
| [Training set](/wiki/training_set) | Fits model parameters (weights, coefficients) | During training |
| [Validation set](/wiki/validation_set) | Evaluates the model during training, guides [hyperparameter tuning](/wiki/hyperparameter_tuning) and early stopping | Between training runs or epochs |
| [Test set](/wiki/test_set) | Provides a final, unbiased estimate of [generalization](/wiki/generalization) | Once, after all training and tuning is done |

The training set teaches the model. The validation set helps the engineer pick between models. The test set is the receipt at the end. If the test set ever influences a hyperparameter choice, it has effectively become a second validation set and the project needs a fresh hold-out for the final read.

## What are typical train, validation, and test split ratios?

There is no universal split that always works. The right ratio depends on dataset size, [class](/wiki/class) imbalance, and how stable the metric needs to be. A few patterns show up often in practice.

| Split | Use case |
|---|---|
| 80 / 10 / 10 | Default for many supervised learning projects; keeps most data for training while leaving enough for both validation and test |
| 70 / 15 / 15 | Used when more careful evaluation is wanted, or when validation and test scores need lower variance |
| 60 / 20 / 20 | Common on smaller tabular datasets where the held-out sets need to be bigger to be representative |
| 98 / 1 / 1 (or similar) | Used on very large datasets (millions of examples) where 1% is already a large validation and test set in absolute terms |

Larger datasets tolerate a smaller percentage for validation and test because 1% of ten million examples is still 100,000 examples, which is plenty to estimate a metric. Small datasets push toward bigger validation and test portions, or toward k-fold [cross-validation](/wiki/cross_validation) so every example takes a turn in each role.

For [classification](/wiki/classification) problems with imbalanced classes, the split is usually stratified so each set keeps roughly the same class proportions as the full dataset. Without stratification, a minority class can end up nearly absent from the validation or test set, which makes metrics like [precision](/wiki/precision), [recall](/wiki/recall), and [AUC](/wiki/auc) unstable or undefined.

## What is hold-out validation?

Hold-out validation is the simplest approach. You set aside a single validation set before training begins and never use it to fit parameters. After each training run, you compute the [validation loss](/wiki/loss) or another metric on this set and use it to compare models or to decide when to stop training.

The approach is fast, easy to explain, and well-suited to large datasets where a single 10% slice is already big enough to be representative. The downside is variance: with a small dataset, the score depends on which examples happened to land in the validation set. A different random split can give a noticeably different answer. Hold-out is often paired with a fixed random seed for reproducibility and with stratification so the split is fair.

## What is k-fold cross-validation?

When the dataset is small or the validation score swings too much between random splits, k-fold [cross-validation](/wiki/cross_validation) is the standard alternative. The training portion is divided into k equal pieces called folds. The model is trained k times. In each round, one fold is held out as the validation fold and the remaining k minus one folds are used for training. After all k rounds, the k validation scores are averaged into a single estimate.

The most common choices are k equal to 5 and k equal to 10. Five folds give a faster, slightly higher-bias estimate; ten folds give a smoother, slightly more expensive one. Wikipedia notes that "10-fold cross-validation is commonly used," while cautioning that "in general k remains an unfixed parameter" whose best value depends on dataset size and compute budget. [2]

k-fold has two practical advantages. Every example is used for validation exactly once, which makes the score less sensitive to a lucky or unlucky split. The averaged score has lower variance than a single hold-out, so small differences between models are easier to trust. The cost is that the full training procedure runs k times instead of once.

### What is stratified k-fold?

For classification, stratified k-fold preserves the class distribution in each fold. This matters most for imbalanced problems, where a plain random split can drop a rare class entirely from a fold and break the metric calculation.

### What is leave-one-out cross-validation?

Leave-one-out cross-validation, often written as LOOCV, is the extreme case where k equals the number of examples. The model is trained once for every single data point: each point in turn is the validation set, and the rest is training data. The averaged score uses every example.

LOOCV is attractive on very small datasets because it wastes almost nothing. It has two well-known drawbacks. The compute cost is high because the model is fit as many times as there are examples. The variance of the estimate can also be surprisingly large, because the n training sets are nearly identical to each other, so the errors are correlated. Scikit-learn's documentation notes that "LOO often results in high variance as an estimator for the test error" and states, "As a general rule, most authors and empirical evidence suggest that 5 or 10-fold cross validation should be preferred to LOO." [4]

## How does the validation set drive early stopping?

During training, the model's loss on the training set usually drops steadily. The loss on the validation set tends to fall at first, then flatten, then start rising again as the model begins to memorize training data rather than learn patterns that generalize. The gap between these two curves is the classic signature of overfitting.

[Early stopping](/wiki/early_stopping) uses this signal. The training loop tracks validation loss at the end of each [epoch](/wiki/epoch). If validation loss stops improving for a set number of checks, training halts and the model rolls back to the weights from the best epoch. As the Wikipedia article on early stopping puts it, "The error on the validation set is used as a proxy for the generalization error in determining when overfitting has begun." [5]

A few practical pieces matter. The patience setting controls how many epochs without improvement are allowed before stopping; typical values are five to ten epochs for [deep learning](/wiki/deep_learning) models. The monitored metric can be validation loss or validation accuracy, with loss preferred because it changes smoothly. The best weights are usually restored, not the final ones.

Early stopping is one of the cheapest forms of [regularization](/wiki/regularization). It does not change the loss function, the optimizer, or the architecture; it just stops at the right time. Modern frameworks like [Keras](/wiki/keras), [PyTorch](/wiki/pytorch) Lightning, and [XGBoost](/wiki/xgboost) expose it as a standard callback.

## How is the validation set used for hyperparameter tuning?

The validation set is also where [hyperparameter tuning](/wiki/hyperparameter_tuning) happens. Hyperparameters are settings chosen before training: learning rate, batch size, number of layers, regularization coefficient, tree depth, and so on. A typical sweep trains several candidate configurations on the training set, scores each on the validation set, and picks the configuration with the best validation score.

Methods include manual tweaking, grid search, random search, and [Bayesian optimization](/wiki/bayesian_optimization). All share the same risk: every comparison spends a little of the validation set's budget, and after enough comparisons the best score may be a lucky outlier. Ways to push back: keep tuning rounds modest, use k-fold [cross-validation](/wiki/cross_validation) so each configuration is scored on multiple folds, or use nested cross-validation when stakes are high. Always keep a final test set that is never touched until the model is frozen.

## Can a model overfit to the validation set?

The validation set is not magic. If you compare hundreds of variants against it and pick the winner, you can fit the noise in the validation set the same way the model can fit noise in the training set. The result looks great on validation, scores worse on test, and worse still in production.

Typical signs are a steady gap between validation and test scores, or configurations that win by tiny margins which disappear on a fresh hold-out. Common defenses include limiting the number of comparisons, using cross-validation instead of a single hold-out, and reserving the test set for the very end of the project. Competition platforms like [Kaggle](/wiki/kaggle) split a public leaderboard from a private leaderboard for the same reason: the public score acts like a validation set that participants can probe, while the private score is the test set that decides the final ranking.

## Which metrics are used to evaluate the validation set?

The metric used for validation depends on the task.

| Metric | Used for |
|---|---|
| [Accuracy](/wiki/accuracy) | Balanced classification; misleading on imbalanced data |
| [Precision](/wiki/precision) and [recall](/wiki/recall) | Classification with class imbalance |
| [F1 score](/wiki/f1_score) | Harmonic mean of precision and recall |
| AUC-ROC | Ranking quality of a binary classifier |
| Mean squared error | [Regression](/wiki/regression); penalizes large errors heavily |
| Mean absolute error | Regression; more robust to outliers than MSE |
| [Cross-entropy loss](/wiki/cross_entropy) | Probabilistic classification; standard objective for [neural networks](/wiki/neural_network) |

Good practice is to pick the validation metric to match the business or scientific goal. A model that wins on raw accuracy can be the wrong choice if the cost of false negatives is much higher than the cost of false positives.

## What does a standard validation workflow look like?

A standard supervised learning workflow looks roughly like this:

1. Collect and clean a [labeled dataset](/wiki/labeled_data).
2. Split into training, validation, and test sets, with stratification if classes are imbalanced.
3. Train a baseline model on the training set.
4. Score on the validation set. Adjust hyperparameters, architecture, features, or preprocessing.
5. Optionally swap the single hold-out for k-fold cross-validation, especially on small datasets.
6. Use early stopping based on validation loss to avoid overtraining.
7. Once the model is frozen, run a single evaluation on the test set and report that number.

The order matters. Touching the test set before step 7, even casually, breaks the guarantee that the test score is unbiased.

## Explain like I'm 5 (ELI5)

Imagine you are studying for a math test. The training set is your homework: you do problems and check the answers in the back of the book so you can learn. The validation set is a practice quiz your teacher gives you the day before the real test. You can see how you did, study a bit more, change how you take notes. The real test is the test set: you only get one shot, and that score is what counts.

If you peek at the real test ahead of time, your score does not really mean anything anymore. That is why machine learning practitioners keep the test set locked away and use the validation set for all the in-between checking.

## References

1. Wikipedia. "Training, validation, and test data sets." https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets
2. Wikipedia. "Cross-validation (statistics)." https://en.wikipedia.org/wiki/Cross-validation_(statistics)
3. Google Developers. "Machine Learning Glossary." https://developers.google.com/machine-learning/glossary
4. scikit-learn developers. "Cross-validation: evaluating estimator performance." https://scikit-learn.org/stable/modules/cross_validation.html
5. Wikipedia. "Early stopping." https://en.wikipedia.org/wiki/Early_stopping
6. Google Developers. "Machine Learning Glossary: ML Fundamentals." https://developers.google.com/machine-learning/glossary/fundamentals
7. Lightly. "Train Test Validation Split: Best Practices and Examples." https://www.lightly.ai/blog/train-test-validation-split