Validation loss
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,233 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,233 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Validation loss in machine learning is the value of a model's loss function computed on the validation set, a portion of data held out from training. It is the main signal for judging whether a neural network is still learning useful patterns or has begun memorizing its training data. Practitioners track validation loss after every epoch, plot it next to training loss, and use the resulting curves to decide when to stop training, which hyperparameters to keep, and which model checkpoint to deploy.
The number is computed like training loss. The only difference is the data: validation loss is averaged over examples the model has not seen during weight updates. That single change turns the loss into an estimate of how well the model will generalize.
A machine learning model is trained by minimizing a loss function on a training set. The optimizer, typically stochastic gradient descent or a variant like Adam, nudges the weights in the direction that lowers this number. A model with enough capacity can drive training loss arbitrarily low by memorizing noise and idiosyncrasies of the training examples. This is overfitting, and it produces a model that scores well on its training set but poorly on anything else. Validation loss catches it. By holding out a portion of labeled data and never letting the optimizer touch it, you get a clean estimate of how the current weights would perform on inputs the model has not memorized.
The validation set sits between two other splits. The training set is what the optimizer updates against. The test set is touched only once, at the end, to give a final unbiased estimate of generalization. The validation set is used repeatedly during development, both for stopping decisions and for choosing between configurations. Common splits are 70/15/15 or 80/10/10 for training, validation, and test. With smaller datasets, k-fold cross-validation rotates the validation slice through the data so every example acts as validation exactly once.
Validation loss uses the same formula as training loss, applied over different examples. For a validation set with M examples,
Validation Loss = (1/M) sum over i of L(y_i, y_hat_i)
where y_i is the true label, y_hat_i is the model's prediction, and L is the chosen loss function. The choice of L depends on the problem:
| Task | Typical loss |
|---|---|
| Binary classification | Binary cross-entropy |
| Multi-class classification | Categorical cross-entropy |
| Regression with continuous targets | Mean squared error or mean absolute error |
| Ranking and retrieval | Pairwise or listwise ranking losses |
| Sequence modeling (language) | Token level cross-entropy, often reported as perplexity |
For cross-entropy with binary labels, the per example loss is L(y, p) = -(y log p + (1 - y) log(1 - p)), where p is the predicted probability of the positive class. The validation loss is the mean of these values over the held out set.
The model is put in evaluation mode while validation loss is computed: dropout is disabled, batch normalization layers use running statistics, and gradients are not computed. The forward pass is the same as inference, so validation loss is a faithful estimate of inference time behavior.
Training and validation loss are usually plotted on the same axes, with epochs on the horizontal axis. The shape of these two curves is the most informative diagnostic in supervised learning.
At the start, both curves are high. As the optimizer makes progress, both fall together. The gap between them, called the generalization gap, gives a rough sense of how much the model has overfit so far. A small gap means training and validation errors track each other; a large and growing gap means the model is fitting patterns specific to the training set that do not transfer.
Four common patterns show up in the loss curves:
| Pattern | What it means |
|---|---|
| Both losses decrease and level off near each other | Good fit. The model is learning generalizable patterns and the chosen capacity is appropriate. |
| Training loss keeps falling while validation loss rises | Classic overfitting. The model is memorizing training examples. |
| Both losses stay high | Underfitting. The model lacks capacity, the features are weak, or the learning rate is wrong. |
| Validation loss noisily oscillates around a falling trend | Often a learning rate that is slightly too high, a small validation set, or a non representative split. |
The overfitting case is what gives validation loss its name in practice. The validation curve typically traces a U shape: it falls along with training loss for a while, reaches a minimum, then climbs back up while training loss continues to descend. The bottom of that U is the point of best generalization for the current configuration. Everything to the right of it is the model getting worse at the actual job, even though training loss looks better.
The most common use of validation loss is early stopping. The idea is simple: stop training when validation loss stops improving, even if training loss is still dropping. This cuts training off near the bottom of the U shape rather than letting it climb.
Early stopping is usually implemented with a patience parameter. The trainer keeps a running record of the best validation loss seen so far. After each epoch, if validation loss does not beat the best by some minimum delta, a counter increments. When the counter exceeds the patience value, training halts and the weights from the best epoch are restored. Common patience values sit in the range of 3 to 10 epochs, with 3 to 6 typical for medium sized models. Larger patience tolerates more noise at the cost of extra compute; smaller patience risks stopping during a temporary plateau before a genuine improvement.
A few practical points show up repeatedly:
Beyond early stopping, validation loss is the score by which hyperparameters are compared. When you run a grid search or random search over learning rate, batch size, weight decay, depth, width, dropout rate, and so on, the winning configuration is the one with the lowest validation loss (or the highest validation accuracy, when that is the headline metric).
The Google Research deep learning tuning playbook recommends starting with hyperparameters that govern how the model learns, such as learning rate and optimizer momentum, and notes that these interact strongly with batch size and regularization strength. Validation loss is the arbiter that lets you compare otherwise hard to compare runs on the same footing.
For large training jobs, sweeping every combination is too expensive. Practitioners often use validation loss in two stages: a coarse search at small scale, then a finer search at full scale around the best configurations. Bayesian optimization tools such as Optuna and Vizier use the validation loss curve as their objective, exploring the hyperparameter space adaptively.
One caution: every time you select a hyperparameter based on validation loss, you leak a little information from the validation set into the chosen model. After enough rounds, validation loss becomes optimistic about generalization. This is why a held out test set, untouched until the very end, is essential.
Validation loss sometimes comes in lower than training loss. This is not always a bug. There are three common causes.
The first is regularization. Dropout, weight decay, and similar techniques are active during training but disabled during validation. Dropout in particular masks out a fraction of neurons on every training step, forcing the rest of the network to compensate. Training loss reflects a deliberately handicapped network; at validation time, all neurons fire, so the model gets a cleaner look at the data and may report a lower loss. The effect can be large enough to keep the validation curve below the training curve for entire runs.
The second is the timing of the measurement. Training loss is typically averaged over the minibatches inside an epoch, while validation loss is computed at the end. The reported training loss is weighted toward the start of the epoch, when the model was worse, while validation loss reflects the model at the end. Shifting the training curve back by half an epoch usually makes the two align.
The third is data leakage or a mismatched split. If a chunk of the training data was accidentally copied into the validation set, or if the validation distribution is easier than the training distribution, the model will look artificially good on validation. The other two causes are benign; data leakage produces a model that fails in production, so it is worth checking carefully.
Dropout is the most common reason the two curves behave oddly. During training, a randomly chosen fraction of activations is zeroed out on each forward pass, so the network must learn redundant representations. During validation, dropout is switched off and the full network runs at full capacity, often yielding lower loss.
Dropout deliberately trades off training loss against validation loss. A higher dropout rate tends to push validation loss down on average by suppressing overfitting, even if it raises training loss. If lowering dropout makes validation loss improve faster but plateau higher, that is the regularization knob doing its job. The right setting is the one that minimizes validation loss at convergence, not the one that makes training fastest.
When the dataset is small, a single train/validation split is noisy: the reported validation loss depends heavily on which examples landed in the validation slice. K-fold cross-validation addresses this by splitting the data into k folds, training k models with a different fold as validation each time, and averaging the k validation losses.
The price is compute. K-fold with k=5 trains the model five times; with k=10, ten times. For deep networks this is often impractical, and a single held out validation set is used instead, sometimes combined with multiple random seeds to estimate variance. Smaller models, gradient boosted trees, and tabular pipelines use k-fold cross-validation routinely because the per fold cost is low.
A few rules of thumb help when looking at training plots:
Loss is not the only metric, and a model with the lowest validation loss is not always the best deployable model. For classification, validation accuracy, F1, AUROC, and calibration may matter more depending on the use case. For language models, perplexity is loss exponentiated, but downstream task accuracy on benchmarks like MMLU or HumanEval is what users care about. Validation loss is the development workhorse, not the final verdict.
Imagine practicing for a spelling test by going over the same word list every night. After a few nights you can recite the list perfectly, but the teacher gives you new words on the test and you struggle. Validation loss is what happens when you try a few new words each night and track how many you miss. If your score on the new words stops improving, you have memorized the practice list rather than learned how to spell. Time to stop drilling and try something different.