# Validation loss

> Source: https://aiwiki.ai/wiki/validation_loss
> Updated: 2026-06-24
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## Introduction

Validation loss is the value of a [model](/wiki/model)'s [loss function](/wiki/loss_function) measured on a held-out [validation set](/wiki/validation_set), data the model never sees during weight updates, and it is the primary signal practitioners use to judge how well a [neural network](/wiki/neural_network) will [generalize](/wiki/generalization) rather than how well it memorizes its [training data](/wiki/training_data). Because it is computed on unseen examples, validation loss is the standard tool for detecting [overfitting](/wiki/overfitting), driving [early stopping](/wiki/early_stopping), tuning [hyperparameters](/wiki/hyperparameter), and selecting which checkpoint to deploy. When training loss keeps falling while validation loss starts rising, the divergence is the textbook indicator that a model is overfitting its training set [1].

Validation loss is computed with exactly the same formula as training loss; the only difference is the data. Validation loss is averaged over examples the optimizer has not updated against, and that single change turns the loss into an estimate of how well the model will perform on new inputs. Practitioners track it after every [epoch](/wiki/epoch), plot it next to training loss, and read the resulting curves to decide when to stop training, which hyperparameters to keep, and which model checkpoint to ship.

## What is validation loss?

A [machine learning model](/wiki/machine_learning_model) is trained by minimizing a loss function on a [training set](/wiki/training_set). The optimizer, typically [stochastic gradient descent](/wiki/stochastic_gradient_descent) or a variant like [Adam](/wiki/adam_optimizer), nudges the [weights](/wiki/weights) in the direction that lowers this number. A model with enough capacity can drive training loss arbitrarily low by memorizing noise and idiosyncrasies of the training examples. This is overfitting, and it produces a model that scores well on its training set but poorly on anything else. Validation loss catches it. By holding out a portion of labeled data and never letting the optimizer touch it, you get a clean estimate of how the current weights would perform on inputs the model has not memorized.

The validation set sits between two other splits. The training set is what the optimizer updates against. The [test set](/wiki/test_set) is touched only once, at the end, to give a final unbiased estimate of generalization. The validation set is used repeatedly during development, both for stopping decisions and for choosing between configurations. As the textbook *Deep Learning* by Goodfellow, Bengio, and Courville frames it, the validation set guides the choice of hyperparameters while the test set must stay untouched so that the final generalization estimate is not biased [2]. Common splits are 70/15/15 or 80/10/10 for training, validation, and test. With smaller datasets, [k-fold cross-validation](/wiki/cross_validation) rotates the validation slice through the data so every example acts as validation exactly once [3].

## How is validation loss computed?

Validation loss uses the same formula as training loss, applied over different examples. For a validation set with M examples,

Validation Loss = (1/M) sum over i of L(y_i, y_hat_i)

where y_i is the true label, y_hat_i is the model's prediction, and L is the chosen loss function. The choice of L depends on the problem:

| Task | Typical loss |
|------|--------------|
| Binary classification | [Binary cross-entropy](/wiki/cross_entropy) |
| Multi-class classification | Categorical cross-entropy |
| Regression with continuous targets | [Mean squared error](/wiki/mean_squared_error) or mean absolute error |
| Ranking and retrieval | Pairwise or listwise ranking losses |
| Sequence modeling (language) | Token level cross-entropy, often reported as [perplexity](/wiki/perplexity) |

For cross-entropy with binary labels, the per example loss is L(y, p) = -(y log p + (1 - y) log(1 - p)), where p is the predicted probability of the positive class. The validation loss is the mean of these values over the held out set.

The model is put in evaluation mode while validation loss is computed: [dropout](/wiki/dropout) is disabled, [batch normalization](/wiki/batch_normalization) layers use running statistics, and gradients are not computed. The forward pass is the same as inference, so validation loss is a faithful estimate of inference time behavior.

## How does training loss differ from validation loss?

Training and validation loss are usually plotted on the same axes, with epochs on the horizontal axis. The shape of these two curves is the most informative diagnostic in supervised learning [4].

At the start, both curves are high. As the optimizer makes progress, both fall together. The gap between them, called the [generalization gap](/wiki/generalization_gap), gives a rough sense of how much the model has overfit so far. A small gap means training and validation errors track each other; a large and growing gap means the model is fitting patterns specific to the training set that do not transfer.

Four common patterns show up in the loss curves:

| Pattern | What it means |
|---------|---------------|
| Both losses decrease and level off near each other | Good fit. The model is learning generalizable patterns and the chosen capacity is appropriate. |
| Training loss keeps falling while validation loss rises | Classic overfitting. The model is memorizing training examples. |
| Both losses stay high | Underfitting. The model lacks capacity, the features are weak, or the learning rate is wrong. |
| Validation loss noisily oscillates around a falling trend | Often a learning rate that is slightly too high, a small validation set, or a non representative split. |

The overfitting case is what gives validation loss its name in practice. The validation curve typically traces a U shape: it falls along with training loss for a while, reaches a minimum, then climbs back up while training loss continues to descend. Google's Machine Learning Crash Course describes exactly this signature: "the training loss curve appears to converge, but the validation loss begins to rise after a certain number of training steps," and identifies the cause as the model overfitting the training set [1]. The bottom of that U is the point of best generalization for the current configuration. Everything to the right of it is the model getting worse at the actual job, even though training loss looks better.

## How is validation loss used to detect overfitting and stop training?

The most common use of validation loss is early stopping. The idea is simple: stop training when validation loss stops improving, even if training loss is still dropping. This cuts training off near the bottom of the U shape rather than letting it climb. In the validation-based formulation, the error on the validation set is used as a proxy for the generalization error in deciding when overfitting has begun, and training halts once that error stops improving [5].

Early stopping is usually implemented with a patience parameter. The trainer keeps a running record of the best validation loss seen so far. After each epoch, if validation loss does not beat the best by some minimum delta, a counter increments. When the counter exceeds the patience value, training halts and the weights from the best epoch are restored. Common patience values sit in the range of 3 to 10 epochs, with 3 to 6 typical for medium sized models. Larger patience tolerates more noise at the cost of extra compute; smaller patience risks stopping during a temporary plateau before a genuine improvement.

A few practical points show up repeatedly:

- Validation loss is rarely smooth. It dips and rises from epoch to epoch even when the underlying trend is downward. Patience exists to keep the trainer from reacting to single epoch noise.
- Save model weights every time validation loss hits a new minimum. When training stops, the saved checkpoint is the model you deploy, not the final epoch [6].
- Some teams smooth the validation curve with an exponential moving average before applying the patience rule, dampening single epoch fluctuations.
- Early stopping itself is a form of regularization, defined as "a form of regularization used to avoid overfitting when training a model with an iterative method, such as gradient descent" [5]. Geoffrey Hinton is widely quoted as calling it a "beautiful free lunch," because it costs almost nothing to implement and often matches the effect of explicit penalties on the weights [7].

## How is validation loss used for hyperparameter selection?

Beyond early stopping, validation loss is the score by which hyperparameters are compared. When you run a grid search or random search over [learning rate](/wiki/learning_rate), [batch size](/wiki/batch_size), [weight decay](/wiki/weight_decay), depth, width, dropout rate, and so on, the winning configuration is the one with the lowest validation loss (or the highest validation accuracy, when that is the headline metric).

The Google Research deep learning tuning playbook recommends starting with hyperparameters that govern how the model learns, such as learning rate and optimizer momentum, and notes that these interact strongly with batch size and regularization strength [8]. Validation loss is the arbiter that lets you compare otherwise hard to compare runs on the same footing.

For large training jobs, sweeping every combination is too expensive. Practitioners often use validation loss in two stages: a coarse search at small scale, then a finer search at full scale around the best configurations. Bayesian optimization tools such as Optuna and Vizier use the validation loss curve as their objective, exploring the hyperparameter space adaptively.

One caution: every time you select a hyperparameter based on validation loss, you leak a little information from the validation set into the chosen model. After enough rounds, validation loss becomes optimistic about generalization. This is why a held out test set, untouched until the very end, is essential [2].

## Why is validation loss sometimes lower than training loss?

Validation loss sometimes comes in lower than training loss. This is not always a bug. There are three common causes [9].

The first is regularization. Dropout, weight decay, and similar techniques are active during training but disabled during validation. Dropout in particular masks out a fraction of neurons on every training step, forcing the rest of the network to compensate. Training loss reflects a deliberately handicapped network; at validation time, all neurons fire, so the model gets a cleaner look at the data and may report a lower loss. The effect can be large enough to keep the validation curve below the training curve for entire runs.

The second is the timing of the measurement. Training loss is typically averaged over the minibatches inside an epoch, while validation loss is computed at the end. The reported training loss is weighted toward the start of the epoch, when the model was worse, while validation loss reflects the model at the end. Shifting the training curve back by half an epoch usually makes the two align [9].

The third is data leakage or a mismatched split. If a chunk of the training data was accidentally copied into the validation set, or if the validation distribution is easier than the training distribution, the model will look artificially good on validation. The other two causes are benign; data leakage produces a model that fails in production, so it is worth checking carefully.

## What effect does dropout have on the two curves?

Dropout is the most common reason the two curves behave oddly. During training, a randomly chosen fraction of activations is zeroed out on each forward pass, so the network must learn redundant representations. During validation, dropout is switched off and the full network runs at full capacity, often yielding lower loss.

Dropout deliberately trades off training loss against validation loss. A higher dropout rate tends to push validation loss down on average by suppressing overfitting, even if it raises training loss. If lowering dropout makes validation loss improve faster but plateau higher, that is the regularization knob doing its job. The right setting is the one that minimizes validation loss at convergence, not the one that makes training fastest.

## How is validation loss handled with small datasets?

When the dataset is small, a single train/validation split is noisy: the reported validation loss depends heavily on which examples landed in the validation slice. K-fold cross-validation addresses this by splitting the data into k folds, training k models with a different fold as validation each time, and averaging the k validation losses [10].

The price is compute. K-fold with k=5 trains the model five times; with k=10, ten times. For deep networks this is often impractical, and a single held out validation set is used instead, sometimes combined with multiple random seeds to estimate variance. Smaller models, gradient boosted trees, and tabular pipelines use k-fold cross-validation routinely because the per fold cost is low.

## What is deep double descent and how does it change the U-shaped curve?

The U-shaped validation curve, where loss falls, bottoms out, then rises with overfitting, is the classical bias-variance picture, but it is not the whole story for very large models. The deep double descent phenomenon, documented by Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever in a 2019 paper later published at ICLR 2020, shows that beyond a certain point validation error can fall a second time. The paper's headline finding is that "a variety of modern deep learning tasks exhibit a 'double-descent' phenomenon where, as we increase model size, performance first gets worse and then gets better" [11].

The peak of that worsening sits at the interpolation threshold, the point where a model is just barely large enough or trained just long enough to fit the training set almost perfectly. OpenAI, which co-published the work, summarized the regime this way: "The peak of test error appears systematically when models are just barely able to fit the train set" [12]. Below that threshold, the classical U holds and bigger means worse once you pass the optimum; above it, in the over-parameterized regime, adding capacity makes validation loss fall again, which is why enormous neural networks can generalize well despite having far more parameters than training examples.

Three facets of the phenomenon matter for anyone reading validation curves:

| Form | What happens |
|------|--------------|
| Model-wise double descent | Holding training time fixed, validation error rises to a peak at the interpolation threshold as model width grows, then descends again in the over-parameterized regime. |
| Epoch-wise double descent | Holding model size fixed, validation error can rise then fall a second time as training proceeds, so "double descent occurs not just as a function of model size, but also as a function of the number of training epochs" [11]. |
| Sample-wise non-monotonicity | Near the threshold, adding data can move the peak; the authors found regimes where "increasing (even quadrupling) the number of train samples actually hurts test performance" [11]. |

To unify these effects, the authors introduce "a new complexity measure we call the effective model complexity" and show that the double descent peak appears as a function of it [11]. The practical upshot is a caution against over-reading a single rising validation curve: a model in the critical regime near the interpolation threshold can look like it is overfitting yet improve again with more capacity or more training. Epoch-wise double descent in particular means that an early-stopping rule with too little patience can halt in the trough between the two descents and miss the second, lower minimum.

## How do you read a real loss curve?

A few rules of thumb help when looking at training plots [4]:

- If validation loss is still trending down at the end of training, train longer.
- If validation loss has plateaued for many epochs while training loss is still falling, you are wasting compute and probably starting to overfit.
- If validation loss is bouncing more than 10 percent epoch to epoch on a small validation set, the set may be too small. Increase its size or rely on cross-validation.
- A sudden spike in validation loss usually means a noisy batch or an exploding gradient. If gradient norms are stable, look for outliers in the validation data.
- A widening gap between training and validation loss is a stronger overfitting signal than the absolute value of validation loss.

Loss is not the only metric, and a model with the lowest validation loss is not always the best deployable model. For classification, validation accuracy, F1, AUROC, and calibration may matter more depending on the use case. For language models, perplexity is loss exponentiated, but downstream task accuracy on benchmarks like MMLU or HumanEval is what users care about. Validation loss is the development workhorse, not the final verdict.

## Explain like I'm 5

Imagine practicing for a spelling test by going over the same word list every night. After a few nights you can recite the list perfectly, but the teacher gives you new words on the test and you struggle. Validation loss is what happens when you try a few new words each night and track how many you miss. If your score on the new words stops improving, you have memorized the practice list rather than learned how to spell. Time to stop drilling and try something different.

## References

1. Google Machine Learning Crash Course, *Overfitting: Interpreting loss curves*. https://developers.google.com/machine-learning/crash-course/overfitting/interpreting-loss-curves
2. Goodfellow, I., Bengio, Y., and Courville, A., *Deep Learning*, MIT Press, 2016 (Hyperparameters and Validation Sets). https://www.deeplearningbook.org/
3. scikit-learn documentation, *Cross-validation: evaluating estimator performance*. https://scikit-learn.org/stable/modules/cross_validation.html
4. Brownlee, J., *How to use Learning Curves to Diagnose Machine Learning Model Performance*, Machine Learning Mastery. https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
5. Wikipedia, *Early stopping*. https://en.wikipedia.org/wiki/Early_stopping
6. GeeksforGeeks, *Using Early Stopping to Reduce Overfitting in Neural Networks*. https://www.geeksforgeeks.org/deep-learning/using-early-stopping-to-reduce-overfitting-in-neural-networks/
7. Nyandwi, J., *Early Stopping Explained*. https://jeande.hashnode.dev/early-stopping-explained
8. Google Research, *Deep Learning Tuning Playbook*. https://github.com/google-research/tuning_playbook
9. Rosebrock, A., *Why is my validation loss lower than my training loss?*, PyImageSearch, 2019. https://pyimagesearch.com/2019/10/14/why-is-my-validation-loss-lower-than-my-training-loss/
10. Wikipedia, *Cross-validation (statistics)*. https://en.wikipedia.org/wiki/Cross-validation_(statistics)
11. Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I., *Deep Double Descent: Where Bigger Models and More Data Hurt*, arXiv:1912.02292, 2019; published at ICLR 2020. https://arxiv.org/abs/1912.02292
12. OpenAI, *Deep double descent*, 2019. https://openai.com/index/deep-double-descent/