Test loss
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,171 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,171 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Test loss is a metric that measures a model's loss against the test data set. The test data is a separate set of examples from the training data set and the validation data set. Running the model on the test set is the last step in a typical machine learning workflow, done after training and after all hyperparameter choices have already been made. The lower the test loss, the better the model has fit the kind of data it will see in the wild, assuming the test set was sampled from the same distribution as the data the model will face in production.
The number is calculated the same way as training loss or validation loss. You feed the test examples through the trained machine learning model, compute the loss function on each prediction versus its label, and average the per-example losses. What changes is what the number is used for. Training loss steers the optimizer through gradient descent. Validation loss drives model selection during development. Test loss does neither; it is the final number you write down and report.
Test loss is often described as an "unbiased estimate" of generalization performance, and the unbiasedness only holds if the test set was truly held out. The moment you start tweaking the model based on what the test set tells you, the test set has become a second validation set, and your test loss starts drifting optimistic.
Supervised learning frames its goal as minimizing expected risk: the average loss the model would incur if you could draw an infinite number of samples from the true data distribution. That expectation is impossible to compute, because the true distribution is unknown. The training loss is the empirical risk on the training set. The test loss is the empirical risk on a held-out sample not used for fitting or selection, which makes it a reasonable estimator of the true risk.
The Wikipedia article on generalization error puts it cleanly: the test sample is "previously unseen by the algorithm and so represents a random sample from the joint probability distribution." Test loss is a Monte Carlo estimate of generalization error, with variance that shrinks roughly with the square root of the test set size. A 100-example test set gives you a noisy estimate; a 10,000-example one is much tighter.
In empirical risk minimization, training loss is what the learner explicitly minimizes, while test loss is a sanity check that the minimization actually generalized. The difference between them, often called the generalization gap, is one of the central objects of study in statistical learning theory.
The computation depends on the problem and the loss function you picked.
| Problem type | Common loss function | What it measures |
|---|---|---|
| Regression | Mean squared error (MSE) | Average of squared differences between predicted and actual values |
| Regression | Mean absolute error (MAE) | Average of absolute differences between predicted and actual values |
| Binary classification | Binary cross-entropy (log loss) | Negative log probability the model assigned to the correct class |
| Multi-class classification | Categorical cross-entropy | Negative log probability summed over all classes |
| Probabilistic models | Negative log-likelihood | How likely the observed labels are under the model |
The convention is to use the same loss function at test time that was used during training. If you trained against cross-entropy, you report test cross-entropy. Mixing them is technically legal but makes comparison across runs awkward and is rare in practice.
A practical detail: at evaluation time, batches are run with model updates disabled and stochastic behavior (dropout, data augmentation, batch normalization in train mode) turned off. In PyTorch this is model.eval() plus torch.no_grad(); in TensorFlow/Keras the equivalent is training=False. Forgetting this is a classic source of test losses that look slightly worse than they should.
MSE is the default for regression. It is smooth, differentiable everywhere, and integrates nicely with gradient-based optimization. The squaring makes it sensitive to large errors, so a model that is mostly right but has a handful of catastrophic predictions will show a high MSE. Sometimes that sensitivity is what you want, when rare large mistakes are unusually costly. Sometimes it is not, in which case MAE is a better fit.
MAE measures the average absolute deviation between predictions and targets. It is robust to outliers because it does not square errors. The cost is that MAE is not differentiable at zero, which can complicate optimization, though subgradients work fine in practice. An MAE of 4.2 on a temperature regression means the model is off by 4.2 degrees on average.
Cross-entropy is the standard loss for classification. It compares the model's predicted probability distribution over classes to a one-hot or soft target distribution. The loss is high when the model puts low probability on the right class, and the penalty grows sharply as the predicted probability approaches zero. A model that is confidently wrong gets punished more than one that is hedging. You can technically use MSE for classification, but it tends to give weak gradients when the model is far from the truth, which is why cross-entropy is preferred.
Reading the three numbers together tells you more than any one in isolation.
| Loss | Computed on | Used for | When measured |
|---|---|---|---|
| Training loss | Training set | Updating weights via backprop | Every batch or epoch |
| Validation loss | Validation set | Hyperparameter tuning, early stopping, model selection | Every epoch, typically |
| Test loss | Test set | Final reporting and comparison | Once, at the end |
A few patterns come up over and over:
Google's machine learning crash course recommends three checks when training and test losses diverge: simplify the model by reducing features, increase the regularization rate, and verify that training and test sets are statistically equivalent. The third is easy to skip and surprisingly often the real problem.
One quirky case: validation loss below training loss. This sounds impossible but happens routinely with regularization like dropout, which is only active during training. The training-time loss includes the noise injected by dropout; the validation-time loss does not.
The whole point of training a model is to use it on data it has not seen yet. Training loss tells you how well the model fits the data you already have, and a model with zero training loss and no other information is essentially a lookup table. What you actually care about is performance in the wild, and the test loss is the closest pre-deployment proxy for that.
If the test set is representative, the test loss is a useful prediction of real-world performance. If it is not, the test loss is misleading in ways that the number alone cannot detect. Thoughtful practitioners pair test loss with sliced metrics, error analysis on held-out examples, and, when possible, a separate validation against fresh data after deployment. For benchmark comparisons in language modeling, vision, and reinforcement learning, test loss is one of the standard reporting metrics, sometimes alongside task scores like accuracy, F1, or BLEU.
There are two main protocols for producing a test loss number. The simple holdout split, where you set aside a single test partition, is the standard for large datasets and most deep learning work. It is fast and the test loss has well-understood properties as long as the split is random and stratified where appropriate. The downside is variance: a single split can be lucky or unlucky, and on small datasets the resulting estimate is noisy.
K-fold cross-validation cycles through k different train/test splits, computes a test loss on each, and averages. This reduces variance and is the standard approach for tabular machine learning with modest data. The price is k times the compute. When reporting cross-validated test loss, include the standard deviation across folds, not just the mean. Nested cross-validation, with an inner loop for hyperparameter tuning and an outer loop for evaluation, is more rigorous when both selection and evaluation must happen on limited data, but is rarely used in deep learning at scale.
The gap between training loss and test loss is closely tied to the bias-variance tradeoff. A model with high bias underfits and shows high training loss and high test loss, with the two roughly equal. A model with high variance overfits and shows low training loss with a much higher test loss. Tuning regularization, model capacity, and training duration is largely about pushing this tradeoff toward whichever point minimizes test loss.
In modern deep learning, particularly with overparameterized neural networks, the classical bias-variance picture becomes less tidy. Phenomena like double descent show that test loss can decrease again as model size grows past the interpolation point. The intuition still applies, but the simple "bigger model means worse generalization" heuristic does not hold reliably.
Imagine studying for a math test. You practice with a workbook (the training set). Halfway through, your tutor gives you a quiz with different problems to check how you are doing (the validation set). On test day, your teacher hands you brand new problems you have never seen (the test set). How well you do on those is your test loss.
A low test loss means you actually understood the math. A high test loss means you may have memorized the practice problems without learning the underlying idea. The reason your teacher does not let you peek at the test ahead of time is the same reason machine learning engineers keep the test set sealed off until the end: peeking would let you tune your answers to the specific questions and make the score look better than your real understanding deserves.