Holdout data
Last reviewed
May 11, 2026
Sources
13 citations
Review status
Source-backed
Revision
v3 · 2,192 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
13 citations
Review status
Source-backed
Revision
v3 · 2,192 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Holdout data is a portion of a labeled dataset that is deliberately kept out of training so it can be used later to evaluate how well a model performs on examples it has never seen. Google's Machine Learning Glossary defines the term as data that "helps evaluate your model's ability to generalize to data other than the data it was trained on," with validation and test sets given as the two canonical examples. The loss on a holdout set is treated as a closer approximation of the loss the model will incur on truly new data than the loss on the training set, which can be driven arbitrarily low by overfitting.
The idea is older than modern machine learning. Brian Ripley's 1996 Pattern Recognition and Neural Networks describes the three way split still used today: a training set fits the model, a validation set tunes choices about the model, and a test set is touched only at the end. "Holdout data" is the umbrella term covering both the validation set and the test set, because both are held out of the training routine.
In practice, three subsets show up over and over.
| Subset | Used during | Touched by training algorithm | Typical role |
|---|---|---|---|
| Training set | Fitting | Yes | Adjusts model parameters via gradient descent or similar |
| Validation set | Model selection | No, but used for hyperparameter tuning and early stopping | Picks the best architecture, learning rate, or checkpoint |
| Test set | Final reporting | No | Gives a single, honest performance number |
Validation and test data are both holdout data, but they play different roles. The validation set gets looked at many times while you iterate; the test set should ideally be looked at once. Russell and Norvig describe the test set as a "sealed envelope" you only open after every other decision is locked in. Once you have used a test set to pick between models or tweak a threshold, it has effectively become part of your tuning loop and stops being a clean estimate of generalization.
Kaggle's competition format makes this distinction concrete. Each competition splits the labeled data into a public training set and a hidden test set, and the hidden test set is further partitioned into a public leaderboard slice (typically around 20 to 30 percent of the hidden labels) and a private leaderboard slice (the remaining 70 to 80 percent). Competitors get instant feedback on the public slice during the contest, but the private slice is only scored once when the competition ends. The private leaderboard is the closest real world example of a truly sealed test set: nobody, including the participants, sees those labels until rankings are final.
The simplest way to produce holdout data is the holdout method: shuffle the labeled examples, then slice off a fixed fraction for evaluation. Common splits are 80/20 or 70/30 for a two way train and test partition, and 60/20/20 or 70/15/15 for a three way train, validation, and test partition. The right ratio depends on how much data you have. With tens of thousands of examples a 20 percent test set is plenty; with hundreds of millions, 1 percent can already contain more than enough examples to drive the standard error of the estimate well below the differences you care about.
A few practical variants come up often:
Scikit learn's train_test_split, StratifiedShuffleSplit, and TimeSeriesSplit are the standard library implementations, but any system that respects the same invariants works.
The single holdout split is fast and cheap. It is also noisy. Performance on a 20 percent test set depends on which examples happened to land there, and on small datasets this variance can be embarrassingly large. The Wikipedia article on cross-validation puts this bluntly: the holdout method, used by itself, "should be used with caution because without such averaging of multiple runs, one may achieve highly misleading results."
The usual fix is k fold cross-validation, where the data is split into k roughly equal folds, the model is trained k times on k minus one folds, and each fold takes a turn as the holdout. The fold scores are averaged. Every example ends up in a holdout at some point, which both lowers the variance of the estimate and uses the data more efficiently. The cost is computational: k separate training runs instead of one. Leave one out cross-validation is the extreme case where k equals the number of examples; it gives a nearly unbiased estimate but is impractical for large datasets or expensive models.
A reasonable rule of thumb: use a single holdout split when data is plentiful, training is expensive, or you just need a sanity check; use k fold cross-validation when data is scarce or when you are comparing models whose performance differences are small.
In modern deep learning workflows a hybrid is common. Practitioners use k fold cross-validation during model selection on the training plus validation portion of the data, then evaluate the final chosen model once on a separate test set that was never touched during cross-validation.
The whole value of a holdout set comes from independence. If even small amounts of information about the test set sneak into training, the holdout score stops being an honest estimate of generalization. This is data leakage, and it is one of the most common reasons published results fail to reproduce in production.
IBM's overview separates leakage into two flavors. Target leakage is when a feature available at training time will not be available at prediction time, or when a feature is a near copy of the label. Train test contamination is the holdout specific failure: information from the test set leaks back into the training process. Common causes include:
A 2024 systematic review covering 17 scientific fields catalogued at least 294 papers whose machine learning results were affected by some form of leakage, almost always in the optimistic direction. The fix is mechanical: split first, then fit every preprocessing step inside a pipeline that only sees training data, then evaluate.
Even a perfectly clean holdout can be eroded if you query it too many times. Each time you look at a test score and adjust your model in response, you are using the test set to make a decision, and a small amount of its signal leaks into your choices. After enough rounds the test score is no longer independent of the model.
Moritz Hardt's 2015 analysis of the Kaggle leaderboard made this concrete with what he called the "wacky boosting" attack: a participant who only generates random submissions, keeps the ones that score slightly above chance on the public leaderboard, and combines them by majority vote can climb the public leaderboard without learning anything about the actual problem. The same effect happens softly in real workflows when a team submits dozens of variants and keeps the best one. This is why Kaggle reserves the private leaderboard, why ML benchmarks periodically refresh their test sets, and why some labs maintain a final "sealed" test set that is only run a small fixed number of times.
Google has recommended a similar practice internally: keep an evaluation set that is queried rarely, and treat any model decision based on it as a one way door.
A holdout set drawn from the same source as the training set tells you how the model will do on data from that same distribution. It does not, by itself, tell you how the model will do once it is deployed and the world moves on. This gap is called distribution shift or data drift, and it is one of the most common reasons production model performance degrades over time.
Two practical patterns help:
In high stakes domains, like medical imaging or credit scoring, regulators increasingly expect both: an internal holdout for development, plus a documented monitoring plan that keeps re evaluating the model on fresh data after deployment.
A short list of failures that come up repeatedly:
Imagine you are studying for a math test. Your teacher gives you a big stack of practice problems. To make sure you actually learn how to do the math and are not just memorizing answers, the teacher hides some problems away in a drawer. After you have practiced on the rest, the teacher pulls the hidden problems out and asks you to solve those.
The hidden problems are the holdout data. If you do well on them, you probably learned the math. If you only do well on the problems you practiced, you might just have memorized them. The trick only works if you really did not peek in the drawer, even by accident. Asking a friend who saw the hidden problems for hints counts as peeking. So does using the answers to check your work too early. Once you peek, the hidden problems are not hidden anymore, and the test is no longer fair.