# Holdout data

> Source: https://aiwiki.ai/wiki/holdout_data
> Updated: 2026-05-11
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

Holdout data is a portion of a labeled dataset that is deliberately kept out of training so it can be used later to evaluate how well a model performs on examples it has never seen. Google's [Machine Learning Glossary](/wiki/machine_learning_glossary) defines the term as data that "helps evaluate your model's ability to generalize to data other than the data it was trained on," with validation and test sets given as the two canonical examples. The loss on a holdout set is treated as a closer approximation of the loss the model will incur on truly new data than the loss on the training set, which can be driven arbitrarily low by [overfitting](/wiki/overfitting).

The idea is older than modern machine learning. Brian Ripley's 1996 *Pattern Recognition and Neural Networks* describes the three way split still used today: a training set fits the model, a validation set tunes choices about the model, and a test set is touched only at the end. "Holdout data" is the umbrella term covering both the validation set and the test set, because both are held out of the training routine.

## What counts as holdout data

In practice, three subsets show up over and over.

| Subset | Used during | Touched by training algorithm | Typical role |
|---|---|---|---|
| Training set | Fitting | Yes | Adjusts model parameters via [gradient descent](/wiki/gradient_descent) or similar |
| Validation set | [Model selection](/wiki/model_selection) | No, but used for [hyperparameter tuning](/wiki/hyperparameter_tuning) and early stopping | Picks the best architecture, learning rate, or checkpoint |
| Test set | Final reporting | No | Gives a single, honest performance number |

Validation and test data are both holdout data, but they play different roles. The validation set gets looked at many times while you iterate; the test set should ideally be looked at once. Russell and Norvig describe the test set as a "sealed envelope" you only open after every other decision is locked in. Once you have used a test set to pick between models or tweak a threshold, it has effectively become part of your tuning loop and stops being a clean estimate of generalization.

Kaggle's competition format makes this distinction concrete. Each competition splits the labeled data into a public training set and a hidden test set, and the hidden test set is further partitioned into a public leaderboard slice (typically around 20 to 30 percent of the hidden labels) and a private leaderboard slice (the remaining 70 to 80 percent). Competitors get instant feedback on the public slice during the contest, but the private slice is only scored once when the competition ends. The private leaderboard is the closest real world example of a truly sealed test set: nobody, including the participants, sees those labels until rankings are final.

## Creating a holdout split

The simplest way to produce holdout data is the holdout method: shuffle the labeled examples, then slice off a fixed fraction for evaluation. Common splits are 80/20 or 70/30 for a two way train and test partition, and 60/20/20 or 70/15/15 for a three way train, validation, and test partition. The right ratio depends on how much data you have. With tens of thousands of examples a 20 percent test set is plenty; with hundreds of millions, 1 percent can already contain more than enough examples to drive the [standard error](/wiki/standard_error) of the estimate well below the differences you care about.

A few practical variants come up often:

* **Random split.** Examples are assigned to train, validation, or test uniformly at random. Easy, but only valid when the examples are roughly independent and identically distributed.
* **Stratified split.** The split preserves the class proportions in each subset. This matters a lot when classes are imbalanced; without stratification, a small test set might end up with very few minority class examples, making accuracy estimates noisy.
* **Grouped split.** Examples that share a grouping key (same patient, same user, same source document) are kept together so the model is never tested on a group it has seen in training. Forgetting this is one of the easier ways to get an inflated score.
* **Temporal split.** For time series and other sequential data, the test set is the most recent window and the training set is everything earlier. Random splitting of time series leaks future information into training, which is one of the canonical examples of [data leakage](/wiki/data_leakage).

Scikit learn's `train_test_split`, `StratifiedShuffleSplit`, and `TimeSeriesSplit` are the standard library implementations, but any system that respects the same invariants works.

## Holdout versus cross-validation

The single holdout split is fast and cheap. It is also noisy. Performance on a 20 percent test set depends on which examples happened to land there, and on small datasets this variance can be embarrassingly large. The Wikipedia article on [cross-validation](/wiki/cross_validation) puts this bluntly: the holdout method, used by itself, "should be used with caution because without such averaging of multiple runs, one may achieve highly misleading results."

The usual fix is k fold cross-validation, where the data is split into k roughly equal folds, the model is trained k times on k minus one folds, and each fold takes a turn as the holdout. The fold scores are averaged. Every example ends up in a holdout at some point, which both lowers the variance of the estimate and uses the data more efficiently. The cost is computational: k separate training runs instead of one. Leave one out cross-validation is the extreme case where k equals the number of examples; it gives a nearly unbiased estimate but is impractical for large datasets or expensive models.

A reasonable rule of thumb: use a single holdout split when data is plentiful, training is expensive, or you just need a sanity check; use k fold cross-validation when data is scarce or when you are comparing models whose performance differences are small.

In modern deep learning workflows a hybrid is common. Practitioners use k fold cross-validation during model selection on the training plus validation portion of the data, then evaluate the final chosen model once on a separate test set that was never touched during cross-validation.

## Why a clean holdout matters: leakage

The whole value of a holdout set comes from independence. If even small amounts of information about the test set sneak into training, the holdout score stops being an honest estimate of generalization. This is data leakage, and it is one of the most common reasons published results fail to reproduce in production.

IBM's overview separates leakage into two flavors. Target leakage is when a feature available at training time will not be available at prediction time, or when a feature is a near copy of the label. Train test contamination is the holdout specific failure: information from the test set leaks back into the training process. Common causes include:

* Fitting a [scaler](/wiki/feature_scaling), imputer, or feature selector on the full dataset before splitting, so the training pipeline has seen summary statistics from the test rows.
* Using the test set to choose hyperparameters, then reporting the test score as final performance.
* Duplicates, near duplicates, or examples from the same group appearing in both training and test.
* Re running the experiment and re shuffling the split after seeing the test result, which slowly leaks the labels through the experimenter.

A 2024 systematic review covering 17 scientific fields catalogued at least 294 papers whose machine learning results were affected by some form of leakage, almost always in the optimistic direction. The fix is mechanical: split first, then fit every preprocessing step inside a pipeline that only sees training data, then evaluate.

## Adaptive overfitting and the sealed test set

Even a perfectly clean holdout can be eroded if you query it too many times. Each time you look at a test score and adjust your model in response, you are using the test set to make a decision, and a small amount of its signal leaks into your choices. After enough rounds the test score is no longer independent of the model.

Moritz Hardt's 2015 analysis of the Kaggle leaderboard made this concrete with what he called the "wacky boosting" attack: a participant who only generates random submissions, keeps the ones that score slightly above chance on the public leaderboard, and combines them by majority vote can climb the public leaderboard without learning anything about the actual problem. The same effect happens softly in real workflows when a team submits dozens of variants and keeps the best one. This is why Kaggle reserves the private leaderboard, why ML benchmarks periodically refresh their test sets, and why some labs maintain a final "sealed" test set that is only run a small fixed number of times.

Google has recommended a similar practice internally: keep an evaluation set that is queried rarely, and treat any model decision based on it as a one way door.

## Holdout data and distribution shift

A holdout set drawn from the same source as the training set tells you how the model will do on data from that same distribution. It does not, by itself, tell you how the model will do once it is deployed and the world moves on. This gap is called distribution shift or data drift, and it is one of the most common reasons production model performance degrades over time.

Two practical patterns help:

* **Time based holdout.** Hold out the most recent slice of data, train on everything earlier, and evaluate on the recent slice. This gives a more realistic estimate of how the model will perform on tomorrow's data than a random split would.
* **Ongoing holdout monitoring.** Once the model is in production, periodically score it on a fresh, labeled sample and compare against the original holdout score. Statistical tests like the Kolmogorov Smirnov test on input feature distributions, or the Population Stability Index for binned features, are commonly used to flag when production data has drifted enough that the original holdout estimate is no longer trustworthy.

In high stakes domains, like medical imaging or credit scoring, regulators increasingly expect both: an internal holdout for development, plus a documented monitoring plan that keeps re evaluating the model on fresh data after deployment.

## Common mistakes

A short list of failures that come up repeatedly:

* Reporting the validation score as if it were a test score. By the time you have tuned against it many times, the validation set is no longer holdout in any meaningful sense.
* Splitting after preprocessing. If normalization, encoding, or feature selection saw the test data, the test data is leaking into training.
* Random splitting time series. The test set ends up containing earlier timestamps than parts of the training set, so the model gets to peek at the future.
* Splitting at the row level when groups exist. Patients, users, or documents appear in both training and test, and the model memorizes group identity instead of learning a general rule.
* Picking the best of many test runs. The reported number is the maximum over k tries, which is biased upward; the variance over those tries is the more honest summary.
* Treating a single 80/20 split as definitive on a small dataset. Either use cross-validation or report a confidence interval over multiple seeds.

## Explain like I'm 5

Imagine you are studying for a math test. Your teacher gives you a big stack of practice problems. To make sure you actually learn how to do the math and are not just memorizing answers, the teacher hides some problems away in a drawer. After you have practiced on the rest, the teacher pulls the hidden problems out and asks you to solve those.

The hidden problems are the holdout data. If you do well on them, you probably learned the math. If you only do well on the problems you practiced, you might just have memorized them. The trick only works if you really did not peek in the drawer, even by accident. Asking a friend who saw the hidden problems for hints counts as peeking. So does using the answers to check your work too early. Once you peek, the hidden problems are not hidden anymore, and the test is no longer fair.

## References

1. Google. "[Machine Learning Glossary: holdout data](https://developers.google.com/machine-learning/glossary)." Google for Developers.
2. Wikipedia. "[Training, validation, and test data sets](https://en.wikipedia.org/wiki/Training,_validation,_and_test_data_sets)."
3. Wikipedia. "[Cross-validation (statistics)](https://en.wikipedia.org/wiki/Cross-validation_(statistics))."
4. DataRobot. "[Training Sets, Validation Sets, and Holdout Sets](https://www.datarobot.com/wiki/training-validation-holdout/)."
5. C3 AI. "[Holdout Data](https://c3.ai/glossary/data-science/holdout-data/)."
6. Hardt, Moritz. "[Competing in a data science contest without reading the data](https://blog.mrtz.org/2015/03/09/competition.html)." 2015.
7. Blum, Avrim, and Moritz Hardt. "[The Ladder: A Reliable Leaderboard for Machine Learning Competitions](http://proceedings.mlr.press/v37/blum15.pdf)." ICML 2015.
8. IBM. "[What is Data Leakage in Machine Learning?](https://www.ibm.com/think/topics/data-leakage-machine-learning)"
9. Kapoor, Sayash, and Arvind Narayanan. "[Leakage and the reproducibility crisis in machine-learning-based science](https://doi.org/10.1016/j.patter.2023.100804)." *Patterns*, 2023.
10. Ripley, Brian D. *Pattern Recognition and Neural Networks*. Cambridge University Press, 1996.
11. Russell, Stuart, and Peter Norvig. *Artificial Intelligence: A Modern Approach*, 4th ed. Pearson, 2020.
12. scikit-learn developers. "[3.1. Cross-validation: evaluating estimator performance](https://scikit-learn.org/stable/modules/cross_validation.html)."
13. Evidently AI. "[What is data drift in ML, and how to detect and handle it](https://www.evidentlyai.com/ml-in-production/data-drift)."
