Test Set

Test Set in Machine Learning

Definition

In machine learning, a test set (also called a test dataset or held-out test data) is a subset of data that is strictly reserved for evaluating the final performance of a trained model. The test set remains completely unseen by the model during both the training and hyperparameter tuning phases. Its sole purpose is to provide an unbiased estimate of how well the model will perform on new, real-world data it has never encountered before.

The test set is one of three standard data partitions used in supervised machine learning, alongside the training set (used to fit the model's parameters) and the validation set (used for hyperparameter tuning and model selection). This three-way split is sometimes called the holdout method, and it forms the foundation of reliable model evaluation across nearly all machine learning workflows ^[1].

Purpose

The primary purpose of a test set is to assess the generalization ability of a machine learning model ^[1]. Generalization refers to the model's capability to make accurate predictions on new, unseen data rather than simply memorizing the patterns in the training data.

By comparing a model's performance on the training set, validation set, and test set, practitioners can diagnose potential issues. If a model performs well on training data but poorly on the test set, this is a strong signal of overfitting ^[2]. Conversely, if the model performs poorly on all three splits, it may be underfitting. The test set provides the final, definitive performance number that researchers report in papers and that engineers use to decide whether a model is ready for deployment.

Test Set vs. Validation Set

A common source of confusion in machine learning is the distinction between the test set and the validation set ^[7]. Although both contain data the model has not trained on directly, they serve fundamentally different roles in the model development pipeline.

Aspect	Validation Set	Test Set
When used	During training and model selection	Only after all training and tuning is complete
Purpose	Tune hyperparameters, select model architecture, guide early stopping	Provide a final, unbiased estimate of model performance
Influence on model	Indirectly influences the model through decisions made during tuning	Must never influence model decisions
Frequency of use	Evaluated repeatedly throughout the development cycle	Ideally evaluated only once at the very end
Analogy	Practice exam taken while studying	Final exam taken after all studying is done

The validation set is part of the iterative development loop: practitioners train the model, evaluate it on the validation set, adjust hyperparameters, and repeat. The test set sits outside this loop entirely. Using the test set during development (even indirectly, by peeking at results and then changing the model) compromises its ability to give an honest performance estimate ^[7].

Creating a Test Set

Data Splitting Ratios

To create a test set, the initial dataset is divided into training, validation, and test portions ^[2]. Common splitting ratios include:

Split Strategy	Training	Validation	Test	Best For
70/15/15	70%	15%	15%	Medium-sized datasets
80/10/10	80%	10%	10%	Large datasets
60/20/20	60%	20%	20%	Smaller datasets where evaluation reliability matters
90/5/5	90%	5%	5%	Very large datasets (millions of samples)

There is no universal rule for the best ratio. The right split depends on the total amount of data available, the complexity of the task, and how precise the performance estimate needs to be. For very large datasets (hundreds of thousands or millions of samples), even 5% or 10% can yield a sufficiently large test set. For small datasets with only a few hundred examples, a larger test proportion (20% or more) may be necessary to get stable performance estimates.

Stratified Splitting

When dealing with classification tasks, especially those involving imbalanced datasets, stratified sampling is essential. In a stratified split, the data is divided so that each partition (train, validation, test) preserves the same class distribution as the original dataset.

For example, if a medical dataset contains 95% healthy patients and 5% patients with a rare disease, a random split might produce a test set with no diseased patients at all. A stratified split ensures that approximately 5% of the test set consists of diseased patients, matching the overall distribution. In Python's scikit-learn library, this is as simple as setting stratify=y in the train_test_split() function.

Stratified splitting is also valuable for cross-validation. Stratified k-fold cross-validation ensures that every fold maintains the class balance, leading to more consistent and less biased performance estimates.

Additional Considerations for Creating Test Sets

Beyond stratification, several other factors matter when constructing a test set:

Temporal ordering. In time-series problems, the test set should contain data from a later time period than the training set. Randomly shuffling and splitting time-series data can cause data leakage because future information bleeds into the training set.
Group-based splits. In some domains, data points are not independent. For instance, in medical imaging, multiple images may come from the same patient. All images from a given patient should go into the same split to prevent the model from memorizing patient-specific features.
Geographic or domain splits. For models intended to generalize across regions or domains, holding out entire regions or domains as the test set provides a more realistic evaluation of real-world performance.

Test Set Size Considerations

The size of the test set directly affects the reliability of performance estimates. A test set that is too small produces noisy, unstable metrics; a test set that is too large wastes data that could have been used for training.

Several factors influence the ideal test set size:

Factor	Impact on Test Set Size
Total dataset size	Larger datasets can afford smaller test proportions while still having enough absolute test samples
Number of classes	More classes generally require a larger test set to have enough examples per class
Desired confidence level	Narrower confidence intervals require more test samples
Model complexity	Complex models with subtle failure modes benefit from larger test sets
Class imbalance	Heavily imbalanced datasets need larger test sets to adequately represent minority classes

A useful rule of thumb is that the test set should contain at least several hundred examples, and ideally more than 1,000, to produce stable accuracy estimates. For classification tasks with many classes or significant class imbalance, larger test sets are needed to ensure each class has enough representation.

Why Test Sets Must Remain Unseen

The fundamental principle of test set evaluation is that the test data must remain completely unseen during the entire model development process. This requirement exists because any exposure to test data, whether direct or indirect, can lead to optimistically biased performance estimates.

When a model is tuned based on test set feedback, even unconsciously through manual inspection, the reported performance no longer reflects how the model would handle truly novel data. This problem is sometimes called information leakage or data leakage.

Data Leakage and the Test Set

Data leakage occurs when information from the test set inadvertently influences the training process. There are several common ways this happens:

Preprocessing leakage. Applying normalization, feature scaling, or imputation to the entire dataset before splitting means that statistics computed from the test set (such as means and standard deviations) influence how training data is transformed. The correct approach is to fit preprocessing steps on the training set only and then apply those same transformations to the test set ^[2].
Feature selection leakage. Selecting features based on correlations computed over the full dataset (including test data) leaks test set information into the model. Feature selection should be performed using training data alone.
Overlap leakage. In some cases, duplicate or near-duplicate data points exist in both the training and test sets. This is particularly common with web-scraped datasets, where the same content may appear under different URLs.
Temporal leakage. Using future data to predict past events in a time-series context is a form of leakage that produces unrealistically high performance.

Data leakage leads to models that appear to perform well during evaluation but fail significantly when deployed in production. Detecting and preventing leakage is a critical part of building trustworthy machine learning systems.

Test Set Contamination in Large Language Models

With the rise of large language models (LLMs) trained on massive web-scraped corpora, a new form of test set compromise has emerged: benchmark data contamination. This occurs when text from well-known evaluation benchmarks ends up in the pretraining data of an LLM, allowing the model to effectively memorize answers rather than demonstrate genuine reasoning ability ^[5].

Research has found contamination rates ranging from 1% to 45% across popular benchmarks. The problem is particularly acute because LLM training corpora often include billions of web pages, making it difficult to verify that no benchmark data has been included. A 2024 survey categorized benchmark data contamination into four severity levels ^[5]:

Severity Level	Description	Example
Semantic	Exposure to content from the same source or topic as the benchmark	Training on Wikipedia articles that cover the same questions
Information	Exposure to metadata, label distributions, or external discussions	Training on forum posts discussing benchmark answers
Data	Exposure to benchmark inputs without labels	Benchmark questions appearing in training text without answers
Label	Complete exposure including answers	Full benchmark question-answer pairs in training data

Several detection methods have been proposed, including n-gram overlap analysis, membership inference testing, and comparison-based methods that check whether a model finds certain orderings of benchmark items suspiciously likely ^[5].

To combat contamination, researchers have developed dynamic benchmarks such as LiveBench, which uses frequently updated questions from recent sources that could not have appeared in training data ^[6]. Other approaches include private benchmarks isolated from the public internet and LLM-as-judge evaluation systems like Chatbot Arena that rely on human preference rankings rather than fixed test sets.

Benchmark Test Sets

Several benchmark test sets have become standard reference points for measuring progress in machine learning:

Benchmark	Domain	Test Set Details	Significance
MNIST	Handwritten digit recognition	10,000 images across 10 digit classes	One of the earliest and most widely used ML benchmarks
CIFAR-10 / CIFAR-100	Image classification	10,000 test images (10 or 100 classes)	Standard benchmarks for evaluating convolutional neural networks
ImageNet	Large-scale image classification	50,000 validation images across 1,000 classes	Drove the deep learning revolution in computer vision
GLUE	Natural language understanding	Multiple test sets across 9 NLU tasks	Established standardized NLP evaluation; models reached human-level by 2019
SuperGLUE	Advanced natural language understanding	More challenging tasks than GLUE	Created after models saturated GLUE benchmarks
MMLU	Multitask language understanding	14,042 multiple-choice questions across 57 subjects	Tests factual knowledge and reasoning in LLMs
SQuAD	Reading comprehension	Over 10,000 question-answer pairs	Widely used for evaluating question-answering systems

These benchmarks have played a crucial role in driving progress, but they also illustrate the challenge of test set reuse: as the community repeatedly evaluates new models on the same test sets over many years, there is a risk of indirect overfitting to those specific data distributions ^[3].

Test Set Reuse and Adaptive Overfitting

One subtle but important problem arises when the same test set is used to evaluate a long sequence of models over time. Even if no single researcher uses the test set improperly, the collective community may gradually overfit to it through a process called adaptive overfitting ^[3].

In machine learning competitions and public benchmarks, participants receive test set feedback through leaderboard scores. Each submission is implicitly influenced by the scores of previous submissions. Over many iterations, this feedback loop can inflate test set performance without corresponding improvements in real-world generalization.

Research on adaptive data analysis has formalized this problem ^[4]. In a non-adaptive setting, the error from evaluating k models on a test set of n samples grows as O(sqrt(log k / n)), which is quite slow. However, in an adaptive setting where each model is designed after seeing previous results, the error grows as O(sqrt(k * log n / n)), which is exponentially worse. With enough adaptive queries, it becomes possible to artificially inflate test set performance by a significant margin.

Mitigation strategies include:

Adding calibrated noise to test set evaluations, which allows quadratic rather than linear numbers of queries before accuracy degrades ^[4].
Using fresh test sets periodically, retiring old benchmarks and introducing new ones.
Holdout sets with restricted access, where test set labels are never publicly released and evaluation can only be done through a submission system with rate limits.

Out-of-Distribution Test Sets

A standard test set is drawn from the same distribution as the training data (known as in-distribution or IID evaluation). However, real-world deployment often involves data that differs from the training distribution. Out-of-distribution (OOD) test sets are designed to evaluate how well a model handles such distributional shifts.

OOD evaluation is important because models that achieve high accuracy on in-distribution test sets can fail dramatically when the input distribution changes. For example, an image classifier trained on photos taken in good lighting conditions might perform poorly on images captured in fog or at night.

Common types of distribution shifts tested by OOD evaluation include:

Covariate shift: The input distribution changes, but the relationship between inputs and outputs stays the same.
Label shift: The class proportions change between training and deployment.
Domain shift: The model is applied to a related but different domain (such as training on product reviews and testing on movie reviews).
Adversarial inputs: Deliberately crafted inputs designed to fool the model.

Several benchmarks explicitly test OOD robustness, including ImageNet-C (corrupted images), ImageNet-R (renditions), and WILDS (a collection of distribution shift benchmarks across multiple domains).

Reporting Test Results

How test results are reported matters almost as much as how they are obtained. Responsible reporting practices help the community draw valid conclusions and avoid misleading comparisons.

Single Run vs. Multiple Runs

Deep learning models involve randomness through weight initialization, data shuffling, and stochastic optimization. A single training run may produce different results from another run with a different random seed. Reporting results from a single run can be misleading because a lucky seed might overestimate (or underestimate) typical model performance.

Best practice is to train the model multiple times with different random seeds and report the mean and standard deviation of test set performance. This gives readers a sense of how stable the results are.

Confidence Intervals

Confidence intervals provide a range within which the true performance is likely to fall. A 95% confidence interval, for example, means that if the experiment were repeated many times, approximately 95% of the resulting intervals would contain the true performance value ^[8].

Two common methods for computing confidence intervals on test set results are:

Method	Description	When to Use
Normal approximation	Uses the formula: accuracy +/- 1.96 * sqrt(accuracy * (1 - accuracy) / n)	Large test sets with balanced classes
Bootstrap	Resamples the test set thousands of times and computes metrics on each sample	Any test set size; does not assume normality

Reporting confidence intervals is especially important when comparing models, because small differences in test set accuracy may not be statistically significant ^[8].

Performance Metrics on the Test Set

The performance of a machine learning model on the test set is measured using metrics appropriate to the task:

Task Type	Common Metrics
Classification	Accuracy, precision, recall, F1 score, AUC-ROC
Regression	Mean squared error (MSE), mean absolute error (MAE), R-squared
Ranking	NDCG, MAP, MRR
Generation	BLEU, ROUGE, perplexity
Clustering	Adjusted Rand index, silhouette score, mutual information

It is good practice to report multiple metrics rather than a single number, as different metrics capture different aspects of model behavior. For instance, in an imbalanced classification task, accuracy alone can be misleading if the model achieves high accuracy by simply predicting the majority class.

Alternatives to a Fixed Test Set

In some scenarios, maintaining a separate, fixed test set is impractical or insufficient. Common alternatives include:

Cross-validation. The dataset is split into k folds, and each fold serves as the test set once while the remaining folds are used for training. This approach is especially valuable for small datasets, as every data point gets used for both training and evaluation.
Bootstrapping. Random samples with replacement are drawn from the dataset, and the model is evaluated on the out-of-bag (unsampled) data points. This provides both a performance estimate and a confidence interval.
Rolling/expanding window evaluation. For time-series data, the training window grows or slides forward over time, and the model is tested on the next time period. This preserves the temporal ordering of data.

Explain Like I'm 5 (ELI5)

Imagine you are studying for a big test at school. You have a workbook full of practice problems that you use to learn and study. These practice problems are like the training set.

Your teacher also gives you a practice quiz to help you figure out what you still need to study. That practice quiz is like the validation set. You can look at your mistakes on the practice quiz and study more.

Finally, on test day, you get a real test with questions you have never seen before. That real test is the test set. It shows whether you actually learned the material or whether you just memorized the answers to the practice problems.

If someone gave you the real test questions ahead of time, your score would not mean anything because you could just memorize the answers. That is why the test set has to be kept secret until the very end.

References

Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning*. Springer. Chapter 7: Model Assessment and Selection.
Raschka, S. (2018). "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning." *arXiv:1811.12808*.
Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). "Do ImageNet Classifiers Generalize to ImageNet?" *Proceedings of the 36th International Conference on Machine Learning (ICML)*.
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. (2015). "Generalization in Adaptive Data Analysis and Holdout Reuse." *Advances in Neural Information Processing Systems (NeurIPS)*.
Xu, R., Li, Y., Luo, C., & Xu, Z. (2024). "Benchmark Data Contamination of Large Language Models: A Survey." *arXiv:2406.04244*.
White, C., Dooley, S., Roberts, M., Pal, A., Feber, B., & Jain, S. (2024). "LiveBench: A Challenging, Contamination-Free LLM Benchmark." *arXiv:2406.19314*.
Brownlee, J. (2020). "What is the Difference Between Test and Validation Datasets?" *Machine Learning Mastery*. https://machinelearningmastery.com/difference-test-validation-datasets/
Raschka, S. (2022). "Creating Confidence Intervals for Machine Learning Classifiers." https://sebastianraschka.com/blog/2022/confidence-intervals-for-ml.html
Northcutt, C. G., Athalye, A., & Mueller, J. (2021). "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks." *Advances in Neural Information Processing Systems (NeurIPS)*.
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." *Advances in Neural Information Processing Systems (NeurIPS)*.

Test Set in Machine Learning

Definition

Purpose

Test Set vs. Validation Set

Creating a Test Set

Data Splitting Ratios

Stratified Splitting

Additional Considerations for Creating Test Sets

Test Set Size Considerations

Why Test Sets Must Remain Unseen

Data Leakage and the Test Set

Test Set Contamination in Large Language Models

Benchmark Test Sets

Test Set Reuse and Adaptive Overfitting

Out-of-Distribution Test Sets

Reporting Test Results

Single Run vs. Multiple Runs

Confidence Intervals

Performance Metrics on the Test Set

Alternatives to a Fixed Test Set

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

Generalization

Generalization Curve

Model Capacity

Splitter

AUC-ROC

AUC (Area Under the ROC Curve)

What links here (24 of 31)

Test Set in Machine Learning

Definition

Purpose

Test Set vs. Validation Set

Creating a Test Set

Data Splitting Ratios

Stratified Splitting

Additional Considerations for Creating Test Sets

Test Set Size Considerations

Why Test Sets Must Remain Unseen

Data Leakage and the Test Set

Test Set Contamination in Large Language Models

Benchmark Test Sets

Test Set Reuse and Adaptive Overfitting

Out-of-Distribution Test Sets

Reporting Test Results

Single Run vs. Multiple Runs

Confidence Intervals

Performance Metrics on the Test Set

Alternatives to a Fixed Test Set

Explain Like I'm 5 (ELI5)

References

Related Articles

Generalization

Generalization Curve

Model Capacity

Splitter

AUC-ROC

AUC (Area Under the ROC Curve)

What links here (24 of 31)