# Test Set

> Source: https://aiwiki.ai/wiki/test_set
> Updated: 2026-07-12
> Categories: Machine Learning, Model Evaluation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **test set** is a portion of data held back from model development and used only once, after all training and tuning is complete, to give an unbiased estimate of how a [machine learning](/wiki/machine_learning) model performs on new, unseen data. It is one of three standard data partitions in supervised learning, alongside the [training set](/wiki/training_set) that fits the model's parameters and the [validation set](/wiki/validation_set) that guides [hyperparameter](/wiki/hyperparameter) tuning [1]. Because the test set is never used to make modeling decisions, its score is the single most trusted number for judging [generalization](/wiki/generalization), the model's ability to perform on data it has never encountered.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## Test Set in Machine Learning

### Definition

In [machine learning](/wiki/machine_learning), a **test set** (also called a **test dataset** or **held-out test data**) is a subset of data that is strictly reserved for evaluating the final performance of a trained model. The test set remains completely unseen by the model during both the [training](/wiki/training) and hyperparameter tuning phases. Its sole purpose is to provide an unbiased estimate of how well the model will perform on new, real-world data it has never encountered before.

The test set is one of three standard data partitions used in supervised machine learning, alongside the [training set](/wiki/training_set) (used to fit the model's parameters) and the [validation set](/wiki/validation_set) (used for [hyperparameter](/wiki/hyperparameter) tuning and model selection). This three-way split is sometimes called the **holdout method**, and it forms the foundation of reliable model evaluation across nearly all machine learning workflows [1]. As Hastie, Tibshirani, and Friedman put it in *The Elements of Statistical Learning*, "the test set should be kept in a vault, and be brought out only at the end of the data analysis" [1].

### What is a test set used for?

The primary purpose of a test set is to assess the [generalization](/wiki/generalization) ability of a machine learning model [1]. Generalization refers to the model's capability to make accurate predictions on new, unseen data rather than simply memorizing the patterns in the training data.

By comparing a model's performance on the [training set](/wiki/training_set), [validation set](/wiki/validation_set), and test set, practitioners can diagnose potential issues. If a model performs well on training data but poorly on the test set, this is a strong signal of [overfitting](/wiki/overfitting) [2]. Conversely, if the model performs poorly on all three splits, it may be [underfitting](/wiki/underfitting). The test set provides the final, definitive performance number that researchers report in papers and that engineers use to decide whether a model is ready for deployment.

### How does a test set differ from a validation set?

A common source of confusion in machine learning is the distinction between the test set and the [validation set](/wiki/validation_set) [7]. Although both contain data the model has not trained on directly, they serve fundamentally different roles in the model development pipeline.

| Aspect | Validation Set | Test Set |
|---|---|---|
| **When used** | During training and model selection | Only after all training and tuning is complete |
| **Purpose** | Tune [hyperparameters](/wiki/hyperparameter), select model architecture, guide early stopping | Provide a final, unbiased estimate of model performance |
| **Influence on model** | Indirectly influences the model through decisions made during tuning | Must never influence model decisions |
| **Frequency of use** | Evaluated repeatedly throughout the development cycle | Ideally evaluated only once at the very end |
| **Analogy** | Practice exam taken while studying | Final exam taken after all studying is done |

The validation set is part of the iterative development loop: practitioners train the model, evaluate it on the validation set, adjust hyperparameters, and repeat. The test set sits outside this loop entirely. Using the test set during development (even indirectly, by peeking at results and then changing the model) compromises its ability to give an honest performance estimate [7].

### How do you create a test set?

#### Data Splitting Ratios

To create a test set, the initial dataset is divided into training, validation, and test portions [2]. Common splitting ratios include:

| Split Strategy | Training | Validation | Test | Best For |
|---|---|---|---|---|
| 70/15/15 | 70% | 15% | 15% | Medium-sized datasets |
| 80/10/10 | 80% | 10% | 10% | Large datasets |
| 60/20/20 | 60% | 20% | 20% | Smaller datasets where evaluation reliability matters |
| 90/5/5 | 90% | 5% | 5% | Very large datasets (millions of samples) |

There is no universal rule for the best ratio. The right split depends on the total amount of data available, the complexity of the task, and how precise the performance estimate needs to be. For very large datasets (hundreds of thousands or millions of samples), even 5% or 10% can yield a sufficiently large test set. For small datasets with only a few hundred examples, a larger test proportion (20% or more) may be necessary to get stable performance estimates.

#### Stratified Splitting

When dealing with classification tasks, especially those involving imbalanced datasets, **stratified sampling** is essential. In a stratified split, the data is divided so that each partition (train, validation, test) preserves the same class distribution as the original dataset.

For example, if a medical dataset contains 95% healthy patients and 5% patients with a rare disease, a random split might produce a test set with no diseased patients at all. A stratified split ensures that approximately 5% of the test set consists of diseased patients, matching the overall distribution. In Python's scikit-learn library, this is as simple as setting `stratify=y` in the `train_test_split()` function.

Stratified splitting is also valuable for [cross-validation](/wiki/cross-validation). Stratified k-fold cross-validation ensures that every fold maintains the class balance, leading to more consistent and less biased performance estimates.

#### Additional Considerations for Creating Test Sets

Beyond stratification, several other factors matter when constructing a test set:

- **Temporal ordering.** In time-series problems, the test set should contain data from a later time period than the training set. Randomly shuffling and splitting time-series data can cause data leakage because future information bleeds into the training set.
- **Group-based splits.** In some domains, data points are not independent. For instance, in medical imaging, multiple images may come from the same patient. All images from a given patient should go into the same split to prevent the model from memorizing patient-specific features.
- **Geographic or domain splits.** For models intended to generalize across regions or domains, holding out entire regions or domains as the test set provides a more realistic evaluation of real-world performance.

### How large should a test set be?

The size of the test set directly affects the reliability of performance estimates. A test set that is too small produces noisy, unstable metrics; a test set that is too large wastes data that could have been used for training.

Several factors influence the ideal test set size:

| Factor | Impact on Test Set Size |
|---|---|
| Total dataset size | Larger datasets can afford smaller test proportions while still having enough absolute test samples |
| Number of classes | More classes generally require a larger test set to have enough examples per class |
| Desired confidence level | Narrower confidence intervals require more test samples |
| Model complexity | Complex models with subtle failure modes benefit from larger test sets |
| Class imbalance | Heavily imbalanced datasets need larger test sets to adequately represent minority classes |

A useful rule of thumb is that the test set should contain at least several hundred examples, and ideally more than 1,000, to produce stable [accuracy](/wiki/accuracy) estimates. For classification tasks with many classes or significant class imbalance, larger test sets are needed to ensure each class has enough representation.

### Why must the test set remain unseen?

The fundamental principle of test set evaluation is that the test data must remain completely unseen during the entire model development process. This requirement exists because any exposure to test data, whether direct or indirect, can lead to optimistically biased performance estimates.

When a model is tuned based on test set feedback, even unconsciously through manual inspection, the reported performance no longer reflects how the model would handle truly novel data. This problem is sometimes called **information leakage** or **data leakage**.

### Data Leakage and the Test Set

Data leakage occurs when information from the test set inadvertently influences the training process. There are several common ways this happens:

- **Preprocessing leakage.** Applying normalization, feature scaling, or imputation to the entire dataset before splitting means that statistics computed from the test set (such as means and standard deviations) influence how training data is transformed. The correct approach is to fit preprocessing steps on the training set only and then apply those same transformations to the test set [2].
- **Feature selection leakage.** Selecting features based on correlations computed over the full dataset (including test data) leaks test set information into the model. Feature selection should be performed using training data alone.
- **Overlap leakage.** In some cases, duplicate or near-duplicate data points exist in both the training and test sets. This is particularly common with web-scraped datasets, where the same content may appear under different URLs.
- **Temporal leakage.** Using future data to predict past events in a time-series context is a form of leakage that produces unrealistically high performance.

Data leakage leads to models that appear to perform well during evaluation but fail significantly when deployed in production. Detecting and preventing leakage is a critical part of building trustworthy machine learning systems.

### Test Set Contamination in Large Language Models

With the rise of [large language models](/wiki/large_language_model) (LLMs) trained on massive web-scraped corpora, a new form of test set compromise has emerged: **benchmark data contamination**. This occurs when text from well-known evaluation benchmarks ends up in the pretraining data of an LLM, allowing the model to effectively memorize answers rather than demonstrate genuine reasoning ability [5].

Research has found contamination rates ranging from 1% to 45% across popular benchmarks. The problem is particularly acute because LLM training corpora often include billions of web pages, making it difficult to verify that no benchmark data has been included. A 2024 survey categorized benchmark data contamination into four severity levels [5]:

| Severity Level | Description | Example |
|---|---|---|
| Semantic | Exposure to content from the same source or topic as the benchmark | Training on Wikipedia articles that cover the same questions |
| Information | Exposure to metadata, label distributions, or external discussions | Training on forum posts discussing benchmark answers |
| Data | Exposure to benchmark inputs without labels | Benchmark questions appearing in training text without answers |
| Label | Complete exposure including answers | Full benchmark question-answer pairs in training data |

Several detection methods have been proposed, including n-gram overlap analysis, membership inference testing, and comparison-based methods that check whether a model finds certain orderings of benchmark items suspiciously likely [5].

To combat contamination, researchers have developed dynamic benchmarks such as LiveBench, which uses frequently updated questions from recent sources that could not have appeared in training data [6]. Other approaches include private benchmarks isolated from the public internet and LLM-as-judge evaluation systems like Chatbot Arena that rely on human preference rankings rather than fixed test sets.

### Benchmark Test Sets

Several benchmark test sets have become standard reference points for measuring progress in machine learning:

| Benchmark | Domain | Test Set Details | Significance |
|---|---|---|---|
| [MNIST](/wiki/mnist) | Handwritten digit recognition | 10,000 images across 10 digit classes | One of the earliest and most widely used ML benchmarks |
| [CIFAR-10](/wiki/cifar_10) / [CIFAR-100](/wiki/cifar_100) | Image classification | 10,000 test images (10 or 100 classes) | Standard benchmarks for evaluating [convolutional neural networks](/wiki/convolutional_neural_network) |
| [ImageNet](/wiki/imagenet) | Large-scale image classification | 50,000 validation images across 1,000 classes | Drove the [deep learning](/wiki/deep_learning) revolution in [computer vision](/wiki/computer_vision) |
| [GLUE](/wiki/glue_benchmark) | Natural language understanding | Multiple test sets across 9 NLU tasks | Established standardized NLP evaluation; models reached human-level by 2019 |
| [SuperGLUE](/wiki/superglue) | Advanced natural language understanding | More challenging tasks than GLUE | Created after models saturated GLUE benchmarks [10] |
| [MMLU](/wiki/mmlu) | Multitask language understanding | 14,042 test questions across 57 subjects | Tests factual knowledge and reasoning in LLMs |
| [SQuAD](/wiki/squad) | Reading comprehension | Over 10,000 question-answer pairs | Widely used for evaluating question-answering systems |

These benchmarks have played a crucial role in driving progress, but they also illustrate the challenge of test set reuse: as the community repeatedly evaluates new models on the same test sets over many years, there is a risk of indirect overfitting to those specific data distributions [3].

These standard test sets are also not error-free. Northcutt, Athalye, and Mueller (2021) audited the test sets of 10 widely used computer vision, natural language, and audio benchmarks and estimated an average of 3.4% incorrect labels, with 2,916 errors comprising roughly 6% of the [ImageNet](/wiki/imagenet) validation set [9]. They concluded that "label errors in test sets are numerous and widespread," and showed that correcting these labels can change which model ranks best on a benchmark [9].

### Test Set Reuse and Adaptive Overfitting

One subtle but important problem arises when the same test set is used to evaluate a long sequence of models over time. Even if no single researcher uses the test set improperly, the collective community may gradually overfit to it through a process called **adaptive overfitting** [3].

In machine learning competitions and public benchmarks, participants receive test set feedback through leaderboard scores. Each submission is implicitly influenced by the scores of previous submissions. Over many iterations, this feedback loop can inflate test set performance without corresponding improvements in real-world generalization.

To measure this empirically, Recht, Roelofs, Schmidt, and Shankar (2019) built entirely new test sets for [CIFAR-10](/wiki/cifar_10) and [ImageNet](/wiki/imagenet) using the original data collection process, then re-evaluated a broad range of models. Accuracy dropped by 11% to 14% on the new ImageNet test set [3]. Notably, the authors concluded the gap was "not caused by adaptivity, but by the models' inability to generalize to slightly harder images than those found in the original test sets," since accuracy gains on the original test set translated to even larger gains on the new one [3].

Research on adaptive data analysis has formalized this problem [4]. In a non-adaptive setting, the error from evaluating k models on a test set of n samples grows as $$O(\sqrt{\log k / n})$$, which is quite slow. However, in an adaptive setting where each model is designed after seeing previous results, the error grows as $$O(\sqrt{k \log n / n})$$, which is exponentially worse. With enough adaptive queries, it becomes possible to artificially inflate test set performance by a significant margin.

Mitigation strategies include:

- **Adding calibrated noise** to test set evaluations, which allows quadratic rather than linear numbers of queries before accuracy degrades [4].
- **Using fresh test sets** periodically, retiring old benchmarks and introducing new ones.
- **Holdout sets with restricted access**, where test set labels are never publicly released and evaluation can only be done through a submission system with rate limits.

### Out-of-Distribution Test Sets

A standard test set is drawn from the same distribution as the training data (known as **in-distribution** or **IID** evaluation). However, real-world deployment often involves data that differs from the training distribution. **Out-of-distribution (OOD) test sets** are designed to evaluate how well a model handles such distributional shifts.

OOD evaluation is important because models that achieve high [accuracy](/wiki/accuracy) on in-distribution test sets can fail dramatically when the input distribution changes. For example, an image classifier trained on photos taken in good lighting conditions might perform poorly on images captured in fog or at night.

Common types of distribution shifts tested by OOD evaluation include:

- **Covariate shift:** The input distribution changes, but the relationship between inputs and outputs stays the same.
- **Label shift:** The class proportions change between training and deployment.
- **Domain shift:** The model is applied to a related but different domain (such as training on product reviews and testing on movie reviews).
- **Adversarial inputs:** Deliberately crafted inputs designed to fool the model.

Several benchmarks explicitly test OOD robustness, including ImageNet-C (corrupted images), ImageNet-R (renditions), and WILDS (a collection of distribution shift benchmarks across multiple domains).

### Reporting Test Results

How test results are reported matters almost as much as how they are obtained. Responsible reporting practices help the community draw valid conclusions and avoid misleading comparisons.

#### Single Run vs. Multiple Runs

Deep learning models involve randomness through weight initialization, data shuffling, and stochastic optimization. A single training run may produce different results from another run with a different random seed. Reporting results from a single run can be misleading because a lucky seed might overestimate (or underestimate) typical model performance.

Best practice is to train the model multiple times with different random seeds and report the mean and standard deviation of test set performance. This gives readers a sense of how stable the results are.

#### Confidence Intervals

Confidence intervals provide a range within which the true performance is likely to fall. A 95% confidence interval, for example, means that if the experiment were repeated many times, approximately 95% of the resulting intervals would contain the true performance value [8].

Two common methods for computing confidence intervals on test set results are:

| Method | Description | When to Use |
|---|---|---|
| Normal approximation | Uses the formula: $$\text{accuracy} \pm 1.96 \sqrt{\text{accuracy} \times (1 - \text{accuracy}) / n}$$ | Large test sets with balanced classes |
| Bootstrap | Resamples the test set thousands of times and computes metrics on each sample | Any test set size; does not assume normality |

Reporting confidence intervals is especially important when comparing models, because small differences in test set accuracy may not be statistically significant [8].

### Performance Metrics on the Test Set

The performance of a machine learning model on the test set is measured using metrics appropriate to the task:

| Task Type | Common Metrics |
|---|---|
| Classification | [Accuracy](/wiki/accuracy), [precision](/wiki/precision), [recall](/wiki/recall), [F1 score](/wiki/f1_score), AUC-ROC |
| Regression | [Mean squared error](/wiki/mean_absolute_error_mae) (MSE), [mean absolute error](/wiki/mean_absolute_error_mae) (MAE), R-squared |
| Ranking | NDCG, MAP, MRR |
| Generation | [BLEU](/wiki/bleu), [ROUGE](/wiki/rouge_score), perplexity |
| Clustering | Adjusted Rand index, silhouette score, mutual information |

It is good practice to report multiple metrics rather than a single number, as different metrics capture different aspects of model behavior. For instance, in an imbalanced classification task, accuracy alone can be misleading if the model achieves high accuracy by simply predicting the majority class.

### Alternatives to a Fixed Test Set

In some scenarios, maintaining a separate, fixed test set is impractical or insufficient. Common alternatives include:

- **[Cross-validation](/wiki/cross-validation).** The dataset is split into k folds, and each fold serves as the test set once while the remaining folds are used for training. This approach is especially valuable for small datasets, as every data point gets used for both training and evaluation.
- **Bootstrapping.** Random samples with replacement are drawn from the dataset, and the model is evaluated on the out-of-bag (unsampled) data points. This provides both a performance estimate and a confidence interval.
- **Rolling/expanding window evaluation.** For time-series data, the training window grows or slides forward over time, and the model is tested on the next time period. This preserves the temporal ordering of data.

## Explain Like I'm 5 (ELI5)

Imagine you are studying for a big test at school. You have a workbook full of practice problems that you use to learn and study. These practice problems are like the **training set**.

Your teacher also gives you a practice quiz to help you figure out what you still need to study. That practice quiz is like the **validation set**. You can look at your mistakes on the practice quiz and study more.

Finally, on test day, you get a real test with questions you have never seen before. That real test is the **test set**. It shows whether you actually learned the material or whether you just memorized the answers to the practice problems.

If someone gave you the real test questions ahead of time, your score would not mean anything because you could just memorize the answers. That is why the test set has to be kept secret until the very end.

## References

1. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning*. Springer. Chapter 7: Model Assessment and Selection.
2. Raschka, S. (2018). "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning." *arXiv:1811.12808*.
3. Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). "Do ImageNet Classifiers Generalize to ImageNet?" *Proceedings of the 36th International Conference on Machine Learning (ICML)*. arXiv:1902.10811.
4. Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. (2015). "Generalization in Adaptive Data Analysis and Holdout Reuse." *Advances in Neural Information Processing Systems (NeurIPS)*.
5. Xu, R., Li, Y., Luo, C., & Xu, Z. (2024). "Benchmark Data Contamination of Large Language Models: A Survey." *arXiv:2406.04244*.
6. White, C., Dooley, S., Roberts, M., Pal, A., Feber, B., & Jain, S. (2024). "LiveBench: A Challenging, Contamination-Free LLM Benchmark." *arXiv:2406.19314*.
7. Brownlee, J. (2020). "What is the Difference Between Test and Validation Datasets?" *Machine Learning Mastery*. https://machinelearningmastery.com/difference-test-validation-datasets/
8. Raschka, S. (2022). "Creating Confidence Intervals for Machine Learning Classifiers." https://sebastianraschka.com/blog/2022/confidence-intervals-for-ml.html
9. Northcutt, C. G., Athalye, A., & Mueller, J. (2021). "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks." *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:2103.14749.
10. Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2019). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." *Advances in Neural Information Processing Systems (NeurIPS)*.