Validation Set
Last reviewed
May 9, 2026
Sources
20 citations
Review status
Source-backed
Revision
v4 ยท 6,481 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
20 citations
Review status
Source-backed
Revision
v4 ยท 6,481 words
Add missing citations, update stale details, or suggest a clearer explanation.
A validation set (also called a development set or dev set) is a subset of labeled data that is held out from the training set and used to evaluate a model's performance during development. It plays a central role in hyperparameter tuning, model selection, and early stopping, acting as a proxy for how well the model will generalize to unseen data without touching the final test set.
The validation set is one of the three canonical partitions of labeled data in supervised learning, alongside the training set and the test set. While the training set teaches the model and the test set delivers a final unbiased verdict, the validation set is where day-to-day modeling decisions are made: which architecture to keep, when to halt training, and which hyperparameters yield the best generalization. The careful, disciplined use of a validation set is often the difference between a model that performs well in the lab and one that survives contact with real-world data.
In supervised learning, the available labeled data is typically divided into two or three non-overlapping partitions: a training set, a validation set, and a test set. The training set is used to fit the model's internal parameters (such as weights in a neural network), while the validation set provides an independent evaluation of the model's fit during training. The test set is reserved for the final, unbiased assessment of the finished model.
The validation set serves several key purposes:
The word "validation" can be misleading because it suggests a single act of confirmation. In practice, the validation set is consulted hundreds or thousands of times during a project. That is precisely why it is kept separate from the test set: any data used repeatedly to make decisions becomes contaminated by selection bias.
One of the most common sources of confusion in machine learning is the distinction between the validation set and the test set. Although both are used to evaluate a model on data it was not trained on, their roles in the workflow are fundamentally different.
| Aspect | Validation Set | Test Set |
|---|---|---|
| When used | During model development (repeatedly) | After all development is complete (ideally once) |
| Primary purpose | Tune hyperparameters and select models | Provide an unbiased final performance estimate |
| Influence on the model | Indirectly shapes the model through selection decisions | Should have zero influence on the model |
| Allowed to look at results | Yes, results guide further development | Looking at results and then changing the model introduces bias |
| Typical usage frequency | Many times throughout training | Once, at the very end |
| Recommended size | Large enough to discriminate between models | Large enough to give tight confidence intervals on the final metric |
| Public visibility | Often shared with collaborators | Sometimes sealed by an external party (Kaggle private leaderboard, NIST holdouts) |
The critical rule is that the test set should never be used to make any training or design decision. If test set performance is used to adjust hyperparameters or select among candidate models, the test set effectively becomes a second validation set, and the reported performance will be optimistically biased [1].
In academic benchmarks, this discipline is often enforced by infrastructure rather than honor. Kaggle competitions split the test set into a public leaderboard portion (visible during the competition) and a private leaderboard portion (revealed only at the end). Submissions ranked on the public board can suffer from leaderboard overfitting, while final standings on the private board reveal which teams generalized. The GLUE and SuperGLUE language understanding benchmarks similarly hide test labels behind a submission server.
The simplest approach is to randomly split the available data into a training portion and a validation portion before training begins. This is called hold-out validation (or a simple train/validation split). Common split ratios include:
| Dataset Size | Typical Train : Validation : Test Split |
|---|---|
| Small (hundreds to low thousands) | 70 : 15 : 15 or 80 : 10 : 10 |
| Medium (tens of thousands) | 80 : 10 : 10 |
| Large (hundreds of thousands or more) | 90 : 5 : 5 or even 98 : 1 : 1 |
With very large datasets, even 1% of the data can contain tens of thousands of examples, which is more than enough to estimate performance reliably. For smaller datasets, devoting a larger fraction to validation ensures that the performance estimate is stable [2].
When splitting data, practitioners should ensure that the validation set follows the same probability distribution as the training set. For classification problems with imbalanced classes, stratified splitting preserves the original class proportions in both the training and validation partitions. For grouped data (such as multiple measurements from the same patient or multiple frames from the same video), group-aware splitting ensures that all samples from a single group appear in only one partition.
In scikit-learn, the function train_test_split performs hold-out splitting and accepts a stratify argument that takes the class labels and ensures proportional representation. The companion function GroupShuffleSplit accepts a group label and guarantees that no group is split across partitions.
In k-fold cross-validation, the training data is divided into k equally sized folds. The model is trained k times, each time using a different fold as the validation set and the remaining k minus 1 folds as the training set. The final performance metric is the average across all k runs.
This approach is especially valuable when data is limited, because every sample serves as part of the validation set exactly once and as part of the training set k minus 1 times. Common choices are k = 5 or k = 10. Stratified k-fold ensures that each fold preserves the class distribution of the full dataset [3].
Cross-validation is implemented in scikit-learn through cross_val_score and cross_validate, with splitter classes such as KFold, StratifiedKFold, GroupKFold, and TimeSeriesSplit defining how folds are constructed. Each splitter respects a different invariant, so choosing the correct splitter for the data structure is critical.
| Factor | Hold-Out Validation | K-Fold Cross-Validation |
|---|---|---|
| Computational cost | Low (train once) | Higher (train k times) |
| Data efficiency | Lower (part of data is never used for training) | Higher (all data used for both training and validation) |
| Variance of estimate | Higher (depends on which samples land in which split) | Lower (averages over k different splits) |
| Bias of estimate | Can be higher with small data | Generally lower |
| Best suited for | Large datasets where a single split is representative | Small to medium datasets |
| Communicates uncertainty | Single number with no built-in error bar | Mean and standard deviation across folds |
For deep learning models that are expensive to train, hold-out validation is often the pragmatic choice, since training a large neural network k times can be prohibitively slow. For classical machine learning algorithms with faster training times, k-fold cross-validation is the standard practice [4].
While basic k-fold is the workhorse, the right cross-validation scheme depends on the structure of the data. Choosing the wrong scheme is one of the most common ways to produce overly optimistic validation scores.
| Variant | How it works | Best for | Implementation |
|---|---|---|---|
| K-fold | Split into k equal folds; each fold serves once as validation | i.i.d. data, regression, balanced classification | KFold |
| Stratified k-fold | Preserves class proportions in each fold | Classification with class imbalance | StratifiedKFold |
| Group k-fold | Ensures samples from the same group never span fold boundaries | Patient records, user sessions, document sentences | GroupKFold |
| Stratified group k-fold | Combines stratification with group constraints | Imbalanced classification with grouped samples | StratifiedGroupKFold |
| Time-series split | Forward chaining: training fold always precedes validation fold | Forecasting, sequential data | TimeSeriesSplit |
| Repeated k-fold | Runs k-fold multiple times with different shuffles | Reducing variance of the estimate | RepeatedKFold, RepeatedStratifiedKFold |
| Leave-one-out (LOOCV) | k = n; each sample takes its turn as a single-element validation set | Very small datasets (n < 50) | LeaveOneOut |
| Leave-p-out | Each combination of p samples serves once as the validation set | Theoretical analysis; rarely practical for p > 1 | LeavePOut |
| Leave-one-group-out | Each unique group is held out as a validation fold | Multi-site studies, federated data | LeaveOneGroupOut |
| Shuffle split | Random subsets repeated for a fixed number of iterations | Quick estimation when k-fold is overkill | ShuffleSplit |
| Nested cross-validation | Inner loop tunes hyperparameters, outer loop estimates generalization | Reporting unbiased results when tuning is part of the pipeline | Manual composition of CV objects |
| Blocked cross-validation | Contiguous chunks of time or space form folds with gaps between them | Spatial data, autocorrelated time series | Custom splitters |
For time series, the standard rule is that any validation point must come strictly after every training point. Random shuffling for time series silently leaks the future into the past and can inflate metrics by orders of magnitude. The TimeSeriesSplit object implements expanding-window forward chaining: fold 1 trains on [1..n] and validates on [n+1..2n], fold 2 trains on [1..2n] and validates on [2n+1..3n], and so on. Some forecasting workflows further insert a gap between training and validation to prevent leakage from short-range autocorrelation.
When hyperparameter tuning and final evaluation both use the same cross-validation procedure, the reported score is biased upward because the chosen hyperparameters were selected to maximize that very score. Nested cross-validation addresses this by separating the two responsibilities into two nested loops. The inner loop performs hyperparameter search on each outer training fold, and the outer loop computes a held-out score using a fold the inner loop never saw. The outer scores are then averaged.
For a typical 5x5 nested scheme with grid search over 100 configurations:
Nested cross-validation is the gold standard for academic publications that compare algorithms, because non-nested estimates can lure a researcher into overestimating generalization performance, particularly when the inner-fold scores have large standard deviations and the maximum is taken across many candidate configurations [5].
Early stopping is a regularization technique that uses the validation set to determine when to halt training. During each epoch of training, the model's loss (or another metric such as accuracy) is computed on both the training set and the validation set.
In a typical training run:
Typical patience values range from 3 to 10 epochs, depending on the dataset and model complexity. Early stopping is widely used in training neural networks because it is simple to implement and effective at preventing overfitting without requiring manual tuning of the number of training epochs [6].
Prechelt's classic 1998 study cataloged several variants of the early stopping criterion, ranging from a simple "stop when the validation error has not improved for p epochs" rule to more elaborate definitions based on the generalization-to-progress quotient (GL/Pk). The simplest patience-based scheme is the most widely used in practice because of its robustness and ease of implementation [6].
Plotting training loss and validation loss across epochs produces learning curves that provide diagnostic information about the model's behavior.
| Curve Pattern | Training Loss | Validation Loss | Diagnosis |
|---|---|---|---|
| Good fit | Decreases and stabilizes | Decreases and stabilizes close to training loss | The model generalizes well |
| Overfitting | Continues to decrease | Decreases then increases (diverges from training loss) | The model memorizes training data |
| Underfitting | Remains high | Remains high, mirrors training loss | The model is too simple or needs more training |
| Oscillating loss | Fluctuates erratically | Fluctuates erratically | Learning rate may be too high or data may have quality issues |
| Validation lower than training | Higher than expected | Below training loss | Distribution mismatch, label noise in training, or dropout inflating training loss |
| Sudden spike in validation | Smooth | Sudden jump | Catastrophic step due to learning rate instability or bad batch |
The gap between the training loss and the validation loss is sometimes called the generalization gap. A small generalization gap indicates that the model's performance on the training data is a good predictor of its performance on new data. A large and growing gap is a classic indicator of overfitting [7].
A related diagnostic is the learning curve plotted against training set size rather than epoch number. By training the same model on increasing fractions of the data and recording validation error, the learning curve reveals whether the model is data-limited (validation error is still falling as more data is added) or capacity-limited (validation error has plateaued and adding data will not help). This guides whether to invest in more labels or in a larger model.
Hyperparameters are settings that are not learned during training but must be specified before training begins. Examples include the learning rate, the number of hidden layers, dropout rate, batch size, and regularization strength.
The standard workflow for hyperparameter tuning is:
When using cross-validation for hyperparameter tuning, steps 2 and 3 are repeated for each fold, and the average validation performance across folds is used to compare configurations. This gives a more robust estimate of each configuration's quality, especially with limited data [8].
Validation performance is the objective that hyperparameter search algorithms optimize. The choice of search method affects how quickly a good configuration is found and how thoroughly the space is explored.
| Method | How it works | Strengths | Weaknesses | Typical tools |
|---|---|---|---|---|
| Grid search | Exhaustively evaluates every combination on a predefined grid | Simple, reproducible, embarrassingly parallel | Combinatorial explosion in high dimensions; wastes effort on unimportant axes | scikit-learn GridSearchCV |
| Random search | Samples configurations uniformly from the search space | Often beats grid search for the same compute budget; trivially parallel | No memory of past trials | scikit-learn RandomizedSearchCV |
| Bayesian optimization | Builds a surrogate model of validation performance and chooses next trial by an acquisition function (Expected Improvement, UCB) | Sample-efficient; uses past trials to guide future ones | Hard to parallelize naively; surrogate cost grows with trials | scikit-optimize, GPyOpt, BoTorch |
| Tree-structured Parzen Estimator (TPE) | Models the densities of good and bad configurations separately | Handles conditional and discrete spaces well | Can underperform Gaussian processes on smooth low-dim spaces | Optuna, Hyperopt |
| Hyperband | Successive halving across multiple bracket widths; cheaply rules out bad trials | Anytime algorithm; strong results on deep learning | Requires a meaningful budget knob (epochs, data fraction) | Ray Tune, Optuna |
| BOHB | Combines Bayesian sampling with Hyperband's resource allocation | Sample-efficient and budget-aware | More complex to implement and debug | HpBandSter, Ray Tune |
| Halving search (SH) | Trains all candidates on a small budget, halves them by validation score, doubles budget, repeats | Fast under tight compute | Brittle to noisy validation scores at small budgets | scikit-learn HalvingGridSearchCV, HalvingRandomSearchCV |
| Population-based training (PBT) | Evolves a population of models; periodically replaces underperformers and perturbs hyperparameters | Adapts hyperparameter schedules during training | Stochastic, hard to reproduce exactly | DeepMind's PBT, Ray Tune PBT |
| Evolutionary search | Mutates and recombines configurations across generations | No gradient or surrogate needed | Can be sample-inefficient | DEAP, NSGA-II implementations |
The theoretical insight behind random search is that hyperparameter response surfaces typically have low effective dimensionality: only a few hyperparameters meaningfully affect performance, and grid search wastes most of its evaluations along unimportant axes. Bergstra and Bengio (2012) showed that random search finds equally good or better configurations than grid search using the same budget, especially as the dimensionality of the space grows [9].
Hyperband (Li et al., 2017) reformulates hyperparameter search as a non-stochastic best-arm identification problem and uses a principled early-stopping schedule. It calls successive halving as a subroutine: start with a large set of configurations and a small budget, evaluate all of them, then keep only the top fraction (typically the top half) and double their budget. Hyperband sweeps over different initial pool sizes to balance exploration and exploitation. In Li et al.'s benchmarks, Hyperband achieved more than an order-of-magnitude speedup over standard Bayesian optimization on several deep learning workloads [10].
BOHB (Falkner et al., 2018) replaces Hyperband's random configuration sampling with a TPE-based model, retaining Hyperband's resource allocation while gaining the sample efficiency of Bayesian optimization. BOHB is the default search strategy in many AutoML systems because it combines fast early-stage progress with strong asymptotic performance [11].
| API | Purpose | Notes |
|---|---|---|
train_test_split | Single train/validation split | Supports stratify, shuffle, random_state |
cross_val_score | Average a metric across CV folds | Returns one number per fold |
cross_validate | Same as above with multi-metric scoring and timing | Returns a dict of arrays |
cross_val_predict | Out-of-fold predictions for every sample | Useful for stacking and calibration |
GridSearchCV | Exhaustive grid search with CV | refit=True retrains on all data with best params |
RandomizedSearchCV | Random search with CV | Specify n_iter and parameter distributions |
HalvingGridSearchCV | Successive halving over a grid | Experimental; faster than GridSearchCV for many candidates |
HalvingRandomSearchCV | Successive halving over random samples | Combines speed of halving with breadth of random sampling |
validation_curve | Sweep one parameter, plot training vs. validation score | Diagnostic for overfitting/underfitting on a single axis |
learning_curve | Sweep training set size | Diagnostic for data-limited vs. capacity-limited regimes |
For distributed and large-scale tuning, dedicated libraries pick up where scikit-learn ends. Optuna offers a define-by-run API that lets the search space depend on previous samples, plus pruners that stop unpromising trials early. Ray Tune scales tuning across clusters and integrates schedulers like ASHA, PBT, and HyperBand. Weights and Biases Sweeps and MLflow add experiment tracking and visualization. AutoML systems such as Auto-sklearn, AutoGluon, H2O.ai, and Google Vertex AI fully automate algorithm and hyperparameter selection on the validation set, returning a tuned pipeline without manual intervention [12].
Although the validation set is not used to train the model's parameters directly, repeated evaluation on the same validation set can lead to a subtler form of overfitting. When practitioners run many experiments, choosing hyperparameters and model designs based on validation performance, information from the validation set gradually leaks into the modeling decisions. Over time, the selected model may be tuned to perform well specifically on the validation set rather than on truly unseen data.
This problem is known as validation set overfitting or adaptive overfitting. Signs include:
Strategies to mitigate validation set leakage include:
Another common source of leakage is preprocessing leakage: fitting data transformations (such as normalization, feature engineering, or imputation) on the entire dataset before splitting it into training and validation sets. Preprocessing steps should be fitted only on the training data and then applied to the validation and test sets [13].
| Leakage source | Example | Fix |
|---|---|---|
| Scaling on the full dataset | StandardScaler().fit(X) before split | Fit scaler on training fold only, transform validation fold |
| Imputation with global statistics | Filling NaN with the mean of the entire dataset | Compute mean on training fold only |
| Target encoding leakage | Using target statistics computed from the validation rows | Use out-of-fold target encoding |
| Feature selection on full data | Picking features by their correlation with the target on the full dataset | Run feature selection inside each CV fold |
| Oversampling before split | Applying SMOTE before splitting | Apply oversampling only inside the training fold |
| Lookahead in time series | Including future observations in moving-average features | Use only past data, with a sufficient lag |
| Group leakage | Same patient in train and validation | Use group-aware splitters such as GroupKFold |
| Duplicate or near-duplicate rows | Same image present multiple times | Deduplicate before splitting |
| Train/validation contamination via embeddings | Embeddings pretrained on data that overlaps validation | Track and disclose the pretraining corpus |
The scikit-learn Pipeline object exists in part to make leakage-free preprocessing easier: any transformer added to a Pipeline is fit only on the training fold during cross-validation, eliminating an entire category of bugs.
Per-group leakage is especially severe in medical imaging, where multiple slices, scans, or visits commonly come from the same patient. If the same patient appears in both the training and validation sets, the model can memorize patient-specific features (anatomy, scanner artifacts, demographics) and produce dramatically inflated metrics. A 2021 study on brain MRI classification reported that slice-wise random splitting boosted apparent slice-level accuracy by 30% on OASIS, 29% on ADNI, 48% on PPMI, and 55% on a local Parkinson's dataset compared with patient-wise splitting [14]. The accepted practice is to split at the patient level, not the slice or visit level, and to verify by computing a hash on patient identifiers across splits.
Similar group-leakage failures occur in user-level recommender systems (same user split across folds), document-level NLP tasks (same document split across folds), and connectome-based neuroimaging studies (same subject's connectivity matrix split across folds).
Validation sets are especially important when training deep learning models because neural networks have a large number of parameters and a strong capacity to memorize training data. Without a validation set, it is difficult to know when to stop training or which architecture works best.
In practice, training a neural network involves:
Modern deep learning frameworks such as PyTorch and TensorFlow provide built-in callbacks and utilities for monitoring validation metrics during training. For example, PyTorch Lightning's EarlyStopping callback and Keras's ModelCheckpoint callback automate the process of tracking the validation loss and saving the best model. Training scripts based on Hugging Face's transformers.Trainer similarly accept an evaluation_strategy argument that controls how often the trainer runs the validation loop.
For large pretrained models, validation takes on a slightly different shape. During pretraining, a held-out portion of the corpus is used to compute validation perplexity (or equivalently bits-per-byte or cross-entropy on next-token prediction). The Chinchilla scaling laws (Hoffmann et al., 2022) used this validation loss as the objective for compute-optimal scaling, sweeping model and data size combinations under a fixed compute budget to find configurations that minimize held-out loss [15]. The validation set in this setting is large (tens of millions of tokens) and chosen to be representative of the pretraining distribution, with separate "validation" and "test" splits maintained from the start.
Downstream evaluation, on the other hand, uses curated benchmark suites rather than perplexity. LM-Eval-Harness (EleutherAI) provides a unified interface for over 200 tasks including MMLU, HellaSwag, ARC, GSM8K, and BBH, supporting multiple-choice, generation, and likelihood-based scoring. HELM (Holistic Evaluation of Language Models, Liang et al., 2022) measures seven properties (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) across 16 core scenarios. BIG-bench offers more than 200 tasks contributed by hundreds of researchers, while MT-Bench, AlpacaEval, and Chatbot Arena focus on instruction-following and conversational quality [16].
A recurring pitfall in modern LLM evaluation is that benchmark validation sets often leak into the pretraining corpus, especially as web-scale crawls grow. Practitioners mitigate this with decontamination procedures: hashing benchmark text and removing matches from the training corpus, or using held-out splits constructed after the model's data cutoff. Even so, comparing scores across reports requires fixing the harness, prompt templates, and scoring rules. A common mistake is comparing model A's lm-evaluation-harness score against model B's HELM score, since different harnesses use different prompt templates, few-shot examples, and aggregation rules and the numbers are not directly comparable.
Choosing the right size for the validation set involves balancing two competing concerns:
General guidelines for sizing:
| Scenario | Recommendation |
|---|---|
| Small dataset (< 1,000 samples) | Use k-fold cross-validation instead of a fixed validation set |
| Medium dataset (1,000 to 100,000 samples) | 10% to 20% for validation |
| Large dataset (> 100,000 samples) | 1% to 10% for validation (even 1% may yield thousands of samples) |
| Very large dataset (millions of samples) | 1% or less is often sufficient |
The validation set should be large enough to detect meaningful differences between candidate models. If the expected improvement from a hyperparameter change is small (for example, a 0.1% increase in accuracy), the validation set needs to be large enough for that difference to be statistically significant [17].
A quick rule of thumb based on the binomial standard error: with 1,000 validation samples and a true accuracy near 80%, the 95% confidence interval has a half-width of approximately 2.5 percentage points. With 10,000 samples, the same interval shrinks to about 0.8 percentage points. Researchers comparing models that differ by less than 1% in accuracy should ensure their validation set holds tens of thousands of examples or use cross-validation with many folds to reduce variance.
When two candidate models produce similar validation scores, a single comparison of point estimates can be misleading. Practitioners often apply paired statistical tests:
| Test | Use case |
|---|---|
| McNemar's test | Two classifiers evaluated on the same validation set |
| 5x2 cross-validation paired t-test (Dietterich) | Robust comparison of two algorithms with limited data |
| Wilcoxon signed-rank | Non-parametric pairwise comparison across folds |
| Bootstrap confidence intervals | Estimating uncertainty around a single metric |
| Friedman + Nemenyi | Comparing many algorithms across multiple datasets |
These tests answer the question: given the size and variability of the validation results, is the observed difference plausibly real or just noise?
Beyond standard hold-out and k-fold approaches, several specialized validation methods address specific data characteristics:
TimeSeriesSplit implements this by using expanding training windows with forward-looking validation windows.Even experienced practitioners run into recurring failure modes when working with validation sets. The following table summarizes the ones most often seen in production projects.
| Pitfall | Symptom | Fix |
|---|---|---|
| Random shuffle on time-series data | Validation metrics vastly better than live performance | Use TimeSeriesSplit or temporal hold-out |
| Same patient or user in both splits | Per-group memorization inflates scores | Use GroupKFold and verify with hashing |
| Preprocessing fit on full dataset | Subtle leakage that may go undetected | Wrap preprocessing inside a Pipeline and fit per fold |
| Stratification ignored | Rare classes missing from a fold | Use StratifiedKFold or StratifiedGroupKFold |
| Tuning on test set | Test scores cease to be unbiased | Reserve test set for final, single evaluation |
| Reporting cherry-picked seeds | Reproducibility fails for others | Report mean and standard deviation across seeds |
| Comparing models across different harnesses | Apples-to-oranges comparison | Pin the harness, prompts, and scoring rules |
| Forgetting to retrain on train + val | Final model uses less data than necessary | Refit with the chosen hyperparameters on the union |
| Validation set too small to discriminate | Noise dominates differences | Increase size or use cross-validation |
| Distribution drift between dev and prod | Lab metrics do not transfer | Refresh validation set from current production data |
Validation does not stop when a model ships. Production systems usually maintain a shadow validation set drawn from recent production traffic to detect concept drift, label shift, and feature pipeline regressions. Tools such as Evidently AI, Fiddler, and Arize compute distribution distances (PSI, KS, Wasserstein) and prediction-quality metrics on this rolling validation slice. ML platforms (MLflow, Kubeflow, SageMaker) version validation datasets alongside trained models so that rerunning the same evaluation tomorrow produces the same number. Tracking the SHA hash of the validation set is a small but important discipline: a metric without a versioned dataset is unreproducible.
Imagine you are studying for a big test at school. You have a workbook full of practice problems. You use most of the problems to learn and practice (that is your training set). But you save a few problems that you do not look at while studying. After you think you have studied enough, you try those saved problems to see if you really understand the material (that is your validation set). If you get them wrong, you go back and study differently. You keep checking with those saved problems until you do well.
Then, on test day, the teacher gives you brand-new problems you have never seen before (that is the test set). Your score on those brand-new problems tells you how well you truly learned, not just how well you memorized the practice answers.
The validation set is like a practice quiz you give yourself before the real test. It helps you figure out the best way to study without spoiling the real test.
If you peek at the practice quiz too many times and only study what is on it, you might do great on the practice but fail the real test.