Cross-Validation

Cross-validation is a statistical technique used in machine learning to evaluate how well a predictive model generalizes to an independent dataset. Rather than relying on a single split of data into training and testing portions, cross-validation systematically partitions the data into multiple subsets, trains the model on some subsets, and validates it on the remaining ones. By averaging performance across these partitions, cross-validation provides a more reliable and less biased estimate of model performance than simple holdout methods.

Cross-validation serves two primary purposes: model assessment (estimating the generalization error of a final model) and model selection (choosing the best model or hyperparameter configuration among competing alternatives). It is one of the most widely used techniques in applied machine learning and statistics, appearing in virtually every modeling workflow from academic research to production systems.

ELI5: Cross-validation in simple terms

Imagine you are studying for an exam using a set of 100 practice questions. If you always practice with questions 1 through 80 and then test yourself on questions 81 through 100, you might get lucky or unlucky depending on which questions end up in your test. Maybe your weak spots happen to not show up in those 20 questions, giving you a false sense of confidence.

Cross-validation is like taking five different practice exams, each time setting aside a different group of 20 questions as the test. First you test on questions 1 through 20, then 21 through 40, and so on. By averaging your scores across all five mini-exams, you get a much more honest picture of how prepared you really are. That is exactly what cross-validation does for machine learning models: it gives a fair, well-rounded estimate of how the model will perform on data it has never seen before.

Historical background

The formalization of cross-validation as a statistical methodology dates to the early 1970s. Mervyn Stone published his seminal paper "Cross-Validatory Choice and Assessment of Statistical Predictions" in 1974 in the Journal of the Royal Statistical Society, Series B. In this work, Stone defined the leave-one-out cross-validation procedure and applied a generalized cross-validation criterion to problems in univariate estimation, linear regression, and analysis of variance ^[1]. Independently, Seymour Geisser introduced the "predictive sample reuse" method in 1975, published in the Journal of the American Statistical Association. Geisser's approach was broader in scope, allowing more general data splits beyond the leave-one-out case ^[2].

Mervyn Stone also demonstrated in 1977 that cross-validation is asymptotically equivalent to the Akaike Information Criterion (AIC) for model selection, establishing a deep theoretical connection between cross-validation and information-theoretic approaches ^[3]. In 1995, Ron Kohavi conducted a landmark empirical study comparing cross-validation and bootstrap methods for accuracy estimation. Based on extensive experiments with over half a million runs of C4.5 and Naive Bayes algorithms, Kohavi concluded that ten-fold stratified cross-validation offered the best tradeoff between bias and variance for model selection on real-world datasets ^[4].

Core methodology

The holdout method

The simplest form of model evaluation is the holdout method, which splits the dataset into two disjoint parts: a training set (typically 70 to 80 percent of the data) and a test set (the remaining 20 to 30 percent). The model learns from the training set and is evaluated on the test set.

While computationally efficient, the holdout method has significant limitations. The performance estimate depends heavily on which data points end up in each partition, leading to high variance in the results. It also uses data inefficiently, since a substantial portion is reserved solely for evaluation and never contributes to training. Cross-validation addresses both of these shortcomings by rotating the roles of training and testing data across multiple iterations.

General cross-validation procedure

The general cross-validation procedure works as follows:

Partition the dataset into multiple non-overlapping subsets (called "folds").
For each fold, hold it out as the validation set and train the model on all remaining folds.
Evaluate the model on the held-out fold and record the performance metric.
Repeat until every fold has served as the validation set exactly once.
Aggregate the performance scores (typically by computing the mean and standard deviation) to produce the final estimate.

This rotation ensures that every data point is used for both training and validation, maximizing data utilization while providing a robust performance estimate.

Types of cross-validation

K-fold cross-validation

K-fold cross-validation is the most commonly used variant. The dataset is randomly shuffled and divided into k equal-sized (or nearly equal-sized) folds. The model is trained k times, each time using a different fold as the validation set and the remaining k - 1 folds as the training set. The final performance estimate is the average across all k iterations.

Common choices for k are 5 and 10. Kohavi's 1995 study recommended 10-fold cross-validation as offering a good balance between bias and variance for most practical applications ^[4]. With k = 5, each training set contains 80 percent of the data; with k = 10, each contains 90 percent.

Procedure for 5-fold cross-validation:

Shuffle the dataset randomly.
Split into 5 equal folds: F1, F2, F3, F4, F5.
Iteration 1: Train on F2 + F3 + F4 + F5, validate on F1.
Iteration 2: Train on F1 + F3 + F4 + F5, validate on F2.
Iteration 3: Train on F1 + F2 + F4 + F5, validate on F3.
Iteration 4: Train on F1 + F2 + F3 + F5, validate on F4.
Iteration 5: Train on F1 + F2 + F3 + F4, validate on F5.
Report the mean and standard deviation of the 5 validation scores.

Stratified k-fold cross-validation

Stratified k-fold cross-validation modifies the standard k-fold approach by ensuring that each fold preserves the approximate class distribution of the original dataset. This is particularly important for imbalanced classification problems where one or more classes are underrepresented.

For example, if the original dataset contains 90 percent negative samples and 10 percent positive samples, each fold in stratified k-fold will also contain approximately 90 percent negative and 10 percent positive samples. Without stratification, some folds might contain very few (or no) positive samples by chance, leading to unreliable performance estimates.

Kohavi's 1995 study found that stratification consistently improved the reliability of cross-validation estimates, and most modern machine learning libraries (including scikit-learn) use stratified k-fold as the default strategy for classification tasks ^[4].

Leave-one-out cross-validation (LOOCV)

Leave-one-out cross-validation is a special case of k-fold cross-validation where k equals the total number of samples n in the dataset. In each iteration, a single observation is held out for validation while the remaining n - 1 observations form the training set. This process repeats n times so that every observation serves as the validation point exactly once.

Advantages of LOOCV:

Maximizes the amount of training data in each iteration (only one sample is excluded).
Produces a nearly unbiased estimate of the true generalization error.
Is deterministic; there is no randomness from data shuffling.

Disadvantages of LOOCV:

Computationally expensive, requiring n separate model training runs.
Tends to produce high-variance estimates because training sets across iterations overlap by n - 2 samples, making the individual estimates highly correlated.
The scikit-learn documentation and most authors recommend 5 or 10-fold cross-validation over LOOCV for practical use ^[5].

LOOCV is most useful for small datasets (fewer than 200 samples) where maximizing training data is critical and the computational cost of n model fits is manageable.

Leave-p-out cross-validation

Leave-p-out cross-validation generalizes LOOCV by holding out p observations in each iteration. This produces C(n, p) distinct training/test splits, where C(n, p) is the binomial coefficient "n choose p." For even moderate values of n and p, the number of iterations becomes astronomically large. For instance, with n = 100 and p = 5, there are over 75 million possible splits.

Because of this combinatorial explosion, leave-p-out cross-validation is rarely used in practice. It serves primarily as a theoretical framework for understanding the properties of cross-validation estimators.

Repeated k-fold cross-validation

Repeated k-fold cross-validation runs the entire k-fold procedure multiple times, each time with a different random shuffling of the data. The results from all repetitions are then averaged. For example, 10 repetitions of 5-fold cross-validation (often written as 10 x 5 CV) produces 50 individual performance estimates that are averaged into the final score.

This approach reduces the variance of the performance estimate compared to a single run of k-fold cross-validation, at the cost of increased computation. It is particularly valuable when the dataset is small enough that the random assignment of observations to folds can noticeably affect results. A common configuration is 10 repeats of 10-fold cross-validation, yielding 100 individual estimates.

Nested cross-validation

Nested cross-validation addresses the problem of simultaneously performing hyperparameter tuning and model evaluation without introducing optimistic bias. It uses two layers of cross-validation loops:

Outer loop: Splits the data into training and test folds for estimating generalization performance.
Inner loop: Within each outer training fold, further splits the data to perform hyperparameter search (for example, using grid search or random search).

The key insight is that if you use the same cross-validation procedure for both selecting hyperparameters and estimating performance, the resulting estimate will be optimistically biased. The test data is no longer statistically "pure" because it was indirectly used to guide hyperparameter choices. Nested cross-validation eliminates this bias by ensuring the outer test fold is never seen during hyperparameter optimization ^[6].

A typical configuration uses 5 folds in the outer loop and 5 folds in the inner loop (5 x 5 nested CV), resulting in 25 inner model fits per outer fold, or 125 total model fits (plus the 5 outer evaluations). While computationally demanding, nested cross-validation provides an unbiased estimate of how well the entire model selection and training pipeline will perform on new data.

Time series cross-validation

Standard k-fold cross-validation assumes that data points are independently and identically distributed (i.i.d.). This assumption is violated in time series data, where observations have temporal dependencies. Randomly shuffling time series data and splitting it into folds would create data leakage: the model could train on future observations to predict past ones.

Time series cross-validation preserves the temporal order of observations using specialized strategies:

Expanding window (forward chaining): The training set starts with the earliest observations and grows with each iteration. The test set is always the next chronological block. For example, with five splits:

Iteration	Training period	Test period
1	Months 1 to 3	Month 4
2	Months 1 to 4	Month 5
3	Months 1 to 5	Month 6
4	Months 1 to 6	Month 7
5	Months 1 to 7	Month 8

Sliding window (rolling window): The training set has a fixed size and "slides" forward in time. Older observations are dropped as newer ones are added. This is preferable when the data distribution shifts over time (concept drift), making distant historical data less relevant.

Iteration	Training period	Test period
1	Months 1 to 3	Month 4
2	Months 2 to 4	Month 5
3	Months 3 to 5	Month 6
4	Months 4 to 6	Month 7
5	Months 5 to 7	Month 8

In scikit-learn, the TimeSeriesSplit class implements the expanding window approach ^[5].

Group k-fold cross-validation

Group k-fold cross-validation ensures that all observations belonging to the same group appear in the same fold. This is necessary when data contains groups of related observations that are not independent, such as multiple measurements from the same patient, multiple images from the same camera, or multiple transactions from the same customer.

If related observations are split across training and validation sets, the model may appear to generalize well simply because it recognizes the group rather than learning the underlying pattern. Group k-fold prevents this leakage by keeping entire groups together. Scikit-learn provides GroupKFold, LeaveOneGroupOut, and StratifiedGroupKFold for this purpose ^[5].

Comparison of cross-validation methods

Method	Number of fits	Bias	Variance	Best for
2-fold CV	2	High (50% training data)	Low	Very large datasets
5-fold CV	5	Moderate (80% training data)	Moderate	Large datasets (100,000+ samples)
10-fold CV	10	Low (90% training data)	Moderate	General-purpose use
LOOCV	n	Very low	High	Small datasets (under 200 samples)
Leave-p-out	C(n, p)	Low	Depends on p	Theoretical analysis
Repeated 10-fold (10x)	100	Low	Low	When variance reduction is critical
Stratified k-fold	k	Low	Lower than standard k-fold	Imbalanced classification
Nested CV (5x5)	125+	Unbiased for pipeline	Moderate	Hyperparameter tuning + evaluation
Time series split	k	Depends on window	Moderate	Temporal data
Group k-fold	k	Depends on groups	Moderate	Grouped or clustered data

Bias-variance tradeoff in choosing k

The choice of k in k-fold cross-validation involves a fundamental tradeoff between bias and variance:

Bias: With a smaller k (such as 2 or 3), each training set contains a smaller fraction of the total data. Models trained on less data tend to underperform compared to models trained on the full dataset, so the cross-validation estimate is pessimistically biased (it underestimates the true performance). As k increases, each training set approaches the size of the full dataset, reducing this bias. LOOCV (k = n) has the smallest bias.

Variance: With a larger k, the k training sets overlap substantially. When k = n (LOOCV), adjacent training sets differ by only two observations. Because of this heavy overlap, the k performance estimates are highly correlated, and their average can have high variance. Smaller values of k produce more independent estimates, resulting in lower variance.

Computation: Larger values of k require more model training runs, increasing computational cost proportionally.

The consensus recommendation in the machine learning community, supported by both theoretical analysis and Kohavi's empirical study, is that k = 10 (or k = 5 for very large datasets) offers the best practical tradeoff ^[4]^[7].

Cross-validation vs. holdout validation vs. bootstrapping

Aspect	Holdout validation	K-fold cross-validation	Bootstrapping
Method	Single train/test split	k rotated train/test splits	Resampling with replacement
Data usage	Inefficient (20-30% reserved)	Efficient (all data used for both)	Full dataset per sample
Bias	Can be high (depends on split)	Low to moderate (depends on k)	Lower bias (uses full data)
Variance	High (single estimate)	Lower (averaged over k folds)	Can be high (replacement sampling)
Computational cost	Lowest	Moderate (k model fits)	High (hundreds of resamples)
Determinism	Depends on split randomness	Reproducible with fixed seed	Reproducible with fixed seed
Use case	Quick prototyping, very large data	Standard model evaluation	Confidence intervals, small data

Bootstrapping creates training sets by sampling n observations with replacement from the original dataset, meaning some observations appear multiple times while others are left out. The left-out observations (roughly 36.8 percent on average) form the test set. This approach can produce lower-bias estimates because each bootstrap sample is the same size as the original dataset, but the estimates can have higher variance due to the nature of replacement sampling ^[4]^[8].

Cross-validation for model selection vs. model assessment

It is important to distinguish between two distinct uses of cross-validation:

Model selection refers to choosing the best model, algorithm, or hyperparameter configuration. For example, you might use 5-fold cross-validation to compare a random forest with 100 trees versus 500 trees and select whichever achieves better average validation performance.

Model assessment refers to estimating the generalization error of the final chosen model. After selecting the best model through the model selection process, you need an honest estimate of how it will perform on completely new data.

Using the same cross-validation procedure for both purposes introduces selection bias: the chosen model's cross-validation score is an optimistically biased estimate of its true performance. This is because you are reporting the "winner" of a comparison, and the winner's score tends to overestimate true performance by chance.

Nested cross-validation solves this by separating the two steps. The inner loop handles model selection, while the outer loop provides an unbiased assessment of the entire selection procedure ^[6]^[9].

Common mistakes and pitfalls

Data leakage

The most critical mistake in cross-validation is data leakage, where information from the validation set improperly influences the training process. Common sources of leakage include:

Preprocessing before splitting: Fitting a scaler, performing feature selection, or computing statistics on the entire dataset before cross-validation. The correct approach is to fit preprocessing steps only on the training fold within each iteration.
Temporal leakage: Using future data to predict the past in time series problems by applying standard k-fold instead of time-aware splitting.
Group leakage: Splitting related observations (such as multiple measurements from the same subject) across training and validation sets.

In scikit-learn, the recommended way to prevent preprocessing leakage is to use Pipeline objects that encapsulate both preprocessing and modeling steps, ensuring that preprocessing is fit only on training data within each fold ^[5].

Failure to stratify

For classification problems with imbalanced class distributions, using unstratified k-fold cross-validation can produce folds where minority classes are severely underrepresented or entirely absent. This leads to unreliable and highly variable performance estimates. Stratified k-fold should be the default choice for classification tasks.

Reporting overly optimistic results

Using the same cross-validation split for hyperparameter tuning and final performance reporting produces optimistically biased results. The reported performance will be better than what the model actually achieves on new data. Nested cross-validation or a separate held-out test set should be used to obtain honest estimates.

Ignoring variance in estimates

Reporting only the mean cross-validation score without the standard deviation across folds hides important information about estimate reliability. A model with a mean accuracy of 85 percent and a standard deviation of 2 percent is very different from one with the same mean but a standard deviation of 10 percent. Always report both the mean and variance (or standard deviation) of cross-validation scores.

Implementation in scikit-learn

Scikit-learn provides a comprehensive suite of cross-validation tools in the sklearn.model_selection module ^[5].

Key classes and functions

Class/Function	Description
`cross_val_score`	Computes cross-validated scores for an estimator using a specified CV strategy
`cross_validate`	Similar to `cross_val_score` but returns multiple metrics, fit times, and optionally trained estimators
`cross_val_predict`	Returns cross-validated predictions for each data point (when in the test fold)
`KFold`	Standard k-fold splitter
`StratifiedKFold`	K-fold splitter that preserves class proportions
`RepeatedKFold`	Repeats k-fold with different random splits
`RepeatedStratifiedKFold`	Repeats stratified k-fold with different random splits
`LeaveOneOut`	Leave-one-out splitter
`LeavePOut`	Leave-p-out splitter
`GroupKFold`	K-fold splitter respecting group boundaries
`StratifiedGroupKFold`	Stratified k-fold respecting group boundaries
`TimeSeriesSplit`	Time series-aware expanding window splitter
`ShuffleSplit`	Random permutation train/test splitter

Example: basic cross-validation

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# 5-fold stratified cross-validation (default for classifiers)
scores = cross_val_score(clf, X, y, cv=5)
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

Example: nested cross-validation

from sklearn.model_selection import cross_val_score, GridSearchCV, KFold
from sklearn.svm import SVC
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

# Inner loop: hyperparameter tuning
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
clf = GridSearchCV(SVC(), param_grid, cv=inner_cv)

# Outer loop: performance estimation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=0)
nested_scores = cross_val_score(clf, X, y, cv=outer_cv)
print(f"Nested CV Accuracy: {nested_scores.mean():.3f} (+/- {nested_scores.std():.3f})")

Example: time series cross-validation

from sklearn.model_selection import TimeSeriesSplit
import numpy as np

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([1, 2, 3, 4, 5, 6])

tscv = TimeSeriesSplit(n_splits=3)
for train_index, test_index in tscv.split(X):
    print(f"Train: {train_index}, Test: {test_index}")
# Train: [0 1 2], Test: <sup><a href="#cite_note-3" class="cite-ref">[3]</a></sup>
# Train: [0 1 2 3], Test: <sup><a href="#cite_note-4" class="cite-ref">[4]</a></sup>
# Train: [0 1 2 3 4], Test: <sup><a href="#cite_note-5" class="cite-ref">[5]</a></sup>

Computational cost considerations

Cross-validation multiplies the computational cost of model training by the number of folds and repetitions. For a single run of k-fold cross-validation, k models must be trained. For repeated k-fold with r repetitions, k x r models are needed. Nested cross-validation with k_outer x k_inner folds and m hyperparameter combinations requires k_outer x k_inner x m model fits.

Several strategies help manage computational costs:

Reduce k: Using 5-fold instead of 10-fold halves the number of fits with only a modest increase in bias.
Use stratified over repeated CV: Stratified k-fold achieves variance reduction more efficiently than repeated k-fold for classification.
Parallelize: Most implementations (including scikit-learn's n_jobs parameter) support parallel execution across folds.
Use approximate methods: For linear models, closed-form LOOCV formulas exist (such as the PRESS statistic for linear regression) that compute the leave-one-out estimate without actually refitting the model n times.
Subsample large datasets: For very large datasets (millions of rows), a single holdout split may be sufficient because the performance estimate has low variance even without cross-validation.

Applications beyond model evaluation

While model evaluation is its primary use, cross-validation also plays a role in:

Feature selection: Cross-validated performance can guide the inclusion or exclusion of features, helping identify the most informative subset.
Ensemble methods: Stacking (stacked generalization) uses cross-validated predictions from base models as inputs to a meta-learner, preventing overfitting of the meta-learner to the base models' training predictions.
Statistical testing: Paired cross-validation tests (such as the 5x2 CV paired t-test proposed by Dietterich in 1998) provide formal statistical comparisons between two classifiers ^[10].
Learning curve estimation: Running cross-validation with varying training set sizes produces learning curves that reveal whether a model suffers from high bias or high variance.

References

Stone, M. (1974). "Cross-Validatory Choice and Assessment of Statistical Predictions." *Journal of the Royal Statistical Society, Series B*, 36(2), 111-147.
Geisser, S. (1975). "The Predictive Sample Reuse Method with Applications." *Journal of the American Statistical Association*, 70(350), 320-328.
Stone, M. (1977). "An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike's Criterion." *Journal of the Royal Statistical Society, Series B*, 39(1), 44-47.
Kohavi, R. (1995). "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection." *Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI)*, 1137-1145.
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830. Documentation: https://scikit-learn.org/stable/modules/cross_validation.html
Varma, S. and Simon, R. (2006). "Bias in Error Estimation When Using Cross-Validation for Model Selection." *BMC Bioinformatics*, 7, 91.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*. 2nd ed. Springer.
Efron, B. and Tibshirani, R. (1997). "Improvements on Cross-Validation: The .632+ Bootstrap Method." *Journal of the American Statistical Association*, 92(438), 548-560.
Cawley, G.C. and Talbot, N.L.C. (2010). "On Over-Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation." *Journal of Machine Learning Research*, 11, 2079-2107.
Dietterich, T.G. (1998). "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms." *Neural Computation*, 10(7), 1895-1923.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). *An Introduction to Statistical Learning*. Springer.
Arlot, S. and Celisse, A. (2010). "A Survey of Cross-Validation Procedures for Model Selection." *Statistics Surveys*, 4, 40-79.

ELI5: Cross-validation in simple terms

Historical background

Core methodology

The holdout method

General cross-validation procedure

Types of cross-validation

K-fold cross-validation

Stratified k-fold cross-validation

Leave-one-out cross-validation (LOOCV)

Leave-p-out cross-validation

Repeated k-fold cross-validation

Nested cross-validation

Time series cross-validation

Group k-fold cross-validation

Comparison of cross-validation methods

Bias-variance tradeoff in choosing k

Cross-validation vs. holdout validation vs. bootstrapping

Cross-validation for model selection vs. model assessment

Common mistakes and pitfalls

Data leakage

Failure to stratify

Reporting overly optimistic results

Ignoring variance in estimates

Implementation in scikit-learn

Key classes and functions

Example: basic cross-validation

Example: nested cross-validation

Example: time series cross-validation

Computational cost considerations

Applications beyond model evaluation

References

Improve this article

Related Articles

ARC-AGI 2

Generalization

Generalization Curve

Model Capacity

AUC (Area Under the ROC Curve)

Accuracy

ELI5: Cross-validation in simple terms

Historical background

Core methodology

The holdout method

General cross-validation procedure

Types of cross-validation

K-fold cross-validation

Stratified k-fold cross-validation

Leave-one-out cross-validation (LOOCV)

Leave-p-out cross-validation

Repeated k-fold cross-validation

Nested cross-validation

Time series cross-validation

Group k-fold cross-validation

Comparison of cross-validation methods

Bias-variance tradeoff in choosing k

Cross-validation vs. holdout validation vs. bootstrapping

Cross-validation for model selection vs. model assessment

Common mistakes and pitfalls

Data leakage

Failure to stratify

Reporting overly optimistic results

Ignoring variance in estimates

Implementation in scikit-learn

Key classes and functions

Example: basic cross-validation

Example: nested cross-validation

Example: time series cross-validation

Computational cost considerations

Applications beyond model evaluation

References

Related Articles

ARC-AGI 2

Generalization

Generalization Curve

Model Capacity

AUC (Area Under the ROC Curve)

Accuracy