Cross-validation is a statistical technique used in machine learning to evaluate how well a predictive model generalizes to an independent dataset. Rather than relying on a single split of data into training and testing portions, cross-validation systematically partitions the data into multiple subsets, trains the model on some subsets, and validates it on the remaining ones. By averaging performance across these partitions, cross-validation provides a more reliable and less biased estimate of model performance than simple holdout methods.
Cross-validation serves two primary purposes: model assessment (estimating the generalization error of a final model) and model selection (choosing the best model or hyperparameter configuration among competing alternatives). It is one of the most widely used techniques in applied machine learning and statistics, appearing in virtually every modeling workflow from academic research to production systems.
Imagine you are studying for an exam using a set of 100 practice questions. If you always practice with questions 1 through 80 and then test yourself on questions 81 through 100, you might get lucky or unlucky depending on which questions end up in your test. Maybe your weak spots happen to not show up in those 20 questions, giving you a false sense of confidence.
Cross-validation is like taking five different practice exams, each time setting aside a different group of 20 questions as the test. First you test on questions 1 through 20, then 21 through 40, and so on. By averaging your scores across all five mini-exams, you get a much more honest picture of how prepared you really are. That is exactly what cross-validation does for machine learning models: it gives a fair, well-rounded estimate of how the model will perform on data it has never seen before.
The formalization of cross-validation as a statistical methodology dates to the early 1970s. Mervyn Stone published his seminal paper "Cross-Validatory Choice and Assessment of Statistical Predictions" in 1974 in the Journal of the Royal Statistical Society, Series B. In this work, Stone defined the leave-one-out cross-validation procedure and applied a generalized cross-validation criterion to problems in univariate estimation, linear regression, and analysis of variance [1]. Independently, Seymour Geisser introduced the "predictive sample reuse" method in 1975, published in the Journal of the American Statistical Association. Geisser's approach was broader in scope, allowing more general data splits beyond the leave-one-out case [2].
Mervyn Stone also demonstrated in 1977 that cross-validation is asymptotically equivalent to the Akaike Information Criterion (AIC) for model selection, establishing a deep theoretical connection between cross-validation and information-theoretic approaches [3]. In 1995, Ron Kohavi conducted a landmark empirical study comparing cross-validation and bootstrap methods for accuracy estimation. Based on extensive experiments with over half a million runs of C4.5 and Naive Bayes algorithms, Kohavi concluded that ten-fold stratified cross-validation offered the best tradeoff between bias and variance for model selection on real-world datasets [4].
The simplest form of model evaluation is the holdout method, which splits the dataset into two disjoint parts: a training set (typically 70 to 80 percent of the data) and a test set (the remaining 20 to 30 percent). The model learns from the training set and is evaluated on the test set.
While computationally efficient, the holdout method has significant limitations. The performance estimate depends heavily on which data points end up in each partition, leading to high variance in the results. It also uses data inefficiently, since a substantial portion is reserved solely for evaluation and never contributes to training. Cross-validation addresses both of these shortcomings by rotating the roles of training and testing data across multiple iterations.
The general cross-validation procedure works as follows:
This rotation ensures that every data point is used for both training and validation, maximizing data utilization while providing a robust performance estimate.
K-fold cross-validation is the most commonly used variant. The dataset is randomly shuffled and divided into k equal-sized (or nearly equal-sized) folds. The model is trained k times, each time using a different fold as the validation set and the remaining k - 1 folds as the training set. The final performance estimate is the average across all k iterations.
Common choices for k are 5 and 10. Kohavi's 1995 study recommended 10-fold cross-validation as offering a good balance between bias and variance for most practical applications [4]. With k = 5, each training set contains 80 percent of the data; with k = 10, each contains 90 percent.
Procedure for 5-fold cross-validation:
Stratified k-fold cross-validation modifies the standard k-fold approach by ensuring that each fold preserves the approximate class distribution of the original dataset. This is particularly important for imbalanced classification problems where one or more classes are underrepresented.
For example, if the original dataset contains 90 percent negative samples and 10 percent positive samples, each fold in stratified k-fold will also contain approximately 90 percent negative and 10 percent positive samples. Without stratification, some folds might contain very few (or no) positive samples by chance, leading to unreliable performance estimates.
Kohavi's 1995 study found that stratification consistently improved the reliability of cross-validation estimates, and most modern machine learning libraries (including scikit-learn) use stratified k-fold as the default strategy for classification tasks [4].
Leave-one-out cross-validation is a special case of k-fold cross-validation where k equals the total number of samples n in the dataset. In each iteration, a single observation is held out for validation while the remaining n - 1 observations form the training set. This process repeats n times so that every observation serves as the validation point exactly once.
Advantages of LOOCV:
Disadvantages of LOOCV:
LOOCV is most useful for small datasets (fewer than 200 samples) where maximizing training data is critical and the computational cost of n model fits is manageable.
Leave-p-out cross-validation generalizes LOOCV by holding out p observations in each iteration. This produces C(n, p) distinct training/test splits, where C(n, p) is the binomial coefficient "n choose p." For even moderate values of n and p, the number of iterations becomes astronomically large. For instance, with n = 100 and p = 5, there are over 75 million possible splits.
Because of this combinatorial explosion, leave-p-out cross-validation is rarely used in practice. It serves primarily as a theoretical framework for understanding the properties of cross-validation estimators.
Repeated k-fold cross-validation runs the entire k-fold procedure multiple times, each time with a different random shuffling of the data. The results from all repetitions are then averaged. For example, 10 repetitions of 5-fold cross-validation (often written as 10 x 5 CV) produces 50 individual performance estimates that are averaged into the final score.
This approach reduces the variance of the performance estimate compared to a single run of k-fold cross-validation, at the cost of increased computation. It is particularly valuable when the dataset is small enough that the random assignment of observations to folds can noticeably affect results. A common configuration is 10 repeats of 10-fold cross-validation, yielding 100 individual estimates.
Nested cross-validation addresses the problem of simultaneously performing hyperparameter tuning and model evaluation without introducing optimistic bias. It uses two layers of cross-validation loops:
The key insight is that if you use the same cross-validation procedure for both selecting hyperparameters and estimating performance, the resulting estimate will be optimistically biased. The test data is no longer statistically "pure" because it was indirectly used to guide hyperparameter choices. Nested cross-validation eliminates this bias by ensuring the outer test fold is never seen during hyperparameter optimization [6].
A typical configuration uses 5 folds in the outer loop and 5 folds in the inner loop (5 x 5 nested CV), resulting in 25 inner model fits per outer fold, or 125 total model fits (plus the 5 outer evaluations). While computationally demanding, nested cross-validation provides an unbiased estimate of how well the entire model selection and training pipeline will perform on new data.
Standard k-fold cross-validation assumes that data points are independently and identically distributed (i.i.d.). This assumption is violated in time series data, where observations have temporal dependencies. Randomly shuffling time series data and splitting it into folds would create data leakage: the model could train on future observations to predict past ones.
Time series cross-validation preserves the temporal order of observations using specialized strategies:
Expanding window (forward chaining): The training set starts with the earliest observations and grows with each iteration. The test set is always the next chronological block. For example, with five splits:
| Iteration | Training period | Test period |
|---|---|---|
| 1 | Months 1 to 3 | Month 4 |
| 2 | Months 1 to 4 | Month 5 |
| 3 | Months 1 to 5 | Month 6 |
| 4 | Months 1 to 6 | Month 7 |
| 5 | Months 1 to 7 | Month 8 |
Sliding window (rolling window): The training set has a fixed size and "slides" forward in time. Older observations are dropped as newer ones are added. This is preferable when the data distribution shifts over time (concept drift), making distant historical data less relevant.
| Iteration | Training period | Test period |
|---|---|---|
| 1 | Months 1 to 3 | Month 4 |
| 2 | Months 2 to 4 | Month 5 |
| 3 | Months 3 to 5 | Month 6 |
| 4 | Months 4 to 6 | Month 7 |
| 5 | Months 5 to 7 | Month 8 |
In scikit-learn, the TimeSeriesSplit class implements the expanding window approach [5].
Group k-fold cross-validation ensures that all observations belonging to the same group appear in the same fold. This is necessary when data contains groups of related observations that are not independent, such as multiple measurements from the same patient, multiple images from the same camera, or multiple transactions from the same customer.
If related observations are split across training and validation sets, the model may appear to generalize well simply because it recognizes the group rather than learning the underlying pattern. Group k-fold prevents this leakage by keeping entire groups together. Scikit-learn provides GroupKFold, LeaveOneGroupOut, and StratifiedGroupKFold for this purpose [5].
| Method | Number of fits | Bias | Variance | Best for |
|---|---|---|---|---|
| 2-fold CV | 2 | High (50% training data) | Low | Very large datasets |
| 5-fold CV | 5 | Moderate (80% training data) | Moderate | Large datasets (100,000+ samples) |
| 10-fold CV | 10 | Low (90% training data) | Moderate | General-purpose use |
| LOOCV | n | Very low | High | Small datasets (under 200 samples) |
| Leave-p-out | C(n, p) | Low | Depends on p | Theoretical analysis |
| Repeated 10-fold (10x) | 100 | Low | Low | When variance reduction is critical |
| Stratified k-fold | k | Low | Lower than standard k-fold | Imbalanced classification |
| Nested CV (5x5) | 125+ | Unbiased for pipeline | Moderate | Hyperparameter tuning + evaluation |
| Time series split | k | Depends on window | Moderate | Temporal data |
| Group k-fold | k | Depends on groups | Moderate | Grouped or clustered data |
The choice of k in k-fold cross-validation involves a fundamental tradeoff between bias and variance:
Bias: With a smaller k (such as 2 or 3), each training set contains a smaller fraction of the total data. Models trained on less data tend to underperform compared to models trained on the full dataset, so the cross-validation estimate is pessimistically biased (it underestimates the true performance). As k increases, each training set approaches the size of the full dataset, reducing this bias. LOOCV (k = n) has the smallest bias.
Variance: With a larger k, the k training sets overlap substantially. When k = n (LOOCV), adjacent training sets differ by only two observations. Because of this heavy overlap, the k performance estimates are highly correlated, and their average can have high variance. Smaller values of k produce more independent estimates, resulting in lower variance.
Computation: Larger values of k require more model training runs, increasing computational cost proportionally.
The consensus recommendation in the machine learning community, supported by both theoretical analysis and Kohavi's empirical study, is that k = 10 (or k = 5 for very large datasets) offers the best practical tradeoff [4][7].
| Aspect | Holdout validation | K-fold cross-validation | Bootstrapping |
|---|---|---|---|
| Method | Single train/test split | k rotated train/test splits | Resampling with replacement |
| Data usage | Inefficient (20-30% reserved) | Efficient (all data used for both) | Full dataset per sample |
| Bias | Can be high (depends on split) | Low to moderate (depends on k) | Lower bias (uses full data) |
| Variance | High (single estimate) | Lower (averaged over k folds) | Can be high (replacement sampling) |
| Computational cost | Lowest | Moderate (k model fits) | High (hundreds of resamples) |
| Determinism | Depends on split randomness | Reproducible with fixed seed | Reproducible with fixed seed |
| Use case | Quick prototyping, very large data | Standard model evaluation | Confidence intervals, small data |
Bootstrapping creates training sets by sampling n observations with replacement from the original dataset, meaning some observations appear multiple times while others are left out. The left-out observations (roughly 36.8 percent on average) form the test set. This approach can produce lower-bias estimates because each bootstrap sample is the same size as the original dataset, but the estimates can have higher variance due to the nature of replacement sampling [4][8].
It is important to distinguish between two distinct uses of cross-validation:
Model selection refers to choosing the best model, algorithm, or hyperparameter configuration. For example, you might use 5-fold cross-validation to compare a random forest with 100 trees versus 500 trees and select whichever achieves better average validation performance.
Model assessment refers to estimating the generalization error of the final chosen model. After selecting the best model through the model selection process, you need an honest estimate of how it will perform on completely new data.
Using the same cross-validation procedure for both purposes introduces selection bias: the chosen model's cross-validation score is an optimistically biased estimate of its true performance. This is because you are reporting the "winner" of a comparison, and the winner's score tends to overestimate true performance by chance.
Nested cross-validation solves this by separating the two steps. The inner loop handles model selection, while the outer loop provides an unbiased assessment of the entire selection procedure [6][9].
The most critical mistake in cross-validation is data leakage, where information from the validation set improperly influences the training process. Common sources of leakage include:
In scikit-learn, the recommended way to prevent preprocessing leakage is to use Pipeline objects that encapsulate both preprocessing and modeling steps, ensuring that preprocessing is fit only on training data within each fold [5].
For classification problems with imbalanced class distributions, using unstratified k-fold cross-validation can produce folds where minority classes are severely underrepresented or entirely absent. This leads to unreliable and highly variable performance estimates. Stratified k-fold should be the default choice for classification tasks.
Using the same cross-validation split for hyperparameter tuning and final performance reporting produces optimistically biased results. The reported performance will be better than what the model actually achieves on new data. Nested cross-validation or a separate held-out test set should be used to obtain honest estimates.
Reporting only the mean cross-validation score without the standard deviation across folds hides important information about estimate reliability. A model with a mean accuracy of 85 percent and a standard deviation of 2 percent is very different from one with the same mean but a standard deviation of 10 percent. Always report both the mean and variance (or standard deviation) of cross-validation scores.
Scikit-learn provides a comprehensive suite of cross-validation tools in the sklearn.model_selection module [5].
| Class/Function | Description |
|---|---|
cross_val_score | Computes cross-validated scores for an estimator using a specified CV strategy |
cross_validate | Similar to cross_val_score but returns multiple metrics, fit times, and optionally trained estimators |
cross_val_predict | Returns cross-validated predictions for each data point (when in the test fold) |
KFold | Standard k-fold splitter |
StratifiedKFold | K-fold splitter that preserves class proportions |
RepeatedKFold | Repeats k-fold with different random splits |
RepeatedStratifiedKFold | Repeats stratified k-fold with different random splits |
LeaveOneOut | Leave-one-out splitter |
LeavePOut | Leave-p-out splitter |
GroupKFold | K-fold splitter respecting group boundaries |
StratifiedGroupKFold | Stratified k-fold respecting group boundaries |
TimeSeriesSplit | Time series-aware expanding window splitter |
ShuffleSplit | Random permutation train/test splitter |
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=100, random_state=42)
# 5-fold stratified cross-validation (default for classifiers)
scores = cross_val_score(clf, X, y, cv=5)
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold
from sklearn.svm import SVC
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
# Inner loop: hyperparameter tuning
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
clf = GridSearchCV(SVC(), param_grid, cv=inner_cv)
# Outer loop: performance estimation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=0)
nested_scores = cross_val_score(clf, X, y, cv=outer_cv)
print(f"Nested CV Accuracy: {nested_scores.mean():.3f} (+/- {nested_scores.std():.3f})")
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([1, 2, 3, 4, 5, 6])
tscv = TimeSeriesSplit(n_splits=3)
for train_index, test_index in tscv.split(X):
print(f"Train: {train_index}, Test: {test_index}")
# Train: [0 1 2], Test: <sup><a href="#cite_note-3" class="cite-ref">[3]</a></sup>
# Train: [0 1 2 3], Test: <sup><a href="#cite_note-4" class="cite-ref">[4]</a></sup>
# Train: [0 1 2 3 4], Test: <sup><a href="#cite_note-5" class="cite-ref">[5]</a></sup>
Cross-validation multiplies the computational cost of model training by the number of folds and repetitions. For a single run of k-fold cross-validation, k models must be trained. For repeated k-fold with r repetitions, k x r models are needed. Nested cross-validation with k_outer x k_inner folds and m hyperparameter combinations requires k_outer x k_inner x m model fits.
Several strategies help manage computational costs:
n_jobs parameter) support parallel execution across folds.While model evaluation is its primary use, cross-validation also plays a role in: