# Cross-Validation

> Source: https://aiwiki.ai/wiki/cross-validation
> Updated: 2026-06-20
> Categories: Machine Learning, Model Evaluation
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

Cross-validation is a statistical resampling technique used in [machine learning](/wiki/machine_learning) to estimate how accurately a predictive model will generalize to data it was not trained on. Instead of relying on a single split of data into [training](/wiki/training_set) and testing portions, cross-validation systematically partitions the data into multiple subsets called folds, trains the model on some folds, and validates it on the held-out fold, then averages the results. The technique was formalized independently by Mervyn Stone and Seymour Geisser in 1974 and 1975 [1][2], and Ron Kohavi's 1995 study of over half a million runs concluded that ten-fold stratified cross-validation is the recommended method for model selection on real-world datasets [4].

By rotating which fold is held out and averaging performance across all folds, cross-validation provides a more reliable and less biased estimate of model performance than a single holdout split. It serves two primary purposes: **model assessment** (estimating the [generalization](/wiki/generalization) error of a final model) and **model selection** (choosing the best model or [hyperparameter](/wiki/hyperparameter) configuration among competing alternatives). It is one of the most widely used techniques in applied machine learning and statistics, appearing in virtually every modeling workflow from academic research to production systems.

## What is cross-validation in simple terms?

Imagine you are studying for an exam using a set of 100 practice questions. If you always practice with questions 1 through 80 and then test yourself on questions 81 through 100, you might get lucky or unlucky depending on which questions end up in your test. Maybe your weak spots happen to not show up in those 20 questions, giving you a false sense of confidence.

Cross-validation is like taking five different practice exams, each time setting aside a different group of 20 questions as the test. First you test on questions 1 through 20, then 21 through 40, and so on. By averaging your scores across all five mini-exams, you get a much more honest picture of how prepared you really are. That is exactly what cross-validation does for machine learning models: it gives a fair, well-rounded estimate of how the model will perform on data it has never seen before.

## When was cross-validation invented?

The formalization of cross-validation as a statistical methodology dates to the early 1970s. Mervyn Stone published his seminal paper "Cross-Validatory Choice and Assessment of Statistical Predictions" in 1974 in the Journal of the Royal Statistical Society, Series B (volume 36, pages 111-147 including discussion). In this work, Stone defined the leave-one-out cross-validation procedure and applied a generalized cross-validation criterion to problems in univariate estimation, linear [regression](/wiki/regression_model), and analysis of variance [1]. Independently, Seymour Geisser introduced the "predictive sample reuse" method in 1975, published in the Journal of the American Statistical Association (volume 70, number 350, pages 320-328). Geisser's approach was broader in scope, allowing more general data splits beyond the leave-one-out case [2]. The two statisticians are credited with simultaneously and independently inventing the method now known as cross-validation.

Mervyn Stone also demonstrated in 1977 that cross-validation is asymptotically equivalent to the Akaike Information Criterion (AIC) for model selection, establishing a deep theoretical connection between cross-validation and information-theoretic approaches [3]. In 1995, Ron Kohavi conducted a landmark empirical study comparing cross-validation and [bootstrap](/wiki/bootstrap_aggregating) methods for accuracy estimation. Based on extensive experiments with over half a million runs of the C4.5 decision-tree and Naive Bayes algorithms across real-world datasets, Kohavi concluded that ten-fold stratified cross-validation offered the best tradeoff between bias and variance for model selection [4]. His recommendation was unambiguous: "for model selection use 10-fold stratified cross validation" even when more computation is available [4].

## Core methodology

### The holdout method

The simplest form of model evaluation is the holdout method, which splits the dataset into two disjoint parts: a [training set](/wiki/training_set) (typically 70 to 80 percent of the data) and a [test set](/wiki/test_set) (the remaining 20 to 30 percent). The model learns from the training set and is evaluated on the test set.

While computationally efficient, the holdout method has significant limitations. The performance estimate depends heavily on which data points end up in each partition, leading to high variance in the results. It also uses data inefficiently, since a substantial portion is reserved solely for evaluation and never contributes to training. Cross-validation addresses both of these shortcomings by rotating the roles of training and testing data across multiple iterations.

### General cross-validation procedure

The general cross-validation procedure works as follows:

1. Partition the dataset into multiple non-overlapping subsets (called "folds").
2. For each fold, hold it out as the [validation set](/wiki/validation_set) and train the model on all remaining folds.
3. Evaluate the model on the held-out fold and record the performance metric.
4. Repeat until every fold has served as the validation set exactly once.
5. Aggregate the performance scores (typically by computing the mean and standard deviation) to produce the final estimate.

This rotation ensures that every data point is used for both training and validation, maximizing data utilization while providing a robust performance estimate.

## What are the types of cross-validation?

### K-fold cross-validation

K-fold cross-validation is the most commonly used variant. The dataset is randomly shuffled and divided into *k* equal-sized (or nearly equal-sized) folds. The model is trained *k* times, each time using a different fold as the validation set and the remaining *k* - 1 folds as the training set. The final performance estimate is the average across all *k* iterations.

Common choices for *k* are 5 and 10. Kohavi's 1995 study recommended 10-fold cross-validation as offering a good balance between bias and variance for most practical applications [4]. With *k* = 5, each training set contains 80 percent of the data; with *k* = 10, each contains 90 percent.

**Procedure for 5-fold cross-validation:**

1. Shuffle the dataset randomly.
2. Split into 5 equal folds: F1, F2, F3, F4, F5.
3. Iteration 1: Train on F2 + F3 + F4 + F5, validate on F1.
4. Iteration 2: Train on F1 + F3 + F4 + F5, validate on F2.
5. Iteration 3: Train on F1 + F2 + F4 + F5, validate on F3.
6. Iteration 4: Train on F1 + F2 + F3 + F5, validate on F4.
7. Iteration 5: Train on F1 + F2 + F3 + F4, validate on F5.
8. Report the mean and standard deviation of the 5 validation scores.

### Stratified k-fold cross-validation

Stratified k-fold cross-validation modifies the standard k-fold approach by ensuring that each fold preserves the approximate class distribution of the original dataset. This is particularly important for imbalanced [classification](/wiki/classification_model) problems where one or more classes are underrepresented.

For example, if the original dataset contains 90 percent negative samples and 10 percent positive samples, each fold in stratified k-fold will also contain approximately 90 percent negative and 10 percent positive samples. Without stratification, some folds might contain very few (or no) positive samples by chance, leading to unreliable performance estimates.

Kohavi's 1995 study found that stratification consistently improved the reliability of cross-validation estimates, and most modern machine learning libraries (including scikit-learn) use stratified k-fold as the default strategy for classification tasks. The scikit-learn documentation states that when the `cv` argument is an integer, `cross_val_score` uses "the `KFold` or `StratifiedKFold` strategies by default, the latter being used if the estimator derives from `ClassifierMixin`" [4][5].

### Leave-one-out cross-validation (LOOCV)

Leave-one-out cross-validation is a special case of k-fold cross-validation where *k* equals the total number of samples *n* in the dataset. In each iteration, a single observation is held out for validation while the remaining *n* - 1 observations form the training set. This process repeats *n* times so that every observation serves as the validation point exactly once.

**Advantages of LOOCV:**
- Maximizes the amount of training data in each iteration (only one sample is excluded).
- Produces a nearly unbiased estimate of the true generalization error.
- Is deterministic; there is no randomness from data shuffling.

**Disadvantages of LOOCV:**
- Computationally expensive, requiring *n* separate model training runs. The scikit-learn documentation notes that "LOO is more computationally expensive than k-fold cross validation" [5].
- Tends to produce high-variance estimates because training sets across iterations overlap by *n* - 2 samples, making the individual estimates highly correlated. The same documentation observes that "LOO often results in high variance as an estimator for the test error" [5].
- The scikit-learn documentation and most authors recommend 5 or 10-fold cross-validation over LOOCV for practical use, stating: "As a general rule, most authors and empirical evidence suggest that 5 or 10-fold cross validation should be preferred to LOO" [5].

LOOCV is most useful for small datasets (fewer than 200 samples) where maximizing training data is critical and the computational cost of *n* model fits is manageable.

### Leave-p-out cross-validation

Leave-p-out cross-validation generalizes LOOCV by holding out *p* observations in each iteration. This produces C(*n*, *p*) distinct training/test splits, where C(*n*, *p*) is the binomial coefficient "*n* choose *p*." For even moderate values of *n* and *p*, the number of iterations becomes astronomically large. For instance, with *n* = 100 and *p* = 5, there are over 75 million possible splits.

Because of this combinatorial explosion, leave-p-out cross-validation is rarely used in practice. It serves primarily as a theoretical framework for understanding the properties of cross-validation estimators.

### Repeated k-fold cross-validation

Repeated k-fold cross-validation runs the entire k-fold procedure multiple times, each time with a different random shuffling of the data. The results from all repetitions are then averaged. For example, 10 repetitions of 5-fold cross-validation (often written as 10 x 5 CV) produces 50 individual performance estimates that are averaged into the final score.

This approach reduces the variance of the performance estimate compared to a single run of k-fold cross-validation, at the cost of increased computation. It is particularly valuable when the dataset is small enough that the random assignment of observations to folds can noticeably affect results. A common configuration is 10 repeats of 10-fold cross-validation, yielding 100 individual estimates.

### Nested cross-validation

Nested cross-validation addresses the problem of simultaneously performing hyperparameter tuning and model evaluation without introducing optimistic bias. It uses two layers of cross-validation loops:

- **Outer loop:** Splits the data into training and test folds for estimating generalization performance.
- **Inner loop:** Within each outer training fold, further splits the data to perform hyperparameter search (for example, using grid search or random search).

The key insight is that if you use the same cross-validation procedure for both selecting hyperparameters and estimating performance, the resulting estimate will be optimistically biased. The test data is no longer statistically "pure" because it was indirectly used to guide hyperparameter choices. Varma and Simon demonstrated this empirically in 2006, reporting that "using CV to compute an error estimate for a classifier that has itself been tuned using CV gives a significantly biased estimate of the true error," and recommending nested cross-validation to obtain a nearly unbiased estimate [6]. Nested cross-validation eliminates this bias by ensuring the outer test fold is never seen during hyperparameter optimization.

A typical configuration uses 5 folds in the outer loop and 5 folds in the inner loop (5 x 5 nested CV), resulting in 25 inner model fits per outer fold, or 125 total model fits (plus the 5 outer evaluations). While computationally demanding, nested cross-validation provides an unbiased estimate of how well the entire model selection and training pipeline will perform on new data.

### Time series cross-validation

Standard k-fold cross-validation assumes that data points are independently and identically distributed (i.i.d.). This assumption is violated in time series data, where observations have temporal dependencies. Randomly shuffling time series data and splitting it into folds would create data leakage: the model could train on future observations to predict past ones.

Time series cross-validation preserves the temporal order of observations using specialized strategies:

**Expanding window (forward chaining):** The training set starts with the earliest observations and grows with each iteration. The test set is always the next chronological block. For example, with five splits:

| Iteration | Training period | Test period |
|-----------|----------------|-------------|
| 1 | Months 1 to 3 | Month 4 |
| 2 | Months 1 to 4 | Month 5 |
| 3 | Months 1 to 5 | Month 6 |
| 4 | Months 1 to 6 | Month 7 |
| 5 | Months 1 to 7 | Month 8 |

**Sliding window (rolling window):** The training set has a fixed size and "slides" forward in time. Older observations are dropped as newer ones are added. This is preferable when the data distribution shifts over time (concept drift), making distant historical data less relevant.

| Iteration | Training period | Test period |
|-----------|----------------|-------------|
| 1 | Months 1 to 3 | Month 4 |
| 2 | Months 2 to 4 | Month 5 |
| 3 | Months 3 to 5 | Month 6 |
| 4 | Months 4 to 6 | Month 7 |
| 5 | Months 5 to 7 | Month 8 |

In scikit-learn, the `TimeSeriesSplit` class implements the expanding window approach [5].

### Group k-fold cross-validation

Group k-fold cross-validation ensures that all observations belonging to the same group appear in the same fold. This is necessary when data contains groups of related observations that are not independent, such as multiple measurements from the same patient, multiple images from the same camera, or multiple transactions from the same customer.

If related observations are split across training and validation sets, the model may appear to generalize well simply because it recognizes the group rather than learning the underlying pattern. Group k-fold prevents this leakage by keeping entire groups together. Scikit-learn provides `GroupKFold`, `LeaveOneGroupOut`, and `StratifiedGroupKFold` for this purpose [5].

## Comparison of cross-validation methods

| Method | Number of fits | Bias | Variance | Best for |
|--------|---------------|------|----------|----------|
| 2-fold CV | 2 | High (50% training data) | Low | Very large datasets |
| 5-fold CV | 5 | Moderate (80% training data) | Moderate | Large datasets (100,000+ samples) |
| 10-fold CV | 10 | Low (90% training data) | Moderate | General-purpose use |
| LOOCV | *n* | Very low | High | Small datasets (under 200 samples) |
| Leave-p-out | C(*n*, *p*) | Low | Depends on *p* | Theoretical analysis |
| Repeated 10-fold (10x) | 100 | Low | Low | When variance reduction is critical |
| Stratified k-fold | *k* | Low | Lower than standard k-fold | Imbalanced classification |
| Nested CV (5x5) | 125+ | Unbiased for pipeline | Moderate | [Hyperparameter](/wiki/hyperparameter) tuning + evaluation |
| Time series split | *k* | Depends on window | Moderate | Temporal data |
| Group k-fold | *k* | Depends on groups | Moderate | Grouped or clustered data |

## How do you choose the value of k?

The choice of *k* in k-fold cross-validation involves a fundamental tradeoff between bias and variance:

**Bias:** With a smaller *k* (such as 2 or 3), each training set contains a smaller fraction of the total data. Models trained on less data tend to underperform compared to models trained on the full dataset, so the cross-validation estimate is pessimistically biased (it underestimates the true performance). As *k* increases, each training set approaches the size of the full dataset, reducing this bias. LOOCV (*k* = *n*) has the smallest bias.

**Variance:** With a larger *k*, the *k* training sets overlap substantially. When *k* = *n* (LOOCV), adjacent training sets differ by only two observations. Because of this heavy overlap, the *k* performance estimates are highly correlated, and their average can have high variance. Smaller values of *k* produce more independent estimates, resulting in lower variance.

**Computation:** Larger values of *k* require more model training runs, increasing computational cost proportionally.

The consensus recommendation in the machine learning community, supported by both theoretical analysis and Kohavi's empirical study, is that *k* = 10 (or *k* = 5 for very large datasets) offers the best practical tradeoff [4][7].

## How does cross-validation differ from holdout and bootstrapping?

| Aspect | Holdout validation | K-fold cross-validation | Bootstrapping |
|--------|-------------------|------------------------|---------------|
| Method | Single train/test split | *k* rotated train/test splits | Resampling with replacement |
| Data usage | Inefficient (20-30% reserved) | Efficient (all data used for both) | Full dataset per sample |
| Bias | Can be high (depends on split) | Low to moderate (depends on *k*) | Lower bias (uses full data) |
| Variance | High (single estimate) | Lower (averaged over *k* folds) | Can be high (replacement sampling) |
| Computational cost | Lowest | Moderate (*k* model fits) | High (hundreds of resamples) |
| Determinism | Depends on split randomness | Reproducible with fixed seed | Reproducible with fixed seed |
| Use case | Quick prototyping, very large data | Standard model evaluation | Confidence intervals, small data |

Bootstrapping creates training sets by sampling *n* observations with replacement from the original dataset, meaning some observations appear multiple times while others are left out. Because each observation has a probability of 1 - (1 - 1/n)^n of being selected, which converges to 1 - 1/e (approximately 0.632) as *n* grows, a bootstrap sample contains on average about 63.2 percent of the unique observations, and the roughly 36.8 percent left out form the test set [4][8]. This approach can produce lower-bias estimates because each bootstrap sample is the same size as the original dataset, but the estimates can have higher variance due to the nature of replacement sampling. Efron and Tibshirani's .632+ bootstrap (1997) combines the optimistic resubstitution error and the pessimistic leave-one-out bootstrap error using weights of 0.368 and 0.632 to correct this bias [8].

## Cross-validation for model selection vs. model assessment

It is important to distinguish between two distinct uses of cross-validation:

**Model selection** refers to choosing the best model, algorithm, or hyperparameter configuration. For example, you might use 5-fold cross-validation to compare a [random forest](/wiki/random_forest) with 100 trees versus 500 trees and select whichever achieves better average validation performance.

**Model assessment** refers to estimating the generalization error of the final chosen model. After selecting the best model through the model selection process, you need an honest estimate of how it will perform on completely new data.

Using the same cross-validation procedure for both purposes introduces selection bias: the chosen model's cross-validation score is an optimistically biased estimate of its true performance. This is because you are reporting the "winner" of a comparison, and the winner's score tends to overestimate true performance by chance. Cawley and Talbot quantified this effect in 2010, showing that over-fitting in model selection can produce performance estimates that are misleadingly optimistic [9].

Nested cross-validation solves this by separating the two steps. The inner loop handles model selection, while the outer loop provides an unbiased assessment of the entire selection procedure [6][9].

## What are common cross-validation mistakes?

### Data leakage

The most critical mistake in cross-validation is data leakage, where information from the validation set improperly influences the training process. Common sources of leakage include:

- **Preprocessing before splitting:** Fitting a scaler, performing feature selection, or computing statistics on the entire dataset before cross-validation. The correct approach is to fit preprocessing steps only on the training fold within each iteration.
- **Temporal leakage:** Using future data to predict the past in time series problems by applying standard k-fold instead of time-aware splitting.
- **Group leakage:** Splitting related observations (such as multiple measurements from the same subject) across training and validation sets.

In scikit-learn, the recommended way to prevent preprocessing leakage is to use `Pipeline` objects that encapsulate both preprocessing and modeling steps, ensuring that preprocessing is fit only on training data within each fold [5].

### Failure to stratify

For classification problems with imbalanced class distributions, using unstratified k-fold cross-validation can produce folds where minority classes are severely underrepresented or entirely absent. This leads to unreliable and highly variable performance estimates. Stratified k-fold should be the default choice for classification tasks.

### Reporting overly optimistic results

Using the same cross-validation split for hyperparameter tuning and final performance reporting produces optimistically biased results. The reported performance will be better than what the model actually achieves on new data. Nested cross-validation or a separate held-out test set should be used to obtain honest estimates.

### Ignoring variance in estimates

Reporting only the mean cross-validation score without the standard deviation across folds hides important information about estimate reliability. A model with a mean accuracy of 85 percent and a standard deviation of 2 percent is very different from one with the same mean but a standard deviation of 10 percent. Always report both the mean and variance (or standard deviation) of cross-validation scores.

## Implementation in scikit-learn

Scikit-learn provides a comprehensive suite of cross-validation tools in the `sklearn.model_selection` module [5].

### Key classes and functions

| Class/Function | Description |
|---------------|-------------|
| `cross_val_score` | Computes cross-validated scores for an estimator using a specified CV strategy |
| `cross_validate` | Similar to `cross_val_score` but returns multiple metrics, fit times, and optionally trained estimators |
| `cross_val_predict` | Returns cross-validated predictions for each data point (when in the test fold) |
| `KFold` | Standard k-fold splitter |
| `StratifiedKFold` | K-fold splitter that preserves class proportions |
| `RepeatedKFold` | Repeats k-fold with different random splits |
| `RepeatedStratifiedKFold` | Repeats stratified k-fold with different random splits |
| `LeaveOneOut` | Leave-one-out splitter |
| `LeavePOut` | Leave-p-out splitter |
| `GroupKFold` | K-fold splitter respecting group boundaries |
| `StratifiedGroupKFold` | Stratified k-fold respecting group boundaries |
| `TimeSeriesSplit` | Time series-aware expanding window splitter |
| `ShuffleSplit` | Random permutation train/test splitter |

### Example: basic cross-validation

```python
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# 5-fold stratified cross-validation (default for classifiers)
scores = cross_val_score(clf, X, y, cv=5)
print(f"Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")
```

### Example: nested cross-validation

```python
from sklearn.model_selection import cross_val_score, GridSearchCV, KFold
from sklearn.svm import SVC
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

# Inner loop: hyperparameter tuning
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
inner_cv = KFold(n_splits=5, shuffle=True, random_state=42)
clf = GridSearchCV(SVC(), param_grid, cv=inner_cv)

# Outer loop: performance estimation
outer_cv = KFold(n_splits=5, shuffle=True, random_state=0)
nested_scores = cross_val_score(clf, X, y, cv=outer_cv)
print(f"Nested CV Accuracy: {nested_scores.mean():.3f} (+/- {nested_scores.std():.3f})")
```

### Example: time series cross-validation

```python
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([1, 2, 3, 4, 5, 6])

tscv = TimeSeriesSplit(n_splits=3)
for train_index, test_index in tscv.split(X):
    print(f"Train: {train_index}, Test: {test_index}")
# Train: [0 1 2], Test: [3]
# Train: [0 1 2 3], Test: [4]
# Train: [0 1 2 3 4], Test: [5]
```

## Computational cost considerations

Cross-validation multiplies the computational cost of model training by the number of folds and repetitions. For a single run of k-fold cross-validation, *k* models must be trained. For repeated k-fold with *r* repetitions, *k* x *r* models are needed. Nested cross-validation with *k_outer* x *k_inner* folds and *m* hyperparameter combinations requires *k_outer* x *k_inner* x *m* model fits.

Several strategies help manage computational costs:

- **Reduce *k*:** Using 5-fold instead of 10-fold halves the number of fits with only a modest increase in bias.
- **Use stratified over repeated CV:** Stratified k-fold achieves variance reduction more efficiently than repeated k-fold for classification.
- **Parallelize:** Most implementations (including scikit-learn's `n_jobs` parameter) support parallel execution across folds.
- **Use approximate methods:** For linear models, closed-form LOOCV formulas exist (such as the PRESS statistic for linear regression) that compute the leave-one-out estimate without actually refitting the model *n* times.
- **Subsample large datasets:** For very large datasets (millions of rows), a single holdout split may be sufficient because the performance estimate has low variance even without cross-validation.

## Applications beyond model evaluation

While model evaluation is its primary use, cross-validation also plays a role in:

- **Feature selection:** Cross-validated performance can guide the inclusion or exclusion of features, helping identify the most informative subset.
- **[Ensemble methods](/wiki/ensemble_learning):** Stacking (stacked generalization) uses cross-validated predictions from base models as inputs to a meta-learner, preventing [overfitting](/wiki/overfitting) of the meta-learner to the base models' training predictions.
- **Statistical testing:** Paired cross-validation tests (such as the 5x2 CV paired t-test proposed by Dietterich in 1998, published in Neural Computation volume 10, pages 1895-1923) provide formal statistical comparisons between two classifiers. Dietterich found that a paired-differences t-test based on several random train-test splits has an elevated type I error and "should never be used," recommending the 5x2 CV test instead [10].
- **Learning curve estimation:** Running cross-validation with varying training set sizes produces learning curves that reveal whether a model suffers from high bias or high variance.

## References

1. Stone, M. (1974). "Cross-Validatory Choice and Assessment of Statistical Predictions." *Journal of the Royal Statistical Society, Series B*, 36(2), 111-147.
2. Geisser, S. (1975). "The Predictive Sample Reuse Method with Applications." *Journal of the American Statistical Association*, 70(350), 320-328.
3. Stone, M. (1977). "An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike's Criterion." *Journal of the Royal Statistical Society, Series B*, 39(1), 44-47.
4. Kohavi, R. (1995). "A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection." *Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI)*, 1137-1145. https://www.ijcai.org/Proceedings/95-2/Papers/016.pdf
5. Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830. Documentation: https://scikit-learn.org/stable/modules/cross_validation.html
6. Varma, S. and Simon, R. (2006). "Bias in Error Estimation When Using Cross-Validation for Model Selection." *BMC Bioinformatics*, 7, 91.
7. Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*. 2nd ed. Springer.
8. Efron, B. and Tibshirani, R. (1997). "Improvements on Cross-Validation: The .632+ Bootstrap Method." *Journal of the American Statistical Association*, 92(438), 548-560.
9. Cawley, G.C. and Talbot, N.L.C. (2010). "On Over-Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation." *Journal of Machine Learning Research*, 11, 2079-2107.
10. Dietterich, T.G. (1998). "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms." *Neural Computation*, 10(7), 1895-1923.
11. James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). *An Introduction to Statistical Learning*. Springer.
12. Arlot, S. and Celisse, A. (2010). "A Survey of Cross-Validation Procedures for Model Selection." *Statistics Surveys*, 4, 40-79.

