Out-of-bag evaluation (OOB evaluation)

Overview

Out-of-bag (OOB) evaluation, sometimes called out-of-bag estimation or OOB error, is a model validation technique used with bagging-based ensemble methods such as random forests and bagged decision trees. It estimates the generalization error of an ensemble model using the training instances that were left out of each base learner's bootstrap sample, eliminating the need for a separate validation set or for repeated cross-validation passes.

The technique was introduced by Leo Breiman in his 1996 University of California, Berkeley technical report "Out-of-Bag Estimation" and refined in his 2001 paper introducing random forests. Because OOB error is computed as a byproduct of training, it provides a built-in estimate of test accuracy at essentially no additional computational cost. Breiman showed that OOB error is approximately as accurate as a test set estimate of the same size, which makes it an attractive replacement for k-fold cross-validation when working with bagged ensembles.

In modern practice, OOB evaluation is most commonly accessed through the oob_score=True option on the RandomForestClassifier, RandomForestRegressor, BaggingClassifier, and BaggingRegressor estimators in scikit-learn, but the same idea is implemented in R's randomForest package, in Spark MLlib, and in many other ensemble learning libraries.

Definition

Given a training set with n observations and an ensemble of m base learners trained on m bootstrap samples, the OOB prediction for an observation x_i is the aggregated prediction of only those base learners whose bootstrap sample did not include x_i. The OOB error is the average loss between these OOB predictions and the true labels, computed across all n training observations. Because each observation is predicted only by base learners that never saw it during training, the OOB error mimics a held-out test estimate.

Mathematical foundation

OOB evaluation relies on a basic property of bootstrap sampling. When n observations are sampled uniformly at random with replacement from a set of size n, the probability that a given observation is not chosen on any single draw is (1 - 1/n). The probability that it is not chosen across all n draws (and so does not appear at all in that bootstrap sample) is

P(not selected) = (1 - 1/n)^n

As n grows large, this expression converges to the limit 1/e, which is approximately 0.3679. So roughly 36.8% of the original observations are left out of each bootstrap sample, and the complementary 63.2% of unique observations make it in. The set of observations not selected is referred to as the out-of-bag set for that particular base learner.

The table below shows how P(not selected) approaches 1/e even for modest sample sizes.

Sample size n	(1 - 1/n)^n	Approximate fraction OOB
2	0.2500	25.0%
5	0.3277	32.8%
10	0.3487	34.9%
50	0.3642	36.4%
100	0.3660	36.6%
1,000	0.3677	36.8%
10,000	0.3679	36.8%
Limit (n -> infinity)	1/e	36.79%

This means that for any individual training observation x_i in an ensemble of m models, roughly 0.368 m of those models did not see x_i during training. Those models can collectively make a prediction on x_i that is statistically equivalent to a prediction made by an ensemble of about 0.368 m base learners trained on completely separate data, so for large m the OOB prediction approximates the prediction of a smaller (but still substantial) ensemble on truly held-out data.

For a classification task with loss function L(y, y_hat), the OOB error estimate is:

OOB error = (1 / n) * sum over i = 1 to n of L(y_i, y_hat_i^OOB)

where y_hat_i^OOB is the aggregated prediction (majority vote for classification, average for regression) over the subset of models that did not include x_i in their bootstrap sample. Common choices for L include zero-one loss (misclassification rate), Brier score, log loss, and mean squared error.

History

The technique grew out of two strands of research in the 1990s. Leo Breiman published "Bagging Predictors" in Machine Learning in 1996, which introduced the idea of training multiple base learners on bootstrap samples and aggregating their outputs. In a companion 1996 technical report titled "Out-of-Bag Estimation," Breiman argued that the bootstrap sampling step in bagging produced a free estimate of generalization error, because every training observation was held out by some fraction of the base learners.

In parallel, Tin Kam Ho published "Random Decision Forests" in 1995 at the Third International Conference on Document Analysis and Recognition, followed by "The Random Subspace Method for Constructing Decision Forests" in IEEE Transactions on Pattern Analysis and Machine Intelligence in 1998. Ho's work used random feature subsets rather than bootstrap samples but anticipated the broader idea of randomized tree ensembles.

Breiman synthesized bagging, Ho's random subspace approach, and ideas from Yoav Amit and Donald Geman's work on randomized geometric features into the random forest algorithm, which he published as "Random Forests" in Machine Learning in 2001. The OOB error was a central feature of the algorithm, used not only to estimate test accuracy but also to compute per-feature variable importance scores. By the time random forests became one of the most widely used machine learning algorithms in the early 2000s, OOB evaluation had become a standard tool in the ensemble learner's toolkit.

The broader bootstrap framework underlying OOB had been developed by Bradley Efron beginning with his 1979 paper "Bootstrap Methods: Another Look at the Jackknife" in The Annals of Statistics. Efron's 1983 paper "Estimating the Error Rate of a Prediction Rule" introduced the 0.632 bootstrap estimator, and Efron and Robert Tibshirani's 1997 paper "Improvements on Cross-Validation: The .632+ Bootstrap Method" extended it. These bootstrap estimators are conceptually related to OOB error but are designed for general predictive models, not specifically for bagged ensembles.

How OOB evaluation works

OOB evaluation runs alongside the normal training of a bagged ensemble. The procedure can be summarized in five steps.

Step 1: Train base learners on bootstrap samples

For each base learner b = 1, 2, ..., m, draw a bootstrap sample D_b of size n from the original training set D by sampling with replacement. Train the base learner h_b on D_b. As a byproduct, record the indices of the observations that were not selected for D_b. Call this set OOB_b.

Step 2: Track OOB membership per observation

For each training observation x_i, identify the set S_i of base learners for which x_i belongs to the OOB set, that is, S_i = {b : i is in OOB_b}. With m base learners and n large, the expected size of S_i is approximately 0.368 m.

Step 3: Generate OOB predictions

For each observation x_i, aggregate the predictions of only those base learners in S_i to obtain the OOB prediction y_hat_i^OOB. The aggregation rule depends on the task:

Task type	Aggregation rule
Classification (hard voting)	Majority vote over predicted classes from base learners in S_i
Classification (soft voting)	Average predicted class probabilities, then take argmax
Regression	Arithmetic mean of base learner predictions

Step 4: Compute OOB error

Compare the OOB predictions to the true labels using a loss function appropriate to the task and average across all observations:

For classification, use zero-one loss to compute the OOB misclassification rate, or use log loss or Brier score for probabilistic outputs.
For regression, use mean squared error, mean absolute error, or another regression loss.

The resulting average is the OOB error estimate.

Step 5: Use OOB error as a generalization estimate

The OOB error can be used in several ways: as a final estimate of generalization performance, as an early stopping criterion (monitoring OOB error as more trees are added), or as a tuning signal for hyperparameters such as the number of features sampled at each split.

OOB error in random forests

Random forests are the canonical use case for OOB error. In a random forest with m trees, each tree is grown on a bootstrap sample of the training data, and at each split the tree considers only a random subset of features (typically sqrt(p) for classification or p/3 for regression where p is the number of features). The bootstrap step alone is enough to enable OOB evaluation; the feature subsampling adds additional decorrelation between trees but does not change the OOB mechanism.

Breiman's 2001 random forest paper reported empirical comparisons showing that OOB error tracked the true test error closely across a wide range of UCI Machine Learning Repository benchmarks. The OOB error estimate is essentially unbiased once the ensemble has enough trees, typically a few hundred for most datasets.

Convergence and number of trees

OOB error decreases (or fluctuates and then plateaus) as more trees are added to the forest. With too few trees, some training observations may have very few or no base learners in their OOB set, which makes their individual OOB predictions noisy. As m grows, every observation accumulates a substantial number of OOB predictions and the OOB error converges to a stable estimate. A common heuristic is to plot OOB error against the number of trees and stop adding trees once the curve flattens.

Example in scikit-learn

The following Python example trains a random forest on the Wisconsin Breast Cancer dataset and reports both the test accuracy and the OOB score.

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

rf = RandomForestClassifier(
    n_estimators=500,
    oob_score=True,        # turn on out-of-bag scoring
    bootstrap=True,        # OOB requires bootstrap sampling
    n_jobs=-1,
    random_state=42,
)
rf.fit(X_train, y_train)

print(f"OOB score:      {rf.oob_score_:.4f}")
print(f"Test accuracy:  {rf.score(X_test, y_test):.4f}")

Key scikit-learn parameters and attributes related to OOB evaluation are summarized below.

Parameter or attribute	Description
`bootstrap=True`	Required for OOB; controls whether each tree is built on a bootstrap sample
`oob_score=True`	Enables OOB score computation after `fit`
`oob_score_`	Mean accuracy (classifier) or R^2 (regressor) on OOB samples
`oob_decision_function_`	Per-sample OOB class probabilities (classifier only)
`oob_prediction_`	Per-sample OOB predictions (regressor only)
`n_estimators`	Number of trees; OOB requires enough trees so each sample is OOB for several trees

If bootstrap=False, scikit-learn raises a warning and disables OOB scoring because no observations are left out of training.

Variable importance via OOB permutation

One of the most influential applications of OOB evaluation is permutation variable importance, also introduced by Breiman in the 2001 random forest paper. The idea is to measure how much a model's predictive performance degrades when the values of a single feature are randomly shuffled, breaking any relationship between that feature and the response.

The procedure for OOB permutation importance is:

Train the random forest and compute the baseline OOB error (or accuracy) for each tree using its OOB samples.
For each feature j, randomly permute the values of feature j among the OOB samples for each tree, then recompute the OOB error using the permuted feature.
The importance of feature j is the average increase in OOB error (or decrease in accuracy) across all trees, often standardized by dividing by the standard deviation across trees.

Features that are critical to the model's predictions show a large drop in OOB accuracy when permuted, while features that contribute little show a small drop. This approach has several advantages over simpler importance measures based on splits or impurity reduction:

Importance measure	Basis	Bias toward high-cardinality features	Computational cost
Mean decrease in impurity (MDI / Gini)	Sum of impurity reductions across all splits using the feature	High; favors continuous and high-cardinality categorical features	Low (computed during training)
OOB permutation importance	Drop in OOB accuracy when feature is shuffled	Lower; less prone to high-cardinality bias	Moderate; requires extra OOB passes
SHAP values	Game-theoretic attribution of model output	Low	High

Permutation importance is implemented in scikit-learn as sklearn.inspection.permutation_importance, which can be applied to any fitted estimator (including random forests) using either the OOB samples, a held-out validation set, or the training data.

A known caveat is that when two features are strongly correlated, permuting one of them may have only a small effect on OOB error because the other still carries the same information. As a result, permutation importance can underestimate the importance of correlated features. Conditional permutation importance and grouped permutation importance address this in part.

Comparison to cross-validation and holdout

OOB error, cross-validation, and holdout validation all estimate generalization error but differ in cost, bias, and applicability. The table below summarizes the trade-offs.

Property	OOB evaluation	k-fold cross-validation	Hold-out (single split)
Applicable to	Bagged ensembles only (random forest, bagged trees, etc.)	Any model	Any model
Extra training passes required	None (uses ensemble already trained)	k full retrainings	None beyond one training run
Effective fraction of data used per estimate	~36.8% OOB for each base learner; full dataset across the ensemble	(k-1)/k for training, 1/k for testing per fold	User-defined (e.g., 80/20)
Bias of estimate	Approximately equivalent to leave-one-out for the ensemble	Slight pessimistic bias for small k	Depends on split
Variance of estimate	Low for large m	Moderate; depends on k	High; depends on the single split
Suitable for hyperparameter tuning	Yes, but care needed (no separate validation set)	Yes	Yes
Suitable for time series	No (random sampling breaks temporal order)	Special variants needed	Yes with chronological split
Computational cost	Negligible additional cost	k times training cost	One training cost

Breiman's 1996 OOB technical report and follow-up empirical studies showed that OOB error and 10-fold cross-validation give very similar estimates for random forests and bagged trees, with OOB tending to be slightly pessimistic when the ensemble is small. For practical purposes, OOB error is the preferred estimator when working with bagged ensembles because it avoids the k-fold retraining cost entirely.

For non-bagging models such as a single decision tree, logistic regression, support vector machine, or neural network, OOB evaluation is not directly applicable and cross-validation or hold-out remains the standard.

The 0.632 and 0.632+ bootstrap

Closely related to OOB evaluation are two bootstrap-based error estimators developed by Bradley Efron and Robert Tibshirani in the 1980s and 1990s.

0.632 bootstrap

Efron's 1983 paper "Estimating the Error Rate of a Prediction Rule" introduced the 0.632 bootstrap estimator. It addresses a known issue with the naive bootstrap error estimate (the average error on each bootstrap sample's OOB set), which tends to be pessimistically biased because each bootstrap model is trained on only about 63.2% of the unique observations. The 0.632 estimator combines the OOB-style error with the resubstitution (training) error using the constant 0.632:

err_0.632 = 0.368 * err_train + 0.632 * err_OOB

The weights are chosen because, in expectation, each bootstrap training sample contains 63.2% of the unique original observations. The 0.632 estimator can have lower bias than either err_train (which underestimates) or err_OOB (which overestimates) alone, particularly for unstable learners.

0.632+ bootstrap

The 0.632 estimator can fail in cases of severe overfitting, where the training error is essentially zero but the model has memorized the data. Efron and Tibshirani's 1997 paper "Improvements on Cross-Validation: The .632+ Bootstrap Method" introduced the 0.632+ bootstrap estimator, which adjusts the weighting based on a measure of overfitting called the relative overfitting rate R:

err_0.632+ = (1 - w) * err_train + w * err_OOB

where w = 0.632 / (1 - 0.368 R) and R is the relative overfitting rate, with R = 0 corresponding to no overfitting (recovers the standard 0.632 estimator) and R = 1 corresponding to maximal overfitting (collapses to err_OOB). The 0.632+ method gives a more honest error estimate when the underlying learner severely overfits.

Relationship to OOB evaluation

The 0.632 and 0.632+ estimators apply to general predictive models trained on bootstrap samples, not just to bagged ensembles. They share the same statistical foundation as OOB evaluation (the (1 - 1/n)^n -> 1/e limit) but combine the OOB-style error with the training error to correct for known biases. In bagged ensembles, the OOB error from Breiman's procedure is essentially an unweighted version of err_OOB; when the goal is to evaluate the ensemble itself rather than a single bootstrap-trained model, the unweighted OOB error is usually preferred.

Estimator	Formula	Best for
Naive resubstitution	err_train	Lower bound only; severely biased downward
Leave-one-out CV	Average error over n folds of size n-1	General models; high computational cost
OOB (Breiman)	Average loss between aggregated OOB predictions and labels	Bagging ensembles; near-zero extra cost
0.632 bootstrap (Efron 1983)	0.368 err_train + 0.632 err_OOB	Single bootstrap-trained models; corrects pessimistic bias
0.632+ bootstrap (Efron and Tibshirani 1997)	Adaptive weighting using relative overfitting rate R	Severely overfitting learners

Limitations and best practices

OOB evaluation is one of the most useful tools in ensemble learning, but it has well-known limitations.

Limitations

Requires bootstrap sampling. OOB evaluation is meaningless without it. Models such as Extra-Trees (Extremely Randomized Trees) by default do not bootstrap, in which case OOB scoring must be turned off or bootstrap=True must be set.
Small ensembles produce noisy OOB estimates. With only a handful of base learners, some training observations may be predicted by very few OOB models or none at all. A typical guideline is to use at least 200 to 500 base learners.
Can underestimate true test error in small samples. When the training set is very small (say n < 30), the bootstrap sample may contain extreme duplication and the OOB set may not be representative of the population.
Not directly applicable to time series or grouped data. Bootstrap sampling assumes exchangeable observations. For time series data, randomly resampled bootstrap samples break the temporal dependence structure, and OOB error can be optimistically biased. Block bootstrap or specialized time-series cross-validation is preferred in this setting.
Permutation importance based on OOB can be misleading for correlated features. As noted earlier, two strongly correlated features may both show low importance even if either alone would be highly predictive.
Class imbalance can distort OOB error. With heavily imbalanced classes, the OOB error may be dominated by the majority class. Stratified sampling, class-weighted loss, or class-specific OOB metrics (precision, recall, F1) can address this.

Best practices

Use oob_score=True whenever you train a random forest or bagging ensemble. The cost is negligible and it provides a free generalization estimate.
Monitor OOB error against the number of trees to choose a sensible value of n_estimators. Stop adding trees once the OOB curve plateaus.
Cross-check with a held-out test set for any model that will be deployed. OOB error is a strong estimate but is not a substitute for evaluation on truly unseen data when the stakes are high.
Use OOB permutation importance rather than mean decrease in impurity (MDI) when interpreting feature importance, especially if features have different scales or cardinalities.
Switch to time-aware validation for temporal data. Replace OOB with a chronological hold-out, walk-forward validation, or block bootstrap variants designed for time series.
Record the random seed. OOB scores depend on the bootstrap draws, which depend on the seed. Pinning random_state makes results reproducible.

Implementation in popular libraries

OOB evaluation is supported across many of the major machine learning libraries.

Library or framework	OOB support	How to enable
scikit-learn (Python)	Yes; `RandomForestClassifier`, `RandomForestRegressor`, `BaggingClassifier`, `BaggingRegressor`	`oob_score=True` and `bootstrap=True`
randomForest (R)	Yes; OOB error printed by default	Automatic; reported in `print(model)` and `plot(model)`
ranger (R)	Yes; computes OOB error and prediction by default	Automatic; available in `model$prediction.error`
H2O (Java/Python/R)	Yes; reports OOB metrics for `H2ORandomForestEstimator`	Automatic when `nfolds = 0`
Spark MLlib	Limited; MLlib does not expose OOB by default	Compute manually using bootstrap indices
XGBoost	No; XGBoost is a boosting method without bootstrap sampling	Use cross-validation instead
LightGBM	No (default); supports row-bagging via `bagging_fraction`, but does not expose OOB error	Use cross-validation
CatBoost	No native OOB; relies on cross-validation	Use cross-validation

Note that gradient boosting libraries such as XGBoost, LightGBM, and CatBoost are based on sequential boosting rather than bagging. While some of them implement optional row-subsampling that resembles bagging, the sequential nature of boosting means the OOB framework does not transfer cleanly, so cross-validation is the standard evaluation method for these libraries.

Explain like I'm 5 (ELI5)

Imagine your class is divided into 100 small reading groups, and each group is given a random handful of storybooks from the library. Because the handfuls are picked with replacement, some books end up in lots of groups and some books end up in only a few groups. About one third of the books never make it into any given group's pile.

Later, the teacher wants to know how good the class is at understanding new stories. Instead of giving everyone a brand new book, the teacher does something clever: for each book, she finds all the groups that did not see it, asks them what they think the story is about, and combines their answers. Since those groups have never read the book, their combined answer is a fair test of how well the class can handle a new story.

That is what out-of-bag evaluation does. Each tree in the random forest is one little reading group, each training row is one storybook, and the OOB error is the score the class gets when only the groups that have never seen a row try to predict it. The whole trick is free because the trees were going to be trained anyway, and the rows they did not use were just sitting there waiting to be useful.

References

Breiman, L. (1996a). "Bagging Predictors." *Machine Learning*, 24(2), 123-140. doi:10.1007/BF00058655
Breiman, L. (1996b). "Out-of-Bag Estimation." Technical Report, Statistics Department, University of California, Berkeley. PDF
Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32. doi:10.1023/A:1010933404324
Ho, T.K. (1995). "Random Decision Forests." *Proceedings of the Third International Conference on Document Analysis and Recognition*, Montreal, Vol. 1, 278-282.
Ho, T.K. (1998). "The Random Subspace Method for Constructing Decision Forests." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 20(8), 832-844.
Efron, B. (1979). "Bootstrap Methods: Another Look at the Jackknife." *The Annals of Statistics*, 7(1), 1-26.
Efron, B. (1983). "Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation." *Journal of the American Statistical Association*, 78(382), 316-331.
Efron, B. & Tibshirani, R.J. (1997). "Improvements on Cross-Validation: The .632+ Bootstrap Method." *Journal of the American Statistical Association*, 92(438), 548-560.
Efron, B. & Tibshirani, R.J. (1993). *An Introduction to the Bootstrap*. Chapman & Hall.
Hastie, T., Tibshirani, R. & Friedman, J. (2009). *The Elements of Statistical Learning*, 2nd ed. Springer. Chapter 15: Random Forests.
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
Strobl, C., Boulesteix, A.L., Zeileis, A. & Hothorn, T. (2007). "Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution." *BMC Bioinformatics*, 8, 25.
Janitza, S., Celik, E. & Boulesteix, A.L. (2018). "A Computationally Fast Variable Importance Test for Random Forests for High-Dimensional Data." *Advances in Data Analysis and Classification*, 12, 885-915.
Wright, M.N. & Ziegler, A. (2017). "ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R." *Journal of Statistical Software*, 77(1), 1-17.
Probst, P. & Boulesteix, A.L. (2018). "To Tune or Not to Tune the Number of Trees in Random Forest." *Journal of Machine Learning Research*, 18(181), 1-18.

Overview

Definition

Mathematical foundation

History

How OOB evaluation works

Step 1: Train base learners on bootstrap samples

Step 2: Track OOB membership per observation

Step 3: Generate OOB predictions

Step 4: Compute OOB error

Step 5: Use OOB error as a generalization estimate

OOB error in random forests

Convergence and number of trees

Example in scikit-learn

Variable importance via OOB permutation

Comparison to cross-validation and holdout

The 0.632 and 0.632+ bootstrap

0.632 bootstrap

0.632+ bootstrap

Relationship to OOB evaluation

Limitations and best practices

Limitations

Best practices

Implementation in popular libraries

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Bagging

Boosting

Decision Forest

Ensemble

Gradient Boosting

Random Forest

Overview

Definition

Mathematical foundation

History

How OOB evaluation works

Step 1: Train base learners on bootstrap samples

Step 2: Track OOB membership per observation

Step 3: Generate OOB predictions

Step 4: Compute OOB error

Step 5: Use OOB error as a generalization estimate

OOB error in random forests

Convergence and number of trees

Example in scikit-learn

Variable importance via OOB permutation

Comparison to cross-validation and holdout

The 0.632 and 0.632+ bootstrap

0.632 bootstrap

0.632+ bootstrap

Relationship to OOB evaluation

Limitations and best practices

Limitations

Best practices

Implementation in popular libraries

Explain like I'm 5 (ELI5)

References

Related Articles

Bagging

Boosting

Decision Forest

Ensemble

Gradient Boosting

Random Forest