Out-of-bag evaluation (OOB evaluation)
Last reviewed
Apr 30, 2026
Sources
15 citations
Review status
Source-backed
Revision
v4 ยท 4,053 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
15 citations
Review status
Source-backed
Revision
v4 ยท 4,053 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms, Bagging, Random forest, Cross-validation
Out-of-bag (OOB) evaluation, sometimes called out-of-bag estimation or OOB error, is a model validation technique used with bagging-based ensemble methods such as random forests and bagged decision trees. It estimates the generalization error of an ensemble model using the training instances that were left out of each base learner's bootstrap sample, eliminating the need for a separate validation set or for repeated cross-validation passes.
The technique was introduced by Leo Breiman in his 1996 University of California, Berkeley technical report "Out-of-Bag Estimation" and refined in his 2001 paper introducing random forests. Because OOB error is computed as a byproduct of training, it provides a built-in estimate of test accuracy at essentially no additional computational cost. Breiman showed that OOB error is approximately as accurate as a test set estimate of the same size, which makes it an attractive replacement for k-fold cross-validation when working with bagged ensembles.
In modern practice, OOB evaluation is most commonly accessed through the oob_score=True option on the RandomForestClassifier, RandomForestRegressor, BaggingClassifier, and BaggingRegressor estimators in scikit-learn, but the same idea is implemented in R's randomForest package, in Spark MLlib, and in many other ensemble learning libraries.
Given a training set with n observations and an ensemble of m base learners trained on m bootstrap samples, the OOB prediction for an observation x_i is the aggregated prediction of only those base learners whose bootstrap sample did not include x_i. The OOB error is the average loss between these OOB predictions and the true labels, computed across all n training observations. Because each observation is predicted only by base learners that never saw it during training, the OOB error mimics a held-out test estimate.
OOB evaluation relies on a basic property of bootstrap sampling. When n observations are sampled uniformly at random with replacement from a set of size n, the probability that a given observation is not chosen on any single draw is (1 - 1/n). The probability that it is not chosen across all n draws (and so does not appear at all in that bootstrap sample) is
P(not selected) = (1 - 1/n)^n
As n grows large, this expression converges to the limit 1/e, which is approximately 0.3679. So roughly 36.8% of the original observations are left out of each bootstrap sample, and the complementary 63.2% of unique observations make it in. The set of observations not selected is referred to as the out-of-bag set for that particular base learner.
The table below shows how P(not selected) approaches 1/e even for modest sample sizes.
| Sample size n | (1 - 1/n)^n | Approximate fraction OOB |
|---|---|---|
| 2 | 0.2500 | 25.0% |
| 5 | 0.3277 | 32.8% |
| 10 | 0.3487 | 34.9% |
| 50 | 0.3642 | 36.4% |
| 100 | 0.3660 | 36.6% |
| 1,000 | 0.3677 | 36.8% |
| 10,000 | 0.3679 | 36.8% |
| Limit (n -> infinity) | 1/e | 36.79% |
This means that for any individual training observation x_i in an ensemble of m models, roughly 0.368 m of those models did not see x_i during training. Those models can collectively make a prediction on x_i that is statistically equivalent to a prediction made by an ensemble of about 0.368 m base learners trained on completely separate data, so for large m the OOB prediction approximates the prediction of a smaller (but still substantial) ensemble on truly held-out data.
For a classification task with loss function L(y, y_hat), the OOB error estimate is:
OOB error = (1 / n) * sum over i = 1 to n of L(y_i, y_hat_i^OOB)
where y_hat_i^OOB is the aggregated prediction (majority vote for classification, average for regression) over the subset of models that did not include x_i in their bootstrap sample. Common choices for L include zero-one loss (misclassification rate), Brier score, log loss, and mean squared error.
The technique grew out of two strands of research in the 1990s. Leo Breiman published "Bagging Predictors" in Machine Learning in 1996, which introduced the idea of training multiple base learners on bootstrap samples and aggregating their outputs. In a companion 1996 technical report titled "Out-of-Bag Estimation," Breiman argued that the bootstrap sampling step in bagging produced a free estimate of generalization error, because every training observation was held out by some fraction of the base learners.
In parallel, Tin Kam Ho published "Random Decision Forests" in 1995 at the Third International Conference on Document Analysis and Recognition, followed by "The Random Subspace Method for Constructing Decision Forests" in IEEE Transactions on Pattern Analysis and Machine Intelligence in 1998. Ho's work used random feature subsets rather than bootstrap samples but anticipated the broader idea of randomized tree ensembles.
Breiman synthesized bagging, Ho's random subspace approach, and ideas from Yoav Amit and Donald Geman's work on randomized geometric features into the random forest algorithm, which he published as "Random Forests" in Machine Learning in 2001. The OOB error was a central feature of the algorithm, used not only to estimate test accuracy but also to compute per-feature variable importance scores. By the time random forests became one of the most widely used machine learning algorithms in the early 2000s, OOB evaluation had become a standard tool in the ensemble learner's toolkit.
The broader bootstrap framework underlying OOB had been developed by Bradley Efron beginning with his 1979 paper "Bootstrap Methods: Another Look at the Jackknife" in The Annals of Statistics. Efron's 1983 paper "Estimating the Error Rate of a Prediction Rule" introduced the 0.632 bootstrap estimator, and Efron and Robert Tibshirani's 1997 paper "Improvements on Cross-Validation: The .632+ Bootstrap Method" extended it. These bootstrap estimators are conceptually related to OOB error but are designed for general predictive models, not specifically for bagged ensembles.
OOB evaluation runs alongside the normal training of a bagged ensemble. The procedure can be summarized in five steps.
For each base learner b = 1, 2, ..., m, draw a bootstrap sample D_b of size n from the original training set D by sampling with replacement. Train the base learner h_b on D_b. As a byproduct, record the indices of the observations that were not selected for D_b. Call this set OOB_b.
For each training observation x_i, identify the set S_i of base learners for which x_i belongs to the OOB set, that is, S_i = {b : i is in OOB_b}. With m base learners and n large, the expected size of S_i is approximately 0.368 m.
For each observation x_i, aggregate the predictions of only those base learners in S_i to obtain the OOB prediction y_hat_i^OOB. The aggregation rule depends on the task:
| Task type | Aggregation rule |
|---|---|
| Classification (hard voting) | Majority vote over predicted classes from base learners in S_i |
| Classification (soft voting) | Average predicted class probabilities, then take argmax |
| Regression | Arithmetic mean of base learner predictions |
Compare the OOB predictions to the true labels using a loss function appropriate to the task and average across all observations:
The resulting average is the OOB error estimate.
The OOB error can be used in several ways: as a final estimate of generalization performance, as an early stopping criterion (monitoring OOB error as more trees are added), or as a tuning signal for hyperparameters such as the number of features sampled at each split.
Random forests are the canonical use case for OOB error. In a random forest with m trees, each tree is grown on a bootstrap sample of the training data, and at each split the tree considers only a random subset of features (typically sqrt(p) for classification or p/3 for regression where p is the number of features). The bootstrap step alone is enough to enable OOB evaluation; the feature subsampling adds additional decorrelation between trees but does not change the OOB mechanism.
Breiman's 2001 random forest paper reported empirical comparisons showing that OOB error tracked the true test error closely across a wide range of UCI Machine Learning Repository benchmarks. The OOB error estimate is essentially unbiased once the ensemble has enough trees, typically a few hundred for most datasets.
OOB error decreases (or fluctuates and then plateaus) as more trees are added to the forest. With too few trees, some training observations may have very few or no base learners in their OOB set, which makes their individual OOB predictions noisy. As m grows, every observation accumulates a substantial number of OOB predictions and the OOB error converges to a stable estimate. A common heuristic is to plot OOB error against the number of trees and stop adding trees once the curve flattens.
The following Python example trains a random forest on the Wisconsin Breast Cancer dataset and reports both the test accuracy and the OOB score.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
rf = RandomForestClassifier(
n_estimators=500,
oob_score=True, # turn on out-of-bag scoring
bootstrap=True, # OOB requires bootstrap sampling
n_jobs=-1,
random_state=42,
)
rf.fit(X_train, y_train)
print(f"OOB score: {rf.oob_score_:.4f}")
print(f"Test accuracy: {rf.score(X_test, y_test):.4f}")
Key scikit-learn parameters and attributes related to OOB evaluation are summarized below.
| Parameter or attribute | Description |
|---|---|
bootstrap=True | Required for OOB; controls whether each tree is built on a bootstrap sample |
oob_score=True | Enables OOB score computation after fit |
oob_score_ | Mean accuracy (classifier) or R^2 (regressor) on OOB samples |
oob_decision_function_ | Per-sample OOB class probabilities (classifier only) |
oob_prediction_ | Per-sample OOB predictions (regressor only) |
n_estimators | Number of trees; OOB requires enough trees so each sample is OOB for several trees |
If bootstrap=False, scikit-learn raises a warning and disables OOB scoring because no observations are left out of training.
One of the most influential applications of OOB evaluation is permutation variable importance, also introduced by Breiman in the 2001 random forest paper. The idea is to measure how much a model's predictive performance degrades when the values of a single feature are randomly shuffled, breaking any relationship between that feature and the response.
The procedure for OOB permutation importance is:
Features that are critical to the model's predictions show a large drop in OOB accuracy when permuted, while features that contribute little show a small drop. This approach has several advantages over simpler importance measures based on splits or impurity reduction:
| Importance measure | Basis | Bias toward high-cardinality features | Computational cost |
|---|---|---|---|
| Mean decrease in impurity (MDI / Gini) | Sum of impurity reductions across all splits using the feature | High; favors continuous and high-cardinality categorical features | Low (computed during training) |
| OOB permutation importance | Drop in OOB accuracy when feature is shuffled | Lower; less prone to high-cardinality bias | Moderate; requires extra OOB passes |
| SHAP values | Game-theoretic attribution of model output | Low | High |
Permutation importance is implemented in scikit-learn as sklearn.inspection.permutation_importance, which can be applied to any fitted estimator (including random forests) using either the OOB samples, a held-out validation set, or the training data.
A known caveat is that when two features are strongly correlated, permuting one of them may have only a small effect on OOB error because the other still carries the same information. As a result, permutation importance can underestimate the importance of correlated features. Conditional permutation importance and grouped permutation importance address this in part.
OOB error, cross-validation, and holdout validation all estimate generalization error but differ in cost, bias, and applicability. The table below summarizes the trade-offs.
| Property | OOB evaluation | k-fold cross-validation | Hold-out (single split) |
|---|---|---|---|
| Applicable to | Bagged ensembles only (random forest, bagged trees, etc.) | Any model | Any model |
| Extra training passes required | None (uses ensemble already trained) | k full retrainings | None beyond one training run |
| Effective fraction of data used per estimate | ~36.8% OOB for each base learner; full dataset across the ensemble | (k-1)/k for training, 1/k for testing per fold | User-defined (e.g., 80/20) |
| Bias of estimate | Approximately equivalent to leave-one-out for the ensemble | Slight pessimistic bias for small k | Depends on split |
| Variance of estimate | Low for large m | Moderate; depends on k | High; depends on the single split |
| Suitable for hyperparameter tuning | Yes, but care needed (no separate validation set) | Yes | Yes |
| Suitable for time series | No (random sampling breaks temporal order) | Special variants needed | Yes with chronological split |
| Computational cost | Negligible additional cost | k times training cost | One training cost |
Breiman's 1996 OOB technical report and follow-up empirical studies showed that OOB error and 10-fold cross-validation give very similar estimates for random forests and bagged trees, with OOB tending to be slightly pessimistic when the ensemble is small. For practical purposes, OOB error is the preferred estimator when working with bagged ensembles because it avoids the k-fold retraining cost entirely.
For non-bagging models such as a single decision tree, logistic regression, support vector machine, or neural network, OOB evaluation is not directly applicable and cross-validation or hold-out remains the standard.
Closely related to OOB evaluation are two bootstrap-based error estimators developed by Bradley Efron and Robert Tibshirani in the 1980s and 1990s.
Efron's 1983 paper "Estimating the Error Rate of a Prediction Rule" introduced the 0.632 bootstrap estimator. It addresses a known issue with the naive bootstrap error estimate (the average error on each bootstrap sample's OOB set), which tends to be pessimistically biased because each bootstrap model is trained on only about 63.2% of the unique observations. The 0.632 estimator combines the OOB-style error with the resubstitution (training) error using the constant 0.632:
err_0.632 = 0.368 * err_train + 0.632 * err_OOB
The weights are chosen because, in expectation, each bootstrap training sample contains 63.2% of the unique original observations. The 0.632 estimator can have lower bias than either err_train (which underestimates) or err_OOB (which overestimates) alone, particularly for unstable learners.
The 0.632 estimator can fail in cases of severe overfitting, where the training error is essentially zero but the model has memorized the data. Efron and Tibshirani's 1997 paper "Improvements on Cross-Validation: The .632+ Bootstrap Method" introduced the 0.632+ bootstrap estimator, which adjusts the weighting based on a measure of overfitting called the relative overfitting rate R:
err_0.632+ = (1 - w) * err_train + w * err_OOB
where w = 0.632 / (1 - 0.368 R) and R is the relative overfitting rate, with R = 0 corresponding to no overfitting (recovers the standard 0.632 estimator) and R = 1 corresponding to maximal overfitting (collapses to err_OOB). The 0.632+ method gives a more honest error estimate when the underlying learner severely overfits.
The 0.632 and 0.632+ estimators apply to general predictive models trained on bootstrap samples, not just to bagged ensembles. They share the same statistical foundation as OOB evaluation (the (1 - 1/n)^n -> 1/e limit) but combine the OOB-style error with the training error to correct for known biases. In bagged ensembles, the OOB error from Breiman's procedure is essentially an unweighted version of err_OOB; when the goal is to evaluate the ensemble itself rather than a single bootstrap-trained model, the unweighted OOB error is usually preferred.
| Estimator | Formula | Best for |
|---|---|---|
| Naive resubstitution | err_train | Lower bound only; severely biased downward |
| Leave-one-out CV | Average error over n folds of size n-1 | General models; high computational cost |
| OOB (Breiman) | Average loss between aggregated OOB predictions and labels | Bagging ensembles; near-zero extra cost |
| 0.632 bootstrap (Efron 1983) | 0.368 err_train + 0.632 err_OOB | Single bootstrap-trained models; corrects pessimistic bias |
| 0.632+ bootstrap (Efron and Tibshirani 1997) | Adaptive weighting using relative overfitting rate R | Severely overfitting learners |
OOB evaluation is one of the most useful tools in ensemble learning, but it has well-known limitations.
bootstrap=True must be set.oob_score=True whenever you train a random forest or bagging ensemble. The cost is negligible and it provides a free generalization estimate.n_estimators. Stop adding trees once the OOB curve plateaus.random_state makes results reproducible.OOB evaluation is supported across many of the major machine learning libraries.
| Library or framework | OOB support | How to enable |
|---|---|---|
| scikit-learn (Python) | Yes; RandomForestClassifier, RandomForestRegressor, BaggingClassifier, BaggingRegressor | oob_score=True and bootstrap=True |
| randomForest (R) | Yes; OOB error printed by default | Automatic; reported in print(model) and plot(model) |
| ranger (R) | Yes; computes OOB error and prediction by default | Automatic; available in model$prediction.error |
| H2O (Java/Python/R) | Yes; reports OOB metrics for H2ORandomForestEstimator | Automatic when nfolds = 0 |
| Spark MLlib | Limited; MLlib does not expose OOB by default | Compute manually using bootstrap indices |
| XGBoost | No; XGBoost is a boosting method without bootstrap sampling | Use cross-validation instead |
| LightGBM | No (default); supports row-bagging via bagging_fraction, but does not expose OOB error | Use cross-validation |
| CatBoost | No native OOB; relies on cross-validation | Use cross-validation |
Note that gradient boosting libraries such as XGBoost, LightGBM, and CatBoost are based on sequential boosting rather than bagging. While some of them implement optional row-subsampling that resembles bagging, the sequential nature of boosting means the OOB framework does not transfer cleanly, so cross-validation is the standard evaluation method for these libraries.
Imagine your class is divided into 100 small reading groups, and each group is given a random handful of storybooks from the library. Because the handfuls are picked with replacement, some books end up in lots of groups and some books end up in only a few groups. About one third of the books never make it into any given group's pile.
Later, the teacher wants to know how good the class is at understanding new stories. Instead of giving everyone a brand new book, the teacher does something clever: for each book, she finds all the groups that did not see it, asks them what they think the story is about, and combines their answers. Since those groups have never read the book, their combined answer is a fair test of how well the class can handle a new story.
That is what out-of-bag evaluation does. Each tree in the random forest is one little reading group, each training row is one storybook, and the OOB error is the score the class gets when only the groups that have never seen a row try to predict it. The whole trick is free because the trees were going to be trained anyway, and the rows they did not use were just sitting there waiting to be useful.