# Out-of-bag evaluation (OOB evaluation)

> Source: https://aiwiki.ai/wiki/out-of-bag_evaluation_oob_evaluation
> Updated: 2026-07-11
> Categories: Machine Learning, Model Evaluation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms), [Bagging](/wiki/bagging), [Random forest](/wiki/random_forest), [Cross-validation](/wiki/cross-validation)*

## Overview

**Out-of-bag (OOB) evaluation**, sometimes called **out-of-bag estimation** or **OOB error**, is a model validation technique used with [bagging](/wiki/bagging)-based [ensemble methods](/wiki/ensemble) such as [random forests](/wiki/random_forest) and bagged decision trees. It estimates the generalization error of an ensemble model using the training instances that were left out of each base learner's bootstrap sample, eliminating the need for a separate validation set or for repeated cross-validation passes.

The technique was introduced by Leo Breiman in his 1996 University of California, Berkeley technical report "Out-of-Bag Estimation" [2] and refined in his 2001 paper introducing random forests [3]. Because OOB error is computed as a byproduct of training, it provides a built-in estimate of test accuracy at essentially no additional computational cost. Breiman showed that OOB error is approximately as accurate as a [test set](/wiki/test_set) estimate of the same size as the training set [2][3], which makes it an attractive replacement for [k-fold cross-validation](/wiki/cross-validation) when working with bagged ensembles. In Breiman's words, "using the out-of-bag error estimate removes the need for a set aside test set" [3].

In modern practice, OOB evaluation is most commonly accessed through the `oob_score=True` option on the `RandomForestClassifier`, `RandomForestRegressor`, `BaggingClassifier`, and `BaggingRegressor` estimators in [scikit-learn](/wiki/scikit-learn) [11], but the same idea is implemented in R's `randomForest` package [14][16], in the faster `ranger` package, in [Apache Spark](/wiki/apache_spark) MLlib, and in many other ensemble learning libraries.

## Definition

Given a training set with n observations and an ensemble of m base learners trained on m bootstrap samples, the OOB prediction for an observation x_i is the aggregated prediction of only those base learners whose bootstrap sample did not include x_i. The OOB error is the average loss between these OOB predictions and the true labels, computed across all n training observations. Because each observation is predicted only by base learners that never saw it during training, the OOB error mimics a held-out test estimate. Hastie, Tibshirani and Friedman, in the random forests chapter of *The Elements of Statistical Learning*, note that the OOB error is essentially equivalent to N-fold cross-validation, so a random forest can be fit and validated in a single training pass [10].

## Mathematical foundation

OOB evaluation relies on a basic property of bootstrap sampling. When n observations are sampled uniformly at random with replacement from a set of size n, the probability that a given observation is **not** chosen on any single draw is $$(1 - 1/n)$$. The probability that it is not chosen across all n draws (and so does not appear at all in that bootstrap sample) is

$$
P(\text{not selected}) = (1 - 1/n)^n
$$

As n grows large, this expression converges to the limit $$1/e$$, which is approximately **0.3679**. So roughly **36.8%** of the original observations are left out of each bootstrap sample, and the complementary **63.2%** of unique observations make it in. The set of observations not selected is referred to as the **out-of-bag set** for that particular base learner. Breiman summarized this directly in the abstract of the 1996 report: "Each bootstrap sample leaves out about 37% of the examples. These left-out examples can be used to form accurate estimates of important quantities" [2].

Equivalently, the **expected number of distinct (unique) observations** that appear in a bootstrap sample of size n is

$$
\mathbb{E}[\text{distinct}] = n (1 - (1 - 1/n)^n)
$$

which converges to $$n (1 - 1/e) \approx 0.632 n$$ as n grows. This 0.632 figure is the source of the name of the 0.632 bootstrap estimator discussed later, and it is the average fraction of unique training rows each base learner actually trains on.

The table below shows how $$P(\text{not selected})$$ approaches $$1/e$$ even for modest sample sizes.

| Sample size n | $$(1 - 1/n)^n$$ | Approximate fraction OOB |
|---|---|---|
| 2 | 0.2500 | 25.0% |
| 5 | 0.3277 | 32.8% |
| 10 | 0.3487 | 34.9% |
| 50 | 0.3642 | 36.4% |
| 100 | 0.3660 | 36.6% |
| 1,000 | 0.3677 | 36.8% |
| 10,000 | 0.3679 | 36.8% |
| Limit ($$n \to \infty$$) | $$1/e$$ | 36.79% |

This means that for any individual training observation x_i in an ensemble of m models, roughly 0.368 m of those models did not see x_i during training. Those models can collectively make a prediction on x_i that is statistically equivalent to a prediction made by an ensemble of about 0.368 m base learners trained on completely separate data, so for large m the OOB prediction approximates the prediction of a smaller (but still substantial) ensemble on truly held-out data.

For a classification task with loss function $$L(y, \hat{y})$$, the OOB error estimate is:

$$
\text{OOB error} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_i^{\text{OOB}})
$$

where $$\hat{y}_i^{\text{OOB}}$$ is the aggregated prediction (majority vote for classification, average for regression) over the subset of models that did not include x_i in their bootstrap sample. Common choices for L include zero-one loss (misclassification rate), Brier score, log loss, and mean squared error.

## History

The technique grew out of two strands of research in the 1990s. Leo Breiman published "Bagging Predictors" in *Machine Learning* in 1996, which introduced the idea of training multiple base learners on bootstrap samples and aggregating their outputs [1]. In a companion 1996 technical report titled "Out-of-Bag Estimation," Breiman argued that the bootstrap sampling step in bagging produced a free estimate of generalization error, because every training observation was held out by some fraction of the base learners [2]. Breiman credited two earlier 1996 papers as the direct stimuli for the OOB idea: a technical report by Robert Tibshirani that used out-of-bag estimates of variance to estimate generalization error for arbitrary classifiers, and a paper by David Wolpert and William Macready on estimating bagging's generalization error for regression problems [2][3][17]. Breiman noted that for classification "our results are new" relative to that prior work [2].

In parallel, Tin Kam Ho published "Random Decision Forests" in 1995 at the Third International Conference on Document Analysis and Recognition [4], followed by "The Random Subspace Method for Constructing Decision Forests" in *IEEE Transactions on Pattern Analysis and Machine Intelligence* in 1998 [5]. Ho's work used random feature subsets rather than bootstrap samples but anticipated the broader idea of randomized tree ensembles.

Breiman synthesized bagging, Ho's random subspace approach, and ideas from Yoav Amit and Donald Geman's work on randomized geometric features into the [random forest](/wiki/random_forest) algorithm, which he published as "Random Forests" in *Machine Learning* in 2001 [3]. The OOB error was a central feature of the algorithm, used not only to estimate test accuracy but also to compute per-feature variable importance scores and internal estimates of forest strength and the correlation between trees. By the time random forests became one of the most widely used machine learning algorithms in the early 2000s, OOB evaluation had become a standard tool in the ensemble learner's toolkit.

The broader bootstrap framework underlying OOB had been developed by Bradley Efron beginning with his 1979 paper "Bootstrap Methods: Another Look at the Jackknife" in *The Annals of Statistics* [6]. Efron's 1983 paper "Estimating the Error Rate of a Prediction Rule" introduced the **0.632 bootstrap** estimator [7], and Efron and Robert Tibshirani's 1997 paper "Improvements on Cross-Validation: The .632+ Bootstrap Method" extended it [8]. These bootstrap estimators are conceptually related to OOB error but are designed for general predictive models, not specifically for bagged ensembles [9].

## How OOB evaluation works

OOB evaluation runs alongside the normal training of a bagged ensemble. The procedure can be summarized in five steps.

### Step 1: Train base learners on bootstrap samples

For each base learner b = 1, 2, ..., m, draw a bootstrap sample D_b of size n from the original training set D by sampling with replacement. Train the base learner h_b on D_b. As a byproduct, record the indices of the observations that were not selected for D_b. Call this set OOB_b.

### Step 2: Track OOB membership per observation

For each training observation x_i, identify the set S_i of base learners for which x_i belongs to the OOB set, that is, $$S_i = \{b : i \in \mathrm{OOB}_b\}$$. With m base learners and n large, the expected size of S_i is approximately 0.368 m.

### Step 3: Generate OOB predictions

For each observation x_i, aggregate the predictions of only those base learners in S_i to obtain the OOB prediction y_hat_i^OOB. The aggregation rule depends on the task:

| Task type | Aggregation rule |
|---|---|
| [Classification](/wiki/classification_model) (hard voting) | Majority vote over predicted classes from base learners in S_i |
| [Classification](/wiki/classification_model) (soft voting) | Average predicted class probabilities, then take argmax |
| [Regression](/wiki/regression_model) | Arithmetic mean of base learner predictions |

### Step 4: Compute OOB error

Compare the OOB predictions to the true labels using a loss function appropriate to the task and average across all observations:

- For classification, use zero-one loss to compute the OOB misclassification rate, or use log loss or Brier score for probabilistic outputs.
- For regression, use mean squared error, mean absolute error, or another regression loss.

The resulting average is the OOB error estimate.

### Step 5: Use OOB error as a generalization estimate

The OOB error can be used in several ways: as a final estimate of generalization performance, as an early stopping criterion (monitoring OOB error as more trees are added), or as a tuning signal for hyperparameters such as the number of features sampled at each split.

## OOB error in random forests

Random forests are the canonical use case for OOB error. In a random forest with m trees, each tree is grown on a bootstrap sample of the training data, and at each split the tree considers only a random subset of features. Breiman's original recommendation and the historical R `randomForest` default use $$\sqrt{p}$$ candidate features for classification and $$\lfloor p/3 \rfloor$$ for regression, where p is the number of features [3][16]. Modern scikit-learn keeps `max_features="sqrt"` as the classifier default but sets the regressor default to `max_features=1.0` (all features) [11]. The bootstrap step alone is enough to enable OOB evaluation; the feature subsampling adds additional decorrelation between trees but does not change the OOB mechanism.

Breiman's 2001 random forest paper reported empirical comparisons showing that OOB error tracked the true test error closely across a wide range of UCI Machine Learning Repository benchmarks [3]. In the earlier 1996 study, the average OOB misclassification rate was within a fraction of a percentage point of the average test set error on datasets such as breast-cancer (4.4% OOB vs 4.4% test) and dna (7.6% vs 7.5%), and Breiman observed that "in classification, the out-of-bag estimates appear almost unbiased," while in regression the OOB estimates "may be systematically low" [2].

### Convergence, number of trees, and a subtle bias

OOB error decreases (or fluctuates and then plateaus) as more trees are added to the forest. With too few trees, some training observations may have very few or no base learners in their OOB set, which makes their individual OOB predictions noisy. As m grows, every observation accumulates a substantial number of OOB predictions and the OOB error converges to a stable estimate. A common heuristic is to plot OOB error against the number of trees and stop adding trees once the curve flattens [3].

Breiman pointed out a subtle and frequently overlooked source of bias. Because each training row is OOB for only about one-third of the trees, the OOB prediction for that row aggregates roughly m/3 trees rather than the full m. Since a forest's error rate decreases as more trees are combined, an OOB prediction built from m/3 trees tends to be slightly worse than the final m-tree forest, so "the out-of-bag estimates will tend to overestimate the current error rate" [3]. Breiman's remedy was to keep adding trees: "To get unbiased out-of-bag estimates, it is necessary to run past the point where the test set error converges. But unlike cross-validation, where bias is present but its extent unknown, the out-of-bag estimates are unbiased" [3]. In practice this means a forest large enough for a stable test error is usually large enough for a stable, nearly unbiased OOB error as well.

### Example in scikit-learn

The following Python example trains a random forest on the Wisconsin Breast Cancer dataset and reports both the test accuracy and the OOB score.

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

rf = RandomForestClassifier(
    n_estimators=500,
    oob_score=True,        # turn on out-of-bag scoring
    bootstrap=True,        # OOB requires bootstrap sampling
    n_jobs=-1,
    random_state=42,
)
rf.fit(X_train, y_train)

print(f"OOB score:      {rf.oob_score_:.4f}")
print(f"Test accuracy:  {rf.score(X_test, y_test):.4f}")
```

Key scikit-learn parameters and attributes related to OOB evaluation are summarized below.

| Parameter or attribute | Description |
|---|---|
| `bootstrap=True` | Required for OOB; controls whether each tree is built on a bootstrap sample |
| `oob_score` | `bool` or callable, default `False`. Set to `True` to compute the OOB score after `fit` using accuracy (classifier) or $$R^2$$ (regressor); pass a callable with signature `metric(y_true, y_pred)` to use a custom OOB metric [11] |
| `oob_score_` | Score of the training set obtained from the OOB estimate; exists only when `oob_score` is set |
| `oob_decision_function_` | Per-sample OOB class probabilities (classifier only). If `n_estimators` is small, a sample may never have been left out, in which case its entries can be `NaN` [11] |
| `oob_prediction_` | Per-sample OOB predictions (regressor only) |
| `max_samples` | Number or fraction of rows drawn per bootstrap sample; default `None` draws n rows (the classic bootstrap). Smaller values change the OOB fraction |
| `n_estimators` | Number of trees; OOB requires enough trees so each sample is OOB for several trees |

The ability to pass a custom scoring callable to `oob_score` was added in scikit-learn 1.4, so older versions accept only a boolean [11]. If `bootstrap=False`, scikit-learn raises an error or warning and disables OOB scoring because no observations are left out of training. A practical consequence of the `NaN` caveat is that very small forests can leave some rows with no OOB prediction at all, which is another reason to use at least a few hundred trees when relying on the OOB score.

## Variable importance via OOB permutation

One of the most influential applications of OOB evaluation is **permutation variable importance**, also introduced by Breiman in the 2001 random forest paper [3]. The idea is to measure how much a model's predictive performance degrades when the values of a single feature are randomly shuffled, breaking any relationship between that feature and the response.

The procedure for OOB permutation importance is:

1. Train the random forest and compute the baseline OOB error (or accuracy) for each tree using its OOB samples.
2. For each feature j, randomly permute the values of feature j among the OOB samples for each tree, then recompute the OOB error using the permuted feature.
3. The importance of feature j is the average increase in OOB error (or decrease in accuracy) across all trees, often standardized by dividing by the standard deviation across trees.

Features that are critical to the model's predictions show a large drop in OOB accuracy when permuted, while features that contribute little show a small drop. This approach has several advantages over simpler importance measures based on splits or impurity reduction:

| Importance measure | Basis | Bias toward high-cardinality features | Computational cost |
|---|---|---|---|
| Mean decrease in impurity (MDI / Gini) | Sum of impurity reductions across all splits using the feature | High; favors continuous and high-cardinality categorical features [12] | Low (computed during training) |
| OOB permutation importance | Drop in OOB accuracy when feature is shuffled | Lower; less prone to high-cardinality bias [12] | Moderate; requires extra OOB passes |
| SHAP values | Game-theoretic attribution of model output | Low | High |

In Breiman's original formulation the per-tree increases in OOB error are averaged across the forest and the result can be divided by its standard deviation across trees to give a z-score-style importance, which is how the R `randomForest` package reports its "MeanDecreaseAccuracy" measure [3][16]. Permutation importance is also implemented in scikit-learn as `sklearn.inspection.permutation_importance`, which can be applied to any fitted estimator (including random forests) using either a held-out validation set or the training data; the scikit-learn implementation permutes a supplied dataset rather than the per-tree OOB rows, but the underlying logic is the same [11].

A known caveat, documented in detail by Strobl and colleagues, is that when two features are strongly correlated, permuting one of them may have only a small effect on OOB error because the other still carries the same information [12]. As a result, permutation importance can underestimate the importance of correlated features. Strobl's group proposed a **conditional permutation importance** that permutes a feature within strata defined by correlated variables to give a more faithful picture, and grouped permutation importance addresses the problem in part [12]. Janitza, Celik and Boulesteix later proposed a fast OOB-based importance test that yields p-values for high-dimensional data [13].

## Comparison to cross-validation and holdout

OOB error, [cross-validation](/wiki/cross-validation), and [holdout validation](/wiki/holdout) all estimate generalization error but differ in cost, bias, and applicability. The table below summarizes the trade-offs.

| Property | OOB evaluation | k-fold cross-validation | Hold-out (single split) |
|---|---|---|---|
| Applicable to | Bagged ensembles only (random forest, bagged trees, etc.) | Any model | Any model |
| Extra training passes required | None (uses ensemble already trained) | k full retrainings | None beyond one training run |
| Effective fraction of data used per estimate | ~36.8% OOB for each base learner; full dataset across the ensemble | (k-1)/k for training, 1/k for testing per fold | User-defined (e.g., 80/20) |
| Bias of estimate | Converges to leave-one-out for the ensemble; slightly pessimistic for small forests, can over- or under-estimate in edge cases | Slight pessimistic bias for small k | Depends on split |
| Variance of estimate | Low for large m | Moderate; depends on k | High; depends on the single split |
| Suitable for hyperparameter tuning | Yes, but care needed (no separate validation set) | Yes | Yes |
| Suitable for [time series](/wiki/time_series_analysis) | No (random sampling breaks temporal order) | Special variants needed | Yes with chronological split |
| Computational cost | Negligible additional cost | k times training cost | One training cost |

Breiman's 1996 OOB technical report and follow-up empirical studies showed that OOB error and 10-fold cross-validation give very similar estimates for random forests and bagged trees [2]. As the OOB error stabilizes over many trees it converges to the leave-one-out cross-validation error, since each row is scored by trees that never saw it. For practical purposes, OOB error is often the preferred estimator when working with bagged ensembles because it avoids the k-fold retraining cost entirely.

For non-bagging models such as a single [decision tree](/wiki/decision_tree), [logistic regression](/wiki/logistic_regression), [support vector machine](/wiki/support_vector_machine_svm), or [neural network](/wiki/neural_network), OOB evaluation is not directly applicable and cross-validation or hold-out remains the standard.

### When is OOB error optimistic or pessimistic?

The direction of OOB bias depends on the setting, and the literature documents cases in both directions.

- **Pessimistic (overestimates error) in the standard tree-count argument.** As explained above, an OOB prediction aggregates only about m/3 trees, so for any fixed forest size the OOB error is slightly worse than the final forest's error and overestimates it. Breiman's fix is to grow more trees until the estimate stabilizes [3].
- **Pessimistic in small, balanced, high-dimensional problems.** Janitza and Hornung showed in a 2018 *PLOS ONE* study that OOB error can substantially overestimate the true error in "high-risk" settings with few observations, many predictors, nearly balanced classes, and weak effects. For an extreme case (n = 20, p = 1000) the gap between OOB and test error was between 10% and 30%, driven by class-distribution differences between the in-bag and OOB portions of each bootstrap sample; they recommend stratified subsampling with class-proportional sampling fractions as a remedy [18].
- **Optimistic (underestimates error) when observations are not independent.** If rows are temporally or spatially correlated, or there are near-duplicate rows, a row can be "out of bag" for a tree that nonetheless trained on a near-copy of it. The OOB row is then not truly unseen, so OOB error can be optimistically biased. This is the main reason OOB is unsuitable for [time series](/wiki/time_series_analysis) and grouped data.

Largely because of the pessimism in small and balanced problems, the scikit-learn user guide advises caution: "Out-of-bag estimates are usually very pessimistic thus we recommend to use cross-validation instead and only use OOB if cross-validation is too time consuming" [20]. For large datasets with many trees, by contrast, OOB and cross-validation typically agree closely, and OOB's near-zero cost makes it the practical default.

## The 0.632 and 0.632+ bootstrap

Closely related to OOB evaluation are two bootstrap-based error estimators developed by Bradley Efron and Robert Tibshirani in the 1980s and 1990s.

### 0.632 bootstrap

Efron's 1983 paper "Estimating the Error Rate of a Prediction Rule" introduced the **0.632 bootstrap estimator** [7]. It addresses a known issue with the naive bootstrap error estimate (the average error on each bootstrap sample's OOB set), which tends to be pessimistically biased because each bootstrap model is trained on only about 63.2% of the unique observations. The 0.632 estimator combines the OOB-style error with the resubstitution (training) error using the constant 0.632:

$$
\text{err}_{0.632} = 0.368 \, \text{err}_{\text{train}} + 0.632 \, \text{err}_{\text{OOB}}
$$

The weights are chosen because, in expectation, each bootstrap training sample contains 63.2% of the unique original observations. The 0.632 estimator can have lower bias than either err_train (which underestimates) or err_OOB (which overestimates) alone, particularly for unstable learners.

### 0.632+ bootstrap

The 0.632 estimator can fail in cases of severe overfitting, where the training error is essentially zero but the model has memorized the data. Efron and Tibshirani's 1997 paper "Improvements on Cross-Validation: The .632+ Bootstrap Method" introduced the **0.632+ bootstrap estimator** [8], which adjusts the weighting based on a measure of overfitting called the relative overfitting rate R:

$$
\text{err}_{0.632+} = (1 - w) \, \text{err}_{\text{train}} + w \, \text{err}_{\text{OOB}}
$$

where $$w = \frac{0.632}{1 - 0.368 R}$$ and R is the relative overfitting rate, with $$R = 0$$ corresponding to no overfitting (recovers the standard 0.632 estimator) and $$R = 1$$ corresponding to maximal overfitting (collapses to $$\text{err}_{\text{OOB}}$$). The 0.632+ method gives a more honest error estimate when the underlying learner severely overfits.

### Relationship to OOB evaluation

The 0.632 and 0.632+ estimators apply to general predictive models trained on bootstrap samples, not just to bagged ensembles [9]. They share the same statistical foundation as OOB evaluation (the $$(1 - 1/n)^n \to 1/e$$ limit) but combine the OOB-style error with the training error to correct for known biases. In bagged ensembles, the OOB error from Breiman's procedure is essentially an unweighted version of err_OOB; when the goal is to evaluate the ensemble itself rather than a single bootstrap-trained model, the unweighted OOB error is usually preferred. Both estimators are available in Python through the `bootstrap_point632_score` function in the mlxtend library, which exposes `method='.632'`, `method='.632+'`, and a plain `method='oob'` option for comparison [19].

| Estimator | Formula | Best for |
|---|---|---|
| Naive resubstitution | $$\text{err}_{\text{train}}$$ | Lower bound only; severely biased downward |
| Leave-one-out CV | Average error over n folds of size n-1 | General models; high computational cost |
| OOB (Breiman) | Average loss between aggregated OOB predictions and labels | [Bagging](/wiki/bagging) ensembles; near-zero extra cost [2] |
| 0.632 bootstrap (Efron 1983) | $$0.368 \, \text{err}_{\text{train}} + 0.632 \, \text{err}_{\text{OOB}}$$ | Single bootstrap-trained models; corrects pessimistic bias [7] |
| 0.632+ bootstrap (Efron and Tibshirani 1997) | Adaptive weighting using relative overfitting rate R | Severely overfitting learners [8] |

## Limitations and best practices

OOB evaluation is one of the most useful tools in ensemble learning, but it has well-known limitations.

### Limitations

- **Requires bootstrap sampling.** OOB evaluation is meaningless without it. Models such as Extra-Trees (Extremely Randomized Trees) by default do not bootstrap, in which case OOB scoring must be turned off or `bootstrap=True` must be set [11].
- **Small ensembles produce noisy OOB estimates.** With only a handful of base learners, some training observations may be predicted by very few OOB models or none at all, in which case scikit-learn's `oob_decision_function_` can even contain `NaN` for those rows. A typical guideline is to use at least 200 to 500 base learners [11][15].
- **Can be biased in small samples.** When the training set is very small, the bootstrap sample may contain extreme duplication and the OOB set may not be representative of the population. As noted above, Janitza and Hornung found OOB error overestimating the true error by 10% to 30% in extreme small-n high-dimensional cases [18].
- **Not directly applicable to time series or grouped data.** Bootstrap sampling assumes exchangeable observations. For [time series data](/wiki/time_series_analysis), randomly resampled bootstrap samples break the temporal dependence structure, and OOB error can be optimistically biased because a row left out of a tree may still be represented by a temporally adjacent, highly correlated row that was kept. Block bootstrap or specialized time-series cross-validation is preferred in this setting.
- **Permutation importance based on OOB can be misleading for correlated features.** As noted earlier, two strongly correlated features may both show low importance even if either alone would be highly predictive [12].
- **Class imbalance can distort OOB error.** With heavily imbalanced classes, the OOB error may be dominated by the majority class. Stratified sampling with class-proportional fractions, class-weighted loss, or class-specific OOB metrics (precision, recall, F1) can address this [18].

### Best practices

- **Use `oob_score=True` whenever you train a random forest or bagging ensemble.** The cost is negligible and it provides a free generalization estimate.
- **Monitor OOB error against the number of trees** to choose a sensible value of `n_estimators`. Stop adding trees once the OOB curve plateaus, and remember Breiman's note that you should run slightly past convergence for the OOB estimate itself to be unbiased [3]. Empirical work by Probst and Boulesteix found that the marginal accuracy gain from additional trees follows a diminishing-returns curve, so very large forests rarely hurt OOB-based estimates but offer little benefit past a point [15].
- **Cross-check with a held-out test set** for any model that will be deployed. OOB error is a strong estimate but is not a substitute for evaluation on truly unseen data when the stakes are high.
- **Use OOB permutation importance** rather than mean decrease in impurity (MDI) when interpreting feature importance, especially if features have different scales or cardinalities.
- **Switch to time-aware validation for temporal data.** Replace OOB with a chronological hold-out, walk-forward validation, or block bootstrap variants designed for time series.
- **Record the random seed.** OOB scores depend on the bootstrap draws, which depend on the seed. Pinning `random_state` makes results reproducible.

## Implementation in popular libraries

OOB evaluation is supported across many of the major machine learning libraries.

| Library or framework | OOB support | How to enable |
|---|---|---|
| scikit-learn (Python) | Yes; `RandomForestClassifier`, `RandomForestRegressor`, `BaggingClassifier`, `BaggingRegressor` | `oob_score=True` (or a scoring callable) and `bootstrap=True` [11] |
| randomForest (R) | Yes; OOB error printed by default | Automatic; reported in `print(model)` and `plot(model)` [16] |
| ranger (R) | Yes; computes OOB error and prediction by default | Automatic; available in `model$prediction.error` [14] |
| H2O (Java/Python/R) | Yes; reports OOB metrics for `H2ORandomForestEstimator` | Automatic when `nfolds = 0` |
| Spark MLlib | Limited; MLlib does not expose OOB by default | Compute manually using bootstrap indices |
| XGBoost | No; XGBoost is a [boosting](/wiki/boosting) method without bootstrap sampling | Use cross-validation instead |
| LightGBM | No (default); supports row-bagging via `bagging_fraction`, but does not expose OOB error | Use cross-validation |
| CatBoost | No native OOB; relies on cross-validation | Use cross-validation |

Note that gradient boosting libraries such as XGBoost, LightGBM, and CatBoost are based on sequential boosting rather than bagging. While some of them implement optional row-subsampling that resembles bagging, the sequential nature of boosting means the OOB framework does not transfer cleanly, so cross-validation is the standard evaluation method for these libraries.

## Explain like I'm 5 (ELI5)

Imagine your class is divided into 100 small reading groups, and each group is given a random handful of storybooks from the library. Because the handfuls are picked with replacement, some books end up in lots of groups and some books end up in only a few groups. About one third of the books never make it into any given group's pile.

Later, the teacher wants to know how good the class is at understanding new stories. Instead of giving everyone a brand new book, the teacher does something clever: for each book, she finds all the groups that did not see it, asks them what they think the story is about, and combines their answers. Since those groups have never read the book, their combined answer is a fair test of how well the class can handle a new story.

That is what out-of-bag evaluation does. Each tree in the random forest is one little reading group, each training row is one storybook, and the OOB error is the score the class gets when only the groups that have never seen a row try to predict it. The whole trick is free because the trees were going to be trained anyway, and the rows they did not use were just sitting there waiting to be useful.

## References

1. Breiman, L. (1996a). "Bagging Predictors." *Machine Learning*, 24(2), 123-140. [doi:10.1007/BF00058655](https://doi.org/10.1007/BF00058655)
2. Breiman, L. (1996b). "Out-of-Bag Estimation." Technical Report, Statistics Department, University of California, Berkeley. [PDF](https://www.stat.berkeley.edu/~breiman/OOBestimation.pdf)
3. Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32. [doi:10.1023/A:1010933404324](https://doi.org/10.1023/A:1010933404324)
4. Ho, T.K. (1995). "Random Decision Forests." *Proceedings of the Third International Conference on Document Analysis and Recognition*, Montreal, Vol. 1, 278-282.
5. Ho, T.K. (1998). "The Random Subspace Method for Constructing Decision Forests." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 20(8), 832-844.
6. Efron, B. (1979). "Bootstrap Methods: Another Look at the Jackknife." *The Annals of Statistics*, 7(1), 1-26.
7. Efron, B. (1983). "Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation." *Journal of the American Statistical Association*, 78(382), 316-331.
8. Efron, B. & Tibshirani, R.J. (1997). "Improvements on Cross-Validation: The .632+ Bootstrap Method." *Journal of the American Statistical Association*, 92(438), 548-560.
9. Efron, B. & Tibshirani, R.J. (1993). *An Introduction to the Bootstrap*. Chapman & Hall.
10. Hastie, T., Tibshirani, R. & Friedman, J. (2009). *The Elements of Statistical Learning*, 2nd ed. Springer. Chapter 15: Random Forests.
11. Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
12. Strobl, C., Boulesteix, A.L., Zeileis, A. & Hothorn, T. (2007). "Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution." *BMC Bioinformatics*, 8, 25.
13. Janitza, S., Celik, E. & Boulesteix, A.L. (2018). "A Computationally Fast Variable Importance Test for Random Forests for High-Dimensional Data." *Advances in Data Analysis and Classification*, 12, 885-915.
14. Wright, M.N. & Ziegler, A. (2017). "ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R." *Journal of Statistical Software*, 77(1), 1-17. [doi:10.18637/jss.v077.i01](https://doi.org/10.18637/jss.v077.i01)
15. Probst, P. & Boulesteix, A.L. (2018). "To Tune or Not to Tune the Number of Trees in Random Forest." *Journal of Machine Learning Research*, 18(181), 1-18. [PDF](https://jmlr.org/papers/v18/17-269.html)
16. Liaw, A. & Wiener, M. (2002). "Classification and Regression by randomForest." *R News*, 2(3), 18-22. [PDF](https://cran.r-project.org/doc/Rnews/Rnews_2002-3.pdf)
17. Wolpert, D.H. & Macready, W.G. (1999). "An Efficient Method to Estimate Bagging's Generalization Error." *Machine Learning*, 35(1), 41-55. [doi:10.1023/A:1007519102914](https://doi.org/10.1023/A:1007519102914)
18. Janitza, S. & Hornung, R. (2018). "On the Overestimation of Random Forest's Out-of-Bag Error." *PLOS ONE*, 13(8), e0201904. [doi:10.1371/journal.pone.0201904](https://doi.org/10.1371/journal.pone.0201904)
19. Raschka, S. (2018). "MLxtend: Providing machine learning and data science utilities and extensions to Python's scientific computing stack" (`bootstrap_point632_score`). *Journal of Open Source Software*, 3(24), 638. [doi:10.21105/joss.00638](https://doi.org/10.21105/joss.00638). [Documentation](https://rasbt.github.io/mlxtend/user_guide/evaluate/bootstrap_point632_score/)
20. Scikit-learn developers. "1.11. Ensembles: Gradient boosting, random forests, bagging, voting, stacking." *scikit-learn User Guide*. [Link](https://scikit-learn.org/stable/modules/ensemble.html)