Bagging

overview

Bagging, short for bootstrap aggregating, is an ensemble learning technique in machine learning designed to improve the stability and accuracy of prediction algorithms. The method works by training multiple instances of the same base learner on different bootstrap samples of the training data, then combining their predictions through averaging (for regression) or majority voting (for classification). Bagging primarily reduces variance and helps prevent overfitting, making it one of the most widely used ensemble strategies in practice.

Leo Breiman introduced bagging in his 1994 technical report and published the foundational paper Bagging Predictors in the journal Machine Learning in 1996 (Volume 24, Issue 2, pages 123 to 140). The technique was among the first effective ensemble methods and remains a building block for more advanced approaches such as random forests, extra trees, and the implicit ensembles induced by dropout in neural networks. As of 2026, bagging is taught in nearly every introductory machine learning course and is implemented in every major library, including scikit-learn, Spark MLlib, H2O, and R packages such as randomForest and ranger.

The enduring popularity of bagging on tabular data was reinforced by Grinsztajn, Oyallon, and Varoquaux in their 2022 NeurIPS paper Why do tree-based models still outperform deep learning on typical tabular data? (arXiv:2207.08815), which benchmarked 45 tabular datasets and concluded that bagging-based and gradient boosting tree ensembles remain state of the art on medium-sized tabular problems even compared to modern deep tabular architectures.

historical background

Before bagging, ensemble approaches in statistics were limited mostly to model averaging in Bayesian inference and to ad hoc voting schemes among classifiers. The dominant view in the early 1990s held that high-variance models such as fully grown decision trees and neural networks were inherently unreliable and required heavy regularization, pruning, or weight decay to generalize.

Leo Breiman, then a professor of statistics at the University of California, Berkeley, took a different angle. Inspired by the bootstrap resampling technique developed by Bradley Efron in 1979, Breiman asked whether averaging predictions from many models trained on bootstrap replicates of the data could reduce variance without sacrificing bias. His 1994 technical report, Bagging Predictors, formalized the procedure. The paper was accepted into Machine Learning and appeared in 1996. It became one of the most influential machine learning papers of the decade, accumulating tens of thousands of citations.

Key milestones in the development of bagging and its descendants:

Year	Event	Reference
1979	Efron introduces the bootstrap	Efron, Annals of Statistics
1994	Breiman circulates the Bagging Predictors technical report	UC Berkeley TR No. 421
1996	Breiman publishes Bagging Predictors in Machine Learning	Breiman 1996a
1996	Breiman publishes Out-of-Bag Estimation technical report	Breiman 1996b
1998	Ho introduces the random subspace method	Ho 1998
2001	Breiman publishes Random Forests	Breiman 2001
2002	Buhlmann and Yu publish Analyzing Bagging, introducing subagging	Annals of Statistics
2006	Geurts, Ernst, and Wehenkel introduce extremely randomized trees	Machine Learning journal
2012	Louppe and Geurts introduce ensembles on random patches	ECML PKDD 2012
2014	Srivastava et al. publish Dropout, framing it as approximate bagging	JMLR 15
2022	Grinsztajn et al. show tree ensembles still beat deep nets on tabular data	NeurIPS 2022

Bagging is also a core inspiration for the modern view of model uncertainty estimation through deep ensembles, which were popularized by Lakshminarayanan, Pritzel, and Blundell in 2017.

how bagging works

The bagging algorithm follows a straightforward three-step process: bootstrap sampling, parallel model training, and prediction aggregation.

step 1: bootstrap sampling

Given an original training dataset $D$ of size $n$, bagging creates $m$ new training sets $D_1, D_2, \dots, D_m$ by sampling uniformly at random with replacement from $D$. Each bootstrap sample $D_i$ is the same size as the original dataset ($n$ observations), but because sampling is done with replacement, some observations appear multiple times while others are left out entirely.

A key statistical property governs this process. The probability that a given observation is not chosen in a single draw is $1 - 1/n$. The probability it is missed across all $n$ draws is $(1 - 1/n)^n$, which converges to $1/e \approx 0.368$ as $n$ grows. Therefore, each bootstrap sample is expected to contain approximately $1 - 1/e \approx 0.632$, or roughly 63.2%, of the unique observations from the original dataset. The remaining 36.8% are either duplicates of selected observations or are not selected at all. This means each bootstrap sample provides a meaningfully different view of the data, which is essential for producing diverse models.

$n$	Expected unique fraction	Expected OOB fraction
10	0.6513	0.3487
100	0.6340	0.3660
1,000	0.6323	0.3677
10,000	0.6322	0.3678
$\infty$	$1 - 1/e \approx 0.6321$	$1/e \approx 0.3679$

The values converge quickly, which is why the 63.2% / 36.8% split is treated as a constant in practice.

step 2: parallel model training

A base learning algorithm (the "base learner") is trained independently on each of the $m$ bootstrap samples. Because each model sees a different subset of the data, the resulting models differ from one another even though they all use the same learning algorithm. This diversity among models is what allows bagging to reduce prediction variance.

An important practical advantage is that all $m$ models can be trained in parallel, since none depends on the output of any other. This makes bagging straightforward to distribute across multiple processors or machines. By contrast, boosting is inherently sequential: each round depends on the residuals or weights produced by the previous round.

step 3: aggregation

Once all $m$ models have been trained, their individual predictions are combined to produce the final output:

Task	Aggregation method	Description
Classification	Majority voting	Each model casts one vote for a class; the class receiving the most votes is the final prediction
Regression	Averaging	The final prediction is the arithmetic mean of all individual model predictions

In some implementations, soft voting is used for classification, where the predicted class probabilities from each model are averaged and the class with the highest average probability is selected. Soft voting often outperforms hard voting because it preserves the confidence information from each model.

Formally, given base models $\hat{f}_1, \hat{f}_2, \dots, \hat{f}_m$ trained on bootstrap samples $D_1, D_2, \dots, D_m$, the bagged regression prediction at input $x$ is:

$$ \hat{f}{bag}(x) = \frac{1}{m} \sum{i=1}^{m} \hat{f}_i(x) $$

For classification with soft voting, the predicted probability for class $k$ is:

$$ \hat{p}{bag}(y = k \mid x) = \frac{1}{m} \sum{i=1}^{m} \hat{p}_i(y = k \mid x) $$

why bagging works: variance reduction

The effectiveness of bagging is grounded in a simple statistical principle. Consider a set of $m$ independent random variables, each with variance $\sigma^2$. The variance of their average is $\sigma^2 / m$. In practice, the bootstrap models are not fully independent because they are all drawn from the same original dataset, but they are sufficiently different that averaging their outputs still produces a meaningful reduction in variance.

More precisely, if the pairwise correlation between any two models is $\rho$ and each model has variance $\sigma^2$, then the variance of the bagged ensemble is:

$$ \operatorname{Var}(\hat{f}_{bag}) = \rho \sigma^2 + \frac{1 - \rho}{m} \sigma^2 $$

As the number of models $m$ increases, the second term shrinks toward zero, leaving $\rho \sigma^2$ as the irreducible floor. This formula explains why bagging is most effective when the base models are diverse (low $\rho$) and individually high in variance (large $\sigma^2$). It also motivates the design of random forests, which add feature subsampling specifically to lower $\rho$.

bias-variance decomposition

The expected squared error of a regression model can be decomposed as bias squared plus variance plus irreducible noise:

$$ \mathbb{E}[(y - \hat{f}(x))^2] = \text{Bias}^2(\hat{f}(x)) + \operatorname{Var}(\hat{f}(x)) + \sigma_{\epsilon}^2 $$

Bagging reduces the second term while leaving the first term essentially unchanged. If the base learner is biased, the bagged ensemble will retain that bias. This is why bagging works best with unstable, low-bias, high-variance learners. Deep, unpruned decision trees are the classic example: they tend to have low bias (they can fit the training data closely) but high variance (small changes in the training data can produce very different trees). Bagging smooths out this variance without sacrificing the low bias.

Conversely, bagging provides little benefit, and can even slightly degrade performance, when applied to stable, low-variance models such as k-nearest neighbors, linear regression, or shallow trees with high bias.

theoretical analyses

Buhlmann and Yu, in their 2002 paper Analyzing Bagging in the Annals of Statistics, gave the first rigorous theoretical treatment of why bagging reduces variance for hard-decision predictors such as decision trees. They showed that bagging effectively smooths discontinuous decision rules near their decision boundaries, converting an indicator function into something closer to a smooth probability. Within a cylindric neighborhood of width $n^{-1/3}$ around the optimal split point, both the variance and the mean squared error of bagged decision stumps can be characterized in closed form. They also introduced subagging (subsample aggregating), which uses subsampling without replacement and which they showed achieves comparable accuracy at lower computational cost.

Later theoretical work by Biau and Scornet (2016) and others extended this analysis to random forests, providing consistency results and asymptotic distribution theorems. Mentch and Hooker (2016) developed inferential procedures using the U-statistic framework to construct confidence intervals around bagged predictions.

out-of-bag estimation

Because each bootstrap sample includes only about 63.2% of the original observations, the remaining 36.8% that were not selected for a given model are called out-of-bag (OOB) observations. These provide a convenient, built-in validation mechanism. The OOB estimate is sometimes treated as a free pseudo cross-validation procedure.

For each observation $x_i$ in the original dataset, there exists a subset of models that did not include $x_i$ in their training data (on average about 36.8% of the $m$ models). The OOB prediction for $x_i$ is obtained by aggregating the predictions of only those models. The OOB error is then computed by comparing these OOB predictions against the true labels across all observations.

The OOB error estimate has been shown to be approximately equivalent to leave-one-out cross-validation error, but it comes "for free" as a byproduct of the bagging process, with no need to set aside a separate validation set or perform multiple rounds of retraining. Leo Breiman demonstrated in his 1996b paper Out-of-Bag Estimation that the OOB estimate is a reliable measure of generalization error, making it especially useful when data is limited.

oob error formula

Let $S_i$ denote the set of model indices $j$ for which $x_i \notin D_j$ (that is, the models that did not see $x_i$ during training). The OOB prediction for observation $i$ in regression is:

$$ \hat{y}i^{OOB} = \frac{1}{|S_i|} \sum{j \in S_i} \hat{f}_j(x_i) $$

The OOB error is then:

$$ \text{OOB error} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_i^{OOB}) $$

for an appropriate loss $L$ (squared error for regression, zero-one loss or log loss for classification). With $m$ in the hundreds, $|S_i|$ is itself in the hundreds, which keeps the OOB prediction stable.

oob feature importance

A closely related use of OOB samples is permutation feature importance, which Breiman introduced for random forests in his 2001 paper. The idea is to measure the increase in OOB error when the values of a single feature are randomly permuted across the OOB samples. A feature whose permutation drives OOB error up sharply is considered important; a feature whose permutation has no effect is considered uninformative. This procedure works for any bagged ensemble, not only random forests, and is implemented in scikit-learn as permutation_importance.

bagging with decision trees

Decision trees are by far the most common base learner used with bagging. An individual unpruned decision tree is a high-variance model: it memorizes the training data and can change drastically if a few training points are altered. This makes it an ideal candidate for variance reduction through bagging.

A bagged ensemble of decision trees (sometimes called a "bagged forest") trains each tree on a different bootstrap sample and combines their predictions. The result is a model that retains the expressiveness of deep trees while achieving much lower variance and better generalization.

However, one drawback is the loss of interpretability. A single decision tree is easy to visualize and explain, but a committee of hundreds or thousands of trees is not. Practitioners typically rely on permutation importance, SHAP values, or partial dependence plots to interpret bagged tree ensembles.

A practical training note: each tree in a bagged ensemble should be grown deeply (or fully) so that it has low bias. Pruning each individual tree would remove the very property that makes bagging effective.

random forests: bagging plus feature randomization

Random forests, introduced by Leo Breiman in 2001, extend bagging by adding a second source of randomness. In addition to training each tree on a bootstrap sample, random forests restrict each split in each tree to consider only a random subset of features (typically $\sqrt{p}$ for classification or $p/3$ for regression, where $p$ is the total number of features).

This feature subsampling further decorrelates the individual trees, reducing the pairwise correlation $\rho$ in the variance formula. The result is an ensemble with even lower variance than standard bagged trees, which explains why random forests consistently rank among the best-performing off-the-shelf classifiers. In a benchmark by Fernandez-Delgado et al. (2014) covering 121 UCI datasets, random forests was the top-ranked family on more datasets than any other method.

Method	Bootstrap samples	Feature subsampling	Correlation between trees
Single decision tree	No	No	Not applicable
Bagged trees	Yes	No (all features at each split)	Moderate
Random forest	Yes	Yes (random subset at each split)	Lower
Extra trees	Optional	Yes plus random thresholds	Lowest

extra trees: even more randomness

The extremely randomized trees algorithm, introduced by Geurts, Ernst, and Wehenkel in 2006 in the Machine Learning journal (Volume 63, pages 3 to 42), pushes the random forest idea further. At each internal node, instead of computing the optimal split threshold for each candidate feature, extra trees pick a random threshold from each candidate's range and choose the best among these random thresholds. By default, scikit-learn's ExtraTreesClassifier does not even use bootstrap sampling; each tree is trained on the full dataset, and randomness comes entirely from feature and threshold sampling. This often reduces variance further at a small cost in bias.

bagging vs. boosting

Bagging and boosting are both ensemble methods, but they operate on fundamentally different principles. Bagging aims to reduce variance by training independent models in parallel, while boosting aims to reduce bias by training models sequentially with each new model focusing on the errors of the previous ones.

Aspect	Bagging	Boosting
Training	Models trained independently, in parallel	Models trained sequentially; each model focuses on errors of previous models
Goal	Reduce variance	Reduce bias (and variance to a lesser extent)
Error focus	Equal treatment of all training examples	Higher weight given to misclassified examples
Sensitivity to noise	Robust to noisy data and outliers	More sensitive to noise and outliers
Overfitting risk	Low; adding more models rarely hurts	Higher if not properly regularized
Hyperparameter sensitivity	Low	Higher; requires careful tuning of learning rate and iterations
Typical base learners	High-variance models (deep trees)	Weak learners (shallow trees, stumps)
Computational parallelism	Trivially parallel across models	Sequential by definition
OOB estimation available	Yes	Generally no (boosting uses staged predictions instead)
Notable algorithms	Random forest, bagged trees, extra trees	AdaBoost, gradient boosting, XGBoost, LightGBM, CatBoost

In general, boosting can achieve higher accuracy on clean data by addressing both bias and variance, but bagging is more robust and simpler to configure. The choice depends on the dataset, the noise level, and the amount of tuning effort available.

empirical comparisons on tabular data

The Grinsztajn, Oyallon, and Varoquaux 2022 NeurIPS paper Why do tree-based models still outperform deep learning on typical tabular data? benchmarked random forests, gradient boosting trees, and several deep tabular architectures across 45 medium-sized tabular datasets. They reported that boosting (gradient boosted trees) was generally the strongest single family, with random forests close behind, and that both bagging and boosting tree ensembles outperformed all neural network families they tested even after extensive hyperparameter tuning. The authors attributed this to three inductive biases of tree ensembles: robustness to uninformative features, robustness to non-Gaussian targets, and the tendency to learn piecewise-constant functions, which align well with how tabular features behave.

bagging vs. pasting

Pasting is a closely related method that differs in one key detail: it samples training subsets without replacement instead of with replacement. This means each observation can appear at most once in a given subset, and the subsets are typically smaller than the full training set. The term "pasting" was introduced by Breiman in 1999 in his paper Pasting Small Votes for Classification in Large Databases and On-Line.

Aspect	Bagging	Pasting
Sampling method	With replacement (bootstrap)	Without replacement
Subset size	Typically same as original dataset	Smaller than original dataset
Duplicate observations in subset	Yes	No
OOB estimation	Available (from unselected samples)	Not directly available
Best when	Dataset fits in memory	Dataset is very large; subsets fit in memory

Both methods can produce diverse ensembles, but bagging is more commonly used because the bootstrap procedure naturally creates the variation needed for effective ensembles and enables OOB error estimation. Pasting is more useful in distributed and out-of-core settings where each machine can only load a subset of the full data.

variants of bagging

subagging

Subagging (subsample aggregating), proposed by Buhlmann and Yu in 2002, replaces bootstrap sampling with subsampling without replacement, drawing subsets of size $m_n < n$ from the original dataset. Theoretical analysis shows that subagging with $m_n \approx n / 2$ achieves variance reduction comparable to full bagging. This approach can be cheaper because each model trains on a smaller dataset, and it is the default behavior of the subsample parameter in libraries such as XGBoost and LightGBM, although in those libraries the subsampling is combined with boosting rather than independent aggregation.

random subspaces

The random subspace method, proposed by Tin Kam Ho in 1998 in IEEE Transactions on Pattern Analysis and Machine Intelligence, trains each model on all instances but only a random subset of features. This is particularly effective in high-dimensional problems where many features are noisy or redundant. When combined with bootstrap instance sampling, it becomes the random patches method.

random patches

The random patches method, introduced by Louppe and Geurts in 2012 in their ECML PKDD paper Ensembles on Random Patches, generalizes both bagging and random subspaces by drawing random subsets of both instances and features for each base model. This creates even greater diversity among models and can be particularly effective in high-dimensional settings. In scikit-learn, this is achieved by setting both max_samples and max_features to values less than 1.0 in the BaggingClassifier.

wagging

Wagging (weight aggregation), proposed by Bauer and Kohavi in 1999, replaces bootstrap sampling with random reweighting of the training observations. Each base model is trained on the full dataset but with random weights drawn from a Poisson or exponential distribution. Wagging avoids the duplication and exclusion behavior of bootstrap sampling and works only with base learners that support example weights.

iterated bagging

Iterated bagging, proposed by Breiman in 2001 in Using Iterated Bagging to Debias Regressions, combines bagging with iterative bias correction. After fitting an initial bagged ensemble, a second ensemble is fit on the residuals, and so on. This hybrid approach borrows the bias reduction of boosting while keeping bagging's parallel structure across each layer.

double bagging

Double bagging, proposed by Hothorn and Lausen in 2003, combines bootstrap samples with linear discriminant analysis fits on the OOB samples to enrich the feature space before training the next round of trees. It is mainly of historical and academic interest today.

bagging for neural networks and approximate bagging

Bagging can be applied to neural networks, which are themselves unstable, high-variance learners. Training multiple neural networks on different bootstrap samples and averaging their predictions often improves generalization. This approach is the basis of deep ensembles, which Lakshminarayanan, Pritzel, and Blundell formalized in 2017 as a simple and reliable method for predictive uncertainty estimation in deep learning.

However, training multiple full neural networks is computationally expensive. In practice, several approximations to bagging are more commonly used with neural networks:

Technique	Reference	Mechanism
Dropout	Srivastava et al. 2014, JMLR 15:1929-1958	Randomly zero units during training; at test time use full network with rescaled weights, approximating an exponentially large geometric mean
Stochastic depth	Huang et al. 2016	Randomly drop entire residual blocks during training
Snapshot ensembles	Huang et al. 2017	Save model checkpoints at different cyclic learning rate minima and average their predictions
Stochastic weight averaging (SWA)	Izmailov et al. 2018	Average weights of a model across SGD iterations near convergence
Deep ensembles	Lakshminarayanan et al. 2017	Train multiple networks independently with different random seeds; average predictions
BatchEnsemble	Wen et al. 2020	Share most weights across an ensemble using rank-1 perturbations to reduce memory cost
Multi-input multi-output (MIMO)	Havasi et al. 2021	Train one network with multiple input/output channels acting as a virtual ensemble

Dropout is the most influential of these. The original paper, Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov, JMLR Volume 15, 2014, pages 1929 to 1958), explicitly framed dropout as approximating an equally weighted geometric mean of the predictions of an exponential number of "thinned" sub-networks that share parameters, which is in spirit a form of model averaging. The connection to bagging is approximate because dropout sub-networks share weights and are not trained on independent bootstrap samples, but the variance-reduction intuition transfers.

when bagging helps and when it does not

Scenario	Bagging effectiveness	Reason
High-variance base learner (e.g., deep decision tree)	Very effective	Large variance to reduce
Noisy training data	Effective	Bootstrap smooths out noise
Low-variance base learner (e.g., k-nearest neighbors)	Minimal benefit or slightly harmful	Little variance to reduce
High-bias base learner (e.g., decision stump)	Ineffective	Bagging does not reduce bias
Linear regression with many features	Marginal	Linear models already have relatively low variance
Small dataset	Effective if combined with OOB estimation	Makes good use of limited data
Large number of features	Effective when paired with feature subsampling	Reduces correlation between models
Strongly imbalanced classification	Mixed; balanced bagging variants such as `BalancedBaggingClassifier` from `imbalanced-learn` work better	Standard bagging keeps the imbalance in each bootstrap

implementation in scikit-learn

Scikit-learn provides BaggingClassifier and BaggingRegressor in the sklearn.ensemble module. These meta-estimators wrap any base learner and handle the bootstrap sampling, parallel training, and aggregation automatically.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

bagging_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=1.0,       # each bootstrap sample is same size as training set
    bootstrap=True,         # sample with replacement (True = bagging, False = pasting)
    oob_score=True,         # compute out-of-bag error estimate
    n_jobs=-1,              # train all models in parallel
    random_state=42
)

bagging_clf.fit(X_train, y_train)
print(f"Test accuracy: {bagging_clf.score(X_test, y_test):.4f}")
print(f"OOB score:     {bagging_clf.oob_score_:.4f}")

A full pipeline showing soft voting and probability calibration:

from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import roc_auc_score, log_loss

base = LogisticRegression(max_iter=1000)
bag = BaggingClassifier(
    estimator=base,
    n_estimators=200,
    max_samples=0.8,
    max_features=0.8,    # random patches
    bootstrap=True,
    bootstrap_features=False,
    oob_score=True,
    n_jobs=-1,
    random_state=0,
)
bag.fit(X_train, y_train)
proba = bag.predict_proba(X_test)[:, 1]
print(f"AUC: {roc_auc_score(y_test, proba):.4f}")
print(f"Log loss: {log_loss(y_test, proba):.4f}")

Key parameters include:

Parameter	Description	Default
`estimator`	The base learning algorithm to bag	`DecisionTreeClassifier`
`n_estimators`	Number of base models in the ensemble	10
`max_samples`	Number or fraction of samples per bootstrap	1.0
`max_features`	Number or fraction of features per model	1.0
`bootstrap`	Whether to sample with replacement	True
`bootstrap_features`	Whether to sample features with replacement	False
`oob_score`	Whether to compute OOB error estimate	False
`warm_start`	Reuse previous fit and add more estimators	False
`n_jobs`	Number of parallel jobs (-1 for all cores)	None
`random_state`	Seed for reproducibility	None

Setting bootstrap=False switches from bagging to pasting. Setting max_features to a value less than 1.0 enables random subspace or random patches behavior.

Estimator	Family	Notes
`BaggingClassifier`, `BaggingRegressor`	General bagging	Wraps any base estimator
`RandomForestClassifier`, `RandomForestRegressor`	Bagging plus feature subsampling at each split	Most popular bagged tree implementation
`ExtraTreesClassifier`, `ExtraTreesRegressor`	Extremely randomized trees	Random thresholds; bootstrap off by default
`IsolationForest`	Bagged anomaly detection trees	Uses subsampling rather than bootstrap
`BalancedBaggingClassifier` (`imbalanced-learn`)	Bagging with class rebalancing per bag	Useful for imbalanced classification

implementation in other libraries

Library	API	Notes
R `randomForest` package	`randomForest()`	Original Breiman/Cutler Fortran code wrapped in R
R `ranger` package	`ranger()`	Faster C++ random forest implementation
Spark MLlib	`RandomForestClassifier`, `RandomForestRegressor`	Distributed random forest
H2O	`H2ORandomForestEstimator`	Distributed random forest with grid search
XGBoost	`subsample`, `colsample_bytree`	Combines bagging-style subsampling with boosting
LightGBM	`bagging_fraction`, `feature_fraction`	Same idea as XGBoost
CatBoost	`bootstrap_type`	Supports Bayesian, Bernoulli, MVS, and Poisson bootstrap
Julia `DecisionTree.jl`	`RandomForestClassifier`	Native Julia implementation
MATLAB Statistics Toolbox	`TreeBagger`	Bagging for trees with parallel training

practical tips and best practices

Pick a high-variance base learner. Use deep, unpruned decision trees. If you bag shallow trees or linear regression, the gain is usually small.
Use enough estimators. With n_estimators=10 (the scikit-learn default), the ensemble may be undertrained. Most practitioners use 100 to 500 for general work and 1,000+ for tasks that benefit from very tight variance.
Watch the diminishing returns. Adding more estimators rarely hurts but the marginal benefit shrinks once n_estimators exceeds a few hundred. Plot OOB error against n_estimators to pick a reasonable cap.
Enable OOB scoring. When data is limited, oob_score=True provides a free estimate of generalization error that closely tracks leave-one-out cross-validation.
Use parallelism. Set n_jobs=-1 to use all available cores. Bagging is embarrassingly parallel.
Combine with feature subsampling. Setting max_features below 1.0 reduces correlation between trees and often improves accuracy, especially in high-dimensional problems.
Calibrate probabilities if needed. Bagged classifiers, like random forests, can produce uncalibrated probabilities. Wrap them in CalibratedClassifierCV if downstream applications require well-calibrated probabilities.
Be cautious on imbalanced data. Use BalancedBaggingClassifier from imbalanced-learn or set class weights inside the base learner.
For very large datasets, prefer pasting (bootstrap=False) or random patches with max_samples < 1.0 to keep memory under control.

limitations of bagging

Bias is not reduced. If the base learner has a systematic bias, the ensemble inherits it.
Loss of interpretability. A single tree can be drawn on a slide; an ensemble of 500 trees cannot.
Memory and storage cost. Storing hundreds of full models can be expensive, particularly for large neural networks.
No improvement on stable learners. Linear regression, k-nearest neighbors, and naive Bayes tend to gain little from bagging.
Correlated errors floor. When base models are highly correlated, the $\rho \sigma^2$ term dominates and additional models do not help.
Slower inference. Predictions require running every base model and aggregating, which can be a bottleneck in latency-sensitive systems unless models are pruned or distilled.
Bootstrap assumption. Bagging implicitly assumes the original sample is a reasonable proxy for the population. With small or biased datasets, this assumption can fail and bagging may amplify rather than reduce error.

historical impact and current relevance

Breiman's 1996 Bagging Predictors paper has accumulated more than 30,000 citations as of 2026 and is considered one of the landmark contributions to machine learning. It demonstrated that simple resampling and aggregation could dramatically improve the performance of unstable learners, which opened the door to the entire field of ensemble methods.

Bagging laid the groundwork for random forests (Breiman, 2001), which became one of the most popular and successful machine learning algorithms across domains ranging from bioinformatics and remote sensing to computer vision, finance, and biomedicine. The principles behind bagging also influenced the development of boosting methods such as AdaBoost and gradient boosting, even though those methods follow a different (sequential) strategy. They likewise influenced modern deep ensemble techniques such as dropout, snapshot ensembles, and stochastic weight averaging.

Today, bagging remains widely used both directly (through random forests, extra trees, and scikit-learn's bagging estimators) and indirectly (through dropout in neural networks and other variance-reduction techniques). Its simplicity, robustness, and theoretical grounding make it a foundational concept that every machine learning practitioner should understand.

The 2022 NeurIPS paper by Grinsztajn, Oyallon, and Varoquaux, along with continuing strong showings of random forests on Kaggle competitions and tabular benchmarks, makes the case that as of 2026 bagging-based methods remain at or near the state of the art for medium-sized tabular problems even in an era dominated by transformer and foundation model approaches for unstructured data.

explain like I'm 5 (ELI5)

Imagine you want to guess how many candies are in a jar. Instead of asking just one friend, you ask 100 friends to each take a quick look and make their best guess. Each friend sees the jar from a slightly different angle, so their guesses are all a little different.

To get your final answer, you add up all their guesses and divide by 100 to get the average. That average is almost always closer to the real number than any single friend's guess would be.

Bagging works the same way. It gives each "friend" (a model) a slightly different version of the data to learn from, then combines all their answers. The individual models might make mistakes, but their mistakes tend to cancel each other out when you average them together. The clever trick in bagging is the way each "slightly different version" of the data is created: you put all the training examples in a hat, and you draw with replacement, meaning the same example can be drawn more than once. Some examples end up in the bag two or three times, and others do not appear at all. Each model studies its own slightly lopsided bag, and the lopsidedness is what gives every friend a different perspective.

frequently asked questions

Does bagging always help? No. Bagging mainly reduces variance. If your base learner is already low variance (such as a linear regression on plenty of data) or has high bias (such as a single decision stump), bagging will help only marginally or not at all.

How many estimators should I use? There is no firm rule. A common starting point is 100 to 500. You can plot the OOB error against n_estimators and stop where the curve flattens.

Is bagging the same as random forest? No. Random forest is a special case of bagging applied to decision trees, with the additional twist of randomly subsampling features at each split. Plain bagging applied to decision trees does not subsample features.

Why is the OOB fraction 36.8%? It is the limit of $(1 - 1/n)^n$ as $n$ grows, which equals $1/e$, approximately 0.3679. So roughly 36.8% of training examples are not selected in any given bootstrap sample.

Can I bag any model? Yes, in principle. In practice, the benefit is largest for high-variance, low-bias base learners such as deep decision trees and neural networks.

Is dropout the same as bagging? Dropout is sometimes described as approximate bagging because it implicitly trains an exponentially large ensemble of sub-networks that share parameters. It is not exact bagging because the sub-networks are not trained on independent bootstrap samples and they share weights.

Does bagging guarantee no overfitting? No. Bagging can still overfit if the base learners are extremely flexible and the dataset is small, but it overfits much less than a single deep tree would. The variance reduction produces a smoother decision boundary that generalizes better in most settings.

references

Breiman, L. (1996a). "Bagging Predictors." Machine Learning, 24(2), 123-140. doi:10.1007/BF00058655
Breiman, L. (1996b). "Out-of-Bag Estimation." Technical Report, Statistics Department, UC Berkeley. PDF
Breiman, L. (1999). "Pasting Small Votes for Classification in Large Databases and On-Line." Machine Learning, 36(1-2), 85-103.
Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5-32. doi:10.1023/A:1010933404324
Breiman, L. (2001). "Using Iterated Bagging to Debias Regressions." Machine Learning, 45(3), 261-277.
Buhlmann, P. & Yu, B. (2002). "Analyzing Bagging." The Annals of Statistics, 30(4), 927-961. doi:10.1214/aos/1031689014
Bauer, E. & Kohavi, R. (1999). "An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants." Machine Learning, 36(1-2), 105-139.
Biau, G. & Scornet, E. (2016). "A Random Forest Guided Tour." TEST, 25, 197-227.
Dietterich, T.G. (2000). "Ensemble Methods in Machine Learning." In Multiple Classifier Systems (MCS 2000), Lecture Notes in Computer Science, vol. 1857. Springer.
Efron, B. (1979). "Bootstrap Methods: Another Look at the Jackknife." Annals of Statistics, 7(1), 1-26.
Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall.
Fernandez-Delgado, M., Cernadas, E., Barro, S. & Amorim, D. (2014). "Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?" Journal of Machine Learning Research, 15, 3133-3181.
Geurts, P., Ernst, D. & Wehenkel, L. (2006). "Extremely Randomized Trees." Machine Learning, 63(1), 3-42. doi:10.1007/s10994-006-6226-1
Grinsztajn, L., Oyallon, E. & Varoquaux, G. (2022). "Why do tree-based models still outperform deep learning on typical tabular data?" Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track. arXiv:2207.08815
Hastie, T., Tibshirani, R. & Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapter 8: Model Inference and Averaging.
Havasi, M., Jenatton, R., Fort, S., Liu, J.Z., Snoek, J., Lakshminarayanan, B., Dai, A.M. & Tran, D. (2021). "Training independent subnetworks for robust prediction." International Conference on Learning Representations (ICLR).
Ho, T.K. (1998). "The Random Subspace Method for Constructing Decision Forests." IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832-844.
Hothorn, T. & Lausen, B. (2003). "Double-Bagging: Combining Classifiers by Bootstrap Aggregation." Pattern Recognition, 36(6), 1303-1309.
Huang, G., Sun, Y., Liu, Z., Sedra, D. & Weinberger, K.Q. (2016). "Deep Networks with Stochastic Depth." European Conference on Computer Vision (ECCV).
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E. & Weinberger, K.Q. (2017). "Snapshot Ensembles: Train 1, Get M for Free." International Conference on Learning Representations (ICLR).
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. & Wilson, A.G. (2018). "Averaging Weights Leads to Wider Optima and Better Generalization." Conference on Uncertainty in Artificial Intelligence (UAI).
Lakshminarayanan, B., Pritzel, A. & Blundell, C. (2017). "Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles." Advances in Neural Information Processing Systems (NeurIPS), 30.
Louppe, G. & Geurts, P. (2012). "Ensembles on Random Patches." In Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2012), Lecture Notes in Computer Science, vol. 7523. Springer.
Mentch, L. & Hooker, G. (2016). "Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests." Journal of Machine Learning Research, 17, 1-41.
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825-2830.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." Journal of Machine Learning Research, 15, 1929-1958. PDF
Wen, Y., Tran, D. & Ba, J. (2020). "BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning." International Conference on Learning Representations (ICLR).

overview

historical background

how bagging works

step 1: bootstrap sampling

step 2: parallel model training

step 3: aggregation

why bagging works: variance reduction

bias-variance decomposition

theoretical analyses

out-of-bag estimation

oob error formula

oob feature importance

bagging with decision trees

random forests: bagging plus feature randomization

extra trees: even more randomness

bagging vs. boosting

empirical comparisons on tabular data

bagging vs. pasting

variants of bagging

subagging

random subspaces

random patches

wagging

iterated bagging

double bagging

bagging for neural networks and approximate bagging

when bagging helps and when it does not

implementation in scikit-learn

related scikit-learn estimators

implementation in other libraries

practical tips and best practices

limitations of bagging

historical impact and current relevance

explain like I'm 5 (ELI5)

frequently asked questions

see also

references

Improve this article

Related Articles

ARC-AGI 2

Boosting

Decision Forest

Ensemble

Gradient Boosting

Random Forest

overview

historical background

how bagging works

step 1: bootstrap sampling

step 2: parallel model training

step 3: aggregation

why bagging works: variance reduction

bias-variance decomposition

theoretical analyses

out-of-bag estimation

oob error formula

oob feature importance

bagging with decision trees

random forests: bagging plus feature randomization

extra trees: even more randomness

bagging vs. boosting

empirical comparisons on tabular data

bagging vs. pasting

variants of bagging

subagging

random subspaces

random patches

wagging

iterated bagging

double bagging

bagging for neural networks and approximate bagging

when bagging helps and when it does not

implementation in scikit-learn

related scikit-learn estimators

implementation in other libraries

practical tips and best practices

limitations of bagging

historical impact and current relevance

explain like I'm 5 (ELI5)

frequently asked questions