Bagging
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 6,226 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 6,226 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms, Ensemble methods, Random forest, Boosting
Bagging, short for bootstrap aggregating, is an ensemble learning technique in machine learning designed to improve the stability and accuracy of prediction algorithms. The method works by training multiple instances of the same base learner on different bootstrap samples of the training data, then combining their predictions through averaging (for regression) or majority voting (for classification). Bagging primarily reduces variance and helps prevent overfitting, making it one of the most widely used ensemble strategies in practice.
Leo Breiman introduced bagging in his 1994 technical report and published the foundational paper Bagging Predictors in the journal Machine Learning in 1996 (Volume 24, Issue 2, pages 123 to 140). The technique was among the first effective ensemble methods and remains a building block for more advanced approaches such as random forests, extra trees, and the implicit ensembles induced by dropout in neural networks. As of 2026, bagging is taught in nearly every introductory machine learning course and is implemented in every major library, including scikit-learn, Spark MLlib, H2O, and R packages such as randomForest and ranger.
The enduring popularity of bagging on tabular data was reinforced by Grinsztajn, Oyallon, and Varoquaux in their 2022 NeurIPS paper Why do tree-based models still outperform deep learning on typical tabular data? (arXiv:2207.08815), which benchmarked 45 tabular datasets and concluded that bagging-based and gradient boosting tree ensembles remain state of the art on medium-sized tabular problems even compared to modern deep tabular architectures.
Before bagging, ensemble approaches in statistics were limited mostly to model averaging in Bayesian inference and to ad hoc voting schemes among classifiers. The dominant view in the early 1990s held that high-variance models such as fully grown decision trees and neural networks were inherently unreliable and required heavy regularization, pruning, or weight decay to generalize.
Leo Breiman, then a professor of statistics at the University of California, Berkeley, took a different angle. Inspired by the bootstrap resampling technique developed by Bradley Efron in 1979, Breiman asked whether averaging predictions from many models trained on bootstrap replicates of the data could reduce variance without sacrificing bias. His 1994 technical report, Bagging Predictors, formalized the procedure. The paper was accepted into Machine Learning and appeared in 1996. It became one of the most influential machine learning papers of the decade, accumulating tens of thousands of citations.
Key milestones in the development of bagging and its descendants:
| Year | Event | Reference |
|---|---|---|
| 1979 | Efron introduces the bootstrap | Efron, Annals of Statistics |
| 1994 | Breiman circulates the Bagging Predictors technical report | UC Berkeley TR No. 421 |
| 1996 | Breiman publishes Bagging Predictors in Machine Learning | Breiman 1996a |
| 1996 | Breiman publishes Out-of-Bag Estimation technical report | Breiman 1996b |
| 1998 | Ho introduces the random subspace method | Ho 1998 |
| 2001 | Breiman publishes Random Forests | Breiman 2001 |
| 2002 | Buhlmann and Yu publish Analyzing Bagging, introducing subagging | Annals of Statistics |
| 2006 | Geurts, Ernst, and Wehenkel introduce extremely randomized trees | Machine Learning journal |
| 2012 | Louppe and Geurts introduce ensembles on random patches | ECML PKDD 2012 |
| 2014 | Srivastava et al. publish Dropout, framing it as approximate bagging | JMLR 15 |
| 2022 | Grinsztajn et al. show tree ensembles still beat deep nets on tabular data | NeurIPS 2022 |
Bagging is also a core inspiration for the modern view of model uncertainty estimation through deep ensembles, which were popularized by Lakshminarayanan, Pritzel, and Blundell in 2017.
The bagging algorithm follows a straightforward three-step process: bootstrap sampling, parallel model training, and prediction aggregation.
Given an original training dataset $D$ of size $n$, bagging creates $m$ new training sets $D_1, D_2, \dots, D_m$ by sampling uniformly at random with replacement from $D$. Each bootstrap sample $D_i$ is the same size as the original dataset ($n$ observations), but because sampling is done with replacement, some observations appear multiple times while others are left out entirely.
A key statistical property governs this process. The probability that a given observation is not chosen in a single draw is $1 - 1/n$. The probability it is missed across all $n$ draws is $(1 - 1/n)^n$, which converges to $1/e \approx 0.368$ as $n$ grows. Therefore, each bootstrap sample is expected to contain approximately $1 - 1/e \approx 0.632$, or roughly 63.2%, of the unique observations from the original dataset. The remaining 36.8% are either duplicates of selected observations or are not selected at all. This means each bootstrap sample provides a meaningfully different view of the data, which is essential for producing diverse models.
| $n$ | Expected unique fraction | Expected OOB fraction |
|---|---|---|
| 10 | 0.6513 | 0.3487 |
| 100 | 0.6340 | 0.3660 |
| 1,000 | 0.6323 | 0.3677 |
| 10,000 | 0.6322 | 0.3678 |
| $\infty$ | $1 - 1/e \approx 0.6321$ | $1/e \approx 0.3679$ |
The values converge quickly, which is why the 63.2% / 36.8% split is treated as a constant in practice.
A base learning algorithm (the "base learner") is trained independently on each of the $m$ bootstrap samples. Because each model sees a different subset of the data, the resulting models differ from one another even though they all use the same learning algorithm. This diversity among models is what allows bagging to reduce prediction variance.
An important practical advantage is that all $m$ models can be trained in parallel, since none depends on the output of any other. This makes bagging straightforward to distribute across multiple processors or machines. By contrast, boosting is inherently sequential: each round depends on the residuals or weights produced by the previous round.
Once all $m$ models have been trained, their individual predictions are combined to produce the final output:
| Task | Aggregation method | Description |
|---|---|---|
| Classification | Majority voting | Each model casts one vote for a class; the class receiving the most votes is the final prediction |
| Regression | Averaging | The final prediction is the arithmetic mean of all individual model predictions |
In some implementations, soft voting is used for classification, where the predicted class probabilities from each model are averaged and the class with the highest average probability is selected. Soft voting often outperforms hard voting because it preserves the confidence information from each model.
Formally, given base models $\hat{f}_1, \hat{f}_2, \dots, \hat{f}_m$ trained on bootstrap samples $D_1, D_2, \dots, D_m$, the bagged regression prediction at input $x$ is:
$$ \hat{f}{bag}(x) = \frac{1}{m} \sum{i=1}^{m} \hat{f}_i(x) $$
For classification with soft voting, the predicted probability for class $k$ is:
$$ \hat{p}{bag}(y = k \mid x) = \frac{1}{m} \sum{i=1}^{m} \hat{p}_i(y = k \mid x) $$
The effectiveness of bagging is grounded in a simple statistical principle. Consider a set of $m$ independent random variables, each with variance $\sigma^2$. The variance of their average is $\sigma^2 / m$. In practice, the bootstrap models are not fully independent because they are all drawn from the same original dataset, but they are sufficiently different that averaging their outputs still produces a meaningful reduction in variance.
More precisely, if the pairwise correlation between any two models is $\rho$ and each model has variance $\sigma^2$, then the variance of the bagged ensemble is:
$$ \operatorname{Var}(\hat{f}_{bag}) = \rho \sigma^2 + \frac{1 - \rho}{m} \sigma^2 $$
As the number of models $m$ increases, the second term shrinks toward zero, leaving $\rho \sigma^2$ as the irreducible floor. This formula explains why bagging is most effective when the base models are diverse (low $\rho$) and individually high in variance (large $\sigma^2$). It also motivates the design of random forests, which add feature subsampling specifically to lower $\rho$.
The expected squared error of a regression model can be decomposed as bias squared plus variance plus irreducible noise:
$$ \mathbb{E}[(y - \hat{f}(x))^2] = \text{Bias}^2(\hat{f}(x)) + \operatorname{Var}(\hat{f}(x)) + \sigma_{\epsilon}^2 $$
Bagging reduces the second term while leaving the first term essentially unchanged. If the base learner is biased, the bagged ensemble will retain that bias. This is why bagging works best with unstable, low-bias, high-variance learners. Deep, unpruned decision trees are the classic example: they tend to have low bias (they can fit the training data closely) but high variance (small changes in the training data can produce very different trees). Bagging smooths out this variance without sacrificing the low bias.
Conversely, bagging provides little benefit, and can even slightly degrade performance, when applied to stable, low-variance models such as k-nearest neighbors, linear regression, or shallow trees with high bias.
Buhlmann and Yu, in their 2002 paper Analyzing Bagging in the Annals of Statistics, gave the first rigorous theoretical treatment of why bagging reduces variance for hard-decision predictors such as decision trees. They showed that bagging effectively smooths discontinuous decision rules near their decision boundaries, converting an indicator function into something closer to a smooth probability. Within a cylindric neighborhood of width $n^{-1/3}$ around the optimal split point, both the variance and the mean squared error of bagged decision stumps can be characterized in closed form. They also introduced subagging (subsample aggregating), which uses subsampling without replacement and which they showed achieves comparable accuracy at lower computational cost.
Later theoretical work by Biau and Scornet (2016) and others extended this analysis to random forests, providing consistency results and asymptotic distribution theorems. Mentch and Hooker (2016) developed inferential procedures using the U-statistic framework to construct confidence intervals around bagged predictions.
Because each bootstrap sample includes only about 63.2% of the original observations, the remaining 36.8% that were not selected for a given model are called out-of-bag (OOB) observations. These provide a convenient, built-in validation mechanism. The OOB estimate is sometimes treated as a free pseudo cross-validation procedure.
For each observation $x_i$ in the original dataset, there exists a subset of models that did not include $x_i$ in their training data (on average about 36.8% of the $m$ models). The OOB prediction for $x_i$ is obtained by aggregating the predictions of only those models. The OOB error is then computed by comparing these OOB predictions against the true labels across all observations.
The OOB error estimate has been shown to be approximately equivalent to leave-one-out cross-validation error, but it comes "for free" as a byproduct of the bagging process, with no need to set aside a separate validation set or perform multiple rounds of retraining. Leo Breiman demonstrated in his 1996b paper Out-of-Bag Estimation that the OOB estimate is a reliable measure of generalization error, making it especially useful when data is limited.
Let $S_i$ denote the set of model indices $j$ for which $x_i \notin D_j$ (that is, the models that did not see $x_i$ during training). The OOB prediction for observation $i$ in regression is:
$$ \hat{y}i^{OOB} = \frac{1}{|S_i|} \sum{j \in S_i} \hat{f}_j(x_i) $$
The OOB error is then:
$$ \text{OOB error} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_i^{OOB}) $$
for an appropriate loss $L$ (squared error for regression, zero-one loss or log loss for classification). With $m$ in the hundreds, $|S_i|$ is itself in the hundreds, which keeps the OOB prediction stable.
A closely related use of OOB samples is permutation feature importance, which Breiman introduced for random forests in his 2001 paper. The idea is to measure the increase in OOB error when the values of a single feature are randomly permuted across the OOB samples. A feature whose permutation drives OOB error up sharply is considered important; a feature whose permutation has no effect is considered uninformative. This procedure works for any bagged ensemble, not only random forests, and is implemented in scikit-learn as permutation_importance.
Decision trees are by far the most common base learner used with bagging. An individual unpruned decision tree is a high-variance model: it memorizes the training data and can change drastically if a few training points are altered. This makes it an ideal candidate for variance reduction through bagging.
A bagged ensemble of decision trees (sometimes called a "bagged forest") trains each tree on a different bootstrap sample and combines their predictions. The result is a model that retains the expressiveness of deep trees while achieving much lower variance and better generalization.
However, one drawback is the loss of interpretability. A single decision tree is easy to visualize and explain, but a committee of hundreds or thousands of trees is not. Practitioners typically rely on permutation importance, SHAP values, or partial dependence plots to interpret bagged tree ensembles.
A practical training note: each tree in a bagged ensemble should be grown deeply (or fully) so that it has low bias. Pruning each individual tree would remove the very property that makes bagging effective.
Random forests, introduced by Leo Breiman in 2001, extend bagging by adding a second source of randomness. In addition to training each tree on a bootstrap sample, random forests restrict each split in each tree to consider only a random subset of features (typically $\sqrt{p}$ for classification or $p/3$ for regression, where $p$ is the total number of features).
This feature subsampling further decorrelates the individual trees, reducing the pairwise correlation $\rho$ in the variance formula. The result is an ensemble with even lower variance than standard bagged trees, which explains why random forests consistently rank among the best-performing off-the-shelf classifiers. In a benchmark by Fernandez-Delgado et al. (2014) covering 121 UCI datasets, random forests was the top-ranked family on more datasets than any other method.
| Method | Bootstrap samples | Feature subsampling | Correlation between trees |
|---|---|---|---|
| Single decision tree | No | No | Not applicable |
| Bagged trees | Yes | No (all features at each split) | Moderate |
| Random forest | Yes | Yes (random subset at each split) | Lower |
| Extra trees | Optional | Yes plus random thresholds | Lowest |
The extremely randomized trees algorithm, introduced by Geurts, Ernst, and Wehenkel in 2006 in the Machine Learning journal (Volume 63, pages 3 to 42), pushes the random forest idea further. At each internal node, instead of computing the optimal split threshold for each candidate feature, extra trees pick a random threshold from each candidate's range and choose the best among these random thresholds. By default, scikit-learn's ExtraTreesClassifier does not even use bootstrap sampling; each tree is trained on the full dataset, and randomness comes entirely from feature and threshold sampling. This often reduces variance further at a small cost in bias.
Bagging and boosting are both ensemble methods, but they operate on fundamentally different principles. Bagging aims to reduce variance by training independent models in parallel, while boosting aims to reduce bias by training models sequentially with each new model focusing on the errors of the previous ones.
| Aspect | Bagging | Boosting |
|---|---|---|
| Training | Models trained independently, in parallel | Models trained sequentially; each model focuses on errors of previous models |
| Goal | Reduce variance | Reduce bias (and variance to a lesser extent) |
| Error focus | Equal treatment of all training examples | Higher weight given to misclassified examples |
| Sensitivity to noise | Robust to noisy data and outliers | More sensitive to noise and outliers |
| Overfitting risk | Low; adding more models rarely hurts | Higher if not properly regularized |
| Hyperparameter sensitivity | Low | Higher; requires careful tuning of learning rate and iterations |
| Typical base learners | High-variance models (deep trees) | Weak learners (shallow trees, stumps) |
| Computational parallelism | Trivially parallel across models | Sequential by definition |
| OOB estimation available | Yes | Generally no (boosting uses staged predictions instead) |
| Notable algorithms | Random forest, bagged trees, extra trees | AdaBoost, gradient boosting, XGBoost, LightGBM, CatBoost |
In general, boosting can achieve higher accuracy on clean data by addressing both bias and variance, but bagging is more robust and simpler to configure. The choice depends on the dataset, the noise level, and the amount of tuning effort available.
The Grinsztajn, Oyallon, and Varoquaux 2022 NeurIPS paper Why do tree-based models still outperform deep learning on typical tabular data? benchmarked random forests, gradient boosting trees, and several deep tabular architectures across 45 medium-sized tabular datasets. They reported that boosting (gradient boosted trees) was generally the strongest single family, with random forests close behind, and that both bagging and boosting tree ensembles outperformed all neural network families they tested even after extensive hyperparameter tuning. The authors attributed this to three inductive biases of tree ensembles: robustness to uninformative features, robustness to non-Gaussian targets, and the tendency to learn piecewise-constant functions, which align well with how tabular features behave.
Pasting is a closely related method that differs in one key detail: it samples training subsets without replacement instead of with replacement. This means each observation can appear at most once in a given subset, and the subsets are typically smaller than the full training set. The term "pasting" was introduced by Breiman in 1999 in his paper Pasting Small Votes for Classification in Large Databases and On-Line.
| Aspect | Bagging | Pasting |
|---|---|---|
| Sampling method | With replacement (bootstrap) | Without replacement |
| Subset size | Typically same as original dataset | Smaller than original dataset |
| Duplicate observations in subset | Yes | No |
| OOB estimation | Available (from unselected samples) | Not directly available |
| Best when | Dataset fits in memory | Dataset is very large; subsets fit in memory |
Both methods can produce diverse ensembles, but bagging is more commonly used because the bootstrap procedure naturally creates the variation needed for effective ensembles and enables OOB error estimation. Pasting is more useful in distributed and out-of-core settings where each machine can only load a subset of the full data.
Subagging (subsample aggregating), proposed by Buhlmann and Yu in 2002, replaces bootstrap sampling with subsampling without replacement, drawing subsets of size $m_n < n$ from the original dataset. Theoretical analysis shows that subagging with $m_n \approx n / 2$ achieves variance reduction comparable to full bagging. This approach can be cheaper because each model trains on a smaller dataset, and it is the default behavior of the subsample parameter in libraries such as XGBoost and LightGBM, although in those libraries the subsampling is combined with boosting rather than independent aggregation.
The random subspace method, proposed by Tin Kam Ho in 1998 in IEEE Transactions on Pattern Analysis and Machine Intelligence, trains each model on all instances but only a random subset of features. This is particularly effective in high-dimensional problems where many features are noisy or redundant. When combined with bootstrap instance sampling, it becomes the random patches method.
The random patches method, introduced by Louppe and Geurts in 2012 in their ECML PKDD paper Ensembles on Random Patches, generalizes both bagging and random subspaces by drawing random subsets of both instances and features for each base model. This creates even greater diversity among models and can be particularly effective in high-dimensional settings. In scikit-learn, this is achieved by setting both max_samples and max_features to values less than 1.0 in the BaggingClassifier.
Wagging (weight aggregation), proposed by Bauer and Kohavi in 1999, replaces bootstrap sampling with random reweighting of the training observations. Each base model is trained on the full dataset but with random weights drawn from a Poisson or exponential distribution. Wagging avoids the duplication and exclusion behavior of bootstrap sampling and works only with base learners that support example weights.
Iterated bagging, proposed by Breiman in 2001 in Using Iterated Bagging to Debias Regressions, combines bagging with iterative bias correction. After fitting an initial bagged ensemble, a second ensemble is fit on the residuals, and so on. This hybrid approach borrows the bias reduction of boosting while keeping bagging's parallel structure across each layer.
Double bagging, proposed by Hothorn and Lausen in 2003, combines bootstrap samples with linear discriminant analysis fits on the OOB samples to enrich the feature space before training the next round of trees. It is mainly of historical and academic interest today.
Bagging can be applied to neural networks, which are themselves unstable, high-variance learners. Training multiple neural networks on different bootstrap samples and averaging their predictions often improves generalization. This approach is the basis of deep ensembles, which Lakshminarayanan, Pritzel, and Blundell formalized in 2017 as a simple and reliable method for predictive uncertainty estimation in deep learning.
However, training multiple full neural networks is computationally expensive. In practice, several approximations to bagging are more commonly used with neural networks:
| Technique | Reference | Mechanism |
|---|---|---|
| Dropout | Srivastava et al. 2014, JMLR 15:1929-1958 | Randomly zero units during training; at test time use full network with rescaled weights, approximating an exponentially large geometric mean |
| Stochastic depth | Huang et al. 2016 | Randomly drop entire residual blocks during training |
| Snapshot ensembles | Huang et al. 2017 | Save model checkpoints at different cyclic learning rate minima and average their predictions |
| Stochastic weight averaging (SWA) | Izmailov et al. 2018 | Average weights of a model across SGD iterations near convergence |
| Deep ensembles | Lakshminarayanan et al. 2017 | Train multiple networks independently with different random seeds; average predictions |
| BatchEnsemble | Wen et al. 2020 | Share most weights across an ensemble using rank-1 perturbations to reduce memory cost |
| Multi-input multi-output (MIMO) | Havasi et al. 2021 | Train one network with multiple input/output channels acting as a virtual ensemble |
Dropout is the most influential of these. The original paper, Dropout: A Simple Way to Prevent Neural Networks from Overfitting (Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov, JMLR Volume 15, 2014, pages 1929 to 1958), explicitly framed dropout as approximating an equally weighted geometric mean of the predictions of an exponential number of "thinned" sub-networks that share parameters, which is in spirit a form of model averaging. The connection to bagging is approximate because dropout sub-networks share weights and are not trained on independent bootstrap samples, but the variance-reduction intuition transfers.
| Scenario | Bagging effectiveness | Reason |
|---|---|---|
| High-variance base learner (e.g., deep decision tree) | Very effective | Large variance to reduce |
| Noisy training data | Effective | Bootstrap smooths out noise |
| Low-variance base learner (e.g., k-nearest neighbors) | Minimal benefit or slightly harmful | Little variance to reduce |
| High-bias base learner (e.g., decision stump) | Ineffective | Bagging does not reduce bias |
| Linear regression with many features | Marginal | Linear models already have relatively low variance |
| Small dataset | Effective if combined with OOB estimation | Makes good use of limited data |
| Large number of features | Effective when paired with feature subsampling | Reduces correlation between models |
| Strongly imbalanced classification | Mixed; balanced bagging variants such as BalancedBaggingClassifier from imbalanced-learn work better | Standard bagging keeps the imbalance in each bootstrap |
Scikit-learn provides BaggingClassifier and BaggingRegressor in the sklearn.ensemble module. These meta-estimators wrap any base learner and handle the bootstrap sampling, parallel training, and aggregation automatically.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
bagging_clf = BaggingClassifier(
estimator=DecisionTreeClassifier(),
n_estimators=100,
max_samples=1.0, # each bootstrap sample is same size as training set
bootstrap=True, # sample with replacement (True = bagging, False = pasting)
oob_score=True, # compute out-of-bag error estimate
n_jobs=-1, # train all models in parallel
random_state=42
)
bagging_clf.fit(X_train, y_train)
print(f"Test accuracy: {bagging_clf.score(X_test, y_test):.4f}")
print(f"OOB score: {bagging_clf.oob_score_:.4f}")
A full pipeline showing soft voting and probability calibration:
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import roc_auc_score, log_loss
base = LogisticRegression(max_iter=1000)
bag = BaggingClassifier(
estimator=base,
n_estimators=200,
max_samples=0.8,
max_features=0.8, # random patches
bootstrap=True,
bootstrap_features=False,
oob_score=True,
n_jobs=-1,
random_state=0,
)
bag.fit(X_train, y_train)
proba = bag.predict_proba(X_test)[:, 1]
print(f"AUC: {roc_auc_score(y_test, proba):.4f}")
print(f"Log loss: {log_loss(y_test, proba):.4f}")
Key parameters include:
| Parameter | Description | Default |
|---|---|---|
estimator | The base learning algorithm to bag | DecisionTreeClassifier |
n_estimators | Number of base models in the ensemble | 10 |
max_samples | Number or fraction of samples per bootstrap | 1.0 |
max_features | Number or fraction of features per model | 1.0 |
bootstrap | Whether to sample with replacement | True |
bootstrap_features | Whether to sample features with replacement | False |
oob_score | Whether to compute OOB error estimate | False |
warm_start | Reuse previous fit and add more estimators | False |
n_jobs | Number of parallel jobs (-1 for all cores) | None |
random_state | Seed for reproducibility | None |
Setting bootstrap=False switches from bagging to pasting. Setting max_features to a value less than 1.0 enables random subspace or random patches behavior.
| Estimator | Family | Notes |
|---|---|---|
BaggingClassifier, BaggingRegressor | General bagging | Wraps any base estimator |
RandomForestClassifier, RandomForestRegressor | Bagging plus feature subsampling at each split | Most popular bagged tree implementation |
ExtraTreesClassifier, ExtraTreesRegressor | Extremely randomized trees | Random thresholds; bootstrap off by default |
IsolationForest | Bagged anomaly detection trees | Uses subsampling rather than bootstrap |
BalancedBaggingClassifier (imbalanced-learn) | Bagging with class rebalancing per bag | Useful for imbalanced classification |
| Library | API | Notes |
|---|---|---|
R randomForest package | randomForest() | Original Breiman/Cutler Fortran code wrapped in R |
R ranger package | ranger() | Faster C++ random forest implementation |
| Spark MLlib | RandomForestClassifier, RandomForestRegressor | Distributed random forest |
| H2O | H2ORandomForestEstimator | Distributed random forest with grid search |
| XGBoost | subsample, colsample_bytree | Combines bagging-style subsampling with boosting |
| LightGBM | bagging_fraction, feature_fraction | Same idea as XGBoost |
| CatBoost | bootstrap_type | Supports Bayesian, Bernoulli, MVS, and Poisson bootstrap |
Julia DecisionTree.jl | RandomForestClassifier | Native Julia implementation |
| MATLAB Statistics Toolbox | TreeBagger | Bagging for trees with parallel training |
n_estimators=10 (the scikit-learn default), the ensemble may be undertrained. Most practitioners use 100 to 500 for general work and 1,000+ for tasks that benefit from very tight variance.n_estimators exceeds a few hundred. Plot OOB error against n_estimators to pick a reasonable cap.oob_score=True provides a free estimate of generalization error that closely tracks leave-one-out cross-validation.n_jobs=-1 to use all available cores. Bagging is embarrassingly parallel.max_features below 1.0 reduces correlation between trees and often improves accuracy, especially in high-dimensional problems.CalibratedClassifierCV if downstream applications require well-calibrated probabilities.BalancedBaggingClassifier from imbalanced-learn or set class weights inside the base learner.bootstrap=False) or random patches with max_samples < 1.0 to keep memory under control.Breiman's 1996 Bagging Predictors paper has accumulated more than 30,000 citations as of 2026 and is considered one of the landmark contributions to machine learning. It demonstrated that simple resampling and aggregation could dramatically improve the performance of unstable learners, which opened the door to the entire field of ensemble methods.
Bagging laid the groundwork for random forests (Breiman, 2001), which became one of the most popular and successful machine learning algorithms across domains ranging from bioinformatics and remote sensing to computer vision, finance, and biomedicine. The principles behind bagging also influenced the development of boosting methods such as AdaBoost and gradient boosting, even though those methods follow a different (sequential) strategy. They likewise influenced modern deep ensemble techniques such as dropout, snapshot ensembles, and stochastic weight averaging.
Today, bagging remains widely used both directly (through random forests, extra trees, and scikit-learn's bagging estimators) and indirectly (through dropout in neural networks and other variance-reduction techniques). Its simplicity, robustness, and theoretical grounding make it a foundational concept that every machine learning practitioner should understand.
The 2022 NeurIPS paper by Grinsztajn, Oyallon, and Varoquaux, along with continuing strong showings of random forests on Kaggle competitions and tabular benchmarks, makes the case that as of 2026 bagging-based methods remain at or near the state of the art for medium-sized tabular problems even in an era dominated by transformer and foundation model approaches for unstructured data.
Imagine you want to guess how many candies are in a jar. Instead of asking just one friend, you ask 100 friends to each take a quick look and make their best guess. Each friend sees the jar from a slightly different angle, so their guesses are all a little different.
To get your final answer, you add up all their guesses and divide by 100 to get the average. That average is almost always closer to the real number than any single friend's guess would be.
Bagging works the same way. It gives each "friend" (a model) a slightly different version of the data to learn from, then combines all their answers. The individual models might make mistakes, but their mistakes tend to cancel each other out when you average them together. The clever trick in bagging is the way each "slightly different version" of the data is created: you put all the training examples in a hat, and you draw with replacement, meaning the same example can be drawn more than once. Some examples end up in the bag two or three times, and others do not appear at all. Each model studies its own slightly lopsided bag, and the lopsidedness is what gives every friend a different perspective.
Does bagging always help? No. Bagging mainly reduces variance. If your base learner is already low variance (such as a linear regression on plenty of data) or has high bias (such as a single decision stump), bagging will help only marginally or not at all.
How many estimators should I use?
There is no firm rule. A common starting point is 100 to 500. You can plot the OOB error against n_estimators and stop where the curve flattens.
Is bagging the same as random forest? No. Random forest is a special case of bagging applied to decision trees, with the additional twist of randomly subsampling features at each split. Plain bagging applied to decision trees does not subsample features.
Why is the OOB fraction 36.8%? It is the limit of $(1 - 1/n)^n$ as $n$ grows, which equals $1/e$, approximately 0.3679. So roughly 36.8% of training examples are not selected in any given bootstrap sample.
Can I bag any model? Yes, in principle. In practice, the benefit is largest for high-variance, low-bias base learners such as deep decision trees and neural networks.
Is dropout the same as bagging? Dropout is sometimes described as approximate bagging because it implicitly trains an exponentially large ensemble of sub-networks that share parameters. It is not exact bagging because the sub-networks are not trained on independent bootstrap samples and they share weights.
Does bagging guarantee no overfitting? No. Bagging can still overfit if the base learners are extremely flexible and the dataset is small, but it overfits much less than a single deep tree would. The variance reduction produces a smoother decision boundary that generalizes better in most settings.