See also: Machine learning terms, Ensemble methods
Bagging, short for bootstrap aggregating, is an ensemble learning technique in machine learning designed to improve the stability and accuracy of prediction algorithms. The method works by training multiple instances of the same base learner on different bootstrap samples of the training data, then combining their predictions through averaging (for regression) or majority voting (for classification). Bagging primarily reduces variance and helps prevent overfitting, making it one of the most widely used ensemble strategies in practice.
Leo Breiman introduced bagging in his 1994 technical report and published the foundational paper "Bagging Predictors" in the journal Machine Learning in 1996. The technique was among the first effective ensemble methods and remains a building block for more advanced approaches such as random forests.
The bagging algorithm follows a straightforward three-step process: bootstrap sampling, parallel model training, and prediction aggregation.
Given an original training dataset D of size n, bagging creates m new training sets D₁, D₂, ..., Dₘ by sampling uniformly at random with replacement from D. Each bootstrap sample Dᵢ is the same size as the original dataset (n observations), but because sampling is done with replacement, some observations appear multiple times while others are left out entirely.
A key statistical property governs this process. For a sufficiently large n, each bootstrap sample is expected to contain approximately (1 - 1/e), or roughly 63.2%, of the unique observations from the original dataset. The remaining 36.8% are duplicates. This means each bootstrap sample provides a meaningfully different view of the data, which is essential for producing diverse models.
A base learning algorithm (the "base learner") is trained independently on each of the m bootstrap samples. Because each model sees a different subset of the data, the resulting models differ from one another even though they all use the same learning algorithm. This diversity among models is what allows bagging to reduce prediction variance.
An important practical advantage is that all m models can be trained in parallel, since none depends on the output of any other. This makes bagging straightforward to distribute across multiple processors or machines.
Once all m models have been trained, their individual predictions are combined to produce the final output:
| Task | Aggregation method | Description |
|---|---|---|
| Classification | Majority voting | Each model casts one vote for a class; the class receiving the most votes is the final prediction |
| Regression | Averaging | The final prediction is the arithmetic mean of all individual model predictions |
In some implementations, soft voting is used for classification, where the predicted class probabilities from each model are averaged and the class with the highest average probability is selected.
The effectiveness of bagging is grounded in a simple statistical principle. Consider a set of m independent random variables, each with variance σ². The variance of their average is σ²/m. In practice, the bootstrap models are not fully independent because they are all drawn from the same original dataset, but they are sufficiently different that averaging their outputs still produces a meaningful reduction in variance.
More precisely, if the pairwise correlation between any two models is ρ and each model has variance σ², then the variance of the bagged ensemble is:
Var(ensemble) = ρσ² + (1 - ρ)σ²/m
As the number of models m increases, the second term shrinks toward zero, leaving ρσ² as the irreducible floor. This formula explains why bagging is most effective when the base models are diverse (low ρ) and individually high in variance (large σ²).
Bagging does not reduce bias. If the base learner is biased, the bagged ensemble will retain that bias. This is why bagging works best with unstable, low-bias, high-variance learners. Deep, unpruned decision trees are the classic example: they tend to have low bias (they can fit the training data closely) but high variance (small changes in the training data can produce very different trees). Bagging smooths out this variance without sacrificing the low bias.
Conversely, bagging provides little benefit, and can even slightly degrade performance, when applied to stable, low-variance models such as k-nearest neighbors or linear regression.
Because each bootstrap sample includes only about 63.2% of the original observations, the remaining 36.8% that were not selected for a given model are called out-of-bag (OOB) observations. These provide a convenient, built-in validation mechanism.
For each observation xᵢ in the original dataset, there exists a subset of models that did not include xᵢ in their training data. The OOB prediction for xᵢ is obtained by aggregating the predictions of only those models. The OOB error is then computed by comparing these OOB predictions against the true labels across all observations.
The OOB error estimate has been shown to be approximately equivalent to leave-one-out cross-validation error, but it comes "for free" as a byproduct of the bagging process, with no need to set aside a separate validation set or perform multiple rounds of retraining. Leo Breiman demonstrated in his 1996b paper that the OOB estimate is a reliable measure of generalization error, making it especially useful when data is limited.
Decision trees are by far the most common base learner used with bagging. An individual unpruned decision tree is a high-variance model: it memorizes the training data and can change drastically if a few training points are altered. This makes it an ideal candidate for variance reduction through bagging.
A bagged ensemble of decision trees (sometimes called a "bagged forest") trains each tree on a different bootstrap sample and combines their predictions. The result is a model that retains the expressiveness of deep trees while achieving much lower variance and better generalization.
However, one drawback is the loss of interpretability. A single decision tree is easy to visualize and explain, but a committee of hundreds or thousands of trees is not.
Random forests, introduced by Leo Breiman in 2001, extend bagging by adding a second source of randomness. In addition to training each tree on a bootstrap sample, random forests restrict each split in each tree to consider only a random subset of features (typically √p for classification or p/3 for regression, where p is the total number of features).
This feature subsampling further decorrelates the individual trees, reducing the pairwise correlation ρ in the variance formula. The result is an ensemble with even lower variance than standard bagged trees, which explains why random forests consistently rank among the best-performing off-the-shelf classifiers.
| Method | Bootstrap samples | Feature subsampling | Correlation between trees |
|---|---|---|---|
| Bagged trees | Yes | No (all features at each split) | Moderate |
| Random forest | Yes | Yes (random subset at each split) | Lower |
Bagging and boosting are both ensemble methods, but they operate on fundamentally different principles.
| Aspect | Bagging | Boosting |
|---|---|---|
| Training | Models trained independently, in parallel | Models trained sequentially; each model focuses on errors of previous models |
| Goal | Reduce variance | Reduce bias (and variance) |
| Error focus | Equal treatment of all training examples | Higher weight given to misclassified examples |
| Sensitivity to noise | Robust to noisy data and outliers | More sensitive to noise and outliers |
| Overfitting risk | Low | Higher if not properly regularized |
| Hyperparameter sensitivity | Low; adding more models rarely hurts | Higher; requires careful tuning of learning rate and iterations |
| Typical base learners | High-variance models (deep trees) | Weak learners (shallow trees, stumps) |
| Notable algorithms | Random forest, bagged trees | AdaBoost, gradient boosting, XGBoost, LightGBM |
In general, boosting can achieve higher accuracy on clean data by addressing both bias and variance, but bagging is more robust and simpler to configure. The choice depends on the dataset, the noise level, and the amount of tuning effort available.
Pasting is a closely related method that differs in one key detail: it samples training subsets without replacement instead of with replacement. This means each observation can appear at most once in a given subset, and the subsets are typically smaller than the full training set.
| Aspect | Bagging | Pasting |
|---|---|---|
| Sampling method | With replacement (bootstrap) | Without replacement |
| Subset size | Typically same as original dataset | Smaller than original dataset |
| Duplicate observations in subset | Yes | No |
| OOB estimation | Available (from unselected samples) | Not directly available |
Both methods can produce diverse ensembles, but bagging is more commonly used because the bootstrap procedure naturally creates the variation needed for effective ensembles and enables OOB error estimation.
Subagging (subsample aggregating), proposed by Buhlmann and Yu in 2002, replaces bootstrap sampling with subsampling without replacement, drawing subsets of size m < n from the original dataset. This approach can achieve similar variance reduction to full bagging, sometimes with lower computational cost because each model trains on a smaller dataset.
The random patches method, introduced by Louppe and Geurts in 2012, generalizes both bagging and random subspaces by drawing random subsets of both instances and features for each base model. This creates even greater diversity among models and can be particularly effective in high-dimensional settings. In scikit-learn, this is achieved by setting both max_samples and max_features to values less than 1.0 in the BaggingClassifier.
The random subspace method, proposed by Ho in 1998, trains each model on all instances but only a random subset of features. When combined with bootstrap instance sampling, it becomes the random patches method.
Bagging can be applied to neural networks, which are themselves unstable, high-variance learners. Training multiple neural networks on different bootstrap samples and averaging their predictions often improves generalization. This approach has been used in deep learning competition entries and production systems.
However, training multiple full neural networks is computationally expensive. In practice, several approximations to bagging are more commonly used with neural networks:
These techniques capture much of the variance-reduction benefit of bagging at a fraction of the computational cost.
| Scenario | Bagging effectiveness | Reason |
|---|---|---|
| High-variance base learner (e.g., deep decision tree) | Very effective | Large variance to reduce |
| Noisy training data | Effective | Bootstrap smooths out noise |
| Low-variance base learner (e.g., k-nearest neighbors) | Minimal benefit or slightly harmful | Little variance to reduce |
| High-bias base learner (e.g., decision stump) | Ineffective | Bagging does not reduce bias |
| Small dataset | Effective if combined with OOB estimation | Makes good use of limited data |
| Large number of features | Effective when paired with feature subsampling | Reduces correlation between models |
Scikit-learn provides BaggingClassifier and BaggingRegressor in the sklearn.ensemble module. These meta-estimators wrap any base learner and handle the bootstrap sampling, parallel training, and aggregation automatically.
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
bagging_clf = BaggingClassifier(
estimator=DecisionTreeClassifier(),
n_estimators=100,
max_samples=1.0, # each bootstrap sample is same size as training set
bootstrap=True, # sample with replacement (True = bagging, False = pasting)
oob_score=True, # compute out-of-bag error estimate
n_jobs=-1, # train all models in parallel
random_state=42
)
bagging_clf.fit(X_train, y_train)
print(f"Test accuracy: {bagging_clf.score(X_test, y_test):.4f}")
print(f"OOB score: {bagging_clf.oob_score_:.4f}")
Key parameters include:
| Parameter | Description | Default |
|---|---|---|
estimator | The base learning algorithm to bag | DecisionTreeClassifier |
n_estimators | Number of base models in the ensemble | 10 |
max_samples | Number or fraction of samples per bootstrap | 1.0 |
max_features | Number or fraction of features per model | 1.0 |
bootstrap | Whether to sample with replacement | True |
oob_score | Whether to compute OOB error estimate | False |
n_jobs | Number of parallel jobs (-1 for all cores) | None |
Setting bootstrap=False switches from bagging to pasting. Setting max_features to a value less than 1.0 enables random subspace or random patches behavior.
Breiman's 1996 bagging paper has accumulated over 16,000 citations and is considered one of the landmark contributions to machine learning. It demonstrated that simple resampling and aggregation could dramatically improve the performance of unstable learners, which opened the door to the entire field of ensemble methods.
Bagging laid the groundwork for random forests (Breiman, 2001), which became one of the most popular and successful machine learning algorithms across domains ranging from bioinformatics to computer vision to finance. The principles behind bagging also influenced the development of boosting methods such as AdaBoost and gradient boosting, even though those methods follow a different (sequential) strategy.
Today, bagging remains widely used both directly (through random forests and scikit-learn's bagging estimators) and indirectly (through dropout in neural networks and other variance-reduction techniques). Its simplicity, robustness, and theoretical grounding make it a foundational concept that every machine learning practitioner should understand.
Imagine you want to guess how many candies are in a jar. Instead of asking just one friend, you ask 100 friends to each take a quick look and make their best guess. Each friend sees the jar from a slightly different angle, so their guesses are all a little different.
To get your final answer, you add up all their guesses and divide by 100 to get the average. That average is almost always closer to the real number than any single friend's guess would be.
Bagging works the same way. It gives each "friend" (a model) a slightly different version of the data to learn from, then combines all their answers. The individual models might make mistakes, but their mistakes tend to cancel each other out when you average them together.