See also: Machine learning terms
Ensemble methods are techniques in machine learning that combine the predictions of multiple models, known as base learners, to produce a single prediction that is typically more accurate and robust than any individual model. The core insight is straightforward: by aggregating the outputs of several learners, individual errors tend to cancel out, yielding better generalization on unseen data. Ensemble approaches underpin many of the most successful algorithms in applied machine learning, including random forests, gradient boosting machines, and competition-winning solutions on platforms like Kaggle.
The effectiveness of ensemble methods rests on several complementary intuitions.
If individual models make independent errors, averaging their predictions causes those errors to cancel. This is a statistical consequence of the law of large numbers. The Condorcet jury theorem (1785) formalizes a related idea for binary decisions: if each of N independent voters is correct with probability p > 0.5, the probability that the majority vote is correct approaches 1 as N grows. Applied to classifiers, Hoeffding's inequality gives a bound on the ensemble error:
P(ensemble error) <= exp(-2N(p - 0.5)^2)
This exponential decay explains why even modest individual accuracy can yield strong collective performance, provided the models' errors are sufficiently uncorrelated.
Prediction error can be decomposed into bias (systematic error from incorrect assumptions), variance (sensitivity to training data fluctuations), and irreducible noise. Different ensemble strategies target different components:
| Strategy | Primary effect | Mechanism |
|---|---|---|
| Bagging | Reduces variance | Averages many high-variance models trained on bootstrap samples |
| Boosting | Reduces bias | Sequentially corrects residual errors of prior models |
| Stacking | Reduces both | A meta-learner learns optimal combination weights |
Recent theoretical work by Brown (2023) extends this to a three-way bias/variance/diversity trade-off, showing that diversity among ensemble members always subtracts from expected risk when bias and variance are held fixed.
Diversity is the engine that makes ensembles work. If every base learner made identical predictions, combining them would offer no benefit. Sources of diversity include:
Empirical studies consistently show that ensembles of diverse models outperform ensembles of similar models, even when the individual diverse models are somewhat weaker.
Bagging, introduced by Leo Breiman in 1996, generates multiple versions of a predictor by training each on a different bootstrap sample (random sample with replacement) of the training data. For regression tasks, predictions are averaged; for classification tasks, a majority vote is taken.
Because each bootstrap sample omits roughly 37% of the original data, the held-out portion (called the out-of-bag sample) can be used to estimate generalization error without a separate validation set. Bagging is most effective with high-variance, low-bias base learners such as deep decision trees.
Variants:
| Variant | Description |
|---|---|
| Pasting | Samples drawn without replacement instead of with replacement |
| Random subspaces | Each model trained on a random subset of features rather than a random subset of samples |
| Random patches | Combines random sampling of both features and samples |
The random forest algorithm, also developed by Breiman (2001), extends bagging by adding feature randomness. At each split in each tree, only a random subset of features is considered. This further decorrelates the trees, reducing variance beyond what plain bagging achieves. Random forests are widely used due to their strong out-of-the-box performance, resistance to overfitting, and ability to estimate feature importance.
Boosting trains a sequence of weak learners, where each new learner focuses on the examples that previous learners got wrong. Unlike bagging, which trains models independently in parallel, boosting builds models sequentially so that each one corrects the mistakes of its predecessors.
AdaBoost (Adaptive Boosting), proposed by Freund and Schapire (1996), was the first practical boosting algorithm. It maintains a set of weights over training examples, increasing the weight of misclassified instances after each round. The final prediction is a weighted majority vote of all weak learners, where each learner's weight reflects its accuracy.
Gradient boosting, formalized by Friedman (2001), generalizes boosting to arbitrary differentiable loss functions. Each new model fits the negative gradient (pseudo-residuals) of the loss with respect to the current ensemble's predictions. This framework is extremely flexible and forms the basis of high-performance libraries such as XGBoost (Chen and Guestrin, 2016), LightGBM, and CatBoost.
| Boosting algorithm | Key idea | Typical use case |
|---|---|---|
| AdaBoost | Reweights misclassified samples | Binary classification with weak learners |
| Gradient boosting | Fits residuals using gradient descent in function space | Tabular data, structured prediction |
| XGBoost | Regularized gradient boosting with system optimizations | Kaggle competitions, production ML |
| LightGBM | Histogram-based splits, leaf-wise growth | Large-scale tabular data |
| CatBoost | Native categorical feature handling, ordered boosting | Data with many categorical variables |
Stacking, introduced by David Wolpert in 1992, uses a two-level architecture. First, several diverse base learners are trained on the full training set (typically using cross-validation to generate out-of-fold predictions). Then a second-level model, called the meta-learner, is trained to combine the base learners' predictions into a final output. The meta-learner learns which base models to trust in which regions of the input space. Common choices for the meta-learner include logistic regression, ridge regression, and simple neural networks.
Stacking can be extended to multiple levels, though in practice two levels are most common because additional layers add complexity with diminishing returns.
Voting is the simplest ensemble combination strategy. Given a set of trained models, their predictions are combined by:
Soft voting generally outperforms hard voting because it uses more information from each model. Weighted voting allows higher-performing models to have greater influence on the final prediction.
Blending is similar to stacking but uses a single holdout validation set rather than cross-validation to generate the training data for the meta-learner. The process is:
Blending is simpler to implement than stacking and less prone to data leakage, but it uses less data for training the base models and can have higher variance in the meta-learner's training signal.
Bayesian model averaging (BMA) weights each model's prediction by its posterior probability given the observed data. The posterior probabilities are typically derived from model evidence using criteria like the Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC). BMA provides a principled framework for model uncertainty but assumes that the true model is among the candidates, which may not hold in practice.
Ensembling is also widely used in deep learning, though the computational cost of training multiple large neural networks can be substantial. Several techniques address this.
The simplest approach is to train the same architecture multiple times with different random initializations. Because neural network optimization is non-convex, different seeds converge to different local minima, providing natural diversity. Lakshminarayanan et al. (2017) showed that deep ensembles of as few as five networks provide well-calibrated uncertainty estimates and strong predictive performance.
Snapshot ensembles, proposed by Huang et al. (2017), collect multiple models from a single training run by using a cyclic learning rate schedule. The learning rate is periodically reset to a high value, causing the optimizer to escape the current local minimum and converge to a new one. The model weights are saved ("snapshotted") at each convergence point. This yields an ensemble at essentially no extra training cost compared to training a single model.
Dropout, a common regularization technique, can be interpreted as implicitly training an exponential number of sub-networks. At test time, using all units with scaled weights approximates the geometric mean of all possible sub-networks. Monte Carlo dropout extends this idea by running multiple stochastic forward passes at inference time to produce ensemble-like uncertainty estimates.
Mixture of Experts (MoE) is a form of ensemble where different sub-networks (experts) specialize in different regions of the input space. A gating network learns to route each input to the most appropriate expert or combination of experts. Unlike traditional ensembles that query every member for every input, MoE activates only a subset of experts per input, making it more computationally efficient.
MoE has become central to modern large language models. DeepSeek-V3 uses 256 experts, Llama 4 Scout employs 16 experts with 109 billion total parameters but only 17 billion active per input, and Mixtral 8x7B uses 8 experts per layer. This sparse activation pattern allows models to have very large total parameter counts while keeping inference cost manageable.
In many scenarios, using every available model in an ensemble is unnecessary or even harmful. Ensemble pruning (also called ensemble selection or ensemble thinning) reduces the ensemble to a subset of members that performs as well as or better than the full collection, while lowering computational cost.
Common approaches include:
Caruana et al. (2004) showed that ensemble selection from a library of models can significantly outperform simply using the best single model, while keeping the ensemble size small.
Ensemble methods have been instrumental in machine learning competitions. Nearly every winning solution on Kaggle involves some form of ensembling. A typical competition pipeline stacks several diverse models (gradient boosting, neural networks, linear models) and uses a blending or stacking layer to combine them.
Notable examples include:
The success of ensembles in competitions has driven development of automated ensembling tools in libraries like auto-sklearn, AutoGluon, and H2O AutoML.
Ensembles require training and storing multiple models, which increases both training time and memory usage. For bagging and voting, base models can be trained in parallel. Boosting is inherently sequential since each model depends on the previous one's errors. Stacking requires multiple rounds of cross-validated training.
At prediction time, querying every model in an ensemble increases latency proportionally to the ensemble size (or sublinearly if models run in parallel). In latency-sensitive applications, techniques like model distillation (training a single student model to mimic the ensemble) can compress the ensemble's knowledge into a single, faster model.
Deploying ensembles in production requires managing multiple model artifacts, monitoring each component's performance over time, and ensuring consistency across updates. Modern ML platforms and model registries provide tooling for this, but the operational burden is real and should be weighed against the accuracy gains.
| Consideration | Challenge | Mitigation |
|---|---|---|
| Training cost | N times the cost of a single model | Parallel training, snapshot ensembles |
| Inference latency | Each prediction requires N model evaluations | Distillation, pruning, parallel inference |
| Storage | N model artifacts to store and version | Shared base layers, pruning |
| Interpretability | Harder to explain than a single model | Feature importance aggregation, SHAP values |
| Maintenance | Multiple models to monitor and update | Automated pipelines, model registries |
| Method | Training style | Diversity source | Strengths | Weaknesses |
|---|---|---|---|---|
| Bagging | Parallel | Bootstrap sampling | Reduces variance, simple to implement | Does not reduce bias |
| Random forests | Parallel | Bootstrap + feature sampling | Strong defaults, feature importance | Can be slow with many trees |
| AdaBoost | Sequential | Reweighting samples | Reduces bias, interpretable weights | Sensitive to noisy data and outliers |
| Gradient boosting | Sequential | Fitting residuals | Flexible loss functions, state-of-the-art on tabular data | Prone to overfitting without regularization |
| Stacking | Two-stage | Different algorithms | Can combine heterogeneous models | Complex, risk of overfitting meta-learner |
| Voting | Independent | Different algorithms | Simple, no additional training | Limited improvement if models are similar |
| Blending | Two-stage | Different algorithms | Simple holdout approach | Uses less data than stacking |
| MoE | Joint (gating + experts) | Specialization by input region | Efficient sparse computation | Complex routing, training instability |
Several theoretical results justify the effectiveness of ensemble methods:
Imagine you have a big test coming up, and you want to know the answers to all the questions. You could ask one really smart friend, but what if they don't know the answer to one of the questions? Instead, you ask several friends, each with different strengths and weaknesses. Then, you take their answers and combine them in a smart way to get the best possible answers.
That is what ensemble methods do in machine learning. They take the "opinions" of several different models and combine them to get a better, more accurate prediction. Each model might get some things wrong, but they tend to get different things wrong. When you put all their answers together, the mistakes cancel out and the right answers shine through.