Ensemble

See also: Machine learning terms

Ensemble methods are techniques in machine learning that combine the predictions of multiple models, known as base learners, to produce a single prediction that is typically more accurate and robust than any individual model. The core insight is straightforward: by aggregating the outputs of several learners, individual errors tend to cancel out, yielding better generalization on unseen data. Ensemble approaches underpin many of the most successful algorithms in applied machine learning, including random forests, gradient boosting machines, and competition-winning solutions on platforms like Kaggle.

Why ensembles work

The effectiveness of ensemble methods rests on several complementary intuitions.

Wisdom of crowds and error decorrelation

If individual models make independent errors, averaging their predictions causes those errors to cancel. This is a statistical consequence of the law of large numbers. The Condorcet jury theorem (1785) formalizes a related idea for binary decisions: if each of N independent voters is correct with probability p > 0.5, the probability that the majority vote is correct approaches 1 as N grows. Applied to classifiers, Hoeffding's inequality gives a bound on the ensemble error:

P(ensemble error) <= exp(-2N(p - 0.5)^2)

This exponential decay explains why even modest individual accuracy can yield strong collective performance, provided the models' errors are sufficiently uncorrelated.

Bias-variance perspective

Prediction error can be decomposed into bias (systematic error from incorrect assumptions), variance (sensitivity to training data fluctuations), and irreducible noise. Different ensemble strategies target different components:

Strategy	Primary effect	Mechanism
Bagging	Reduces variance	Averages many high-variance models trained on bootstrap samples
Boosting	Reduces bias	Sequentially corrects residual errors of prior models
Stacking	Reduces both	A meta-learner learns optimal combination weights

Recent theoretical work by Brown (2023) extends this to a three-way bias/variance/diversity trade-off, showing that diversity among ensemble members always subtracts from expected risk when bias and variance are held fixed.

Model diversity

Diversity is the engine that makes ensembles work. If every base learner made identical predictions, combining them would offer no benefit. Sources of diversity include:

Training on different data subsets (as in bagging)
Using different algorithms or architectures
Varying hyperparameters or random seeds
Training on different feature subsets (random subspace method)

Empirical studies consistently show that ensembles of diverse models outperform ensembles of similar models, even when the individual diverse models are somewhat weaker.

Types of ensemble methods

Bagging (bootstrap aggregating)

Bagging, introduced by Leo Breiman in 1996, generates multiple versions of a predictor by training each on a different bootstrap sample (random sample with replacement) of the training data. For regression tasks, predictions are averaged; for classification tasks, a majority vote is taken.

Because each bootstrap sample omits roughly 37% of the original data, the held-out portion (called the out-of-bag sample) can be used to estimate generalization error without a separate validation set. Bagging is most effective with high-variance, low-bias base learners such as deep decision trees.

Variants:

Variant	Description
Pasting	Samples drawn without replacement instead of with replacement
Random subspaces	Each model trained on a random subset of features rather than a random subset of samples
Random patches	Combines random sampling of both features and samples

Random forests

The random forest algorithm, also developed by Breiman (2001), extends bagging by adding feature randomness. At each split in each tree, only a random subset of features is considered. This further decorrelates the trees, reducing variance beyond what plain bagging achieves. Random forests are widely used due to their strong out-of-the-box performance, resistance to overfitting, and ability to estimate feature importance.

Boosting

Boosting trains a sequence of weak learners, where each new learner focuses on the examples that previous learners got wrong. Unlike bagging, which trains models independently in parallel, boosting builds models sequentially so that each one corrects the mistakes of its predecessors.

AdaBoost (Adaptive Boosting), proposed by Freund and Schapire (1996), was the first practical boosting algorithm. It maintains a set of weights over training examples, increasing the weight of misclassified instances after each round. The final prediction is a weighted majority vote of all weak learners, where each learner's weight reflects its accuracy.

Gradient boosting, formalized by Friedman (2001), generalizes boosting to arbitrary differentiable loss functions. Each new model fits the negative gradient (pseudo-residuals) of the loss with respect to the current ensemble's predictions. This framework is extremely flexible and forms the basis of high-performance libraries such as XGBoost (Chen and Guestrin, 2016), LightGBM, and CatBoost.

Boosting algorithm	Key idea	Typical use case
AdaBoost	Reweights misclassified samples	Binary classification with weak learners
Gradient boosting	Fits residuals using gradient descent in function space	Tabular data, structured prediction
XGBoost	Regularized gradient boosting with system optimizations	Kaggle competitions, production ML
LightGBM	Histogram-based splits, leaf-wise growth	Large-scale tabular data
CatBoost	Native categorical feature handling, ordered boosting	Data with many categorical variables

Stacking (stacked generalization)

Stacking, introduced by David Wolpert in 1992, uses a two-level architecture. First, several diverse base learners are trained on the full training set (typically using cross-validation to generate out-of-fold predictions). Then a second-level model, called the meta-learner, is trained to combine the base learners' predictions into a final output. The meta-learner learns which base models to trust in which regions of the input space. Common choices for the meta-learner include logistic regression, ridge regression, and simple neural networks.

Stacking can be extended to multiple levels, though in practice two levels are most common because additional layers add complexity with diminishing returns.

Voting

Voting is the simplest ensemble combination strategy. Given a set of trained models, their predictions are combined by:

Hard voting: Each model casts a vote for a class label, and the majority class is selected.
Soft voting: Each model outputs class probabilities, which are averaged (optionally with weights), and the class with the highest average probability is selected.

Soft voting generally outperforms hard voting because it uses more information from each model. Weighted voting allows higher-performing models to have greater influence on the final prediction.

Blending

Blending is similar to stacking but uses a single holdout validation set rather than cross-validation to generate the training data for the meta-learner. The process is:

Split training data into a training portion and a holdout (blending) portion.
Train base models on the training portion.
Generate predictions on the holdout portion.
Train a meta-learner on these holdout predictions.

Blending is simpler to implement than stacking and less prone to data leakage, but it uses less data for training the base models and can have higher variance in the meta-learner's training signal.

Bayesian model averaging

Bayesian model averaging (BMA) weights each model's prediction by its posterior probability given the observed data. The posterior probabilities are typically derived from model evidence using criteria like the Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC). BMA provides a principled framework for model uncertainty but assumes that the true model is among the candidates, which may not hold in practice.

Neural network ensembles

Ensembling is also widely used in deep learning, though the computational cost of training multiple large neural networks can be substantial. Several techniques address this.

Multi-seed ensembles

The simplest approach is to train the same architecture multiple times with different random initializations. Because neural network optimization is non-convex, different seeds converge to different local minima, providing natural diversity. Lakshminarayanan et al. (2017) showed that deep ensembles of as few as five networks provide well-calibrated uncertainty estimates and strong predictive performance.

Snapshot ensembles

Snapshot ensembles, proposed by Huang et al. (2017), collect multiple models from a single training run by using a cyclic learning rate schedule. The learning rate is periodically reset to a high value, causing the optimizer to escape the current local minimum and converge to a new one. The model weights are saved ("snapshotted") at each convergence point. This yields an ensemble at essentially no extra training cost compared to training a single model.

Dropout as implicit ensemble

Dropout, a common regularization technique, can be interpreted as implicitly training an exponential number of sub-networks. At test time, using all units with scaled weights approximates the geometric mean of all possible sub-networks. Monte Carlo dropout extends this idea by running multiple stochastic forward passes at inference time to produce ensemble-like uncertainty estimates.

Mixture of Experts

Mixture of Experts (MoE) is a form of ensemble where different sub-networks (experts) specialize in different regions of the input space. A gating network learns to route each input to the most appropriate expert or combination of experts. Unlike traditional ensembles that query every member for every input, MoE activates only a subset of experts per input, making it more computationally efficient.

MoE has become central to modern large language models. DeepSeek-V3 uses 256 experts, Llama 4 Scout employs 16 experts with 109 billion total parameters but only 17 billion active per input, and Mixtral 8x7B uses 8 experts per layer. This sparse activation pattern allows models to have very large total parameter counts while keeping inference cost manageable.

Ensemble selection and pruning

In many scenarios, using every available model in an ensemble is unnecessary or even harmful. Ensemble pruning (also called ensemble selection or ensemble thinning) reduces the ensemble to a subset of members that performs as well as or better than the full collection, while lowering computational cost.

Common approaches include:

Forward selection: Greedily add models that improve ensemble performance on a validation set.
Backward elimination: Start with all models and iteratively remove those whose removal improves or does not hurt performance.
Diversity-based pruning: Select a subset that maximizes both accuracy and diversity among members.
Reinforcement learning-based pruning: An agent learns a policy for selecting which classifiers to include by exploring a state space and maximizing a cumulative reward.

Caruana et al. (2004) showed that ensemble selection from a library of models can significantly outperform simply using the best single model, while keeping the ensemble size small.

Ensembles in competitions

Ensemble methods have been instrumental in machine learning competitions. Nearly every winning solution on Kaggle involves some form of ensembling. A typical competition pipeline stacks several diverse models (gradient boosting, neural networks, linear models) and uses a blending or stacking layer to combine them.

Notable examples include:

The Netflix Prize (2009), where the winning team BellKor's Pragmatic Chaos combined hundreds of models through blending.
The Crowdflower Search Results Relevance competition, won by Chenglong Chen using an ensemble of 35 models, many of which were themselves ensembles.
The majority of Kaggle competitions on tabular data, where XGBoost or LightGBM ensembled with neural networks and other learners dominate leaderboards.

The success of ensembles in competitions has driven development of automated ensembling tools in libraries like auto-sklearn, AutoGluon, and H2O AutoML.

Practical considerations

Computational cost

Ensembles require training and storing multiple models, which increases both training time and memory usage. For bagging and voting, base models can be trained in parallel. Boosting is inherently sequential since each model depends on the previous one's errors. Stacking requires multiple rounds of cross-validated training.

Inference latency

At prediction time, querying every model in an ensemble increases latency proportionally to the ensemble size (or sublinearly if models run in parallel). In latency-sensitive applications, techniques like model distillation (training a single student model to mimic the ensemble) can compress the ensemble's knowledge into a single, faster model.

Model management

Deploying ensembles in production requires managing multiple model artifacts, monitoring each component's performance over time, and ensuring consistency across updates. Modern ML platforms and model registries provide tooling for this, but the operational burden is real and should be weighed against the accuracy gains.

Consideration	Challenge	Mitigation
Training cost	N times the cost of a single model	Parallel training, snapshot ensembles
Inference latency	Each prediction requires N model evaluations	Distillation, pruning, parallel inference
Storage	N model artifacts to store and version	Shared base layers, pruning
Interpretability	Harder to explain than a single model	Feature importance aggregation, SHAP values
Maintenance	Multiple models to monitor and update	Automated pipelines, model registries

Comparison of ensemble methods

Method	Training style	Diversity source	Strengths	Weaknesses
Bagging	Parallel	Bootstrap sampling	Reduces variance, simple to implement	Does not reduce bias
Random forests	Parallel	Bootstrap + feature sampling	Strong defaults, feature importance	Can be slow with many trees
AdaBoost	Sequential	Reweighting samples	Reduces bias, interpretable weights	Sensitive to noisy data and outliers
Gradient boosting	Sequential	Fitting residuals	Flexible loss functions, state-of-the-art on tabular data	Prone to overfitting without regularization
Stacking	Two-stage	Different algorithms	Can combine heterogeneous models	Complex, risk of overfitting meta-learner
Voting	Independent	Different algorithms	Simple, no additional training	Limited improvement if models are similar
Blending	Two-stage	Different algorithms	Simple holdout approach	Uses less data than stacking
MoE	Joint (gating + experts)	Specialization by input region	Efficient sparse computation	Complex routing, training instability

Theoretical foundations

Several theoretical results justify the effectiveness of ensemble methods:

Condorcet jury theorem (1785): Under majority voting with independent classifiers each having accuracy above 50%, ensemble accuracy converges to 100% as the number of classifiers grows.
Hoeffding inequality: Provides exponential concentration bounds on the deviation of the ensemble error from the expected error, given independent base learners.
Bias-variance decomposition: Breiman (1996) showed formally that bagging reduces the variance component of prediction error, explaining why it helps most with unstable learners.
Ambiguity decomposition: Krogh and Vedelsby (1995) proved that the ensemble error equals the average error of the individual members minus a diversity term (the "ambiguity"), showing that diversity directly improves ensemble performance.
Margin theory: Schapire et al. (1998) analyzed boosting through the lens of classification margins, showing that boosting increases the margins on training examples, which is connected to improved generalization.
Unified diversity theory: Brown (2023) provided a unified theory relating bias, variance, and diversity in ensembles, showing these three quantities jointly determine ensemble risk.

Explain Like I'm 5 (ELI5)

Imagine you have a big test coming up, and you want to know the answers to all the questions. You could ask one really smart friend, but what if they don't know the answer to one of the questions? Instead, you ask several friends, each with different strengths and weaknesses. Then, you take their answers and combine them in a smart way to get the best possible answers.

That is what ensemble methods do in machine learning. They take the "opinions" of several different models and combine them to get a better, more accurate prediction. Each model might get some things wrong, but they tend to get different things wrong. When you put all their answers together, the mistakes cancel out and the right answers shine through.

References

Breiman, L. (1996). "Bagging Predictors." *Machine Learning*, 24(2), 123-140.
Freund, Y. and Schapire, R. (1996). "Experiments with a New Boosting Algorithm." *Proceedings of the Thirteenth International Conference on Machine Learning*, pp. 148-156.
Wolpert, D. H. (1992). "Stacked Generalization." *Neural Networks*, 5(2), 241-259.
Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32.
Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *Annals of Statistics*, 29(5), 1189-1232.
Chen, T. and Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, pp. 785-794.
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and Weinberger, K. Q. (2017). "Snapshot Ensembles: Train 1, Get M for Free." *Proceedings of ICLR 2017*.
Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). "Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles." *Advances in Neural Information Processing Systems*, 30.
Krogh, A. and Vedelsby, J. (1995). "Neural Network Ensembles, Cross Validation, and Active Learning." *Advances in Neural Information Processing Systems*, 7.
Caruana, R., Niculescu-Mizil, A., Crew, G., and Ksikes, A. (2004). "Ensemble Selection from Libraries of Models." *Proceedings of the 21st International Conference on Machine Learning*.
Brown, G. (2023). "A Unified Theory of Diversity in Ensemble Learning." *Journal of Machine Learning Research*, 24, 1-49.
Schapire, R. E., Freund, Y., Bartlett, P., and Lee, W. S. (1998). "Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods." *Annals of Statistics*, 26(5), 1651-1686.

Why ensembles work

Wisdom of crowds and error decorrelation

Bias-variance perspective

Model diversity

Types of ensemble methods

Bagging (bootstrap aggregating)

Random forests

Boosting

Stacking (stacked generalization)

Voting

Blending

Bayesian model averaging

Neural network ensembles

Multi-seed ensembles

Snapshot ensembles

Dropout as implicit ensemble

Mixture of Experts

Ensemble selection and pruning

Ensembles in competitions

Practical considerations

Computational cost

Inference latency

Model management

Comparison of ensemble methods

Theoretical foundations

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Bagging

Boosting

Decision Forest

Gradient Boosting

Random Forest

Why ensembles work

Wisdom of crowds and error decorrelation

Bias-variance perspective

Model diversity

Types of ensemble methods

Bagging (bootstrap aggregating)

Random forests

Boosting

Stacking (stacked generalization)

Voting

Blending

Bayesian model averaging

Neural network ensembles

Multi-seed ensembles

Snapshot ensembles

Dropout as implicit ensemble

Mixture of Experts

Ensemble selection and pruning

Ensembles in competitions

Practical considerations

Computational cost

Inference latency

Model management

Comparison of ensemble methods

Theoretical foundations

Explain Like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Bagging

Boosting

Decision Forest

Gradient Boosting

Random Forest