# Ensemble

> Source: https://aiwiki.ai/wiki/ensemble
> Updated: 2026-06-21
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**Ensemble methods** are techniques in [machine learning](/wiki/machine_learning) that combine the predictions of multiple models, known as base learners, to produce a single prediction that is typically more accurate and robust than any individual model. The core insight is straightforward: by aggregating the outputs of several learners, individual errors tend to cancel out, yielding better generalization on unseen data. Ensemble approaches underpin many of the most successful algorithms in applied machine learning, including [random forests](/wiki/random_forest), [gradient boosting](/wiki/gradient_boosting) machines, and competition-winning solutions on platforms like Kaggle. Their effectiveness is well documented: the $1 million Netflix Prize was won in 2009 by a team that blended hundreds of individual models, achieving a 10.06% improvement over Netflix's own algorithm with a final RMSE of 0.8567.[13]

## What are the main types of ensemble methods?

The principal families of ensemble methods are bagging, boosting, stacking, voting, blending, and Bayesian model averaging, plus the [mixture of experts](/wiki/mixture_of_experts) architecture used in modern deep learning. Bagging trains models in parallel on resampled data to cut variance; boosting trains models sequentially to cut bias; and stacking trains a meta-learner to combine heterogeneous models. Each is described in detail below.

## Why do ensembles work?

The effectiveness of ensemble methods rests on several complementary intuitions.

### Wisdom of crowds and error decorrelation

If individual models make independent errors, averaging their predictions causes those errors to cancel. This is a statistical consequence of the law of large numbers. The Condorcet jury theorem (1785) formalizes a related idea for binary decisions: if each of N independent voters is correct with probability p > 0.5, the probability that the majority vote is correct approaches 1 as N grows. Applied to classifiers, Hoeffding's inequality gives a bound on the ensemble error:

P(ensemble error) <= exp(-2N(p - 0.5)^2)

This exponential decay explains why even modest individual accuracy can yield strong collective performance, provided the models' errors are sufficiently uncorrelated.

### Bias-variance perspective

Prediction error can be decomposed into bias (systematic error from incorrect assumptions), variance (sensitivity to training data fluctuations), and irreducible noise. Different ensemble strategies target different components:

| Strategy | Primary effect | Mechanism |
|---|---|---|
| [Bagging](/wiki/bagging) | Reduces variance | Averages many high-variance models trained on bootstrap samples |
| [Boosting](/wiki/boosting) | Reduces bias | Sequentially corrects residual errors of prior models |
| Stacking | Reduces both | A meta-learner learns optimal combination weights |

Recent theoretical work by Brown (2023) extends this to a three-way bias/variance/diversity trade-off, showing that diversity among ensemble members always subtracts from expected risk when bias and variance are held fixed.[11]

### Model diversity

Diversity is the engine that makes ensembles work. If every base learner made identical predictions, combining them would offer no benefit. Sources of diversity include:

- Training on different data subsets (as in bagging)
- Using different algorithms or architectures
- Varying hyperparameters or random seeds
- Training on different feature subsets (random subspace method)

Empirical studies consistently show that ensembles of diverse models outperform ensembles of similar models, even when the individual diverse models are somewhat weaker.

## Types of ensemble methods

### Bagging (bootstrap aggregating)

Bagging, introduced by Leo Breiman in 1996, generates multiple versions of a predictor by training each on a different bootstrap sample (random sample with replacement) of the training data.[1] In Breiman's own words, "Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor."[1] For [regression](/wiki/regression_model) tasks, predictions are averaged; for [classification](/wiki/classification_model) tasks, a majority vote is taken.

Because each bootstrap sample draws n observations with replacement from a set of size n, the probability that any given observation is omitted from a sample approaches (1 - 1/n)^n, which converges to 1/e, or about 36.8%, as n grows. This held-out portion (called the out-of-bag sample) can be used to estimate generalization error without a separate validation set. Bagging is most effective with high-variance, low-bias base learners such as deep [decision trees](/wiki/decision_tree).

**Variants:**

| Variant | Description |
|---|---|
| Pasting | Samples drawn without replacement instead of with replacement |
| Random subspaces | Each model trained on a random subset of features rather than a random subset of samples |
| Random patches | Combines random sampling of both features and samples |

### Random forests

The [random forest](/wiki/random_forest) algorithm, also developed by Breiman (2001), extends bagging by adding feature randomness.[4] At each split in each tree, only a random subset of features is considered. This further decorrelates the trees, reducing variance beyond what plain bagging achieves. Random forests are widely used due to their strong out-of-the-box performance, resistance to [overfitting](/wiki/overfitting), and ability to estimate feature importance.

### Boosting

Boosting trains a sequence of weak learners, where each new learner focuses on the examples that previous learners got wrong. Unlike bagging, which trains models independently in parallel, boosting builds models sequentially so that each one corrects the mistakes of its predecessors.

**AdaBoost** (Adaptive Boosting), proposed by Freund and Schapire (1996), was the first practical boosting algorithm.[2] It maintains a set of weights over training examples, increasing the weight of misclassified instances after each round. The final prediction is a weighted majority vote of all weak learners, where each learner's weight reflects its accuracy.

**Gradient boosting**, formalized by Friedman (2001), generalizes boosting to arbitrary differentiable loss functions.[5] Each new model fits the negative gradient (pseudo-residuals) of the loss with respect to the current ensemble's predictions. This framework is extremely flexible and forms the basis of high-performance libraries such as XGBoost (Chen and Guestrin, 2016), LightGBM, and CatBoost.[6]

| Boosting algorithm | Key idea | Typical use case |
|---|---|---|
| AdaBoost | Reweights misclassified samples | Binary classification with weak learners |
| Gradient boosting | Fits residuals using gradient descent in function space | Tabular data, structured prediction |
| XGBoost | Regularized gradient boosting with system optimizations | Kaggle competitions, production ML |
| LightGBM | Histogram-based splits, leaf-wise growth | Large-scale tabular data |
| CatBoost | Native categorical feature handling, ordered boosting | Data with many categorical variables |

### Stacking (stacked generalization)

Stacking, introduced by David Wolpert in 1992, uses a two-level architecture.[3] First, several diverse base learners are trained on the full training set (typically using cross-validation to generate out-of-fold predictions). Then a second-level model, called the meta-learner, is trained to combine the base learners' predictions into a final output. The meta-learner learns which base models to trust in which regions of the input space. Common choices for the meta-learner include logistic regression, ridge regression, and simple [neural networks](/wiki/neural_network).

Stacking can be extended to multiple levels, though in practice two levels are most common because additional layers add complexity with diminishing returns.

### Voting

Voting is the simplest ensemble combination strategy. Given a set of trained models, their predictions are combined by:

- **Hard voting:** Each model casts a vote for a class label, and the majority class is selected.
- **Soft voting:** Each model outputs class probabilities, which are averaged (optionally with weights), and the class with the highest average probability is selected.

Soft voting generally outperforms hard voting because it uses more information from each model. Weighted voting allows higher-performing models to have greater influence on the final prediction.

### Blending

Blending is similar to stacking but uses a single holdout validation set rather than cross-validation to generate the training data for the meta-learner. The process is:

1. Split training data into a training portion and a holdout (blending) portion.
2. Train base models on the training portion.
3. Generate predictions on the holdout portion.
4. Train a meta-learner on these holdout predictions.

Blending is simpler to implement than stacking and less prone to data leakage, but it uses less data for training the base models and can have higher variance in the meta-learner's training signal.

### Bayesian model averaging

Bayesian model averaging (BMA) weights each model's prediction by its posterior probability given the observed data. The posterior probabilities are typically derived from model evidence using criteria like the Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC). BMA provides a principled framework for model uncertainty but assumes that the true model is among the candidates, which may not hold in practice.

## How are ensembles used in deep learning?

Ensembling is also widely used in deep learning, though the computational cost of training multiple large neural networks can be substantial. Several techniques address this.

### Multi-seed ensembles

The simplest approach is to train the same architecture multiple times with different random initializations. Because neural network optimization is non-convex, different seeds converge to different local minima, providing natural diversity. Lakshminarayanan et al. (2017) showed that deep ensembles of as few as five networks provide well-calibrated uncertainty estimates and strong predictive performance.[8]

### Snapshot ensembles

Snapshot ensembles, proposed by Huang et al. (2017), collect multiple models from a single training run by using a cyclic learning rate schedule.[7] The learning rate is periodically reset to a high value, causing the optimizer to escape the current local minimum and converge to a new one. The model weights are saved ("snapshotted") at each convergence point. This yields an ensemble at essentially no extra training cost compared to training a single model.

### Dropout as implicit ensemble

Dropout, a common regularization technique, can be interpreted as implicitly training an exponential number of sub-networks. At test time, using all units with scaled weights approximates the geometric mean of all possible sub-networks. Monte Carlo dropout extends this idea by running multiple stochastic forward passes at inference time to produce ensemble-like uncertainty estimates.

## How does mixture of experts relate to ensembles?

[Mixture of Experts](/wiki/mixture_of_experts) (MoE) is a form of ensemble where different sub-networks (experts) specialize in different regions of the input space. A gating network learns to route each input to the most appropriate expert or combination of experts. Unlike traditional ensembles that query every member for every input, MoE activates only a subset of experts per input, making it more computationally efficient.

MoE has become central to modern large language models. DeepSeek-V3 uses 256 routed experts (plus one shared expert) and activates only 37 billion of its 671 billion total parameters per token.[14] Llama 4 Scout employs 16 experts with 109 billion total parameters but only 17 billion active per input, and Mixtral 8x7B uses 8 experts per layer, routing each token to the top 2 so that only about 12.9 billion of its 46.7 billion total parameters are used per forward pass.[15] This sparse activation pattern allows models to have very large total parameter counts while keeping inference cost manageable.

## Ensemble selection and pruning

In many scenarios, using every available model in an ensemble is unnecessary or even harmful. Ensemble pruning (also called ensemble selection or ensemble thinning) reduces the ensemble to a subset of members that performs as well as or better than the full collection, while lowering computational cost.

Common approaches include:

- **Forward selection:** Greedily add models that improve ensemble performance on a validation set.
- **Backward elimination:** Start with all models and iteratively remove those whose removal improves or does not hurt performance.
- **Diversity-based pruning:** Select a subset that maximizes both accuracy and diversity among members.
- **Reinforcement learning-based pruning:** An agent learns a policy for selecting which classifiers to include by exploring a state space and maximizing a cumulative reward.

Caruana et al. (2004) showed that ensemble selection from a library of models can significantly outperform simply using the best single model, while keeping the ensemble size small.[10]

## Why do ensembles dominate machine learning competitions?

Ensemble methods have been instrumental in machine learning competitions. Nearly every winning solution on Kaggle involves some form of ensembling. A typical competition pipeline stacks several diverse models (gradient boosting, neural networks, linear models) and uses a blending or stacking layer to combine them.

Notable examples include:

- The Netflix Prize (2009), where the winning team BellKor's Pragmatic Chaos combined hundreds of models through blending to reach an RMSE of 0.8567, a 10.06% improvement over Netflix's Cinematch baseline; the team submitted its winning entry on July 26, 2009, just 20 minutes ahead of a rival team that achieved an identical score.[13]
- The Crowdflower Search Results Relevance competition, won by Chenglong Chen using an ensemble of 35 models, many of which were themselves ensembles.
- The majority of Kaggle competitions on tabular data, where XGBoost or LightGBM ensembled with neural networks and other learners dominate leaderboards.

The success of ensembles in competitions has driven development of automated ensembling tools in libraries like auto-sklearn, AutoGluon, and H2O AutoML.

## Practical considerations

### Computational cost

Ensembles require training and storing multiple models, which increases both training time and memory usage. For bagging and voting, base models can be trained in parallel. Boosting is inherently sequential since each model depends on the previous one's errors. Stacking requires multiple rounds of cross-validated training.

### Inference latency

At prediction time, querying every model in an ensemble increases latency proportionally to the ensemble size (or sublinearly if models run in parallel). In latency-sensitive applications, techniques like model distillation (training a single student model to mimic the ensemble) can compress the ensemble's knowledge into a single, faster model.

### Model management

Deploying ensembles in production requires managing multiple model artifacts, monitoring each component's performance over time, and ensuring consistency across updates. Modern ML platforms and model registries provide tooling for this, but the operational burden is real and should be weighed against the accuracy gains.

| Consideration | Challenge | Mitigation |
|---|---|---|
| Training cost | N times the cost of a single model | Parallel training, snapshot ensembles |
| Inference latency | Each prediction requires N model evaluations | Distillation, pruning, parallel inference |
| Storage | N model artifacts to store and version | Shared base layers, pruning |
| Interpretability | Harder to explain than a single model | Feature importance aggregation, SHAP values |
| Maintenance | Multiple models to monitor and update | Automated pipelines, model registries |

## How do bagging, boosting, and stacking differ?

| Method | Training style | Diversity source | Strengths | Weaknesses |
|---|---|---|---|---|
| Bagging | Parallel | Bootstrap sampling | Reduces variance, simple to implement | Does not reduce bias |
| Random forests | Parallel | Bootstrap + feature sampling | Strong defaults, feature importance | Can be slow with many trees |
| AdaBoost | Sequential | Reweighting samples | Reduces bias, interpretable weights | Sensitive to noisy data and outliers |
| Gradient boosting | Sequential | Fitting residuals | Flexible loss functions, state-of-the-art on tabular data | Prone to overfitting without regularization |
| Stacking | Two-stage | Different algorithms | Can combine heterogeneous models | Complex, risk of overfitting meta-learner |
| Voting | Independent | Different algorithms | Simple, no additional training | Limited improvement if models are similar |
| Blending | Two-stage | Different algorithms | Simple holdout approach | Uses less data than stacking |
| MoE | Joint (gating + experts) | Specialization by input region | Efficient sparse computation | Complex routing, training instability |

## Theoretical foundations

Several theoretical results justify the effectiveness of ensemble methods:

- **Condorcet jury theorem (1785):** Under majority voting with independent classifiers each having accuracy above 50%, ensemble accuracy converges to 100% as the number of classifiers grows.
- **Hoeffding inequality:** Provides exponential concentration bounds on the deviation of the ensemble error from the expected error, given independent base learners.
- **Bias-variance decomposition:** Breiman (1996) showed formally that bagging reduces the variance component of prediction error, explaining why it helps most with unstable learners.[1]
- **Ambiguity decomposition:** Krogh and Vedelsby (1995) proved that the ensemble error equals the average error of the individual members minus a diversity term (the "ambiguity"), showing that diversity directly improves ensemble performance.[9]
- **Margin theory:** Schapire et al. (1998) analyzed boosting through the lens of classification margins, showing that boosting increases the margins on training examples, which is connected to improved generalization.[12]
- **Unified diversity theory:** Brown (2023) provided a unified theory relating bias, variance, and diversity in ensembles, showing these three quantities jointly determine ensemble risk.[11]

## Explain Like I'm 5 (ELI5)

Imagine you have a big test coming up, and you want to know the answers to all the questions. You could ask one really smart friend, but what if they don't know the answer to one of the questions? Instead, you ask several friends, each with different strengths and weaknesses. Then, you take their answers and combine them in a smart way to get the best possible answers.

That is what ensemble methods do in machine learning. They take the "opinions" of several different models and combine them to get a better, more accurate prediction. Each model might get some things wrong, but they tend to get different things wrong. When you put all their answers together, the mistakes cancel out and the right answers shine through.

## References

1. Breiman, L. (1996). "Bagging Predictors." *Machine Learning*, 24(2), 123-140.
2. Freund, Y. and Schapire, R. (1996). "Experiments with a New Boosting Algorithm." *Proceedings of the Thirteenth International Conference on Machine Learning*, pp. 148-156.
3. Wolpert, D. H. (1992). "Stacked Generalization." *Neural Networks*, 5(2), 241-259.
4. Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32.
5. Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *Annals of Statistics*, 29(5), 1189-1232.
6. Chen, T. and Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, pp. 785-794.
7. Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and Weinberger, K. Q. (2017). "Snapshot Ensembles: Train 1, Get M for Free." *Proceedings of ICLR 2017*.
8. Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). "Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles." *Advances in Neural Information Processing Systems*, 30.
9. Krogh, A. and Vedelsby, J. (1995). "Neural Network Ensembles, Cross Validation, and Active Learning." *Advances in Neural Information Processing Systems*, 7.
10. Caruana, R., Niculescu-Mizil, A., Crew, G., and Ksikes, A. (2004). "Ensemble Selection from Libraries of Models." *Proceedings of the 21st International Conference on Machine Learning*.
11. Brown, G. (2023). "A Unified Theory of Diversity in Ensemble Learning." *Journal of Machine Learning Research*, 24, 1-49.
12. Schapire, R. E., Freund, Y., Bartlett, P., and Lee, W. S. (1998). "Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods." *Annals of Statistics*, 26(5), 1651-1686.
13. Netflix Prize. "Grand Prize awarded to team BellKor's Pragmatic Chaos" (2009); see also "Netflix Prize," Wikipedia. https://www.netflixprize.com/community/topic_1537.html
14. DeepSeek-AI (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437. https://arxiv.org/abs/2412.19437
15. Jiang, A. Q., et al. (2024). "Mixtral of Experts." arXiv:2401.04088. https://arxiv.org/abs/2401.04088

