# Ensemble learning

> Source: https://aiwiki.ai/wiki/ensemble_learning
> Updated: 2026-06-21
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Ensemble learning** is a [machine learning](/wiki/machine_learning) paradigm that combines multiple models to produce predictions that are better than any individual model could achieve alone. By aggregating the outputs of diverse learners, ensemble methods reduce [overfitting](/wiki/overfitting), lower variance, improve accuracy, and increase robustness. Ensemble approaches underpin many of the most successful algorithms in applied machine learning, including [random forests](/wiki/random_forest), gradient boosted trees, and the stacking strategies that consistently win data science competitions [1]. As Thomas Dietterich put it in his 2000 survey, "An ensemble of classifiers is a set of classifiers whose individual decisions are combined in some way (typically by weighted or unweighted voting) to classify new examples" [1]. The two papers that defined the field, Leo Breiman's "Random Forests" (2001) and Tianqi Chen and Carlos Guestrin's "XGBoost" (2016), are among the most cited works in all of machine learning, with the random forests paper alone accumulating well over 100,000 citations [3][6].

## Overview

The core idea behind ensemble learning is straightforward: a group of models that each make different kinds of errors will, when combined, cancel out their individual mistakes and converge on more accurate predictions. This principle is analogous to the "wisdom of crowds" effect in human decision-making, where the aggregate judgment of many independent individuals tends to outperform any single expert.

Ensemble methods vary in how they train the individual models (called base learners or weak learners) and how they combine their predictions. The three main families are bagging, boosting, and stacking, each addressing different aspects of the [bias-variance tradeoff](/wiki/bias_variance_tradeoff).

### What are the main types of ensemble methods?

| Type | Training strategy | Combination method | Primary effect | Key algorithms |
|---|---|---|---|---|
| Bagging | Train models in parallel on bootstrap samples | Averaging (regression) or majority vote (classification) | Reduces variance | [Random forest](/wiki/random_forest), bagged decision trees |
| Boosting | Train models sequentially; each corrects predecessor's errors | Weighted sum | Reduces bias (and variance) | AdaBoost, [gradient boosting](/wiki/gradient_boosting), XGBoost, LightGBM, CatBoost |
| Stacking | Train diverse base models, then train a meta-model on their outputs | Learned combination via meta-learner | Reduces both bias and variance | Stacked generalization, blending |
| Voting | Train diverse models independently | Majority vote (hard) or average probabilities (soft) | Reduces variance | Voting classifier/regressor |

## Bagging

Bagging, short for bootstrap aggregating, was introduced by Leo Breiman in 1996 [2]. The method works by generating multiple bootstrap samples (random samples drawn with replacement) from the training data, training a separate base model on each sample, and then aggregating their predictions. Breiman's central observation was that the gains come from instability in the base learner: "The vital element is the instability of the prediction method. If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy" [2].

### How does bagging work?

1. From a training set of *n* examples, draw *B* bootstrap samples, each of size *n*, with replacement. On average, each bootstrap sample contains about 63.2% of the unique original training points (the rest are duplicates).
2. Train an independent base model on each bootstrap sample.
3. For [classification](/wiki/classification), combine predictions by majority vote. For [regression](/wiki/regression), combine by averaging.

Because each base model sees a slightly different version of the training data, the models develop different patterns of errors. Averaging over many such models smooths out these errors, reducing variance without substantially increasing bias.

Bagging is most effective with high-variance, low-bias base learners, such as deep [decision trees](/wiki/decision_tree). A single deep decision tree tends to overfit, but an ensemble of many deep trees trained on different bootstrap samples generalizes much better.

### Out-of-bag estimation

A useful byproduct of bagging is the out-of-bag (OOB) error estimate. Since each bootstrap sample omits roughly 36.8% of the training points, each model can be evaluated on the points it did not see during training. Averaging these predictions across all models that did not train on a given point provides an unbiased estimate of generalization error, similar to [cross-validation](/wiki/cross-validation) but obtained for free during training.

## Random forests

The random forest algorithm, introduced by Leo Breiman in 2001 [3], extends bagging with an additional layer of randomization. In addition to training each tree on a bootstrap sample, random forests also restrict each split in each tree to consider only a random subset of features.

### How do random forests work?

1. Draw *B* bootstrap samples from the training data.
2. For each sample, grow a decision tree. At each node split, instead of considering all *p* features, randomly select *m* features (where *m* << *p*; a common default is m = sqrt(p) for classification and m = p/3 for regression) and choose the best split among those *m* features.
3. Grow each tree fully (or to a specified depth) without pruning.
4. Aggregate predictions by majority vote (classification) or averaging (regression).

The additional feature randomization serves a critical purpose: it decorrelates the individual trees. Without it, if one feature is a very strong predictor, most trees in a bagged ensemble would use that feature at the root and produce correlated predictions, limiting the variance reduction from averaging. By forcing each tree to consider different features, random forests ensure diversity among the trees.

### Properties of random forests

| Property | Details |
|---|---|
| Accuracy | Competitive with the best algorithms for tabular data; often among the top performers on benchmark datasets |
| Robustness | Resistant to overfitting as the number of trees increases; adding more trees never hurts performance (though gains diminish) |
| Feature importance | Provides built-in measures of feature importance based on how much each feature reduces impurity across all trees |
| Handling of mixed data types | Can handle both numerical and categorical features |
| Missing values | Some implementations handle missing values natively |
| Hyperparameters | Main parameters are the number of trees (*B*) and the number of features per split (*m*); both are relatively easy to tune |
| Parallelization | Trees are independent, so training can be parallelized across multiple CPU cores |

Breiman's 2001 paper [3] became one of the most cited papers in all of machine learning, with over 100,000 citations recorded by Semantic Scholar. Random forests remain a default choice for tabular data problems in industry, especially when interpretability (via feature importance) and minimal tuning are priorities.

## Boosting

Boosting is a family of algorithms that trains models sequentially, with each new model focusing on the examples that previous models got wrong. Unlike bagging, which reduces variance by averaging independent models, boosting reduces bias by iteratively correcting errors.

### AdaBoost

Adaptive Boosting (AdaBoost), introduced by Yoav Freund and Robert Schapire in 1997, was the first practical boosting algorithm [4]. Freund and Schapire received the Godel Prize in 2003 for the theoretical contributions of the work.

AdaBoost works as follows:

1. Initialize equal weights for all training examples.
2. Train a weak learner (typically a decision stump, which is a tree with a single split) on the weighted data.
3. Compute the weighted error rate of the learner.
4. Assign a weight to the learner based on its accuracy (more accurate learners get higher weights).
5. Update the training example weights: increase weights for misclassified examples, decrease weights for correctly classified ones.
6. Repeat steps 2-5 for a specified number of rounds.
7. The final prediction is a weighted vote of all weak learners.

The key insight is that by re-weighting examples, each successive learner is forced to concentrate on the "hard" cases that previous learners struggled with. The theoretical guarantee is that if each weak learner performs slightly better than random guessing, the combined ensemble can achieve arbitrarily low training error.

### Gradient boosting

Gradient boosting, formalized by Jerome Friedman in 2001 [5], generalizes boosting to arbitrary differentiable [loss functions](/wiki/loss_function). Instead of re-weighting examples, gradient boosting fits each new model to the negative gradient (pseudo-residuals) of the loss function with respect to the current ensemble's predictions.

The gradient boosting procedure is:

1. Initialize the model with a constant prediction (for example, the mean of the target values).
2. For each boosting round:
   a. Compute the negative gradient of the loss function at the current predictions. For squared error loss, this is simply the residuals (true values minus current predictions).
   b. Fit a new weak learner (typically a small decision tree with 4 to 8 leaves) to these negative gradients.
   c. Add the new learner's predictions, scaled by a learning rate, to the current ensemble.
3. Repeat for a specified number of rounds.

The learning rate (or shrinkage factor, typically 0.01 to 0.3) controls how much each new tree contributes. Smaller learning rates require more trees but generally produce better results due to the regularization effect.

Gradient boosting is extremely flexible because it works with any differentiable loss function: squared error for regression, log loss for classification, quantile loss for quantile regression, and many domain-specific objectives.

### XGBoost

XGBoost (Extreme Gradient Boosting), introduced by Tianqi Chen and Carlos Guestrin in 2016 [6], is an optimized implementation of gradient boosting that added several key innovations:

| Innovation | Description |
|---|---|
| Regularization | Adds L1 and L2 penalties on leaf weights, reducing overfitting |
| Approximate split finding | Uses weighted quantile sketch for efficient handling of large datasets |
| Sparsity awareness | Handles missing values natively by learning optimal default split directions |
| Column subsampling | Borrows from random forests to add feature randomization, further reducing overfitting |
| Parallel tree construction | Parallelizes the computation within each tree (not across trees) for faster training |
| Cache-aware access | Optimizes memory access patterns for hardware efficiency |

XGBoost became the dominant algorithm in [Kaggle](/wiki/kaggle) competitions and many industrial applications. The paper's central claim is one of scale: "XGBoost scales beyond billions of examples using far fewer resources than existing systems" [6]. Chen and Guestrin's paper [6] reported that XGBoost was used by the majority of winning teams in Kaggle competitions at the time of publication.

### LightGBM

LightGBM, developed by Microsoft Research and released in 2017 [7], introduced two key techniques that make gradient boosting faster on large datasets:

**Gradient-based One-Side Sampling (GOSS).** Instead of using all data points to compute gradients, GOSS keeps all instances with large gradients (which contribute more to information gain) and randomly samples from instances with small gradients. This significantly reduces computation without sacrificing accuracy.

**Exclusive Feature Bundling (EFB).** In sparse datasets (common in real-world applications), many features are mutually exclusive (they rarely take nonzero values simultaneously). EFB bundles these features together, reducing the effective number of features.

LightGBM also grows trees leaf-wise rather than level-wise, which can produce deeper, more accurate trees with fewer splits. It is particularly fast on large datasets and is widely used in production systems at scale.

### CatBoost

CatBoost, developed by Yandex and released in 2017 [8], addresses a specific challenge in gradient boosting: handling categorical features. While XGBoost and LightGBM require categorical features to be encoded (one-hot encoding, label encoding, or target encoding) before training, CatBoost processes categorical features directly using an ordered target encoding scheme that avoids target leakage.

CatBoost also uses ordered boosting, a permutation-based approach that trains each tree on a different ordering of the training data. This reduces the overfitting that can occur when the same data is used for both computing the gradient and fitting the tree.

### How do XGBoost, LightGBM, and CatBoost differ?

| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Tree growth | Level-wise | Leaf-wise | Symmetric (balanced) |
| Categorical feature handling | Requires encoding | Basic support | Native, ordered target encoding |
| Missing value handling | Native | Native | Native |
| Training speed (large data) | Moderate | Fast | Moderate |
| GPU support | Yes | Yes | Yes |
| Regularization | L1, L2 on weights | L1, L2 on weights | L2 on weights, ordered boosting |
| Default performance | Strong | Strong | Strong, especially with categorical data |
| Year introduced | 2016 | 2017 | 2017 |

All three libraries produce comparable results on most benchmarks. The choice among them often depends on the specific characteristics of the dataset (amount of categorical features, dataset size, sparsity) and practical considerations (training speed, ease of use).

## Stacking

Stacking (stacked generalization) was introduced by David Wolpert in 1992 [9]. Unlike bagging and boosting, which combine models of the same type, stacking combines diverse models of potentially different types through a learned meta-model.

### How does stacking work?

1. **Level 0: Base learners.** Train multiple diverse models (for example, a random forest, an SVM, a [neural network](/wiki/neural_network), and a gradient boosting model) on the training data. Crucially, use cross-validation to generate out-of-fold predictions: for each fold, train each base model on the training folds and generate predictions for the held-out fold. This prevents the meta-model from overfitting to the base learners' training-set predictions.
2. **Level 1: Meta-learner.** Treat the out-of-fold predictions from all base learners as features and train a meta-learner (often a simple model like [logistic regression](/wiki/logistic_regression) or [linear regression](/wiki/linear_regression)) to combine them.
3. **Prediction.** For new data, generate predictions from all base learners and pass them through the meta-learner.

Stacking works because different algorithms capture different patterns in the data, and the meta-learner learns to weight their contributions optimally. A random forest might excel at capturing nonlinear interactions, while a linear model might capture linear trends more accurately. The meta-learner discovers which base model to trust in which regions of the feature space.

Multi-level stacking (with multiple meta-learner layers) is sometimes used in competitions, but it adds complexity and is rarely necessary in practice.

## Voting

Voting ensembles are the simplest form of model combination. Multiple models are trained independently, and their predictions are combined by majority vote (hard voting) or by averaging their predicted probabilities (soft voting).

**Hard voting** assigns the label predicted by the majority of models. If three out of five models predict class A, the ensemble predicts class A.

**Soft voting** averages the predicted probabilities across all models and selects the class with the highest average probability. Soft voting generally performs better because it accounts for the confidence of each model's prediction, not just its top choice.

Voting is effective when the base models are diverse and individually strong. It provides no benefit when all models make the same predictions.

## Why do ensembles work? The bias-variance perspective

The effectiveness of ensemble methods can be understood through the bias-variance decomposition of prediction error. For a single model, the expected prediction error at a point can be decomposed as:

Expected error = Bias^2 + Variance + Irreducible noise

**Bagging reduces variance.** If the base models have roughly the same bias, averaging their predictions does not change the expected bias but reduces the variance. For *B* independent models, each with variance sigma^2, the variance of their average is sigma^2 / B. In practice, the models are not fully independent (they are trained on overlapping bootstrap samples), so the reduction is less than 1/B, but it is still substantial.

**Boosting reduces bias.** By sequentially correcting errors, boosting allows the ensemble to approximate complex functions that no single weak learner could capture. Each boosting round adds a new term to the model, gradually reducing the bias. With regularization (learning rate, tree depth limits), boosting also controls variance.

**Stacking reduces both.** By combining diverse models that have different bias profiles, stacking can achieve lower bias than any single model while the meta-learner averages out variance.

### The role of diversity

Diversity among base learners is the critical ingredient for ensemble success. If all models make the same predictions (and the same errors), combining them yields no improvement. Ensembles benefit when individual models are accurate but make errors on different subsets of the data.

Diversity can be introduced through:

| Strategy | How it creates diversity |
|---|---|
| Different training data | Bagging uses bootstrap samples; boosting re-weights examples |
| Different features | Random forests use feature subsampling |
| Different algorithms | Stacking combines models of different types |
| Different hyperparameters | Training the same algorithm with different settings |
| Different random seeds | Even the same algorithm with different initialization can produce diverse models |
| Different training objectives | Models optimizing different loss functions capture different aspects of the data |

## Ensemble methods in competitions

Ensemble techniques have a storied history in machine learning competitions, particularly on [Kaggle](/wiki/kaggle). The overwhelming majority of winning solutions in Kaggle competitions use some form of ensembling.

The famous Netflix Prize competition (2006-2009) was a watershed moment for ensemble methods. The winning team, BellKor's Pragmatic Chaos, combined hundreds of models using blending (a variant of stacking) to achieve the required 10% improvement over Netflix's existing Cinematch recommendation algorithm. Their verified winning submission on July 26, 2009, reached a test RMSE of 0.8567, a 10.06% improvement over Cinematch, and the team collected the $1 million Grand Prize at a ceremony in New York City on September 21, 2009 [10]. The finish was extraordinarily close: BellKor's Pragmatic Chaos edged out the rival team "The Ensemble," which had matched its accuracy, by submitting roughly 20 minutes earlier near the end of the nearly three-year contest. The competition demonstrated that sophisticated ensembling could produce substantial gains over individual models [10].

In Kaggle competitions, a common pattern has emerged:

1. Start with strong individual models (typically XGBoost, LightGBM, CatBoost, or [deep learning](/wiki/deep_learning) models depending on the problem type).
2. Tune each model independently using cross-validation.
3. Combine models through weighted averaging, stacking, or blending.
4. The ensemble typically improves over the best single model by a few percentage points, which can make the difference between the top positions on the leaderboard.

Tianqi Chen and Carlos Guestrin reported in their 2016 XGBoost paper [6] that among the 29 challenge-winning solutions published at Kaggle's blog during 2015, 17 used XGBoost. Of these, eight used XGBoost alone, and the remaining nine combined XGBoost with neural networks in an ensemble.

## Model merging in large language models

A modern evolution of ensemble ideas appears in the practice of model merging for [large language models](/wiki/large_language_model) (LLMs). Rather than running multiple LLMs at inference time (which would be prohibitively expensive), model merging combines the weights of multiple fine-tuned models into a single set of weights, producing a model that inherits capabilities from each source model.

The technique has a verified accuracy payoff. In the influential "Model soups" paper (Wortsman et al., ICML 2022), averaging the weights of many fine-tuned models produced a ViT-G model that reached 90.94% top-1 accuracy on ImageNet, a new state of the art at the time, with no additional inference or memory cost compared to a single model [12].

Popular model merging techniques include:

| Technique | Description |
|---|---|
| Model soups | Average the weights of multiple models fine-tuned from the same base model with different hyperparameters |
| SLERP (Spherical Linear Interpolation) | Interpolates between two models' weight vectors along the surface of a hypersphere, preserving angular relationships |
| TIES-Merging | Resolves conflicts between multiple task vectors by trimming redundant parameters, resolving sign disagreements, and merging |
| DARE (Drop and Rescale) | Randomly drops a large fraction of delta parameters (often 90%+) and rescales the rest to approximate the original fine-tuned behavior |
| Task arithmetic | Computes task vectors (fine-tuned weights minus base weights), scales and adds them, then adds the result back to the base model |

[Model merging](/wiki/model_merging) can be seen as a form of parameter-space ensembling. While traditional ensembles combine predictions (output space), model merging combines weights (parameter space). This distinction means model merging produces a single model with no additional inference cost, unlike a traditional ensemble that must run all component models.

Tools like mergekit (by Charles Goddard) have made model merging accessible to the open-source LLM community. On the [Hugging Face](/wiki/hugging_face) model hub, merged models are among the most popular, with users combining specialized fine-tuned models (for coding, creative writing, reasoning, and other tasks) into versatile general-purpose models [11].

This practice draws a direct conceptual line from Wolpert's stacked generalization in 1992 to modern foundation model engineering, showing how the ensemble principle adapts to new computational realities.

## Practical guidance

### When should you use each ensemble method?

| Situation | Recommended approach | Rationale |
|---|---|---|
| Quick baseline with minimal tuning | Random forest | Robust, few hyperparameters, hard to break |
| Maximum accuracy on tabular data | Gradient boosting (XGBoost, LightGBM, CatBoost) | Consistently top-performing on structured data |
| Competition or critical deployment | Stacking of diverse models | Squeezes out the last fraction of accuracy |
| Many categorical features | CatBoost or LightGBM | Native categorical handling avoids error-prone encoding |
| Very large dataset (millions of rows) | LightGBM | Fastest training with GOSS and EFB |
| Need for interpretability | Random forest with feature importance | Built-in importance measures; SHAP values available |
| Combining deep learning with classical ML | Stacking or soft voting | Leverages complementary strengths |

### Common pitfalls

**Overfitting through data leakage.** When building a stacking ensemble, it is essential to use out-of-fold predictions for the meta-learner's training data. Using the base learners' training-set predictions leads to severe data leakage and an overoptimistic estimate of performance.

**Diminishing returns.** Adding more models to an ensemble yields diminishing improvements. Beyond a certain point, the additional complexity, training time, and maintenance cost outweigh the marginal accuracy gains.

**Correlation among base learners.** If all base learners are highly correlated (for example, several XGBoost models with similar hyperparameters), the ensemble will provide little benefit over a single model. Diversity is key.

**Deployment complexity.** Running multiple models at inference time increases latency, memory usage, and operational complexity. In production settings, a single well-tuned model often provides a better trade-off between accuracy and simplicity.

## Current relevance

Ensemble methods remain among the most practically important techniques in machine learning. Gradient boosted trees (XGBoost, LightGBM, CatBoost) are the default choice for tabular data in industry, consistently outperforming or matching deep learning approaches on structured data while being faster to train and easier to deploy.

Recent benchmarks and studies continue to confirm this. TabNet, transformer-based tabular models, and other deep learning approaches for structured data have not convincingly surpassed well-tuned gradient boosting in most comparisons. As of 2025 and into 2026, gradient boosted trees remain the recommended starting point for tabular machine learning at companies ranging from startups to large technology firms.

The ensemble principle has also found new expressions in deep learning beyond model merging: mixture-of-experts ([MoE](/wiki/mixture_of_experts)) architectures, which route inputs to specialized sub-networks, can be viewed as a form of learned ensembling within a single model. Snapshot ensembles, which save neural network checkpoints during training and average their predictions, provide an inexpensive way to ensemble deep models. Test-time augmentation, where a model's predictions on multiple augmented versions of an input are averaged, is another form of ensembling.

The theoretical and practical insights of ensemble learning, from Breiman's bagging to modern LLM merging, continue to shape how practitioners build high-performing machine learning systems.

## References

[1] Dietterich, T.G. (2000). "Ensemble Methods in Machine Learning." *Proceedings of the First International Workshop on Multiple Classifier Systems (MCS 2000)*, Lecture Notes in Computer Science, vol. 1857, 1-15. https://doi.org/10.1007/3-540-45014-9_1

[2] Breiman, L. (1996). "Bagging Predictors." *Machine Learning*, 24(2), 123-140. https://doi.org/10.1007/BF00058655

[3] Breiman, L. (2001). "Random Forests." *Machine Learning*, 45, 5-32. https://doi.org/10.1023/A:1010933404324

[4] Freund, Y. and Schapire, R.E. (1997). "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting." *Journal of Computer and System Sciences*, 55(1), 119-139. https://doi.org/10.1006/jcss.1997.1504

[5] Friedman, J.H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *Annals of Statistics*, 29(5), 1189-1232. https://doi.org/10.1214/aos/1013203451

[6] Chen, T. and Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 785-794. https://doi.org/10.1145/2939672.2939785

[7] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.Y. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." *Advances in Neural Information Processing Systems*, 30, 3146-3154. https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html

[8] Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., and Gulin, A. (2018). "CatBoost: Unbiased Boosting with Categorical Features." *Advances in Neural Information Processing Systems*, 31, 6638-6648. https://proceedings.neurips.cc/paper/2018/hash/14491b756b3a51daac41c24863285549-Abstract.html

[9] Wolpert, D.H. (1992). "Stacked Generalization." *Neural Networks*, 5(2), 241-259. https://doi.org/10.1016/S0893-6080(05)80023-1

[10] Bell, R.M. and Koren, Y. (2007). "Lessons from the Netflix Prize Challenge." *ACM SIGKDD Explorations Newsletter*, 9(2), 75-79. https://doi.org/10.1145/1345448.1345465

[11] Goddard, C. (2024). "mergekit: Tools for Merging Pre-Trained Large Language Models." GitHub repository. https://github.com/arcee-ai/mergekit

[12] Wortsman, M., Ilharco, G., Gadre, S.Y., et al. (2022). "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time." *Proceedings of the 39th International Conference on Machine Learning (ICML)*, PMLR 162. https://proceedings.mlr.press/v162/wortsman22a.html

