Ensemble learning is a machine learning paradigm that combines multiple models to produce predictions that are better than any individual model could achieve alone. By aggregating the outputs of diverse learners, ensemble methods reduce overfitting, lower variance, improve accuracy, and increase robustness. Ensemble approaches underpin many of the most successful algorithms in applied machine learning, including random forests, gradient boosted trees, and the stacking strategies that consistently win data science competitions [1].
The core idea behind ensemble learning is straightforward: a group of models that each make different kinds of errors will, when combined, cancel out their individual mistakes and converge on more accurate predictions. This principle is analogous to the "wisdom of crowds" effect in human decision-making, where the aggregate judgment of many independent individuals tends to outperform any single expert.
Ensemble methods vary in how they train the individual models (called base learners or weak learners) and how they combine their predictions. The three main families are bagging, boosting, and stacking, each addressing different aspects of the bias-variance tradeoff.
| Type | Training strategy | Combination method | Primary effect | Key algorithms |
|---|---|---|---|---|
| Bagging | Train models in parallel on bootstrap samples | Averaging (regression) or majority vote (classification) | Reduces variance | Random forest, bagged decision trees |
| Boosting | Train models sequentially; each corrects predecessor's errors | Weighted sum | Reduces bias (and variance) | AdaBoost, gradient boosting, XGBoost, LightGBM, CatBoost |
| Stacking | Train diverse base models, then train a meta-model on their outputs | Learned combination via meta-learner | Reduces both bias and variance | Stacked generalization, blending |
| Voting | Train diverse models independently | Majority vote (hard) or average probabilities (soft) | Reduces variance | Voting classifier/regressor |
Bagging, short for bootstrap aggregating, was introduced by Leo Breiman in 1996 [2]. The method works by generating multiple bootstrap samples (random samples drawn with replacement) from the training data, training a separate base model on each sample, and then aggregating their predictions.
Because each base model sees a slightly different version of the training data, the models develop different patterns of errors. Averaging over many such models smooths out these errors, reducing variance without substantially increasing bias.
Bagging is most effective with high-variance, low-bias base learners, such as deep decision trees. A single deep decision tree tends to overfit, but an ensemble of many deep trees trained on different bootstrap samples generalizes much better.
A useful byproduct of bagging is the out-of-bag (OOB) error estimate. Since each bootstrap sample omits roughly 36.8% of the training points, each model can be evaluated on the points it did not see during training. Averaging these predictions across all models that did not train on a given point provides an unbiased estimate of generalization error, similar to cross-validation but obtained for free during training.
The random forest algorithm, introduced by Leo Breiman in 2001 [3], extends bagging with an additional layer of randomization. In addition to training each tree on a bootstrap sample, random forests also restrict each split in each tree to consider only a random subset of features.
The additional feature randomization serves a critical purpose: it decorrelates the individual trees. Without it, if one feature is a very strong predictor, most trees in a bagged ensemble would use that feature at the root and produce correlated predictions, limiting the variance reduction from averaging. By forcing each tree to consider different features, random forests ensure diversity among the trees.
| Property | Details |
|---|---|
| Accuracy | Competitive with the best algorithms for tabular data; often among the top performers on benchmark datasets |
| Robustness | Resistant to overfitting as the number of trees increases; adding more trees never hurts performance (though gains diminish) |
| Feature importance | Provides built-in measures of feature importance based on how much each feature reduces impurity across all trees |
| Handling of mixed data types | Can handle both numerical and categorical features |
| Missing values | Some implementations handle missing values natively |
| Hyperparameters | Main parameters are the number of trees (B) and the number of features per split (m); both are relatively easy to tune |
| Parallelization | Trees are independent, so training can be parallelized across multiple CPU cores |
Breiman's 2001 paper [3] became one of the most cited papers in all of machine learning. Random forests remain a default choice for tabular data problems in industry, especially when interpretability (via feature importance) and minimal tuning are priorities.
Boosting is a family of algorithms that trains models sequentially, with each new model focusing on the examples that previous models got wrong. Unlike bagging, which reduces variance by averaging independent models, boosting reduces bias by iteratively correcting errors.
Adaptive Boosting (AdaBoost), introduced by Yoav Freund and Robert Schapire in 1997, was the first practical boosting algorithm [4]. It won the Godel Prize in 2003 for its theoretical contributions.
AdaBoost works as follows:
The key insight is that by re-weighting examples, each successive learner is forced to concentrate on the "hard" cases that previous learners struggled with. The theoretical guarantee is that if each weak learner performs slightly better than random guessing, the combined ensemble can achieve arbitrarily low training error.
Gradient boosting, formalized by Jerome Friedman in 2001 [5], generalizes boosting to arbitrary differentiable loss functions. Instead of re-weighting examples, gradient boosting fits each new model to the negative gradient (pseudo-residuals) of the loss function with respect to the current ensemble's predictions.
The gradient boosting procedure is:
The learning rate (or shrinkage factor, typically 0.01 to 0.3) controls how much each new tree contributes. Smaller learning rates require more trees but generally produce better results due to the regularization effect.
Gradient boosting is extremely flexible because it works with any differentiable loss function: squared error for regression, log loss for classification, quantile loss for quantile regression, and many domain-specific objectives.
XGBoost (Extreme Gradient Boosting), introduced by Tianqi Chen and Carlos Guestrin in 2016 [6], is an optimized implementation of gradient boosting that added several key innovations:
| Innovation | Description |
|---|---|
| Regularization | Adds L1 and L2 penalties on leaf weights, reducing overfitting |
| Approximate split finding | Uses weighted quantile sketch for efficient handling of large datasets |
| Sparsity awareness | Handles missing values natively by learning optimal default split directions |
| Column subsampling | Borrows from random forests to add feature randomization, further reducing overfitting |
| Parallel tree construction | Parallelizes the computation within each tree (not across trees) for faster training |
| Cache-aware access | Optimizes memory access patterns for hardware efficiency |
XGBoost became the dominant algorithm in Kaggle competitions and many industrial applications. Chen and Guestrin's paper [6] reported that XGBoost was used by the majority of winning teams in Kaggle competitions at the time of publication.
LightGBM, developed by Microsoft Research and released in 2017 [7], introduced two key techniques that make gradient boosting faster on large datasets:
Gradient-based One-Side Sampling (GOSS). Instead of using all data points to compute gradients, GOSS keeps all instances with large gradients (which contribute more to information gain) and randomly samples from instances with small gradients. This significantly reduces computation without sacrificing accuracy.
Exclusive Feature Bundling (EFB). In sparse datasets (common in real-world applications), many features are mutually exclusive (they rarely take nonzero values simultaneously). EFB bundles these features together, reducing the effective number of features.
LightGBM also grows trees leaf-wise rather than level-wise, which can produce deeper, more accurate trees with fewer splits. It is particularly fast on large datasets and is widely used in production systems at scale.
CatBoost, developed by Yandex and released in 2017 [8], addresses a specific challenge in gradient boosting: handling categorical features. While XGBoost and LightGBM require categorical features to be encoded (one-hot encoding, label encoding, or target encoding) before training, CatBoost processes categorical features directly using an ordered target encoding scheme that avoids target leakage.
CatBoost also uses ordered boosting, a permutation-based approach that trains each tree on a different ordering of the training data. This reduces the overfitting that can occur when the same data is used for both computing the gradient and fitting the tree.
| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Tree growth | Level-wise | Leaf-wise | Symmetric (balanced) |
| Categorical feature handling | Requires encoding | Basic support | Native, ordered target encoding |
| Missing value handling | Native | Native | Native |
| Training speed (large data) | Moderate | Fast | Moderate |
| GPU support | Yes | Yes | Yes |
| Regularization | L1, L2 on weights | L1, L2 on weights | L2 on weights, ordered boosting |
| Default performance | Strong | Strong | Strong, especially with categorical data |
| Year introduced | 2016 | 2017 | 2017 |
All three libraries produce comparable results on most benchmarks. The choice among them often depends on the specific characteristics of the dataset (amount of categorical features, dataset size, sparsity) and practical considerations (training speed, ease of use).
Stacking (stacked generalization) was introduced by David Wolpert in 1992 [9]. Unlike bagging and boosting, which combine models of the same type, stacking combines diverse models of potentially different types through a learned meta-model.
Stacking works because different algorithms capture different patterns in the data, and the meta-learner learns to weight their contributions optimally. A random forest might excel at capturing nonlinear interactions, while a linear model might capture linear trends more accurately. The meta-learner discovers which base model to trust in which regions of the feature space.
Multi-level stacking (with multiple meta-learner layers) is sometimes used in competitions, but it adds complexity and is rarely necessary in practice.
Voting ensembles are the simplest form of model combination. Multiple models are trained independently, and their predictions are combined by majority vote (hard voting) or by averaging their predicted probabilities (soft voting).
Hard voting assigns the label predicted by the majority of models. If three out of five models predict class A, the ensemble predicts class A.
Soft voting averages the predicted probabilities across all models and selects the class with the highest average probability. Soft voting generally performs better because it accounts for the confidence of each model's prediction, not just its top choice.
Voting is effective when the base models are diverse and individually strong. It provides no benefit when all models make the same predictions.
The effectiveness of ensemble methods can be understood through the bias-variance decomposition of prediction error. For a single model, the expected prediction error at a point can be decomposed as:
Expected error = Bias^2 + Variance + Irreducible noise
Bagging reduces variance. If the base models have roughly the same bias, averaging their predictions does not change the expected bias but reduces the variance. For B independent models, each with variance sigma^2, the variance of their average is sigma^2 / B. In practice, the models are not fully independent (they are trained on overlapping bootstrap samples), so the reduction is less than 1/B, but it is still substantial.
Boosting reduces bias. By sequentially correcting errors, boosting allows the ensemble to approximate complex functions that no single weak learner could capture. Each boosting round adds a new term to the model, gradually reducing the bias. With regularization (learning rate, tree depth limits), boosting also controls variance.
Stacking reduces both. By combining diverse models that have different bias profiles, stacking can achieve lower bias than any single model while the meta-learner averages out variance.
Diversity among base learners is the critical ingredient for ensemble success. If all models make the same predictions (and the same errors), combining them yields no improvement. Ensembles benefit when individual models are accurate but make errors on different subsets of the data.
Diversity can be introduced through:
| Strategy | How it creates diversity |
|---|---|
| Different training data | Bagging uses bootstrap samples; boosting re-weights examples |
| Different features | Random forests use feature subsampling |
| Different algorithms | Stacking combines models of different types |
| Different hyperparameters | Training the same algorithm with different settings |
| Different random seeds | Even the same algorithm with different initialization can produce diverse models |
| Different training objectives | Models optimizing different loss functions capture different aspects of the data |
Ensemble techniques have a storied history in machine learning competitions, particularly on Kaggle. The overwhelming majority of winning solutions in Kaggle competitions use some form of ensembling.
The famous Netflix Prize competition (2006-2009) was a watershed moment for ensemble methods. The winning team, BellKor's Pragmatic Chaos, combined hundreds of models using blending (a variant of stacking) to achieve the required 10% improvement over Netflix's existing recommendation algorithm. The prize was $1 million, and the competition demonstrated that sophisticated ensembling could produce substantial gains over individual models [10].
In Kaggle competitions, a common pattern has emerged:
Tianqi Chen and Carlos Guestrin reported in their 2016 XGBoost paper [6] that among the 29 challenge-winning solutions published at Kaggle's blog during 2015, 17 used XGBoost. Of these, eight used XGBoost alone, and the remaining nine combined XGBoost with neural networks in an ensemble.
A modern evolution of ensemble ideas appears in the practice of model merging for large language models (LLMs). Rather than running multiple LLMs at inference time (which would be prohibitively expensive), model merging combines the weights of multiple fine-tuned models into a single set of weights, producing a model that inherits capabilities from each source model.
Popular model merging techniques include:
| Technique | Description |
|---|---|
| Model soups | Average the weights of multiple models fine-tuned from the same base model with different hyperparameters |
| SLERP (Spherical Linear Interpolation) | Interpolates between two models' weight vectors along the surface of a hypersphere, preserving angular relationships |
| TIES-Merging | Resolves conflicts between multiple task vectors by trimming redundant parameters, resolving sign disagreements, and merging |
| DARE (Drop and Rescale) | Randomly drops a large fraction of delta parameters (often 90%+) and rescales the rest to approximate the original fine-tuned behavior |
| Task arithmetic | Computes task vectors (fine-tuned weights minus base weights), scales and adds them, then adds the result back to the base model |
Model merging can be seen as a form of parameter-space ensembling. While traditional ensembles combine predictions (output space), model merging combines weights (parameter space). This distinction means model merging produces a single model with no additional inference cost, unlike a traditional ensemble that must run all component models.
Tools like mergekit (by Charles Goddard) have made model merging accessible to the open-source LLM community. On the Hugging Face model hub, merged models are among the most popular, with users combining specialized fine-tuned models (for coding, creative writing, reasoning, and other tasks) into versatile general-purpose models [11].
This practice draws a direct conceptual line from Wolpert's stacked generalization in 1992 to modern foundation model engineering, showing how the ensemble principle adapts to new computational realities.
| Situation | Recommended approach | Rationale |
|---|---|---|
| Quick baseline with minimal tuning | Random forest | Robust, few hyperparameters, hard to break |
| Maximum accuracy on tabular data | Gradient boosting (XGBoost, LightGBM, CatBoost) | Consistently top-performing on structured data |
| Competition or critical deployment | Stacking of diverse models | Squeezes out the last fraction of accuracy |
| Many categorical features | CatBoost or LightGBM | Native categorical handling avoids error-prone encoding |
| Very large dataset (millions of rows) | LightGBM | Fastest training with GOSS and EFB |
| Need for interpretability | Random forest with feature importance | Built-in importance measures; SHAP values available |
| Combining deep learning with classical ML | Stacking or soft voting | Leverages complementary strengths |
Overfitting through data leakage. When building a stacking ensemble, it is essential to use out-of-fold predictions for the meta-learner's training data. Using the base learners' training-set predictions leads to severe data leakage and an overoptimistic estimate of performance.
Diminishing returns. Adding more models to an ensemble yields diminishing improvements. Beyond a certain point, the additional complexity, training time, and maintenance cost outweigh the marginal accuracy gains.
Correlation among base learners. If all base learners are highly correlated (for example, several XGBoost models with similar hyperparameters), the ensemble will provide little benefit over a single model. Diversity is key.
Deployment complexity. Running multiple models at inference time increases latency, memory usage, and operational complexity. In production settings, a single well-tuned model often provides a better trade-off between accuracy and simplicity.
Ensemble methods remain among the most practically important techniques in machine learning. Gradient boosted trees (XGBoost, LightGBM, CatBoost) are the default choice for tabular data in industry, consistently outperforming or matching deep learning approaches on structured data while being faster to train and easier to deploy.
Recent benchmarks and studies continue to confirm this. TabNet, transformer-based tabular models, and other deep learning approaches for structured data have not convincingly surpassed well-tuned gradient boosting in most comparisons. As of 2025 and into 2026, gradient boosted trees remain the recommended starting point for tabular machine learning at companies ranging from startups to large technology firms.
The ensemble principle has also found new expressions in deep learning beyond model merging: mixture-of-experts (MoE) architectures, which route inputs to specialized sub-networks, can be viewed as a form of learned ensembling within a single model. Snapshot ensembles, which save neural network checkpoints during training and average their predictions, provide an inexpensive way to ensemble deep models. Test-time augmentation, where a model's predictions on multiple augmented versions of an input are averaged, is another form of ensembling.
The theoretical and practical insights of ensemble learning, from Breiman's bagging to modern LLM merging, continue to shape how practitioners build high-performing machine learning systems.