# Gradient Boosting

> Source: https://aiwiki.ai/wiki/gradient_boosting
> Updated: 2026-07-12
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**Gradient boosting** is an [ensemble](/wiki/ensemble) [machine learning](/wiki/machine_learning) technique that builds a predictive model by combining many weak learners, typically [decision trees](/wiki/decision_tree), in a sequential and additive fashion, where each new learner is trained to correct the errors of the combined model so far using the gradients of a differentiable [loss function](/wiki/loss_function). Originally formalized by Jerome H. Friedman in his 2001 paper "Greedy Function Approximation: A Gradient Boosting Machine" (The Annals of Statistics, vol. 29, no. 5, pp. 1189-1232), it frames learning as functional [gradient descent](/wiki/gradient_descent) in function space, making it one of the most accurate supervised learning methods for structured and tabular data.[1] It is used for both [classification](/wiki/classification) and [regression](/wiki/regression) tasks, and it belongs to the family of [ensemble learning](/wiki/ensemble_learning) methods, which combine the predictions of multiple base models to produce a more accurate and robust prediction.

Gradient boosting has become a dominant force in applied machine learning, particularly for structured and tabular data. Friedman opened the founding paper by stating that "function estimation/approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space."[1] Its practical dominance is well documented: in the XGBoost paper, Tianqi Chen and Carlos Guestrin reported that among the 29 challenge-winning solutions published on Kaggle's blog during 2015, 17 used XGBoost, of which 8 used it alone, while the second most popular method (deep neural nets) appeared in 11 solutions.[3] Libraries such as XGBoost, LightGBM, and CatBoost have made gradient boosting accessible, efficient, and highly competitive in both industry applications and machine learning competitions.

## ELI5 (Explain like I'm five)

Imagine you are trying to guess how many jelly beans are in a jar. Your first guess is probably not perfect. So a friend looks at how far off you were and makes a small correction. Then another friend looks at the remaining error and makes another small correction. You keep adding friends who each focus only on fixing whatever mistakes the group has made so far. After many rounds of corrections, the combined guess from the whole group is very close to the real number.

Another way to picture it: imagine you have a team of people who are not very good at solving a specific problem, but they are all slightly better than guessing randomly. You want to create a super team that can solve the problem much better than any individual. You start by having one person attempt to solve the problem, and then you see where they made mistakes. Next, you add another person to the team who focuses on fixing those mistakes. You keep adding more people to the team, with each new person focusing on fixing the mistakes made by the team so far.

That is essentially what gradient boosting does. It starts with a simple initial prediction, then adds one small model at a time, with each new model focused specifically on reducing the errors left behind by all the previous models combined.

## When was gradient boosting invented?

The roots of gradient boosting trace back to the development of [boosting](/wiki/boosting) methods in computational learning theory during the 1990s. The framework itself was formalized by Jerome H. Friedman in a 1999 technical report, published in 2001, and the modern high-performance implementations (XGBoost, LightGBM, CatBoost) appeared between 2014 and 2018.[1][3][4][5]

### Early boosting and AdaBoost

In 1990, Robert Schapire proved that weak learners could be combined to form a strong learner,[17] and in 1995 Yoav Freund and Robert Schapire introduced [AdaBoost](/wiki/adaboost) (Adaptive Boosting), which iteratively trained weak classifiers on re-weighted versions of the training data, placing greater weight on samples that were previously misclassified.[6] AdaBoost proved remarkably effective in practice, and their work earned the 2003 Godel Prize for the paper "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting" (Journal of Computer and System Sciences, 1997), establishing boosting as a major paradigm in machine learning.[18] The ACM SIGACT awarding committee praised AdaBoost for "a combination of features, including its elegance, the simplicity of its implementation, its wide applicability, and its striking success in reducing errors in benchmark applications."[18]

### Breiman's statistical interpretation

In 1997, Leo Breiman recast AdaBoost and related methods as "ARCing" (Adaptive Resampling and Combining) algorithms.[7] Breiman was among the first to observe that boosting could be interpreted as a gradient descent procedure in function space.[8] This insight connected boosting to the broader field of numerical optimization and laid the groundwork for the gradient boosting framework.

### Friedman's Gradient Boosting Machine

Jerome H. Friedman formalized the gradient boosting framework in his landmark 1999 technical report, later published as "Greedy Function Approximation: A Gradient Boosting Machine" in The Annals of Statistics in 2001.[1] Friedman's key contribution was framing boosting as functional gradient descent: instead of optimizing parameters in a fixed model, gradient boosting performs steepest descent in the space of functions by iteratively adding functions (weak learners) that point in the direction of the negative gradient.[1] This perspective allowed the framework to generalize beyond classification to arbitrary differentiable loss functions, covering regression, ranking, and robust estimation.

In parallel, Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean independently developed a similar functional gradient descent view of boosting, presented at the 1999 [NeurIPS](/wiki/neurips) conference.[14] Friedman's contribution was distinctive in its focus on decision tree base learners and practical algorithmic details, including region-specific updates within each tree.

### Stochastic gradient boosting

In 2002, Friedman introduced stochastic gradient boosting in the journal Computational Statistics and Data Analysis.[2] This paper showed that randomly subsampling the training data at each boosting iteration (without replacement) could substantially improve both accuracy and training speed.[2] By drawing only a fraction of the training set to fit each new tree, the algorithm introduced a form of [regularization](/wiki/regularization) analogous to the benefits of [bagging](/wiki/bagging), reducing variance and improving generalization. These two papers laid the theoretical and practical foundation for all modern gradient boosting implementations.

### Modern era

From roughly 2014 onward, a wave of highly optimized gradient boosting libraries transformed the practical landscape. XGBoost (2014, published 2016),[3] LightGBM (2017),[4] and CatBoost (2017, published 2018)[5] each introduced architectural and algorithmic innovations that made gradient boosting faster, more scalable, and more accurate. These libraries propelled gradient boosting to dominance in tabular data competitions and widespread industrial adoption.

## How does gradient boosting work?

Gradient boosting constructs an additive model in a forward stage-wise manner.[16] At each stage, a new weak learner (base learner) is fit to the negative gradient of the loss function evaluated at the current model's predictions.[1] These negative gradients are often called "pseudo-residuals" because, for squared error loss, they correspond exactly to the residual errors of the current model.

The core idea can be understood through an analogy with numerical optimization. In standard gradient descent, you iteratively update a parameter vector by moving in the direction of the negative gradient. In gradient boosting, the "parameter" being optimized is the prediction function itself, and the update at each step is a new function (a weak learner) that approximates the negative gradient direction.

### Algorithm outline

The gradient boosting algorithm proceeds as follows:[1]

1. **Initialize** the model with a constant value that minimizes the loss function across all training examples. For squared error loss, this is simply the mean of the target values.

2. **For each iteration** $$m = 1, 2, \ldots, M$$:
   - Compute pseudo-residuals: calculate the negative gradient of the loss function with respect to the current model's prediction for each training example.
   - Fit a weak learner (typically a shallow decision tree) to the pseudo-residuals.
   - Determine the optimal step size (multiplier) for this learner by performing a line search that minimizes the loss.
   - Update the model by adding the scaled weak learner to the current ensemble.

3. **Output** the final model as the sum of the initial constant and all M weak learner contributions.

### Mathematical formulation

The goal is to find a function $$F(x)$$ that minimizes the expected loss:

$$
\hat{F} = \arg\min_F \mathbb{E}_{x,y}[L(y, F(x))]
$$

The model takes the form of an additive expansion:

$$
F_M(x) = F_0(x) + \sum_{m=1}^{M} \eta_m \, h_m(x)
$$

where $$F_0(x)$$ is the initial constant prediction, $$h_m(x)$$ is the weak learner fit at iteration m, and $$\eta_m$$ is the step size (learning rate multiplier) for that learner.

At each iteration m, the pseudo-residuals are computed as:

$$
r_{im} = -\left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F = F_{m-1}}
$$

A weak learner $$h_m$$ is then fit to the set $$\{(x_i, r_{im})\}$$. The step size is found by line search:

$$
\eta_m = \arg\min_\eta \sum_{i=1}^{n} L(y_i, F_{m-1}(x_i) + \eta \, h_m(x_i))
$$

Finally, the model is updated (including a shrinkage factor $$\nu$$ for the learning rate):

$$
F_m(x) = F_{m-1}(x) + \nu \, \eta_m \, h_m(x)
$$

where $$\nu$$ is the shrinkage (learning rate) parameter, typically between 0.01 and 0.3.

When decision trees are used as base learners, Friedman proposed a refinement called "TreeBoost" where separate optimal values are computed for each leaf region of the tree, rather than a single global step size.[1]

### Pseudo-residuals and the gradient connection

The term "pseudo-residuals" comes from the fact that, for squared error loss, the negative gradient at each point is simply the residual $$(y_i - F(x_i))$$. For other loss functions, the negative gradient serves as a generalized notion of the residual. For example, with absolute error loss, the pseudo-residuals are the signs of the residuals. With logistic loss (used in classification), the pseudo-residuals involve the difference between observed class labels and predicted probabilities.

By fitting each successive tree to these pseudo-residuals, the algorithm performs a form of functional gradient descent: each new tree approximates the steepest descent direction in function space, and the ensemble gradually moves toward a function that minimizes the chosen loss.

## Loss functions

One of the strengths of gradient boosting is its flexibility with respect to the loss function. Any differentiable loss function can be used, and the choice of loss function determines what the pseudo-residuals look like and, consequently, what each weak learner tries to approximate. Friedman's original paper derived specific algorithms for least-squares, least-absolute-deviation, and Huber-M loss for regression, and multiclass logistic likelihood for classification.[1]

| Loss function | Task | Formula | Pseudo-residual / Notes |
|---|---|---|---|
| Mean squared error (MSE) | [Regression](/wiki/regression) | $$L = \frac{1}{n} \sum (y_i - F(x_i))^2$$ | $$y_i - F(x_i)$$; standard choice for regression, sensitive to outliers |
| Mean absolute error (MAE) | [Regression](/wiki/regression) | $$L = \frac{1}{n} \sum \lvert y_i - F(x_i) \rvert$$ | $$\operatorname{sign}(y_i - F(x_i))$$; more robust to outliers than MSE |
| Huber loss | [Regression](/wiki/regression) | Piecewise combination of MSE and MAE with threshold $$\delta$$ | Residual if $$\lvert r \rvert \le \delta$$, else $$\delta \, \operatorname{sign}(r)$$; balances sensitivity and robustness |
| Log loss (binary cross-entropy) | [Classification](/wiki/classification) | $$L = -\sum [y_i \log(p_i) + (1-y_i) \log(1-p_i)]$$ | $$y_i - p_i$$ (where $$p_i = \operatorname{sigmoid}(F(x_i))$$); standard for binary classification |
| Multinomial deviance / Multi-class log loss | [Classification](/wiki/classification) | $$L = -\sum y_k \log(p_k)$$; generalized log loss for K classes | Per-class residuals, extension via softmax |
| Quantile loss | Quantile regression | $$L = \alpha \, \max(y - F(x), 0) + (1 - \alpha) \, \max(F(x) - y, 0)$$ | Asymmetric absolute loss at quantile $$\tau$$; used for prediction intervals |

For regression, MSE produces pseudo-residuals equal to the ordinary residuals, making gradient boosting with MSE equivalent to iteratively fitting trees to the residuals of the current model. MAE is more robust to outliers because its pseudo-residuals are bounded. Huber loss provides a compromise, behaving like MSE for small errors and MAE for large errors.

For classification, log loss is the standard choice. The model outputs log-odds, which are converted to probabilities via the sigmoid (binary) or softmax (multi-class) function.

## Gradient tree boosting

In practice, the weak learners in gradient boosting are almost always regression trees, giving rise to the term "gradient tree boosting" or "gradient boosted decision trees" (GBDT). Trees are preferred for several reasons: they can capture nonlinear relationships and interactions between features, they handle mixed data types naturally, and they are invariant to monotonic transformations of features, so feature scaling is unnecessary.

The most commonly used weak learners in gradient boosting are shallow decision trees, sometimes called stumps, chosen for their simplicity and computational efficiency. In practice, trees with depths of 3 to 8 are common. Deeper trees capture more complex interactions between features but increase the risk of [overfitting](/wiki/overfitting).

The size of the individual trees is a critical hyperparameter. Friedman recommended using trees with 4 to 8 terminal nodes (leaves).[1] The number of terminal nodes J controls the maximum order of interaction that the model can capture. A tree with J terminal nodes can model interactions among at most $$J - 1$$ variables.[16]

When trees serve as base learners, the tree partitions the input space into disjoint regions $$R_{1m}, R_{2m}, \ldots, R_{Jm}$$, and the tree output for an input x is:

$$
h_m(x) = \sum_{j=1}^{J} b_{jm} \, \mathbf{1}(x \in R_{jm})
$$

Friedman's TreeBoost modification computes a separate optimal value (leaf weight) for each region rather than a single step size for the entire tree.[1] This region-specific optimization allows more flexible updates and improves convergence.

## Stochastic gradient boosting

In 2002, Friedman proposed stochastic gradient boosting, which introduces randomness into the training process by fitting each base learner on a random subsample of the training data (drawn without replacement) rather than the full dataset.[2] This technique is analogous to the use of random subsets in [random forests](/wiki/random_forest).

### Row subsampling

At each iteration, a fraction f of the training examples is sampled at random, and only this subsample is used to compute pseudo-residuals and fit the next tree. Friedman found empirically that values of f between 0.5 and 0.8 generally perform well, with smaller fractions reducing variance at the cost of increased bias.[2] Even moderate subsampling ($$f = 0.5$$) often outperforms deterministic gradient boosting while also reducing computation time.

### Column subsampling

In addition to row subsampling, modern implementations also support column (feature) subsampling, where only a random subset of features is considered when finding the best split at each node. This idea, borrowed from random forests, further reduces correlation among trees and helps prevent [overfitting](/wiki/overfitting).

The combination of row and column subsampling reduces variance, improves generalization, and speeds up training, making stochastic gradient boosting the default in most practical applications.

## Key hyperparameters

Gradient boosting has several hyperparameters that significantly affect model performance. Careful tuning of these parameters is essential for achieving good results.[15] Without proper regularization, gradient boosting can memorize the training data, especially when using many trees or deep trees.

| Hyperparameter | Parameter Name(s) | Typical range | Effect |
|---|---|---|---|
| Number of estimators | `n_estimators`, `num_boost_round`, M | 100 to 10,000 | Total number of boosting rounds. More trees can improve performance but increase training time and risk overfitting. |
| [Learning rate](/wiki/learning_rate) (shrinkage, $$\eta$$) | `learning_rate`, `eta` | 0.001 to 0.3 | Scales the contribution of each tree. Smaller values require more trees but often produce better generalization. |
| Maximum tree depth | `max_depth` | 3 to 10 | Controls the complexity of individual trees. Deeper trees can capture more complex interactions but are more prone to overfitting. |
| Subsample (row fraction) | `subsample`, `bagging_fraction` | 0.5 to 1.0 | Fraction of training rows used per iteration. Values less than 1.0 enable stochastic gradient boosting. |
| Column subsample (colsample) | `colsample_bytree`, `feature_fraction` | 0.5 to 1.0 | Fraction of features considered at each split or each tree. Reduces overfitting and speeds up training. |
| Minimum samples per leaf | `min_child_weight`, `min_data_in_leaf` | 1 to 100+ | Minimum number of training examples (or sum of Hessians) required in a leaf node. Higher values produce smoother predictions. |
| L1 regularization ($$\alpha$$) | `reg_alpha`, `lambda_l1` | 0 to 10 | Adds an L1 penalty on leaf weights. Encourages sparsity in leaf values. |
| L2 regularization ($$\lambda$$) | `reg_lambda`, `lambda_l2` | 0 to 10 | Adds an L2 penalty on leaf weights. Smooths leaf values and reduces overfitting. |
| Early stopping | `early_stopping_rounds` | 10 to 50 | Stops training when validation performance has not improved for a specified number of rounds. |
| Maximum number of leaves | `max_leaves`, `num_leaves` | 15 to 255 | Directly limits the number of terminal nodes per tree, an alternative to depth constraint. |

The interaction between learning rate and number of estimators is particularly important. A common practice is to set a small learning rate (0.01 to 0.1) and then use early stopping to determine the optimal number of trees.

## Regularization and early stopping

[Regularization](/wiki/regularization) is critical in gradient boosting because the sequential, error-correcting nature of the algorithm makes it prone to overfitting, especially on noisy data.

### Shrinkage

Shrinkage (the learning rate) is the most important regularization technique. By scaling each tree's contribution by a small factor (typically 0.01 to 0.1), the model learns more slowly and has better generalization. Friedman showed that smaller learning rates, combined with more boosting rounds, consistently outperform larger learning rates with fewer rounds.[1] Empirically, using a smaller learning rate with more trees almost always yields better test performance than using a larger learning rate with fewer trees, at the cost of longer training time. The update becomes:

$$
F_m(x) = F_{m-1}(x) + \nu \, \eta_m \, h_m(x)
$$

where $$\nu$$ is the shrinkage parameter ($$0 < \nu \le 1$$).

### Tree constraints

Limiting tree depth, minimum leaf size, and maximum number of leaves all restrict the complexity of individual trees. Shallower trees act as weaker learners, which is consistent with the boosting philosophy of combining many weak models.

### L1 and L2 penalties

L1 regularization (Lasso-type) adds a penalty proportional to the sum of the absolute values of leaf weights to the objective function. This pushes leaf weights toward zero and can make some leaf outputs exactly zero, encouraging sparsity. L2 regularization (Ridge-type) adds a penalty proportional to the sum of squared leaf weights, which encourages smaller and more evenly distributed leaf weights. L2 regularization is generally smoother and is often preferred as a default because it effectively prevents any single leaf from having an outsized influence on predictions. XGBoost popularized this approach by including both penalties directly in the objective function, along with a penalty on the number of leaves (tree complexity). This explicit regularization was one of XGBoost's key contributions relative to earlier gradient boosting implementations.[3]

### Subsampling (stochastic gradient boosting)

Row subsampling, introduced by Friedman (2002), draws a random fraction of the training data (without replacement) to fit each new tree.[2] This injects randomness and reduces variance, similar to the effect of [bagging](/wiki/bagging) in [random forests](/wiki/random_forest). Column subsampling, which randomly selects a subset of features for each tree (or each split), further reduces correlation between trees and can improve both speed and accuracy. The combination of row and column subsampling is sometimes called "stochastic gradient boosting" and is standard practice in modern implementations.

### Early stopping

Early stopping is a practical regularization technique where the model is evaluated on a held-out validation set at each boosting round. Training is halted when performance on the validation set stops improving (or begins to degrade) for a specified number of consecutive rounds (the "patience" parameter). This avoids the need to guess the optimal number of boosting rounds in advance and is one of the most effective ways to prevent overfitting, while also saving computation.

## Major implementations

Several highly optimized libraries have been developed for gradient boosting, each introducing algorithmic innovations that improve speed, scalability, or accuracy.

### What is XGBoost?

XGBoost (Extreme Gradient Boosting) was developed by Tianqi Chen and first released in 2014, with the accompanying paper by Chen and Carlos Guestrin published in 2016 at the ACM SIGKDD conference.[3] The paper described it as "a scalable end-to-end tree boosting system" that "scales beyond billions of examples using far fewer resources than existing systems."[3] It quickly became one of the most widely used machine learning tools, particularly in Kaggle competitions where it powered many winning solutions: as noted above, 17 of the 29 challenge-winning solutions posted on Kaggle's blog in 2015 used XGBoost, and every team in the top 10 of the KDDCup 2015 used it.[3]

Key innovations in XGBoost include:

- **Regularized objective function.** The loss function explicitly includes L1 and L2 penalties on leaf weights plus a penalty on the number of leaves, providing built-in regularization that earlier gradient boosting implementations lacked. The objective includes both the loss function and a penalty term that depends on the number of leaves and the L2 norm of leaf weights.
- **Weighted quantile sketch.** For approximate split finding on large datasets, XGBoost uses a distributed weighted quantile sketch algorithm to propose candidate split points based on the distribution of feature values, weighted by the second-order gradient statistics (Hessians). This allows it to handle datasets that do not fit in memory.
- **Sparsity-aware split finding.** XGBoost includes an algorithm that handles missing values and sparse features natively by learning a default direction for missing values at each split. It tries both directions and selects the one that maximizes gain. This makes it substantially faster on sparse data (up to 50 times faster than the naive approach).[3]
- **Cache-aware and out-of-core computation.** XGBoost optimizes memory access patterns for cache efficiency and supports out-of-core computation, allowing it to handle datasets that exceed main memory.
- **Parallel and distributed training.** Although boosting is inherently sequential (each tree depends on the previous ones), XGBoost parallelizes the construction of individual trees by distributing split finding across features and data partitions. It supports multi-threaded CPU training and optional GPU acceleration.
- **Column block structure.** Training data is stored in compressed column (CSC) format in sorted blocks, enabling efficient parallel split evaluation.

XGBoost supports level-wise (breadth-first) tree growth by default, though it also offers a "depthwise" and "lossguide" (leaf-wise) growth policy.

### What is LightGBM?

LightGBM (Light Gradient Boosting Machine) was developed by Guolin Ke and colleagues at Microsoft Research and published at NeurIPS in 2017.[4] It was designed for training speed and memory efficiency on large-scale datasets with many features. The authors reported that LightGBM "speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy."[4]

Key innovations in LightGBM include:

- **Gradient-based One-Side Sampling (GOSS).** Instead of using all training instances, GOSS keeps all instances with large gradients (which contribute most to information gain) and randomly samples a fraction of instances with small gradients. A constant multiplier is applied to the small-gradient samples to correct for the sampling bias. This reduces the number of data instances used for split evaluation without significantly sacrificing accuracy.[4]
- **Exclusive Feature Bundling (EFB).** In high-dimensional sparse datasets, many features are mutually exclusive (rarely nonzero simultaneously). EFB bundles these features together, reducing the effective number of features and speeding up histogram construction. One-hot encoded features are a natural candidate for bundling.
- **Leaf-wise (best-first) tree growth.** Unlike level-wise growth used by most other implementations, LightGBM grows trees leaf-wise, always splitting the leaf with the highest loss reduction. This produces asymmetric trees that often converge faster for the same number of leaves, though it can overfit on small datasets if the maximum depth is not constrained.
- **Histogram-based split finding.** LightGBM bins continuous features into discrete histograms, reducing the computational cost of evaluating splits from $$O(n)$$ to $$O(B)$$, where B is the number of bins. This approach also reduces memory usage.[4]
- **Parallel learning and GPU support.** LightGBM supports feature-parallel and data-parallel distributed training, as well as GPU-accelerated training.

LightGBM can be up to 20 times faster than conventional GBDT implementations while achieving comparable accuracy.[4]

### What is CatBoost?

CatBoost (Categorical Boosting) was developed by Anna Veronika Dorogush, Andrey Gulin, Gleb Gusev, Nikita Kazeev, Liudmila Ostroumova Prokhorenkova, and Aleksandr Vorobev at Yandex. The NeurIPS 2018 paper by Prokhorenkova et al. introduced the theoretical foundations.[5] CatBoost was designed to address two specific problems in gradient boosting: prediction shift (a form of target leakage) and the handling of categorical features.

Key innovations in CatBoost include:

- **Ordered boosting.** Standard gradient boosting suffers from a subtle form of target leakage: the same data used to estimate the model at iteration m is also used to compute the residuals that the next tree learns from. CatBoost addresses this with ordered boosting, which uses a permutation-based scheme to ensure that the residuals for each training example are computed from a model that was not trained on that example. Multiple random permutations of the dataset are generated, and each tree is built using gradient estimates computed only on "preceding" examples in the permutation, producing unbiased gradient estimates and reducing overfitting.[5]
- **Ordered target statistics for categorical features.** Instead of using one-hot encoding or simple target encoding (which leaks target information), CatBoost computes target statistics using only the training examples that precede the current one in a random permutation. This prevents target leakage while still converting categorical features into informative numerical representations.
- **Symmetric (oblivious) decision trees.** CatBoost builds symmetric (balanced) trees by default, where every node at the same depth uses the same splitting condition. This architectural choice speeds up prediction (since the tree can be evaluated as a lookup table), enables efficient CPU and GPU implementation, and acts as a regularization mechanism.
- **Feature combinations.** CatBoost can automatically generate and evaluate combinations of categorical features during training, capturing interactions without manual [feature engineering](/wiki/feature_engineering).
- **Robust default parameters.** CatBoost is designed to produce strong results with minimal hyperparameter tuning, making it particularly accessible for practitioners.

### Comparison of major implementations

| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Year introduced / Initial release | 2014 (paper 2016) | 2016 (paper 2017) | 2017 (paper 2018) |
| Developed by | Tianqi Chen (Univ. of Washington, then DMLC), Carlos Guestrin | Microsoft Research | Yandex |
| Key paper | Chen & Guestrin, KDD 2016 | Ke et al., NeurIPS 2017 | Prokhorenkova et al., NeurIPS 2018 |
| Tree growth strategy | Level-wise (default) | Leaf-wise (best-first) | Symmetric (oblivious / balanced) |
| Split finding | Exact or approximate (weighted quantile sketch) | Histogram-based | Histogram-based |
| Categorical features | Requires encoding (native support added later) | Native support via optimal split | Native ordered target statistics |
| Missing value handling | Native (learned default direction) | Native (assigns to gain-maximizing direction) | Native |
| Key speed optimization | Cache-aware access, column blocks, parallel split finding | GOSS, EFB, histogram-based splits | Symmetric tree structure, ordered boosting |
| Regularization | L1 + L2 on leaf weights, tree complexity penalty | L1 + L2, min data in leaf, max depth, GOSS | Ordered boosting, L2, random strength, depth |
| GPU training | Yes | Yes | Yes |
| Distributed training | Yes (Dask, Spark, built-in) | Yes (built-in) | Yes (built-in) |
| Ranking support | Yes (LambdaMART-based) | Yes (LambdaRank) | Yes (YetiRank, YetiRankPairwise) |
| Language bindings | Python, R, Java, Scala, C++, Julia | Python, R, C++, Java | Python, R, Java, command line |
| License | Apache 2.0 | MIT | Apache 2.0 |
| Training speed (large data) | Moderate | Fast | Moderate |
| Ease of tuning | Moderate | Moderate | Easier (good defaults) |
| Best for | Competitions, general use | Large datasets, speed-critical | Categorical-heavy data, less tuning |

All three libraries are open source under permissive licenses (Apache 2.0 for XGBoost and CatBoost, MIT for LightGBM), and all support histogram-based approximate split finding, parallel training, and GPU acceleration.

### scikit-learn HistGradientBoosting

Starting with version 0.21 (2019), [scikit-learn](https://scikit-learn.org/) introduced `HistGradientBoostingClassifier` and `HistGradientBoostingRegressor`, histogram-based gradient boosting estimators inspired by LightGBM. These estimators bin continuous features into integer-valued histograms before training, which drastically reduces the number of candidate splits. They are orders of magnitude faster than scikit-learn's original `GradientBoostingClassifier` and `GradientBoostingRegressor` on datasets with more than 10,000 samples. The histogram-based estimators also support native missing value handling and categorical feature support (added in scikit-learn 1.0).

## Histogram-based gradient boosting

Histogram-based gradient boosting is an algorithmic optimization that has become the standard approach in modern implementations. Rather than evaluating every unique feature value as a potential split point (which is expensive for continuous features), the algorithm first discretizes each feature into a fixed number of bins (typically 255).

### How histogram binning works

1. **Preprocessing:** Before training begins, each feature's values are sorted and partitioned into at most `max_bins` buckets based on quantiles. Each training sample's feature value is then replaced with its corresponding bin index.
2. **Histogram construction:** For each node during tree building, the algorithm accumulates gradient and Hessian statistics into a histogram for each feature. Each bin in the histogram stores the sum of gradients and the sum of Hessians for all samples falling into that bin.
3. **Split evaluation:** Candidate splits are evaluated by scanning the histogram bins rather than individual data points. For a feature with B bins, only $$B-1$$ candidate splits need to be evaluated.
4. **Histogram subtraction trick:** After splitting a node, the histogram for one child can be computed by subtracting the other child's histogram from the parent's histogram, saving roughly half the computation.

### Advantages

Histogram-based splitting reduces the computational complexity of split finding from $$O(nd)$$ to $$O(Bd)$$, where n is the number of samples, d is the number of features, and B is the number of bins (typically 255). This provides dramatic speedups on large datasets. The binning also acts as a form of regularization, since it groups nearby feature values together and reduces sensitivity to exact feature values. Additionally, histograms are compact data structures that improve cache locality and reduce memory usage.

LightGBM, CatBoost, and scikit-learn's HistGradientBoosting all use histogram-based splitting by default. XGBoost added a `tree_method='hist'` option and eventually made it the default method as well.

## How does gradient boosting differ from random forest?

Gradient boosting and random forest are the two most widely used tree-based ensemble methods, but they differ fundamentally in how they combine trees. The short answer: a random forest builds many deep trees independently and in parallel and averages them (a [bagging](/wiki/bagging) method that mainly reduces variance), whereas gradient boosting builds many shallow trees sequentially, each correcting the previous ensemble's errors (a boosting method that mainly reduces bias).[16]

| Aspect | Gradient boosting | Random forest |
|---|---|---|
| Ensemble strategy | Sequential (each tree corrects errors) | Parallel (trees are independent) |
| Bias-variance tradeoff | Primarily reduces bias | Primarily reduces variance |
| Overfitting risk | Higher (requires regularization) | Lower (bagging provides natural regularization) |
| Training speed | Slower (sequential) | Faster (fully parallelizable) |
| Hyperparameter sensitivity | High (learning rate, depth, n_estimators) | Low (robust with defaults) |
| Peak accuracy on tabular data | Generally higher with tuning | Competitive, but often slightly lower |
| Handling noisy data | More sensitive | More robust |
| Interpretability | Moderate (SHAP, feature importance) | Moderate (feature importance) |

Random forests build many deep trees independently and average their predictions, which reduces variance effectively but does little to reduce bias. Gradient boosting builds many shallow trees sequentially, with each tree targeting the residual errors, which reduces bias aggressively.[16] In practice, well-tuned gradient boosting often achieves higher accuracy than random forests on tabular data, but random forests are faster to train and require less tuning.

For many applications, a reasonable workflow is to start with a random forest as a baseline and then switch to gradient boosting for improved accuracy, using the random forest results to guide hyperparameter choices.

## Feature importance

Gradient boosting provides several methods for measuring the importance of input features. Understanding which features drive predictions is critical for model interpretability and debugging.

### Gain-based importance

Gain-based importance (also called "split importance") measures the total reduction in the loss function (or improvement in the splitting criterion) contributed by splits on a given feature across all trees in the ensemble. Features that are used in more splits and that produce larger loss reductions receive higher importance scores. This is the default feature importance method in most gradient boosting libraries (e.g., XGBoost's `importance_type='gain'`).

A limitation of gain-based importance is that it can be biased toward high-cardinality features (features with many unique values), because these features have more potential split points.

### Split count (frequency) importance

Split count importance (also called "weight" in XGBoost) counts how many times a feature is used as a split variable across all trees. A feature used in many splits is considered more important. This metric is simple but can be misleading: a feature might be split on frequently but contribute little to the overall reduction in loss.

### Permutation importance

Permutation importance is a model-agnostic method that measures how much the model's performance degrades when a feature's values are randomly shuffled. If shuffling a feature causes a large drop in accuracy, the feature is important. Permutation importance avoids the bias of gain-based importance, but it can be computationally expensive and may give misleading results when features are correlated (shuffling one correlated feature may have little effect because the other correlated features still carry the information).

### SHAP values

SHAP (SHapley Additive exPlanations), introduced by Scott Lundberg and Su-In Lee in 2017, provides a theoretically grounded approach to [feature importance](/wiki/feature_importances) based on Shapley values from cooperative game theory.[9] Each feature's SHAP value represents its marginal contribution to a prediction, averaged over all possible orderings of features.

For tree-based models, Lundberg, Erion, and Lee (2019) developed TreeSHAP, an exact polynomial-time algorithm that computes SHAP values efficiently ($$O(TLD^2)$$ where T is the number of trees, L is the maximum number of leaves, and D is the maximum depth).[10] TreeSHAP reduces the computational complexity from exponential in the number of features to polynomial, making it practical for gradient boosting models with hundreds of features. SHAP values offer several desirable properties:

- **Local accuracy:** SHAP values for a single prediction sum to the difference between that prediction and the average prediction (the base value).
- **Consistency:** If a feature's contribution increases in a modified model, its SHAP value will not decrease.
- **Missingness:** Features that are not present have zero attribution.

SHAP values provide both local explanations (why was this particular prediction made?) and global explanations (which features matter most across the dataset?). Lundberg et al. (2020) showed that TreeSHAP is theoretically superior to gain-based and split-count importance for tree ensembles.[11] They have become the de facto standard for interpreting gradient boosting models in both research and industry.

### Comparison of feature importance methods

| Method | Type | Bias Toward High-Cardinality | Handles Correlated Features | Provides Local Explanations | Computational Cost |
|---|---|---|---|---|---|
| Gain | Model-specific | Yes | No | No | Low (computed during training) |
| Split Count | Model-specific | Yes | No | No | Low (computed during training) |
| Permutation | Model-agnostic | No | Poorly (splits importance among correlated features) | No | Medium (requires repeated evaluation) |
| SHAP (TreeSHAP) | Model-specific (for trees) | No | Yes (attributes fairly) | Yes | Medium to high (depends on model size) |

## Gradient boosting for ranking

Gradient boosting has found widespread use in learning-to-rank tasks, where the goal is to order a list of items (such as search results, product recommendations, or advertisement placements) by relevance to maximize a ranking quality metric.

### Background

Traditional [supervised learning](/wiki/supervised_learning) approaches optimize pointwise or pairwise loss functions, but ranking metrics like Normalized Discounted Cumulative Gain (NDCG) and Mean Average [Precision](/wiki/precision) (MAP) are non-differentiable and listwise in nature. This creates a mismatch: directly optimizing NDCG via gradient descent is not possible because the metric involves sorting operations.

### From RankNet to LambdaMART

The evolution from RankNet to LambdaMART proceeded through three stages:

1. **RankNet (Burges et al., 2005):** A [neural network](/wiki/neural_network) model that optimizes a pairwise cross-entropy loss over pairs of documents.[13] RankNet demonstrated that pairwise objectives could be effective for ranking but did not directly optimize listwise metrics like NDCG.

2. **LambdaRank (Burges, 2006):** Rather than deriving gradients from a formal loss function, LambdaRank defines "lambda gradients" that incorporate the change in NDCG (or another ranking metric) that would result from swapping two documents. These lambda gradients are heuristically motivated but empirically effective at optimizing the target ranking metric.[12]

3. **LambdaMART (Burges, 2010):** LambdaMART is one of the most successful learning-to-rank algorithms and combines the lambda gradients from LambdaRank with gradient boosted [decision trees](/wiki/decision_tree) (MART, or Multiple Additive Regression Trees).[12] The key insight is that instead of optimizing a smooth proxy loss function, LambdaMART defines gradients that directly account for the effect of swapping document pairs on ranking metrics like NDCG. For each query/list, it computes lambda gradients for all document pairs based on the current model's scores and the target metric, and then trains a regression tree to predict these lambda gradients (pseudo-residuals). The resulting algorithm directly optimizes ranking quality and was the core algorithm behind several winning entries in the Yahoo Learning to Rank Challenge (2010).

LambdaMART has been used in production ranking systems at major search engines. It is supported natively in XGBoost (via `objective='rank:ndcg'`), LightGBM (via `objective='lambdarank'`), and CatBoost (via `YetiRank` and `YetiRankPairwise` objectives). It remains the most widely used gradient boosting approach for ranking in production [search engines](/wiki/search_engine) and [recommendation systems](/wiki/recommender_system).

## What is gradient boosting used for?

Gradient boosting is used extensively across industries and research domains. Its strength on tabular/structured data, combined with mature library support, makes it a default choice for many prediction problems. Common applications include machine learning competitions, credit scoring and fraud detection in finance, web search ranking, ad click-through rate prediction, clinical risk modeling, and demand forecasting.

### Kaggle and machine learning competitions

Gradient boosting, particularly through XGBoost and LightGBM, has dominated competitive machine learning on platforms like Kaggle. In their 2016 paper, Chen and Guestrin documented this directly: among the 29 challenge-winning solutions published on Kaggle's blog during 2015, 17 used XGBoost, of which 8 used it as the sole model, while the second most popular method, deep neural networks, was used in 11 solutions.[3] Kaggle's then-CEO Anthony Goldbloom likewise observed in 2015 that XGBoost was winning practically every competition in the structured-data category. The XGBoost GitHub repository documents dozens of first-place solutions using the library. In many tabular data competitions, gradient boosting ensembles outperform [deep learning](/wiki/deep_learning) approaches, especially when feature counts are moderate and data volumes are in the thousands to millions of rows.

### Finance and risk modeling

Gradient boosting is widely used in finance for credit scoring, loan default prediction, fraud detection, and algorithmic trading. Financial institutions value the method for its accuracy on tabular data, its ability to handle mixed feature types, and its interpretability via feature importance and SHAP values. Regulatory requirements in banking and insurance often demand model explanations, and gradient boosting models can be made interpretable more easily than black-box [neural networks](/wiki/neural_network).

### Web search and ranking

Major [search engines](/wiki/search_engine), including Yahoo, Bing, and Yandex, have used gradient boosting (often LambdaMART variants) as a core ranking component. The algorithm takes features extracted from query-document pairs (such as text relevance scores, click-through rates, document quality signals, and freshness indicators) and learns a ranking function that maximizes NDCG or a similar metric.

### Ad click-through rate prediction

Predicting whether a user will click on a digital advertisement is a massive-scale [classification](/wiki/classification) problem. Gradient boosting models are used to predict click-through rates (CTR) at companies like Facebook, Google, and Yandex. The method handles the heterogeneous feature space typical of ad prediction (user demographics, ad content features, contextual signals, historical behavior) and scales to billions of training examples with distributed implementations.

### Healthcare and bioinformatics

Gradient boosting has been applied to clinical prediction tasks such as disease diagnosis, patient outcome prediction, readmission risk scoring, patient risk stratification, treatment response prediction, and drug response modeling. Its ability to handle missing data natively (in XGBoost and LightGBM) is valuable in healthcare settings where patient records are frequently incomplete.

### Natural language processing

While [deep learning](/wiki/deep_learning) dominates most [natural language processing](/wiki/natural_language_processing) tasks, gradient boosting remains relevant for NLP problems that involve engineered tabular features, such as text classification with TF-IDF features, named entity recognition post-processing, and [information retrieval](/wiki/information_retrieval) re-ranking.

### Fraud detection and anomaly detection

Gradient boosting is commonly used for fraud detection in banking, e-commerce, and insurance. The algorithm handles class imbalance well (via weighted loss functions or custom objectives) and can incorporate diverse feature types including transaction amounts, time-based features, and categorical merchant codes.

### High-energy physics

Gradient boosting was used in the Higgs boson machine learning challenge and in physics analyses at CERN for event classification.

### Other applications

| Domain | Example Use Cases |
|---|---|
| Energy | Load forecasting, solar/wind power prediction, building energy optimization |
| Retail | Demand forecasting, customer churn prediction, price optimization |
| Manufacturing | Predictive maintenance, quality control, defect detection |
| Ecology | Species distribution modeling, habitat classification |
| Recommendation systems | Feature-based ranking and scoring in recommendation pipelines |
| Insurance | Claims prediction, pricing models, risk assessment |
| Genomics | Gene expression prediction, variant calling, protein function prediction |

## Advantages and limitations

### Advantages

- Consistently achieves state-of-the-art performance on tabular and structured data.
- Handles mixed feature types (numerical, categorical, ordinal) without extensive preprocessing.
- Robust to outliers when using appropriate loss functions (e.g., Huber loss, quantile loss).
- Supports arbitrary differentiable loss functions, enabling customization for specific tasks.
- Mature, highly optimized implementations are available (XGBoost, LightGBM, CatBoost).
- Built-in support for missing values in modern implementations.
- Feature importance and SHAP values provide interpretability.
- Effective regularization mechanisms prevent overfitting.
- Effective for both regression and classification tasks.

### Limitations

- Sequential training limits parallelization of the boosting process (individual trees can be parallelized, but the boosting rounds themselves are sequential), making gradient boosting harder to parallelize than methods like [random forests](/wiki/random_forest).
- More sensitive to hyperparameter choices than random forests; poor tuning can lead to overfitting or underfitting, especially on small or noisy datasets without careful tuning of [hyperparameters](/wiki/hyperparameters).
- Generally underperforms [deep learning](/wiki/deep_learning) on unstructured data such as images, text, and audio.
- Prediction latency can be high for ensembles with thousands of trees, though symmetric tree structures (CatBoost) and tree pruning can mitigate this.
- Sensitive to the choice of learning rate and number of iterations.
- Does not naturally capture spatial or temporal structure (unlike convolutional or recurrent [neural networks](/wiki/neural_network)).
- Requires more careful data preparation and validation strategy compared to simpler methods.

## Alternative names

Gradient boosting is known by several names in the literature and in different software packages:

| Name | Context |
|---|---|
| Gradient Boosting Machine (GBM) | Friedman's original name for the framework |
| Gradient Boosted Decision Trees (GBDT) | When decision trees are used as base learners |
| Multiple Additive Regression Trees (MART) | Term used by Friedman and in some commercial software |
| Boosted Regression Trees (BRT) | Common in ecology and environmental science |
| TreeNet | Commercial implementation by Salford Systems |
| TreeBoost | Friedman's term for the tree-specific variant |
| Stochastic Gradient Boosting | When row/column subsampling is applied |

## References

1. Friedman, J.H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *The Annals of Statistics*, 29(5), 1189-1232.
2. Friedman, J.H. (2002). "Stochastic Gradient Boosting." *Computational Statistics and Data Analysis*, 38(4), 367-378.
3. Chen, T. and Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 785-794. arXiv:1603.02754.
4. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." *Advances in Neural Information Processing Systems*, 30.
5. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., and Gulin, A. (2018). "CatBoost: Unbiased Boosting with Categorical Features." *Advances in Neural Information Processing Systems*, 31.
6. Freund, Y. and Schapire, R.E. (1997). "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting." *Journal of Computer and System Sciences*, 55(1), 119-139.
7. Breiman, L. (1998). "Arcing Classifiers." *The Annals of Statistics*, 26(3), 801-849.
8. Breiman, L. (1997). "Arcing the Edge." Technical Report 486, Statistics Department, University of California, Berkeley.
9. Lundberg, S.M. and Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." *Advances in Neural Information Processing Systems*, 30.
10. Lundberg, S.M., Erion, G.G., and Lee, S.-I. (2019). "Consistent Individualized Feature Attribution for Tree Ensembles." *arXiv preprint arXiv:1802.03888*.
11. Lundberg, S.M., Erion, G., Chen, H., DeGrave, A., Prutkin, J.M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., and Lee, S.-I. (2020). "From Local Explanations to Global Understanding with [Explainable AI](/wiki/explainable_ai) for Trees." *Nature Machine Intelligence*, 2, 56-67.
12. Burges, C.J.C. (2010). "From RankNet to LambdaRank to LambdaMART: An Overview." *Microsoft Research Technical Report MSR-TR-2010-82*.
13. Burges, C.J.C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hullender, G. (2005). "Learning to Rank using Gradient Descent." *Proceedings of the 22nd International Conference on Machine Learning*.
14. Mason, L., Baxter, J., Bartlett, P., and Frean, M. (2000). "Boosting Algorithms as Gradient Descent." *Advances in Neural Information Processing Systems*, 12.
15. Natekin, A. and Knoll, A. (2013). "Gradient Boosting Machines, a Tutorial." *Frontiers in Neurorobotics*, 7, 21.
16. Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction.* 2nd ed. Springer.
17. Schapire, R.E. and Freund, Y. (2012). *Boosting: Foundations and Algorithms.* MIT Press.
18. ACM SIGACT (2003). "2003 Godel Prize: Yoav Freund and Robert Schapire." ACM Special Interest Group on Algorithms and Computation Theory. https://sigact.org/prizes/godel/2003.html