See also: Machine learning terms, Gradient boosting
Gradient boosted decision trees (GBT, also written GBDT, GBM, or GBRT) is an ensemble method that builds a strong predictor by sequentially fitting many shallow decision trees to the gradient of a loss function. Each new tree corrects the residual errors of the running model, and the final prediction is a sum of the small contributions from every tree. The framework was formalized by Jerome H. Friedman in 2001 in his Annals of Statistics paper "Greedy Function Approximation: A Gradient Boosting Machine" (cited more than 26,000 times as of 2026). It remains the dominant supervised learning method for tabular data, powering most winning entries in machine learning competitions for over a decade and a large fraction of production models in finance, advertising, search, and forecasting.
GBT is implemented in three industry-standard libraries that account for the vast majority of real-world deployments: XGBoost, LightGBM, and CatBoost. The scikit-learn HistGradientBoostingClassifier and HistGradientBoostingRegressor classes provide a fourth widely used option that is bundled with the most popular general-purpose ML library in Python. Together these implementations have set the state of the art on benchmark after benchmark for tabular problems, and the 2022 Grinsztajn et al. NeurIPS paper "Why do tree-based models still outperform deep learning on typical tabular data?" confirmed that the gap with neural networks is not closing.
Imagine you are guessing how heavy a pumpkin is. Your first guess is just "the average pumpkin weight," so you are off by a few pounds. A friend looks at your error and writes a tiny rule: "if the pumpkin is wider than 30 cm, add 4 pounds." Now your guess is closer. A second friend looks at the new error and adds another small rule. After a hundred friends, each adding a tiny correction, your combined guess is very accurate.
That is what GBT does. Each "friend" is a small decision tree. The trees are weak on their own (often only a few splits deep), but their sum is very strong because each one is built specifically to fix the mistakes left by all the earlier trees. The "gradient" part means each new tree is told exactly which direction to push every prediction in order to make the loss go down.
The boosting idea predates gradient boosting by about a decade. Rob Schapire proved in 1990 that any weak learner (one that does only slightly better than random) could be combined with others into a strong learner. Yoav Freund and Schapire turned this into a working algorithm in 1995 with AdaBoost (Adaptive Boosting), published as "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting" in the Journal of Computer and System Sciences in 1997. AdaBoost reweights training examples after each round so that the next weak learner focuses on the examples the previous learners got wrong. Freund and Schapire received the 2003 Godel Prize for this work.
Leo Breiman noted in his 1998 "Arcing Classifiers" paper that AdaBoost could be interpreted as the optimization of an exponential loss function. This was the conceptual key. If boosting is just optimization, the loss function should not be limited to the exponential, and the optimization step should not be limited to reweighting. Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean made this explicit in their 1999 NeurIPS paper "Boosting Algorithms as Gradient Descent," which introduced the functional gradient view.
Jerome Friedman put the framework on solid ground in 2001 with "Greedy Function Approximation: A Gradient Boosting Machine." He framed boosting as steepest descent in function space, derived the algorithm for any differentiable loss, and worked out the special case where the weak learner is a regression tree (which he called TreeBoost). The 2001 paper also introduced shrinkage, scaling each tree's contribution by a small learning rate, and showed empirically that small learning rates with many trees beat large learning rates with few trees almost universally.
In 2002 Friedman followed up with "Stochastic Gradient Boosting," which trains each tree on a random subsample of the data. Borrowing from random forest bagging, this single change usually improves accuracy and cuts training time in half. Stochastic gradient boosting (often combined with column subsampling) is the default mode in every modern implementation.
The combination of fast machines, large datasets, and Kaggle competitions made GBT the most consequential ML method of the 2010s. XGBoost was released by Tianqi Chen in 2014 as a command-line tool, and after Chen and Tong He used it to win the special High Energy Physics meets Machine Learning Award in the Higgs Boson Kaggle challenge that summer, it was wrapped in Python and R packages. By 2015 XGBoost had won 17 of 29 Kaggle competitions, and every team in the KDDCup 2015 top 10 used it. Chen and Carlos Guestrin published the formal paper "XGBoost: A Scalable Tree Boosting System" at KDD 2016. Microsoft Research released LightGBM at NeurIPS 2017, Yandex released CatBoost at NeurIPS 2018, and the M5 Forecasting Competition (Walmart sales, 2020) was won by an undergraduate using an ensemble of LightGBM models.
GBT builds an additive model in a forward stage-wise way. After M rounds the model has the form:
F_M(x) = F_0(x) + Σ_{m=1..M} ν · γ_m · h_m(x)
where F_0(x) is a constant initialization, each h_m(x) is a regression tree fit at iteration m, γ_m is a per-tree (or per-leaf) step size, and ν is the global shrinkage factor (the learning rate, usually 0.01 to 0.1). The recipe for one round is:
y; for log loss it is the log-odds of the positive class.r_im = -∂L(y_i, F(x_i)) / ∂F(x_i) evaluated at F = F_{m-1}.h_m to the targets (x_i, r_im).R_jm, find the leaf value b_jm by line search on the original loss, not on the squared-error fit.F_m(x) = F_{m-1}(x) + ν · h_m(x).F_M.For squared error the negative gradient is just the residual y_i - F(x_i), so gradient boosting reduces to repeatedly fitting trees on the leftover error. For log loss the negative gradient is y_i - p_i, the difference between the true label and the predicted probability. This unifies regression and classification under one framework.
Any differentiable loss works. The common choices are:
| Loss | Task | Negative gradient |
|---|---|---|
| Squared error | Regression | y_i - F(x_i) |
| Absolute error (L1) | Robust regression | sign(y_i - F(x_i)) |
| Huber | Regression with outliers | residual if |r| ≤ δ, else δ · sign(r) |
| Quantile loss | Quantile / interval prediction | τ - 1{r < 0} |
| Log loss | Binary classification | y_i - σ(F(x_i)) |
| Multinomial deviance | Multi-class | y_i,k - softmax_k(F(x_i)) |
| Poisson | Count regression | y_i - exp(F(x_i)) |
| Gamma | Strictly positive targets | (y_i - μ_i) / μ_i |
| LambdaRank | Learning to rank | Pairwise λ-gradients on NDCG |
| Cox | Survival | Partial-likelihood gradient |
XGBoost popularized a refinement called Newton boosting. Instead of just the first-order gradient, it uses a second-order Taylor expansion of the loss at F_{m-1}:
L ≈ Σ [g_i · h_m(x_i) + ½ · h_i · h_m(x_i)²] + Ω(h_m)
where g_i is the gradient and h_i is the Hessian (the second derivative) of the loss. With this, the optimal leaf value has a closed form: w_j* = -G_j / (H_j + λ) where G and H are the sums of gradients and Hessians in leaf j. The split-finding gain is ½ · (G_L²/(H_L+λ) + G_R²/(H_R+λ) - G²/(H+λ)) - γ, with γ a per-leaf complexity penalty. LightGBM and CatBoost use the same Newton-style update; the original Friedman algorithm uses only first-order information plus a line search.
GBT is famously sensitive to hyperparameters. The interaction between learning rate, number of trees, and tree depth is where most of the tuning effort goes.
| Hyperparameter | Typical range | What it does |
|---|---|---|
| n_estimators / num_boost_round | 100 to 10,000 | Total number of trees. Use early stopping to find it. |
| learning_rate (eta, ν) | 0.01 to 0.3 | Shrinks each tree's contribution. Lower = more trees needed. |
| max_depth | 3 to 10 | Maximum tree depth. Deep trees model interactions but overfit. |
| num_leaves | 15 to 255 | LightGBM's leaf-wise budget; controls capacity directly. |
| min_child_weight / min_samples_leaf | 1 to 100+ | Minimum sum of Hessians (or count) per leaf. Smooths predictions. |
| subsample | 0.5 to 1.0 | Row subsampling for stochastic boosting. |
| colsample_bytree | 0.5 to 1.0 | Feature subsampling per tree. |
| reg_alpha (L1) | 0 to 10 | L1 penalty on leaf weights. |
| reg_lambda (L2) | 0 to 10 | L2 penalty on leaf weights. |
| gamma (XGBoost) | 0 to 5 | Minimum loss reduction to make a split. |
| early_stopping_rounds | 10 to 100 | Stop if validation metric does not improve for this many rounds. |
| max_bins | 64 to 511 | Histogram resolution. More bins = slightly better, slower. |
A common pattern is to fix a small learning rate (0.05), set n_estimators to something large like 5000, and let early stopping pick the actual count on a validation set. Tree depth is usually swept first, then regularization (reg_lambda, min_child_weight), then subsampling.
The four implementations below cover essentially every modern GBT use case. Each has subtly different defaults and tradeoffs.
| Feature | XGBoost | LightGBM | CatBoost | sklearn HistGradientBoosting |
|---|---|---|---|---|
| First release | 2014 | 2016 | 2017 | 2018 |
| Reference paper | Chen & Guestrin (KDD 2016) | Ke et al. (NeurIPS 2017) | Prokhorenkova et al. (NeurIPS 2018) | Inspired by LightGBM |
| Origin | Tianqi Chen (DMLC) | Microsoft Research | Yandex | scikit-learn project |
| Default tree growth | Level-wise (depthwise); leaf-wise via grow_policy | Leaf-wise | Symmetric (oblivious) | Leaf-wise |
| Split finding | Exact + histogram (hist) + GPU | Histogram + GOSS | Histogram | Histogram |
| Categorical features | One-hot or enable_categorical=True (recent) | Native, optimal split via Fisher 1958 | Ordered target statistics, native | Native, partition-based |
| Missing values | Native (learned default direction) | Native | Native | Native |
| Newton boosting | Yes (default) | Yes | Yes | Yes |
| GPU | Yes (tree_method=hist, device=cuda) | Yes | Yes | No (CPU only, OpenMP) |
| Distributed training | Yes (Dask, Spark, Ray) | Yes (MPI, Dask, Spark) | Yes (CPU/GPU multi-host) | No |
| Monotonic constraints | Yes | Yes | Yes | Yes |
| SHAP integration | Yes (TreeSHAP built in) | Yes | Yes | Yes (via shap library) |
| Notable speed trick | Sparsity-aware split, column block | GOSS + EFB + leaf-wise | Symmetric trees for fast inference | OpenMP parallel histograms |
| Best at | General-purpose, large competitions | Largest datasets, fast iteration | Categorical-heavy data, robust defaults | Anyone already using sklearn |
Other implementations worth knowing about: H2O GBM (Java, used in enterprise), Apache Spark MLlib's GBTClassifier (for distributed JVM workloads), TensorFlow Decision Forests / Yggdrasil Decision Forests (Google's open-source library based on the system that ran inside Google for years), and the legacy sklearn.ensemble.GradientBoostingClassifier (pre-histogram, slow on anything over 10,000 samples; usually replaced by HistGradientBoosting).
XGBoost. Newton boosting with explicit L1 + L2 + leaf-count regularization in the objective. Sparsity-aware split finding that learns a default branch for missing values, reportedly 50x faster than the naive approach on sparse data. Cache-aware block storage for parallel histogram building. The original tree_method=exact mode does pre-sorted split finding; the hist mode added in 2017 uses histograms and is now the default.
LightGBM. Three innovations stacked: histogram-based split finding (bin into 255 buckets, compute splits in O(bins) instead of O(rows)), leaf-wise tree growth (always split the leaf with the highest loss reduction, instead of growing level by level), and Gradient-based One-Side Sampling (GOSS) plus Exclusive Feature Bundling (EFB). GOSS keeps every example with a large gradient and randomly subsamples the small-gradient ones, with a correction multiplier to keep the sums unbiased. EFB groups mutually-exclusive sparse features into single bundles, useful when one-hot encodings have made the feature count explode. Together they make LightGBM the fastest of the three on the largest datasets.
CatBoost. Two big ideas, both aimed at fixing target leakage. Ordered target statistics replace one-hot encoding for categorical features: for each row, the target encoding is computed using only the rows that come before it in a random permutation, so the row's own label never leaks into its features. Ordered boosting generalizes this to the boosting loop itself: for each row, the residual that the next tree learns on is computed from a model that was not trained on that row. CatBoost also uses oblivious trees (also called symmetric trees), where every node at the same depth uses the same split. Oblivious trees are weaker individually but extremely fast at inference, since prediction reduces to evaluating a binary index of length depth. CatBoost's defaults are good enough that it is often the best library when a practitioner does not want to spend days on hyperparameter search.
scikit-learn HistGradientBoosting. Histogram-based GBT directly inspired by LightGBM, parallelized with OpenMP, and shipped inside scikit-learn since version 0.21 (2019). Native missing-value support, native categorical support (since 1.0), and monotonic constraints. Orders of magnitude faster than the older GradientBoostingClassifier. The main reason to choose it over XGBoost or LightGBM is that it is zero-install if you already have sklearn, and it integrates cleanly with sklearn pipelines and GridSearchCV.
The reasons GBT keeps winning:
max_bins and decrease learning_rate for accuracy; reduce num_leaves and use GOSS for speed.bst.save_model, or compiled to native code via Treelite for low-latency serving.GBT is not the right tool everywhere:
For about a decade, the conventional wisdom has been: tabular data goes to GBT, everything else goes to neural networks. Several papers have tried to overturn the first half of that statement. TabNet (2019) used attention-based feature selection. Hopular (2022) used Hopfield networks. FT-Transformer and SAINT applied transformers to tabular data. NODE used differentiable trees. Each was claimed to beat XGBoost on selected benchmarks at release time.
Grinsztajn, Oyallon, and Varoquaux's 2022 NeurIPS paper "Why do tree-based models still outperform deep learning on typical tabular data?" pushed back hard. They benchmarked tree ensembles (XGBoost, gradient boosting, random forests) against the best deep tabular models on a curated set of 45 datasets, controlling for hyperparameter tuning and dataset size. Trees won on medium-sized data (around 10,000 rows) and stayed ahead even after the neural networks were tuned at much higher cost. The authors traced the gap to three inductive biases:
The practical takeaway: on tabular data with under a million rows, GBT should be the first thing tried. Deep tabular models are sometimes worth the cost when there is a huge amount of data, when the problem has multimodal inputs (numerical + text + image), or when transfer learning across tasks is needed.
From roughly 2015 to 2020, GBT (mostly XGBoost, then LightGBM) won almost every Kaggle competition that involved tabular data. The pattern was:
| Year | Competition / context | Winning approach |
|---|---|---|
| 2014 | Higgs Boson Machine Learning Challenge (Kaggle) | XGBoost (Special HEP-meets-ML award to Tianqi Chen and Tong He) |
| 2015 | KDDCup 2015 student dropout prediction | All top-10 teams used XGBoost |
| 2015 | Otto Group Product Classification | XGBoost (Titericz, Semenov) |
| 2016 | Allstate Claims Severity | XGBoost (Bhattacharjee) |
| 2017+ | Most tabular Kaggle competitions | XGBoost, then LightGBM as it matured |
| 2018+ | Categorical-heavy competitions | CatBoost competitive with XGBoost / LightGBM |
| 2020 | M5 Forecasting Competition (Walmart) | LightGBM ensemble (YeonJun In, Kyung Hee University) |
| 2022 | M5 Uncertainty conclusions | Tree-based methods dominated |
The M5 result was particularly striking. M5 was the first M-competition where every top method was a pure ML approach rather than a statistical time-series model, and the winning solution was an equal-weighted average of six LightGBM models trained on different aggregations of the Walmart sales hierarchy. This was a turning point in the forecasting community: hand-tuned ARIMA and ETS were no longer competitive with off-the-shelf GBT.
GBT and random forest are the two dominant tree ensemble families, but they are built on opposite philosophies.
| Aspect | Gradient boosted trees | Random forest |
|---|---|---|
| Training | Sequential, each tree depends on previous trees' errors | Parallel, every tree trained independently on a bootstrap sample |
| Bias-variance focus | Reduces bias aggressively | Reduces variance aggressively |
| Tree depth | Shallow (3 to 10) | Deep (often unlimited) |
| Number of trees | Tuned via early stopping | More is always better, just slower |
| Hyperparameter sensitivity | High | Low |
| Out-of-the-box accuracy | Lower than tuned, but excellent when tuned | Strong even with defaults |
| Risk of overfitting | Real, must regularize | Bagging is self-regularizing |
| Best suited to | Maximizing accuracy | Quick baselines, feature ranking, low-tuning use |
A practical recipe: train a random forest first to get a quick baseline and a sanity-check feature importance, then move to GBT for the production model.
GBT remains the default for almost any tabular ML problem in industry. Concrete examples:
GBT is also frequently used as a feature extractor for downstream models. The leaf indices of a trained GBT can be one-hot encoded and fed into a linear model or neural network; this trick was popularized by Facebook in their 2014 ad CTR paper "Practical Lessons from Predicting Clicks on Ads at Facebook," where the combined GBT-plus-logistic system outperformed either alone.
A minimal training loop in Python with sklearn's HistGradientBoostingClassifier:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml
X, y = fetch_openml("adult", version=2, return_X_y=True, as_frame=True)
X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.2, stratify=y)
clf = HistGradientBoostingClassifier(
max_iter=2000,
learning_rate=0.05,
max_leaf_nodes=63,
min_samples_leaf=20,
l2_regularization=1.0,
early_stopping=True,
n_iter_no_change=20,
validation_fraction=0.1,
categorical_features="from_dtype",
random_state=0,
)
clf.fit(X_tr, y_tr)
print("validation accuracy:", clf.score(X_va, y_va))
The equivalent in XGBoost looks similar but with explicit early_stopping_rounds:
import xgboost as xgb
dtrain = xgb.DMatrix(X_tr, label=(y_tr == ">50K").astype(int), enable_categorical=True)
dvalid = xgb.DMatrix(X_va, label=(y_va == ">50K").astype(int), enable_categorical=True)
params = {
"objective": "binary:logistic",
"tree_method": "hist",
"learning_rate": 0.05,
"max_depth": 6,
"min_child_weight": 1.0,
"reg_lambda": 1.0,
"subsample": 0.8,
"colsample_bytree": 0.8,
"eval_metric": "logloss",
}
bst = xgb.train(
params, dtrain,
num_boost_round=2000,
evals=[(dvalid, "val")],
early_stopping_rounds=20,
verbose_eval=100,
)
The parameters that matter most are essentially the same in every library: a small learning rate, a moderate-sized tree (depth 6 or num_leaves=31 to 63), modest subsampling, an L2 penalty, and early stopping on a held-out validation set.
Scott Lundberg and Su-In Lee's 2017 NeurIPS paper "A Unified Approach to Interpreting Model Predictions" introduced SHAP (SHapley Additive exPlanations), and Lundberg, Erion, and Lee's 2019 follow-up gave TreeSHAP, an algorithm that computes exact Shapley values for tree ensembles in polynomial time. TreeSHAP runs in O(TLD²) for an ensemble of T trees with up to L leaves and depth D, instead of the exponential cost of model-agnostic SHAP. Every major GBT library now ships with TreeSHAP integration: model.predict(X, pred_contribs=True) in XGBoost, model.predict(X, pred_contrib=True) in LightGBM, and model.get_feature_importance(type="ShapValues") in CatBoost. SHAP has effectively replaced gain-based importance as the default explanation tool for GBT models in industry, especially in regulated settings (credit, insurance, healthcare) where per-prediction attribution is required.
GBT is one node in a larger family: