Gradient boosted (decision) trees (GBT)

Gradient boosted decision trees (GBT, also written GBDT, GBM, or GBRT) is an ensemble method that builds a strong predictor by sequentially fitting many shallow decision trees to the gradient of a loss function. Each new tree corrects the residual errors of the running model, and the final prediction is a sum of the small contributions from every tree. The framework was formalized by Jerome H. Friedman in 2001 in his Annals of Statistics paper "Greedy Function Approximation: A Gradient Boosting Machine" (cited more than 26,000 times as of 2026). It remains the dominant supervised learning method for tabular data, powering most winning entries in machine learning competitions for over a decade and a large fraction of production models in finance, advertising, search, and forecasting.

GBT is implemented in three industry-standard libraries that account for the vast majority of real-world deployments: XGBoost, LightGBM, and CatBoost. The scikit-learn HistGradientBoostingClassifier and HistGradientBoostingRegressor classes provide a fourth widely used option that is bundled with the most popular general-purpose ML library in Python. Together these implementations have set the state of the art on benchmark after benchmark for tabular problems, and the 2022 Grinsztajn et al. NeurIPS paper "Why do tree-based models still outperform deep learning on typical tabular data?" confirmed that the gap with neural networks is not closing.

eli5 (explain like i'm five)

Imagine you are guessing how heavy a pumpkin is. Your first guess is just "the average pumpkin weight," so you are off by a few pounds. A friend looks at your error and writes a tiny rule: "if the pumpkin is wider than 30 cm, add 4 pounds." Now your guess is closer. A second friend looks at the new error and adds another small rule. After a hundred friends, each adding a tiny correction, your combined guess is very accurate.

That is what GBT does. Each "friend" is a small decision tree. The trees are weak on their own (often only a few splits deep), but their sum is very strong because each one is built specifically to fix the mistakes left by all the earlier trees. The "gradient" part means each new tree is told exactly which direction to push every prediction in order to make the loss go down.

history

from adaboost to gradient boosting

The boosting idea predates gradient boosting by about a decade. Rob Schapire proved in 1990 that any weak learner (one that does only slightly better than random) could be combined with others into a strong learner. Yoav Freund and Schapire turned this into a working algorithm in 1995 with AdaBoost (Adaptive Boosting), published as "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting" in the Journal of Computer and System Sciences in 1997. AdaBoost reweights training examples after each round so that the next weak learner focuses on the examples the previous learners got wrong. Freund and Schapire received the 2003 Godel Prize for this work.

Leo Breiman noted in his 1998 "Arcing Classifiers" paper that AdaBoost could be interpreted as the optimization of an exponential loss function. This was the conceptual key. If boosting is just optimization, the loss function should not be limited to the exponential, and the optimization step should not be limited to reweighting. Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean made this explicit in their 1999 NeurIPS paper "Boosting Algorithms as Gradient Descent," which introduced the functional gradient view.

friedman 2001 and 2002

Jerome Friedman put the framework on solid ground in 2001 with "Greedy Function Approximation: A Gradient Boosting Machine." He framed boosting as steepest descent in function space, derived the algorithm for any differentiable loss, and worked out the special case where the weak learner is a regression tree (which he called TreeBoost). The 2001 paper also introduced shrinkage, scaling each tree's contribution by a small learning rate, and showed empirically that small learning rates with many trees beat large learning rates with few trees almost universally.

In 2002 Friedman followed up with "Stochastic Gradient Boosting," which trains each tree on a random subsample of the data. Borrowing from random forest bagging, this single change usually improves accuracy and cuts training time in half. Stochastic gradient boosting (often combined with column subsampling) is the default mode in every modern implementation.

the kaggle era

The combination of fast machines, large datasets, and Kaggle competitions made GBT the most consequential ML method of the 2010s. XGBoost was released by Tianqi Chen in 2014 as a command-line tool, and after Chen and Tong He used it to win the special High Energy Physics meets Machine Learning Award in the Higgs Boson Kaggle challenge that summer, it was wrapped in Python and R packages. By 2015 XGBoost had won 17 of 29 Kaggle competitions, and every team in the KDDCup 2015 top 10 used it. Chen and Carlos Guestrin published the formal paper "XGBoost: A Scalable Tree Boosting System" at KDD 2016. Microsoft Research released LightGBM at NeurIPS 2017, Yandex released CatBoost at NeurIPS 2018, and the M5 Forecasting Competition (Walmart sales, 2020) was won by an undergraduate using an ensemble of LightGBM models.

the algorithm

GBT builds an additive model in a forward stage-wise way. After M rounds the model has the form:

F_M(x) = F_0(x) + Σ_{m=1..M} ν · γ_m · h_m(x)

where F_0(x) is a constant initialization, each h_m(x) is a regression tree fit at iteration m, γ_m is a per-tree (or per-leaf) step size, and ν is the global shrinkage factor (the learning rate, usually 0.01 to 0.1). The recipe for one round is:

Initialize with the constant that minimizes the loss on the training set. For squared error this is the mean of y; for log loss it is the log-odds of the positive class.
For m = 1 to M:
1. Compute the negative gradient (the pseudo-residual) for every training point: r_im = -∂L(y_i, F(x_i)) / ∂F(x_i) evaluated at F = F_{m-1}.
2. Fit a regression tree h_m to the targets (x_i, r_im).
3. (TreeBoost refinement.) For each leaf region R_jm, find the leaf value b_jm by line search on the original loss, not on the squared-error fit.
4. Update: F_m(x) = F_{m-1}(x) + ν · h_m(x).
Return F_M.

For squared error the negative gradient is just the residual y_i - F(x_i), so gradient boosting reduces to repeatedly fitting trees on the leftover error. For log loss the negative gradient is y_i - p_i, the difference between the true label and the predicted probability. This unifies regression and classification under one framework.

loss functions

Any differentiable loss works. The common choices are:

Loss	Task	Negative gradient
Squared error	Regression	y_i - F(x_i)
Absolute error (L1)	Robust regression	sign(y_i - F(x_i))
Huber	Regression with outliers	residual if \|r\| ≤ δ, else δ · sign(r)
Quantile loss	Quantile / interval prediction	τ - 1{r < 0}
Log loss	Binary classification	y_i - σ(F(x_i))
Multinomial deviance	Multi-class	y_i,k - softmax_k(F(x_i))
Poisson	Count regression	y_i - exp(F(x_i))
Gamma	Strictly positive targets	(y_i - μ_i) / μ_i
LambdaRank	Learning to rank	Pairwise λ-gradients on NDCG
Cox	Survival	Partial-likelihood gradient

newton boosting

XGBoost popularized a refinement called Newton boosting. Instead of just the first-order gradient, it uses a second-order Taylor expansion of the loss at F_{m-1}:

L ≈ Σ [g_i · h_m(x_i) + ½ · h_i · h_m(x_i)²] + Ω(h_m)

where g_i is the gradient and h_i is the Hessian (the second derivative) of the loss. With this, the optimal leaf value has a closed form: w_j* = -G_j / (H_j + λ) where G and H are the sums of gradients and Hessians in leaf j. The split-finding gain is ½ · (G_L²/(H_L+λ) + G_R²/(H_R+λ) - G²/(H+λ)) - γ, with γ a per-leaf complexity penalty. LightGBM and CatBoost use the same Newton-style update; the original Friedman algorithm uses only first-order information plus a line search.

hyperparameters that matter

GBT is famously sensitive to hyperparameters. The interaction between learning rate, number of trees, and tree depth is where most of the tuning effort goes.

Hyperparameter	Typical range	What it does
n_estimators / num_boost_round	100 to 10,000	Total number of trees. Use early stopping to find it.
learning_rate (eta, ν)	0.01 to 0.3	Shrinks each tree's contribution. Lower = more trees needed.
max_depth	3 to 10	Maximum tree depth. Deep trees model interactions but overfit.
num_leaves	15 to 255	LightGBM's leaf-wise budget; controls capacity directly.
min_child_weight / min_samples_leaf	1 to 100+	Minimum sum of Hessians (or count) per leaf. Smooths predictions.
subsample	0.5 to 1.0	Row subsampling for stochastic boosting.
colsample_bytree	0.5 to 1.0	Feature subsampling per tree.
reg_alpha (L1)	0 to 10	L1 penalty on leaf weights.
reg_lambda (L2)	0 to 10	L2 penalty on leaf weights.
gamma (XGBoost)	0 to 5	Minimum loss reduction to make a split.
early_stopping_rounds	10 to 100	Stop if validation metric does not improve for this many rounds.
max_bins	64 to 511	Histogram resolution. More bins = slightly better, slower.

A common pattern is to fix a small learning rate (0.05), set n_estimators to something large like 5000, and let early stopping pick the actual count on a validation set. Tree depth is usually swept first, then regularization (reg_lambda, min_child_weight), then subsampling.

library comparison

The four implementations below cover essentially every modern GBT use case. Each has subtly different defaults and tradeoffs.

Feature	XGBoost	LightGBM	CatBoost	sklearn HistGradientBoosting
First release	2014	2016	2017	2018
Reference paper	Chen & Guestrin (KDD 2016)	Ke et al. (NeurIPS 2017)	Prokhorenkova et al. (NeurIPS 2018)	Inspired by LightGBM
Origin	Tianqi Chen (DMLC)	Microsoft Research	Yandex	scikit-learn project
Default tree growth	Level-wise (depthwise); leaf-wise via `grow_policy`	Leaf-wise	Symmetric (oblivious)	Leaf-wise
Split finding	Exact + histogram (`hist`) + GPU	Histogram + GOSS	Histogram	Histogram
Categorical features	One-hot or `enable_categorical=True` (recent)	Native, optimal split via Fisher 1958	Ordered target statistics, native	Native, partition-based
Missing values	Native (learned default direction)	Native	Native	Native
Newton boosting	Yes (default)	Yes	Yes	Yes
GPU	Yes (`tree_method=hist`, `device=cuda`)	Yes	Yes	No (CPU only, OpenMP)
Distributed training	Yes (Dask, Spark, Ray)	Yes (MPI, Dask, Spark)	Yes (CPU/GPU multi-host)	No
Monotonic constraints	Yes	Yes	Yes	Yes
SHAP integration	Yes (TreeSHAP built in)	Yes	Yes	Yes (via `shap` library)
Notable speed trick	Sparsity-aware split, column block	GOSS + EFB + leaf-wise	Symmetric trees for fast inference	OpenMP parallel histograms
Best at	General-purpose, large competitions	Largest datasets, fast iteration	Categorical-heavy data, robust defaults	Anyone already using sklearn

Other implementations worth knowing about: H2O GBM (Java, used in enterprise), Apache Spark MLlib's GBTClassifier (for distributed JVM workloads), TensorFlow Decision Forests / Yggdrasil Decision Forests (Google's open-source library based on the system that ran inside Google for years), and the legacy sklearn.ensemble.GradientBoostingClassifier (pre-histogram, slow on anything over 10,000 samples; usually replaced by HistGradientBoosting).

key innovations by library

XGBoost. Newton boosting with explicit L1 + L2 + leaf-count regularization in the objective. Sparsity-aware split finding that learns a default branch for missing values, reportedly 50x faster than the naive approach on sparse data. Cache-aware block storage for parallel histogram building. The original tree_method=exact mode does pre-sorted split finding; the hist mode added in 2017 uses histograms and is now the default.

LightGBM. Three innovations stacked: histogram-based split finding (bin into 255 buckets, compute splits in O(bins) instead of O(rows)), leaf-wise tree growth (always split the leaf with the highest loss reduction, instead of growing level by level), and Gradient-based One-Side Sampling (GOSS) plus Exclusive Feature Bundling (EFB). GOSS keeps every example with a large gradient and randomly subsamples the small-gradient ones, with a correction multiplier to keep the sums unbiased. EFB groups mutually-exclusive sparse features into single bundles, useful when one-hot encodings have made the feature count explode. Together they make LightGBM the fastest of the three on the largest datasets.

CatBoost. Two big ideas, both aimed at fixing target leakage. Ordered target statistics replace one-hot encoding for categorical features: for each row, the target encoding is computed using only the rows that come before it in a random permutation, so the row's own label never leaks into its features. Ordered boosting generalizes this to the boosting loop itself: for each row, the residual that the next tree learns on is computed from a model that was not trained on that row. CatBoost also uses oblivious trees (also called symmetric trees), where every node at the same depth uses the same split. Oblivious trees are weaker individually but extremely fast at inference, since prediction reduces to evaluating a binary index of length depth. CatBoost's defaults are good enough that it is often the best library when a practitioner does not want to spend days on hyperparameter search.

scikit-learn HistGradientBoosting. Histogram-based GBT directly inspired by LightGBM, parallelized with OpenMP, and shipped inside scikit-learn since version 0.21 (2019). Native missing-value support, native categorical support (since 1.0), and monotonic constraints. Orders of magnitude faster than the older GradientBoostingClassifier. The main reason to choose it over XGBoost or LightGBM is that it is zero-install if you already have sklearn, and it integrates cleanly with sklearn pipelines and GridSearchCV.

strengths

The reasons GBT keeps winning:

State of the art on tabular data. The Grinsztajn et al. 2022 NeurIPS paper benchmarked tree ensembles against TabNet, FT-Transformer, ResNet, SAINT, and other deep tabular models on 45 medium-sized datasets. Tree ensembles still won, even after extensive hyperparameter tuning of the neural networks.
Mixed feature types out of the box. Numerical, categorical, ordinal, and missing values can be passed without scaling, encoding, or imputation. CatBoost and LightGBM handle categoricals natively; XGBoost and HistGradientBoosting are catching up.
Robust to feature scaling. Tree splits are invariant to monotonic transformations, so log-transforming or standardizing inputs has no effect.
Built-in feature importance. Gain-based importance, split-count importance, and TreeSHAP (Lundberg, Erion & Lee 2019) all give per-feature attributions in polynomial time.
Strong with missing values. Modern GBT learns a default direction for missing values at each split; in some datasets the missingness pattern is itself predictive and the model picks this up automatically.
Tunable speed-accuracy tradeoff. Increase max_bins and decrease learning_rate for accuracy; reduce num_leaves and use GOSS for speed.
Mature production tooling. Models can be exported to ONNX, embedded in C/C++ via XGBoost's bst.save_model, or compiled to native code via Treelite for low-latency serving.

weaknesses

GBT is not the right tool everywhere:

Unstructured data is not its game. For images, audio, and natural language, deep learning crushes GBT. The inductive bias of axis-aligned splits is wrong for pixel grids and token sequences.
Hyperparameter sensitivity. A poorly tuned GBT can be much worse than a default random forest. The interaction between learning rate, depth, and tree count traps newcomers.
Sequential by construction. Each tree depends on the previous trees' predictions, so the boosting loop itself does not parallelize across iterations (though split-finding within a single tree does).
Overfits without regularization. Without shrinkage, subsampling, and early stopping, GBT will memorize noise. The exponential loss in AdaBoost and the log loss in GBT are both vulnerable to outliers.
Inference latency. A model with 5,000 trees at depth 8 is not free to evaluate. CatBoost's oblivious trees and Treelite compilation help, but neural networks with batch GPU inference can sometimes be cheaper at scale.
Less interpretable as the ensemble grows. A single tree is a flowchart; 5,000 trees are not. SHAP recovers some interpretability but at a cost.

tabular ml benchmarks and the deep learning challenge

For about a decade, the conventional wisdom has been: tabular data goes to GBT, everything else goes to neural networks. Several papers have tried to overturn the first half of that statement. TabNet (2019) used attention-based feature selection. Hopular (2022) used Hopfield networks. FT-Transformer and SAINT applied transformers to tabular data. NODE used differentiable trees. Each was claimed to beat XGBoost on selected benchmarks at release time.

Grinsztajn, Oyallon, and Varoquaux's 2022 NeurIPS paper "Why do tree-based models still outperform deep learning on typical tabular data?" pushed back hard. They benchmarked tree ensembles (XGBoost, gradient boosting, random forests) against the best deep tabular models on a curated set of 45 datasets, controlling for hyperparameter tuning and dataset size. Trees won on medium-sized data (around 10,000 rows) and stayed ahead even after the neural networks were tuned at much higher cost. The authors traced the gap to three inductive biases:

Tree models are robust to uninformative features; neural networks are not.
Tree models preserve the orientation of the data, since splits are axis-aligned; neural networks blur it through linear layers.
Tree models can fit irregular, piecewise-constant functions easily; neural networks bias toward smooth functions.

The practical takeaway: on tabular data with under a million rows, GBT should be the first thing tried. Deep tabular models are sometimes worth the cost when there is a huge amount of data, when the problem has multimodal inputs (numerical + text + image), or when transfer learning across tasks is needed.

the kaggle and competitions story

From roughly 2015 to 2020, GBT (mostly XGBoost, then LightGBM) won almost every Kaggle competition that involved tabular data. The pattern was:

Year	Competition / context	Winning approach
2014	Higgs Boson Machine Learning Challenge (Kaggle)	XGBoost (Special HEP-meets-ML award to Tianqi Chen and Tong He)
2015	KDDCup 2015 student dropout prediction	All top-10 teams used XGBoost
2015	Otto Group Product Classification	XGBoost (Titericz, Semenov)
2016	Allstate Claims Severity	XGBoost (Bhattacharjee)
2017+	Most tabular Kaggle competitions	XGBoost, then LightGBM as it matured
2018+	Categorical-heavy competitions	CatBoost competitive with XGBoost / LightGBM
2020	M5 Forecasting Competition (Walmart)	LightGBM ensemble (YeonJun In, Kyung Hee University)
2022	M5 Uncertainty conclusions	Tree-based methods dominated

The M5 result was particularly striking. M5 was the first M-competition where every top method was a pure ML approach rather than a statistical time-series model, and the winning solution was an equal-weighted average of six LightGBM models trained on different aggregations of the Walmart sales hierarchy. This was a turning point in the forecasting community: hand-tuned ARIMA and ETS were no longer competitive with off-the-shelf GBT.

random forest vs gbt

GBT and random forest are the two dominant tree ensemble families, but they are built on opposite philosophies.

Aspect	Gradient boosted trees	Random forest
Training	Sequential, each tree depends on previous trees' errors	Parallel, every tree trained independently on a bootstrap sample
Bias-variance focus	Reduces bias aggressively	Reduces variance aggressively
Tree depth	Shallow (3 to 10)	Deep (often unlimited)
Number of trees	Tuned via early stopping	More is always better, just slower
Hyperparameter sensitivity	High	Low
Out-of-the-box accuracy	Lower than tuned, but excellent when tuned	Strong even with defaults
Risk of overfitting	Real, must regularize	Bagging is self-regularizing
Best suited to	Maximizing accuracy	Quick baselines, feature ranking, low-tuning use

A practical recipe: train a random forest first to get a quick baseline and a sanity-check feature importance, then move to GBT for the production model.

modern context: where gbt is in 2026

GBT remains the default for almost any tabular ML problem in industry. Concrete examples:

Credit scoring. Banks and fintech use XGBoost or LightGBM for default prediction, often constrained with monotonic constraints to keep the model auditable.
Fraud detection. Card networks and payment processors use GBT for transaction-level scoring, where latency matters and feature engineering is rich.
Search ranking. LambdaMART (gradient boosting on pairwise NDCG gradients) was used at Yahoo, Bing, and Yandex; LightGBM and XGBoost both ship LambdaMART as a built-in objective.
Click-through-rate prediction. Many ad-tech systems still use GBT, although deep learning (DLRM, Wide & Deep) has gained ground for the largest-scale CTR problems.
Recommendation systems. GBT often serves as the second-stage ranker on top of candidate generators (collaborative filtering, two-tower models).
Forecasting. Demand forecasting in retail, supply chain, and energy is typically LightGBM since the M5 results.
Healthcare. Electronic health record models for sepsis prediction, readmission risk, and ICU mortality are commonly gradient-boosted.
High-energy physics. Boosted decision trees are still the workhorse classifier for separating signal from background in CERN experiments.

GBT is also frequently used as a feature extractor for downstream models. The leaf indices of a trained GBT can be one-hot encoded and fed into a linear model or neural network; this trick was popularized by Facebook in their 2014 ad CTR paper "Practical Lessons from Predicting Clicks on Ads at Facebook," where the combined GBT-plus-logistic system outperformed either alone.

implementation example

A minimal training loop in Python with sklearn's HistGradientBoostingClassifier:

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml

X, y = fetch_openml("adult", version=2, return_X_y=True, as_frame=True)
X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.2, stratify=y)

clf = HistGradientBoostingClassifier(
    max_iter=2000,
    learning_rate=0.05,
    max_leaf_nodes=63,
    min_samples_leaf=20,
    l2_regularization=1.0,
    early_stopping=True,
    n_iter_no_change=20,
    validation_fraction=0.1,
    categorical_features="from_dtype",
    random_state=0,
)
clf.fit(X_tr, y_tr)
print("validation accuracy:", clf.score(X_va, y_va))

The equivalent in XGBoost looks similar but with explicit early_stopping_rounds:

import xgboost as xgb

dtrain = xgb.DMatrix(X_tr, label=(y_tr == ">50K").astype(int), enable_categorical=True)
dvalid = xgb.DMatrix(X_va, label=(y_va == ">50K").astype(int), enable_categorical=True)

params = {
    "objective": "binary:logistic",
    "tree_method": "hist",
    "learning_rate": 0.05,
    "max_depth": 6,
    "min_child_weight": 1.0,
    "reg_lambda": 1.0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "eval_metric": "logloss",
}
bst = xgb.train(
    params, dtrain,
    num_boost_round=2000,
    evals=[(dvalid, "val")],
    early_stopping_rounds=20,
    verbose_eval=100,
)

The parameters that matter most are essentially the same in every library: a small learning rate, a moderate-sized tree (depth 6 or num_leaves=31 to 63), modest subsampling, an L2 penalty, and early stopping on a held-out validation set.

explainability with shap

Scott Lundberg and Su-In Lee's 2017 NeurIPS paper "A Unified Approach to Interpreting Model Predictions" introduced SHAP (SHapley Additive exPlanations), and Lundberg, Erion, and Lee's 2019 follow-up gave TreeSHAP, an algorithm that computes exact Shapley values for tree ensembles in polynomial time. TreeSHAP runs in O(TLD²) for an ensemble of T trees with up to L leaves and depth D, instead of the exponential cost of model-agnostic SHAP. Every major GBT library now ships with TreeSHAP integration: model.predict(X, pred_contribs=True) in XGBoost, model.predict(X, pred_contrib=True) in LightGBM, and model.get_feature_importance(type="ShapValues") in CatBoost. SHAP has effectively replaced gain-based importance as the default explanation tool for GBT models in industry, especially in regulated settings (credit, insurance, healthcare) where per-prediction attribution is required.

relationship to other methods

GBT is one node in a larger family:

AdaBoost is the original boosting algorithm; gradient boosting generalizes it to arbitrary differentiable losses.
Random forest is the bagging cousin; it averages many high-variance trees instead of summing many low-variance ones.
Stacking combines heterogeneous models with a meta-learner; GBT is often the meta-learner.
Functional gradient descent is the broader perspective: any time you fit a new function to the negative gradient of a loss and add it to the running model, you are doing gradient boosting. The base learner does not have to be a tree (it can be a neural network or a linear model), but trees dominate in practice because they are fast, handle mixed data, and capture interactions.
Stochastic gradient descent is the parameter-space analogue. SGD does steepest descent on parameters; GBT does steepest descent on functions.

references

Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." The Annals of Statistics, 29(5), 1189-1232.
Friedman, J. H. (2002). "Stochastic Gradient Boosting." Computational Statistics and Data Analysis, 38(4), 367-378.
Freund, Y. and Schapire, R. E. (1997). "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting." Journal of Computer and System Sciences, 55(1), 119-139. (Godel Prize, 2003.)
Mason, L., Baxter, J., Bartlett, P., and Frean, M. (1999). "Boosting Algorithms as Gradient Descent." Advances in Neural Information Processing Systems, 12.
Breiman, L. (1998). "Arcing Classifiers." The Annals of Statistics, 26(3), 801-849.
Chen, T. and Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." Advances in Neural Information Processing Systems, 30.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. (2018). "CatBoost: Unbiased Boosting with Categorical Features." Advances in Neural Information Processing Systems, 31.
Grinsztajn, L., Oyallon, E., and Varoquaux, G. (2022). "Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data?" Advances in Neural Information Processing Systems 35: Datasets and Benchmarks Track. arXiv:2207.08815.
Lundberg, S. M. and Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." Advances in Neural Information Processing Systems, 30.
Lundberg, S. M., Erion, G. G., and Lee, S.-I. (2019). "Consistent Individualized Feature Attribution for Tree Ensembles." arXiv:1802.03888.
Burges, C. J. C. (2010). "From RankNet to LambdaRank to LambdaMART: An Overview." Microsoft Research Technical Report MSR-TR-2010-82.
Chen, T. and He, T. (2014). "Higgs Boson Discovery with Boosted Trees." NeurIPS Workshop on High-Energy Physics and Machine Learning. PMLR 42:69-80.
Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2022). "M5 Accuracy Competition: Results, Findings, and Conclusions." International Journal of Forecasting, 38(4), 1346-1364.
He, X., Pan, J., Jin, O., et al. (2014). "Practical Lessons from Predicting Clicks on Ads at Facebook." Proceedings of the Eighth International Workshop on Data Mining for Online Advertising (ADKDD), 1-9.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer.
scikit-learn developers (2026). "Histogram-Based Gradient Boosting." scikit-learn 1.8 documentation.
XGBoost developers (2026). "Introduction to Boosted Trees." XGBoost documentation.
LightGBM developers (2026). "LightGBM Features." LightGBM documentation.
CatBoost developers (2026). "CatBoost Algorithm." CatBoost documentation.

eli5 (explain like i'm five)

history

from adaboost to gradient boosting

friedman 2001 and 2002

the kaggle era

the algorithm

loss functions

newton boosting

hyperparameters that matter

library comparison

key innovations by library

strengths

weaknesses

tabular ml benchmarks and the deep learning challenge

the kaggle and competitions story

random forest vs gbt

modern context: where gbt is in 2026

implementation example

explainability with shap

relationship to other methods

references

Improve this article

Related Articles

AdaBoost

Axis-aligned condition

Binary condition

In-set condition

Leaf

Oblique condition

eli5 (explain like i'm five)

history

from adaboost to gradient boosting

friedman 2001 and 2002

the kaggle era

the algorithm

loss functions

newton boosting

hyperparameters that matter

library comparison

key innovations by library

strengths

weaknesses

tabular ml benchmarks and the deep learning challenge

the kaggle and competitions story

random forest vs gbt

modern context: where gbt is in 2026

implementation example

explainability with shap

relationship to other methods

references

Related Articles

AdaBoost

Axis-aligned condition

Binary condition

In-set condition

Leaf

Oblique condition