# Gradient boosted (decision) trees (GBT)

> Source: https://aiwiki.ai/wiki/gradient_boosted_decision_trees_gbt
> Updated: 2026-06-23
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms), [Gradient boosting](/wiki/gradient_boosting)*

**Gradient boosted decision trees** (GBT, also written GBDT, GBM, or GBRT) is an [ensemble method](/wiki/ensemble_method) that builds a strong predictor by sequentially fitting many shallow [decision trees](/wiki/decision_tree) to the negative gradient of a loss function, so each new tree corrects the residual errors of the running model and the final prediction is the sum of every tree's small contribution. The framework was formalized by [Jerome H. Friedman](/wiki/friedman) in his 2001 Annals of Statistics paper "Greedy Function Approximation: A Gradient Boosting Machine," which views function estimation "from the perspective of numerical optimization in function space, rather than parameter space" and makes "a connection between stagewise additive expansions and steepest-descent minimization."[1] The 2001 paper has been cited more than 26,000 times.[1] GBT remains the dominant supervised learning method for tabular data, and a 2022 NeurIPS benchmark found that "tree-based models remain state-of-the-art on medium-sized data (~10K samples)" even after extensive tuning of competing neural networks.[9]

GBT is implemented in three industry-standard libraries that account for the vast majority of real-world deployments: [XGBoost](/wiki/xgboost) (Tianqi Chen, first released March 2014), [LightGBM](/wiki/lightgbm) (Microsoft Research, 2016), and [CatBoost](/wiki/catboost) (Yandex, 2017). The [scikit-learn](/wiki/scikit-learn) `HistGradientBoostingClassifier` and `HistGradientBoostingRegressor` classes (added in scikit-learn 0.21, 2019) provide a fourth widely used option bundled with the most popular general-purpose ML library in Python. Together these implementations have set the state of the art on benchmark after benchmark for tabular problems, and the 2022 Grinsztajn et al. NeurIPS paper "Why do tree-based models still outperform deep learning on typical tabular data?" confirmed that the gap with neural networks is not closing.[9]

## eli5 (explain like i'm five)

Imagine you are guessing how heavy a pumpkin is. Your first guess is just "the average pumpkin weight," so you are off by a few pounds. A friend looks at your error and writes a tiny rule: "if the pumpkin is wider than 30 cm, add 4 pounds." Now your guess is closer. A second friend looks at the new error and adds another small rule. After a hundred friends, each adding a tiny correction, your combined guess is very accurate.

That is what GBT does. Each "friend" is a small [decision tree](/wiki/decision_tree). The trees are weak on their own (often only a few splits deep), but their sum is very strong because each one is built specifically to fix the mistakes left by all the earlier trees. The "gradient" part means each new tree is told exactly which direction to push every prediction in order to make the loss go down.

## How did gradient boosting develop? (history)

### from adaboost to gradient boosting

The boosting idea predates gradient boosting by about a decade. Rob Schapire proved in 1990 that any weak learner (one that does only slightly better than random) could be combined with others into a strong learner. Yoav Freund and Schapire turned this into a working algorithm in 1995 with [AdaBoost](/wiki/adaboost) (Adaptive Boosting), published as "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting" in the Journal of Computer and System Sciences in 1997.[3] AdaBoost reweights training examples after each round so that the next weak learner focuses on the examples the previous learners got wrong. Freund and Schapire received the 2003 Godel Prize for this work.

Leo Breiman noted in his 1998 "Arcing Classifiers" paper that AdaBoost could be interpreted as the optimization of an exponential loss function.[5] This was the conceptual key. If boosting is just optimization, the loss function should not be limited to the exponential, and the optimization step should not be limited to reweighting. Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean made this explicit in their 1999 NeurIPS paper "Boosting Algorithms as Gradient Descent," which introduced the functional gradient view.[4]

### friedman 2001 and 2002

Jerome Friedman put the framework on solid ground in 2001 with "Greedy Function Approximation: A Gradient Boosting Machine," published in The Annals of Statistics 29(5), pages 1189-1232.[1] He framed boosting as steepest descent in function space, derived the algorithm for any differentiable loss, and worked out the special case where the weak learner is a regression tree (which he called TreeBoost). The 2001 paper also introduced shrinkage, scaling each tree's contribution by a small learning rate, and showed empirically that small learning rates with many trees beat large learning rates with few trees almost universally.[1]

In 2002 Friedman followed up with "Stochastic Gradient Boosting," which trains each tree on a random subsample of the data. Borrowing from [random forest](/wiki/random_forest) bagging, this single change usually improves accuracy and cuts training time substantially.[2] Stochastic gradient boosting (often combined with column subsampling) is the default mode in every modern implementation.

### the kaggle era

The combination of fast machines, large datasets, and [Kaggle](/wiki/kaggle) competitions made GBT the most consequential ML method of the 2010s. [XGBoost](/wiki/xgboost) was first released by Tianqi Chen on March 27, 2014, as a command-line tool, and after Chen and Tong He used it to win the special High Energy Physics meets Machine Learning Award in the [Higgs Boson](/wiki/higgs_boson) Kaggle challenge that summer, it was wrapped in Python and R packages.[13] Among the 29 challenge-winning solutions published on Kaggle's blog during 2015, 17 used XGBoost (the next most common method, deep neural networks, appeared in 11), and every team in the KDDCup 2015 top 10 used it.[6] Chen and Carlos Guestrin published the formal paper "XGBoost: A Scalable Tree Boosting System" at KDD 2016.[6] Microsoft Research released LightGBM in 2016 with the paper following at NeurIPS 2017,[7] Yandex open-sourced CatBoost on July 18, 2017 with the paper at NeurIPS 2018,[8] and the M5 Forecasting Competition (Walmart sales, 2020) was won by an undergraduate using an ensemble of LightGBM models.[14]

## How does the gradient boosting algorithm work?

GBT builds an additive model in a forward stage-wise way. After M rounds the model has the form:

```
F_M(x) = F_0(x) + Σ_{m=1..M} ν · γ_m · h_m(x)
```

where `F_0(x)` is a constant initialization, each `h_m(x)` is a regression tree fit at iteration m, `γ_m` is a per-tree (or per-leaf) step size, and `ν` is the global shrinkage factor (the learning rate, usually 0.01 to 0.1). The recipe for one round is:

1. **Initialize** with the constant that minimizes the loss on the training set. For squared error this is the mean of `y`; for log loss it is the log-odds of the positive class.
2. **For m = 1 to M:**
   1. Compute the negative gradient (the pseudo-residual) for every training point: `r_im = -∂L(y_i, F(x_i)) / ∂F(x_i)` evaluated at `F = F_{m-1}`.
   2. Fit a regression tree `h_m` to the targets `(x_i, r_im)`.
   3. (TreeBoost refinement.) For each leaf region `R_jm`, find the leaf value `b_jm` by line search on the original loss, not on the squared-error fit.
   4. Update: `F_m(x) = F_{m-1}(x) + ν · h_m(x)`.
3. **Return** `F_M`.

For squared error the negative gradient is just the residual `y_i - F(x_i)`, so gradient boosting reduces to repeatedly fitting trees on the leftover error. For log loss the negative gradient is `y_i - p_i`, the difference between the true label and the predicted probability. This unifies regression and classification under one framework.[16]

### loss functions

Any differentiable loss works. The common choices are:

| Loss | Task | Negative gradient |
|---|---|---|
| Squared error | Regression | y_i - F(x_i) |
| Absolute error (L1) | Robust regression | sign(y_i - F(x_i)) |
| Huber | Regression with outliers | residual if &#124;r&#124; ≤ δ, else δ · sign(r) |
| Quantile loss | Quantile / interval prediction | τ - 1{r < 0} |
| Log loss | Binary classification | y_i - σ(F(x_i)) |
| Multinomial deviance | Multi-class | y_i,k - softmax_k(F(x_i)) |
| Poisson | Count regression | y_i - exp(F(x_i)) |
| Gamma | Strictly positive targets | (y_i - μ_i) / μ_i |
| LambdaRank | Learning to rank | Pairwise λ-gradients on NDCG |
| Cox | Survival | Partial-likelihood gradient |

### newton boosting

XGBoost popularized a refinement called Newton boosting.[6] Instead of just the first-order gradient, it uses a second-order Taylor expansion of the loss at `F_{m-1}`:

```
L ≈ Σ [g_i · h_m(x_i) + ½ · h_i · h_m(x_i)²] + Ω(h_m)
```

where `g_i` is the gradient and `h_i` is the Hessian (the second derivative) of the loss. With this, the optimal leaf value has a closed form: `w_j* = -G_j / (H_j + λ)` where G and H are the sums of gradients and Hessians in leaf j. The split-finding gain is `½ · (G_L²/(H_L+λ) + G_R²/(H_R+λ) - G²/(H+λ)) - γ`, with γ a per-leaf complexity penalty. LightGBM and CatBoost use the same Newton-style update; the original Friedman algorithm uses only first-order information plus a line search.

## Which hyperparameters matter most?

GBT is famously sensitive to hyperparameters. The interaction between learning rate, number of trees, and tree depth is where most of the tuning effort goes.

| Hyperparameter | Typical range | What it does |
|---|---|---|
| n_estimators / num_boost_round | 100 to 10,000 | Total number of trees. Use early stopping to find it. |
| learning_rate (eta, ν) | 0.01 to 0.3 | Shrinks each tree's contribution. Lower = more trees needed. |
| max_depth | 3 to 10 | Maximum tree depth. Deep trees model interactions but overfit. |
| num_leaves | 15 to 255 | LightGBM's leaf-wise budget; controls capacity directly. |
| min_child_weight / min_samples_leaf | 1 to 100+ | Minimum sum of Hessians (or count) per leaf. Smooths predictions. |
| subsample | 0.5 to 1.0 | Row subsampling for stochastic boosting. |
| colsample_bytree | 0.5 to 1.0 | Feature subsampling per tree. |
| reg_alpha (L1) | 0 to 10 | L1 penalty on leaf weights. |
| reg_lambda (L2) | 0 to 10 | L2 penalty on leaf weights. |
| gamma (XGBoost) | 0 to 5 | Minimum loss reduction to make a split. |
| early_stopping_rounds | 10 to 100 | Stop if validation metric does not improve for this many rounds. |
| max_bins | 64 to 511 | Histogram resolution. More bins = slightly better, slower. |

A common pattern is to fix a small learning rate (0.05), set `n_estimators` to something large like 5000, and let early stopping pick the actual count on a validation set. Tree depth is usually swept first, then regularization (`reg_lambda`, `min_child_weight`), then subsampling.

## How do XGBoost, LightGBM, and CatBoost compare?

The four implementations below cover essentially every modern GBT use case. Each has subtly different defaults and tradeoffs.

| Feature | XGBoost | LightGBM | CatBoost | sklearn HistGradientBoosting |
|---|---|---|---|---|
| First release | 2014 | 2016 | 2017 | 2019 (scikit-learn 0.21) |
| Reference paper | Chen & Guestrin (KDD 2016) | Ke et al. (NeurIPS 2017) | Prokhorenkova et al. (NeurIPS 2018) | Inspired by LightGBM |
| Origin | Tianqi Chen (DMLC) | Microsoft Research | Yandex | scikit-learn project |
| Default tree growth | Level-wise (depthwise); leaf-wise via `grow_policy` | Leaf-wise | Symmetric (oblivious) | Leaf-wise |
| Split finding | Exact + histogram (`hist`) + GPU | Histogram + GOSS | Histogram | Histogram |
| Categorical features | One-hot or `enable_categorical=True` (recent) | Native, optimal split via Fisher 1958 | Ordered target statistics, native | Native, partition-based |
| Missing values | Native (learned default direction) | Native | Native | Native |
| Newton boosting | Yes (default) | Yes | Yes | Yes |
| GPU | Yes (`tree_method=hist`, `device=cuda`) | Yes | Yes | No (CPU only, OpenMP) |
| Distributed training | Yes (Dask, Spark, Ray) | Yes (MPI, Dask, Spark) | Yes (CPU/GPU multi-host) | No |
| Monotonic constraints | Yes | Yes | Yes | Yes |
| SHAP integration | Yes (TreeSHAP built in) | Yes | Yes | Yes (via `shap` library) |
| Notable speed trick | Sparsity-aware split, column block | GOSS + EFB + leaf-wise | Symmetric trees for fast inference | OpenMP parallel histograms |
| Best at | General-purpose, large competitions | Largest datasets, fast iteration | Categorical-heavy data, robust defaults | Anyone already using sklearn |

Other implementations worth knowing about: H2O GBM (Java, used in enterprise), Apache Spark MLlib's `GBTClassifier` (for distributed JVM workloads), TensorFlow Decision Forests / Yggdrasil Decision Forests (Google's open-source library based on the system that ran inside Google for years), and the legacy `sklearn.ensemble.GradientBoostingClassifier` (pre-histogram, slow on anything over 10,000 samples; usually replaced by `HistGradientBoosting`).

### key innovations by library

**XGBoost.** Newton boosting with explicit L1 + L2 + leaf-count regularization in the objective. Sparsity-aware split finding that learns a default branch for missing values, reportedly 50x faster than the naive approach on sparse data.[6] Cache-aware block storage for parallel histogram building. The paper's abstract states that "XGBoost scales beyond billions of examples using far fewer resources than existing systems."[6] The original `tree_method=exact` mode does pre-sorted split finding; the `hist` mode added in 2017 uses histograms and is now the default.[18]

**LightGBM.** Three innovations stacked: histogram-based split finding (bin into 255 buckets, compute splits in O(bins) instead of O(rows)), leaf-wise tree growth (always split the leaf with the highest loss reduction, instead of growing level by level), and Gradient-based One-Side Sampling (GOSS) plus Exclusive Feature Bundling (EFB). GOSS keeps every example with a large gradient and randomly subsamples the small-gradient ones, with a correction multiplier to keep the sums unbiased. EFB groups mutually-exclusive sparse features into single bundles, useful when one-hot encodings have made the feature count explode. The paper reports that "LightGBM speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy."[7] Together these tricks make LightGBM the fastest of the three on the largest datasets.[19]

**CatBoost.** Two big ideas, both aimed at fixing target leakage. **Ordered target statistics** replace one-hot encoding for categorical features: for each row, the target encoding is computed using only the rows that come before it in a random permutation, so the row's own label never leaks into its features. **Ordered boosting** generalizes this to the boosting loop itself: for each row, the residual that the next tree learns on is computed from a model that was not trained on that row. CatBoost also uses **oblivious trees** (also called symmetric trees), where every node at the same depth uses the same split. Oblivious trees are weaker individually but extremely fast at inference, since prediction reduces to evaluating a binary index of length depth.[20] CatBoost's defaults are good enough that it is often the best library when a practitioner does not want to spend days on hyperparameter search.[8]

**scikit-learn HistGradientBoosting.** Histogram-based GBT directly inspired by LightGBM, parallelized with OpenMP, and shipped inside scikit-learn since version 0.21 (2019), becoming a non-experimental estimator in version 1.0.[17] Native missing-value support, native categorical support (since 1.0), and monotonic constraints. Orders of magnitude faster than the older `GradientBoostingClassifier`. The main reason to choose it over XGBoost or LightGBM is that it is zero-install if you already have sklearn, and it integrates cleanly with sklearn pipelines and GridSearchCV.

## What are the strengths of gradient boosted trees?

The reasons GBT keeps winning:

- **State of the art on tabular data.** The Grinsztajn et al. 2022 NeurIPS paper benchmarked tree ensembles against TabNet, FT-Transformer, ResNet, SAINT, and other deep tabular models on 45 medium-sized datasets. Tree ensembles still won, even after extensive hyperparameter tuning of the neural networks.[9]
- **Mixed feature types out of the box.** Numerical, categorical, ordinal, and missing values can be passed without scaling, encoding, or imputation. CatBoost and LightGBM handle categoricals natively; XGBoost and HistGradientBoosting are catching up.
- **Robust to feature scaling.** Tree splits are invariant to monotonic transformations, so log-transforming or standardizing inputs has no effect.
- **Built-in feature importance.** Gain-based importance, split-count importance, and TreeSHAP (Lundberg, Erion & Lee 2019) all give per-feature attributions in polynomial time.[11]
- **Strong with missing values.** Modern GBT learns a default direction for missing values at each split; in some datasets the missingness pattern is itself predictive and the model picks this up automatically.
- **Tunable speed-accuracy tradeoff.** Increase `max_bins` and decrease `learning_rate` for accuracy; reduce `num_leaves` and use GOSS for speed.
- **Mature production tooling.** Models can be exported to ONNX, embedded in C/C++ via XGBoost's `bst.save_model`, or compiled to native code via Treelite for low-latency serving.

## What are the weaknesses of gradient boosted trees?

GBT is not the right tool everywhere:

- **Unstructured data is not its game.** For images, audio, and natural language, deep learning crushes GBT. The inductive bias of axis-aligned splits is wrong for pixel grids and token sequences.
- **Hyperparameter sensitivity.** A poorly tuned GBT can be much worse than a default [random forest](/wiki/random_forest). The interaction between learning rate, depth, and tree count traps newcomers.
- **Sequential by construction.** Each tree depends on the previous trees' predictions, so the boosting loop itself does not parallelize across iterations (though split-finding within a single tree does).
- **Overfits without regularization.** Without shrinkage, subsampling, and early stopping, GBT will memorize noise. The exponential loss in AdaBoost and the log loss in GBT are both vulnerable to outliers.
- **Inference latency.** A model with 5,000 trees at depth 8 is not free to evaluate. CatBoost's oblivious trees and Treelite compilation help, but neural networks with batch GPU inference can sometimes be cheaper at scale.
- **Less interpretable as the ensemble grows.** A single tree is a flowchart; 5,000 trees are not. SHAP recovers some interpretability but at a cost.

## Why do tree models still beat deep learning on tabular data?

For about a decade, the conventional wisdom has been: tabular data goes to GBT, everything else goes to neural networks. Several papers have tried to overturn the first half of that statement. TabNet (2019) used attention-based feature selection. Hopular (2022) used Hopfield networks. FT-Transformer and SAINT applied transformers to tabular data. NODE used differentiable trees. Each was claimed to beat XGBoost on selected benchmarks at release time.

Grinsztajn, Oyallon, and Varoquaux's 2022 NeurIPS paper "Why do tree-based models still outperform deep learning on typical tabular data?" (Datasets and Benchmarks Track) pushed back hard. They benchmarked tree ensembles (XGBoost, gradient boosting, random forests) against the best deep tabular models on a curated set of 45 datasets, controlling for hyperparameter tuning and dataset size. The paper concludes that "tree-based models remain state-of-the-art on medium-sized data (~10K samples) even without accounting for their superior speed."[9] The authors traced the gap to three inductive biases, framed as challenges for anyone building tabular-specific neural networks, which should "1. be robust to uninformative features, 2. preserve the orientation of the data, and 3. be able to easily learn irregular functions":[9]

1. Tree models are robust to uninformative features; neural networks are not.
2. Tree models preserve the orientation of the data, since splits are axis-aligned; neural networks blur it through linear layers.
3. Tree models can fit irregular, piecewise-constant functions easily; neural networks bias toward smooth functions.

The practical takeaway: on tabular data with under a million rows, GBT should be the first thing tried. Deep tabular models are sometimes worth the cost when there is a huge amount of data, when the problem has multimodal inputs (numerical + text + image), or when transfer learning across tasks is needed.

## How did GBT dominate Kaggle and forecasting competitions?

From roughly 2015 to 2020, GBT (mostly XGBoost, then LightGBM) won almost every Kaggle competition that involved tabular data. The pattern was:

| Year | Competition / context | Winning approach |
|---|---|---|
| 2014 | Higgs Boson Machine Learning Challenge ([Kaggle](/wiki/kaggle)) | XGBoost (Special HEP-meets-ML award to Tianqi Chen and Tong He) |
| 2015 | KDDCup 2015 student dropout prediction | All top-10 teams used XGBoost |
| 2015 | Otto Group Product Classification | XGBoost (Titericz, Semenov) |
| 2016 | Allstate Claims Severity | XGBoost (Bhattacharjee) |
| 2017+ | Most tabular Kaggle competitions | XGBoost, then LightGBM as it matured |
| 2018+ | Categorical-heavy competitions | CatBoost competitive with XGBoost / LightGBM |
| 2020 | [M5 Forecasting Competition](/wiki/m5_competition) (Walmart) | LightGBM ensemble (YeonJun In, Kyung Hee University) |
| 2022 | M5 Uncertainty conclusions | Tree-based methods dominated |

The M5 result was particularly striking. M5 was the first M-competition where every top method was a pure ML approach rather than a statistical time-series model. The winning solution by YeonJun In (a senior undergraduate at Kyung Hee University, competing as "YJ_STU") built 220 LightGBM models in total, trained on data pooled per store, per store-category, and per store-department; each Walmart sales series was then forecast as the equal-weighted average of six of those models, optimized on the Tweedie loss.[14] This was a turning point in the forecasting community: hand-tuned ARIMA and ETS were no longer competitive with off-the-shelf GBT.

## How does random forest differ from GBT?

GBT and [random forest](/wiki/random_forest) are the two dominant tree ensemble families, but they are built on opposite philosophies.

| Aspect | Gradient boosted trees | Random forest |
|---|---|---|
| Training | Sequential, each tree depends on previous trees' errors | Parallel, every tree trained independently on a bootstrap sample |
| Bias-variance focus | Reduces bias aggressively | Reduces variance aggressively |
| Tree depth | Shallow (3 to 10) | Deep (often unlimited) |
| Number of trees | Tuned via early stopping | More is always better, just slower |
| Hyperparameter sensitivity | High | Low |
| Out-of-the-box accuracy | Lower than tuned, but excellent when tuned | Strong even with defaults |
| Risk of overfitting | Real, must regularize | Bagging is self-regularizing |
| Best suited to | Maximizing accuracy | Quick baselines, feature ranking, low-tuning use |

A practical recipe: train a random forest first to get a quick baseline and a sanity-check feature importance, then move to GBT for the production model.

## Where is GBT used in 2026? (modern context)

GBT remains the default for almost any tabular ML problem in industry. Concrete examples:

- **Credit scoring.** Banks and fintech use XGBoost or LightGBM for default prediction, often constrained with monotonic constraints to keep the model auditable.
- **Fraud detection.** Card networks and payment processors use GBT for transaction-level scoring, where latency matters and feature engineering is rich.
- **Search ranking.** LambdaMART (gradient boosting on pairwise NDCG gradients) was used at Yahoo, Bing, and Yandex; LightGBM and XGBoost both ship LambdaMART as a built-in objective.[12]
- **Click-through-rate prediction.** Many ad-tech systems still use GBT, although deep learning (DLRM, Wide & Deep) has gained ground for the largest-scale CTR problems.
- **Recommendation systems.** GBT often serves as the second-stage ranker on top of candidate generators (collaborative filtering, two-tower models).
- **Forecasting.** Demand forecasting in retail, supply chain, and energy is typically LightGBM since the M5 results.
- **Healthcare.** Electronic health record models for sepsis prediction, readmission risk, and ICU mortality are commonly gradient-boosted.
- **High-energy physics.** Boosted decision trees are still the workhorse classifier for separating signal from background in CERN experiments.

GBT is also frequently used as a **feature extractor** for downstream models. The leaf indices of a trained GBT can be one-hot encoded and fed into a linear model or neural network; this trick was popularized by Facebook in their 2014 ad CTR paper "Practical Lessons from Predicting Clicks on Ads at Facebook," where the combined GBT-plus-logistic system outperformed either alone.[15]

## implementation example

A minimal training loop in Python with sklearn's `HistGradientBoostingClassifier`:

```python
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_openml

X, y = fetch_openml("adult", version=2, return_X_y=True, as_frame=True)
X_tr, X_va, y_tr, y_va = train_test_split(X, y, test_size=0.2, stratify=y)

clf = HistGradientBoostingClassifier(
    max_iter=2000,
    learning_rate=0.05,
    max_leaf_nodes=63,
    min_samples_leaf=20,
    l2_regularization=1.0,
    early_stopping=True,
    n_iter_no_change=20,
    validation_fraction=0.1,
    categorical_features="from_dtype",
    random_state=0,
)
clf.fit(X_tr, y_tr)
print("validation accuracy:", clf.score(X_va, y_va))
```

The equivalent in XGBoost looks similar but with explicit `early_stopping_rounds`:

```python
import xgboost as xgb

dtrain = xgb.DMatrix(X_tr, label=(y_tr == ">50K").astype(int), enable_categorical=True)
dvalid = xgb.DMatrix(X_va, label=(y_va == ">50K").astype(int), enable_categorical=True)

params = {
    "objective": "binary:logistic",
    "tree_method": "hist",
    "learning_rate": 0.05,
    "max_depth": 6,
    "min_child_weight": 1.0,
    "reg_lambda": 1.0,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "eval_metric": "logloss",
}
bst = xgb.train(
    params, dtrain,
    num_boost_round=2000,
    evals=[(dvalid, "val")],
    early_stopping_rounds=20,
    verbose_eval=100,
)
```

The parameters that matter most are essentially the same in every library: a small learning rate, a moderate-sized tree (depth 6 or `num_leaves=31` to 63), modest subsampling, an L2 penalty, and early stopping on a held-out validation set.

## How are GBT models explained with SHAP?

Scott Lundberg and Su-In Lee's 2017 NeurIPS paper "A Unified Approach to Interpreting Model Predictions" introduced SHAP (SHapley Additive exPlanations),[10] and Lundberg, Erion, and Lee's 2019 follow-up gave **TreeSHAP**, an algorithm that computes exact Shapley values for tree ensembles in polynomial time. TreeSHAP runs in `O(TLD²)` for an ensemble of T trees with up to L leaves and depth D, instead of the exponential cost of model-agnostic SHAP.[11] Every major GBT library now ships with TreeSHAP integration: `model.predict(X, pred_contribs=True)` in XGBoost, `model.predict(X, pred_contrib=True)` in LightGBM, and `model.get_feature_importance(type="ShapValues")` in CatBoost. SHAP has effectively replaced gain-based importance as the default explanation tool for GBT models in industry, especially in regulated settings (credit, insurance, healthcare) where per-prediction attribution is required.

## relationship to other methods

GBT is one node in a larger family:

- **[AdaBoost](/wiki/adaboost)** is the original boosting algorithm; gradient boosting generalizes it to arbitrary differentiable losses.
- **[Random forest](/wiki/random_forest)** is the bagging cousin; it averages many high-variance trees instead of summing many low-variance ones.
- **Stacking** combines heterogeneous models with a meta-learner; GBT is often the meta-learner.
- **Functional gradient descent** is the broader perspective: any time you fit a new function to the negative gradient of a loss and add it to the running model, you are doing gradient boosting. The base learner does not have to be a tree (it can be a neural network or a linear model), but trees dominate in practice because they are fast, handle mixed data, and capture interactions.
- **[Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd)** is the parameter-space analogue. SGD does steepest descent on parameters; GBT does steepest descent on functions.

## References

1. Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *The Annals of Statistics*, 29(5), 1189-1232. doi:10.1214/aos/1013203451. (Cited more than 26,000 times per Semantic Scholar, 2026.)
2. Friedman, J. H. (2002). "Stochastic Gradient Boosting." *Computational Statistics and Data Analysis*, 38(4), 367-378.
3. Freund, Y. and Schapire, R. E. (1997). "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting." *Journal of Computer and System Sciences*, 55(1), 119-139. (Godel Prize, 2003.)
4. Mason, L., Baxter, J., Bartlett, P., and Frean, M. (1999). "Boosting Algorithms as Gradient Descent." *Advances in Neural Information Processing Systems*, 12.
5. Breiman, L. (1998). "Arcing Classifiers." *The Annals of Statistics*, 26(3), 801-849.
6. Chen, T. and Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 785-794. arXiv:1603.02754.
7. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." *Advances in Neural Information Processing Systems*, 30, 3149-3157.
8. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. (2018). "CatBoost: Unbiased Boosting with Categorical Features." *Advances in Neural Information Processing Systems*, 31. (Library open-sourced by Yandex on July 18, 2017.)
9. Grinsztajn, L., Oyallon, E., and Varoquaux, G. (2022). "Why Do Tree-Based Models Still Outperform Deep Learning on Typical Tabular Data?" *Advances in Neural Information Processing Systems 35: Datasets and Benchmarks Track*. arXiv:2207.08815.
10. Lundberg, S. M. and Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." *Advances in Neural Information Processing Systems*, 30.
11. Lundberg, S. M., Erion, G. G., and Lee, S.-I. (2019). "Consistent Individualized Feature Attribution for Tree Ensembles." arXiv:1802.03888.
12. Burges, C. J. C. (2010). "From RankNet to LambdaRank to LambdaMART: An Overview." *Microsoft Research Technical Report* MSR-TR-2010-82.
13. Chen, T. and He, T. (2014). "Higgs Boson Discovery with Boosted Trees." *NeurIPS Workshop on High-Energy Physics and Machine Learning*. PMLR 42:69-80.
14. Makridakis, S., Spiliotis, E., and Assimakopoulos, V. (2022). "M5 Accuracy Competition: Results, Findings, and Conclusions." *International Journal of Forecasting*, 38(4), 1346-1364.
15. He, X., Pan, J., Jin, O., et al. (2014). "Practical Lessons from Predicting Clicks on Ads at Facebook." *Proceedings of the Eighth International Workshop on Data Mining for Online Advertising* (ADKDD), 1-9.
16. Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction.* 2nd ed. Springer.
17. scikit-learn developers (2026). "Histogram-Based Gradient Boosting." scikit-learn documentation. (HistGradientBoosting added in scikit-learn 0.21, 2019; non-experimental since 1.0.)
18. XGBoost developers (2026). "Introduction to Boosted Trees." XGBoost documentation.
19. LightGBM developers (2026). "LightGBM Features." LightGBM documentation.
20. CatBoost developers (2026). "CatBoost Algorithm." CatBoost documentation.

