See also: Gradient boosted (decision) trees, Gradient Boosting, LightGBM, CatBoost
XGBoost (short for eXtreme Gradient Boosting) is an open-source software library that implements an optimized, distributed version of gradient boosted decision trees. It was created by Tianqi Chen in 2014 as a research project at the University of Washington, and was popularized by the 2016 paper "XGBoost: A Scalable Tree Boosting System," co-authored with Carlos Guestrin and presented at the ACM SIGKDD conference. The library combines a regularized formulation of gradient boosting with a set of systems-level engineering tricks (sparsity-aware splitting, weighted quantile sketch, cache-aware histogram blocks, distributed training, and GPU acceleration) that together let it train on billions of examples while routinely topping the leaderboard on tabular machine learning problems.
For most of the period from 2014 through 2018, XGBoost was the single most cited algorithm in the winning solutions of Kaggle competitions, and even in 2026 it remains a default first choice for tabular regression and classification in finance, advertising, fraud detection, and risk modeling. The library is written in C++ and exposes idiomatic bindings for Python, R, Julia, Java, Scala, Ruby, Swift, and the JVM ecosystem (Spark, Flink, Hadoop), with first-class support for distributed training on scikit-learn, Apache Spark, Dask, and Ray. It is hosted by the Distributed Machine Learning Community (DMLC) on GitHub at dmlc/xgboost and has more than 26,000 GitHub stars as of 2026.
Imagine you are trying to predict how much rain will fall tomorrow. You ask a hundred friends to make tiny guesses, and you add their guesses together to get a final answer. The trick is that each new friend is only allowed to look at the mistakes the previous friends made, and they have to write a small rule that fixes those mistakes a tiny bit. After a hundred friends, the total guess becomes very accurate.
XGBoost is a computer program that does this very fast and very well. Each "friend" is a small decision tree that splits the world into a few buckets and gives a number to each bucket. XGBoost adds a few extra clever ideas: it punishes friends who try to make rules that are too complicated, it skips empty data without complaining, and it uses every CPU core (and every GPU) on your computer at the same time so the whole thing finishes in seconds instead of hours.
XGBoost began as a research project by Tianqi Chen, then a PhD student at the University of Washington under the supervision of Carlos Guestrin in the Paul G. Allen School of Computer Science. Chen pushed the first commit to a public Git repository on March 27, 2014. The initial release was a small command-line C++ program configured through libsvm-format text files. Chen built it to study how a single, well-engineered implementation of gradient tree boosting could scale to industrial workloads after he ran into the slow performance of the existing R gbm package on a Kaggle problem.
The project was hosted under the Distributed (Deep) Machine Learning Community (DMLC), an umbrella that Chen co-founded along with collaborators including Mu Li, Naiyan Wang, and Bing Xu. DMLC also hatched MXNet, TVM, and other systems that would later become Apache top-level projects.
XGBoost first attracted broad attention through the Higgs Boson Machine Learning Challenge that ran on Kaggle from May to September 2014. The challenge invited data scientists to help physicists at the Large Hadron Collider at CERN classify Higgs boson decay events from background noise. Tianqi Chen released a public XGBoost benchmark script for the challenge that beat almost every other submission in the leaderboard for many weeks. Multiple top-10 finishers attributed their solutions to XGBoost, including the eventual prize ceremony at the NeurIPS 2014 workshop where the library was specifically called out. By the time the competition closed, the project had jumped from a small research codebase to one of the most discussed tools in competitive machine learning.
The Higgs win catalyzed two follow-on developments. First, Bing Xu wrote the Python bindings (xgboost on PyPI) and Tong He wrote the R bindings (xgboost on CRAN), which together turned the project from a niche command-line tool into a library that could be plugged into any data scientist's workflow. Second, the team began to invest seriously in distributed training, leading to the early Apache Spark and YARN integrations.
In March 2016, Chen and Guestrin posted the now-famous paper "XGBoost: A Scalable Tree Boosting System" to arXiv (1603.02754) and presented it at the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining in San Francisco. The paper formalized the algorithmic improvements behind the library: a regularized objective derived from a second-order Taylor expansion, a sparsity-aware split-finding algorithm that learns a default direction for missing values, a weighted quantile sketch for approximate split-point selection on huge datasets, and a cache-aware columnar block layout for parallel and out-of-core computation.
The paper has since been cited more than 50,000 times (Google Scholar, 2026), making it one of the most cited works in applied machine learning. It also won the ACM SIGKDD 2016 Best Paper Award. In the same year, Chen received the John Chambers Award from the American Statistical Association for his contributions to statistical software, and the team was recognized with the High Energy Physics meets Machine Learning Award for the role XGBoost played in the Higgs work.
Between 2015 and 2018, XGBoost was the most popular algorithm on Kaggle. A Kaggle blog survey of competitions in 2015 reported that 17 of the 29 winning solutions used XGBoost, eight of them as the sole model and the rest as a component of an ensemble. By comparison, deep neural networks were used in only 11 winning solutions in the same period. The library became a kind of professional standard: tutorials, kernels, and Coursera courses all assumed it as the default tabular baseline.
This dominance prompted the creation of competing implementations. Microsoft Research released LightGBM in 2016 with leaf-wise growth and a more aggressive histogram algorithm that delivered substantial speedups on very wide datasets. Yandex released CatBoost in 2017 with a focus on categorical-feature handling and ordered boosting. Scikit-learn shipped its own histogram-based implementation, HistGradientBoosting, in version 0.21 in 2019, drawing heavily on the LightGBM design. By 2020, the gradient boosting market had three major open-source players plus a strong scikit-learn implementation, and XGBoost responded by adopting many of the same techniques (histogram trees, leaf-wise growth, native categorical support).
The project followed a slow, deliberate release cadence in its first few years. With the 1.0 release in February 2020, the team adopted semantic versioning and committed to a [MAJOR].[FEATURE].[MAINTENANCE] scheme. Subsequent major releases reworked GPU support, the Apache Spark integration, the JVM packages, and the R API.
XGBoost optimizes a regularized objective at each boosting iteration t:
L(t) = sum_i l(y_i, y_hat_i^(t-1) + f_t(x_i)) + Omega(f_t)
where l is any twice-differentiable convex loss (squared error for regression, logistic loss for binary classification, softmax for multiclass, etc.), f_t is the new tree being added, and Omega(f) = gamma * T + 0.5 * lambda * ||w||^2 penalizes the number of leaves T and the L2 norm of the leaf weight vector w. The library also supports an L1 penalty alpha * ||w||_1 on the leaf weights, giving both ridge and lasso flavors of regularization.
Where classical gradient boosting (Friedman, 2001) approximated the loss with only the first-order gradient, XGBoost uses a second-order Taylor expansion. After dropping constants, the per-iteration objective becomes:
L_tilde(t) ~= sum_i [ g_i * f_t(x_i) + 0.5 * h_i * f_t(x_i)^2 ] + Omega(f_t)
where g_i = d l / d y_hat_i^(t-1) and h_i = d^2 l / d y_hat_i^(t-1)^2 are the first and second derivatives of the loss at the previous prediction. For a fixed tree structure with leaf assignments q(x), the optimal leaf weight is w_j* = -G_j / (H_j + lambda) and the corresponding objective score is -0.5 * sum_j G_j^2 / (H_j + lambda) + gamma * T, where G_j and H_j are the sums of g_i and h_i over the instances in leaf j. The split-finding algorithm uses this score to greedily evaluate every candidate split, picking the one that maximizes:
Gain = 0.5 * [ G_L^2/(H_L+lambda) + G_R^2/(H_R+lambda) - (G_L+G_R)^2/(H_L+H_R+lambda) ] - gamma
The gamma term acts as a fixed cost for splitting and lets the algorithm prune any split whose gain does not exceed it.
For small datasets the library uses an exact greedy search that enumerates every candidate split for every feature. For large datasets this becomes infeasible, so XGBoost falls back to an approximate algorithm based on a weighted quantile sketch. The key insight is that the second-order weights h_i already act as natural importance weights for each example, so the histogram bin boundaries should be the quantiles of the empirical distribution of features weighted by h_i. The paper introduces a novel sketch data structure that supports merge and prune operations on weighted points, allowing the quantile boundaries to be computed in a single distributed pass over the data.
In 2017 the team added a fully histogram-based tree method (tree_method="hist") that pre-bins the features once and then operates on integer bin indices, similar to LightGBM's default. This is now the recommended setting for both CPU and GPU training.
Real tabular data is full of missing values, one-hot encodings, and categorical sparsity. Rather than imputing, XGBoost defines a default direction at every split: when a feature value is missing (or zero in a sparse matrix), the instance is routed to the default branch. The default direction is learned during training by trying both options for every candidate split and choosing whichever maximizes the gain. This means the model can take advantage of the meaningful pattern of missingness in datasets where missingness is informative (a common situation in survey data, electronic health records, and fraud detection). The original paper showed a 50x speedup on the Allstate-10K sparse dataset compared to a naive scan that did not exploit sparsity.
XGBoost stores the training data as a set of compressed, in-memory column blocks sorted by feature value. Each block can be scanned independently by a different CPU core to compute split statistics, which is the source of the library's intra-machine parallelism. The blocks are also designed to fit in CPU cache, with prefetching used to hide memory latency. For datasets that do not fit in RAM, the blocks can be sharded onto disk; the library also supports an out-of-core mode that streams blocks from disk, and an external-memory mode (substantially overhauled in version 3.0) for datasets larger than host memory.
XGBoost has had a CUDA-based GPU implementation since version 0.7 (released in late 2017). The current device="cuda" setting (which superseded the earlier tree_method="gpu_hist") implements the histogram algorithm directly on the GPU and supports both single-GPU and multi-GPU training (the latter via NCCL). On large datasets it can be ten to twenty times faster than the CPU version. NVIDIA has invested directly in the project, and the GPU code is co-maintained by their RAPIDS team.
Two constraint mechanisms make XGBoost especially useful for regulated domains. Monotonic constraints force the model to be monotonically increasing or decreasing in a chosen feature, which is often a regulatory requirement in credit scoring (a higher income can never lower the predicted creditworthiness). The constraint is enforced during split-finding by rejecting any split that would violate the requested direction at the current node. Feature interaction constraints restrict which subsets of features are allowed to appear together in a single root-to-leaf path, which is useful for interpretable modeling and for reducing spurious interactions in noisy datasets.
XGBoost has more than fifty parameters, but a small subset accounts for almost all the practical tuning. The table below lists the hyperparameters that data scientists most often touch. Defaults shown are for the Python XGBClassifier / XGBRegressor API as of version 3.2.
| parameter | default | typical range | role |
|---|---|---|---|
n_estimators | 100 | 100 to 5000 | Number of boosting rounds (trees). Use early stopping to find the right value. |
eta (learning_rate) | 0.3 | 0.01 to 0.3 | Shrinkage applied to each new tree. Lower values need more trees but generalize better. |
max_depth | 6 | 3 to 10 | Maximum depth of each tree. Shallower trees reduce overfitting but may underfit. |
min_child_weight | 1 | 1 to 10 | Minimum sum of Hessians required in a leaf. Acts as a complexity penalty. |
gamma (min_split_loss) | 0 | 0 to 10 | Minimum loss reduction required to make a split. Larger values prune more aggressively. |
subsample | 1.0 | 0.5 to 1.0 | Fraction of rows sampled for each tree. Adds randomness, similar to bagging. |
colsample_bytree | 1.0 | 0.5 to 1.0 | Fraction of columns sampled for each tree. |
colsample_bylevel | 1.0 | 0.5 to 1.0 | Fraction of columns sampled at each tree level. |
colsample_bynode | 1.0 | 0.5 to 1.0 | Fraction of columns sampled at each split. |
reg_lambda | 1.0 | 0 to 10 | L2 regularization on leaf weights. |
reg_alpha | 0.0 | 0 to 10 | L1 regularization on leaf weights. |
scale_pos_weight | 1.0 | depends | Up-weights the positive class for imbalanced classification. |
tree_method | "hist" | hist, exact, approx | Algorithm used for split finding. Hist is now the default. |
device | "cpu" | cpu, cuda, cuda:0 | Hardware backend. Set to cuda for GPU training. |
objective | task-dependent | reg:squarederror, binary:logistic, multi:softprob, rank:pairwise, etc. | Loss function used during training. |
eval_metric | task-dependent | rmse, logloss, auc, ndcg, etc. | Metric reported on the validation set. |
early_stopping_rounds | None | 10 to 100 | Stop training when validation metric does not improve for this many rounds. |
A common starting point for tabular classification is eta=0.05, max_depth=6, min_child_weight=1, subsample=0.8, colsample_bytree=0.8, with n_estimators=2000 and early_stopping_rounds=50. Bayesian optimization tools such as Optuna and Ray Tune are widely used to find better settings automatically.
XGBoost has supported distributed training since 2014. The original implementation used the DMLC Rabit library for collective communication (allreduce, broadcast) over a small set of nodes connected via TCP or YARN. Modern XGBoost integrates with three main distributed frameworks:
xgboost4j-spark JVM package and the PySpark interface added in version 1.6. The Spark integration uses the Barrier execution mode and supports both row-partitioned and feature-partitioned data.xgboost.dask Python module. Dask is the recommended choice for users who already have a Dask cluster and want to stay inside the Python ecosystem. The interface mirrors the standard scikit-learn API.xgboost_ray package originally developed at Anyscale and Uber. Ray adds elastic training (the job continues if a worker dies) and seamless integration with Ray Tune for hyperparameter optimization.All three integrations support multi-GPU training. Distributed XGBoost typically scales close to linearly up to a few dozen workers; beyond that, communication of the gradient histograms can become the bottleneck.
XGBoost ships native bindings for several languages, with the C++ core compiled as a shared library that is loaded by each language wrapper.
xgboost on PyPI): the dominant interface, including a scikit-learn-compatible API (XGBClassifier, XGBRegressor, XGBRanker) and the lower-level Booster API.xgboost on CRAN): a fully featured wrapper that integrates with the caret and tidymodels packages.xgboost4j and xgboost4j-spark for Maven/Gradle.XGBoost.jl): a community-maintained wrapper.xgboost-ruby), Swift (SwiftXGBoost), Go bindings, and a Rust crate (xgboost-rs) all exist with varying maturity.Within the broader Python ecosystem, XGBoost integrates with scikit-learn pipelines, MLflow for experiment tracking, ONNX for model export, SHAP for interpretability (the shap.TreeExplainer is highly optimized for XGBoost), and almost every commercial ML platform including AWS SageMaker, Google Vertex AI, Azure Machine Learning, Databricks, and Snowflake.
The table below summarizes how XGBoost compares with the two main alternative implementations of gradient boosted trees and with scikit-learn's built-in HistGradientBoosting. All four are considered "production grade" and benchmarks generally rank them within a few percentage points of one another on most tabular tasks.
| feature | XGBoost | LightGBM | CatBoost | scikit-learn HistGB |
|---|---|---|---|---|
| First release | 2014 | 2016 | 2017 | 2019 (v0.21) |
| Maintainer | DMLC | Microsoft Research | Yandex | scikit-learn community |
| Default tree growth | Level-wise (depth-wise) | Leaf-wise | Symmetric (oblivious) | Level-wise |
| Default split algorithm | Histogram | Histogram | Histogram | Histogram |
| Categorical handling | Native (since 1.5) | Native | Native (best in class) | Limited |
| Missing value handling | Native (sparsity-aware) | Native | Native | Native |
| GPU training | Yes (CUDA, multi-GPU) | Yes (CUDA, OpenCL) | Yes (CUDA, multi-GPU) | No |
| Distributed training | Spark, Dask, Ray | Dask | Spark, Dask | No |
| L1/L2 regularization | Both | L2 only | Both | L2 only |
| Monotonic constraints | Yes | Yes | Yes | Yes |
| Feature interaction constraints | Yes | No | No | No |
| Typical relative speed | 1.0x | 2x to 7x faster | 0.5x to 1.0x | 1.0x to 2.0x |
| Best fit | Default tabular baseline; sparse, large data | Wide datasets (many features), large row counts | Heavy categorical features, low-tuning workflows | Pipelines that must stay inside scikit-learn |
The rough heuristic among practitioners is: try XGBoost first as a strong baseline, switch to LightGBM if training time or memory becomes a bottleneck on very large data, and switch to CatBoost when the dataset is dominated by high-cardinality categorical features. In Kaggle competitions a stacked ensemble of all three is still common.
XGBoost was the dominant Kaggle algorithm from 2014 through 2018. Most winning solutions used it as the main predictive engine, often stacked with a small neural network or a logistic regression for diversity. After 2018 the share of pure-XGBoost wins decreased as LightGBM and CatBoost matured and as deep learning models started to compete on image and text-heavy challenges, but XGBoost remains a near-universal component in tabular competitions.
The library has been adopted across the financial industry for credit-risk modeling, loan default prediction, anti-money laundering, and trading signal research. The combination of strong out-of-the-box accuracy, monotonic constraints, and SHAP-based interpretability makes it a natural fit for regulated environments where the model must be auditable. Several large banks have moved their credit-decisioning pipelines from logistic regression to XGBoost since 2018.
Click-through rate (CTR) prediction is the canonical advertising-tech problem: given a user, a context, and a candidate ad, estimate the probability of a click. XGBoost was widely used in CTR models before deep learning architectures (Wide & Deep, DeepFM, DCN, DIN) took over the largest production systems. It is still a common baseline and feature-extraction stage. Hybrid pipelines such as XGBoost + LR (used for example by Facebook in their 2014 "Practical Lessons from Predicting Clicks on Ads" paper, and replicated in many follow-on systems) feed XGBoost leaf indices as one-hot features into a downstream linear model.
Customer churn prediction (telecom, banking, SaaS), credit-card fraud detection, supply-chain demand forecasting, energy load forecasting, and electronic health record (EHR) modeling are all dominated by gradient boosting in practice, and XGBoost is among the most common implementations. Public benchmarks regularly report 95%+ AUC on canonical churn datasets when XGBoost is properly tuned.
Beyond the original Higgs Boson use case, XGBoost is used in particle physics, astronomy (galaxy classification, photometric redshift), bioinformatics (variant pathogenicity prediction, drug response modeling), epidemiology, and climate science. The combination of fast training on large tabular tables and built-in feature importance makes it attractive whenever a researcher needs both predictive accuracy and a rough sense of which inputs matter.
A recurring debate in the machine learning community is whether deep neural networks can out-perform gradient boosted trees on tabular data. As of 2026 the answer remains, with caveats, no. The 2021 Shwartz-Ziv and Armon paper "Tabular Data: Deep Learning Is Not All You Need" benchmarked XGBoost against TabNet, NODE, and several other tabular deep models on 11 datasets and concluded that XGBoost outperformed every deep model on average and required dramatically less hyperparameter tuning. The 2022 Grinsztajn et al. NeurIPS paper "Why do tree-based models still outperform deep learning on typical tabular data?" reached the same conclusion on a curated benchmark of 45 datasets.
Several structural reasons explain the persistent gap. Decision trees are robust to uninformative features (they simply do not split on them), while neural networks must learn to ignore noise. Trees are invariant to monotonic transformations of individual features, so they do not require careful normalization. Trees handle missing values and heavy-tailed distributions natively. Finally, neural networks for tabular data tend to require large amounts of data to compete, which is precisely the regime where most tabular problems live.
That said, deep models close the gap or surpass XGBoost in two specific situations. First, when the dataset is very large (tens of millions of rows or more) and there are meaningful interactions between many features, transformer-based architectures such as TabTransformer, FT-Transformer, and SAINT can produce slightly better results, often at much higher computational cost. Second, in multimodal pipelines where tabular features must be combined with text, image, or sequence data, deep models offer a natural way to fuse modalities. In multimodal pipelines a common pattern is to use XGBoost on the tabular part and concatenate or stack its predictions with the deep model's outputs.
Despite its strengths, XGBoost has well-known weaknesses that practitioners need to be aware of.
The table below lists the major XGBoost releases and the headline features they introduced. Patch releases and pre-1.0 versions are omitted for brevity.
| version | release date | highlights |
|---|---|---|
| 0.4 | May 2015 | First Spark integration; PyPI release. |
| 0.6 | July 2016 | First R CRAN release; faster Booster; better Python sklearn API. |
| 0.7 | December 2017 | First CUDA GPU implementation (gpu_hist). |
| 0.8 | August 2018 | DART booster generalizations; multi-GPU support. |
| 0.9 | May 2019 | One-versus-rest classification; gpu_hist as default GPU method. |
| 1.0 | February 2020 | Adopted semantic versioning; improved Spark integration; pickle compatibility. |
| 1.1 | May 2020 | Sparse-matrix performance improvements. |
| 1.2 | August 2020 | Improved categorical interface; better gpu_hist scalability. |
| 1.3 | December 2020 | Experimental categorical support; default device API previewed. |
| 1.4 | April 2021 | Refined categorical handling; deterministic GPU training. |
| 1.5 | October 2021 | Native categorical features in tree_method="hist". |
| 1.6 | April 2022 | Native PySpark estimator; streamlined Dask interface. |
| 1.7 | October 2022 | Quantile regression objective; PyPI macOS arm64 wheels. |
| 2.0 | September 2023 | Vector leaves for multi-target regression; new federated learning interface; default tree_method="hist". |
| 2.1 | June 2024 | Improved external memory; better column sampling; device parameter replaces gpu_hist. |
| 3.0 | March 2025 | Major reworked R package; redesigned JVM packages; ExtMemQuantileDMatrix for efficient external-memory training; expanded categorical and quantile regression support. |
| 3.1 | October 2024 | scikit-learn 1.8 compatibility; improved error reporting. |
| 3.2 | February 2025 | Stable release; further GPU NDCG and ranking improvements. |
(Where the dates differ between independent trackers, the value here follows the GitHub release tag for the corresponding vX.Y.0 release.)
The Python package is the most common entry point and is installed with pip install xgboost or conda install -c conda-forge xgboost. A minimal binary classification example using the scikit-learn wrapper looks like this:
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
data.data, data.target, random_state=42
)
model = xgb.XGBClassifier(
n_estimators=500,
max_depth=6,
learning_rate=0.05,
objective="binary:logistic",
eval_metric="logloss",
early_stopping_rounds=20,
tree_method="hist",
device="cuda", # remove this line for CPU training
)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
print("AUC:", model.score(X_test, y_test))
The lower-level Booster API offers more control and is preferred for production deployments and custom training loops:
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
params = {
"objective": "binary:logistic",
"eval_metric": "auc",
"max_depth": 6,
"eta": 0.05,
"tree_method": "hist",
"device": "cuda",
}
booster = xgb.train(
params,
dtrain,
num_boost_round=2000,
evals=[(dtrain, "train"), (dtest, "valid")],
early_stopping_rounds=50,
verbose_eval=100,
)
XGBoost exposes several tools for understanding what a trained model is doing.
model.feature_importances_ returns three flavors of importance: weight (number of splits a feature is used in), gain (average loss reduction contributed by the feature), and cover (average Hessian coverage of the splits using the feature).xgb.plot_importance and xgb.plot_tree visualize feature importance and individual trees.shap.TreeExplainer. The implementation is specialized for XGBoost and can compute per-prediction feature attributions in milliseconds.pred_interactions=True option in Booster.predict returns the SHAP interaction matrix, a richer attribution that splits each feature's contribution across pairwise interactions.XGBoost belongs to the family of gradient boosting machines first formalized by Jerome Friedman in 2001. Its closest cousins are:
gbm R package (Greg Ridgeway, 2003), which was the first widely used implementation but was slow and single-threaded.GradientBoostingClassifier and GradientBoostingRegressor (introduced 2010), which are exact-greedy and now superseded by HistGradientBoosting.XGBoost is released under the Apache License 2.0. Development is coordinated through the dmlc/xgboost GitHub repository, with a small set of committers drawn from academia (Carnegie Mellon, University of Washington), large tech companies (NVIDIA, AWS, Microsoft, Yandex), and the open-source community. The project follows a lightweight RFC process for major design changes and uses GitHub Actions for continuous integration across Linux, macOS, and Windows on x86-64 and ARM64 architectures.