# XGBoost

> Source: https://aiwiki.ai/wiki/xgboost
> Updated: 2026-06-21
> Categories: Algorithms, Machine Learning, Open Source AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Gradient boosted (decision) trees](/wiki/gradient_boosted_decision_trees_gbt), [Gradient Boosting](/wiki/gradient_boosting), [LightGBM](/wiki/lightgbm), [CatBoost](/wiki/catboost)*

**XGBoost** (short for **eXtreme Gradient Boosting**) is an open-source software library that implements an optimized, distributed version of [gradient boosted decision trees](/wiki/gradient_boosted_decision_trees_gbt), and it is one of the most widely used [machine learning](/wiki/machine_learning) algorithms for tabular data. [1] [3] It was created by [Tianqi Chen](/wiki/tianqi_chen) in 2014 as a research project at the University of Washington and popularized by the 2016 paper "XGBoost: A Scalable Tree Boosting System," co-authored with Carlos Guestrin and presented at the 22nd ACM SIGKDD conference; that paper has been cited more than 47,000 times (Semantic Scholar, 2026), making it one of the most cited works in applied machine learning. [1] [2] The library combines a regularized formulation of gradient boosting with a set of systems-level engineering tricks (sparsity-aware splitting, weighted quantile sketch, cache-aware histogram blocks, distributed training, and GPU acceleration) that together let it train on billions of examples while routinely topping the leaderboard on tabular problems. [1]

The paper itself states the goal directly: "The most important factor behind the success of XGBoost is its scalability in all scenarios. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings." [1] For most of the period from 2014 through 2018, XGBoost was the single most cited algorithm in the winning solutions of [Kaggle](/wiki/kaggle) competitions, and even in 2026 it remains a default first choice for tabular regression and classification in finance, advertising, fraud detection, and risk modeling. The library is written in C++ and exposes idiomatic bindings for Python, R, Julia, Java, Scala, Ruby, Swift, and the JVM ecosystem (Spark, Flink, Hadoop), with first-class support for distributed training on [scikit-learn](/wiki/scikit_learn), Apache Spark, Dask, and Ray. [3] It is hosted by the Distributed Machine Learning Community (DMLC) on GitHub at `dmlc/xgboost` and has more than 28,000 GitHub stars as of 2026. [4]

## eli5 (explain like i'm five)

Imagine you are trying to predict how much rain will fall tomorrow. You ask a hundred friends to make tiny guesses, and you add their guesses together to get a final answer. The trick is that each new friend is only allowed to look at the mistakes the previous friends made, and they have to write a small rule that fixes those mistakes a tiny bit. After a hundred friends, the total guess becomes very accurate.

XGBoost is a computer program that does this very fast and very well. Each "friend" is a small [decision tree](/wiki/decision_tree) that splits the world into a few buckets and gives a number to each bucket. XGBoost adds a few extra clever ideas: it punishes friends who try to make rules that are too complicated, it skips empty data without complaining, and it uses every CPU core (and every GPU) on your computer at the same time so the whole thing finishes in seconds instead of hours.

## When was XGBoost created?

### the dmlc origin (2014)

XGBoost began as a research project by Tianqi Chen, then a PhD student at the University of Washington under the supervision of Carlos Guestrin in the Paul G. Allen School of Computer Science. [6] Chen pushed the first commit to a public Git repository on March 27, 2014. [4] The initial release was a small command-line C++ program configured through libsvm-format text files. Chen built it to study how a single, well-engineered implementation of gradient tree boosting could scale to industrial workloads after he ran into the slow performance of the existing R `gbm` package on a Kaggle problem.

The project was hosted under the Distributed (Deep) Machine Learning Community (DMLC), an umbrella that Chen co-founded along with collaborators including Mu Li, Naiyan Wang, and Bing Xu. DMLC also hatched MXNet, TVM, and other systems that would later become Apache top-level projects.

### the higgs boson breakthrough

XGBoost first attracted broad attention through the Higgs Boson Machine Learning Challenge that ran on Kaggle from May to September 2014, a competition that drew 1,785 teams of 1,942 players. [13] The challenge invited data scientists to help physicists at the Large Hadron Collider at CERN classify Higgs boson decay events from background noise. [13] Tianqi Chen released a public XGBoost benchmark script for the challenge that beat almost every other submission in the leaderboard for many weeks. Multiple top finishers attributed their solutions to XGBoost, and at the closing workshop Chen and Tong He (team Crowwork) received the Special "High Energy Physics meets Machine Learning" Award; the award citation noted that their algorithm "was an excellent compromise between performance and simplicity." [13] By the time the competition closed, the project had jumped from a small research codebase to one of the most discussed tools in competitive machine learning.

The Higgs win catalyzed two follow-on developments. First, Bing Xu wrote the Python bindings (`xgboost` on PyPI) and Tong He wrote the R bindings (`xgboost` on CRAN), which together turned the project from a niche command-line tool into a library that could be plugged into any data scientist's workflow. Second, the team began to invest seriously in distributed training, leading to the early Apache Spark and YARN integrations.

### the sigkdd 2016 paper

In March 2016, Chen and Guestrin posted the now-famous paper "XGBoost: A Scalable Tree Boosting System" to arXiv (1603.02754) and presented it at the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining in San Francisco. [1] [2] The paper formalized the algorithmic improvements behind the library: a regularized objective derived from a second-order Taylor expansion, a sparsity-aware split-finding algorithm that learns a default direction for missing values, a weighted quantile sketch for approximate split-point selection on huge datasets, and a cache-aware columnar block layout for parallel and out-of-core computation. [1] The authors summarized the contribution as "a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning." [1]

The paper has since been cited more than 47,000 times (Semantic Scholar, 2026), making it one of the most cited works in applied machine learning. [2] In the same year, Chen received the John M. Chambers Statistical Software Award from the American Statistical Association for his contributions to statistical software (the award carries a cash prize of 2,000 dollars), and the team had earlier been recognized with the High Energy Physics meets Machine Learning Award for the role XGBoost played in the Higgs work. [13] [19]

### kaggle dominance, then competition (2015-2020)

Between 2015 and 2018, XGBoost was the most popular algorithm on Kaggle. The original paper reports that among the 29 competition-winning solutions published on Kaggle's blog during 2015, 17 used XGBoost (eight of them as the sole model and the rest as a component of an ensemble), while the second most popular method, deep neural networks, was used in only 11 winning solutions. [1] The library became a kind of professional standard: tutorials, kernels, and Coursera courses all assumed it as the default tabular baseline.

This dominance prompted the creation of competing implementations. Microsoft Research released [LightGBM](/wiki/lightgbm) in 2016 with leaf-wise growth and a more aggressive histogram algorithm that delivered substantial speedups on very wide datasets. [9] Yandex released [CatBoost](/wiki/catboost) in 2017 with a focus on categorical-feature handling and ordered boosting. [10] Scikit-learn shipped its own histogram-based implementation, `HistGradientBoosting`, in version 0.21 in 2019, drawing heavily on the LightGBM design. By 2020, the gradient boosting market had three major open-source players plus a strong scikit-learn implementation, and XGBoost responded by adopting many of the same techniques (histogram trees, leaf-wise growth, native categorical support). [5]

### major version milestones

The project followed a slow, deliberate release cadence in its first few years. With the 1.0 release in February 2020, the team adopted semantic versioning and committed to a `[MAJOR].[FEATURE].[MAINTENANCE]` scheme. [5] Subsequent major releases reworked GPU support, the Apache Spark integration, the JVM packages, the R API, and (in the 3.x line) external-memory training. [5]

## How does the XGBoost algorithm work?

### regularized objective

XGBoost optimizes a regularized objective at each boosting iteration `t`:

```
L(t) = sum_i l(y_i, y_hat_i^(t-1) + f_t(x_i)) + Omega(f_t)
```

where `l` is any twice-differentiable convex loss (squared error for regression, logistic loss for binary classification, softmax for multiclass, etc.), `f_t` is the new tree being added, and `Omega(f) = gamma * T + 0.5 * lambda * ||w||^2` penalizes the number of leaves `T` and the L2 norm of the leaf weight vector `w`. [1] The library also supports an L1 penalty `alpha * ||w||_1` on the leaf weights, giving both ridge and lasso flavors of regularization. [3]

Where classical [gradient boosting](/wiki/gradient_boosting) (Friedman, 2001) approximated the loss with only the first-order gradient, XGBoost uses a second-order Taylor expansion. [8] After dropping constants, the per-iteration objective becomes:

```
L_tilde(t) ~= sum_i [ g_i * f_t(x_i) + 0.5 * h_i * f_t(x_i)^2 ] + Omega(f_t)
```

where `g_i = d l / d y_hat_i^(t-1)` and `h_i = d^2 l / d y_hat_i^(t-1)^2` are the first and second derivatives of the loss at the previous prediction. For a fixed tree structure with leaf assignments `q(x)`, the optimal leaf weight is `w_j* = -G_j / (H_j + lambda)` and the corresponding objective score is `-0.5 * sum_j G_j^2 / (H_j + lambda) + gamma * T`, where `G_j` and `H_j` are the sums of `g_i` and `h_i` over the instances in leaf `j`. The split-finding algorithm uses this score to greedily evaluate every candidate split, picking the one that maximizes:

```
Gain = 0.5 * [ G_L^2/(H_L+lambda) + G_R^2/(H_R+lambda) - (G_L+G_R)^2/(H_L+H_R+lambda) ] - gamma
```

The `gamma` term acts as a fixed cost for splitting and lets the algorithm prune any split whose gain does not exceed it. [1]

### exact and approximate split finding

For small datasets the library uses an exact greedy search that enumerates every candidate split for every feature. For large datasets this becomes infeasible, so XGBoost falls back to an approximate algorithm based on a **weighted quantile sketch**. [1] The key insight is that the second-order weights `h_i` already act as natural importance weights for each example, so the histogram bin boundaries should be the quantiles of the empirical distribution of features weighted by `h_i`. The paper introduces a novel sketch data structure that supports merge and prune operations on weighted points, allowing the quantile boundaries to be computed in a single distributed pass over the data. [1]

In 2017 the team added a fully histogram-based tree method (`tree_method="hist"`) that pre-bins the features once and then operates on integer bin indices, similar to LightGBM's default. This is now the recommended setting for both CPU and GPU training. [3]

### sparsity-aware splitting

Real tabular data is full of missing values, one-hot encodings, and categorical sparsity. Rather than imputing, XGBoost defines a **default direction** at every split: when a feature value is missing (or zero in a sparse matrix), the instance is routed to the default branch. [1] The default direction is learned during training by trying both options for every candidate split and choosing whichever maximizes the gain. This means the model can take advantage of the meaningful pattern of missingness in datasets where missingness is informative (a common situation in survey data, electronic health records, and fraud detection). The original paper reported that the sparsity-aware algorithm ran more than 50 times faster on an Allstate sparse dataset than a naive implementation that did not exploit sparsity. [1]

### parallel and cache-aware execution

XGBoost stores the training data as a set of compressed, in-memory **column blocks** sorted by feature value. [1] Each block can be scanned independently by a different CPU core to compute split statistics, which is the source of the library's intra-machine parallelism. The blocks are also designed to fit in CPU cache, with prefetching used to hide memory latency. [1] For datasets that do not fit in RAM, the blocks can be sharded onto disk; the library also supports an out-of-core mode that streams blocks from disk, and an external-memory mode (substantially overhauled in version 3.0 with the new `ExtMemQuantileDMatrix`) for datasets larger than host memory. [5]

### gpu support

XGBoost has had a CUDA-based GPU implementation since version 0.7 (released in December 2017). [5] The current `device="cuda"` setting (which superseded the earlier `tree_method="gpu_hist"`) implements the histogram algorithm directly on the GPU and supports both single-GPU and multi-GPU training (the latter via NCCL). [3] On large datasets it can be ten to twenty times faster than the CPU version. NVIDIA has invested directly in the project, and the GPU code is co-maintained by their RAPIDS team. [15]

### monotonic and feature interaction constraints

Two constraint mechanisms make XGBoost especially useful for regulated domains. **Monotonic constraints** force the model to be monotonically increasing or decreasing in a chosen feature, which is often a regulatory requirement in credit scoring (a higher income can never lower the predicted creditworthiness). [3] The constraint is enforced during split-finding by rejecting any split that would violate the requested direction at the current node. **Feature interaction constraints** restrict which subsets of features are allowed to appear together in a single root-to-leaf path, which is useful for interpretable modeling and for reducing spurious interactions in noisy datasets. [3]

## What are the most important XGBoost hyperparameters?

XGBoost has more than fifty parameters, but a small subset accounts for almost all the practical tuning. The table below lists the hyperparameters that data scientists most often touch. Defaults shown are for the Python `XGBClassifier` / `XGBRegressor` API as of version 3.2. [3]

| parameter | default | typical range | role |
|-----------|---------|---------------|------|
| `n_estimators` | 100 | 100 to 5000 | Number of boosting rounds (trees). Use early stopping to find the right value. |
| `eta` (`learning_rate`) | 0.3 | 0.01 to 0.3 | Shrinkage applied to each new tree. Lower values need more trees but generalize better. |
| `max_depth` | 6 | 3 to 10 | Maximum depth of each tree. Shallower trees reduce overfitting but may underfit. |
| `min_child_weight` | 1 | 1 to 10 | Minimum sum of Hessians required in a leaf. Acts as a complexity penalty. |
| `gamma` (`min_split_loss`) | 0 | 0 to 10 | Minimum loss reduction required to make a split. Larger values prune more aggressively. |
| `subsample` | 1.0 | 0.5 to 1.0 | Fraction of rows sampled for each tree. Adds randomness, similar to bagging. |
| `colsample_bytree` | 1.0 | 0.5 to 1.0 | Fraction of columns sampled for each tree. |
| `colsample_bylevel` | 1.0 | 0.5 to 1.0 | Fraction of columns sampled at each tree level. |
| `colsample_bynode` | 1.0 | 0.5 to 1.0 | Fraction of columns sampled at each split. |
| `reg_lambda` | 1.0 | 0 to 10 | L2 regularization on leaf weights. |
| `reg_alpha` | 0.0 | 0 to 10 | L1 regularization on leaf weights. |
| `scale_pos_weight` | 1.0 | depends | Up-weights the positive class for imbalanced classification. |
| `tree_method` | `"hist"` | hist, exact, approx | Algorithm used for split finding. Hist is now the default. |
| `device` | `"cpu"` | cpu, cuda, cuda:0 | Hardware backend. Set to cuda for GPU training. |
| `objective` | task-dependent | reg:squarederror, binary:logistic, multi:softprob, rank:pairwise, etc. | Loss function used during training. |
| `eval_metric` | task-dependent | rmse, logloss, auc, ndcg, etc. | Metric reported on the validation set. |
| `early_stopping_rounds` | None | 10 to 100 | Stop training when validation metric does not improve for this many rounds. |

A common starting point for tabular classification is `eta=0.05`, `max_depth=6`, `min_child_weight=1`, `subsample=0.8`, `colsample_bytree=0.8`, with `n_estimators=2000` and `early_stopping_rounds=50`. Bayesian optimization tools such as Optuna and Ray Tune are widely used to find better settings automatically.

## distributed training

XGBoost has supported distributed training since 2014. The original implementation used the DMLC Rabit library for collective communication (allreduce, broadcast) over a small set of nodes connected via TCP or YARN. [3] Modern XGBoost integrates with three main distributed frameworks:

- **Apache Spark**, via the `xgboost4j-spark` JVM package and the PySpark interface added in version 1.6. The Spark integration uses the Barrier execution mode and supports both row-partitioned and feature-partitioned data. [20]
- **Dask**, via the `xgboost.dask` Python module. Dask is the recommended choice for users who already have a Dask cluster and want to stay inside the Python ecosystem. The interface mirrors the standard scikit-learn API. [3]
- **Ray**, via the `xgboost_ray` package originally developed at Anyscale and Uber. Ray adds elastic training (the job continues if a worker dies) and seamless integration with Ray Tune for hyperparameter optimization. [14]

All three integrations support multi-GPU training. Distributed XGBoost typically scales close to linearly up to a few dozen workers; beyond that, communication of the gradient histograms can become the bottleneck.

## bindings and ecosystem

XGBoost ships native bindings for several languages, with the C++ core compiled as a shared library that is loaded by each language wrapper. [3]

- **Python** (`xgboost` on PyPI): the dominant interface, including a scikit-learn-compatible API (`XGBClassifier`, `XGBRegressor`, `XGBRanker`) and the lower-level Booster API. [3]
- **R** (`xgboost` on CRAN): a fully featured wrapper that integrates with the `caret` and `tidymodels` packages, substantially reworked into a more idiomatic interface in version 3.0. [3] [5]
- **JVM** (Java, Scala, Kotlin): packaged as `xgboost4j` and `xgboost4j-spark` for Maven/Gradle. [20]
- **Julia** (`XGBoost.jl`): a community-maintained wrapper.
- **Other**: Ruby (`xgboost-ruby`), Swift (`SwiftXGBoost`), Go bindings, and a Rust crate (`xgboost-rs`) all exist with varying maturity.

Within the broader Python ecosystem, XGBoost integrates with [scikit-learn](/wiki/scikit_learn) pipelines, MLflow for experiment tracking, ONNX for model export, SHAP for interpretability (the `shap.TreeExplainer` is highly optimized for XGBoost), and almost every commercial ML platform including AWS SageMaker, Google Vertex AI, Azure Machine Learning, Databricks, and Snowflake. [16]

## How does XGBoost differ from LightGBM and CatBoost?

The table below summarizes how XGBoost compares with the two main alternative implementations of gradient boosted trees and with scikit-learn's built-in `HistGradientBoosting`. All four are considered "production grade" and benchmarks generally rank them within a few percentage points of one another on most tabular tasks.

| feature | XGBoost | LightGBM | CatBoost | scikit-learn HistGB |
|---------|---------|----------|----------|---------------------|
| First release | 2014 | 2016 | 2017 | 2019 (v0.21) |
| Maintainer | DMLC | Microsoft Research | Yandex | scikit-learn community |
| Default tree growth | Level-wise (depth-wise) | Leaf-wise | Symmetric (oblivious) | Level-wise |
| Default split algorithm | Histogram | Histogram | Histogram | Histogram |
| Categorical handling | Native (since 1.5) | Native | Native (best in class) | Limited |
| Missing value handling | Native (sparsity-aware) | Native | Native | Native |
| GPU training | Yes (CUDA, multi-GPU) | Yes (CUDA, OpenCL) | Yes (CUDA, multi-GPU) | No |
| Distributed training | Spark, Dask, Ray | Dask | Spark, Dask | No |
| L1/L2 regularization | Both | L2 only | Both |  L2 only |
| Monotonic constraints | Yes | Yes | Yes | Yes |
| Feature interaction constraints | Yes | No | No | No |
| Typical relative speed | 1.0x | 2x to 7x faster | 0.5x to 1.0x | 1.0x to 2.0x |
| Best fit | Default tabular baseline; sparse, large data | Wide datasets (many features), large row counts | Heavy categorical features, low-tuning workflows | Pipelines that must stay inside scikit-learn |

The rough heuristic among practitioners is: try XGBoost first as a strong baseline, switch to LightGBM if training time or memory becomes a bottleneck on very large data, and switch to CatBoost when the dataset is dominated by high-cardinality categorical features. In Kaggle competitions a stacked ensemble of all three is still common.

## What is XGBoost used for?

### kaggle and competitive machine learning

XGBoost was the dominant Kaggle algorithm from 2014 through 2018. Most winning solutions used it as the main predictive engine, often stacked with a small neural network or a logistic regression for diversity. [1] After 2018 the share of pure-XGBoost wins decreased as LightGBM and CatBoost matured and as deep learning models started to compete on image and text-heavy challenges, but XGBoost remains a near-universal component in tabular competitions.

### financial risk and credit scoring

The library has been adopted across the financial industry for credit-risk modeling, loan default prediction, anti-money laundering, and trading signal research. The combination of strong out-of-the-box accuracy, monotonic constraints, and SHAP-based interpretability makes it a natural fit for regulated environments where the model must be auditable. [16] Several large banks have moved their credit-decisioning pipelines from logistic regression to XGBoost since 2018.

### click-through rate and ad ranking

Click-through rate (CTR) prediction is the canonical advertising-tech problem: given a user, a context, and a candidate ad, estimate the probability of a click. XGBoost was widely used in CTR models before deep learning architectures (Wide & Deep, DeepFM, DCN, DIN) took over the largest production systems. It is still a common baseline and feature-extraction stage. Hybrid pipelines such as XGBoost + LR (used for example by Facebook in their 2014 "Practical Lessons from Predicting Clicks on Ads" paper, and replicated in many follow-on systems) feed XGBoost leaf indices as one-hot features into a downstream linear model. [17]

### churn, fraud, and demand forecasting

Customer churn prediction (telecom, banking, SaaS), credit-card fraud detection, supply-chain demand forecasting, energy load forecasting, and electronic health record (EHR) modeling are all dominated by gradient boosting in practice, and XGBoost is among the most common implementations. Public benchmarks regularly report 95%+ AUC on canonical churn datasets when XGBoost is properly tuned.

### scientific applications

Beyond the original Higgs Boson use case, XGBoost is used in particle physics, astronomy (galaxy classification, photometric redshift), bioinformatics (variant pathogenicity prediction, drug response modeling), epidemiology, and climate science. [13] The combination of fast training on large tabular tables and built-in feature importance makes it attractive whenever a researcher needs both predictive accuracy and a rough sense of which inputs matter.

## Does deep learning beat XGBoost on tabular data?

A recurring debate in the machine learning community is whether deep neural networks can out-perform gradient boosted trees on tabular data. As of 2026 the answer remains, with caveats, no. The 2021 Shwartz-Ziv and Armon paper "Tabular Data: Deep Learning Is Not All You Need" benchmarked XGBoost against TabNet, NODE, and several other tabular deep models on 11 datasets and concluded that XGBoost outperformed every deep model on average and required dramatically less hyperparameter tuning. [11] The 2022 Grinsztajn et al. NeurIPS paper "Why do tree-based models still outperform deep learning on typical tabular data?" reached the same conclusion on a curated benchmark of 45 datasets, finding that tree-based models such as XGBoost remained state of the art "for medium-sized data (around 10K samples)." [12]

Several structural reasons explain the persistent gap. Decision trees are robust to uninformative features (they simply do not split on them), while neural networks must learn to ignore noise. Trees are invariant to monotonic transformations of individual features, so they do not require careful normalization. Trees handle missing values and heavy-tailed distributions natively. Finally, neural networks for tabular data tend to require large amounts of data to compete, which is precisely the regime where most tabular problems live. [12]

That said, deep models close the gap or surpass XGBoost in two specific situations. First, when the dataset is very large (tens of millions of rows or more) and there are meaningful interactions between many features, transformer-based architectures such as TabTransformer, FT-Transformer, and SAINT can produce slightly better results, often at much higher computational cost. Second, in multimodal pipelines where tabular features must be combined with text, image, or sequence data, deep models offer a natural way to fuse modalities. In multimodal pipelines a common pattern is to use XGBoost on the tabular part and concatenate or stack its predictions with the deep model's outputs.

## limitations

Despite its strengths, XGBoost has well-known weaknesses that practitioners need to be aware of.

- **Limited interpretability of large ensembles**: While SHAP makes per-prediction attribution feasible, an ensemble of a thousand depth-6 trees is not directly inspectable. In regulated domains a simpler model (logistic regression, EBM, decision tree) may still be preferred for top-level reasoning. [16]
- **Hyperparameter sensitivity**: Although less sensitive than deep networks, XGBoost has many hyperparameters and a poorly tuned model can lag a well-tuned one by several percentage points. Automated tuning (Optuna, HyperOpt, Vizier) is essentially mandatory for production use.
- **Not ideal for unstructured data**: Images, raw text, and audio do not have the well-defined feature columns that tree models expect. For these modalities convolutional networks, transformers, and pre-trained foundation models perform far better.
- **Memory and compute footprint at scale**: While XGBoost can train on billion-row datasets with the histogram method and external memory, very large sparse problems (such as ad-click prediction with billions of features) still favor specialized linear models or factorization machines.
- **Sequential dependence**: Each tree depends on the residuals of the previous trees, so boosting itself is inherently sequential. The parallelism is at the per-split level within each tree, not across boosting rounds. This caps the wall-clock speedups that distributed training can achieve, particularly on small to mid-size datasets. [8]
- **Risk of overfitting on noisy or small datasets**: Without strong regularization or early stopping, XGBoost can memorize a small training set. Cross-validation and early stopping on a held-out set are essential.
- **No native time-series handling**: There is no built-in concept of time. Practitioners must hand-engineer lag features, rolling statistics, and seasonal indicators, or pair XGBoost with a specialized library such as Nixtla's MLForecast or Skforecast.

## version history

The table below lists the major XGBoost releases and the headline features they introduced. Patch releases and pre-1.0 versions are omitted for brevity. Dates follow the GitHub release tag for the corresponding `vX.Y.0` release; the 3.x highlights are drawn from the official release notes. [5] [18]

| version | release date | highlights |
|---------|--------------|------------|
| 0.4 | May 2015 | First Spark integration; PyPI release. |
| 0.6 | July 2016 | First R CRAN release; faster Booster; better Python sklearn API. |
| 0.7 | December 2017 | First CUDA GPU implementation (`gpu_hist`). |
| 0.8 | August 2018 | DART booster generalizations; multi-GPU support. |
| 0.9 | May 2019 | One-versus-rest classification; gpu_hist as default GPU method. |
| 1.0 | February 2020 | Adopted semantic versioning; improved Spark integration; pickle compatibility. |
| 1.1 | May 2020 | Sparse-matrix performance improvements. |
| 1.2 | August 2020 | Improved categorical interface; better gpu_hist scalability. |
| 1.3 | December 2020 | Experimental categorical support; default device API previewed. |
| 1.4 | April 2021 | Refined categorical handling; deterministic GPU training. |
| 1.5 | October 2021 | Native categorical features in `tree_method="hist"`. |
| 1.6 | April 2022 | Native PySpark estimator; streamlined Dask interface. |
| 1.7 | October 2022 | Quantile regression objective; PyPI macOS arm64 wheels. |
| 2.0 | September 2023 | Vector leaves for multi-target regression; new federated learning interface; default `tree_method="hist"`. |
| 2.1 | June 2024 | Improved external memory; better column sampling; `device` parameter replaces `gpu_hist`. |
| 3.0 | February 2025 | Major external-memory rework with `ExtMemQuantileDMatrix`; redesigned R package; restructured Spark/JVM packages; reduced GPU memory use. |
| 3.1 | September 2025 | Categorical re-coder that keeps encodings consistent at inference; adaptive CUDA external-memory cache; CUDA 13 wheels (`xgboost-cu13`). |
| 3.2 | February 2026 | Vector-leaf multi-target tree progress (reduced-gradient "sketch boost"); enhanced GPU external-memory training; removal of the deprecated CLI. |

## installation and quick start

The Python package is the most common entry point and is installed with `pip install xgboost` or `conda install -c conda-forge xgboost`. [3] A minimal binary classification example using the scikit-learn wrapper looks like this:

```
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, random_state=42
)

model = xgb.XGBClassifier(
    n_estimators=500,
    max_depth=6,
    learning_rate=0.05,
    objective="binary:logistic",
    eval_metric="logloss",
    early_stopping_rounds=20,
    tree_method="hist",
    device="cuda",  # remove this line for CPU training
)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
print("AUC:", model.score(X_test, y_test))
```

The lower-level Booster API offers more control and is preferred for production deployments and custom training loops:

```
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

params = {
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "max_depth": 6,
    "eta": 0.05,
    "tree_method": "hist",
    "device": "cuda",
}

booster = xgb.train(
    params,
    dtrain,
    num_boost_round=2000,
    evals=[(dtrain, "train"), (dtest, "valid")],
    early_stopping_rounds=50,
    verbose_eval=100,
)
```

## interpretability and model debugging

XGBoost exposes several tools for understanding what a trained model is doing. [3]

- `model.feature_importances_` returns three flavors of importance: `weight` (number of splits a feature is used in), `gain` (average loss reduction contributed by the feature), and `cover` (average Hessian coverage of the splits using the feature). [3]
- `xgb.plot_importance` and `xgb.plot_tree` visualize feature importance and individual trees. [3]
- The SHAP library provides exact, fast Shapley values for tree ensembles via `shap.TreeExplainer`. The implementation is specialized for XGBoost and can compute per-prediction feature attributions in milliseconds. [16]
- The `pred_interactions=True` option in `Booster.predict` returns the SHAP interaction matrix, a richer attribution that splits each feature's contribution across pairwise interactions. [16]

## relationship to the broader gradient boosting family

XGBoost belongs to the family of gradient boosting machines first formalized by Jerome Friedman in 2001. [8] Its closest cousins are:

- The original `gbm` R package (Greg Ridgeway, 2003), which was the first widely used implementation but was slow and single-threaded.
- Scikit-learn's `GradientBoostingClassifier` and `GradientBoostingRegressor` (introduced 2010), which are exact-greedy and now superseded by `HistGradientBoosting`.
- [LightGBM](/wiki/lightgbm) (Microsoft, 2016), a histogram-based implementation with leaf-wise growth optimized for very wide datasets. [9]
- [CatBoost](/wiki/catboost) (Yandex, 2017), an ordered-boosting implementation with industry-leading categorical handling. [10]
- AdaBoost (Freund and Schapire, 1995) and [random forests](/wiki/random_forest) (Breiman, 2001) are the closely related ensemble methods that predate gradient boosting and complement it on tabular tasks.

## Is XGBoost open source?

XGBoost is released under the Apache License 2.0, a permissive open-source license. [4] Development is coordinated through the `dmlc/xgboost` GitHub repository, with a small set of committers drawn from academia (Carnegie Mellon, University of Washington), large tech companies (NVIDIA, AWS, Microsoft, Yandex), and the open-source community. [4] The project follows a lightweight RFC process for major design changes and uses GitHub Actions for continuous integration across Linux, macOS, and Windows on x86-64 and ARM64 architectures.

## see also

- [Gradient boosted (decision) trees](/wiki/gradient_boosted_decision_trees_gbt)
- [Gradient boosting](/wiki/gradient_boosting)
- [LightGBM](/wiki/lightgbm)
- [CatBoost](/wiki/catboost)
- [Decision tree](/wiki/decision_tree)
- [Random forest](/wiki/random_forest)
- [Scikit-learn](/wiki/scikit_learn)
- [Machine learning](/wiki/machine_learning)
- [Kaggle](/wiki/kaggle)

## References

1. Chen, Tianqi, and Carlos Guestrin. "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016): 785-794. https://dl.acm.org/doi/10.1145/2939672.2939785
2. Chen, Tianqi, and Carlos Guestrin. "XGBoost: A Scalable Tree Boosting System." arXiv preprint arXiv:1603.02754 (2016). https://arxiv.org/abs/1603.02754
3. XGBoost official documentation. https://xgboost.readthedocs.io/
4. dmlc/xgboost GitHub repository. https://github.com/dmlc/xgboost
5. XGBoost release notes and version history. https://xgboost.readthedocs.io/en/stable/changes/index.html
6. Tianqi Chen. Personal home page, Carnegie Mellon University. https://tqchen.com/
7. Tianqi Chen. Faculty profile, CMU School of Computer Science. https://www.csd.cmu.edu/people/faculty/tianqi-chen
8. Friedman, Jerome H. "Greedy Function Approximation: A Gradient Boosting Machine." Annals of Statistics 29.5 (2001): 1189-1232.
9. Ke, Guolin, et al. "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." Advances in Neural Information Processing Systems 30 (2017).
10. Prokhorenkova, Liudmila, et al. "CatBoost: unbiased boosting with categorical features." Advances in Neural Information Processing Systems 31 (2018).
11. Shwartz-Ziv, Ravid, and Amitai Armon. "Tabular Data: Deep Learning is Not All You Need." arXiv preprint arXiv:2106.03253 (2021). https://arxiv.org/abs/2106.03253
12. Grinsztajn, Leo, Edouard Oyallon, and Gael Varoquaux. "Why do tree-based models still outperform deep learning on typical tabular data?" Advances in Neural Information Processing Systems 35 (2022). https://arxiv.org/abs/2207.08815
13. Higgs Boson Machine Learning Challenge. Kaggle, 2014. https://www.kaggle.com/c/higgs-boson
14. Anyscale. "Introducing Distributed XGBoost Training with Ray." https://www.anyscale.com/blog/distributed-xgboost-training-with-ray
15. NVIDIA RAPIDS. "GPU-Accelerated XGBoost." https://developer.nvidia.com/rapids
16. SHAP documentation. "TreeExplainer." https://shap.readthedocs.io/
17. He, Xinran, et al. "Practical Lessons from Predicting Clicks on Ads at Facebook." Proceedings of the Eighth International Workshop on Data Mining for Online Advertising (2014).
18. XGBoost 3.0.0, 3.1.0, and 3.2.0 release notes. https://xgboost.readthedocs.io/en/latest/changes/
19. American Statistical Association John M. Chambers Statistical Software Award. https://community.amstat.org/jointscsg-section/awards/john-m-chambers
20. XGBoost JVM packages and Apache Spark integration. https://xgboost.readthedocs.io/en/stable/jvm/