LightGBM
Last reviewed
Sources
10 citations
Review status
Source-backed
Revision
v3 ยท 4,349 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
10 citations
Review status
Source-backed
Revision
v3 ยท 4,349 words
Add missing citations, update stale details, or suggest a clearer explanation.
LightGBM (short for Light Gradient-Boosting Machine) is a free and open-source gradient boosting framework that trains ensembles of decision trees on tabular data, originally developed at Microsoft Research by Guolin Ke and colleagues and first released on GitHub in October 2016. Its defining contributions, the Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) techniques introduced in the 2017 NeurIPS paper LightGBM: A Highly Efficient Gradient Boosting Decision Tree, let it, in the authors' words, "speed up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy" [1]. The framework is designed for distributed and high-performance training, and is widely used for classification, regression, ranking, and many other supervised learning tasks. LightGBM is one of the three dominant gradient boosting libraries in modern practice, alongside XGBoost and CatBoost, and it is consistently one of the most-used algorithms in Kaggle competitions and in production machine learning systems for tabular data.
LightGBM differs from earlier gradient boosting implementations through four interlocking design choices: a histogram-based split-finding algorithm that bins continuous features into a small number of integer buckets, a leaf-wise (best-first) tree growth strategy that grows whichever leaf will most reduce the loss, GOSS, which subsamples training rows by gradient magnitude, and EFB, which fuses sparse, mutually exclusive features into denser bundles. Together these techniques cut training time by roughly an order of magnitude relative to pre-2017 implementations of gradient boosted decision trees (GBT) at comparable accuracy. The original NeurIPS paper reports per-iteration speedups of 21x, 6x, 1.6x, 14x, and 13x over a histogram-based LightGBM baseline without GOSS and EFB on the Allstate, Flight Delay, Microsoft LETOR, KDD CUP 2010, and KDD CUP 2012 datasets respectively, with no measurable loss in test accuracy [1].
LightGBM is a general-purpose supervised learning library for tabular (structured, row-and-column) data. It is applied to ranking (search relevance, recommendation reranking), binary and multiclass classification (fraud detection, churn, click-through-rate prediction), and regression (demand forecasting, pricing, risk scoring). It is not used directly on raw images, audio, or free text, which are the domain of deep neural networks, although LightGBM is frequently trained on engineered features derived from those modalities. Its combination of fast training, low memory use, native categorical handling, and competitive accuracy makes it a default first model for tabular problems and a workhorse in Kaggle competition pipelines.
LightGBM began as an internal Microsoft Research Asia project around 2015, motivated by the limitations of existing GBDT implementations on the very large click-through-rate, search ranking, and advertising datasets that Microsoft was running in production. The lead author, Guolin Ke, was at the time a researcher at Microsoft Research Asia working with Tie-Yan Liu's group on machine learning for search and recommendation. Early prototypes targeted speed on Microsoft's own learning-to-rank workloads, which routinely involved hundreds of millions of rows and tens of thousands of features. The first public release on GitHub was made in October 2016, initially under the microsoft/LightGBM repository as part of Microsoft's Distributed Machine Learning Toolkit (DMTK) umbrella project [2].
Key early milestones:
The paper formalized the GOSS and EFB algorithms and provided the first peer-reviewed benchmarks against XGBoost. It has since been cited tens of thousands of times and is one of the most cited applied machine learning papers of the late 2010s. The project subsequently grew well beyond its DMTK origin and became one of Microsoft's most popular open source projects. In March 2026 it was transferred from microsoft/LightGBM to a community-governed lightgbm-org/LightGBM organization, a move suggested by Microsoft's Open Source Conduct Team to establish the project's identity as an authoritative source independent of Microsoft's organization structure; it remains MIT-licensed and continues under the same core maintainers, including the project's creator [9].
LightGBM combines four core ideas. None of them was, strictly speaking, invented for LightGBM, but the engineering combination is what gives the library its characteristic speed-vs-accuracy profile. The paper frames the central problem as follows: in conventional GBDT, "for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming," so both GOSS and EFB attack that scan by shrinking either the number of rows or the number of features [1].
Like pre-sorted GBDT implementations, LightGBM looks for the best split point on each feature at each node. Unlike the classical pre-sorted algorithm, it first bins each continuous feature into a fixed number of integer buckets (controlled by max_bin, default 255). Histograms of gradients and Hessians are then accumulated per bin per leaf. Finding the best split becomes an O(#bins) scan rather than an O(#data) scan, and the histograms can be built in O(#data) total per level. The histograms also use far less memory than per-row sorted indices, which is one of the reasons LightGBM has substantially lower memory footprints than pre-sorted XGBoost. A subtler but important optimization is histogram subtraction: the histogram of a leaf's smaller child can be derived by subtracting the larger child's histogram from the parent's, halving the work at each split [3].
Most earlier GBDT implementations grow trees level-wise (also called depth-wise): every leaf at the current depth is split before moving on to the next depth. LightGBM instead grows trees leaf-wise: at every step, it picks whichever leaf in the entire tree will yield the largest loss reduction when split, and grows only that leaf. The official documentation states the underlying guarantee directly: "Holding #leaf fixed, leaf-wise algorithms tend to achieve lower loss than level-wise algorithms" [3]. The downside is a propensity to overfit on small datasets, since leaf-wise growth can produce very deep, unbalanced trees. LightGBM mitigates this with the max_depth, num_leaves, and min_data_in_leaf parameters, which together cap tree complexity [3].
GOSS is the first of the two novel techniques introduced in the LightGBM paper. The intuition is that data points with large gradients (in absolute value) carry more information about where the model is currently underfitting than points with small gradients. Rather than discard the small-gradient points entirely, GOSS retains the top a fraction of points (by gradient magnitude) and randomly subsamples b fraction of the remaining points, then upweights those random samples by (1 - a) / b so that the data distribution remains approximately unbiased. The paper proves that "since the data instances with larger gradients play a more important role in the computation of information gain, GOSS can obtain quite accurate estimation of the information gain with a much smaller data size" [1]. In the paper's own experiments the sampling rates were set to a = 0.05, b = 0.05 for the Allstate, KDD10, and KDD12 datasets and a = 0.1, b = 0.1 for Flight Delay and LETOR, and the authors report that GOSS alone delivers nearly a 2x speedup while remaining more accurate than ordinary Stochastic Gradient Boosting at the same sampling ratio [1].
EFB targets high-dimensional sparse feature spaces, the kind that arises after one-hot encoding categorical variables in click logs or text data. The observation is that in such data, many features are mutually exclusive: they almost never take nonzero values on the same row. EFB greedily groups such features into bundles, mapping each bundle to a single new feature whose values are computed by offsetting each member feature's values into a unique subrange, which reduces the complexity of histogram building from O(#data x #feature) to O(#data x #bundle) [1]. The paper proves that "finding the optimal bundling of exclusive features is NP-hard, but a greedy algorithm can achieve quite good approximation ratio," reducing it to a graph-coloring problem [1]. LightGBM therefore uses a graph-coloring heuristic that allows a small conflict ratio (the fraction of rows on which two bundled features are both nonzero). EFB reduces feature count, and therefore histogram-construction time, by roughly the bundling factor, often with negligible loss in accuracy [1].
LightGBM was one of the first major GBDT libraries to support categorical features without one-hot encoding. When a feature is declared categorical, LightGBM uses an algorithm based on Fisher (1958) to find the optimal partition of categories into two subsets at each split. Internally, it sorts the histogram of categories by sum_gradient / sum_hessian and then walks down the sorted order, which gives O(k log k) split finding rather than the O(2^k) of brute-force category partitioning [4]. This typically produces better trees than one-hot encoding, especially for high-cardinality categories such as user IDs, ZIP codes, or product SKUs [4].
LightGBM is implemented in C++ for performance, with thin language bindings for Python, R, C, and a community-maintained Daal4py interface. Trees are stored compactly as integer-bin split conditions. The library supports four boosting modes (boosting_type):
| boosting_type | Algorithm | Notes |
|---|---|---|
gbdt | Standard gradient boosting (default) | Highest accuracy in most benchmarks. |
dart | Dropouts meet Multiple Additive Regression Trees | Adds dropout-style regularization to the boosting ensemble. |
goss | GOSS-only without DART/RF | Faster training; small accuracy hit. |
rf | Random Forest mode | Bagged trees instead of boosted. |
Objective functions cover the standard supervised tasks:
regression (L2), regression_l1, huber, fair, poisson, quantile, mape, gamma, tweedie.binary with logistic loss.multiclass (softmax), multiclassova (one-vs-all).cross_entropy, cross_entropy_lambda.lambdarank, rank_xendcg.Custom objectives can be supplied as Python callables returning gradient and Hessian arrays.
The original NeurIPS 2017 paper benchmarked LightGBM against XGBoost on five public datasets, listed below with the row and feature counts reported in the paper [1]:
| Dataset | Rows | Features | Task | Metric |
|---|---|---|---|---|
| Allstate | 12M | 4,228 | Binary classification | AUC |
| Flight Delay | 10M | 700 | Binary classification | AUC |
| LETOR (Microsoft LTR) | 2M | 136 | Ranking | NDCG |
| KDD CUP 2010 | 19M | 29M | Binary classification | AUC |
| KDD CUP 2012 | 119M | 54M | Binary classification | AUC |
All experiments ran on a Linux server with two Intel E5-2670 v3 CPUs (24 cores total) and 256GB of memory, with the thread count fixed at 16 [1]. On these datasets, LightGBM with GOSS and EFB enabled was reported to be up to over 20x faster than XGBoost's pre-sorted (xgb_exa) algorithm and roughly 2x faster than XGBoost's histogram (xgb_his) algorithm where the latter could run, while matching test accuracy to within roughly 0.1% AUC [1]. The histogram version of XGBoost ran out of memory on both KDD CUP datasets, whereas LightGBM completed them, which the paper attributes to histogram-based split finding and EFB lowering memory consumption [1].
Later independent benchmarks have largely confirmed the original results in the regimes the paper targeted. A 2023 benchmarking study by Florek and Zagdanski on a broad suite of tabular classification problems found LightGBM to be the fastest of the three major libraries, often by a factor of 2x-7x over XGBoost and roughly 2x over CatBoost, while accuracy differences across the three were typically within one or two percent [5]. The general pattern that has emerged from years of experimentation is:
The table below summarizes the rough comparison reported across multiple benchmark studies:
| Property | LightGBM | XGBoost | CatBoost |
|---|---|---|---|
| Training speed (large data) | Fastest | Medium | Fast |
| Memory footprint | Lowest | Medium-High | Medium |
| Default-hyperparameter accuracy | Good | Good | Best |
| Categorical features (native) | Yes (integer-coded) | No (needs encoding; histogram aware in v1.5+) | Yes (ordered target encoding) |
| Missing value handling | Yes | Yes | Yes |
| GPU training | Yes (OpenCL + CUDA) | Yes (CUDA) | Yes (CUDA) |
| Distributed training | Yes (Dask, MPI, Spark via SynapseML) | Yes (Dask, Spark via XGBoost4J-Spark) | Yes (Spark) |
| Trees grown | Leaf-wise | Level-wise (also leaf-wise option) | Symmetric (oblivious) |
| First public release | 2016 | 2014 | 2017 |
| License | MIT | Apache 2.0 | Apache 2.0 |
No single library dominates on every benchmark, and choice of framework is usually a function of dataset size, sparsity, categorical cardinality, and deployment constraints. In practice, many teams test all three and pick whichever performs best on their validation set.
LightGBM exposes more than a hundred parameters, but only about a dozen are routinely tuned in practice. The official tuning guide recommends adjusting num_leaves, min_data_in_leaf, and max_depth to control tree complexity, and learning_rate together with num_iterations to control the boosting trajectory [3].
| Parameter | Default | Purpose | Typical tuning range |
|---|---|---|---|
num_leaves | 31 | Max leaves per tree (the primary complexity knob in leaf-wise growth) | 15-255 |
max_depth | -1 (no limit) | Cap on tree depth; useful guardrail against overfitting | 6-12 |
learning_rate | 0.1 | Shrinkage applied to each tree's contribution | 0.01-0.3 |
n_estimators / num_iterations | 100 | Number of boosting rounds | 100-10000 (with early stopping) |
min_data_in_leaf | 20 | Minimum samples per leaf; prevents tiny leaves | 5-200 |
feature_fraction | 1.0 | Fraction of features sampled per tree (column subsampling) | 0.6-1.0 |
bagging_fraction | 1.0 | Fraction of rows sampled per iteration | 0.6-1.0 |
bagging_freq | 0 | How often to re-sample rows (0 = never) | 1-10 |
lambda_l1 | 0.0 | L1 regularization on leaf weights | 0-10 |
lambda_l2 | 0.0 | L2 regularization on leaf weights | 0-10 |
min_gain_to_split | 0.0 | Minimum loss reduction to make a split | 0-1 |
max_bin | 255 | Histogram bin count; lower = faster, less precise | 63-512 |
cat_smooth | 10 | Smoothing applied to categorical splits | 5-100 |
early_stopping_round | 0 | Stop if validation metric does not improve for N rounds | 10-100 |
boosting_type | gbdt | Choice of boosting algorithm (gbdt, dart, goss, rf) | depends on task |
objective | regression / binary / multiclass | Loss function | task-dependent |
A typical tuning recipe is to start from defaults, fix learning_rate at 0.05 and n_estimators to a large number with early stopping, then sweep num_leaves (often jointly constrained as num_leaves <= 2^max_depth), min_data_in_leaf, and the bagging and feature fractions. Bayesian optimization tools such as Optuna and Hyperopt have first-class LightGBM integrations for this loop.
LightGBM ships two parallel Python APIs: a native lightgbm.train function operating on lightgbm.Dataset objects, and a scikit-learn-compatible API in the lightgbm.sklearn module. The scikit-learn API exposes three estimator classes:
LGBMClassifier, with default objective binary or multiclass depending on the target.LGBMRegressor, with default objective regression.LGBMRanker, with default objective lambdarank.These classes implement fit, predict, and predict_proba and can therefore be dropped into scikit-learn Pipeline, GridSearchCV, RandomizedSearchCV, and similar utilities. A minimal training example:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
model = lgb.LGBMClassifier(
n_estimators=2000,
learning_rate=0.05,
num_leaves=63,
feature_fraction=0.9,
bagging_fraction=0.8,
bagging_freq=5,
)
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
callbacks=[lgb.early_stopping(50)],
)
preds = model.predict_proba(X_val)
The native Dataset API is slightly faster on very large data because it avoids one extra pass to build histograms, and gives finer control over categorical features and weights. It is the API used by most of the LightGBM examples in the official documentation [6].
LightGBM was designed with distributed training in mind from the start, which is one of the reasons it sits under the DMTK umbrella. The library supports three modes of distributed parallelism:
For practical deployment, LightGBM integrates with Dask through lightgbm.dask, with Apache Spark via Microsoft's SynapseML library, with Ray via lightgbm-ray, and with raw MPI for HPC environments. Distributed training is one area where LightGBM remains differentiated from its competitors: feature-parallel training and voting parallelism are first-class concepts.
GPU training is supported via OpenCL (the original GPU backend, contributed by Huan Zhang and others) and via CUDA (added in v3.x and substantially expanded in v4.x). The OpenCL backend works on AMD, Intel, and NVIDIA GPUs; the CUDA backend is faster on NVIDIA hardware, and as of v4.6 supports NVIDIA's Blackwell architecture [8][9]. The official tuning guide recommends max_bin = 63 and single-precision arithmetic (gpu_use_dp = false) on GPUs, since most consumer GPUs have weak double-precision throughput. On large dense datasets the GPU backend can be 2x to 10x faster than CPU training [8].
LightGBM provides first-class bindings for a wide range of environments:
| Binding | Maintainer | Notes |
|---|---|---|
| C++ core | LightGBM team | The reference implementation. |
Python (lightgbm) | LightGBM team | Native and scikit-learn APIs; Dask integration. |
R (lightgbm) | LightGBM team | CRAN package; full feature parity for tabular tasks. |
| C API | LightGBM team | Used by all language bindings as the FFI surface. |
| Daal4py | Intel | Optimized inference path on Intel CPUs. |
.NET (Microsoft.ML.LightGbm) | Microsoft | Part of the ML.NET package family. |
| SynapseML | Microsoft | Spark and Synapse Analytics integration. |
| Treelite | DMLC | Cross-platform compiled inference for LightGBM models. |
| ONNX Runtime | Community + Microsoft | Convert LightGBM models to ONNX for portable inference. |
This broad surface area is part of the reason LightGBM is so widely adopted in production: a model trained in Python on a workstation can be exported to ONNX or compiled with Treelite and served from a C++ inference server, an ML.NET application, or a Spark batch job without retraining.
Microsoft. LightGBM is used internally for ranking, advertising, and CTR prediction across Bing search, Microsoft Advertising, and Microsoft 365 features that touch tabular signals. The library was originally driven by these workloads and continues to be maintained, in part, against them.
Kaggle. Since roughly 2017, LightGBM has been one of the two most-used algorithms on tabular Kaggle competitions, alongside XGBoost. From 2018 onward, a large majority of top-3 finishes on tabular competitions have used LightGBM either as the primary model or as part of an ensemble. Its combination of training speed and competitive accuracy makes it especially attractive for the rapid iteration that competition workflows require [5].
Industry adoption. LightGBM is deployed at scale at companies including Netflix (recommendation reranking), Uber, Lyft, Booking.com, Airbnb, and many financial services and ad-tech firms, typically for ranking, fraud detection, churn prediction, demand forecasting, and credit scoring. Its low-memory inference, support for large categorical cardinalities, and easy integration with Spark and Dask are usually cited as the reasons for adoption.
The practical decision among LightGBM, XGBoost, and CatBoost typically comes down to data shape and operational constraints rather than absolute accuracy:
It is common in practice to train all three and ensemble or stack them, since their errors are sufficiently uncorrelated to give a small accuracy lift even when each individual model is well-tuned.
The table below summarizes major releases. Patch versions are omitted for brevity.
| Version | Release Date | Highlights |
|---|---|---|
| Initial commits | Oct 2016 | First public release on GitHub under microsoft/LightGBM. |
| Categorical features | Dec 2016 | Native integer-coded categorical splits added (no one-hot needed). |
| Python beta | Dec 2016 | First Python package release. |
| R beta | Jan 2017 | First R package release. |
| v1.0 | Feb 2017 | First stable release. |
| NeurIPS paper | Dec 2017 | Ke et al. paper formally introduces GOSS and EFB. |
| v2.0 | 2017 | Stable distributed training, first GPU (OpenCL) support. |
| v2.1 | Jan 2018 | Improved categorical feature splits, R package on CRAN. |
| v2.2 | Sep 2018 | Faster histogram building, multi-class objective fixes. |
| v2.3 | Sep 2019 | Voting parallelism improvements, lambdarank stability. |
| v3.0 | Aug 2020 | CUDA GPU backend (experimental), Dask integration, refactored sklearn API. |
| v3.1 | Nov 2020 | DART/RF mode improvements; min_data_in_leaf default change. |
| v3.2 | Mar 2021 | Faster prediction; improved early stopping callbacks. |
| v3.3 | Late 2021 | Production-quality CUDA backend; larger-than-memory training improvements. |
| v4.0 | Jul 2023 | Major release: dropped legacy APIs, removed deprecated parameters, faster CUDA, improved categorical handling, Python 3.7+ minimum. |
| v4.1 | Sep 2023 | Quantile regression improvements, more callback hooks, CUDA fixes. |
| v4.2 | Dec 2023 | Distributed training stability, scikit-learn 1.4 compatibility. |
| v4.3 | Jan 2024 | Further CUDA backend improvements, improved Dask support. |
| v4.4 | Jun 2024 | Performance fixes; better handling of pandas nullable integer dtypes. |
| v4.5 | Jul 2024 | Improved categorical and missing-value handling on CUDA. |
| v4.6 | Feb 2025 | Latest stable release as of mid-2026: Python 3.13 support, NVIDIA Blackwell CUDA support, linear tree on GPU, and Bagging by Query for lambdarank [9]. |
The project was transferred from microsoft/LightGBM to lightgbm-org/LightGBM in March 2026 to reflect its multi-organization maintainer base, while remaining MIT-licensed and continuing under the same core team [9].
LightGBM inherits the general limitations of gradient boosted decision trees (GBT): it is a tabular-only method, it does not extrapolate beyond the range of training data, and it requires meaningful feature engineering on raw signals such as text or images. In addition, LightGBM has some library-specific quirks:
num_leaves and stronger min_data_in_leaf are the standard fixes.num_leaves, max_depth, and min_data_in_leaf is a frequent source of confusion for newcomers.