CatBoost
Last reviewed
Apr 27, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,971 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 27, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,971 words
Add missing citations, update stale details, or suggest a clearer explanation.
CatBoost is an open-source gradient boosted decision trees library developed by Yandex and released to the public on July 18, 2017 [1]. The name is a contraction of "Categorical Boosting," reflecting the library's signature contribution: native handling of categorical features through ordered target statistics, eliminating the need for users to manually one-hot encode or label-encode string columns before training. CatBoost belongs to the same family of high-performance gradient boosting toolkits as XGBoost and LightGBM, and it is one of the three libraries that together dominate tabular machine learning competitions on Kaggle and the bulk of production tabular models in industry.
The library introduced two algorithmic innovations that distinguish it from earlier boosting frameworks. The first is ordered boosting, a permutation-driven training procedure that fixes a subtle form of target leakage present in conventional gradient boosting implementations. The second is ordered target statistics, a related permutation trick that converts categorical features to numerical features without leaking information from the response variable. Together these techniques reduce a phenomenon the authors call prediction shift, the gap between the conditional distribution of the model's predictions on training samples versus test samples that arises because the same data points appear in both the gradient computation and the leaf-value estimation [2]. CatBoost also uses oblivious decision trees (also called symmetric trees) as its base learner, which gives it extremely fast prediction throughput and acts as a structural regularizer against overfitting.
The canonical reference for CatBoost is the NeurIPS 2018 paper CatBoost: unbiased boosting with categorical features by Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin [2]. An earlier preprint, Fighting biases with dynamic boosting, appeared on arXiv in June 2017 [3]. The library is licensed under Apache License 2.0, maintained on GitHub at github.com/catboost/catboost, and as of 2025 the most recent stable release is version 1.2.8.
CatBoost is the publicly released descendant of a long line of internal Yandex gradient boosting libraries. In 2009, Andrey Gulin developed MatrixNet, a proprietary gradient boosting algorithm that became Yandex's primary ranking model for web search results [4]. Over the following years MatrixNet was extended for many other Yandex products including recommendation systems, weather forecasting, and the Yandex.Music streaming service. MatrixNet operated only on numerical features, so users had to encode categorical variables manually before training, a significant practical limitation given that web ranking features include many categorical signals such as country, language, and document type.
In 2014 and 2015, Gulin led research on a project internally called Tensornet with the explicit goal of solving "how to work with categorical data" inside a gradient boosting framework. Tensornet experimented with several encoding schemes but did not converge on a single approach until the team identified the prediction shift problem and developed the ordered target statistics solution. In 2016, Anna Veronika Dorogush's Machine Learning Infrastructure team at Yandex consolidated the MatrixNet and Tensornet codebases into the unified library that would become CatBoost [4].
Yandex announced the open-source release of CatBoost on July 18, 2017, framing it as the first machine learning method that natively handled categorical data [5]. The initial release supported Python and command-line interfaces and ran only on CPU. By the end of 2017, InfoWorld had named CatBoost one of the best open-source machine learning tools of the year. The 0.6 release in 2018 added GPU training support, and version 0.10 introduced the R interface. The first 1.0 release shipped in 2021, signaling API stability after four years of rapid iteration. Text feature support arrived in version 0.19 in late 2019, embedding feature support followed in 2020, and multilabel classification, multitarget regression, and survival analysis losses were added across the 0.24 through 1.0 release series. NVIDIA reported in 2018 that CatBoost on a single Volta GPU could run roughly 40 times faster than a dual-socket Intel Xeon E5-2660v4 server on large datasets [6].
Gradient boosting was originally formulated by Jerome Friedman as a stagewise additive procedure that fits each new base learner to the negative gradient of a loss function evaluated on the current ensemble's predictions. Each successive decision tree is added with a small step size, and the ensemble's predictions converge toward the minimum of the empirical risk over many rounds of fitting. Gradient boosting on decision trees has become the dominant approach for machine learning on tabular data, often outperforming deep neural networks on heterogeneous numerical and categorical features.
Classical gradient boosting implementations reuse the same training data to compute both the gradient residuals and the leaf values for each new tree. This reuse causes a subtle problem: the leaf values are biased toward the specific samples used to compute them, and that bias propagates into the gradient residuals for the next tree. Over many boosting rounds, the bias accumulates and produces models that fit the training data more aggressively than they should. The CatBoost authors named this accumulated bias prediction shift and showed that it is responsible for a measurable fraction of the generalization gap in standard boosting implementations [2]. A related issue arises when categorical features are encoded using target statistics, a popular preprocessing technique that replaces each category with the mean target value observed for that category. Target encoding usually outperforms one-hot encoding when categories are high-cardinality, but it leaks information from the response into the features. Practitioners typically address the leak with cross-validated target encoding; CatBoost's contribution is to formalize the idea inside the gradient boosting algorithm itself rather than as a preprocessing step.
Ordered boosting maintains a separate model for each training sample, where each per-sample model is trained on only the samples that come before it in a randomly chosen permutation of the data. When the algorithm needs to compute the gradient residual for sample $i$ during round $t$, it uses the prediction from the model trained on samples 1 through $i-1$, never on sample $i$ itself. This guarantees that the gradient for any given sample is computed from a model that has never seen that sample's target, breaking the chain of bias accumulation that produces prediction shift in classical boosting.
A naive implementation of ordered boosting would require storing $n$ separate models, one for each training sample, which is prohibitively expensive. CatBoost uses an efficient approximation that maintains $\log_2 n$ models corresponding to powers-of-two prefixes of the permutation. The library generates several independent permutations of the data and averages results across them to reduce the variance introduced by relying on a single random ordering.
The library exposes a parameter called boosting_type that switches between Plain (classical gradient boosting) and Ordered (the permutation-driven variant). On small datasets the ordered variant generally produces better validation metrics, while on very large datasets the cost of maintaining the auxiliary models often outweighs the benefit. The library chooses an appropriate default automatically based on dataset size unless the user overrides it.
Ordered target statistics apply the same permutation trick to categorical feature encoding. For each categorical feature value, CatBoost computes a target statistic using only the samples that precede the current sample in the permutation. The encoded value is roughly the smoothed mean of target values among earlier samples sharing the same category, with a Bayesian prior added to stabilize rare categories near the start of the permutation. Because only earlier examples are used, the example's own label cannot leak into its own encoding.
For low-cardinality categorical features (controlled by the one_hot_max_size parameter), CatBoost falls back to one-hot encoding. For high-cardinality features such as user IDs, product IDs, or city names, ordered target statistics are dramatically more efficient than one-hot encoding and usually outperform other target encoding schemes that rely on cross-validation folds. The library also supports feature combinations, where two or more categorical features are concatenated into a single composite feature whose target statistic is then computed. CatBoost considers these combinations greedily during training, adding new combined features only if they improve the validation loss. This automatic feature engineering is particularly valuable on datasets with many categorical interactions, such as recommendation systems.
CatBoost's base learners are oblivious decision trees, also known as symmetric trees. In an oblivious tree, every internal node at the same depth uses the same feature and the same split threshold, so the tree is perfectly balanced and the path from root to leaf is determined by a fixed sequence of binary tests. A tree of depth $d$ has exactly $2^d$ leaves.
The oblivious structure has three practical advantages. First, prediction is extremely fast: scoring a sample requires $d$ comparisons regardless of which leaf it lands in, and the comparisons can be vectorized. CatBoost's prediction throughput is among the highest of any gradient boosting library. Second, the symmetric structure acts as a structural regularizer that prevents the tree from carving out very narrow regions of feature space to fit individual training samples. Third, the regular structure is well-suited to GPU implementation; the entire tree-building loop can be expressed as a batched tensor operation.
Oblivious trees are less expressive per tree than asymmetric trees, so CatBoost typically needs more boosting iterations to reach a given accuracy than LightGBM or XGBoost with leaf-wise growth. The advantage of faster prediction usually outweighs the cost of additional iterations in practice.
The three major gradient boosting libraries differ along several axes that matter for practitioners choosing among them. The table below summarizes the main contrasts.
| Aspect | CatBoost | XGBoost | LightGBM |
|---|---|---|---|
| First release | July 2017 | March 2014 | January 2017 |
| Original developer | Yandex | DMLC (Tianqi Chen) | Microsoft |
| Tree structure | Oblivious (symmetric) | Level-wise or leaf-wise | Leaf-wise |
| Categorical handling | Native via ordered target statistics + one-hot | Manual encoding required (some recent native support) | Native via Fisher splits, less robust than CatBoost |
| Prediction shift mitigation | Ordered boosting | None | None |
| Training speed (large data) | Moderate | Slowest | Fastest |
| Prediction speed | Fastest | Moderate | Moderate |
| Default accuracy on heavy categorical data | Often best | Requires careful encoding | Competitive |
| GPU support | Yes, multi-GPU and multi-host | Yes | Yes |
| Text feature support | Native (since 0.19) | None | None |
| Embedding feature support | Native | None | None |
| Memory usage during training | Higher (ordered boosting state) | Moderate | Lowest |
| Sensitivity to hyperparameters | Low (good defaults) | Higher | Higher |
LightGBM is generally the fastest of the three on large datasets, with reported training speeds roughly 7 times faster than XGBoost and 2 times faster than CatBoost on common benchmarks, owing to its leaf-wise tree growth and histogram-based split finding [7]. XGBoost is the oldest and most battle-tested, with the broadest ecosystem of integrations. CatBoost's competitive advantage is the combination of native categorical handling, low sensitivity to hyperparameters, and very fast prediction. On datasets where categorical features carry most of the signal, CatBoost frequently produces the best validation accuracy with default hyperparameters, often beating tuned XGBoost and LightGBM configurations [8]. In practice, many Kaggle competitors and production teams ensemble all three libraries together, exploiting the modest decorrelation in their errors. CatBoost is also the only one of the three with native support for text features and embedding features.
A distinguishing characteristic of CatBoost is that its default hyperparameters produce strong models on most datasets without much tuning. The library team has invested significant effort in choosing defaults that work well across a wide variety of tasks, and the documentation emphasizes that practitioners can often skip hyperparameter optimization entirely for initial experiments. The most important parameters are summarized in the table below.
| Parameter | Default | Typical range | Description |
|---|---|---|---|
iterations | 1000 | 100 to 10000+ | Number of boosting rounds (synonyms: num_boost_round, n_estimators, num_trees) |
learning_rate | Auto | 0.01 to 0.3 | Step size for each new tree; auto-tuned based on dataset size and iterations |
depth | 6 | 4 to 10 | Depth of each oblivious tree; deeper trees fit more complex patterns but risk overfitting |
l2_leaf_reg | 3.0 | 1 to 10 | L2 regularization on leaf values; higher values combat overfitting |
bagging_temperature | 1.0 | 0 to infinity | Controls Bayesian bootstrap aggregating; 0 disables, higher values increase randomness |
random_strength | 1.0 | 0 to 10 | Amount of randomness added when scoring splits; higher values discourage overfitting |
border_count | 254 (CPU) / 128 (GPU) | 32 to 255 | Number of histogram bins per numerical feature |
one_hot_max_size | 2 to 255 | 2 to 255 | Cardinality threshold below which one-hot encoding is used instead of target statistics |
boosting_type | Auto | Plain or Ordered | Selects between classical and ordered boosting |
early_stopping_rounds | None | 10 to 100 | Stop training if validation metric does not improve for this many rounds |
subsample | 0.8 | 0.5 to 1.0 | Fraction of rows sampled per tree (with bootstrap_type Bernoulli or Poisson) |
rsm | 1.0 | 0.5 to 1.0 | Random subspace method; fraction of features sampled per tree |
The learning_rate parameter is automatically chosen by CatBoost based on the dataset size and the number of iterations. A common workflow is to fix iterations to a large number such as 5000, set early_stopping_rounds to around 50, and let the library determine an appropriate learning rate. The depth parameter has the largest impact on capacity; depth 6 is the default and works well for most problems, while depths above 10 typically overfit. The l2_leaf_reg parameter is the primary lever for controlling overfitting beyond depth and learning rate. The bagging_temperature parameter draws sample weights from an exponential distribution scaled by the temperature value: setting it to 0 disables bagging while higher values produce more stochastic training. Combined with subsample and rsm, this gives CatBoost three independent randomization knobs that practitioners can tune for ensemble diversity.
CatBoost supports a wide catalogue of loss functions covering classification, regression, ranking, multilabel, multitask, and survival problems. The defaults are sensible: Logloss for binary classification, MultiClass for multiclass classification, and RMSE for regression. The library exposes a separate set of evaluation metrics that can be computed on validation data without being used as the optimization objective, including AUC, F1, MCC, and many problem-specific scores.
For ranking problems, CatBoost provides the CatBoostRanker interface and supports several specialized loss functions including PairLogit, YetiRank, and YetiRankPairwise. YetiRank is a smooth approximation of the NDCG ranking metric originally developed inside Yandex for web search. The ranker supports group-wise data layouts where samples belong to query groups, and the loss is computed within each group rather than across the full dataset. For multilabel and multitarget problems, CatBoost provides MultiLogloss and MultiRMSE objectives that treat each target column as an independent task while still benefiting from a shared ensemble of trees. A notable extension is support for survival analysis through accelerated failure time loss functions, including SurvivalAft with normal, logistic, and extreme value error distributions, making CatBoost one of the few gradient boosting libraries with first-class support for time-to-event modeling.
Starting with version 0.19, CatBoost added native support for text features, columns containing free-form strings such as product descriptions, user reviews, or natural language queries. Text features are tokenized internally and converted to numerical features using a combination of bag-of-words counts, character n-grams, and naive Bayes classifiers. The derived numerical features are then used as inputs to the gradient boosting algorithm in the same way as ordinary numerical features.
CatBoost also supports embedding features, fixed-dimensional dense vectors that arise from neural networks or other representation learning methods. Embedding features are converted to scalar features using two strategies: linear discriminant analysis projects the embedding onto a small number of axes that maximally separate the target classes, and a nearest-neighbor lookup computes target statistics from the labels of the closest embeddings in the training set. This makes CatBoost a natural choice for hybrid pipelines where neural networks produce embeddings of unstructured inputs (text, images, audio) and a gradient boosted model combines those embeddings with structured tabular features for the final prediction.
The text and embedding features are still less mature than the categorical and numerical features, and only a subset of the export formats support them. As of the 1.2 release, models with text or embedding features cannot be exported to ONNX or CoreML and must be served with the CatBoost runtime.
CatBoost powers a wide variety of production systems at Yandex. The Yandex Search ranker uses CatBoost models to score the relevance of web pages to user queries, replacing the older MatrixNet system. Yandex.Music, Yandex.Taxi, Yandex.Drive, Yandex.Weather, and the Alice intelligent assistant all use CatBoost models for ranking, recommendation, and prediction tasks. Yandex's self-driving car program relied on CatBoost for several of its perception and planning models, and Yandex Cloud offers CatBoost as a managed training and inference service for external customers.
Outside Yandex, CatBoost has been adopted at significant scale by several large companies. Cloudflare uses CatBoost as a core component of its bot detection system, training on a dataset of more than a trillion HTTP requests; the deployed solution scores over 660 billion requests per day [9]. CERN uses CatBoost for physics analysis tasks including signal-versus-background classification at the LHCb experiment. JetBrains has integrated CatBoost into the code completion engine in its IntelliJ IDEA family of integrated development environments. Careem, the Middle Eastern ride-hailing platform, uses CatBoost to predict ride destinations.
CatBoost is also widely used in Kaggle competitions, where it appears in winning solutions for a substantial fraction of competitions involving categorical-heavy data. Kaggle's annual data science survey has consistently ranked CatBoost among the top ten most-used machine learning frameworks since 2018, and the library has approximately 100,000 daily installations from PyPI as of 2022 [1].
CatBoost ships with first-class bindings for several programming languages and a command-line interface. The Python interface is the most fully featured, exposing training, evaluation, and prediction APIs through scikit-learn-compatible classes (CatBoostClassifier, CatBoostRegressor, CatBoostRanker) as well as a lower-level CatBoost interface that gives direct access to dataset construction, custom loss functions, and pool objects. The Python package is distributed through PyPI and conda-forge with both CPU and GPU-enabled wheels. The R interface (catboost) provides the same capabilities through R-idiomatic function names and integrates with caret and tidymodels. The Java interface (catboost-spark) targets Apache Spark and JVM-based production systems and is widely used for batch scoring at large companies. The C++ interface is the lowest-level binding and is used internally by all the other interfaces. A REST API is available through the catboost-server package.
For model deployment, CatBoost supports several export formats. Models can be saved as native binary files, JSON, standalone Python code, standalone C++ code, Apple CoreML for iOS and macOS deployment, ONNX-ML for cross-framework inference, and PMML 4.3 for legacy enterprise pipelines. The CoreML and ONNX exports support only models with numerical features; categorical, text, and embedding features require either the native runtime or a custom preprocessing layer.
| Version | Release date | Notable features |
|---|---|---|
| 0.1 | July 2017 | Initial open-source release; CPU-only training |
| 0.2 | August 2017 | Speed and stability improvements |
| 0.6 | March 2018 | GPU training support |
| 0.10 | August 2018 | R interface; many quality-of-life improvements |
| 0.12 | January 2019 | Multi-classification on GPU; per-tree feature importance |
| 0.19 | November 2019 | Text feature support for classification on GPU |
| 0.22 | December 2019 | Multilabel classification |
| 0.24 | September 2020 | Embedding feature support; nearest-neighbor and LDA derived features |
| 0.25 | November 2020 | MonoForest model analysis framework based on a NeurIPS 2019 paper |
| 1.0 | September 2021 | API stability milestone; first major version |
| 1.1 | November 2022 | Survival analysis losses |
| 1.2 | June 2023 | Performance improvements; new ranking loss variants |
| 1.2.8 | April 2025 | Stable maintenance release |
CatBoost is not the right tool for every problem. The ordered boosting algorithm and the per-permutation auxiliary models add memory overhead that can exceed the dataset size on very large problems, and the symmetric tree structure produces less expressive trees than the asymmetric leaf-wise growth used by LightGBM. Practitioners working with hundreds of millions of rows and pure numerical features often find that LightGBM trains faster while producing comparable accuracy. The text and embedding feature pipelines, while convenient, are less competitive than dedicated neural network approaches for problems where the text or embedding signal dominates over the structured features.
A more fundamental limitation is that gradient boosting on decision trees has been somewhat displaced for very high-dimensional and unstructured tasks by deep learning methods. CatBoost remains the state of the art on many tabular benchmarks but is not competitive with transformer models or convolutional networks on text, image, audio, and video problems. The growing class of tabular deep learning methods (NODE, TabNet, FT-Transformer, SAINT, TabPFN) have begun to close the gap on tabular tasks, although gradient boosting libraries including CatBoost still hold a meaningful lead on most public benchmarks.
Gradient boosting libraries including CatBoost are best understood in contrast to two simpler classes of tree-based methods. A single decision tree partitions the feature space into axis-aligned regions and is highly interpretable but typically high variance. A random forest ensembles many decision trees grown independently on bootstrap samples of the training data, averaging their predictions to reduce variance, but does not benefit from the bias-reduction of boosting. Gradient boosting differs from random forests in that the trees are grown sequentially with each new tree fitted to the residual errors of the current ensemble, achieving much lower bias at the cost of being harder to parallelize and more sensitive to hyperparameters. Compared to deep neural networks, CatBoost typically outperforms multilayer perceptrons on tabular data with moderate sample sizes, while neural networks tend to win when the dataset is very large or when the features include unstructured signals like text or images. Hybrid pipelines that pass neural embeddings into CatBoost as embedding features often outperform either approach alone.