Threshold (for decision trees)

In a decision tree, a threshold is the value used in an internal node's split test that decides which child subtree a sample is routed to. For a numerical feature, the test typically has the form "is feature_x less than or equal to 4.5?" If yes, the sample goes to the left child; if no, it goes to the right child. The threshold (here, 4.5) is the cut point that partitions the input space, and the choice of threshold at each node is what determines the geometry of the entire tree.

Thresholds are not given by the user. They are learned from training data by a greedy splitting procedure that searches over candidate cut points and picks the one that most improves a chosen split criterion (Gini impurity, entropy, mean squared error, and so on). The same idea generalizes to categorical features (where the "threshold" is a subset of categories), to oblique splits (a threshold on a linear combination of features), and to set membership tests. Threshold selection is the inner loop of every classical tree learner, and most of the engineering effort in modern gradient boosting libraries goes into making this loop fast and statistically well behaved.

This article covers what thresholds are, how they are chosen, how the choice differs across CART, ID3, C4.5, and CHAID, what split criteria look like for classification and regression, how modern libraries like XGBoost, LightGBM, and CatBoost approximate the search for very large datasets, and the well-known limitations of axis-aligned thresholds.

Definition

A decision tree is a recursive partition of the input space. Every non-leaf node carries a split test, and every leaf carries a prediction (a class label, a class distribution, a real-valued estimate, or a more general summary). The split test takes one input sample and decides whether to route it left or right. For a single numerical feature x_j and a threshold t, the test is x_j <= t, which is equivalent to a hyperplane in feature space that is perpendicular to the j-th coordinate axis. Trees built from this kind of test are called axis-aligned, because every decision boundary is parallel to a feature axis (Breiman et al. 1984).

Two properties follow immediately. First, the tree's decision regions are unions of hyperrectangles, each corresponding to one root-to-leaf path. Second, the tree's predictions are piecewise constant: every sample that reaches the same leaf gets the same prediction. The thresholds at the internal nodes are what determine the size, shape, and number of these hyperrectangles.

Types of split conditions

The word "threshold" originally referred to a numerical cut point, but most modern libraries support several kinds of split conditions. The internal data structure stores a feature index plus a condition object. The TensorFlow Decision Forests source code, for example, distinguishes numerical, categorical, and several oblique condition types.

Condition type	Test form	Typical use
Numerical (axis-aligned)	x_j <= t	Continuous features, default in CART and gradient boosting
Categorical (subset)	x_j in S	Categorical features in CART, LightGBM, CatBoost
In-set	x_j in {v1, v2, ..., vk}	Discrete features with small alphabets
Oblique	sum(w_j * x_j) <= t	Diagonal boundaries, OC1, sparse oblique trees
Multiway (CHAID)	x_j in S1, S2, ..., Sm	Categorical or binned numerical with more than two children

Numerical splits are by far the most common. Categorical splits in CART are handled by sorting categories by class proportion and treating the resulting ordering as if it were numerical (Breiman et al. 1984), which keeps the search at O(k log k) instead of the 2^(k-1) cost of trying every subset. LightGBM uses a more aggressive variant (Fisher 1958) that finds the optimal partition for a single categorical split in O(k log k) time. Oblique trees test a linear combination, so the split is no longer perpendicular to a feature axis; OC1 by Murthy, Kasif, and Salzberg (1994) is the canonical algorithm. CHAID (Kass 1980) uses chi-square tests and produces multiway splits for categorical features.

How the threshold is chosen

For a single numerical feature in a CART-style learner, the standard procedure is:

Sort the unique training values of the feature at this node.
Generate candidate thresholds. Typically these are the midpoints between consecutive sorted values, since any threshold strictly between two adjacent sorted values yields the same split.
For each candidate threshold, partition the samples and compute the split criterion (impurity, gain, or variance) on the resulting two children.
Pick the threshold that gives the best score.

This is known as the exact greedy algorithm. Repeated for every feature at every node, it is the workhorse behind scikit-learn's DecisionTreeClassifier and DecisionTreeRegressor when splitter='best' (scikit-learn 1.10 docs). With n samples and d features, the exact greedy split costs roughly O(n d log n) per node when implemented carefully, which is fast for small datasets but becomes a bottleneck once n d gets into the hundreds of millions.

The splitter='random' option in scikit-learn replaces the inner loop with a single random candidate threshold per feature, sacrificing some quality for speed. Extra Trees by Geurts, Ernst, and Wehenkel (2006) take this further and use random thresholds throughout, which trades extra bias for very low variance and very fast training.

Split criteria for classification

The split criterion is the function the algorithm tries to minimize (or whose decrease it tries to maximize) when choosing a threshold. For classification the dominant choices are Gini impurity, entropy, and misclassification error, all defined on the class distribution at a node.

Criterion	Formula	Used by	Notes
Gini impurity	G = 1 - sum(p_i^2)	CART (default), scikit-learn `criterion='gini'`	Smooth, fast to compute, slightly favors larger partitions
Entropy / information gain	H = -sum(p_i log p_i)	ID3, C4.5, scikit-learn `criterion='entropy'` or `'log_loss'`	Logarithmic, more sensitive to small class probabilities
Misclassification error	1 - max(p_i)	Rarely used as split criterion	Not differentiable at the maximum, less responsive to changes in class distribution
Gain ratio	InformationGain / SplitInformation	C4.5	Penalizes splits with many small partitions, fixes ID3's bias toward high-cardinality features
Chi-square	sum((O - E)^2 / E)	CHAID	Statistical significance test, allows multiway splits

In practice, Gini impurity and entropy almost always select the same threshold in binary problems, and their predictions correlate strongly. Hastie, Tibshirani, and Friedman (2009, section 9.2.3) note that the choice between them rarely matters for accuracy. Misclassification error is theoretically reasonable but practically unhelpful as a split criterion because the function is flat over many candidate splits, so two very different cuts can score identically.

worked example: choosing a threshold

Suppose we have ten samples at a node with feature x and binary label y:

x	1.0	1.5	2.2	3.0	3.7	4.1	4.8	5.5	6.0	7.2
y	0	0	0	0	1	0	1	1	1	1

The parent node has 5 zeros and 5 ones, so its Gini impurity is 1 - (0.5^2 + 0.5^2) = 0.5. Sorted, the candidate thresholds are the nine midpoints (1.25, 1.85, 2.6, 3.35, 3.9, 4.45, 5.15, 5.75, 6.6). Try the threshold 3.35: the left child has four samples (all label 0, Gini 0), the right child has six (one 0, five 1, Gini = 1 - (1/6)^2 - (5/6)^2 = 0.278). Weighted child impurity is (4/10) * 0 + (6/10) * 0.278 = 0.167, so the impurity decrease is 0.5 - 0.167 = 0.333. Try threshold 3.9: left has five (four 0, one 1, Gini = 0.32), right has five (one 0, four 1, Gini = 0.32). Weighted = 0.32, decrease = 0.18. Try threshold 4.45: left has six (four 0, one 1, plus one 0, so five 0, one 1, Gini = 0.278), right has four (one 0, three 1, Gini = 0.375). Weighted = 0.32, decrease = 0.18.

Threshold 3.35 wins, with the largest impurity decrease, so the algorithm splits there. The tree learner then recurses on the left and right subsets, repeating the search for each.

Split criteria for regression

For regression, the labels are continuous, so impurity is replaced by variance or some other dispersion measure. The most common criteria are mean squared error, mean absolute error, and Friedman's modified MSE.

Criterion	What it minimizes	Used by	Notes
MSE / variance reduction	sum((y - mean)^2) in each child	CART, scikit-learn `criterion='squared_error'`	Default for regression, sensitive to outliers
MAE / absolute deviation	sum(	y - median	) in each child
Friedman MSE	Weighted improvement scaled by left and right counts	scikit-learn `criterion='friedman_mse'` (default for `GradientBoostingRegressor`)	From Friedman (2001), better for boosting because it accounts for child sizes
Poisson deviance	Sum of Poisson deviance per child	scikit-learn `criterion='poisson'`	Count regression with non-negative targets
Quantile loss	Sum of pinball loss at quantile q	LightGBM, sklearn `GradientBoostingRegressor` (loss)	Quantile regression and prediction intervals

For MSE the leaf prediction is the mean of the training targets in that leaf, and the variance reduction at a candidate threshold can be updated incrementally in O(1) as the threshold moves through the sorted values. This is what makes the exact greedy algorithm fast in practice. MAE is harder because the median has no equivalent O(1) update, which is why scikit-learn marks it as significantly slower in the documentation.

Tree algorithms compared

Different classical tree learners differ mostly in the kinds of features they accept, the split criterion they use, and how many children a node may have.

Algorithm	Year	Authors	Split criterion	Splits	Features supported
AID / THAID	1963, 1972	Morgan and Sonquist; Messenger and Mandell	Variance / heuristic	Multiway	Categorical
ID3	1986	Quinlan	Information gain	Multiway (one branch per category)	Categorical only
CART	1984	Breiman, Friedman, Olshen, Stone	Gini (classification), MSE (regression)	Binary	Numerical and categorical, classification and regression
C4.5	1993	Quinlan	Gain ratio	Binary for numerical, multiway for categorical	Numerical and categorical, handles missing values
C5.0	1997 onward	Quinlan (commercial)	Gain ratio	Same as C4.5	Adds boosting and rule extraction
CHAID	1980	Kass	Chi-square (classification), F-test (regression)	Multiway	Categorical, ordinal, binned numerical
QUEST	1997	Loh and Shih	Statistical tests for variable selection, then quadratic discriminant for split	Binary	Mixed
GUIDE	Loh, ongoing	Loh	Curvature and interaction tests	Binary or multiway	Mixed, including missing values

ID3 is the simplest of these and has mainly historical interest. It cannot handle continuous features directly and produces multiway splits that overfit when categorical features have many levels (Quinlan 1986). C4.5 fixed both issues: continuous features were handled by sorting and choosing a threshold, and the gain ratio criterion divided information gain by the entropy of the split itself, which penalized very fragmented partitions (Quinlan 1993). CART by Breiman, Friedman, Olshen, and Stone is the version closest to what scikit-learn and most modern libraries implement: binary splits, Gini for classification, MSE for regression, cost complexity pruning, surrogate splits for missing values, and treatment of both classification and regression in one framework.

Exact greedy versus histogram-based split finding

For small datasets, exact greedy threshold search (sort the feature, try every midpoint) is fine. For very large datasets, sorting all feature values at every node is expensive in both time and memory. Two approximate methods dominate modern practice.

Pre-sorted exact split (XGBoost `tree_method='exact'`)

XGBoost's original implementation uses a pre-sorted column block layout: feature values are sorted once at the start, and a column block holds sorted values plus row pointers. At each node the algorithm scans through the sorted values and updates the split statistics incrementally. This is exact but uses memory proportional to the dataset size and is slow for very wide or very large data (Chen and Guestrin 2016).

Approximate quantile sketch (XGBoost `tree_method='approx'`)

For large datasets XGBoost proposes a fixed set of candidate thresholds per feature, computed using a weighted quantile sketch. The candidates are the values that split the feature into roughly equal-weight buckets, where the weights are second-order gradients (Hessians) from the loss. The split search then only evaluates these candidates rather than every value. Chen and Guestrin (2016) prove the approximation has bounded error and show that the sketch can be merged across machines, which is why XGBoost scales horizontally.

Histogram-based split finding (LightGBM, XGBoost `tree_method='hist'`, CatBoost)

LightGBM by Ke et al. (2017) bins continuous feature values into a small number of integer histograms (typically 256 bins) before training begins. At each node, the algorithm only considers the bin boundaries as candidate thresholds, and the per-bin gradient and Hessian sums can be computed in a single pass over the data. Because bin counts are small fixed integers, the split search is essentially constant per feature per node once the histograms exist. LightGBM also reuses parent histograms when computing child histograms by subtracting one child from the parent.

XGBoost added a histogram method (tree_method='hist') inspired by LightGBM that is now the default in modern releases. CatBoost uses oblivious symmetric trees, where every node at the same depth tests the same feature with the same threshold, and combines this with histogram-based binning.

Speedups specific to LightGBM

LightGBM introduced two more techniques that make the threshold search even cheaper:

Gradient-based One-Side Sampling (GOSS) keeps all training examples with large gradients and randomly subsamples examples with small gradients, then upweights the subsampled set when computing information gain. The intuition is that examples the current model already fits well (small gradient) carry less information about where to split next.

Exclusive Feature Bundling (EFB) bundles features that rarely take nonzero values at the same time into a single new feature, reducing the effective feature count. For sparse high-dimensional data such as one-hot encoded categorical variables, this can shrink the per-node split search by an order of magnitude.

Ke et al. (2017) report up to 20x speedup over conventional gradient boosting decision trees with comparable accuracy.

Modern boosting libraries

Library	Default split method	Categorical handling	Notable threshold trick
scikit-learn `DecisionTreeClassifier`	Exact greedy	Binary subset (CART-style)	`splitter='random'` for fast random thresholds
scikit-learn `HistGradientBoostingClassifier`	Histogram (256 bins)	Native categorical with sorted-by-target trick	Inspired by LightGBM
XGBoost	`hist` (default since 2.0), `approx`, or `exact`	One-hot or native categorical (since 1.5)	Weighted quantile sketch
LightGBM	Histogram (255 bins)	Optimal split in O(k log k)	GOSS, EFB, leaf-wise tree growth
CatBoost	Symmetric oblivious trees over histograms	Ordered target statistics, native categorical	Oblivious trees, ordered boosting

Categorical feature handling

For a categorical feature with k levels, the number of nontrivial subset splits is 2^(k-1) - 1, which is intractable beyond about k = 20. Real systems use one of several shortcuts.

CART's classical trick (Breiman et al. 1984) for binary classification with k categories sorts the categories by the proportion of class 1 within each, then treats the sorted index as if it were numerical and searches for a threshold over k - 1 candidates. Fisher (1958) proved this gives the optimal binary split for two-class problems, and the result extends to MSE regression. LightGBM uses essentially this method.

One-hot encoding, where each category becomes a separate binary feature, is widely used but is suboptimal for high-cardinality features because the split search has to choose one category at a time, which forces the tree to grow deep before it can isolate any single value.

Target encoding (or mean encoding) replaces each category with the average of the target variable conditional on the category, which converts categorical features back into numerical ones at the cost of leakage if not done carefully. CatBoost addresses this with ordered target statistics, where the encoding for each row uses only previously seen rows in a random permutation, which preserves an out-of-sample property and reduces overfitting (Prokhorenkova et al. 2018).

The greedy threshold search will keep splitting until each leaf contains a single sample if you let it. To prevent that, decision tree learners apply pre-pruning constraints, post-pruning, or both.

Pre-pruning constraints include max_depth (maximum tree depth), min_samples_split (minimum samples required to consider splitting a node), min_samples_leaf (minimum samples that must end up in a leaf), min_impurity_decrease (a candidate split is rejected if the impurity decrease is below this value), and max_leaf_nodes or max_features. The min_impurity_decrease parameter is essentially a threshold on the threshold-selection criterion: if no candidate split passes the bar, the node is left as a leaf. Scikit-learn implements all of these.

Post-pruning, particularly minimal cost complexity pruning from Breiman et al. (1984), grows a large tree first and then prunes it back. The pruning chooses a regularization parameter alpha and removes subtrees that minimize a cost-complexity score equal to total impurity plus alpha times the number of leaves. As alpha grows, more subtrees are pruned. Scikit-learn exposes this through the ccp_alpha argument.

A single decision tree pruned this way is still high variance, which motivates ensembles: random forest by Breiman (2001) averages many bootstrap-resampled trees with random feature subsets at each split, and gradient boosting by Friedman (2001) fits a sequence of small trees to the residuals of the current ensemble.

Limitations of axis-aligned thresholds

Thresholds tied to one feature at a time impose a particular geometry, and that geometry has well-known weaknesses.

A diagonal decision boundary like x_1 + x_2 < 10 cannot be represented by a single axis-aligned split. The tree can approximate it with a staircase of small axis-aligned cuts, but it needs many leaves and a deep tree to do so. Oblique splits like the ones in OC1 (Murthy, Kasif, and Salzberg 1994) and sparse oblique trees in TensorFlow Decision Forests test linear combinations of features and capture diagonal boundaries directly, although the search for a good linear combination is computationally expensive.

The greedy choice of one threshold at a time is locally optimal, not globally optimal. A pair of splits chosen jointly might give a much better partition than either chosen first. Finding the globally optimal binary tree is NP-hard (Hyafil and Rivest 1976), which is why production tree learners settle for greedy. Recent work on optimal sparse decision trees (Hu, Rudin, and Seltzer 2019; Lin et al. 2020) uses branch-and-bound to find provably optimal trees up to moderate sizes.

Small changes in the training data can change the chosen threshold and propagate down the tree, producing a very different tree, which is why single trees are high-variance models. Ensembles average over this variance.

Finally, axis-aligned trees are insensitive to monotone transformations of a feature (taking logs, for example, does not change which split is chosen) but are sensitive to rotations of the feature space, since a rotation makes diagonal patterns axis-aligned and vice versa. This is the opposite of linear models like logistic regression, which are rotation-invariant after standardization but very sensitive to monotone transformations of features.

Implementations

The most common decision tree implementations in Python expose the threshold-selection mechanism through a few standard parameters.

Library	Class	Key threshold-related parameters
scikit-learn	`DecisionTreeClassifier`, `DecisionTreeRegressor`	`criterion`, `splitter`, `max_depth`, `min_samples_split`, `min_samples_leaf`, `min_impurity_decrease`, `ccp_alpha`
scikit-learn	`HistGradientBoostingClassifier`, `HistGradientBoostingRegressor`	`max_bins`, `max_depth`, `learning_rate`, `min_samples_leaf`
XGBoost	`XGBClassifier`, `XGBRegressor`	`tree_method` (`hist`, `exact`, `approx`), `max_bin`, `max_depth`, `min_child_weight`, `gamma`
LightGBM	`LGBMClassifier`, `LGBMRegressor`	`num_leaves`, `max_bin`, `min_data_in_leaf`, `cat_smooth`, `boosting_type`
CatBoost	`CatBoostClassifier`, `CatBoostRegressor`	`depth`, `border_count`, `feature_border_type`, `cat_features`
TensorFlow Decision Forests	`RandomForestModel`, `GradientBoostedTreesModel`	`num_trees`, `max_depth`, `split_axis`, `categorical_algorithm`

The scikit-learn fitted tree exposes the per-node threshold through the tree_.threshold array. Inspecting it is a quick way to understand what the learned tree actually does on a small problem.

References

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). *Classification and Regression Trees*. Wadsworth and Brooks/Cole.
Quinlan, J. R. (1986). "Induction of Decision Trees." *Machine Learning* 1: 81-106. https://hunch.net/~coms-4771/quinlan.pdf
Quinlan, J. R. (1993). *C4.5: Programs for Machine Learning*. Morgan Kaufmann.
Quinlan, J. R. (1996). "Improved Use of Continuous Attributes in C4.5." *Journal of Artificial Intelligence Research* 4: 77-90. https://arxiv.org/abs/cs/9603103
Kass, G. V. (1980). "An Exploratory Technique for Investigating Large Quantities of Categorical Data." *Applied Statistics* 29(2): 119-127.
Fisher, W. D. (1958). "On Grouping for Maximum Homogeneity." *Journal of the American Statistical Association* 53(284): 789-798.
Murthy, S. K., Kasif, S., and Salzberg, S. (1994). "A System for Induction of Oblique Decision Trees." *Journal of Artificial Intelligence Research* 2: 1-32. https://arxiv.org/abs/cs/9408103
Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *Annals of Statistics* 29(5): 1189-1232.
Breiman, L. (2001). "Random Forests." *Machine Learning* 45(1): 5-32.
Geurts, P., Ernst, D., and Wehenkel, L. (2006). "Extremely Randomized Trees." *Machine Learning* 63(1): 3-42.
Chen, T. and Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*. https://arxiv.org/abs/1603.02754
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." *Advances in Neural Information Processing Systems* 30. https://proceedings.neurips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. (2018). "CatBoost: Unbiased Boosting with Categorical Features." *Advances in Neural Information Processing Systems* 31. https://arxiv.org/abs/1706.09516
Hyafil, L. and Rivest, R. L. (1976). "Constructing Optimal Binary Decision Trees is NP-Complete." *Information Processing Letters* 5(1): 15-17.
Hu, X., Rudin, C., and Seltzer, M. (2019). "Optimal Sparse Decision Trees." *Advances in Neural Information Processing Systems* 32. https://arxiv.org/abs/1904.12847
Lin, J., Zhong, C., Hu, D., Rudin, C., and Seltzer, M. (2020). "Generalized and Scalable Optimal Sparse Decision Trees." *International Conference on Machine Learning*. https://arxiv.org/abs/2006.08690
Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning*, second edition, chapter 9. Springer. https://hastie.su.domains/Papers/ESLII.pdf
scikit-learn developers. "DecisionTreeClassifier." scikit-learn 1.10 documentation. https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
scikit-learn developers. "1.10. Decision Trees." scikit-learn 1.10 documentation. https://scikit-learn.org/stable/modules/tree.html
scikit-learn developers. "Post pruning decision trees with cost complexity pruning." scikit-learn examples. https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
TensorFlow Decision Forests source. `tensorflow_decision_forests/component/py_tree/condition.py`. https://github.com/tensorflow/decision-forests

Threshold (for decision trees)

Definition

Types of split conditions

How the threshold is chosen

Split criteria for classification

worked example: choosing a threshold

Split criteria for regression

Tree algorithms compared

Exact greedy versus histogram-based split finding

Pre-sorted exact split (XGBoost `tree_method='exact'`)

Approximate quantile sketch (XGBoost `tree_method='approx'`)

Histogram-based split finding (LightGBM, XGBoost `tree_method='hist'`, CatBoost)

Speedups specific to LightGBM

Modern boosting libraries

Categorical feature handling

Limitations of axis-aligned thresholds

Implementations

See also

References

Improve this article

Definition

Types of split conditions

How the threshold is chosen

Split criteria for classification

worked example: choosing a threshold

Split criteria for regression

Tree algorithms compared

Exact greedy versus histogram-based split finding

Pre-sorted exact split (XGBoost `tree_method='exact'`)

Approximate quantile sketch (XGBoost `tree_method='approx'`)

Histogram-based split finding (LightGBM, XGBoost `tree_method='hist'`, CatBoost)

Speedups specific to LightGBM

Modern boosting libraries

Categorical feature handling

Limitations of axis-aligned thresholds

Implementations

See also

References

Definition

Types of split conditions

How the threshold is chosen

Split criteria for classification

worked example: choosing a threshold

Split criteria for regression

Tree algorithms compared

Exact greedy versus histogram-based split finding

Pre-sorted exact split (XGBoost tree_method='exact')

Approximate quantile sketch (XGBoost tree_method='approx')

Histogram-based split finding (LightGBM, XGBoost tree_method='hist', CatBoost)

Speedups specific to LightGBM

Modern boosting libraries

Categorical feature handling

Pruning and threshold-related regularization

Limitations of axis-aligned thresholds

Implementations

See also

References

Improve this article

Related Articles

Axis-aligned condition

Binary condition

In-set condition

Leaf

Oblique condition

Gradient boosted (decision) trees (GBT)

Definition

Types of split conditions

How the threshold is chosen

Split criteria for classification

worked example: choosing a threshold

Split criteria for regression

Tree algorithms compared

Exact greedy versus histogram-based split finding

Pre-sorted exact split (XGBoost tree_method='exact')

Approximate quantile sketch (XGBoost tree_method='approx')

Histogram-based split finding (LightGBM, XGBoost tree_method='hist', CatBoost)

Speedups specific to LightGBM

Modern boosting libraries

Categorical feature handling

Pruning and threshold-related regularization

Limitations of axis-aligned thresholds

Implementations

See also

References

Related Articles

Axis-aligned condition

Binary condition

In-set condition

Leaf

Oblique condition

Gradient boosted (decision) trees (GBT)

Pre-sorted exact split (XGBoost `tree_method='exact'`)

Approximate quantile sketch (XGBoost `tree_method='approx'`)

Histogram-based split finding (LightGBM, XGBoost `tree_method='hist'`, CatBoost)

Pre-sorted exact split (XGBoost `tree_method='exact'`)

Approximate quantile sketch (XGBoost `tree_method='approx'`)

Histogram-based split finding (LightGBM, XGBoost `tree_method='hist'`, CatBoost)