# Threshold (for decision trees)

> Source: https://aiwiki.ai/wiki/threshold_for_decision_trees
> Updated: 2026-06-24
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

In a [decision tree](/wiki/decision_tree), a **threshold** is the cut point used in an internal node's split test that decides which child subtree a sample is routed to. For a numerical feature the test has the form "is feature_x less than or equal to 4.5?": if yes the sample goes to the left child, if no it goes to the right child, and the threshold (here, 4.5) is the value that partitions the input space. Thresholds are not supplied by the user; they are learned from the training data by a greedy procedure that searches candidate cut points (typically the midpoints between sorted, distinct feature values) and keeps the one that most reduces a split criterion such as [Gini impurity](/wiki/gini_impurity) or [information gain](/wiki/information_gain). The scikit-learn user guide describes the default behavior exactly: the optimal split is "found by performing a greedy exhaustive search over all available features and all possible thresholds (i.e. midpoints between sorted, distinct feature values)" [19].

The same idea generalizes to categorical features (where the "threshold" is a subset of categories), to oblique splits (a threshold on a linear combination of features), and to set membership tests. Threshold selection is the inner loop of every classical tree learner, and most of the engineering effort in modern gradient boosting libraries goes into making this loop fast and statistically well behaved.

This article covers what thresholds are, how they are chosen, how the choice differs across [CART](/wiki/cart_algorithm), [ID3](/wiki/id3_algorithm), [C4.5](/wiki/c4_5_algorithm), and CHAID, what split criteria look like for classification and regression, how modern libraries like [XGBoost](/wiki/xgboost), [LightGBM](/wiki/lightgbm), and [CatBoost](/wiki/catboost) approximate the search for very large datasets, and the well-known limitations of axis-aligned thresholds.

## What is a threshold in a decision tree?

A decision tree is a recursive partition of the input space. Every non-leaf node carries a split test, and every leaf carries a prediction (a class label, a class distribution, a real-valued estimate, or a more general summary). The split test takes one input sample and decides whether to route it left or right. For a single numerical feature x_j and a threshold t, the test is x_j <= t, which is equivalent to a hyperplane in feature space that is perpendicular to the j-th coordinate axis. Trees built from this kind of test are called axis-aligned, because every decision boundary is parallel to a feature axis (Breiman et al. 1984) [1].

Two properties follow immediately. First, the tree's decision regions are unions of hyperrectangles, each corresponding to one root-to-leaf path. Second, the tree's predictions are piecewise constant: every sample that reaches the same leaf gets the same prediction. The thresholds at the internal nodes are what determine the size, shape, and number of these hyperrectangles.

A node threshold is distinct from a [decision threshold](/wiki/decision_threshold), which is the probability cutoff (often 0.5) applied to a model's predicted score when turning it into a hard class label. A tree can have thousands of internal node thresholds and still expose a single, separate decision threshold for its final probability output.

## What kinds of split conditions are there?

The word "threshold" originally referred to a numerical cut point, but most modern libraries support several kinds of split conditions. The internal data structure stores a feature index plus a condition object. The TensorFlow Decision Forests source code, for example, distinguishes numerical, categorical, and several oblique condition types [21].

| Condition type | Test form | Typical use |
| --- | --- | --- |
| Numerical (axis-aligned) | x_j <= t | Continuous features, default in CART and gradient boosting |
| Categorical (subset) | x_j in S | Categorical features in CART, LightGBM, CatBoost |
| [In-set](/wiki/in-set_condition) | x_j in {v1, v2, ..., vk} | Discrete features with small alphabets |
| [Oblique](/wiki/oblique_condition) | sum(w_j * x_j) <= t | Diagonal boundaries, OC1, sparse oblique trees |
| Multiway (CHAID) | x_j in S1, S2, ..., Sm | Categorical or binned numerical with more than two children |

Numerical splits are by far the most common. Categorical splits in CART are handled by sorting categories by class proportion and treating the resulting ordering as if it were numerical (Breiman et al. 1984) [1], which keeps the search at O(k log k) instead of the 2^(k-1) cost of trying every subset. LightGBM uses a more aggressive variant (Fisher 1958) [6] that finds the optimal partition for a single categorical split in O(k log k) time. Oblique trees test a linear combination, so the split is no longer perpendicular to a feature axis; OC1 by Murthy, Kasif, and Salzberg (1994) [7] is the canonical algorithm. CHAID (Kass 1980) [5] uses chi-square tests and produces multiway splits for categorical features.

## How is the threshold chosen?

For a single numerical feature in a CART-style learner, the standard procedure is:

1. Sort the unique training values of the feature at this node.
2. Generate candidate thresholds. Typically these are the midpoints between consecutive sorted values, since any threshold strictly between two adjacent sorted values yields the same split.
3. For each candidate threshold, partition the samples and compute the split criterion (impurity, gain, or variance) on the resulting two children.
4. Pick the threshold that gives the best score.

This is known as the **exact greedy** algorithm. Repeated for every feature at every node, it is the workhorse behind scikit-learn's `DecisionTreeClassifier` and `DecisionTreeRegressor` when `splitter='best'` (scikit-learn docs) [18][19]. With n samples and d features, the exact greedy split costs roughly O(n d log n) per node when implemented carefully, which is fast for small datasets but becomes a bottleneck once n d gets into the hundreds of millions. C4.5 introduced exactly this sort-then-threshold treatment for continuous attributes, and Quinlan (1996) later refined the way candidate thresholds are penalized to produce smaller, more accurate trees [4].

The `splitter='random'` option in scikit-learn replaces the inner loop with a single random candidate threshold per feature, sacrificing some quality for speed [19]. Extra Trees by Geurts, Ernst, and Wehenkel (2006) take this further and use random thresholds throughout, which trades extra bias for very low variance and very fast training [10].

Because the search depends only on the ordering of feature values, decision-tree thresholds are invariant under any monotone transformation of a feature: rescaling, taking logs, or min-max normalization does not change which split is selected, which is why feature scaling is not required before fitting a tree [19].

## What split criteria are used for classification?

The split criterion is the function the algorithm tries to minimize (or whose decrease it tries to maximize) when choosing a threshold. For classification the dominant choices are Gini impurity, entropy, and misclassification error, all defined on the class distribution at a node. The scikit-learn user guide gives the two common ones for a node Q_m as H(Q_m) = sum_k p_mk (1 - p_mk) for Gini and H(Q_m) = -sum_k p_mk log(p_mk) for log loss / entropy [19].

| Criterion | Formula | Used by | Notes |
| --- | --- | --- | --- |
| [Gini impurity](/wiki/gini_impurity) | G = 1 - sum(p_i^2) | CART (default), scikit-learn `criterion='gini'` | Smooth, fast to compute, slightly favors larger partitions |
| [Entropy](/wiki/entropy) / [information gain](/wiki/information_gain) | H = -sum(p_i log p_i) | ID3, C4.5, scikit-learn `criterion='entropy'` or `'log_loss'` | Logarithmic, more sensitive to small class probabilities |
| Misclassification error | 1 - max(p_i) | Rarely used as split criterion | Not differentiable at the maximum, less responsive to changes in class distribution |
| Gain ratio | InformationGain / SplitInformation | C4.5 | Penalizes splits with many small partitions, fixes ID3's bias toward high-cardinality features |
| Chi-square | sum((O - E)^2 / E) | CHAID | Statistical significance test, allows multiway splits |

In practice, Gini impurity and entropy almost always select the same threshold in binary problems, and their predictions correlate strongly. Hastie, Tibshirani, and Friedman (2009, section 9.2.3) note that the choice between them rarely matters for accuracy [17]. Misclassification error is theoretically reasonable but practically unhelpful as a split criterion because the function is flat over many candidate splits, so two very different cuts can score identically.

### Worked example: choosing a threshold

Suppose we have ten samples at a node with feature x and binary label y:

| x | 1.0 | 1.5 | 2.2 | 3.0 | 3.7 | 4.1 | 4.8 | 5.5 | 6.0 | 7.2 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| y | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 |

The parent node has 5 zeros and 5 ones, so its Gini impurity is 1 - (0.5^2 + 0.5^2) = 0.5. Sorted, the candidate thresholds are the nine midpoints (1.25, 1.85, 2.6, 3.35, 3.9, 4.45, 5.15, 5.75, 6.6). Try the threshold 3.35: the left child has four samples (all label 0, Gini 0), the right child has six (one 0, five 1, Gini = 1 - (1/6)^2 - (5/6)^2 = 0.278). Weighted child impurity is (4/10) * 0 + (6/10) * 0.278 = 0.167, so the impurity decrease is 0.5 - 0.167 = 0.333. Try threshold 3.9: left has five (four 0, one 1, Gini = 0.32), right has five (one 0, four 1, Gini = 0.32). Weighted = 0.32, decrease = 0.18. Try threshold 4.45: left has six (four 0, one 1, plus one 0, so five 0, one 1, Gini = 0.278), right has four (one 0, three 1, Gini = 0.375). Weighted = 0.32, decrease = 0.18.

Threshold 3.35 wins, with the largest impurity decrease, so the algorithm splits there. The tree learner then recurses on the left and right subsets, repeating the search for each.

## What split criteria are used for regression?

For regression, the labels are continuous, so impurity is replaced by variance or some other dispersion measure. The most common criteria are mean squared error, mean absolute error, and Friedman's modified MSE.

| Criterion | What it minimizes | Used by | Notes |
| --- | --- | --- | --- |
| MSE / variance reduction | sum((y - mean)^2) in each child | CART, scikit-learn `criterion='squared_error'` | Default for regression, sensitive to outliers |
| MAE / absolute deviation | sum(|y - median|) in each child | scikit-learn `criterion='absolute_error'` | Robust to outliers, slower to compute |
| Friedman MSE | Weighted improvement scaled by left and right counts | scikit-learn `criterion='friedman_mse'` (default for `GradientBoostingRegressor`) | From Friedman (2001), better for boosting because it accounts for child sizes |
| Poisson deviance | Sum of Poisson deviance per child | scikit-learn `criterion='poisson'` | Count regression with non-negative targets |
| Quantile loss | Sum of pinball loss at quantile q | LightGBM, sklearn `GradientBoostingRegressor` (loss) | Quantile regression and prediction intervals |

For MSE the leaf prediction is the mean of the training targets in that leaf, and the variance reduction at a candidate threshold can be updated incrementally in O(1) as the threshold moves through the sorted values. This is what makes the exact greedy algorithm fast in practice. MAE is harder because the median has no equivalent O(1) update, which is why scikit-learn marks it as significantly slower in the documentation [18].

## How do the tree algorithms compare?

Different classical tree learners differ mostly in the kinds of features they accept, the split criterion they use, and how many children a node may have.

| Algorithm | Year | Authors | Split criterion | Splits | Features supported |
| --- | --- | --- | --- | --- | --- |
| AID / THAID | 1963, 1972 | Morgan and Sonquist; Messenger and Mandell | Variance / heuristic | Multiway | Categorical |
| ID3 | 1986 | Quinlan | Information gain | Multiway (one branch per category) | Categorical only |
| CART | 1984 | Breiman, Friedman, Olshen, Stone | Gini (classification), MSE (regression) | Binary | Numerical and categorical, classification and regression |
| C4.5 | 1993 | Quinlan | Gain ratio | Binary for numerical, multiway for categorical | Numerical and categorical, handles missing values |
| C5.0 | 1997 onward | Quinlan (commercial) | Gain ratio | Same as C4.5 | Adds boosting and rule extraction |
| CHAID | 1980 | Kass | Chi-square (classification), F-test (regression) | Multiway | Categorical, ordinal, binned numerical |
| QUEST | 1997 | Loh and Shih | Statistical tests for variable selection, then quadratic discriminant for split | Binary | Mixed |
| GUIDE | Loh, ongoing | Loh | Curvature and interaction tests | Binary or multiway | Mixed, including missing values |

ID3 is the simplest of these and has mainly historical interest. It cannot handle continuous features directly and produces multiway splits that overfit when categorical features have many levels (Quinlan 1986) [2]. C4.5 fixed both issues: continuous features were handled by sorting and choosing a threshold, and the gain ratio criterion divided information gain by the entropy of the split itself, which penalized very fragmented partitions (Quinlan 1993) [3]. CART by Breiman, Friedman, Olshen, and Stone is the version closest to what scikit-learn and most modern libraries implement: binary splits, Gini for classification, MSE for regression, cost complexity pruning, surrogate splits for missing values, and treatment of both classification and regression in one framework [1].

## How do large datasets speed up the threshold search?

For small datasets, exact greedy threshold search (sort the feature, try every midpoint) is fine. For very large datasets, sorting all feature values at every node is expensive in both time and memory. As the LightGBM authors put it, "for each feature, they need to scan all the data instances to estimate the information gain of all possible split points, which is very time consuming" [12]. Two approximate methods dominate modern practice.

### Pre-sorted exact split (XGBoost `tree_method='exact'`)

XGBoost's original implementation uses a pre-sorted column block layout: feature values are sorted once at the start, and a column block holds sorted values plus row pointers. At each node the algorithm scans through the sorted values and updates the split statistics incrementally. This is exact but uses memory proportional to the dataset size and is slow for very wide or very large data, because, as Chen and Guestrin (2016) note, it is "computationally demanding to enumerate all the possible splits for continuous features" [11].

### Approximate quantile sketch (XGBoost `tree_method='approx'`)

For large datasets XGBoost proposes a fixed set of candidate thresholds per feature, computed using a weighted quantile sketch. The candidates are the values that split the feature into roughly equal-weight buckets, where the weights are second-order gradients (Hessians) from the loss. The split search then only evaluates these candidates rather than every value. Chen and Guestrin (2016) prove the approximation has bounded error and show that the sketch can be merged across machines, which is why XGBoost scales horizontally [11].

### Histogram-based split finding (LightGBM, XGBoost `tree_method='hist'`, CatBoost)

LightGBM by Ke et al. (2017) bins continuous feature values into a small number of integer histograms (typically 256 bins) before training begins [12]. At each node, the algorithm only considers the bin boundaries as candidate thresholds, and the per-bin gradient and Hessian sums can be computed in a single pass over the data. Because bin counts are small fixed integers, the split search is essentially constant per feature per node once the histograms exist. LightGBM also reuses parent histograms when computing child histograms by subtracting one child from the parent.

XGBoost added a histogram method (`tree_method='hist'`) inspired by LightGBM that is now the default in modern releases. CatBoost uses oblivious symmetric trees, where every node at the same depth tests the same feature with the same threshold, and combines this with histogram-based binning.

### Speedups specific to LightGBM

LightGBM introduced two more techniques that make the threshold search even cheaper:

Gradient-based One-Side Sampling (GOSS) keeps all training examples with large gradients and randomly subsamples examples with small gradients, then upweights the subsampled set when computing information gain. The intuition is that examples the current model already fits well (small gradient) carry less information about where to split next.

Exclusive Feature Bundling (EFB) bundles features that rarely take nonzero values at the same time into a single new feature, reducing the effective feature count. For sparse high-dimensional data such as one-hot encoded categorical variables, this can shrink the per-node split search by an order of magnitude.

Ke et al. (2017) report that the combination "speeds up the training process of conventional GBDT by up to over 20 times while achieving almost the same accuracy" [12].

### Modern boosting libraries

| Library | Default split method | Categorical handling | Notable threshold trick |
| --- | --- | --- | --- |
| scikit-learn `DecisionTreeClassifier` | Exact greedy | Binary subset (CART-style) | `splitter='random'` for fast random thresholds |
| scikit-learn `HistGradientBoostingClassifier` | Histogram (256 bins) | Native categorical with sorted-by-target trick | Inspired by LightGBM |
| XGBoost | `hist` (default since 2.0), `approx`, or `exact` | One-hot or native categorical (since 1.5) | Weighted quantile sketch |
| LightGBM | Histogram (255 bins) | Optimal split in O(k log k) | GOSS, EFB, leaf-wise tree growth |
| CatBoost | Symmetric oblivious trees over histograms | Ordered target statistics, native categorical | Oblivious trees, ordered boosting |

## How are categorical features handled?

For a categorical feature with k levels, the number of nontrivial subset splits is 2^(k-1) - 1, which is intractable beyond about k = 20. Real systems use one of several shortcuts.

CART's classical trick (Breiman et al. 1984) for binary classification with k categories sorts the categories by the proportion of class 1 within each, then treats the sorted index as if it were numerical and searches for a threshold over k - 1 candidates [1]. Fisher (1958) proved this gives the optimal binary split for two-class problems, and the result extends to MSE regression [6]. LightGBM uses essentially this method.

One-hot encoding, where each category becomes a separate binary feature, is widely used but is suboptimal for high-cardinality features because the split search has to choose one category at a time, which forces the tree to grow deep before it can isolate any single value.

Target encoding (or mean encoding) replaces each category with the average of the target variable conditional on the category, which converts categorical features back into numerical ones at the cost of leakage if not done carefully. CatBoost addresses this with ordered target statistics, where the encoding for each row uses only previously seen rows in a random permutation, which preserves an out-of-sample property and reduces overfitting (Prokhorenkova et al. 2018) [13].

## How does threshold choice relate to overfitting and pruning?

The greedy threshold search will keep splitting until each leaf contains a single sample if you let it, which produces a tree that memorizes the training set and overfits. To prevent that, decision tree learners apply pre-pruning constraints, post-pruning, or both.

Pre-pruning constraints include `max_depth` (maximum tree depth), `min_samples_split` (minimum samples required to consider splitting a node), `min_samples_leaf` (minimum samples that must end up in a leaf), `min_impurity_decrease` (a candidate split is rejected if the impurity decrease is below this value), and `max_leaf_nodes` or `max_features`. The `min_impurity_decrease` parameter is essentially a threshold on the threshold-selection criterion: if no candidate split passes the bar, the node is left as a leaf. Scikit-learn implements all of these [18].

Post-pruning, particularly minimal cost complexity pruning from Breiman et al. (1984), grows a large tree first and then prunes it back [1]. The pruning chooses a regularization parameter alpha and removes subtrees that minimize a cost-complexity score equal to total impurity plus alpha times the number of leaves. As alpha grows, more subtrees are pruned. Scikit-learn exposes this through the `ccp_alpha` argument [20].

A single decision tree pruned this way is still high variance, which motivates ensembles: [random forest](/wiki/random_forest) by Breiman (2001) averages many bootstrap-resampled trees with random feature subsets at each split [9], and [gradient boosting](/wiki/gradient_boosting) by Friedman (2001) fits a sequence of small trees to the residuals of the current ensemble [8].

## Why do axis-aligned thresholds have limitations?

Thresholds tied to one feature at a time impose a particular geometry, and that geometry has well-known weaknesses.

A diagonal decision boundary like x_1 + x_2 < 10 cannot be represented by a single axis-aligned split. The tree can approximate it with a staircase of small axis-aligned cuts, but it needs many leaves and a deep tree to do so. Oblique splits like the ones in OC1 (Murthy, Kasif, and Salzberg 1994) [7] and sparse oblique trees in TensorFlow Decision Forests [21] test linear combinations of features and capture diagonal boundaries directly, although the search for a good linear combination is computationally expensive.

The greedy choice of one threshold at a time is locally optimal, not globally optimal. A pair of splits chosen jointly might give a much better partition than either chosen first. As the scikit-learn user guide states, "the problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts" [19], a result that traces back to Hyafil and Rivest (1976) showing that constructing optimal binary decision trees is NP-complete [14]. That is why production tree learners settle for greedy heuristics. Recent work on optimal sparse decision trees (Hu, Rudin, and Seltzer 2019 [15]; Lin et al. 2020 [16]) uses branch-and-bound to find provably optimal trees up to moderate sizes.

Small changes in the training data can change the chosen threshold and propagate down the tree, producing a very different tree, which is why single trees are high-variance models. Ensembles average over this variance.

Finally, axis-aligned trees are insensitive to monotone transformations of a feature (taking logs, for example, does not change which split is chosen) but are sensitive to rotations of the feature space, since a rotation makes diagonal patterns axis-aligned and vice versa. This is the opposite of linear models like [logistic regression](/wiki/logistic_regression), which are rotation-invariant after standardization but very sensitive to monotone transformations of features.

## How do libraries expose the threshold?

The most common decision tree implementations in Python expose the threshold-selection mechanism through a few standard parameters.

| Library | Class | Key threshold-related parameters |
| --- | --- | --- |
| [scikit-learn](/wiki/scikit-learn) | `DecisionTreeClassifier`, `DecisionTreeRegressor` | `criterion`, `splitter`, `max_depth`, `min_samples_split`, `min_samples_leaf`, `min_impurity_decrease`, `ccp_alpha` |
| scikit-learn | `HistGradientBoostingClassifier`, `HistGradientBoostingRegressor` | `max_bins`, `max_depth`, `learning_rate`, `min_samples_leaf` |
| XGBoost | `XGBClassifier`, `XGBRegressor` | `tree_method` (`hist`, `exact`, `approx`), `max_bin`, `max_depth`, `min_child_weight`, `gamma` |
| LightGBM | `LGBMClassifier`, `LGBMRegressor` | `num_leaves`, `max_bin`, `min_data_in_leaf`, `cat_smooth`, `boosting_type` |
| CatBoost | `CatBoostClassifier`, `CatBoostRegressor` | `depth`, `border_count`, `feature_border_type`, `cat_features` |
| TensorFlow Decision Forests | `RandomForestModel`, `GradientBoostedTreesModel` | `num_trees`, `max_depth`, `split_axis`, `categorical_algorithm` |

The scikit-learn fitted tree exposes the per-node threshold through the `tree_.threshold` array. Inspecting it is a quick way to understand what the learned tree actually does on a small problem [19].

## See also

- [Decision tree](/wiki/decision_tree)
- [Decision boundary](/wiki/decision_boundary)
- [Decision threshold](/wiki/decision_threshold)
- [Gini impurity](/wiki/gini_impurity)
- [Entropy](/wiki/entropy)
- [Information gain](/wiki/information_gain)
- [CART algorithm](/wiki/cart_algorithm)
- [ID3 algorithm](/wiki/id3_algorithm)
- [C4.5 algorithm](/wiki/c4_5_algorithm)
- [Random forest](/wiki/random_forest)
- [Gradient boosting](/wiki/gradient_boosting)
- [XGBoost](/wiki/xgboost)
- [LightGBM](/wiki/lightgbm)
- [CatBoost](/wiki/catboost)
- [Oblique condition](/wiki/oblique_condition)
- [In-set condition](/wiki/in-set_condition)

## References

1. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). *Classification and Regression Trees*. Wadsworth and Brooks/Cole.
2. Quinlan, J. R. (1986). "Induction of Decision Trees." *Machine Learning* 1: 81-106. https://hunch.net/~coms-4771/quinlan.pdf
3. Quinlan, J. R. (1993). *C4.5: Programs for Machine Learning*. Morgan Kaufmann.
4. Quinlan, J. R. (1996). "Improved Use of Continuous Attributes in C4.5." *Journal of Artificial Intelligence Research* 4: 77-90. https://arxiv.org/abs/cs/9603103
5. Kass, G. V. (1980). "An Exploratory Technique for Investigating Large Quantities of Categorical Data." *Applied Statistics* 29(2): 119-127.
6. Fisher, W. D. (1958). "On Grouping for Maximum Homogeneity." *Journal of the American Statistical Association* 53(284): 789-798.
7. Murthy, S. K., Kasif, S., and Salzberg, S. (1994). "A System for Induction of Oblique Decision Trees." *Journal of Artificial Intelligence Research* 2: 1-32. https://arxiv.org/abs/cs/9408103
8. Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *Annals of Statistics* 29(5): 1189-1232.
9. Breiman, L. (2001). "Random Forests." *Machine Learning* 45(1): 5-32.
10. Geurts, P., Ernst, D., and Wehenkel, L. (2006). "Extremely Randomized Trees." *Machine Learning* 63(1): 3-42.
11. Chen, T. and Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*. https://arxiv.org/abs/1603.02754
12. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." *Advances in Neural Information Processing Systems* 30. https://proceedings.neurips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.pdf
13. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. (2018). "CatBoost: Unbiased Boosting with Categorical Features." *Advances in Neural Information Processing Systems* 31. https://arxiv.org/abs/1706.09516
14. Hyafil, L. and Rivest, R. L. (1976). "Constructing Optimal Binary Decision Trees is NP-Complete." *Information Processing Letters* 5(1): 15-17.
15. Hu, X., Rudin, C., and Seltzer, M. (2019). "Optimal Sparse Decision Trees." *Advances in Neural Information Processing Systems* 32. https://arxiv.org/abs/1904.12847
16. Lin, J., Zhong, C., Hu, D., Rudin, C., and Seltzer, M. (2020). "Generalized and Scalable Optimal Sparse Decision Trees." *International Conference on Machine Learning*. https://arxiv.org/abs/2006.08690
17. Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning*, second edition, chapter 9. Springer. https://hastie.su.domains/Papers/ESLII.pdf
18. scikit-learn developers. "DecisionTreeClassifier." scikit-learn documentation. https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
19. scikit-learn developers. "1.10. Decision Trees." scikit-learn documentation. https://scikit-learn.org/stable/modules/tree.html
20. scikit-learn developers. "Post pruning decision trees with cost complexity pruning." scikit-learn examples. https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
21. TensorFlow Decision Forests source. `tensorflow_decision_forests/component/py_tree/condition.py`. https://github.com/tensorflow/decision-forests

