In a decision tree, a threshold is the value used in an internal node's split test that decides which child subtree a sample is routed to. For a numerical feature, the test typically has the form "is feature_x less than or equal to 4.5?" If yes, the sample goes to the left child; if no, it goes to the right child. The threshold (here, 4.5) is the cut point that partitions the input space, and the choice of threshold at each node is what determines the geometry of the entire tree.
Thresholds are not given by the user. They are learned from training data by a greedy splitting procedure that searches over candidate cut points and picks the one that most improves a chosen split criterion (Gini impurity, entropy, mean squared error, and so on). The same idea generalizes to categorical features (where the "threshold" is a subset of categories), to oblique splits (a threshold on a linear combination of features), and to set membership tests. Threshold selection is the inner loop of every classical tree learner, and most of the engineering effort in modern gradient boosting libraries goes into making this loop fast and statistically well behaved.
This article covers what thresholds are, how they are chosen, how the choice differs across CART, ID3, C4.5, and CHAID, what split criteria look like for classification and regression, how modern libraries like XGBoost, LightGBM, and CatBoost approximate the search for very large datasets, and the well-known limitations of axis-aligned thresholds.
A decision tree is a recursive partition of the input space. Every non-leaf node carries a split test, and every leaf carries a prediction (a class label, a class distribution, a real-valued estimate, or a more general summary). The split test takes one input sample and decides whether to route it left or right. For a single numerical feature x_j and a threshold t, the test is x_j <= t, which is equivalent to a hyperplane in feature space that is perpendicular to the j-th coordinate axis. Trees built from this kind of test are called axis-aligned, because every decision boundary is parallel to a feature axis (Breiman et al. 1984).
Two properties follow immediately. First, the tree's decision regions are unions of hyperrectangles, each corresponding to one root-to-leaf path. Second, the tree's predictions are piecewise constant: every sample that reaches the same leaf gets the same prediction. The thresholds at the internal nodes are what determine the size, shape, and number of these hyperrectangles.
The word "threshold" originally referred to a numerical cut point, but most modern libraries support several kinds of split conditions. The internal data structure stores a feature index plus a condition object. The TensorFlow Decision Forests source code, for example, distinguishes numerical, categorical, and several oblique condition types.
| Condition type | Test form | Typical use |
|---|---|---|
| Numerical (axis-aligned) | x_j <= t | Continuous features, default in CART and gradient boosting |
| Categorical (subset) | x_j in S | Categorical features in CART, LightGBM, CatBoost |
| In-set | x_j in {v1, v2, ..., vk} | Discrete features with small alphabets |
| Oblique | sum(w_j * x_j) <= t | Diagonal boundaries, OC1, sparse oblique trees |
| Multiway (CHAID) | x_j in S1, S2, ..., Sm | Categorical or binned numerical with more than two children |
Numerical splits are by far the most common. Categorical splits in CART are handled by sorting categories by class proportion and treating the resulting ordering as if it were numerical (Breiman et al. 1984), which keeps the search at O(k log k) instead of the 2^(k-1) cost of trying every subset. LightGBM uses a more aggressive variant (Fisher 1958) that finds the optimal partition for a single categorical split in O(k log k) time. Oblique trees test a linear combination, so the split is no longer perpendicular to a feature axis; OC1 by Murthy, Kasif, and Salzberg (1994) is the canonical algorithm. CHAID (Kass 1980) uses chi-square tests and produces multiway splits for categorical features.
For a single numerical feature in a CART-style learner, the standard procedure is:
This is known as the exact greedy algorithm. Repeated for every feature at every node, it is the workhorse behind scikit-learn's DecisionTreeClassifier and DecisionTreeRegressor when splitter='best' (scikit-learn 1.10 docs). With n samples and d features, the exact greedy split costs roughly O(n d log n) per node when implemented carefully, which is fast for small datasets but becomes a bottleneck once n d gets into the hundreds of millions.
The splitter='random' option in scikit-learn replaces the inner loop with a single random candidate threshold per feature, sacrificing some quality for speed. Extra Trees by Geurts, Ernst, and Wehenkel (2006) take this further and use random thresholds throughout, which trades extra bias for very low variance and very fast training.
The split criterion is the function the algorithm tries to minimize (or whose decrease it tries to maximize) when choosing a threshold. For classification the dominant choices are Gini impurity, entropy, and misclassification error, all defined on the class distribution at a node.
| Criterion | Formula | Used by | Notes |
|---|---|---|---|
| Gini impurity | G = 1 - sum(p_i^2) | CART (default), scikit-learn criterion='gini' | Smooth, fast to compute, slightly favors larger partitions |
| Entropy / information gain | H = -sum(p_i log p_i) | ID3, C4.5, scikit-learn criterion='entropy' or 'log_loss' | Logarithmic, more sensitive to small class probabilities |
| Misclassification error | 1 - max(p_i) | Rarely used as split criterion | Not differentiable at the maximum, less responsive to changes in class distribution |
| Gain ratio | InformationGain / SplitInformation | C4.5 | Penalizes splits with many small partitions, fixes ID3's bias toward high-cardinality features |
| Chi-square | sum((O - E)^2 / E) | CHAID | Statistical significance test, allows multiway splits |
In practice, Gini impurity and entropy almost always select the same threshold in binary problems, and their predictions correlate strongly. Hastie, Tibshirani, and Friedman (2009, section 9.2.3) note that the choice between them rarely matters for accuracy. Misclassification error is theoretically reasonable but practically unhelpful as a split criterion because the function is flat over many candidate splits, so two very different cuts can score identically.
Suppose we have ten samples at a node with feature x and binary label y:
| x | 1.0 | 1.5 | 2.2 | 3.0 | 3.7 | 4.1 | 4.8 | 5.5 | 6.0 | 7.2 |
|---|---|---|---|---|---|---|---|---|---|---|
| y | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 |
The parent node has 5 zeros and 5 ones, so its Gini impurity is 1 - (0.5^2 + 0.5^2) = 0.5. Sorted, the candidate thresholds are the nine midpoints (1.25, 1.85, 2.6, 3.35, 3.9, 4.45, 5.15, 5.75, 6.6). Try the threshold 3.35: the left child has four samples (all label 0, Gini 0), the right child has six (one 0, five 1, Gini = 1 - (1/6)^2 - (5/6)^2 = 0.278). Weighted child impurity is (4/10) * 0 + (6/10) * 0.278 = 0.167, so the impurity decrease is 0.5 - 0.167 = 0.333. Try threshold 3.9: left has five (four 0, one 1, Gini = 0.32), right has five (one 0, four 1, Gini = 0.32). Weighted = 0.32, decrease = 0.18. Try threshold 4.45: left has six (four 0, one 1, plus one 0, so five 0, one 1, Gini = 0.278), right has four (one 0, three 1, Gini = 0.375). Weighted = 0.32, decrease = 0.18.
Threshold 3.35 wins, with the largest impurity decrease, so the algorithm splits there. The tree learner then recurses on the left and right subsets, repeating the search for each.
For regression, the labels are continuous, so impurity is replaced by variance or some other dispersion measure. The most common criteria are mean squared error, mean absolute error, and Friedman's modified MSE.
| Criterion | What it minimizes | Used by | Notes |
|---|---|---|---|
| MSE / variance reduction | sum((y - mean)^2) in each child | CART, scikit-learn criterion='squared_error' | Default for regression, sensitive to outliers |
| MAE / absolute deviation | sum( | y - median | ) in each child |
| Friedman MSE | Weighted improvement scaled by left and right counts | scikit-learn criterion='friedman_mse' (default for GradientBoostingRegressor) | From Friedman (2001), better for boosting because it accounts for child sizes |
| Poisson deviance | Sum of Poisson deviance per child | scikit-learn criterion='poisson' | Count regression with non-negative targets |
| Quantile loss | Sum of pinball loss at quantile q | LightGBM, sklearn GradientBoostingRegressor (loss) | Quantile regression and prediction intervals |
For MSE the leaf prediction is the mean of the training targets in that leaf, and the variance reduction at a candidate threshold can be updated incrementally in O(1) as the threshold moves through the sorted values. This is what makes the exact greedy algorithm fast in practice. MAE is harder because the median has no equivalent O(1) update, which is why scikit-learn marks it as significantly slower in the documentation.
Different classical tree learners differ mostly in the kinds of features they accept, the split criterion they use, and how many children a node may have.
| Algorithm | Year | Authors | Split criterion | Splits | Features supported |
|---|---|---|---|---|---|
| AID / THAID | 1963, 1972 | Morgan and Sonquist; Messenger and Mandell | Variance / heuristic | Multiway | Categorical |
| ID3 | 1986 | Quinlan | Information gain | Multiway (one branch per category) | Categorical only |
| CART | 1984 | Breiman, Friedman, Olshen, Stone | Gini (classification), MSE (regression) | Binary | Numerical and categorical, classification and regression |
| C4.5 | 1993 | Quinlan | Gain ratio | Binary for numerical, multiway for categorical | Numerical and categorical, handles missing values |
| C5.0 | 1997 onward | Quinlan (commercial) | Gain ratio | Same as C4.5 | Adds boosting and rule extraction |
| CHAID | 1980 | Kass | Chi-square (classification), F-test (regression) | Multiway | Categorical, ordinal, binned numerical |
| QUEST | 1997 | Loh and Shih | Statistical tests for variable selection, then quadratic discriminant for split | Binary | Mixed |
| GUIDE | Loh, ongoing | Loh | Curvature and interaction tests | Binary or multiway | Mixed, including missing values |
ID3 is the simplest of these and has mainly historical interest. It cannot handle continuous features directly and produces multiway splits that overfit when categorical features have many levels (Quinlan 1986). C4.5 fixed both issues: continuous features were handled by sorting and choosing a threshold, and the gain ratio criterion divided information gain by the entropy of the split itself, which penalized very fragmented partitions (Quinlan 1993). CART by Breiman, Friedman, Olshen, and Stone is the version closest to what scikit-learn and most modern libraries implement: binary splits, Gini for classification, MSE for regression, cost complexity pruning, surrogate splits for missing values, and treatment of both classification and regression in one framework.
For small datasets, exact greedy threshold search (sort the feature, try every midpoint) is fine. For very large datasets, sorting all feature values at every node is expensive in both time and memory. Two approximate methods dominate modern practice.
tree_method='exact')XGBoost's original implementation uses a pre-sorted column block layout: feature values are sorted once at the start, and a column block holds sorted values plus row pointers. At each node the algorithm scans through the sorted values and updates the split statistics incrementally. This is exact but uses memory proportional to the dataset size and is slow for very wide or very large data (Chen and Guestrin 2016).
tree_method='approx')For large datasets XGBoost proposes a fixed set of candidate thresholds per feature, computed using a weighted quantile sketch. The candidates are the values that split the feature into roughly equal-weight buckets, where the weights are second-order gradients (Hessians) from the loss. The split search then only evaluates these candidates rather than every value. Chen and Guestrin (2016) prove the approximation has bounded error and show that the sketch can be merged across machines, which is why XGBoost scales horizontally.
tree_method='hist', CatBoost)LightGBM by Ke et al. (2017) bins continuous feature values into a small number of integer histograms (typically 256 bins) before training begins. At each node, the algorithm only considers the bin boundaries as candidate thresholds, and the per-bin gradient and Hessian sums can be computed in a single pass over the data. Because bin counts are small fixed integers, the split search is essentially constant per feature per node once the histograms exist. LightGBM also reuses parent histograms when computing child histograms by subtracting one child from the parent.
XGBoost added a histogram method (tree_method='hist') inspired by LightGBM that is now the default in modern releases. CatBoost uses oblivious symmetric trees, where every node at the same depth tests the same feature with the same threshold, and combines this with histogram-based binning.
LightGBM introduced two more techniques that make the threshold search even cheaper:
Gradient-based One-Side Sampling (GOSS) keeps all training examples with large gradients and randomly subsamples examples with small gradients, then upweights the subsampled set when computing information gain. The intuition is that examples the current model already fits well (small gradient) carry less information about where to split next.
Exclusive Feature Bundling (EFB) bundles features that rarely take nonzero values at the same time into a single new feature, reducing the effective feature count. For sparse high-dimensional data such as one-hot encoded categorical variables, this can shrink the per-node split search by an order of magnitude.
Ke et al. (2017) report up to 20x speedup over conventional gradient boosting decision trees with comparable accuracy.
| Library | Default split method | Categorical handling | Notable threshold trick |
|---|---|---|---|
scikit-learn DecisionTreeClassifier | Exact greedy | Binary subset (CART-style) | splitter='random' for fast random thresholds |
scikit-learn HistGradientBoostingClassifier | Histogram (256 bins) | Native categorical with sorted-by-target trick | Inspired by LightGBM |
| XGBoost | hist (default since 2.0), approx, or exact | One-hot or native categorical (since 1.5) | Weighted quantile sketch |
| LightGBM | Histogram (255 bins) | Optimal split in O(k log k) | GOSS, EFB, leaf-wise tree growth |
| CatBoost | Symmetric oblivious trees over histograms | Ordered target statistics, native categorical | Oblivious trees, ordered boosting |
For a categorical feature with k levels, the number of nontrivial subset splits is 2^(k-1) - 1, which is intractable beyond about k = 20. Real systems use one of several shortcuts.
CART's classical trick (Breiman et al. 1984) for binary classification with k categories sorts the categories by the proportion of class 1 within each, then treats the sorted index as if it were numerical and searches for a threshold over k - 1 candidates. Fisher (1958) proved this gives the optimal binary split for two-class problems, and the result extends to MSE regression. LightGBM uses essentially this method.
One-hot encoding, where each category becomes a separate binary feature, is widely used but is suboptimal for high-cardinality features because the split search has to choose one category at a time, which forces the tree to grow deep before it can isolate any single value.
Target encoding (or mean encoding) replaces each category with the average of the target variable conditional on the category, which converts categorical features back into numerical ones at the cost of leakage if not done carefully. CatBoost addresses this with ordered target statistics, where the encoding for each row uses only previously seen rows in a random permutation, which preserves an out-of-sample property and reduces overfitting (Prokhorenkova et al. 2018).
The greedy threshold search will keep splitting until each leaf contains a single sample if you let it. To prevent that, decision tree learners apply pre-pruning constraints, post-pruning, or both.
Pre-pruning constraints include max_depth (maximum tree depth), min_samples_split (minimum samples required to consider splitting a node), min_samples_leaf (minimum samples that must end up in a leaf), min_impurity_decrease (a candidate split is rejected if the impurity decrease is below this value), and max_leaf_nodes or max_features. The min_impurity_decrease parameter is essentially a threshold on the threshold-selection criterion: if no candidate split passes the bar, the node is left as a leaf. Scikit-learn implements all of these.
Post-pruning, particularly minimal cost complexity pruning from Breiman et al. (1984), grows a large tree first and then prunes it back. The pruning chooses a regularization parameter alpha and removes subtrees that minimize a cost-complexity score equal to total impurity plus alpha times the number of leaves. As alpha grows, more subtrees are pruned. Scikit-learn exposes this through the ccp_alpha argument.
A single decision tree pruned this way is still high variance, which motivates ensembles: random forest by Breiman (2001) averages many bootstrap-resampled trees with random feature subsets at each split, and gradient boosting by Friedman (2001) fits a sequence of small trees to the residuals of the current ensemble.
Thresholds tied to one feature at a time impose a particular geometry, and that geometry has well-known weaknesses.
A diagonal decision boundary like x_1 + x_2 < 10 cannot be represented by a single axis-aligned split. The tree can approximate it with a staircase of small axis-aligned cuts, but it needs many leaves and a deep tree to do so. Oblique splits like the ones in OC1 (Murthy, Kasif, and Salzberg 1994) and sparse oblique trees in TensorFlow Decision Forests test linear combinations of features and capture diagonal boundaries directly, although the search for a good linear combination is computationally expensive.
The greedy choice of one threshold at a time is locally optimal, not globally optimal. A pair of splits chosen jointly might give a much better partition than either chosen first. Finding the globally optimal binary tree is NP-hard (Hyafil and Rivest 1976), which is why production tree learners settle for greedy. Recent work on optimal sparse decision trees (Hu, Rudin, and Seltzer 2019; Lin et al. 2020) uses branch-and-bound to find provably optimal trees up to moderate sizes.
Small changes in the training data can change the chosen threshold and propagate down the tree, producing a very different tree, which is why single trees are high-variance models. Ensembles average over this variance.
Finally, axis-aligned trees are insensitive to monotone transformations of a feature (taking logs, for example, does not change which split is chosen) but are sensitive to rotations of the feature space, since a rotation makes diagonal patterns axis-aligned and vice versa. This is the opposite of linear models like logistic regression, which are rotation-invariant after standardization but very sensitive to monotone transformations of features.
The most common decision tree implementations in Python expose the threshold-selection mechanism through a few standard parameters.
| Library | Class | Key threshold-related parameters |
|---|---|---|
| scikit-learn | DecisionTreeClassifier, DecisionTreeRegressor | criterion, splitter, max_depth, min_samples_split, min_samples_leaf, min_impurity_decrease, ccp_alpha |
| scikit-learn | HistGradientBoostingClassifier, HistGradientBoostingRegressor | max_bins, max_depth, learning_rate, min_samples_leaf |
| XGBoost | XGBClassifier, XGBRegressor | tree_method (hist, exact, approx), max_bin, max_depth, min_child_weight, gamma |
| LightGBM | LGBMClassifier, LGBMRegressor | num_leaves, max_bin, min_data_in_leaf, cat_smooth, boosting_type |
| CatBoost | CatBoostClassifier, CatBoostRegressor | depth, border_count, feature_border_type, cat_features |
| TensorFlow Decision Forests | RandomForestModel, GradientBoostedTreesModel | num_trees, max_depth, split_axis, categorical_algorithm |
The scikit-learn fitted tree exposes the per-node threshold through the tree_.threshold array. Inspecting it is a quick way to understand what the learned tree actually does on a small problem.