See also: Machine learning terms
In a decision tree, a leaf (also called a terminal node) is a node that has no children. Every path from the root ends at a leaf, and the leaf is where the model finally produces a prediction. Internal nodes carry a split condition such as "age <= 35" or "color in {red, blue}" and route an example to one of their children; leaves do nothing of the sort. They simply hold a value (or a small structure) that the tree returns for any example whose path lands on them.
Google's decision forests glossary puts it concisely: "a leaf is any endpoint in a decision tree. Unlike a condition, a leaf does not perform a test. Rather, a leaf is a possible prediction." The same glossary describes the chain of conditions traversed at inference time as the inference path, with the leaf as the terminal node of that path.
Leaves are the part of the model that actually carries predictive information. The conditions are routing logic. Once a tree is built, you could think of it as a lookup table that maps an example to one of its leaves, and the leaf is what gets returned. This is why most of the interesting design choices in modern tree learners (LightGBM, XGBoost, CatBoost) revolve around how many leaves to grow, how deep they sit, what value to store in each one, and how to keep them from memorizing noise.
A decision tree is a directed rooted tree. There are exactly two kinds of nodes:
feature_j <= threshold), oblique (a test on a linear combination of features), or in-set (a categorical membership test). See axis-aligned condition, oblique condition, and in-set condition for the distinctions. The threshold is the constant the feature is compared against.The root is the special internal node at the top of the tree where every inference begins. In a binary tree (the kind produced by CART, XGBoost, LightGBM, and the default scikit-learn implementation), every internal node has exactly two children. A tree of depth k therefore has at most 2^k leaves, and CatBoost's symmetric trees actually hit that bound exactly.
A common point of confusion: "leaf" in a decision tree is not the same as "leaf" in graph theory (where it usually means a vertex of degree one in an undirected tree), and neither has anything to do with botany. The vocabulary collides because trees are everywhere in computer science. The decision-tree leaf is specifically the terminal predictive node.
What sits inside a leaf depends on the task and the algorithm. The body of the leaf is whatever the training procedure decides best summarizes the subset of training examples that fell into that leaf. Below are the common cases.
| Task | Stored in the leaf | Prediction returned |
|---|---|---|
| Binary classification | Majority class (or class proportions) of training examples in the leaf | Class label, or a probability between 0 and 1 |
| Multi-class classification | Vector of class counts or normalized class probabilities | Argmax class, or full probability distribution |
| Regression | Mean of the training target values in the leaf (sometimes the median) | A real number |
| Probabilistic classification | Full categorical distribution (smoothed if desired) | A distribution over labels |
| Quantile regression | A specified quantile of the leaf's training targets | The quantile (for example, the 90th percentile) |
| Survival / Cox regression | Hazard estimate or Kaplan-Meier curve | A survival function |
| Boosted tree (XGBoost, LightGBM, CatBoost) | A leaf weight (a real-valued gradient step), not a prediction by itself | Sum the leaf weights from every tree, then apply the link function |
| Custom | Any function of the leaf subset | Whatever the loss requires |
For a single classification tree, the standard rule is the majority vote of the training examples that arrived at the leaf. For a single regression tree, the standard is the leaf-subset mean. Both are the choices that minimize the corresponding training loss (0/1 loss or squared error) given a constant prediction per leaf. Median predictions are sometimes preferred for regression with heavy-tailed targets because medians are less sensitive to outliers.
In scikit-learn, you can get back the actual leaf an example lands in by calling tree.apply(X), which returns the integer leaf index for each row. This is useful for debugging, for cohort analysis, and for the leaf-embedding trick described later.
Gradient-boosted trees treat leaves differently from a standalone decision tree. A single boosted tree is not a model on its own; it is one term in a sum. Each tree is trained to nudge the current ensemble's predictions in a direction that reduces the loss, and what the leaf stores is the size of that nudge.
XGBoost and LightGBM follow Tianqi Chen and Carlos Guestrin's 2016 derivation. For a leaf j containing training instances with summed gradient G_j and summed Hessian H_j (computed from the current ensemble's predictions), the optimal leaf weight under an L2 regularizer lambda is
w_j = - G_j / (H_j + lambda)
and the gain from a candidate split into left and right children is
Gain = (1/2) * [ G_L^2/(H_L + lambda) + G_R^2/(H_R + lambda) - (G_L + G_R)^2/(H_L + H_R + lambda) ] - gamma
where gamma is a complexity penalty per added leaf. This is essentially one Newton step on the loss, applied per leaf. The final ensemble prediction for an example is the sum of the leaf weights from every tree in the ensemble (after applying the inverse link function for classification, for example a logistic for binary log loss).
The practical consequence is that you cannot read meaning out of an individual boosted-tree leaf the way you can with a single CART leaf. A leaf weight of -0.4 in tree 17 is just a contribution; it is the sum across hundreds of trees that produces the final score.
Different libraries grow trees in different orders, and the growth strategy directly determines the shape and number of leaves you end up with.
| Strategy | Used by | How it picks the next split | Tree shape |
|---|---|---|---|
| Level-wise (depth-wise) | XGBoost (default), classical CART variants | Splits every leaf at the current depth before going deeper | Balanced; like a complete binary tree up to the point where splits stop being profitable |
| Leaf-wise (best-first) | LightGBM (default), XGBoost with grow_policy=lossguide | Always splits the leaf with the largest expected gain | Often deep and asymmetric; concentrates capacity where the data needs it |
| Symmetric (oblivious) | CatBoost | Forces every node at the same depth to use the same feature and threshold | Perfectly balanced; depth k gives exactly 2^k leaves |
| Pre-order (recursive) | Classic CART implementations | Recurses into one child fully before the other | Order does not affect the final tree because all profitable splits are taken |
Leaf-wise growth converges in fewer leaves on the same dataset because each split spends complexity on the loss reduction that matters most. The cost is that with the same num_leaves budget, a leaf-wise tree is often deeper than a level-wise tree, and on small datasets it can chase noise. The LightGBM documentation explicitly warns about this: theoretically you could set num_leaves = 2^max_depth to match a depth-wise tree, but a leaf-wise tree at that setting is typically much deeper, and the project recommends keeping num_leaves smaller than 2^max_depth in practice. The example given in the docs is that for max_depth=7, setting num_leaves to around 70 or 80 often beats setting it to 127.
CatBoost goes the other way. By forcing all nodes at a given depth to share the same split, oblivious trees lose flexibility per tree but gain speed: leaf indices can be looked up with a sequence of bit operations, evaluation vectorizes well on CPUs and GPUs, and the symmetric structure acts as a strong regularizer.
Most regularization in tree learners is leaf-shaped: you constrain how many leaves you may have, how small they may be, and how informative each split must be. The table below summarizes the parameters that show up most often.
| Parameter | Library | What it limits | Typical default |
|---|---|---|---|
num_leaves | LightGBM | Maximum leaves per tree (the main complexity knob for leaf-wise growth) | 31 |
max_leaves / max_leaf_nodes | XGBoost (max_leaves), scikit-learn (max_leaf_nodes) | Cap on total leaves per tree | 0 (unlimited) in XGBoost, None in scikit-learn |
max_depth | All libraries | Caps depth, which indirectly caps leaves | 6 (XGBoost), -1 in LightGBM (no limit), None in scikit-learn |
min_samples_leaf | scikit-learn | Minimum training samples that must end up in any leaf | 1 |
min_data_in_leaf / min_child_samples | LightGBM | Minimum training samples per leaf | 20 |
min_child_weight | XGBoost | Minimum sum of Hessians per leaf (for squared loss this is just the sample count) | 1 |
min_sum_hessian_in_leaf | LightGBM | Hessian-sum analog of min_child_weight | 1e-3 |
min_impurity_decrease | scikit-learn | Required reduction in impurity for a split to be accepted | 0.0 |
gamma (min_split_loss) | XGBoost | Required gain reduction (after regularization) for a split | 0 |
reg_lambda / lambda | XGBoost, LightGBM | L2 penalty on leaf weights (shrinks w_j) | 1.0 (XGBoost), 0 (LightGBM) |
reg_alpha / alpha | XGBoost, LightGBM | L1 penalty on leaf weights | 0 |
ccp_alpha | scikit-learn | Cost-complexity pruning strength applied after fitting | 0.0 |
These knobs all push in the same direction: too many large-capacity leaves and the tree memorizes; too few small leaves and the tree underfits. The right setting is almost always discovered by cross-validation, by an internal early-stopping loop on a validation set, or by a parameter-search tool like Optuna. The defaults are conservative on purpose. For LightGBM in particular, the docs note that min_data_in_leaf set into the hundreds or low thousands is reasonable for large datasets.
Leaves are the unit of capacity in a tree. Add more leaves and you can fit a more complex function; add too many and you fit the noise. The standard bias-variance picture applies. A tree with one leaf is a constant predictor; a tree with one leaf per training example perfectly memorizes the training set and generalizes terribly. The interesting region is in between, and it is where almost all hyperparameter tuning happens.
A few diagnostics help in practice:
max_depth or num_leaves. The classic U-shaped validation curve tells you when you have crossed into overfitting.min_split_loss or min_impurity_decrease will trim the tree.For boosted ensembles, the situation is slightly different because you also get to control the number of trees and the learning rate. Many practitioners keep individual trees fairly shallow (depth 4 to 8, or num_leaves around 31 to 255) and rely on adding more trees with a small learning rate, which empirically generalizes better than building one or two enormous trees.
Pruning is the act of throwing away leaves (or whole subtrees) after the tree has been grown. The two flavors mirror the two ways you might decide to throw something away.
Pre-pruning (early stopping) refuses to create leaves that would be too small or too uninformative in the first place. Constraints like min_samples_leaf, min_data_in_leaf, min_child_weight, and min_impurity_decrease are pre-pruning. They are cheap because the tree never grows unnecessary structure, but they suffer from the horizon effect: a split that looks bad on its own might unlock excellent splits one level down, and pre-pruning never gives that chance.
Post-pruning grows the tree to its full size first and then collapses subtrees that hurt validation performance. The canonical method is minimal cost-complexity pruning, introduced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in their 1984 book Classification and Regression Trees. For a subtree T,
R_alpha(T) = R(T) + alpha * |T_leaves|
where R(T) is the resubstitution error (misclassification or squared error), |T_leaves| is the number of leaves, and alpha is a non-negative complexity parameter. The algorithm computes the "weakest link" alpha at which collapsing each subtree to a leaf is justified, prunes that subtree, and repeats. This produces a nested sequence of subtrees indexed by alpha, and the final alpha is chosen by cross-validation. In scikit-learn, this is exposed as the ccp_alpha parameter on both DecisionTreeClassifier and DecisionTreeRegressor.
Reduced error pruning is a simpler alternative used in some C4.5-style implementations: hold out a validation set, walk up from the leaves, and replace any subtree with a single leaf if doing so does not hurt validation accuracy. C4.5 itself uses error-based pruning, which estimates pessimistic confidence intervals on the leaf error and prunes when the bound improves.
A neat idea from Xinran He and colleagues at Facebook ("Practical Lessons from Predicting Clicks on Ads at Facebook," ADKDD 2014) is to use the index of the leaf an example lands in as a categorical feature. Train a gradient boosted tree, then for every example record which leaf it ended up in for each tree. One-hot encode those leaf indices, concatenate the resulting sparse vector across all trees, and feed it into a logistic regression. The boosted trees handle nonlinear feature interactions; the linear model on top is cheap to retrain and easy to deploy. Facebook reported that this hybrid improved normalized cross-entropy by about 3.4 percent over either model alone, which at click-through-rate scale translates to a substantial revenue lift. The approach popularized leaf-embedding tricks in industrial recommender systems.
Beyond CTR, leaf indices are sometimes used as a learned discretization of the input space, as inputs to nearest-neighbor methods ("find me other examples that share many leaf assignments"), or as the categorical interface in deep tabular methods that try to combine the inductive bias of trees with the gradient flow of neural nets.
| Library | What is stored at a leaf | How to access |
|---|---|---|
scikit-learn DecisionTreeClassifier | Class counts and predicted class | tree_.value, tree_.apply(X) |
scikit-learn DecisionTreeRegressor | Mean target and sample count | tree_.value, tree_.apply(X) |
| XGBoost | Real-valued leaf weight per tree | Booster.predict(..., pred_leaf=True) returns leaf indices; dump_model() returns weights |
| LightGBM | Real-valued leaf weight per tree | Booster.predict(..., pred_leaf=True) returns leaf indices |
| CatBoost | Real-valued leaf weight per tree (oblivious) | model.calc_leaf_indexes() |
| Spark MLlib | Class label or mean | transform() produces predictions; intermediate leaf access requires UDFs |
For visualization, scikit-learn's sklearn.tree.plot_tree and sklearn.tree.export_text print the tree with the leaves at the bottom of the diagram or text dump. Each leaf line shows the predicted value, the impurity at the leaf, and the number of samples that reach it.
Gradient-boosted decision trees remain the dominant tool for tabular data. Benchmarks like the 2022 Grinsztajn et al. study ("Why do tree-based models still outperform deep learning on tabular data?") found that XGBoost and other GBT libraries beat large neural networks on a wide range of tabular tasks, and leaves are the unit doing the work. Production GBT models routinely have ensembles of 1,000 to 10,000 trees with 31 to 255 leaves each, which means tens of millions of leaves total. The combinatorial number of distinct leaf-index tuples (one per tree) is enormous, and that is why ensembles can fit such complex functions even though each individual tree is shallow.
Leaf representations also show up in neural tabular methods. Neural Oblivious Decision Ensembles (NODE), TabNet, and related architectures borrow the leaf-and-condition vocabulary, replacing hard splits with soft attention so that gradients can flow back through what would otherwise be a discontinuous routing decision. The leaves in those models hold learnable embeddings rather than scalars, but the structural intuition is the same: route an example through a sequence of conditions and emit whatever the matching leaf stores.
A few things to watch for when working with leaves:
min_samples_leaf, min_data_in_leaf, or min_child_weight to prevent them.is_unbalance (LightGBM) and scale_pos_weight (XGBoost) parameters reweight the loss so leaves can move toward the minority class.Imagine you are sorting toys by asking yes-or-no questions. "Is it red?" If yes, walk to the red bucket. "Is it bigger than a shoebox?" If yes, walk to the big-red bucket. After a few questions you arrive at one specific bucket, and that bucket already has a guess written on it: "This is probably a fire truck." The bucket is the leaf. The questions are the splits. The whole point of building the tree was to figure out the right questions and to write a good guess on each bucket.