Leaf

Introduction

In a decision tree, a leaf (also called a terminal node) is a node that has no children. Every path from the root ends at a leaf, and the leaf is where the model finally produces a prediction. Internal nodes carry a split condition such as "age <= 35" or "color in {red, blue}" and route an example to one of their children; leaves do nothing of the sort. They simply hold a value (or a small structure) that the tree returns for any example whose path lands on them.

Google's decision forests glossary puts it concisely: "a leaf is any endpoint in a decision tree. Unlike a condition, a leaf does not perform a test. Rather, a leaf is a possible prediction." The same glossary describes the chain of conditions traversed at inference time as the inference path, with the leaf as the terminal node of that path.

Leaves are the part of the model that actually carries predictive information. The conditions are routing logic. Once a tree is built, you could think of it as a lookup table that maps an example to one of its leaves, and the leaf is what gets returned. This is why most of the interesting design choices in modern tree learners (LightGBM, XGBoost, CatBoost) revolve around how many leaves to grow, how deep they sit, what value to store in each one, and how to keep them from memorizing noise.

Anatomy of a tree node

A decision tree is a directed rooted tree. There are exactly two kinds of nodes:

An internal node has at least one child and holds a split condition. The condition can be axis-aligned (a test on a single feature, like feature_j <= threshold), oblique (a test on a linear combination of features), or in-set (a categorical membership test). See axis-aligned condition, oblique condition, and in-set condition for the distinctions. The threshold is the constant the feature is compared against.
A leaf has no children and holds a prediction.

The root is the special internal node at the top of the tree where every inference begins. In a binary tree (the kind produced by CART, XGBoost, LightGBM, and the default scikit-learn implementation), every internal node has exactly two children. A tree of depth k therefore has at most 2^k leaves, and CatBoost's symmetric trees actually hit that bound exactly.

A common point of confusion: "leaf" in a decision tree is not the same as "leaf" in graph theory (where it usually means a vertex of degree one in an undirected tree), and neither has anything to do with botany. The vocabulary collides because trees are everywhere in computer science. The decision-tree leaf is specifically the terminal predictive node.

What a leaf actually stores

What sits inside a leaf depends on the task and the algorithm. The body of the leaf is whatever the training procedure decides best summarizes the subset of training examples that fell into that leaf. Below are the common cases.

Task	Stored in the leaf	Prediction returned
Binary classification	Majority class (or class proportions) of training examples in the leaf	Class label, or a probability between 0 and 1
Multi-class classification	Vector of class counts or normalized class probabilities	Argmax class, or full probability distribution
Regression	Mean of the training target values in the leaf (sometimes the median)	A real number
Probabilistic classification	Full categorical distribution (smoothed if desired)	A distribution over labels
Quantile regression	A specified quantile of the leaf's training targets	The quantile (for example, the 90th percentile)
Survival / Cox regression	Hazard estimate or Kaplan-Meier curve	A survival function
Boosted tree (XGBoost, LightGBM, CatBoost)	A leaf weight (a real-valued gradient step), not a prediction by itself	Sum the leaf weights from every tree, then apply the link function
Custom	Any function of the leaf subset	Whatever the loss requires

For a single classification tree, the standard rule is the majority vote of the training examples that arrived at the leaf. For a single regression tree, the standard is the leaf-subset mean. Both are the choices that minimize the corresponding training loss (0/1 loss or squared error) given a constant prediction per leaf. Median predictions are sometimes preferred for regression with heavy-tailed targets because medians are less sensitive to outliers.

In scikit-learn, you can get back the actual leaf an example lands in by calling tree.apply(X), which returns the integer leaf index for each row. This is useful for debugging, for cohort analysis, and for the leaf-embedding trick described later.

Leaves in boosted trees

Gradient-boosted trees treat leaves differently from a standalone decision tree. A single boosted tree is not a model on its own; it is one term in a sum. Each tree is trained to nudge the current ensemble's predictions in a direction that reduces the loss, and what the leaf stores is the size of that nudge.

XGBoost and LightGBM follow Tianqi Chen and Carlos Guestrin's 2016 derivation. For a leaf j containing training instances with summed gradient G_j and summed Hessian H_j (computed from the current ensemble's predictions), the optimal leaf weight under an L2 regularizer lambda is

w_j = - G_j / (H_j + lambda)

and the gain from a candidate split into left and right children is

Gain = (1/2) * [ G_L^2/(H_L + lambda) + G_R^2/(H_R + lambda) - (G_L + G_R)^2/(H_L + H_R + lambda) ] - gamma

where gamma is a complexity penalty per added leaf. This is essentially one Newton step on the loss, applied per leaf. The final ensemble prediction for an example is the sum of the leaf weights from every tree in the ensemble (after applying the inverse link function for classification, for example a logistic for binary log loss).

The practical consequence is that you cannot read meaning out of an individual boosted-tree leaf the way you can with a single CART leaf. A leaf weight of -0.4 in tree 17 is just a contribution; it is the sum across hundreds of trees that produces the final score.

Tree growth strategies

Different libraries grow trees in different orders, and the growth strategy directly determines the shape and number of leaves you end up with.

Strategy	Used by	How it picks the next split	Tree shape
Level-wise (depth-wise)	XGBoost (default), classical CART variants	Splits every leaf at the current depth before going deeper	Balanced; like a complete binary tree up to the point where splits stop being profitable
Leaf-wise (best-first)	LightGBM (default), XGBoost with `grow_policy=lossguide`	Always splits the leaf with the largest expected gain	Often deep and asymmetric; concentrates capacity where the data needs it
Symmetric (oblivious)	CatBoost	Forces every node at the same depth to use the same feature and threshold	Perfectly balanced; depth k gives exactly 2^k leaves
Pre-order (recursive)	Classic CART implementations	Recurses into one child fully before the other	Order does not affect the final tree because all profitable splits are taken

Leaf-wise growth converges in fewer leaves on the same dataset because each split spends complexity on the loss reduction that matters most. The cost is that with the same num_leaves budget, a leaf-wise tree is often deeper than a level-wise tree, and on small datasets it can chase noise. The LightGBM documentation explicitly warns about this: theoretically you could set num_leaves = 2^max_depth to match a depth-wise tree, but a leaf-wise tree at that setting is typically much deeper, and the project recommends keeping num_leaves smaller than 2^max_depth in practice. The example given in the docs is that for max_depth=7, setting num_leaves to around 70 or 80 often beats setting it to 127.

CatBoost goes the other way. By forcing all nodes at a given depth to share the same split, oblivious trees lose flexibility per tree but gain speed: leaf indices can be looked up with a sequence of bit operations, evaluation vectorizes well on CPUs and GPUs, and the symmetric structure acts as a strong regularizer.

Hyperparameters that control leaves

Most regularization in tree learners is leaf-shaped: you constrain how many leaves you may have, how small they may be, and how informative each split must be. The table below summarizes the parameters that show up most often.

Parameter	Library	What it limits	Typical default
`num_leaves`	LightGBM	Maximum leaves per tree (the main complexity knob for leaf-wise growth)	31
`max_leaves` / `max_leaf_nodes`	XGBoost (`max_leaves`), scikit-learn (`max_leaf_nodes`)	Cap on total leaves per tree	0 (unlimited) in XGBoost, None in scikit-learn
`max_depth`	All libraries	Caps depth, which indirectly caps leaves	6 (XGBoost), -1 in LightGBM (no limit), None in scikit-learn
`min_samples_leaf`	scikit-learn	Minimum training samples that must end up in any leaf	1
`min_data_in_leaf` / `min_child_samples`	LightGBM	Minimum training samples per leaf	20
`min_child_weight`	XGBoost	Minimum sum of Hessians per leaf (for squared loss this is just the sample count)	1
`min_sum_hessian_in_leaf`	LightGBM	Hessian-sum analog of `min_child_weight`	1e-3
`min_impurity_decrease`	scikit-learn	Required reduction in impurity for a split to be accepted	0.0
`gamma` (`min_split_loss`)	XGBoost	Required gain reduction (after regularization) for a split	0
`reg_lambda` / `lambda`	XGBoost, LightGBM	L2 penalty on leaf weights (shrinks `w_j`)	1.0 (XGBoost), 0 (LightGBM)
`reg_alpha` / `alpha`	XGBoost, LightGBM	L1 penalty on leaf weights	0
`ccp_alpha`	scikit-learn	Cost-complexity pruning strength applied after fitting	0.0

These knobs all push in the same direction: too many large-capacity leaves and the tree memorizes; too few small leaves and the tree underfits. The right setting is almost always discovered by cross-validation, by an internal early-stopping loop on a validation set, or by a parameter-search tool like Optuna. The defaults are conservative on purpose. For LightGBM in particular, the docs note that min_data_in_leaf set into the hundreds or low thousands is reasonable for large datasets.

Why the number and size of leaves matters

Leaves are the unit of capacity in a tree. Add more leaves and you can fit a more complex function; add too many and you fit the noise. The standard bias-variance picture applies. A tree with one leaf is a constant predictor; a tree with one leaf per training example perfectly memorizes the training set and generalizes terribly. The interesting region is in between, and it is where almost all hyperparameter tuning happens.

A few diagnostics help in practice:

Plot training and validation loss against max_depth or num_leaves. The classic U-shaped validation curve tells you when you have crossed into overfitting.
Watch the average leaf size. If many leaves contain only one or two training examples, predictions in those leaves are essentially memorized labels.
Look at split gains. If the marginal gain of adding another leaf is close to zero, raising min_split_loss or min_impurity_decrease will trim the tree.

For boosted ensembles, the situation is slightly different because you also get to control the number of trees and the learning rate. Many practitioners keep individual trees fairly shallow (depth 4 to 8, or num_leaves around 31 to 255) and rely on adding more trees with a small learning rate, which empirically generalizes better than building one or two enormous trees.

Pruning

Pruning is the act of throwing away leaves (or whole subtrees) after the tree has been grown. The two flavors mirror the two ways you might decide to throw something away.

Pre-pruning (early stopping) refuses to create leaves that would be too small or too uninformative in the first place. Constraints like min_samples_leaf, min_data_in_leaf, min_child_weight, and min_impurity_decrease are pre-pruning. They are cheap because the tree never grows unnecessary structure, but they suffer from the horizon effect: a split that looks bad on its own might unlock excellent splits one level down, and pre-pruning never gives that chance.

Post-pruning grows the tree to its full size first and then collapses subtrees that hurt validation performance. The canonical method is minimal cost-complexity pruning, introduced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in their 1984 book Classification and Regression Trees. For a subtree T,

R_alpha(T) = R(T) + alpha * |T_leaves|

where R(T) is the resubstitution error (misclassification or squared error), |T_leaves| is the number of leaves, and alpha is a non-negative complexity parameter. The algorithm computes the "weakest link" alpha at which collapsing each subtree to a leaf is justified, prunes that subtree, and repeats. This produces a nested sequence of subtrees indexed by alpha, and the final alpha is chosen by cross-validation. In scikit-learn, this is exposed as the ccp_alpha parameter on both DecisionTreeClassifier and DecisionTreeRegressor.

Reduced error pruning is a simpler alternative used in some C4.5-style implementations: hold out a validation set, walk up from the leaves, and replace any subtree with a single leaf if doing so does not hurt validation accuracy. C4.5 itself uses error-based pruning, which estimates pessimistic confidence intervals on the leaf error and prunes when the bound improves.

Leaf indices as features

A neat idea from Xinran He and colleagues at Facebook ("Practical Lessons from Predicting Clicks on Ads at Facebook," ADKDD 2014) is to use the index of the leaf an example lands in as a categorical feature. Train a gradient boosted tree, then for every example record which leaf it ended up in for each tree. One-hot encode those leaf indices, concatenate the resulting sparse vector across all trees, and feed it into a logistic regression. The boosted trees handle nonlinear feature interactions; the linear model on top is cheap to retrain and easy to deploy. Facebook reported that this hybrid improved normalized cross-entropy by about 3.4 percent over either model alone, which at click-through-rate scale translates to a substantial revenue lift. The approach popularized leaf-embedding tricks in industrial recommender systems.

Beyond CTR, leaf indices are sometimes used as a learned discretization of the input space, as inputs to nearest-neighbor methods ("find me other examples that share many leaf assignments"), or as the categorical interface in deep tabular methods that try to combine the inductive bias of trees with the gradient flow of neural nets.

Implementations

Library	What is stored at a leaf	How to access
scikit-learn `DecisionTreeClassifier`	Class counts and predicted class	`tree_.value`, `tree_.apply(X)`
scikit-learn `DecisionTreeRegressor`	Mean target and sample count	`tree_.value`, `tree_.apply(X)`
XGBoost	Real-valued leaf weight per tree	`Booster.predict(..., pred_leaf=True)` returns leaf indices; `dump_model()` returns weights
LightGBM	Real-valued leaf weight per tree	`Booster.predict(..., pred_leaf=True)` returns leaf indices
CatBoost	Real-valued leaf weight per tree (oblivious)	`model.calc_leaf_indexes()`
Spark MLlib	Class label or mean	`transform()` produces predictions; intermediate leaf access requires UDFs

For visualization, scikit-learn's sklearn.tree.plot_tree and sklearn.tree.export_text print the tree with the leaves at the bottom of the diagram or text dump. Each leaf line shows the predicted value, the impurity at the leaf, and the number of samples that reach it.

Modern context

Gradient-boosted decision trees remain the dominant tool for tabular data. Benchmarks like the 2022 Grinsztajn et al. study ("Why do tree-based models still outperform deep learning on tabular data?") found that XGBoost and other GBT libraries beat large neural networks on a wide range of tabular tasks, and leaves are the unit doing the work. Production GBT models routinely have ensembles of 1,000 to 10,000 trees with 31 to 255 leaves each, which means tens of millions of leaves total. The combinatorial number of distinct leaf-index tuples (one per tree) is enormous, and that is why ensembles can fit such complex functions even though each individual tree is shallow.

Leaf representations also show up in neural tabular methods. Neural Oblivious Decision Ensembles (NODE), TabNet, and related architectures borrow the leaf-and-condition vocabulary, replacing hard splits with soft attention so that gradients can flow back through what would otherwise be a discontinuous routing decision. The leaves in those models hold learnable embeddings rather than scalars, but the structural intuition is the same: route an example through a sequence of conditions and emit whatever the matching leaf stores.

Common pitfalls

A few things to watch for when working with leaves:

Predictions from a single tree are piecewise constant. A regression tree with eight leaves can only output eight distinct values. If a stairstep prediction surface is unacceptable, you need an ensemble (which can output many more distinct sums) or a different model class.
Trees cannot extrapolate. Every leaf prediction is a summary of training data inside the leaf's region; outside the training distribution, the tree just keeps returning the value of the boundary leaf.
Tiny leaves are a red flag. A leaf with one or two training samples has essentially no statistical support, and its prediction is dominated by noise. Use min_samples_leaf, min_data_in_leaf, or min_child_weight to prevent them.
For boosted trees, do not interpret a single leaf weight as a probability or a predicted target. It is one term in a sum, often after a logit transformation. Use SHAP or partial dependence to interpret the ensemble as a whole.
Imbalanced classes can produce leaves that always predict the majority class, even after splitting. Class weights or is_unbalance (LightGBM) and scale_pos_weight (XGBoost) parameters reweight the loss so leaves can move toward the minority class.

Explain like I'm 5

Imagine you are sorting toys by asking yes-or-no questions. "Is it red?" If yes, walk to the red bucket. "Is it bigger than a shoebox?" If yes, walk to the big-red bucket. After a few questions you arrive at one specific bucket, and that bucket already has a guess written on it: "This is probably a fire truck." The bucket is the leaf. The questions are the splits. The whole point of building the tree was to figure out the right questions and to write a good guess on each bucket.

References

Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). *Classification and Regression Trees*. Wadsworth.
Quinlan, J. R. (1993). *C4.5: Programs for Machine Learning*. Morgan Kaufmann.
Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *The Annals of Statistics*, 29(5), 1189-1232.
He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., & Candela, J. Q. (2014). "Practical Lessons from Predicting Clicks on Ads at Facebook." *Proceedings of the Eighth International Workshop on Data Mining for Online Advertising (ADKDD)*.
Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 785-794.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." *Advances in Neural Information Processing Systems*, 30.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). "CatBoost: Unbiased Boosting with Categorical Features." *Advances in Neural Information Processing Systems*, 31.
Popov, S., Morozov, S., & Babenko, A. (2019). "Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data." *International Conference on Learning Representations*.
Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). "Why do tree-based models still outperform deep learning on tabular data?" *Advances in Neural Information Processing Systems*, 35.
Google Developers, "Machine Learning Glossary: Decision Forests." https://developers.google.com/machine-learning/glossary/df
LightGBM Documentation, "Parameters Tuning." https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
XGBoost Documentation, "XGBoost Parameters." https://xgboost.readthedocs.io/en/stable/parameter.html
scikit-learn Documentation, "DecisionTreeClassifier" and "Post pruning decision trees with cost complexity pruning." https://scikit-learn.org/stable/modules/tree.html

Introduction

Anatomy of a tree node

What a leaf actually stores

Leaves in boosted trees

Tree growth strategies

Hyperparameters that control leaves

Why the number and size of leaves matters

Pruning

Leaf indices as features

Implementations

Modern context

Common pitfalls

Explain like I'm 5

References

Improve this article

Related Articles

Axis-aligned condition

Binary condition

In-set condition

Oblique condition

Threshold (for decision trees)

Gradient boosted (decision) trees (GBT)

Introduction

Anatomy of a tree node

What a leaf actually stores

Leaves in boosted trees

Tree growth strategies

Hyperparameters that control leaves

Why the number and size of leaves matters

Pruning

Leaf indices as features

Implementations

Modern context

Common pitfalls

Explain like I'm 5

References

Related Articles

Axis-aligned condition

Binary condition

In-set condition

Oblique condition

Threshold (for decision trees)

Gradient boosted (decision) trees (GBT)