# Leaf

> Source: https://aiwiki.ai/wiki/leaf
> Updated: 2026-07-11
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

A **leaf** (also called a **terminal node**) is a node in a [decision tree](/wiki/decision_tree) that has no children and holds the model's prediction. Every path from the root of the tree ends at exactly one leaf, and that leaf is where the model finally produces an output: a class label for classification, a real number for regression, or a leaf weight in a gradient-boosted ensemble. Unlike an internal node, a leaf performs no test; it simply returns the value stored in it for any example whose path lands there. [1][10]

Google's decision forests glossary defines the term concisely: "Any endpoint in a decision tree. Unlike a condition, a leaf doesn't perform a test. Rather, a leaf is a possible prediction." [10] The same glossary describes the chain of [conditions](/wiki/axis-aligned_condition) an example traverses at inference time as the *inference path*, defined as "the route a particular example takes from the root to other conditions, terminating with a leaf." [10]

Leaves are the part of the model that actually carries predictive information. The conditions are routing logic. Once a tree is built, you could think of it as a lookup table that maps an example to one of its leaves, and the leaf is what gets returned. This is why most of the interesting design choices in modern tree learners ([LightGBM](/wiki/lightgbm), [XGBoost](/wiki/xgboost), CatBoost) revolve around how many leaves to grow, how deep they sit, what value to store in each one, and how to keep them from memorizing noise. [5][6][7]

## What is a leaf node?

A decision tree is a directed rooted tree. There are exactly two kinds of nodes:

- An **internal node** has at least one child and holds a split condition. The condition can be axis-aligned (a test on a single feature, like `feature_j <= threshold`), oblique (a test on a linear combination of features), or in-set (a categorical membership test). See [axis-aligned condition](/wiki/axis-aligned_condition), [oblique condition](/wiki/oblique_condition), and [in-set condition](/wiki/in-set_condition) for the distinctions. The [threshold](/wiki/threshold_for_decision_trees) is the constant the feature is compared against. In Google's glossary a condition is "any node that performs a test," also called a split or a test. [10]
- A **leaf** has no children and holds a prediction.

The root is the special internal node at the top of the tree where every inference begins. In a binary tree (the kind produced by [CART](/wiki/cart_algorithm), XGBoost, LightGBM, and the default [scikit-learn](/wiki/scikit-learn) implementation), every internal node has exactly two children. A tree of depth $$k$$ therefore has at most $$2^k$$ leaves, and CatBoost's symmetric trees actually hit that bound exactly: at the default depth of 6, every CatBoost tree has $$2^6 = 64$$ leaves. [7]

A common point of confusion: "leaf" in a decision tree is not the same as "leaf" in graph theory (where it usually means a vertex of degree one in an undirected tree), and neither has anything to do with botany. The vocabulary collides because trees are everywhere in computer science. The decision-tree leaf is specifically the terminal predictive node. [10]

## What does a leaf store and how does it make a prediction?

What sits inside a leaf depends on the task and the algorithm. The body of the leaf is whatever the training procedure decides best summarizes the subset of training examples that fell into that leaf. Below are the common cases.

| Task | Stored in the leaf | Prediction returned |
|------|--------------------|---------------------|
| Binary classification | Majority class (or class proportions) of training examples in the leaf | Class label, or a probability between 0 and 1 |
| Multi-class classification | Vector of class counts or normalized class probabilities | Argmax class, or full probability distribution |
| Regression | Mean of the training target values in the leaf (sometimes the median) | A real number |
| Probabilistic classification | Full categorical distribution (smoothed if desired) | A distribution over labels |
| Quantile regression | A specified quantile of the leaf's training targets | The quantile (for example, the 90th percentile) |
| Survival / Cox regression | Hazard estimate or Kaplan-Meier curve | A survival function |
| Boosted tree (XGBoost, LightGBM, CatBoost) | A *leaf weight* (a real-valued gradient step), not a prediction by itself | Sum the leaf weights from every tree, then apply the link function |
| Custom | Any function of the leaf subset | Whatever the loss requires |

For a single classification tree, the standard rule is the majority vote of the training examples that arrived at the leaf. For a single regression tree, the standard is the leaf-subset mean. Both are the choices that minimize the corresponding training loss (0/1 loss or squared error) given a constant prediction per leaf. [1] Median predictions are sometimes preferred for regression with heavy-tailed targets because medians are less sensitive to outliers.

In scikit-learn, you can get back the actual leaf an example lands in by calling `tree.apply(X)`, which returns the integer leaf index for each row. [13] This is useful for debugging, for cohort analysis, and for the leaf-embedding trick described later.

## What is leaf purity?

During training, a decision-tree learner chooses splits to make the resulting leaves as *pure* as possible, meaning each leaf should contain training examples that mostly share one label (for classification) or cluster tightly around one value (for regression). A perfectly pure classification leaf contains examples of a single class; an impure leaf is a mix. The learner measures impurity with a function evaluated on the class proportions in a node, and it accepts the split that reduces total impurity the most. [1][13]

The two impurity measures used most often for classification are:

| Impurity measure | Formula (over class proportions $$p_i$$ in a node) | Range for K classes | Notes |
|------------------|-----------------------------------------------|---------------------|-------|
| Gini impurity | $$1 - \sum_i p_i^2$$ | 0 to $$1 - 1/K$$ | Default `criterion` in scikit-learn's `DecisionTreeClassifier`; 0 means a pure node |
| Entropy (information gain) | $$-\sum_i p_i \log_2(p_i)$$ | 0 to $$\log_2(K)$$ | Used by ID3 and C4.5; the split that maximizes the entropy drop is the "information gain" criterion |

Both functions equal 0 when a node is perfectly pure (one class has proportion 1) and reach their maximum when classes are evenly mixed. [13] For a two-class node, Gini impurity peaks at 0.5 and entropy peaks at 1.0 bit, both at the 50/50 split. In scikit-learn the impurity at each node (including each leaf) is printed by `export_text` and `plot_tree`, and a leaf's `impurity` near 0 signals that the path to it cleanly separated the classes. For regression trees, the analogous quantity is the mean squared error (variance) of the target inside the node, which the tree minimizes instead of a class-impurity measure. [13]

Purity is the engine of tree growth: the recursive split-selection rule keeps subdividing nodes as long as splitting buys a worthwhile drop in impurity, and it stops (creating a leaf) when no split helps or a stopping constraint such as `min_impurity_decrease` is hit. [13]

## How do leaves work in gradient-boosted trees?

Gradient-boosted trees treat leaves differently from a standalone decision tree. A single boosted tree is not a model on its own; it is one term in a sum. Each tree is trained to nudge the current ensemble's predictions in a direction that reduces the loss, and what the leaf stores is the size of that nudge.

XGBoost and LightGBM follow Tianqi Chen and Carlos Guestrin's 2016 derivation. [5] For a leaf $$j$$ containing training instances with summed gradient $$G_j$$ and summed Hessian $$H_j$$ (computed from the current ensemble's predictions), the optimal leaf weight under an L2 regularizer $$\lambda$$ is

$$
w_j = -\frac{G_j}{H_j + \lambda}
$$

and the gain from a candidate split into left and right children is

$$
\text{Gain} = \frac{1}{2}\left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right] - \gamma
$$

where $$\gamma$$ is a complexity penalty per added leaf. [5] This is essentially one Newton step on the loss, applied per leaf, because XGBoost uses a second-order Taylor expansion of the objective. [5] The final ensemble prediction for an example is the sum of the leaf weights from every tree in the ensemble (after applying the inverse link function for classification, for example a logistic for binary log loss).

The practical consequence is that you cannot read meaning out of an individual boosted-tree leaf the way you can with a single CART leaf. A leaf weight of -0.4 in tree 17 is just a contribution; it is the sum across hundreds of trees that produces the final score.

## How do tree-growth strategies change the leaves?

Different libraries grow trees in different orders, and the growth strategy directly determines the shape and number of leaves you end up with.

| Strategy | Used by | How it picks the next split | Tree shape |
|----------|---------|-----------------------------|------------|
| Level-wise (depth-wise) | XGBoost (default), classical CART variants | Splits every leaf at the current depth before going deeper | Balanced; like a complete binary tree up to the point where splits stop being profitable |
| Leaf-wise (best-first) | LightGBM (default), XGBoost with `grow_policy=lossguide` | Always splits the leaf with the largest expected gain ("split at nodes with highest loss change") | Often deep and asymmetric; concentrates capacity where the data needs it |
| Symmetric (oblivious) | CatBoost | Forces every node at the same depth to use the same feature and threshold | Perfectly balanced; depth $$k$$ gives exactly $$2^k$$ leaves |
| Pre-order (recursive) | Classic CART implementations | Recurses into one child fully before the other | Order does not affect the final tree because all profitable splits are taken |

Leaf-wise growth converges in fewer leaves on the same dataset because each split spends complexity on the loss reduction that matters most. The cost is that with the same `num_leaves` budget, a leaf-wise tree is often deeper than a level-wise tree, and on small datasets it can chase noise. The LightGBM documentation explicitly warns about this: "when trying to tune the num_leaves, we should let it be smaller than $$2^{\text{max\_depth}}$$." [11] The example given in the docs is that for `max_depth=7`, "setting num_leaves to 127 may cause over-fitting, and setting it to 70 or 80 may get better accuracy than depth-wise." [11]

CatBoost goes the other way. By forcing all nodes at a given depth to share the same split, oblivious trees lose flexibility per tree but gain speed: leaf indices can be looked up with a sequence of bit operations, evaluation vectorizes well on CPUs and GPUs, and the symmetric structure acts as a strong regularizer. [7]

## Which hyperparameters control leaves?

Most regularization in tree learners is leaf-shaped: you constrain how many leaves you may have, how small they may be, and how informative each split must be. The table below summarizes the parameters that show up most often, with defaults verified against each library's current documentation. [11][12][13]

| Parameter | Library | What it limits | Typical default |
|-----------|---------|----------------|-----------------|
| `num_leaves` | LightGBM | Maximum leaves per tree (the main complexity knob for leaf-wise growth) | 31 |
| `max_leaves` / `max_leaf_nodes` | XGBoost (`max_leaves`), scikit-learn (`max_leaf_nodes`) | Cap on total leaves per tree | 0 (unlimited) in XGBoost, None in scikit-learn |
| `max_depth` | All libraries | Caps depth, which indirectly caps leaves | 6 (XGBoost), -1 in LightGBM (no limit), None in scikit-learn |
| `min_samples_leaf` | scikit-learn | Minimum training samples that must end up in any leaf | 1 |
| `min_data_in_leaf` / `min_child_samples` | LightGBM | Minimum training samples per leaf | 20 |
| `min_child_weight` | XGBoost | Minimum sum of Hessians per leaf (for squared loss this is just the sample count) | 1 |
| `min_sum_hessian_in_leaf` | LightGBM | Hessian-sum analog of `min_child_weight` | 1e-3 |
| `min_impurity_decrease` | scikit-learn | Required reduction in impurity for a split to be accepted | 0.0 |
| `gamma` (`min_split_loss`) | XGBoost | Required gain reduction (after regularization) for a split | 0 |
| `reg_lambda` / `lambda` | XGBoost, LightGBM | L2 penalty on leaf weights (shrinks `w_j`) | 1.0 (XGBoost), 0 (LightGBM) |
| `reg_alpha` / `alpha` | XGBoost, LightGBM | L1 penalty on leaf weights | 0 |
| `ccp_alpha` | scikit-learn | Cost-complexity pruning strength applied after fitting | 0.0 |

These knobs all push in the same direction: too many large-capacity leaves and the tree memorizes; too few small leaves and the tree underfits. The right setting is almost always discovered by cross-validation, by an internal early-stopping loop on a validation set, or by a parameter-search tool like Optuna. The defaults are conservative on purpose. For LightGBM in particular, the docs note that for `min_data_in_leaf`, "setting it to hundreds or thousands is enough for a large dataset." [11]

## Why does the number of leaves matter?

Leaves are the unit of capacity in a tree. Add more leaves and you can fit a more complex function; add too many and you fit the noise. The standard bias-variance picture applies. A tree with one leaf is a constant predictor; a tree with one leaf per training example perfectly memorizes the training set and generalizes terribly. The interesting region is in between, and it is where almost all hyperparameter tuning happens. [1]

A few diagnostics help in practice:

- Plot training and validation loss against `max_depth` or `num_leaves`. The classic U-shaped validation curve tells you when you have crossed into [overfitting](/wiki/overfitting).
- Watch the average leaf size. If many leaves contain only one or two training examples, predictions in those leaves are essentially memorized labels.
- Look at split gains. If the marginal gain of adding another leaf is close to zero, raising `min_split_loss` or `min_impurity_decrease` will trim the tree.

For boosted ensembles, the situation is slightly different because you also get to control the number of trees and the learning rate. Many practitioners keep individual trees fairly shallow (depth 4 to 8, or `num_leaves` around 31 to 255) and rely on adding more trees with a small learning rate, which empirically generalizes better than building one or two enormous trees. [3]

## What is leaf pruning?

Pruning is the act of throwing away leaves (or whole subtrees) after the tree has been grown. The two flavors mirror the two ways you might decide to throw something away.

*Pre-pruning* (early stopping) refuses to create leaves that would be too small or too uninformative in the first place. Constraints like `min_samples_leaf`, `min_data_in_leaf`, `min_child_weight`, and `min_impurity_decrease` are pre-pruning. They are cheap because the tree never grows unnecessary structure, but they suffer from the *horizon effect*: a split that looks bad on its own might unlock excellent splits one level down, and pre-pruning never gives that chance.

*Post-pruning* grows the tree to its full size first and then collapses subtrees that hurt validation performance. The canonical method is *minimal cost-complexity pruning*, introduced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in their 1984 book *Classification and Regression Trees*. [1] For a subtree T,

$$
R_\alpha(T) = R(T) + \alpha \lvert T_{\text{leaves}} \rvert
$$

where $$R(T)$$ is the resubstitution error (misclassification or squared error), $$\lvert T_{\text{leaves}} \rvert$$ is the number of leaves, and $$\alpha$$ is a non-negative complexity parameter. [1] The algorithm computes the "weakest link" alpha at which collapsing each subtree to a leaf is justified, prunes that subtree, and repeats. This produces a nested sequence of subtrees indexed by alpha, and the final alpha is chosen by cross-validation. In scikit-learn, this is exposed as the `ccp_alpha` parameter on both `DecisionTreeClassifier` and `DecisionTreeRegressor`. [13]

Reduced error pruning is a simpler alternative used in some C4.5-style implementations: hold out a validation set, walk up from the leaves, and replace any subtree with a single leaf if doing so does not hurt validation accuracy. C4.5 itself uses error-based pruning, which estimates pessimistic confidence intervals on the leaf error and prunes when the bound improves. [2]

## Can leaf indices be used as features?

A neat idea from Xinran He and colleagues at Facebook ("Practical Lessons from Predicting Clicks on Ads at Facebook," ADKDD 2014) is to use the *index* of the leaf an example lands in as a categorical feature. [4] In the authors' words, "we treat each individual tree as a categorical feature that takes as value the index of the leaf an instance ends up falling in." [4] Train a [gradient boosted](/wiki/gradient_boosting) tree, then for every example record which leaf it ended up in for each tree. One-hot encode those leaf indices, concatenate the resulting sparse vector across all trees, and feed it into a logistic regression. The boosted trees handle nonlinear feature interactions; the linear model on top is cheap to retrain and easy to deploy. Facebook reported that this hybrid improved normalized cross-entropy by about 3.4 percent over either model alone, which the paper notes is large given that most feature-engineering experiments move the metric by only a fraction of a percent. [4] At click-through-rate scale that 3.4 percent translates to a substantial revenue lift, and the approach popularized leaf-embedding tricks in industrial recommender systems.

Beyond CTR, leaf indices are sometimes used as a learned discretization of the input space, as inputs to nearest-neighbor methods ("find me other examples that share many leaf assignments"), or as the categorical interface in deep tabular methods that try to combine the inductive bias of trees with the gradient flow of neural nets.

## How do real libraries expose leaves?

| Library | What is stored at a leaf | How to access |
|---------|--------------------------|---------------|
| [scikit-learn](/wiki/scikit-learn) `DecisionTreeClassifier` | Class counts and predicted class | `tree_.value`, `tree_.apply(X)` |
| scikit-learn `DecisionTreeRegressor` | Mean target and sample count | `tree_.value`, `tree_.apply(X)` |
| [XGBoost](/wiki/xgboost) | Real-valued leaf weight per tree | `Booster.predict(..., pred_leaf=True)` returns leaf indices; `dump_model()` returns weights |
| [LightGBM](/wiki/lightgbm) | Real-valued leaf weight per tree | `Booster.predict(..., pred_leaf=True)` returns leaf indices |
| CatBoost | Real-valued leaf weight per tree (oblivious) | `model.calc_leaf_indexes()` |
| Spark MLlib | Class label or mean | `transform()` produces predictions; intermediate leaf access requires UDFs |

For visualization, scikit-learn's `sklearn.tree.plot_tree` and `sklearn.tree.export_text` print the tree with the leaves at the bottom of the diagram or text dump. [13] Each leaf line shows the predicted value, the impurity at the leaf, and the number of samples that reach it.

## Why do leaves still matter in 2026?

Gradient-boosted decision trees remain the dominant tool for tabular data. The 2022 Grinsztajn, Oyallon, and Varoquaux study ("Why do tree-based models still outperform deep learning on typical tabular data?", NeurIPS Datasets and Benchmarks track) benchmarked tree-based models against modern deep nets across 45 datasets and a large grid of hyperparameters, and found that XGBoost and other gradient-boosted-tree libraries remained state of the art on medium-sized tabular data (around 10,000 samples), even before accounting for their much faster training. [9] Leaves are the unit doing the work. Production GBT models routinely have ensembles of 1,000 to 10,000 trees with 31 to 255 leaves each, which means tens of millions of leaves total. The combinatorial number of distinct leaf-index tuples (one per tree) is enormous, and that is why ensembles can fit such complex functions even though each individual tree is shallow.

Leaf representations also show up in neural tabular methods. Neural Oblivious Decision Ensembles (NODE), TabNet, and related architectures borrow the leaf-and-condition vocabulary, replacing hard splits with soft attention so that gradients can flow back through what would otherwise be a discontinuous routing decision. [8] The leaves in those models hold learnable embeddings rather than scalars, but the structural intuition is the same: route an example through a sequence of conditions and emit whatever the matching leaf stores.

## What are common pitfalls with leaves?

A few things to watch for when working with leaves:

- Predictions from a single tree are piecewise constant. A regression tree with eight leaves can only output eight distinct values. If a stairstep prediction surface is unacceptable, you need an ensemble (which can output many more distinct sums) or a different model class.
- Trees cannot extrapolate. Every leaf prediction is a summary of training data inside the leaf's region; outside the training distribution, the tree just keeps returning the value of the boundary leaf.
- Tiny leaves are a red flag. A leaf with one or two training samples has essentially no statistical support, and its prediction is dominated by noise. Use `min_samples_leaf`, `min_data_in_leaf`, or `min_child_weight` to prevent them.
- For boosted trees, do not interpret a single leaf weight as a probability or a predicted target. It is one term in a sum, often after a logit transformation. Use SHAP or partial dependence to interpret the ensemble as a whole.
- Imbalanced classes can produce leaves that always predict the majority class, even after splitting. Class weights or `is_unbalance` (LightGBM) and `scale_pos_weight` (XGBoost) parameters reweight the loss so leaves can move toward the minority class. [11][12]

## Explain like I'm 5

Imagine you are sorting toys by asking yes-or-no questions. "Is it red?" If yes, walk to the red bucket. "Is it bigger than a shoebox?" If yes, walk to the big-red bucket. After a few questions you arrive at one specific bucket, and that bucket already has a guess written on it: "This is probably a fire truck." The bucket is the leaf. The questions are the splits. The whole point of building the tree was to figure out the right questions and to write a good guess on each bucket.

## References

1. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). *Classification and Regression Trees*. Wadsworth.
2. Quinlan, J. R. (1993). *C4.5: Programs for Machine Learning*. Morgan Kaufmann.
3. Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *The Annals of Statistics*, 29(5), 1189-1232.
4. He, X., Pan, J., Jin, O., Xu, T., Liu, B., Xu, T., Shi, Y., Atallah, A., Herbrich, R., Bowers, S., & Candela, J. Q. (2014). "Practical Lessons from Predicting Clicks on Ads at Facebook." *Proceedings of the Eighth International Workshop on Data Mining for Online Advertising (ADKDD)*.
5. Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 785-794.
6. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." *Advances in Neural Information Processing Systems*, 30.
7. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). "CatBoost: Unbiased Boosting with Categorical Features." *Advances in Neural Information Processing Systems*, 31.
8. Popov, S., Morozov, S., & Babenko, A. (2019). "Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data." *International Conference on Learning Representations*.
9. Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). "Why do tree-based models still outperform deep learning on typical tabular data?" *Advances in Neural Information Processing Systems 35 (Datasets and Benchmarks Track)*.
10. Google Developers, "Machine Learning Glossary: Decision Forests." https://developers.google.com/machine-learning/glossary/df
11. LightGBM Documentation, "Parameters Tuning." https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html
12. XGBoost Documentation, "XGBoost Parameters." https://xgboost.readthedocs.io/en/stable/parameter.html
13. scikit-learn Documentation, "Decision Trees" and "Post pruning decision trees with cost complexity pruning." https://scikit-learn.org/stable/modules/tree.html