Node (decision tree)
Last reviewed
May 11, 2026
Sources
12 citations
Review status
Source-backed
Revision
v2 ยท 2,287 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
12 citations
Review status
Source-backed
Revision
v2 ยท 2,287 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
In machine learning, a node is a single point inside a decision tree where the model either tests an attribute of the input or outputs a prediction. Google's Machine Learning Glossary states the rule cleanly: every non-leaf node contains a condition, and every leaf node contains a prediction. The whole tree is just a hierarchy of those two kinds of nodes wired together by branches, and inference works by routing one example from the root at the top down to exactly one leaf at the bottom.
Nodes are the structural unit of the tree. They are what the learning algorithm creates while growing the tree, what pruning trims away, and what the inference path walks. Everything a decision tree "knows" about a problem is stored across its nodes, either as conditions (for splits) or as values (for predictions).
Decision tree nodes are usually grouped into three roles:
| Role | Other names | Has incoming edge? | Has outgoing edges? | Stores |
|---|---|---|---|---|
| Root node | Top node | No | Yes (typically two) | A condition |
| Internal node | Decision node, split node, test node | Yes | Yes (typically two) | A condition |
| Leaf node | Terminal node, end node | Yes | No | A prediction (class label, probability vector, or numeric value) |
The distinction between an internal node and the root is structural rather than semantic: the root is just the unique internal node with no parent. Some references therefore treat "internal node" as a superset that includes the root.
The root node is the topmost node and the entry point for inference. It has no incoming edge and represents the first split applied to the full training set. From the root, branches lead either to other internal nodes that apply more refined conditions, or directly to leaves if the tree is shallow. In a stump (a tree of depth one), the root is the only internal node, and its two children are both leaves.
An internal node is any non-leaf node. It holds a condition, sometimes called a split or a test, that compares one or more features of an input to a threshold or a set membership rule. Each internal node has one incoming edge from its parent and two or more outgoing edges to its children. Most popular implementations, including CART, use strict binary trees, so internal nodes have exactly two children, one for the True branch and one for the False branch.
Conditions inside internal nodes come in two broad flavors:
| Condition type | Form | Features used | Notes |
|---|---|---|---|
| Axis-aligned (univariate) | x_i >= t or x_i in S | One | Splits are hyperplanes parallel to a feature axis. Used by almost all mainstream implementations because they are fast to learn and easy to interpret. |
| Oblique (multivariate) | w_1 x_1 + w_2 x_2 + ... + w_k x_k >= t | Several | Hyperplane can tilt freely. Produces smoother boundaries but is more expensive to train and harder to read. |
Because axis-aligned splits only care about the order of feature values, decision trees do not require feature scaling or normalization. This is one practical reason they remain popular in tabular settings: you can mix integers, floats, and one-hot indicators without preprocessing.
A leaf node has no outgoing edges. It is where the inference path stops and a prediction is returned. The form of that prediction depends on the task:
In the scikit-learn implementation, leaves and internal nodes share the same array layout. A node is identified as a leaf when its left child id equals its right child id, both set to -1, indicating that there is no further split. The value array at that node holds the prediction.
Decision trees are built greedily, from the root down, in a procedure often called top-down induction of decision trees (TDIDT). At each node, the learner scans candidate splits and picks the one that improves a chosen impurity or loss criterion the most. The classic criteria are:
| Criterion | Used by | Intuition |
|---|---|---|
| Information gain (entropy reduction) | ID3, C4.5, C5.0 | How much does the split reduce Shannon entropy of the label distribution? |
| Gini impurity | CART | Probability that a randomly labeled sample at the node is misclassified. |
| Gain ratio | C4.5 | Information gain normalized by the split's own entropy, to avoid favoring features with many categories. |
| Variance reduction | CART (regression) | Reduction in target variance after splitting. |
| Chi-squared | CHAID | Statistical test of independence between feature and target. |
For a numeric feature, the standard trick for finding the best threshold is to sort the feature values and test thresholds halfway between consecutive sorted values. This makes the per-feature split search run in O(n log n) time. Across m features and n samples, the full training of a single decision tree typically runs in O(m n log^2 n) time.
The tree keeps growing recursively. At each child node the learner repeats the same search on the subset of samples that reached that child. Growth stops when one of the stopping conditions fires, such as a maximum depth, a minimum number of samples per node, a minimum impurity decrease, or perfect purity. When growth stops at a node, that node becomes a leaf and a prediction is computed and stored.
Several well known algorithms differ mainly in how they choose conditions, how they handle different data types, and whether they prune.
| Algorithm | Author | Splitting criterion | Notable features |
|---|---|---|---|
| CART (Classification and Regression Trees) | Breiman, Friedman, Olshen, and Stone (1984) | Gini impurity (classification) or variance (regression) | Binary splits only; supports regression; uses cost-complexity post-pruning. |
| ID3 (Iterative Dichotomiser 3) | J. Ross Quinlan (1986) | Information gain | Categorical features only; no pruning in the original version. |
| C4.5 | Quinlan (1993) | Gain ratio | Handles numeric and categorical features, missing values, and post-pruning. |
| C5.0 | Quinlan (commercial) | Gain ratio with boosting | Faster, more memory efficient, supports boosting and rule extraction. |
| CHAID | Kass (1980) | Chi-squared | Multiway splits; commonly used for survey and marketing data. |
| Conditional inference trees | Hothorn, Hornik, and Zeileis (2006) | Permutation tests | Avoids selection bias toward features with many splits; usually does not need pruning. |
These algorithms produce different trees on the same data, but they all manipulate the same three node roles. The choice of algorithm mostly changes what a condition can look like and how aggressively the tree is grown.
A fully grown tree can keep splitting until each leaf holds a single training sample. The result fits the training set perfectly but typically suffers from overfitting. Pruning is the process of removing nodes or whole subtrees to produce a smaller, more general model. There are two timing strategies.
Pre-pruning halts growth during construction. Common stop rules include:
Pre-pruning is computationally cheap because the tree never grows beyond its final size. The main drawback is the horizon effect: a split that looks weak on its own may unlock useful splits one or two levels deeper, and pre-pruning will never discover them because it stops too early.
Post-pruning lets the tree grow fully and then trims it back. Several specific procedures are in use:
| Technique | Idea |
|---|---|
| Reduced-error pruning (REP) | Walk the tree bottom-up. Replace each subtree with a leaf labeled by the majority class on a separate validation set. Keep the replacement if it does not hurt validation accuracy. |
| Cost-complexity pruning (CCP) | Introduce a penalty alpha for tree size. Find the subtree that minimizes training error plus alpha times the number of leaves. Sweeping alpha gives a nested sequence of trees, and cross-validation picks the best. This is the post-pruning method used by CART. |
| Pessimistic error pruning (PEP) | Estimate generalization error from training error with a continuity correction, then prune subtrees whose pessimistic error does not decrease after collapsing. Used by C4.5. |
| Minimum description length (MDL) pruning | Choose the tree that minimizes the combined cost of encoding the tree plus encoding the residual errors. |
| Minimum error pruning (MEP) | Estimate expected error rates with a Bayesian correction and prune to minimize them. |
In scikit-learn, cost-complexity pruning is exposed through the ccp_alpha parameter on DecisionTreeClassifier and DecisionTreeRegressor: larger values prune more aggressively, and the cost_complexity_pruning_path method returns the sequence of effective alphas.
Most decision tree libraries store nodes as parallel arrays so that traversal is cache friendly. The scikit-learn Tree object is a good reference. For every node id i, it tracks:
| Attribute | Meaning |
|---|---|
children_left[i] | Id of the left child, or -1 if the node is a leaf. |
children_right[i] | Id of the right child, or -1 if the node is a leaf. |
feature[i] | Index of the feature used by the split (meaningless for leaves). |
threshold[i] | Threshold value used by the split. |
n_node_samples[i] | Number of training samples that reached the node. |
weighted_n_node_samples[i] | Same count, but using sample weights. |
impurity[i] | Impurity (Gini or entropy) at the node. |
value[i, j, k] | Per-class proportions or regression target at the node. |
A node is a leaf when its left and right children are both -1. Walking the tree from the root and applying the condition at each internal node lets you reconstruct the path that an example takes and explain the prediction.
Suppose a small CART model predicts whether a loan will default using two features, income and credit_score. The tree might look like this:
Root: income < 30000?
True -> Internal: credit_score < 600?
True -> Leaf: predict "default"
False -> Leaf: predict "no default"
False -> Internal: credit_score < 720?
True -> Leaf: predict "no default"
False -> Leaf: predict "no default"
For an applicant with income = 45000 and credit_score = 690, inference visits three nodes: the root (False branch), the internal node testing credit score (True branch), and a leaf. The leaf's value, "no default", is the prediction. Three nodes were touched, two of them internal and one of them a leaf.
This is the entire computational story of a decision tree: walk from the root, apply the condition stored at each internal node, take the matching branch, and stop when you hit a leaf.
The node level is also where most engineering knobs live. Tuning a decision tree mostly means tuning what conditions are allowed at internal nodes and when leaves are created:
max_depth, min_samples_split, min_samples_leaf, and min_impurity_decrease are pre-pruning controls on internal nodes.max_leaf_nodes limits how many leaves the tree may end up with.criterion chooses how an internal node scores candidate splits.ccp_alpha controls post-pruning.max_features restricts which features an internal node may consider for its split, which is central to random forests.In ensemble methods such as random forests, gradient boosted trees, and extra trees, the same node concept carries over. Each tree in the ensemble is built from the same internal-node-and-leaf machinery, and the ensemble's prediction is just an aggregation over the leaves reached in every tree.
Think of playing 20 Questions to guess an animal. Each question you ask is a node. The first question is the root node. The follow-up questions are internal nodes. When you finally say, "It's a giraffe," that final answer is a leaf node. A decision tree does the same thing with data: it keeps asking yes-or-no questions about a row of numbers until it reaches a leaf, and the leaf gives the answer.