Node (decision tree)

Definition

In machine learning, a node is a single point inside a decision tree where the model either tests an attribute of the input or outputs a prediction. Google's Machine Learning Glossary states the rule cleanly: every non-leaf node contains a condition, and every leaf node contains a prediction. The whole tree is just a hierarchy of those two kinds of nodes wired together by branches, and inference works by routing one example from the root at the top down to exactly one leaf at the bottom.

Nodes are the structural unit of the tree. They are what the learning algorithm creates while growing the tree, what pruning trims away, and what the inference path walks. Everything a decision tree "knows" about a problem is stored across its nodes, either as conditions (for splits) or as values (for predictions).

Decision tree nodes are usually grouped into three roles:

Role	Other names	Has incoming edge?	Has outgoing edges?	Stores
Root node	Top node	No	Yes (typically two)	A condition
Internal node	Decision node, split node, test node	Yes	Yes (typically two)	A condition
Leaf node	Terminal node, end node	Yes	No	A prediction (class label, probability vector, or numeric value)

The distinction between an internal node and the root is structural rather than semantic: the root is just the unique internal node with no parent. Some references therefore treat "internal node" as a superset that includes the root.

Node types

Root node

The root node is the topmost node and the entry point for inference. It has no incoming edge and represents the first split applied to the full training set. From the root, branches lead either to other internal nodes that apply more refined conditions, or directly to leaves if the tree is shallow. In a stump (a tree of depth one), the root is the only internal node, and its two children are both leaves.

Internal node

An internal node is any non-leaf node. It holds a condition, sometimes called a split or a test, that compares one or more features of an input to a threshold or a set membership rule. Each internal node has one incoming edge from its parent and two or more outgoing edges to its children. Most popular implementations, including CART, use strict binary trees, so internal nodes have exactly two children, one for the True branch and one for the False branch.

Conditions inside internal nodes come in two broad flavors:

Condition type	Form	Features used	Notes
Axis-aligned (univariate)	`x_i >= t` or `x_i in S`	One	Splits are hyperplanes parallel to a feature axis. Used by almost all mainstream implementations because they are fast to learn and easy to interpret.
Oblique (multivariate)	`w_1 x_1 + w_2 x_2 + ... + w_k x_k >= t`	Several	Hyperplane can tilt freely. Produces smoother boundaries but is more expensive to train and harder to read.

Because axis-aligned splits only care about the order of feature values, decision trees do not require feature scaling or normalization. This is one practical reason they remain popular in tabular settings: you can mix integers, floats, and one-hot indicators without preprocessing.

Leaf node

A leaf node has no outgoing edges. It is where the inference path stops and a prediction is returned. The form of that prediction depends on the task:

Classification: the leaf stores either a single class label (the majority class among training samples that reached it) or a probability vector over classes.
Regression: the leaf stores a numeric value, typically the mean of the target variable over training samples that reached it.
Survival, quantile, or ranking trees: the leaf stores a small model or summary statistic appropriate to the task, such as a Kaplan-Meier curve or a quantile estimate.

In the scikit-learn implementation, leaves and internal nodes share the same array layout. A node is identified as a leaf when its left child id equals its right child id, both set to -1, indicating that there is no further split. The value array at that node holds the prediction.

How the tree decides what to put in each node

Decision trees are built greedily, from the root down, in a procedure often called top-down induction of decision trees (TDIDT). At each node, the learner scans candidate splits and picks the one that improves a chosen impurity or loss criterion the most. The classic criteria are:

Criterion	Used by	Intuition
Information gain (entropy reduction)	ID3, C4.5, C5.0	How much does the split reduce Shannon entropy of the label distribution?
Gini impurity	CART	Probability that a randomly labeled sample at the node is misclassified.
Gain ratio	C4.5	Information gain normalized by the split's own entropy, to avoid favoring features with many categories.
Variance reduction	CART (regression)	Reduction in target variance after splitting.
Chi-squared	CHAID	Statistical test of independence between feature and target.

For a numeric feature, the standard trick for finding the best threshold is to sort the feature values and test thresholds halfway between consecutive sorted values. This makes the per-feature split search run in O(n log n) time. Across m features and n samples, the full training of a single decision tree typically runs in O(m n log^2 n) time.

The tree keeps growing recursively. At each child node the learner repeats the same search on the subset of samples that reached that child. Growth stops when one of the stopping conditions fires, such as a maximum depth, a minimum number of samples per node, a minimum impurity decrease, or perfect purity. When growth stops at a node, that node becomes a leaf and a prediction is computed and stored.

Learning algorithms

Several well known algorithms differ mainly in how they choose conditions, how they handle different data types, and whether they prune.

Algorithm	Author	Splitting criterion	Notable features
CART (Classification and Regression Trees)	Breiman, Friedman, Olshen, and Stone (1984)	Gini impurity (classification) or variance (regression)	Binary splits only; supports regression; uses cost-complexity post-pruning.
ID3 (Iterative Dichotomiser 3)	J. Ross Quinlan (1986)	Information gain	Categorical features only; no pruning in the original version.
C4.5	Quinlan (1993)	Gain ratio	Handles numeric and categorical features, missing values, and post-pruning.
C5.0	Quinlan (commercial)	Gain ratio with boosting	Faster, more memory efficient, supports boosting and rule extraction.
CHAID	Kass (1980)	Chi-squared	Multiway splits; commonly used for survey and marketing data.
Conditional inference trees	Hothorn, Hornik, and Zeileis (2006)	Permutation tests	Avoids selection bias toward features with many splits; usually does not need pruning.

These algorithms produce different trees on the same data, but they all manipulate the same three node roles. The choice of algorithm mostly changes what a condition can look like and how aggressively the tree is grown.

Pruning: removing nodes to improve generalization

A fully grown tree can keep splitting until each leaf holds a single training sample. The result fits the training set perfectly but typically suffers from overfitting. Pruning is the process of removing nodes or whole subtrees to produce a smaller, more general model. There are two timing strategies.

Pre-pruning (early stopping)

Pre-pruning halts growth during construction. Common stop rules include:

maximum tree depth,
minimum number of samples required to split a node,
minimum number of samples in a leaf,
minimum impurity decrease threshold,
maximum number of leaves.

Pre-pruning is computationally cheap because the tree never grows beyond its final size. The main drawback is the horizon effect: a split that looks weak on its own may unlock useful splits one or two levels deeper, and pre-pruning will never discover them because it stops too early.

Post-pruning

Post-pruning lets the tree grow fully and then trims it back. Several specific procedures are in use:

Technique	Idea
Reduced-error pruning (REP)	Walk the tree bottom-up. Replace each subtree with a leaf labeled by the majority class on a separate validation set. Keep the replacement if it does not hurt validation accuracy.
Cost-complexity pruning (CCP)	Introduce a penalty alpha for tree size. Find the subtree that minimizes training error plus alpha times the number of leaves. Sweeping alpha gives a nested sequence of trees, and cross-validation picks the best. This is the post-pruning method used by CART.
Pessimistic error pruning (PEP)	Estimate generalization error from training error with a continuity correction, then prune subtrees whose pessimistic error does not decrease after collapsing. Used by C4.5.
Minimum description length (MDL) pruning	Choose the tree that minimizes the combined cost of encoding the tree plus encoding the residual errors.
Minimum error pruning (MEP)	Estimate expected error rates with a Bayesian correction and prune to minimize them.

In scikit-learn, cost-complexity pruning is exposed through the ccp_alpha parameter on DecisionTreeClassifier and DecisionTreeRegressor: larger values prune more aggressively, and the cost_complexity_pruning_path method returns the sequence of effective alphas.

Node attributes in practice

Most decision tree libraries store nodes as parallel arrays so that traversal is cache friendly. The scikit-learn Tree object is a good reference. For every node id i, it tracks:

Attribute	Meaning
`children_left[i]`	Id of the left child, or -1 if the node is a leaf.
`children_right[i]`	Id of the right child, or -1 if the node is a leaf.
`feature[i]`	Index of the feature used by the split (meaningless for leaves).
`threshold[i]`	Threshold value used by the split.
`n_node_samples[i]`	Number of training samples that reached the node.
`weighted_n_node_samples[i]`	Same count, but using sample weights.
`impurity[i]`	Impurity (Gini or entropy) at the node.
`value[i, j, k]`	Per-class proportions or regression target at the node.

A node is a leaf when its left and right children are both -1. Walking the tree from the root and applying the condition at each internal node lets you reconstruct the path that an example takes and explain the prediction.

Example: routing one sample through the nodes

Suppose a small CART model predicts whether a loan will default using two features, income and credit_score. The tree might look like this:

Root: income < 30000?
  True  -> Internal: credit_score < 600?
              True  -> Leaf: predict "default"
              False -> Leaf: predict "no default"
  False -> Internal: credit_score < 720?
              True  -> Leaf: predict "no default"
              False -> Leaf: predict "no default"

For an applicant with income = 45000 and credit_score = 690, inference visits three nodes: the root (False branch), the internal node testing credit score (True branch), and a leaf. The leaf's value, "no default", is the prediction. Three nodes were touched, two of them internal and one of them a leaf.

This is the entire computational story of a decision tree: walk from the root, apply the condition stored at each internal node, take the matching branch, and stop when you hit a leaf.

Why the node abstraction matters

The node level is also where most engineering knobs live. Tuning a decision tree mostly means tuning what conditions are allowed at internal nodes and when leaves are created:

max_depth, min_samples_split, min_samples_leaf, and min_impurity_decrease are pre-pruning controls on internal nodes.
max_leaf_nodes limits how many leaves the tree may end up with.
criterion chooses how an internal node scores candidate splits.
ccp_alpha controls post-pruning.
max_features restricts which features an internal node may consider for its split, which is central to random forests.

In ensemble methods such as random forests, gradient boosted trees, and extra trees, the same node concept carries over. Each tree in the ensemble is built from the same internal-node-and-leaf machinery, and the ensemble's prediction is just an aggregation over the leaves reached in every tree.

Explain like I'm 5 (ELI5)

Think of playing 20 Questions to guess an animal. Each question you ask is a node. The first question is the root node. The follow-up questions are internal nodes. When you finally say, "It's a giraffe," that final answer is a leaf node. A decision tree does the same thing with data: it keeps asking yes-or-no questions about a row of numbers until it reaches a leaf, and the leaf gives the answer.

References

Google for Developers, Machine Learning Glossary, entries for "node (decision tree)," "leaf," and "condition (decision tree)." https://developers.google.com/machine-learning/glossary
Google for Developers, Decision Forests course, "Decision trees" page. https://developers.google.com/machine-learning/decision-forests/decision-trees
Google for Developers, Decision Forests course, "Types of conditions" page. https://developers.google.com/machine-learning/decision-forests/conditions
Google for Developers, Decision Forests course, "Growing decision trees" page. https://developers.google.com/machine-learning/decision-forests/growing
Wikipedia, "Decision tree learning." https://en.wikipedia.org/wiki/Decision_tree_learning
Wikipedia, "Decision tree." https://en.wikipedia.org/wiki/Decision_tree
Wikipedia, "Decision tree pruning." https://en.wikipedia.org/wiki/Decision_tree_pruning
scikit-learn documentation, "Understanding the decision tree structure." https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html
scikit-learn documentation, "Post pruning decision trees with cost complexity pruning." https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). *Classification and Regression Trees*. Wadsworth.
Quinlan, J. R. (1986). "Induction of Decision Trees." *Machine Learning*, 1(1), 81 to 106.
Quinlan, J. R. (1993). *C4.5: Programs for Machine Learning*. Morgan Kaufmann.

Definition

Node types

Root node

Internal node

Leaf node

How the tree decides what to put in each node

Learning algorithms

Pruning: removing nodes to improve generalization

Pre-pruning (early stopping)

Post-pruning

Node attributes in practice

Example: routing one sample through the nodes

Why the node abstraction matters

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Agglomerative clustering

Definition

Node types

Root node

Internal node

Leaf node

How the tree decides what to put in each node

Learning algorithms

Pruning: removing nodes to improve generalization

Pre-pruning (early stopping)

Post-pruning

Node attributes in practice

Example: routing one sample through the nodes

Why the node abstraction matters

Explain like I'm 5 (ELI5)

References

Related Articles

Machine learning terms/Natural Language Processing

Machine learning terms/Computer Vision

Machine learning terms/Sequence Models

Split

Static

Agglomerative clustering