Root
Last reviewed
May 11, 2026
Sources
10 citations
Review status
Source-backed
Revision
v2 · 2,196 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
10 citations
Review status
Source-backed
Revision
v2 · 2,196 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
The word "root" shows up in machine learning in two different settings. The first is structural: it names the starting node of a decision tree, the single point at the top of the tree from which every other split descends. The second is arithmetic: it appears inside the name of the root mean square error (RMSE), where it refers to the square root that turns a mean squared error back into the original units of the target variable. The two meanings share only the English word. This article focuses on the first sense, the root node, with a short section at the end on RMSE.
In a decision tree, the root is the starting point where the first split is made. A decision tree is a supervised learning algorithm that recursively partitions the feature space into regions for classification or regression. The root node is the topmost node and represents the entire dataset before any splitting has occurred. Every training sample passes through the root, and the rule it encodes is the first question the tree asks about an input.
The Wikipedia article on decision tree learning describes the structure compactly: a tree is built by splitting the source set, which forms the root node, into subsets that constitute the successor children. The same procedure runs recursively on each child until the subsets are pure or no useful split remains. The root is the geometric apex of the tree and the first decision boundary the algorithm imposes on the training data. Most diagrams place it at the top and grow downward, the opposite of how botanical trees grow.
Before any split runs, the root sees all of the training rows and all of the candidate features. Its job is to pick one feature and one threshold (for numeric features) or one set partition (for categorical features) that best separates samples by their target value. After the split, the data is divided into a left child and a right child, and each child becomes a new sub problem with its own selection step.
Three properties distinguish the root from every other node. It is unique, since there is exactly one root per tree. It contains the full training set. And it owns the first decision, so whichever feature and threshold the algorithm picks becomes the very first question asked at inference time.
Decision tree construction begins with selecting the best feature to split the dataset at the root node. The choice is based on a numerical metric that quantifies how cleanly a candidate split separates the training samples by their target value. For classification the two standard metrics are Gini impurity and information gain, which measures the reduction in Shannon entropy. For regression the standard metric is the reduction in variance, typically expressed as mean squared error. The chosen feature and the corresponding threshold value form the basis for the root node's decision rule.
Gini impurity is defined as 1 minus the sum of squared class probabilities at a node. A node with only one class has a Gini impurity of zero. The split that minimizes the weighted Gini impurity of the resulting children is the one the tree selects. Information gain works similarly but uses Shannon entropy, and the chosen split is the one that produces the largest drop in entropy between the parent and the weighted average of the children. Both metrics tend to agree on which split is best in most real datasets, and Gini is cheaper to compute because it avoids the logarithm.
If a node contains samples from K classes with proportions p_1 through p_K, Gini impurity equals 1 minus the sum of p_i squared, and Shannon entropy equals the negative sum of p_i times log base 2 of p_i. Information gain equals parent entropy minus the weighted entropy of the two children.
Suppose the root has 7 samples, 4 from class A and 3 from class B. Parent entropy is about 0.985 bits. Consider a split that produces a left child with 3 samples (2 A, 1 B) and a right child with 4 samples (2 A, 2 B). Left child entropy is roughly 0.918, right child entropy is exactly 1.0, and the weighted average is about 0.965. Information gain is 0.985 minus 0.965, or 0.020 bits. The algorithm computes this for every candidate split on every feature and picks the one with the highest gain. That split becomes the root's decision rule.
The root node plays an important role in the model's overall behavior. A well chosen root can lead to a more accurate and efficient tree structure, because the first split usually separates the largest amount of class signal in one step. A poor root choice leaves most of the work to deeper nodes, which tends to produce a less accurate model with more complexity and a higher risk of overfitting. Feature selection and the choice of splitting criterion at the root are therefore essential aspects of decision tree construction.
The root also has an outsized effect on interpretation. Decision trees are popular partly because the path from root to leaf reads like a sequence of if then rules, and the first rule on every path is the root's rule. Practitioners often inspect the root first to see which feature the model considers most informative; if the root splits on a feature that does not match domain intuition, that is usually a signal to check the data for leakage or encoding issues before trusting the rest of the tree.
The standard algorithms for tree induction, including ID3, C4.5, and CART, are greedy. They pick the locally optimal split at the root, then move on to the children without ever revisiting the root choice. This greedy strategy is fast and almost always good enough, but it can miss combinations of features that would have produced a better tree under a more global search. Finding the globally optimal decision tree is NP hard, as proved by Hyafil and Rivest in 1976, so the greedy choice is the practical default.
The Python library scikit learn implements an optimized version of CART in its DecisionTreeClassifier and DecisionTreeRegressor classes. Two parameters most directly control how the root is chosen.
The criterion parameter sets the impurity function. For classification its options are gini, entropy, and log_loss, with gini as the default. For regression its options include squared_error, friedman_mse, absolute_error, and poisson, with squared_error as the default. The splitter parameter controls the search strategy. Its default best makes the algorithm run a greedy exhaustive search over every candidate feature and every threshold between sorted distinct feature values. Setting it to random samples a single random threshold per feature.
The max_features parameter restricts how many features the algorithm considers at each split, including the root. Its default of None means all features are eligible. Setting it to sqrt or log2 restricts the search to a random subset, which is the setting used inside a random forest. After training, the tree_ attribute on a fitted estimator exposes the tree structure, with the root at node index 0; its split feature and threshold are read from tree_.feature<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup> and tree_.threshold<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>.
A single decision tree tends to overfit, and the root choice in particular is brittle. Small changes in the training data can change which feature scores highest at the root, which then changes the entire tree below it. Ensembles address this by training many trees on slightly different data or with slightly different rules and combining their predictions.
Bagging, short for bootstrap aggregating, trains each tree on a bootstrap sample of the original training set. Because the bootstrap samples differ, the root split selected on each sample can differ, and the resulting trees explore different first questions. The trees vote or average their predictions, and the variance of the combined prediction is lower than that of any individual tree.
Random forests, introduced by Leo Breiman in 2001, take this further. On top of bootstrap sampling they also restrict each split, including the root, to a random subset of candidate features. The Wikipedia article on random forests explains the motivation: if one or a few features are very strong predictors for the target, those features would be selected as the root in many of the trees, causing the trees to become correlated and the ensemble to lose its variance reduction. By forcing each tree to consider only a sample of features at each split, the random forest guarantees that many trees end up with different roots, even when one feature is dominant on the full data. This is the structural reason random forests typically outperform plain bagged trees.
Boosting algorithms such as gradient boosted trees use a different strategy. They build trees one at a time, with each new tree fit to the residual errors of the previous trees. The roots of trees deeper in the sequence often split on features that earlier trees did not exploit, because those features now carry the largest remaining signal.
| Criterion | Used for | Definition | Lowest value at |
|---|---|---|---|
| Gini impurity | Classification | 1 minus the sum of squared class proportions | Pure node (one class only) |
| Information gain | Classification | Parent entropy minus weighted child entropy | Maximally informative split |
| Mean squared error | Regression | Average squared distance from node mean | Node where target is constant |
| Mean absolute error | Regression | Average absolute distance from node median | Node where target is constant |
| Chi square | Classification | Sum of (observed minus expected) squared over expected | No association with target |
Gini and information gain rarely disagree on the best split in practice; the resulting accuracy difference is usually negligible. CART uses Gini by default; ID3 and C4.5 use entropy.
A separate use of "root" appears in the name of the root mean square error, a common performance metric for regression models. RMSE measures the typical size of the model's prediction errors in the same units as the target variable, which makes it easier to interpret than mean squared error on its own.
The RMSE is calculated in three steps.
The resulting value is the RMSE. It is a single number summarizing the average magnitude of the model's errors.
RMSE is widely used because it is interpretable and because squaring the errors before averaging makes it sensitive to large errors. A lower RMSE indicates a better fit of the model to the data, and a higher RMSE indicates a poorer fit. RMSE is not normalized by the scale of the target, so comparing RMSE values across models trained on different targets is misleading; a model predicting house prices in dollars will always have a larger RMSE than a model predicting whether a coin came up heads.
"Root" in machine learning can mean two things. The first is about a kind of model called a decision tree. A decision tree makes a prediction by asking a chain of yes or no questions until it reaches an answer. The root is the very first question. Imagine guessing an animal: the root is the first yes or no question you would ask, like "does it have feathers?" That first question shapes everything after, so picking a good first question matters. The computer picks it by scoring how well each possible question splits the training examples into clean groups; the question that splits them most neatly becomes the root.
The second meaning is about checking how good a model is. Root mean square error (RMSE) looks at how far off the model's guesses are from the true answers, squares each difference, averages them, and then takes the square root. A small RMSE means the model is usually close to the truth. A big RMSE means it is often pretty far off.