Condition
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,194 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,194 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
In machine learning, a condition is the test that sits at a non-leaf node of a decision tree. Each condition examines one or more features of an example, picks a path based on the result, and passes the example to a child node. Google's Machine Learning Glossary defines a condition simply as "any node that performs a test," and the same source notes that the words condition, split, and test refer to the same thing in decision tree terminology.
Conditions are the workhorses of decision forests, the family of models that includes random forests and gradient boosted trees. A trained tree is mostly a stack of conditions; the leaf nodes at the bottom just hold the prediction. During inference, an example follows what Google calls an inference path: it starts at the root condition, answers each test in turn, and ends at a leaf whose value is returned as the prediction.
The word "condition" also shows up loosely in other parts of machine learning, for example to describe rules in expert systems or stopping criteria in optimization. The strict definition used in this article is the modern one: a node in a decision tree that tests features and routes examples.
A decision tree has three kinds of nodes:
| Component | Role |
|---|---|
| Root | The starting node at the top of the tree. It is the first condition every example sees. |
| Internal node | Any non-root, non-leaf node. Each internal node holds a condition. |
| Leaf | A terminal node that holds a prediction (a class label for classification, a value for regression). |
In a decision tree diagram the root sits at the top and the leaves at the bottom. A condition lives at the root and every internal node, so a tree with n leaves contains about n - 1 conditions in the binary case. A condition does not store a prediction; it only decides which child an example goes to next.
Most conditions in practical tree libraries follow one of a few standard shapes. Google's Decision Forests course identifies five common forms:
| Form | Example | Notes |
|---|---|---|
| Threshold (numeric) | area >= 200 | Most common form. Tests a single numerical feature against a learned cutoff. |
| Equality (categorical) | species == "cat" | Tests whether a categorical feature equals a specific value. |
| In-set (categorical) | species in {"cat", "dog", "bird"} | Tests whether a categorical feature belongs to a learned subset. |
| Oblique (linear combination) | 0.4 * height + 0.6 * width >= 1.2 | Combines several features into a single linear test. |
| Missing-value test | feature_i is Missing | Routes examples that lack a value for a particular feature. |
The threshold form is the one most people picture when they hear "decision tree." It is also what scikit-learn's DecisionTreeClassifier produces by default and what Google's YDF library uses out of the box. The other forms matter when the data has many categories, when classes are not separable along single feature axes, or when missing values need their own path.
Google's glossary draws a line between two families of conditions based on how many features they touch.
An axis-aligned condition involves exactly one feature. A test like num_legs >= 2 is axis-aligned because it only looks at one column. Plotted in feature space, these conditions become splits parallel to one of the coordinate axes, which is where the name comes from. Most production libraries train axis-aligned trees by default because they are fast to learn, fast to evaluate, and easy to interpret.
An oblique condition involves more than one feature. The classic example is something like height > width, but in practice oblique splits are usually learned as linear combinations of several features at once, for example 0.4 * x1 + 0.6 * x2 >= threshold. The original algorithm to learn such splits is OC1, introduced by Murthy, Kasif, and Salzberg in 1994, and it remains the canonical reference for multivariate decision trees.
Oblique trees can carve more flexible decision boundaries because their splits do not have to align with the feature axes. The trade-off is cost. Finding a good oblique condition is harder than scanning thresholds on a single feature, so training and inference are both slower. In Google's YDF library this mode is opt-in, enabled with split_axis="SPARSE_OBLIQUE".
A second axis of classification is the number of outgoing branches.
A binary condition has exactly two possible outcomes, typically yes and no. Decision trees that only contain binary conditions are called binary decision trees. Most modern libraries train binary trees because they are simpler to implement, less prone to overfitting, and easy to combine with techniques like surrogate splits for missing values.
A non-binary condition has more than two outcomes. A node that tests a categorical feature with three possible values could in principle branch three ways. CHAID, for example, learns multi-way splits for classification. Non-binary conditions are more expressive but also more prone to overfit, since each extra branch splits the training data into a smaller subset.
The two families are not as different as they look. As Google's Decision Forests course points out, "a non-binary condition can be emulated with multiple binary conditions," so a binary tree can represent the same partitioning as a multi-way tree, just with more layers. That is one reason binary splits dominate in modern libraries.
Conditions are not written by hand. They are picked by a part of the training algorithm called the splitter, which Google's glossary defines as "the routine (and algorithm) responsible for finding the best condition at each node." During training, the algorithm starts with all examples at the root and asks the splitter to find the best possible test for that node. Once the best condition is chosen, the examples are partitioned according to the condition's outcome, and the process repeats on each child.
This recursive procedure, called top-down induction of decision trees, is greedy. It picks the locally best condition at every node without backtracking, which means the final tree is not guaranteed to be globally optimal. Finding the globally optimal tree is NP-hard, so greedy growth is the standard compromise.
The quality of a candidate condition is measured by a splitting criterion. The choice of criterion depends on the task and the algorithm:
| Criterion | Used by | Task | Idea |
|---|---|---|---|
| Information gain | ID3, C4.5, C5.0 | Classification | The drop in entropy between the parent node and the weighted entropy of the children. |
| Gini impurity | CART | Classification | The probability of misclassifying a random example if it were labeled by the class distribution at the node. |
| Variance reduction | CART | Regression | The drop in within-node variance of the target variable. |
| Mean squared error | Most regression trees | Regression | Equivalent to variance reduction for the squared loss. |
For a numeric feature with many possible thresholds, the splitter scans the candidate cutpoints (often the unique sorted values), scores each one by the chosen criterion, and keeps the best. For a categorical feature with k levels, exhaustive search over all partitions is exponential, so libraries use heuristics such as the Breiman two-class shortcut or random subset sampling. Google's YDF picks the splitter automatically based on feature type and hyperparameters, since the splitter is, in their words, "the bottleneck when training a decision tree."
A spam classifier might use a root condition like num_links >= 10. This is an axis-aligned, binary, threshold condition on a single numeric feature. Emails that satisfy the test go down the yes branch toward more conditions about sender reputation or capitalized words; those that fail follow a different sub-tree.
A regression tree that predicts house prices might use square_footage > 2000 as its root. The two children are themselves conditions, perhaps num_bathrooms >= 3 and zip_code in {94110, 94114, 94117}. The latter is an in-set condition on a categorical feature. The leaves hold predicted prices in dollars rather than class labels.
On the classic Iris dataset, an axis-aligned tree typically splits on petal length and petal width. An oblique tree might learn a condition like 0.6 * petal_length + 0.4 * petal_width >= 3.1. The boundary is a diagonal line in feature space, which can separate the three species with fewer total splits.
In modern practice, single decision trees are rarely used on their own; they are almost always combined into ensembles. Conditions still play the same role inside each tree, but the way they are learned changes.
Random forests train many trees in parallel on different bootstrap samples of the data. At each node, the splitter is restricted to consider only a random subset of features. Google's glossary calls this attribute sampling: "a tactic for training a decision forest in which each decision tree considers only a random subset of possible features." The point is to make the trees less correlated with each other, which lowers the variance of the averaged prediction. Each tree's conditions are still axis-aligned thresholds in the default case, but the candidate features at any node are sampled rather than exhaustive.
Gradient boosted decision trees (GBDTs), used by libraries like XGBoost, LightGBM, and CatBoost, train trees sequentially. Each new tree fits the residual errors of the current ensemble. The conditions inside each tree are learned by a splitter, but the splitting criterion is based on the gradient and (often) the Hessian of the loss with respect to the current predictions, not on Gini or entropy directly. XGBoost, for example, uses a regularized objective and picks the condition that maximizes the gain in that objective. The trees in a GBDT are usually shallow (a few levels) so that no single tree overfits.
Both approaches still rely on the same atomic unit: a node that tests a feature and routes examples.
At prediction time, a tree consumes one example at a time. The example enters at the root, evaluates the condition there, and is sent to the matching child. The process repeats at each internal node until the example arrives at a leaf, at which point the leaf's stored value is returned. Google's glossary calls this sequence the inference path.
For an ensemble, each tree produces its own prediction from its own chain of conditions, and the ensemble aggregates them. Random forests average for regression or vote for classification. Gradient boosted models sum the predictions, often after a sigmoid or softmax for classification. Because inference only evaluates one condition per level, decision forests stay fast even when they contain hundreds or thousands of trees.
| Term | Meaning |
|---|---|
| Node | Any condition or leaf. |
| Root | The first condition in a tree. |
| Split | A synonym for condition. |
| Test | Another synonym for condition. |
| Splitter | The routine that picks the best condition at a node. |
| Inference path | The sequence of conditions an example traverses to reach a leaf. |
| Pre-pruning | Stopping criteria that prevent a condition from being added (for example minimum samples per leaf). |
| Post-pruning | Removing or collapsing conditions after the tree is grown to reduce overfitting. |
Many practical hyperparameters in libraries like scikit-learn, XGBoost, and LightGBM (max_depth, min_samples_split, min_child_weight, max_features) are constraints on which conditions the splitter is allowed to consider or how many conditions a tree can contain.
Think of a decision tree as a game of yes-or-no questions. A condition is one of those questions. You start with one question at the top: "Does the animal have feathers?" If the answer is yes, you go one way and ask another question. If no, you go the other way. You keep answering until you reach the bottom, which is your final guess ("penguin!"). Each question is a condition. The computer learns which questions to ask by looking at a lot of examples and figuring out which question splits them most cleanly into the right groups.