Decision Tree

Introduction

In the field of machine learning, a decision tree is a tree-structured predictive model used for both classification and regression tasks. The model works by recursively splitting the input data into subsets based on the values of input features, forming a hierarchy of decision rules that ultimately produce a prediction at each terminal node. Decision trees are among the most widely used algorithms in supervised learning because of their intuitive structure, interpretability, and ability to handle both numerical and categorical data without extensive preprocessing.

A decision tree can be understood as a flowchart-like structure where each internal node represents a test on a feature (for example, "Is temperature greater than 30?" or "Is color equal to red?"), each branch represents the outcome of that test, and each leaf node represents a class label or a continuous value. The path from the root to a leaf encodes a conjunction of feature conditions that leads to a particular prediction.

Finding the optimal decision tree for a given dataset is NP-complete, as proven by Laurent Hyafil and Ronald Rivest in 1976. Because of this computational intractability, all practical decision tree algorithms rely on greedy heuristics that make locally optimal splitting decisions at each node rather than searching for a globally optimal tree.

History

The development of decision tree algorithms spans more than six decades, with contributions from statisticians, computer scientists, and machine learning researchers.

AID (1963)

The earliest decision tree algorithm was Automatic Interaction Detection (AID), published by James N. Morgan and John A. Sonquist in 1963 in the Journal of the American Statistical Association. Morgan and Sonquist were motivated by the challenge of dealing with complex interactions among categorical variables in social survey data. They wanted to untangle the influence of factors such as age, education, ethnicity, and profession on a person's income. AID works by recursively splitting data at the root node into two child nodes, selecting binary splits on ordinal and nominal independent variables that most reduce the sum of squares of an interval dependent variable from its mean. Because interaction effects are a natural byproduct of the splitting process, Morgan and Sonquist called the procedure "automatic." AID is considered the first regression tree algorithm in the published literature.

THAID (1973)

In the early 1970s, Morgan and colleagues extended AID with THAID (THeta Automatic Interaction Detection), which was designed for classification problems with categorical dependent variables. THAID used a different splitting criterion based on maximizing the proportion of correctly classified observations.

CHAID (1980)

The Chi-squared Automatic Interaction Detection (CHAID) algorithm was developed by Gordon V. Kass, a statistician at the University of the Witwatersrand in South Africa. Kass published the method in his 1980 paper "An Exploratory Technique for Investigating Large Quantities of Categorical Data" in Applied Statistics. CHAID is based on a formal extension of AID and THAID, and it uses chi-squared tests of significance (with Bonferroni correction) to determine the best splits. Unlike AID and CART, CHAID can produce multi-way splits (not just binary), creating discrete groups at each node. CHAID was originally designed for classification and was later extended to handle regression problems as well.

CART (1984)

The CART (Classification and Regression Trees) algorithm was introduced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in their landmark 1984 book Classification and Regression Trees. CART represented a major advance over earlier tree methods. It produces strictly binary trees (each node splits into exactly two children), handles both classification and regression tasks, introduces cost-complexity pruning for tree simplification, and provides theoretical guarantees for consistency. Breiman and colleagues also proved conditions under which all recursive partitioning techniques are Bayes risk consistent. CART remains one of the most widely used tree algorithms and is the basis for the tree implementation in scikit-learn.

ID3 (1986)

ID3 (Iterative Dichotomiser 3) was developed by Ross Quinlan and published in 1986 in the paper "Induction of Decision Trees" in Machine Learning. ID3 uses information gain (based on entropy) as the splitting criterion, selecting the feature with the highest information gain at each step. ID3 follows a top-down, greedy search through the space of possible decision trees. A key limitation of ID3 is that it can only handle categorical features and does not include a pruning mechanism, making it prone to overfitting.

C4.5 (1993)

C4.5 is the successor to ID3, also developed by Ross Quinlan. Published in his 1993 book C4.5: Programs for Machine Learning, C4.5 addresses several limitations of ID3. The algorithm introduces gain ratio as a splitting criterion (which normalizes information gain by the intrinsic information of a split, reducing bias toward features with many values), handles both categorical and continuous features, includes mechanisms for tree pruning, and can deal with missing values. C4.5 was named the top algorithm in data mining by the IEEE International Conference on Data Mining (ICDM) in 2006.

C5.0 (1997)

C5.0 is the commercial successor to C4.5, released by Ross Quinlan in the late 1990s. C5.0 offers improved speed and memory efficiency, support for boosting to improve accuracy, handling of very large datasets, and a more compact rule set output. While C5.0 is proprietary, its predecessor C4.5 remains freely available and widely used in academic research.

The following table summarizes the major milestones in the history of decision tree algorithms:

Year	Algorithm	Creator(s)	Key Contribution
1963	AID	Morgan and Sonquist	First regression tree algorithm; binary splits on survey data
1973	THAID	Morgan et al.	Extended AID to classification with categorical dependent variables
1976	NP-completeness proof	Hyafil and Rivest	Proved that constructing optimal binary decision trees is NP-complete
1980	CHAID	Gordon V. Kass	Chi-squared significance testing; multi-way splits
1984	CART	Breiman, Friedman, Olshen, Stone	Binary trees for classification and regression; cost-complexity pruning
1986	ID3	Ross Quinlan	Information gain splitting criterion; top-down greedy approach
1993	C4.5	Ross Quinlan	Gain ratio; handles continuous features and missing values; pruning
1997	C5.0	Ross Quinlan	Faster, more memory-efficient; boosting support

Structure of a decision tree

Nodes

A decision tree consists of three types of nodes:

Root node: The topmost node in the tree that represents the entire dataset. The root node contains the first splitting condition applied to all training samples.
Internal nodes (decision nodes): Nodes that represent decision points, which split the dataset based on a particular feature value. Each internal node tests a condition on one feature and routes samples to child nodes based on the outcome.
Leaf nodes (terminal nodes): Terminal nodes that correspond to the final predictions or outcomes of the model. In classification trees, a leaf node typically stores the majority class of the training samples that reached it. In regression trees, it stores the mean (or sometimes the median) of the target values.

Branches

Branches in a decision tree connect nodes and represent the decision path taken based on the feature values. Each branch corresponds to a specific decision rule, such as "age < 30" or "salary > 50,000". For binary trees (like CART), each internal node has exactly two branches. For multi-way trees (like CHAID or ID3 with categorical features), a node may have as many branches as there are distinct values of the splitting feature.

Tree depth and size

The depth of a decision tree is the length of the longest path from the root node to any leaf node. The size of a tree refers to the total number of nodes. Deeper trees can capture more complex patterns in the data but are more likely to overfit. Controlling tree depth is one of the primary mechanisms for regularizing decision trees.

Splitting criteria

The core of any decision tree algorithm is the criterion used to select the best feature and threshold for splitting at each node. The goal is to create child nodes that are more "pure" (homogeneous) than the parent node with respect to the target variable.

Gini impurity

Gini impurity measures the probability of misclassifying a randomly chosen element from the dataset if it were labeled according to the distribution of labels in the subset. For a node with K classes, the Gini impurity is defined as:

Gini(t) = 1 - sum(p_i^2) for i = 1 to K

where p_i is the proportion of samples belonging to class i at node t. A Gini impurity of 0 indicates a perfectly pure node (all samples belong to one class), while the maximum impurity for a binary classification problem is 0.5 (equal split between two classes). CART uses Gini impurity as its default splitting criterion for classification.

When evaluating a candidate split, the algorithm computes the weighted average Gini impurity of the child nodes and selects the split that yields the greatest reduction in impurity (the "Gini gain").

Entropy and information gain

Entropy is a concept from information theory that quantifies the uncertainty or disorder in a dataset. For a node with K classes, entropy is defined as:

Entropy(t) = - sum(p_i * log2(p_i)) for i = 1 to K

where p_i is the proportion of samples belonging to class i. Entropy ranges from 0 (perfectly pure node) to log2(K) (uniform distribution across all classes).

Information gain measures the reduction in entropy achieved by splitting on a particular feature:

IG(t, feature) = Entropy(t) - sum((|t_v| / |t|) * Entropy(t_v))

where t_v represents the child nodes created by the split and |t_v|/|t| is the fraction of samples routed to each child. ID3 uses information gain as its splitting criterion. A known limitation of information gain is that it favors features with many distinct values (high cardinality), which led to the development of gain ratio in C4.5.

Gain ratio

Gain ratio, introduced in C4.5, normalizes information gain by the intrinsic information (split information) of the feature:

GainRatio(t, feature) = IG(t, feature) / SplitInfo(t, feature)

where SplitInfo(t, feature) = - sum((|t_v| / |t|) * log2(|t_v| / |t|)). This normalization penalizes features that produce many small partitions, reducing the bias inherent in information gain.

Variance reduction

For regression trees, the splitting criterion is typically variance reduction (also known as reduction in sum of squared errors). The algorithm selects the split that minimizes the weighted variance of the target variable in the child nodes:

VarianceReduction = Var(t) - sum((|t_v| / |t|) * Var(t_v))

where Var(t) is the variance of the target variable at node t. Each leaf node in a regression tree predicts the mean of the target values of the training samples that reached it. CART uses this criterion for regression tasks.

The following table compares the main splitting criteria:

Criterion	Task	Used By	Range	Notes
Gini impurity	Classification	CART, scikit-learn	0 to 0.5 (binary)	Computationally simpler; no logarithm
Entropy / Information Gain	Classification	ID3, C4.5	0 to log2(K)	Based on information theory; slightly more expensive to compute
Gain Ratio	Classification	C4.5, C5.0	Normalized	Corrects information gain bias toward high-cardinality features
Variance Reduction	Regression	CART	0 to Var(y)	Minimizes sum of squared errors in child nodes
Chi-squared test	Classification	CHAID	Significance level	Uses statistical significance for split selection

Classification trees vs. regression trees

Decision trees can be applied to both classification and regression problems. The fundamental tree-growing procedure is the same in both cases (recursive binary or multi-way partitioning), but several details differ.

Aspect	Classification tree	Regression tree
Target variable	Categorical (discrete class labels)	Continuous (real-valued)
Splitting criterion	Gini impurity, entropy, gain ratio, chi-squared test	Variance reduction (MSE), mean absolute error (MAE)
Leaf prediction	Majority class among training samples at the leaf	Mean (or median) of target values at the leaf
Probability output	Class probability distribution at each leaf	Point estimate (mean) at each leaf
Evaluation metrics	Accuracy, F1 score, AUC-ROC	Mean squared error, mean absolute error, R-squared
Common algorithms	CART, ID3, C4.5, CHAID	CART, AID

In a classification model, each leaf outputs the class that appears most frequently among the training samples that arrived at that leaf. The tree can also output a probability vector by reporting the proportion of each class at the leaf. In a regression model, each leaf outputs a constant value, typically the mean of the target variable for the training samples at that leaf. Some implementations use the median instead of the mean to produce predictions that are more robust to outliers.

CART is unique in that the same algorithmic framework handles both tasks. The only difference is the impurity measure: Gini impurity (or entropy) for classification and variance reduction (or mean absolute error) for regression.

The CART algorithm in detail

CART (Classification and Regression Trees) is the most widely implemented decision tree algorithm and serves as the foundation for many modern tree-based methods. Below is a detailed description of how CART works.

Tree growing

Start at the root node with the entire training dataset.
Evaluate all possible splits: For each feature, consider every possible threshold (for continuous features) or subset partition (for categorical features). For continuous features, the candidate thresholds are typically the midpoints between consecutive distinct values in the sorted feature column.
Select the best split: Choose the feature and threshold that maximizes the reduction in the impurity measure (Gini impurity for classification, variance reduction for regression).
Create two child nodes: Route samples satisfying the split condition to the left child and the remaining samples to the right child.
Recurse: Repeat steps 2 through 4 for each child node.
Stop when a stopping condition is met (for example, the node is pure, the maximum depth is reached, or the number of samples in a node falls below a minimum threshold).

The computational cost of growing a CART tree on a dataset with n samples and p features is O(p * n * log(n)) at each node (sorting each feature costs O(n log n), and there are p features to consider). The total cost depends on the depth of the tree, which in the worst case yields O(p * n^2 * log(n)) for a fully grown tree.

Handling different feature types

CART handles continuous and categorical features differently:

Continuous features: The algorithm searches over all possible thresholds of the form "feature <= value" and selects the one that produces the best split.
Categorical features: For a feature with m categories, CART considers all possible binary partitions of the categories into two groups. For classification with two classes, there is an efficient shortcut: sort the categories by the proportion of one class and search for the optimal split point along this ordering. This reduces the number of candidate splits from 2^(m-1) - 1 to m - 1.

Handling missing values

CART handles missing values through surrogate splits. When the best split feature has missing values for some samples, the algorithm identifies surrogate features that produce splits most similar to the primary split. Samples with missing values on the primary feature are then routed using the best available surrogate split. This approach avoids the need to impute missing values before training.

Other algorithms handle missing values differently. C4.5 distributes samples with missing values across all child nodes, weighting each branch by the fraction of non-missing samples that follow it. XGBoost learns a default direction for missing values during training: at each split, samples with missing values are sent left or right depending on which direction reduces the loss more. LightGBM similarly assigns missing values to whichever child reduces the objective.

Predictions

Classification: Each leaf node stores the majority class among the training samples that reached it. The predicted class for a new sample is determined by traversing the tree from root to leaf. CART can also output class probabilities based on the class distribution at the leaf.
Regression: Each leaf node stores the mean of the target values of the training samples that reached it. The prediction for a new sample is this mean value.

Pruning

Pruning is the process of simplifying a decision tree by removing branches that do not significantly improve predictive performance. Without pruning, decision trees tend to grow very deep, memorizing noise in the training data and performing poorly on unseen data. There are two main categories of pruning.

Pre-pruning (early stopping)

Pre-pruning halts the growth of the tree during construction by imposing constraints. Common pre-pruning parameters include:

Parameter	Description	Effect
Maximum depth	Limits the longest path from root to any leaf	Prevents excessively deep trees
Minimum samples per split	Requires a minimum number of samples at a node before it can be split	Prevents splits on very small subsets
Minimum samples per leaf	Requires each leaf to contain at least a specified number of samples	Ensures leaf predictions are based on sufficient data
Maximum number of leaf nodes	Limits the total number of leaves in the tree	Controls overall tree complexity
Minimum impurity decrease	Requires a split to achieve at least a specified reduction in impurity	Prevents trivial splits

Pre-pruning is computationally efficient because it avoids growing the full tree. However, it can be overly conservative: a split that appears unhelpful on its own may enable useful splits further down the tree (the "horizon effect").

Post-pruning

Post-pruning first grows the tree to its full depth (or near-full depth) and then removes or collapses branches that do not improve generalization. Since the tree is fully grown before pruning begins, post-pruning can consider all possible patterns before discarding weak or irrelevant ones.

Cost-complexity pruning (minimal cost-complexity pruning)

Cost-complexity pruning, introduced with the CART algorithm, is the most widely used post-pruning method. It defines a cost-complexity measure for any subtree T:

R_alpha(T) = R(T) + alpha * |T|

where R(T) is the total misclassification rate (or sum of squared errors for regression) of the tree, |T| is the number of leaf nodes, and alpha is a non-negative complexity parameter. The algorithm works as follows:

For each internal node, compute the "effective alpha" at which pruning that node (replacing its subtree with a leaf) becomes worthwhile.
Prune the node with the smallest effective alpha first (the "weakest link").
Repeat, generating a sequence of progressively smaller subtrees.
Select the optimal subtree using cross-validation or a held-out validation set.

Greater values of alpha increase the number of nodes pruned, yielding simpler trees. The optimal alpha is typically selected via k-fold cross-validation.

Reduced error pruning

Reduced error pruning is one of the simplest post-pruning methods. It requires a separate validation set. Starting from the leaves, each internal node is tentatively replaced with a leaf labeled with the majority class. If this replacement does not decrease accuracy on the validation set, the change is kept. This process continues until no further pruning improves validation accuracy. Reduced error pruning is fast and straightforward but requires setting aside data for validation, which can be wasteful when data is scarce.

Error-based pruning

Error-based pruning, used in C4.5, estimates the error rate at each node using a confidence interval based on the training data. Subtrees whose estimated error exceeds that of a single leaf replacement are pruned. This method does not require a separate validation set.

Oblique decision trees

Standard decision tree algorithms like CART, ID3, and C4.5 create axis-aligned (univariate) splits, meaning each split tests a single feature at a time. This produces rectangular decision regions in the feature space. Oblique decision trees (also called multivariate decision trees) generalize this by splitting on linear combinations of features. Instead of a condition like "feature_j <= threshold," an oblique split takes the form:

w_1 * x_1 + w_2 * x_2 + ... + w_p * x_p <= threshold

where w_1, ..., w_p are learned weights. This allows the tree to produce decision boundaries that are diagonal or angled relative to the feature axes, which can represent certain patterns much more compactly than a staircase of axis-aligned splits.

Oblique trees often achieve higher generalization accuracy with fewer nodes than axis-aligned trees, especially on datasets where the true decision boundary is not parallel to any feature axis. However, they come with trade-offs. First, finding the optimal oblique split at each node is computationally expensive because the search space is continuous and high-dimensional. Second, oblique splits are harder for humans to interpret because they involve weighted combinations of multiple features rather than simple thresholds on a single variable.

Several algorithms have been proposed for constructing oblique trees. The OC1 system (Murthy, Kasif, and Salzberg, 1994) uses a randomized perturbation procedure and coordinate descent to optimize oblique splits. CART-LC (a variant of CART with linear combination splits) optimizes split coefficients by iterating over one weight at a time. More recent approaches use linear programming, support vector machines, or neural network components to find oblique hyperplanes at each node.

Multi-output decision trees

A multi-output decision tree predicts multiple target variables simultaneously using a single tree structure. Instead of predicting a single class label or a single continuous value, each leaf node stores a vector of predictions (one per target). The tree is grown using a splitting criterion that aggregates the impurity or variance across all targets.

For multi-output classification, the tree minimizes the average Gini impurity or entropy across all target variables at each split. For multi-output regression, it minimizes the average variance reduction across all targets.

Multi-output trees are useful in scenarios where the target variables are correlated, because a single tree can capture shared structure in the input space that affects multiple outputs simultaneously. Applications include predicting multiple properties of a chemical compound, multi-label classification tasks, and structured output problems such as image completion.

In scikit-learn, both DecisionTreeClassifier and DecisionTreeRegressor natively support multi-output prediction. When fit on a target array Y of shape (n_samples, n_outputs), the estimator learns a single tree that produces predictions for all outputs at once.

Feature importance

Decision trees and tree-based ensemble methods provide natural measures of feature importance, which indicate how much each feature contributes to the model's predictions.

Mean decrease in impurity (MDI)

Also known as impurity-based feature importance, MDI measures the total reduction in the splitting criterion (Gini impurity, entropy, or mean squared error) attributable to each feature, averaged across all trees in the ensemble. For a single tree, the importance of a feature is the sum of the impurity decreases at all nodes where that feature is used for splitting, weighted by the number of samples reaching each node.

MDI is computed during training and is therefore very fast. However, it has known limitations: it is biased toward high-cardinality features (features with many distinct values), and it can assign high importance to features that overfit the training data but are not genuinely predictive.

Permutation importance

Permutation importance, introduced by Breiman in his 2001 Random Forest paper, measures feature importance by randomly shuffling the values of a single feature and observing the resulting decrease in model performance on a held-out dataset. If shuffling a feature causes a large drop in accuracy (or increase in error), that feature is considered important.

Permutation importance is model-agnostic (it can be applied to any model, not just trees), avoids the high-cardinality bias of MDI, and can be computed on test data to reflect genuine predictive importance. Its main drawback is that it can be slow for large datasets and is affected by feature correlations (correlated features may share importance).

SHAP values

SHAP (SHapley Additive exPlanations), developed by Scott Lundberg and Su-In Lee in 2017, provides a unified framework for feature importance based on Shapley values from cooperative game theory. TreeSHAP is an efficient algorithm for computing exact SHAP values for tree-based models in polynomial time. SHAP values provide both global feature importance (average absolute SHAP value across all samples) and local explanations (the contribution of each feature to an individual prediction).

Ensemble methods based on decision trees

Single decision trees have well-known limitations, including high variance and a tendency to overfit. Ensemble methods combine multiple trees to produce more accurate and robust predictions. The two main families of tree-based ensembles are bagging and boosting.

Bagging and random forests

Bootstrap aggregating (bagging), introduced by Leo Breiman in 1996, trains multiple decision trees on different bootstrap samples (random samples drawn with replacement) from the training data. The final prediction is the average (regression) or majority vote (classification) of all trees. Bagging reduces variance without increasing bias.

Random forest, proposed by Leo Breiman in his 2001 paper "Random Forests" published in Machine Learning, extends bagging by adding random feature selection. At each split in each tree, only a random subset of features (typically the square root of the total number of features for classification, or one-third for regression) is considered as candidates. This additional randomness further decorrelates the individual trees, leading to lower generalization error. Key properties of random forests include:

They do not overfit as more trees are added (Breiman showed this using the law of large numbers).
They provide out-of-bag (OOB) error estimates, which can serve as an unbiased estimate of the test error without needing a separate validation set.
They can estimate feature importance through permutation importance or mean decrease in impurity.
They scale well to large datasets and high-dimensional feature spaces.

Gradient boosting

Gradient boosting, formalized by Jerome Friedman in his 2001 paper "Greedy Function Approximation: A Gradient Boosting Machine" published in The Annals of Statistics, builds trees sequentially rather than independently. Each new tree is trained to predict the residual errors (or, more precisely, the negative gradient of the loss function) of the current ensemble. The predictions of all trees are summed to produce the final output. Gradient boosting typically uses shallow trees (often called "stumps" for depth-1 trees, or trees with depth 3 to 8) as weak learners. Key hyperparameters include the learning rate (shrinkage), number of trees, and tree depth.

AdaBoost, introduced by Yoav Freund and Robert Schapire in 1997, is a related boosting method that adjusts sample weights at each iteration, giving more weight to misclassified samples. AdaBoost can be viewed as a special case of gradient boosting with an exponential loss function.

Modern gradient boosting implementations

Several high-performance implementations of gradient boosted trees have become dominant in applied machine learning and competitive data science:

XGBoost

XGBoost (eXtreme Gradient Boosting), developed by Tianqi Chen and Carlos Guestrin at the University of Washington, was published in the 2016 paper "XGBoost: A Scalable Tree Boosting System" at the ACM SIGKDD conference. XGBoost introduced several innovations: a regularized objective function that penalizes tree complexity (including both L1 and L2 regularization on leaf weights), a sparsity-aware algorithm for handling missing values, a weighted quantile sketch for approximate split finding, and system-level optimizations such as cache-aware access and out-of-core computation. XGBoost uses level-wise (breadth-first) tree growth by default.

LightGBM

LightGBM (Light Gradient Boosting Machine), developed by a team led by Guolin Ke at Microsoft Research, was published in the 2017 NeurIPS paper "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." LightGBM introduces two key techniques: Gradient-based One-Side Sampling (GOSS), which excludes data instances with small gradients to speed up information gain estimation; and Exclusive Feature Bundling (EFB), which bundles mutually exclusive features to reduce dimensionality. LightGBM also uses leaf-wise (best-first) tree growth rather than level-wise growth, which can produce deeper, more asymmetric trees that converge faster. LightGBM achieves training speedups of over 20x compared to conventional gradient boosting while maintaining comparable accuracy.

CatBoost

CatBoost (Categorical Boosting), developed by the machine learning team at Yandex, was described in the 2018 NeurIPS paper "CatBoost: Unbiased Boosting with Categorical Features" by Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost introduces ordered boosting, a permutation-driven approach that addresses prediction shift (a form of target leakage present in standard gradient boosting), and an efficient algorithm for processing categorical features directly without manual encoding. CatBoost uses oblivious (symmetric) trees as base learners, where all nodes at the same depth level use the same feature and split threshold. This symmetric structure enables efficient vectorized evaluation, particularly on GPUs.

The following table compares the major tree-based ensemble algorithms:

Algorithm	Year	Author(s)	Ensemble Type	Tree Growth	Key Innovation
Bagging	1996	Breiman	Parallel (bootstrap)	Full depth	Variance reduction via averaging
Random Forest	2001	Breiman	Parallel (bootstrap + feature subspace)	Full depth	Random feature selection at each split
AdaBoost	1997	Freund, Schapire	Sequential (boosting)	Shallow (stumps)	Adaptive sample weighting
Gradient Boosting	2001	Friedman	Sequential (boosting)	Shallow	Fits residuals via gradient descent in function space
XGBoost	2016	Chen, Guestrin	Sequential (boosting)	Level-wise	Regularized objective; sparsity-aware splits
LightGBM	2017	Ke et al. (Microsoft)	Sequential (boosting)	Leaf-wise	GOSS and EFB for speed; leaf-wise growth
CatBoost	2018	Prokhorenkova et al. (Yandex)	Sequential (boosting)	Symmetric (oblivious)	Ordered boosting; native categorical feature handling

Visualization and interpretability

One of the strongest advantages of decision trees is that they can be visualized in a way that humans can inspect and understand. Several tools and techniques exist for visualizing decision trees and interpreting their predictions.

Graphical tree rendering

The most common visualization is a node-link diagram where each node displays the splitting feature, threshold, impurity value, number of samples, and (for classification) the class distribution. Nodes are typically colored by the majority class or the predicted value, with color intensity reflecting the node's purity. This rendering allows practitioners to trace any prediction from root to leaf and understand why the model made a particular decision.

In scikit-learn, two primary functions support tree visualization:

sklearn.tree.plot_tree: Added in scikit-learn 0.21, this function renders a decision tree directly as a matplotlib figure without requiring external software. It accepts parameters such as max_depth (to limit the displayed depth), feature_names, class_names, filled (to color nodes by class or value), and rounded (for rounded node boxes).
sklearn.tree.export_graphviz: This function exports the tree in DOT format, which can be rendered into high-resolution images using the Graphviz software. It provides more control over aesthetic details and can produce output in formats like PNG, PDF, or SVG.

Additionally, the sklearn.tree.export_text function produces a plain-text representation of the tree rules, which is useful for logging, debugging, or environments where graphical output is not available.

Rule extraction

Decision trees can be converted into a set of if-then rules, one rule per leaf. Each rule is the conjunction of all split conditions along the path from the root to that leaf. Rule extraction is useful for embedding tree-based logic into production systems, regulatory documentation, or clinical guidelines where a visual diagram is insufficient.

Interpretability advantages and limitations

Decision trees are often cited as one of the most interpretable machine learning models. For shallow trees (depth 3 to 5), a human can follow the decision logic easily. However, interpretability degrades rapidly as trees grow deeper. A tree with hundreds of leaves is no easier to interpret than a black-box model. This is one reason why post-pruning and depth constraints are important not only for generalization but also for practical usability.

For ensemble methods built on decision trees, individual tree interpretability is largely lost because the final prediction aggregates hundreds or thousands of trees. Techniques like SHAP values, partial dependence plots, and feature importance scores are used to interpret ensemble models at a higher level.

Advantages

Interpretability: Decision trees are among the most interpretable machine learning models. The tree structure can be visualized as a flowchart, and the decision rules at each node are easy to understand. This makes decision trees popular in fields where model transparency is important, such as medicine, finance, and law.
No feature scaling required: Unlike neural networks, support vector machines, or logistic regression, decision trees do not require feature normalization or standardization. They are invariant to monotonic transformations of individual features.
Handles mixed data types: Decision trees can handle both numerical and categorical features natively (depending on the implementation), as well as datasets with mixed feature types.
Handles missing values: Algorithms such as CART (via surrogate splits), XGBoost, and LightGBM have built-in mechanisms for handling missing values without requiring imputation.
Nonlinear relationships: Decision trees can capture nonlinear relationships and complex interactions between features without requiring explicit feature engineering.
Fast inference: Prediction with a decision tree requires only a series of comparisons along a single path from root to leaf, making inference very fast even for large trees.
Feature selection: The tree-building process implicitly performs feature selection, since only the most informative features are used for splitting. Uninformative features are simply ignored.

Limitations

Overfitting: Decision trees are prone to overfitting, particularly when grown to full depth without pruning or other regularization. A deep tree can memorize the training data, including noise, leading to poor generalization.
High variance: Small changes in the training data can produce very different tree structures. This instability is a consequence of the greedy, top-down construction process, where a different split at the root node cascades into an entirely different tree.
Axis-aligned splits: Standard decision tree algorithms (CART, ID3, C4.5) split on one feature at a time, producing axis-aligned (orthogonal) decision boundaries. This means that diagonal or curved decision boundaries can only be approximated by a staircase of axis-aligned splits, which may require a very deep tree. Oblique decision trees address this limitation at the cost of increased computational complexity and reduced interpretability.
Biased toward high-cardinality features: Some splitting criteria (such as information gain) favor features with many distinct values, which can lead to suboptimal trees. Gain ratio and other corrections address this to some extent.
Greedy construction: Decision trees are typically built using a greedy algorithm that selects the locally optimal split at each node. There is no guarantee that the resulting tree is globally optimal. Finding the optimal decision tree is NP-complete in general (Hyafil and Rivest, 1976).
Imbalanced data: Decision trees can be biased toward the majority class in imbalanced datasets unless class weights or other balancing techniques are used.
Piecewise constant predictions: Decision tree predictions are discontinuous, piecewise constant functions. They cannot extrapolate beyond the range of training data and may produce blocky approximations of smooth target functions. This is particularly noticeable in regression tasks.
Limited expressiveness of single trees: A single decision tree often cannot match the accuracy of ensemble methods, neural networks, or other more flexible models on complex tasks. This is why ensemble methods like random forests and gradient boosting are preferred in practice.

Applications

Decision trees and tree-based ensembles are used across a wide range of domains:

Medical diagnosis: Decision trees are used to build clinical decision support systems that help physicians diagnose diseases based on patient symptoms and test results. Their interpretability is especially valuable in healthcare settings where clinicians need to understand and trust the model's reasoning.
Credit scoring and finance: Banks and financial institutions use tree-based models to assess credit risk, detect fraud, and make lending decisions. Regulatory requirements for model transparency make decision trees and interpretable ensembles attractive choices.
Customer analytics: Tree-based models are used for customer segmentation, churn prediction, and recommendation systems in marketing and e-commerce.
Natural language processing: Gradient boosted trees are used for feature-rich text classification, sentiment analysis, and ranking tasks in information retrieval.
Bioinformatics: Decision trees and random forests are used for gene expression analysis, protein function prediction, and drug discovery.
Remote sensing: Tree-based classifiers are widely used for land cover classification from satellite imagery.
Manufacturing and quality control: Decision trees help identify which process parameters most influence product defects, enabling root-cause analysis on production lines.

Software implementations

Decision trees and tree-based ensembles are available in most major machine learning libraries:

Library	Language	Tree Algorithms Available
scikit-learn	Python	CART (DecisionTreeClassifier, DecisionTreeRegressor), Random Forest, Gradient Boosting, AdaBoost, ExtraTreesClassifier
XGBoost	Python, R, C++, Java	XGBoost (regularized gradient boosting)
LightGBM	Python, R, C++	LightGBM (leaf-wise gradient boosting with GOSS and EFB)
CatBoost	Python, R, C++	CatBoost (ordered boosting with oblivious trees)
rpart	R	CART
Spark MLlib	Scala, Java, Python	Decision tree, Random Forest, Gradient Boosting
Weka	Java	J48 (C4.5 implementation), Random Forest, REPTree
dtreeviz	Python	Visualization library for scikit-learn, XGBoost, and Spark trees

Scikit-learn's decision tree implementation is based on an optimized version of the CART algorithm. It uses Cython for performance-critical routines and supports both dense and sparse input data. Key classes include DecisionTreeClassifier for classification tasks and DecisionTreeRegressor for regression tasks, both of which support multi-output prediction, sample weighting, and cost-complexity pruning via the ccp_alpha parameter.

Explain like I'm 5 (ELI5)

Imagine you are playing a guessing game with a friend who is thinking of an animal. You ask simple yes-or-no questions to narrow down the answer. "Does it have fur?" If yes, you ask "Is it bigger than a cat?" If no, you ask "Does it have feathers?" Each question helps you eliminate animals that do not fit, and after a few questions you can guess the right answer.

A decision tree works the same way with data. The computer looks at a bunch of examples and figures out which questions are the best ones to ask, meaning which questions split the examples into the most clear-cut groups. It arranges these questions into a tree shape, where the first question is at the top, follow-up questions come next, and the final answers are at the bottom. When it sees new data, it starts at the top and follows the branches based on the answers until it reaches a final prediction.

References

Hyafil, L., & Rivest, R. L. (1976). "Constructing Optimal Binary Decision Trees is NP-Complete." *Information Processing Letters*, 5(1), 15-17.
Morgan, J. N., & Sonquist, J. A. (1963). "Problems in the Analysis of Survey Data, and a Proposal." *Journal of the American Statistical Association*, 58(302), 415-434.
Kass, G. V. (1980). "An Exploratory Technique for Investigating Large Quantities of Categorical Data." *Applied Statistics*, 29(2), 119-127.
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). *Classification and Regression Trees*. Wadsworth.
Quinlan, J. R. (1986). "Induction of Decision Trees." *Machine Learning*, 1(1), 81-106.
Quinlan, J. R. (1993). *C4.5: Programs for Machine Learning*. Morgan Kaufmann.
Murthy, S. K., Kasif, S., & Salzberg, S. (1994). "A System for Induction of Oblique Decision Trees." *Journal of Artificial Intelligence Research*, 2, 1-32.
Breiman, L. (1996). "Bagging Predictors." *Machine Learning*, 24(2), 123-140.
Freund, Y., & Schapire, R. E. (1997). "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting." *Journal of Computer and System Sciences*, 55(1), 119-139.
Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32.
Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *The Annals of Statistics*, 29(5), 1189-1232.
Loh, W.-Y. (2014). "Fifty Years of Classification and Regression Trees." *International Statistical Review*, 82(3), 329-348.
Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 785-794.
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." *Advances in Neural Information Processing Systems*, 30.
Lundberg, S. M., & Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." *Advances in Neural Information Processing Systems*, 30.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). "CatBoost: Unbiased Boosting with Categorical Features." *Advances in Neural Information Processing Systems*, 31.

Introduction

History

AID (1963)

THAID (1973)

CHAID (1980)

CART (1984)

ID3 (1986)

C4.5 (1993)

C5.0 (1997)

Structure of a decision tree

Nodes

Branches

Tree depth and size

Splitting criteria

Gini impurity

Entropy and information gain

Gain ratio

Variance reduction

Classification trees vs. regression trees

The CART algorithm in detail

Tree growing

Handling different feature types

Handling missing values

Predictions

Pruning

Pre-pruning (early stopping)

Post-pruning

Cost-complexity pruning (minimal cost-complexity pruning)

Reduced error pruning

Error-based pruning

Oblique decision trees

Multi-output decision trees

Feature importance

Mean decrease in impurity (MDI)

Permutation importance

SHAP values

Ensemble methods based on decision trees

Bagging and random forests

Gradient boosting

Modern gradient boosting implementations

XGBoost

LightGBM

CatBoost

Visualization and interpretability

Graphical tree rendering

Rule extraction

Interpretability advantages and limitations

Advantages

Limitations

Applications

Software implementations

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Machine learning terms/Decision Forests

ARC-AGI 2

Gradient Boosting

Random Forest

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Introduction

History

AID (1963)

THAID (1973)

CHAID (1980)

CART (1984)

ID3 (1986)

C4.5 (1993)

C5.0 (1997)

Structure of a decision tree

Nodes

Branches

Tree depth and size

Splitting criteria

Gini impurity

Entropy and information gain

Gain ratio

Variance reduction

Classification trees vs. regression trees