# Decision Tree

> Source: https://aiwiki.ai/wiki/decision_tree
> Updated: 2026-06-20
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## Introduction

A **decision tree** is a tree-structured predictive model in [machine learning](/wiki/machine_learning) that makes a prediction by asking a sequence of yes-or-no questions about an input's features, following a path of branching rules from a root node down to a leaf that outputs a class label (for [classification](/wiki/classification)) or a numeric value (for [regression](/wiki/regression)). Decision trees are among the most widely used and interpretable algorithms in [supervised learning](/wiki/supervised_learning): the two foundational tree methods, C4.5 and CART, were ranked the number 1 and number 3 most influential algorithms in data mining in a 2006 survey by the IEEE International Conference on Data Mining (ICDM).[17]

The model works by recursively splitting the input data into subsets based on the values of input features, forming a hierarchy of decision rules that ultimately produce a prediction at each terminal node. Decision trees are popular because of their intuitive structure, their interpretability, and their ability to handle both numerical and categorical data without extensive preprocessing.

A decision tree can be understood as a flowchart-like structure where each internal node represents a test on a feature (for example, "Is temperature greater than 30?" or "Is color equal to red?"), each branch represents the outcome of that test, and each leaf node represents a class label or a continuous value. The path from the root to a leaf encodes a conjunction of feature conditions that leads to a particular prediction.

Finding the optimal decision tree for a given dataset is NP-complete, as proven by Laurent Hyafil and Ronald Rivest in 1976.[1] Because of this computational intractability, all practical decision tree algorithms rely on greedy heuristics that make locally optimal splitting decisions at each node rather than searching for a globally optimal tree.

### What is a decision tree used for?

Decision trees are used for both classification (predicting a discrete category, such as whether an email is spam) and regression (predicting a continuous quantity, such as a house price). Beyond standalone prediction, decision trees are the building blocks of the most successful tabular-data models in applied machine learning: [random forests](/wiki/random_forest) and [gradient boosting](/wiki/gradient_boosting) ensembles such as [XGBoost](/wiki/xgboost), [LightGBM](/wiki/lightgbm), and [CatBoost](/wiki/catboost) combine hundreds or thousands of trees to achieve state-of-the-art accuracy. Trees are also valued in regulated domains (medicine, finance, law) where the ability to read off and audit the decision rules matters as much as raw accuracy.

## History

The development of decision tree algorithms spans more than six decades, with contributions from statisticians, computer scientists, and machine learning researchers.[12]

### AID (1963)

The earliest decision tree algorithm was **Automatic Interaction Detection (AID)**, published by James N. Morgan and John A. Sonquist in 1963 in the *Journal of the American Statistical Association*.[2] Morgan and Sonquist were motivated by the challenge of dealing with complex interactions among categorical variables in social survey data. They wanted to untangle the influence of factors such as age, education, ethnicity, and profession on a person's income. AID works by recursively splitting data at the root node into two child nodes, selecting binary splits on ordinal and nominal independent variables that most reduce the sum of squares of an interval dependent variable from its mean. Because interaction effects are a natural byproduct of the splitting process, Morgan and Sonquist called the procedure "automatic."[2] AID is considered the first regression tree algorithm in the published literature.[12]

### THAID (1973)

In the early 1970s, Morgan and colleagues extended AID with **THAID** (THeta Automatic Interaction Detection), which was designed for classification problems with categorical dependent variables. THAID used a different splitting criterion based on maximizing the proportion of correctly classified observations.

### CHAID (1980)

The **Chi-squared Automatic Interaction Detection (CHAID)** algorithm was developed by Gordon V. Kass, a statistician at the University of the Witwatersrand in South Africa. Kass published the method in his 1980 paper "An Exploratory Technique for Investigating Large Quantities of Categorical Data" in *Applied Statistics*.[3] CHAID is based on a formal extension of AID and THAID, and it uses chi-squared tests of significance (with Bonferroni correction) to determine the best splits.[3] Unlike AID and CART, CHAID can produce multi-way splits (not just binary), creating discrete groups at each node. CHAID was originally designed for classification and was later extended to handle regression problems as well.

### CART (1984)

The **CART (Classification and Regression Trees)** algorithm was introduced by [Leo Breiman](/wiki/leo_breiman), Jerome Friedman, Richard Olshen, and Charles Stone in their landmark 1984 book *Classification and Regression Trees*.[4] CART represented a major advance over earlier tree methods. It produces strictly binary trees (each node splits into exactly two children), handles both classification and regression tasks, introduces cost-complexity pruning for tree simplification, and provides theoretical guarantees for consistency.[4] Breiman and colleagues also proved conditions under which all recursive partitioning techniques are Bayes risk consistent.[4] CART remains one of the most widely used tree algorithms and is the basis for the tree implementation in [scikit-learn](/wiki/scikit_learn). The scikit-learn documentation states that "scikit-learn uses an optimized version of the CART algorithm; however, the scikit-learn implementation does not support categorical variables for now."[18]

### ID3 (1986)

**ID3 (Iterative Dichotomiser 3)** was developed by [Ross Quinlan](/wiki/ross_quinlan) and published in 1986 in the paper "Induction of Decision Trees" in *Machine Learning*.[5] ID3 uses [information gain](/wiki/information_gain) (based on [entropy](/wiki/entropy)) as the splitting criterion, selecting the feature with the highest information gain at each step.[5] ID3 follows a top-down, greedy search through the space of possible decision trees. A key limitation of ID3 is that it can only handle categorical features and does not include a pruning mechanism, making it prone to [overfitting](/wiki/overfitting). Quinlan's 1986 paper became one of the most cited works in machine learning, accumulating more than 18,000 citations.[19]

### C4.5 (1993)

**C4.5** is the successor to ID3, also developed by Ross Quinlan. Published in his 1993 book *C4.5: Programs for Machine Learning*, C4.5 addresses several limitations of ID3.[6] The algorithm introduces gain ratio as a splitting criterion (which normalizes information gain by the intrinsic information of a split, reducing bias toward features with many values), handles both categorical and continuous features, includes mechanisms for tree pruning, and can deal with missing values.[6] C4.5 was ranked the number 1 most influential algorithm in data mining by a survey of the IEEE International Conference on Data Mining (ICDM) presented in December 2006 and published as "Top 10 Algorithms in Data Mining" by Wu and colleagues; CART appeared on the same list at number 3.[17]

### C5.0 (1997)

**C5.0** is the commercial successor to C4.5, released by Ross Quinlan in the late 1990s. C5.0 offers improved speed and memory efficiency, support for boosting to improve accuracy, handling of very large datasets, and a more compact rule set output. While C5.0 is proprietary, its predecessor C4.5 remains freely available and widely used in academic research.

The following table summarizes the major milestones in the history of decision tree algorithms:

| Year | Algorithm | Creator(s) | Key Contribution |
|------|-----------|------------|-------------------|
| 1963 | [AID](/wiki/automatic_interaction_detection) | Morgan and Sonquist | First regression tree algorithm; binary splits on survey data |
| 1973 | THAID | Morgan et al. | Extended AID to classification with categorical dependent variables |
| 1976 | NP-completeness proof | Hyafil and Rivest | Proved that constructing optimal binary decision trees is NP-complete |
| 1980 | [CHAID](/wiki/chaid) | Gordon V. Kass | Chi-squared significance testing; multi-way splits |
| 1984 | [CART](/wiki/cart) | Breiman, Friedman, Olshen, Stone | Binary trees for classification and regression; cost-complexity pruning |
| 1986 | [ID3](/wiki/id3) | Ross Quinlan | Information gain splitting criterion; top-down greedy approach |
| 1993 | [C4.5](/wiki/c4_5) | Ross Quinlan | Gain ratio; handles continuous features and missing values; pruning |
| 1997 | [C5.0](/wiki/c5_0) | Ross Quinlan | Faster, more memory-efficient; boosting support |

## Structure of a decision tree

### Nodes

A decision tree consists of three types of nodes:

- **Root node**: The topmost node in the tree that represents the entire dataset. The root node contains the first splitting condition applied to all training samples.

- **Internal nodes (decision nodes)**: Nodes that represent decision points, which split the dataset based on a particular feature value. Each internal node tests a condition on one feature and routes samples to child nodes based on the outcome.

- **Leaf nodes (terminal nodes)**: Terminal nodes that correspond to the final predictions or outcomes of the model. In classification trees, a leaf node typically stores the majority class of the training samples that reached it. In regression trees, it stores the mean (or sometimes the median) of the target values.

### Branches

**Branches** in a decision tree connect nodes and represent the decision path taken based on the feature values. Each branch corresponds to a specific decision rule, such as "age < 30" or "salary > 50,000". For binary trees (like CART), each internal node has exactly two branches. For multi-way trees (like CHAID or ID3 with categorical features), a node may have as many branches as there are distinct values of the splitting feature.

### Tree depth and size

The **depth** of a decision tree is the length of the longest path from the root node to any leaf node. The **size** of a tree refers to the total number of nodes. Deeper trees can capture more complex patterns in the data but are more likely to overfit. Controlling tree depth is one of the primary mechanisms for regularizing decision trees.

## Splitting criteria

The core of any decision tree algorithm is the criterion used to select the best feature and threshold for splitting at each node. The goal is to create child nodes that are more "pure" (homogeneous) than the parent node with respect to the target variable.

### Gini impurity

**[Gini impurity](/wiki/gini_impurity)** measures the probability of misclassifying a randomly chosen element from the dataset if it were labeled according to the distribution of labels in the subset. For a node with *K* classes, the Gini impurity is defined as:

```
Gini(t) = 1 - sum(p_i^2) for i = 1 to K
```

where `p_i` is the proportion of samples belonging to class *i* at node *t*. A Gini impurity of 0 indicates a perfectly pure node (all samples belong to one class), while the maximum impurity for a binary classification problem is 0.5 (equal split between two classes). CART uses Gini impurity as its default splitting criterion for classification.[4]

When evaluating a candidate split, the algorithm computes the weighted average Gini impurity of the child nodes and selects the split that yields the greatest reduction in impurity (the "Gini gain").

### Entropy and information gain

**Entropy** is a concept from [information theory](/wiki/information_theory) that quantifies the uncertainty or disorder in a dataset. For a node with *K* classes, entropy is defined as:

```
Entropy(t) = - sum(p_i * log2(p_i)) for i = 1 to K
```

where `p_i` is the proportion of samples belonging to class *i*. Entropy ranges from 0 (perfectly pure node) to log2(K) (uniform distribution across all classes).

**Information gain** measures the reduction in entropy achieved by splitting on a particular feature:

```
IG(t, feature) = Entropy(t) - sum((|t_v| / |t|) * Entropy(t_v))
```

where *t_v* represents the child nodes created by the split and |t_v|/|t| is the fraction of samples routed to each child. ID3 uses information gain as its splitting criterion.[5] A known limitation of information gain is that it favors features with many distinct values (high cardinality), which led to the development of gain ratio in C4.5.[6]

### Gain ratio

**Gain ratio**, introduced in C4.5, normalizes information gain by the intrinsic information (split information) of the feature:[6]

```
GainRatio(t, feature) = IG(t, feature) / SplitInfo(t, feature)
```

where `SplitInfo(t, feature) = - sum((|t_v| / |t|) * log2(|t_v| / |t|))`. This normalization penalizes features that produce many small partitions, reducing the bias inherent in information gain.

### Variance reduction

For regression trees, the splitting criterion is typically **variance reduction** (also known as reduction in sum of squared errors). The algorithm selects the split that minimizes the weighted variance of the target variable in the child nodes:

```
VarianceReduction = Var(t) - sum((|t_v| / |t|) * Var(t_v))
```

where `Var(t)` is the variance of the target variable at node *t*. Each leaf node in a regression tree predicts the mean of the target values of the training samples that reached it. CART uses this criterion for regression tasks.[4]

The following table compares the main splitting criteria:

| Criterion | Task | Used By | Range | Notes |
|-----------|------|---------|-------|-------|
| [Gini impurity](/wiki/gini_impurity) | Classification | CART, [scikit-learn](/wiki/scikit_learn) | 0 to 0.5 (binary) | Computationally simpler; no logarithm |
| [Entropy](/wiki/entropy) / Information Gain | Classification | ID3, C4.5 | 0 to log2(K) | Based on information theory; slightly more expensive to compute |
| Gain Ratio | Classification | C4.5, C5.0 | Normalized | Corrects information gain bias toward high-cardinality features |
| Variance Reduction | Regression | CART | 0 to Var(y) | Minimizes sum of squared errors in child nodes |
| Chi-squared test | Classification | CHAID | Significance level | Uses statistical significance for split selection |

## Classification trees vs. regression trees

Decision trees can be applied to both classification and regression problems. The fundamental tree-growing procedure is the same in both cases (recursive binary or multi-way partitioning), but several details differ.

| Aspect | Classification tree | Regression tree |
|--------|--------------------|-----------------|
| Target variable | Categorical (discrete class labels) | Continuous (real-valued) |
| Splitting criterion | Gini impurity, entropy, gain ratio, chi-squared test | Variance reduction (MSE), mean absolute error (MAE) |
| Leaf prediction | Majority class among training samples at the leaf | Mean (or median) of target values at the leaf |
| Probability output | Class probability distribution at each leaf | Point estimate (mean) at each leaf |
| Evaluation metrics | Accuracy, F1 score, AUC-ROC | Mean squared error, mean absolute error, R-squared |
| Common algorithms | CART, ID3, C4.5, CHAID | CART, AID |

In a [classification model](/wiki/classification_model), each leaf outputs the class that appears most frequently among the training samples that arrived at that leaf. The tree can also output a probability vector by reporting the proportion of each class at the leaf. In a [regression model](/wiki/regression_model), each leaf outputs a constant value, typically the mean of the target variable for the training samples at that leaf. Some implementations use the median instead of the mean to produce predictions that are more robust to outliers.

CART is unique in that the same algorithmic framework handles both tasks. The only difference is the impurity measure: Gini impurity (or entropy) for classification and variance reduction (or mean absolute error) for regression.[4]

## The CART algorithm in detail

CART (Classification and Regression Trees) is the most widely implemented decision tree algorithm and serves as the foundation for many modern tree-based methods.[4][12] Below is a detailed description of how CART works.

### Tree growing

1. **Start at the root node** with the entire training dataset.
2. **Evaluate all possible splits**: For each feature, consider every possible threshold (for continuous features) or subset partition (for categorical features). For continuous features, the candidate thresholds are typically the midpoints between consecutive distinct values in the sorted feature column.
3. **Select the best split**: Choose the feature and threshold that maximizes the reduction in the impurity measure (Gini impurity for classification, variance reduction for regression).
4. **Create two child nodes**: Route samples satisfying the split condition to the left child and the remaining samples to the right child.
5. **Recurse**: Repeat steps 2 through 4 for each child node.
6. **Stop** when a stopping condition is met (for example, the node is pure, the maximum depth is reached, or the number of samples in a node falls below a minimum threshold).

The computational cost of growing a CART tree on a dataset with *n* samples and *p* features is O(p * n * log(n)) at each node (sorting each feature costs O(n log n), and there are *p* features to consider). The total cost depends on the depth of the tree, which in the worst case yields O(p * n^2 * log(n)) for a fully grown tree.

### Handling different feature types

CART handles continuous and categorical features differently:

- **Continuous features**: The algorithm searches over all possible thresholds of the form "feature <= value" and selects the one that produces the best split.
- **Categorical features**: For a feature with *m* categories, CART considers all possible binary partitions of the categories into two groups. For classification with two classes, there is an efficient shortcut: sort the categories by the proportion of one class and search for the optimal split point along this ordering. This reduces the number of candidate splits from 2^(m-1) - 1 to m - 1.

### Handling missing values

CART handles missing values through **surrogate splits**. When the best split feature has missing values for some samples, the algorithm identifies surrogate features that produce splits most similar to the primary split. Samples with missing values on the primary feature are then routed using the best available surrogate split. This approach avoids the need to impute missing values before training.[4]

Other algorithms handle missing values differently. C4.5 distributes samples with missing values across all child nodes, weighting each branch by the fraction of non-missing samples that follow it.[6] XGBoost learns a default direction for missing values during training: at each split, samples with missing values are sent left or right depending on which direction reduces the loss more.[13] LightGBM similarly assigns missing values to whichever child reduces the objective.[14]

### Predictions

- **Classification**: Each leaf node stores the majority class among the training samples that reached it. The predicted class for a new sample is determined by traversing the tree from root to leaf. CART can also output class probabilities based on the class distribution at the leaf.
- **Regression**: Each leaf node stores the mean of the target values of the training samples that reached it. The prediction for a new sample is this mean value.

## Pruning

[Pruning](/wiki/pruning) is the process of simplifying a decision tree by removing branches that do not significantly improve predictive performance. Without pruning, decision trees tend to grow very deep, memorizing noise in the training data and performing poorly on unseen data. There are two main categories of pruning.

### Pre-pruning (early stopping)

Pre-pruning halts the growth of the tree during construction by imposing constraints. Common pre-pruning parameters include:

| Parameter | Description | Effect |
|-----------|-------------|--------|
| Maximum depth | Limits the longest path from root to any leaf | Prevents excessively deep trees |
| Minimum samples per split | Requires a minimum number of samples at a node before it can be split | Prevents splits on very small subsets |
| Minimum samples per leaf | Requires each leaf to contain at least a specified number of samples | Ensures leaf predictions are based on sufficient data |
| Maximum number of leaf nodes | Limits the total number of leaves in the tree | Controls overall tree complexity |
| Minimum impurity decrease | Requires a split to achieve at least a specified reduction in impurity | Prevents trivial splits |

Pre-pruning is computationally efficient because it avoids growing the full tree. However, it can be overly conservative: a split that appears unhelpful on its own may enable useful splits further down the tree (the "horizon effect").

### Post-pruning

Post-pruning first grows the tree to its full depth (or near-full depth) and then removes or collapses branches that do not improve generalization. Since the tree is fully grown before pruning begins, post-pruning can consider all possible patterns before discarding weak or irrelevant ones.

#### Cost-complexity pruning (minimal cost-complexity pruning)

Cost-complexity pruning, introduced with the CART algorithm, is the most widely used post-pruning method.[4] It defines a cost-complexity measure for any subtree *T*:

```
R_alpha(T) = R(T) + alpha * |T|
```

where `R(T)` is the total misclassification rate (or sum of squared errors for regression) of the tree, `|T|` is the number of leaf nodes, and `alpha` is a non-negative complexity parameter. The algorithm works as follows:

1. For each internal node, compute the "effective alpha" at which pruning that node (replacing its subtree with a leaf) becomes worthwhile.
2. Prune the node with the smallest effective alpha first (the "weakest link").
3. Repeat, generating a sequence of progressively smaller subtrees.
4. Select the optimal subtree using cross-validation or a held-out validation set.

Greater values of alpha increase the number of nodes pruned, yielding simpler trees. The optimal alpha is typically selected via k-fold [cross-validation](/wiki/cross-validation).

#### Reduced error pruning

Reduced error pruning is one of the simplest post-pruning methods. It requires a separate validation set. Starting from the leaves, each internal node is tentatively replaced with a leaf labeled with the majority class. If this replacement does not decrease accuracy on the validation set, the change is kept. This process continues until no further pruning improves validation accuracy. Reduced error pruning is fast and straightforward but requires setting aside data for validation, which can be wasteful when data is scarce.

#### Error-based pruning

Error-based pruning, used in C4.5, estimates the error rate at each node using a confidence interval based on the training data. Subtrees whose estimated error exceeds that of a single leaf replacement are pruned. This method does not require a separate validation set.[6]

## Oblique decision trees

Standard decision tree algorithms like CART, ID3, and C4.5 create **axis-aligned** (univariate) splits, meaning each split tests a single feature at a time. This produces rectangular decision regions in the feature space. **Oblique decision trees** (also called multivariate decision trees) generalize this by splitting on linear combinations of features. Instead of a condition like "feature_j <= threshold," an oblique split takes the form:

```
w_1 * x_1 + w_2 * x_2 + ... + w_p * x_p <= threshold
```

where `w_1, ..., w_p` are learned weights. This allows the tree to produce decision boundaries that are diagonal or angled relative to the feature axes, which can represent certain patterns much more compactly than a staircase of axis-aligned splits.

Oblique trees often achieve higher generalization accuracy with fewer nodes than axis-aligned trees, especially on datasets where the true decision boundary is not parallel to any feature axis.[7] However, they come with trade-offs. First, finding the optimal oblique split at each node is computationally expensive because the search space is continuous and high-dimensional. Second, oblique splits are harder for humans to interpret because they involve weighted combinations of multiple features rather than simple thresholds on a single variable.

Several algorithms have been proposed for constructing oblique trees. The OC1 system (Murthy, Kasif, and Salzberg, 1994) uses a randomized perturbation procedure and coordinate descent to optimize oblique splits.[7] CART-LC (a variant of CART with linear combination splits) optimizes split coefficients by iterating over one weight at a time.[4] More recent approaches use linear programming, support vector machines, or neural network components to find oblique hyperplanes at each node.

## Multi-output decision trees

A **multi-output decision tree** predicts multiple target variables simultaneously using a single tree structure. Instead of predicting a single class label or a single continuous value, each leaf node stores a vector of predictions (one per target). The tree is grown using a splitting criterion that aggregates the impurity or variance across all targets.

For multi-output classification, the tree minimizes the average Gini impurity or entropy across all target variables at each split. For multi-output regression, it minimizes the average variance reduction across all targets.

Multi-output trees are useful in scenarios where the target variables are correlated, because a single tree can capture shared structure in the input space that affects multiple outputs simultaneously. Applications include predicting multiple properties of a chemical compound, multi-label classification tasks, and structured output problems such as image completion.

In scikit-learn, both `DecisionTreeClassifier` and `DecisionTreeRegressor` natively support multi-output prediction. When fit on a target array Y of shape (n_samples, n_outputs), the estimator learns a single tree that produces predictions for all outputs at once.

## Feature importance

Decision trees and tree-based [ensemble](/wiki/ensemble) methods provide natural measures of [feature importance](/wiki/feature_importances), which indicate how much each feature contributes to the model's predictions.

### Mean decrease in impurity (MDI)

Also known as **impurity-based feature importance**, MDI measures the total reduction in the splitting criterion (Gini impurity, entropy, or mean squared error) attributable to each feature, averaged across all trees in the ensemble. For a single tree, the importance of a feature is the sum of the impurity decreases at all nodes where that feature is used for splitting, weighted by the number of samples reaching each node.

MDI is computed during training and is therefore very fast. However, it has known limitations: it is biased toward high-cardinality features (features with many distinct values), and it can assign high importance to features that overfit the training data but are not genuinely predictive.

### Permutation importance

**Permutation importance**, introduced by Breiman in his 2001 Random Forest paper, measures feature importance by randomly shuffling the values of a single feature and observing the resulting decrease in model performance on a held-out dataset.[10] If shuffling a feature causes a large drop in accuracy (or increase in error), that feature is considered important.

Permutation importance is model-agnostic (it can be applied to any model, not just trees), avoids the high-cardinality bias of MDI, and can be computed on test data to reflect genuine predictive importance. Its main drawback is that it can be slow for large datasets and is affected by feature correlations (correlated features may share importance).

### SHAP values

**[SHAP](/wiki/shap) (SHapley Additive exPlanations)**, developed by Scott Lundberg and Su-In Lee in 2017, provides a unified framework for feature importance based on Shapley values from cooperative game theory.[15] TreeSHAP is an efficient algorithm for computing exact SHAP values for tree-based models in polynomial time.[15] SHAP values provide both global feature importance (average absolute SHAP value across all samples) and local explanations (the contribution of each feature to an individual prediction).

## Ensemble methods based on decision trees

Single decision trees have well-known limitations, including high variance and a tendency to overfit.[8] Ensemble methods combine multiple trees to produce more accurate and robust predictions. The two main families of tree-based ensembles are bagging and boosting.

### Bagging and random forests

**[Bootstrap aggregating](/wiki/bagging) (bagging)**, introduced by Leo Breiman in 1996, trains multiple decision trees on different bootstrap samples (random samples drawn with replacement) from the training data.[8] The final prediction is the average (regression) or majority vote (classification) of all trees. Bagging reduces variance without increasing bias.[8]

**[Random forest](/wiki/random_forest)**, proposed by Leo Breiman in his 2001 paper "Random Forests" published in *Machine Learning*, extends bagging by adding random feature selection.[10] At each split in each tree, only a random subset of features (typically the square root of the total number of features for classification, or one-third for regression) is considered as candidates. This additional randomness further decorrelates the individual trees, leading to lower generalization error.[10] In the paper's abstract, Breiman observes of random forests that "because of the Law of Large Numbers they do not overfit."[10] Key properties of random forests include:

- They do not overfit as more trees are added (Breiman showed this using the law of large numbers).[10]
- They provide out-of-bag (OOB) error estimates, which can serve as an unbiased estimate of the test error without needing a separate validation set.
- They can estimate feature importance through permutation importance or mean decrease in impurity.
- They scale well to large datasets and high-dimensional feature spaces.

### Gradient boosting

**[Gradient boosting](/wiki/gradient_boosting)**, formalized by Jerome Friedman in his 2001 paper "Greedy Function Approximation: A Gradient Boosting Machine" published in *The Annals of Statistics*, builds trees sequentially rather than independently.[11] Each new tree is trained to predict the residual errors (or, more precisely, the negative gradient of the loss function) of the current ensemble.[11] The predictions of all trees are summed to produce the final output. Gradient boosting typically uses shallow trees (often called "stumps" for depth-1 trees, or trees with depth 3 to 8) as weak learners. Key hyperparameters include the learning rate (shrinkage), number of trees, and tree depth.

AdaBoost, introduced by Yoav Freund and Robert Schapire in 1997, is a related boosting method that adjusts sample weights at each iteration, giving more weight to misclassified samples.[9] AdaBoost can be viewed as a special case of gradient boosting with an exponential loss function.[11]

### Modern gradient boosting implementations

Several high-performance implementations of gradient boosted trees have become dominant in applied machine learning and competitive data science:

#### XGBoost

**[XGBoost](/wiki/xgboost)** (eXtreme Gradient Boosting), developed by Tianqi Chen and Carlos Guestrin at the University of Washington, was published in the 2016 paper "XGBoost: A Scalable Tree Boosting System" at the ACM SIGKDD conference.[13] The paper describes XGBoost as a system that "is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges," and reports that among the 29 winning solutions published on the Kaggle competition blog during 2015, 17 used XGBoost.[13] XGBoost introduced several innovations: a regularized objective function that penalizes tree complexity (including both L1 and L2 regularization on leaf weights), a sparsity-aware algorithm for handling missing values, a weighted quantile sketch for approximate split finding, and system-level optimizations such as cache-aware access and out-of-core computation.[13] XGBoost uses level-wise (breadth-first) tree growth by default.

#### LightGBM

**[LightGBM](/wiki/lightgbm)** (Light Gradient Boosting Machine), developed by a team led by Guolin Ke at Microsoft Research, was published in the 2017 [NeurIPS](/wiki/neurips) paper "LightGBM: A Highly Efficient Gradient Boosting Decision Tree."[14] LightGBM introduces two key techniques: **Gradient-based One-Side Sampling (GOSS)**, which excludes data instances with small gradients to speed up information gain estimation; and **Exclusive Feature Bundling (EFB)**, which bundles mutually exclusive features to reduce dimensionality.[14] LightGBM also uses leaf-wise (best-first) tree growth rather than level-wise growth, which can produce deeper, more asymmetric trees that converge faster. LightGBM achieves training speedups of over 20x compared to conventional gradient boosting while maintaining comparable accuracy.[14]

#### CatBoost

**[CatBoost](/wiki/catboost)** (Categorical Boosting), developed by the machine learning team at Yandex, was described in the 2018 NeurIPS paper "CatBoost: Unbiased Boosting with Categorical Features" by Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin.[16] CatBoost introduces **ordered boosting**, a permutation-driven approach that addresses prediction shift (a form of target leakage present in standard gradient boosting), and an efficient algorithm for processing categorical features directly without manual encoding.[16] CatBoost uses **oblivious (symmetric) trees** as base learners, where all nodes at the same depth level use the same feature and split threshold.[16] This symmetric structure enables efficient vectorized evaluation, particularly on GPUs.

The following table compares the major tree-based ensemble algorithms:

| Algorithm | Year | Author(s) | Ensemble Type | Tree Growth | Key Innovation |
|-----------|------|-----------|---------------|-------------|----------------|
| [Bagging](/wiki/bagging) | 1996 | Breiman | Parallel (bootstrap) | Full depth | Variance reduction via averaging |
| [Random Forest](/wiki/random_forest) | 2001 | Breiman | Parallel (bootstrap + feature subspace) | Full depth | Random feature selection at each split |
| [AdaBoost](/wiki/adaboost) | 1997 | Freund, Schapire | Sequential (boosting) | Shallow (stumps) | Adaptive sample weighting |
| [Gradient Boosting](/wiki/gradient_boosting) | 2001 | Friedman | Sequential (boosting) | Shallow | Fits residuals via gradient descent in function space |
| [XGBoost](/wiki/xgboost) | 2016 | Chen, Guestrin | Sequential (boosting) | Level-wise | Regularized objective; sparsity-aware splits |
| [LightGBM](/wiki/lightgbm) | 2017 | Ke et al. (Microsoft) | Sequential (boosting) | Leaf-wise | GOSS and EFB for speed; leaf-wise growth |
| [CatBoost](/wiki/catboost) | 2018 | Prokhorenkova et al. (Yandex) | Sequential (boosting) | Symmetric (oblivious) | Ordered boosting; native categorical feature handling |

## Visualization and interpretability

One of the strongest advantages of decision trees is that they can be visualized in a way that humans can inspect and understand. Several tools and techniques exist for visualizing decision trees and interpreting their predictions.

### Graphical tree rendering

The most common visualization is a node-link diagram where each node displays the splitting feature, threshold, impurity value, number of samples, and (for classification) the class distribution. Nodes are typically colored by the majority class or the predicted value, with color intensity reflecting the node's purity. This rendering allows practitioners to trace any prediction from root to leaf and understand why the model made a particular decision.

In scikit-learn, two primary functions support tree visualization:

- **`sklearn.tree.plot_tree`**: Added in scikit-learn 0.21, this function renders a decision tree directly as a matplotlib figure without requiring external software. It accepts parameters such as `max_depth` (to limit the displayed depth), `feature_names`, `class_names`, `filled` (to color nodes by class or value), and `rounded` (for rounded node boxes).
- **`sklearn.tree.export_graphviz`**: This function exports the tree in DOT format, which can be rendered into high-resolution images using the Graphviz software. It provides more control over aesthetic details and can produce output in formats like PNG, PDF, or SVG.

Additionally, the `sklearn.tree.export_text` function produces a plain-text representation of the tree rules, which is useful for logging, debugging, or environments where graphical output is not available.

### Rule extraction

Decision trees can be converted into a set of if-then rules, one rule per leaf. Each rule is the conjunction of all split conditions along the path from the root to that leaf. Rule extraction is useful for embedding tree-based logic into production systems, regulatory documentation, or clinical guidelines where a visual diagram is insufficient.

### Why are decision trees considered interpretable?

Decision trees are often cited as one of the most interpretable machine learning models. For shallow trees (depth 3 to 5), a human can follow the decision logic easily by reading the sequence of feature tests from root to leaf, and every prediction comes with an explicit rule path that explains it. However, interpretability degrades rapidly as trees grow deeper. A tree with hundreds of leaves is no easier to interpret than a black-box model. This is one reason why post-pruning and depth constraints are important not only for generalization but also for practical usability.

For [ensemble](/wiki/ensemble) methods built on decision trees, individual tree interpretability is largely lost because the final prediction aggregates hundreds or thousands of trees. Techniques like SHAP values, partial dependence plots, and feature importance scores are used to interpret ensemble models at a higher level.

## Advantages

- **[Interpretability](/wiki/interpretability)**: Decision trees are among the most interpretable machine learning models. The tree structure can be visualized as a flowchart, and the decision rules at each node are easy to understand. This makes decision trees popular in fields where model transparency is important, such as medicine, finance, and law.
- **No feature scaling required**: Unlike [neural networks](/wiki/neural_network), [support vector machines](/wiki/support_vector_machine_svm), or [logistic regression](/wiki/logistic_regression), decision trees do not require feature normalization or standardization. They are invariant to monotonic transformations of individual features.
- **Handles mixed data types**: Decision trees can handle both numerical and categorical features natively (depending on the implementation), as well as datasets with mixed feature types.
- **Handles missing values**: Algorithms such as CART (via surrogate splits), XGBoost, and LightGBM have built-in mechanisms for handling missing values without requiring imputation.
- **Nonlinear relationships**: Decision trees can capture nonlinear relationships and complex interactions between features without requiring explicit feature engineering.
- **Fast inference**: Prediction with a decision tree requires only a series of comparisons along a single path from root to leaf, making inference very fast even for large trees.
- **Feature selection**: The tree-building process implicitly performs feature selection, since only the most informative features are used for splitting. Uninformative features are simply ignored.

## Limitations

- **Overfitting**: Decision trees are prone to overfitting, particularly when grown to full depth without pruning or other regularization. A deep tree can memorize the training data, including noise, leading to poor generalization.
- **High variance**: Small changes in the training data can produce very different tree structures. This instability is a consequence of the greedy, top-down construction process, where a different split at the root node cascades into an entirely different tree.
- **Axis-aligned splits**: Standard decision tree algorithms (CART, ID3, C4.5) split on one feature at a time, producing axis-aligned (orthogonal) decision boundaries. This means that diagonal or curved decision boundaries can only be approximated by a staircase of axis-aligned splits, which may require a very deep tree. Oblique decision trees address this limitation at the cost of increased computational complexity and reduced interpretability.
- **Biased toward high-cardinality features**: Some splitting criteria (such as information gain) favor features with many distinct values, which can lead to suboptimal trees. Gain ratio and other corrections address this to some extent.
- **Greedy construction**: Decision trees are typically built using a greedy algorithm that selects the locally optimal split at each node. There is no guarantee that the resulting tree is globally optimal. Finding the optimal decision tree is NP-complete in general (Hyafil and Rivest, 1976).[1]
- **Imbalanced data**: Decision trees can be biased toward the majority class in imbalanced datasets unless class weights or other balancing techniques are used.
- **Piecewise constant predictions**: Decision tree predictions are discontinuous, piecewise constant functions. They cannot extrapolate beyond the range of training data and may produce blocky approximations of smooth target functions. This is particularly noticeable in regression tasks.
- **Limited expressiveness of single trees**: A single decision tree often cannot match the accuracy of ensemble methods, neural networks, or other more flexible models on complex tasks. This is why ensemble methods like random forests and gradient boosting are preferred in practice.

## When should you use a decision tree?

A single decision tree is a good choice when interpretability is the priority, when a fast and transparent baseline is needed, or when the decision logic must be auditable by domain experts or regulators (for example, a clinical decision rule or a credit policy). When raw predictive accuracy on tabular data matters more than interpretability, a tree-based ensemble is usually preferred: random forests are robust and require little tuning, while gradient boosting libraries such as XGBoost, LightGBM, and CatBoost often deliver the strongest accuracy on structured datasets and remain among the most successful models in machine learning competitions.[13] For unstructured data such as images, audio, or natural language, [deep learning](/wiki/deep_learning) models generally outperform tree-based methods.

## Applications

Decision trees and tree-based ensembles are used across a wide range of domains:

- **Medical diagnosis**: Decision trees are used to build clinical decision support systems that help physicians diagnose diseases based on patient symptoms and test results. Their interpretability is especially valuable in healthcare settings where clinicians need to understand and trust the model's reasoning.
- **Credit scoring and finance**: Banks and financial institutions use tree-based models to assess credit risk, detect fraud, and make lending decisions. Regulatory requirements for model transparency make decision trees and interpretable ensembles attractive choices.
- **Customer analytics**: Tree-based models are used for customer segmentation, churn prediction, and recommendation systems in marketing and e-commerce.
- **[Natural language processing](/wiki/natural_language_processing)**: Gradient boosted trees are used for feature-rich text classification, sentiment analysis, and ranking tasks in information retrieval.
- **Bioinformatics**: Decision trees and random forests are used for gene expression analysis, protein function prediction, and drug discovery.
- **Remote sensing**: Tree-based classifiers are widely used for land cover classification from satellite imagery.
- **Manufacturing and quality control**: Decision trees help identify which process parameters most influence product defects, enabling root-cause analysis on production lines.

## Software implementations

Decision trees and tree-based ensembles are available in most major machine learning libraries:

| Library | Language | Tree Algorithms Available |
|---------|----------|---------------------------|
| [scikit-learn](/wiki/scikit_learn) | Python | CART (DecisionTreeClassifier, DecisionTreeRegressor), Random Forest, Gradient Boosting, AdaBoost, ExtraTreesClassifier |
| [XGBoost](/wiki/xgboost) | Python, R, C++, Java | XGBoost (regularized gradient boosting) |
| [LightGBM](/wiki/lightgbm) | Python, R, C++ | LightGBM (leaf-wise gradient boosting with GOSS and EFB) |
| [CatBoost](/wiki/catboost) | Python, R, C++ | CatBoost (ordered boosting with oblivious trees) |
| rpart | R | CART |
| Spark MLlib | Scala, Java, Python | Decision tree, Random Forest, Gradient Boosting |
| [Weka](/wiki/weka) | Java | J48 (C4.5 implementation), Random Forest, REPTree |
| dtreeviz | Python | Visualization library for scikit-learn, XGBoost, and Spark trees |

Scikit-learn's decision tree implementation is based on an optimized version of the CART algorithm. It uses Cython for performance-critical routines and supports both dense and sparse input data. Key classes include `DecisionTreeClassifier` for classification tasks and `DecisionTreeRegressor` for regression tasks, both of which support multi-output prediction, sample weighting, and cost-complexity pruning via the `ccp_alpha` parameter.

## Explain like I'm 5 (ELI5)

Imagine you are playing a guessing game with a friend who is thinking of an animal. You ask simple yes-or-no questions to narrow down the answer. "Does it have fur?" If yes, you ask "Is it bigger than a cat?" If no, you ask "Does it have feathers?" Each question helps you eliminate animals that do not fit, and after a few questions you can guess the right answer.

A decision tree works the same way with data. The computer looks at a bunch of examples and figures out which questions are the best ones to ask, meaning which questions split the examples into the most clear-cut groups. It arranges these questions into a tree shape, where the first question is at the top, follow-up questions come next, and the final answers are at the bottom. When it sees new data, it starts at the top and follows the branches based on the answers until it reaches a final prediction.

## References

1. Hyafil, L., & Rivest, R. L. (1976). "Constructing Optimal Binary Decision Trees is NP-Complete." *Information Processing Letters*, 5(1), 15-17.
2. Morgan, J. N., & Sonquist, J. A. (1963). "Problems in the Analysis of Survey Data, and a Proposal." *Journal of the American Statistical Association*, 58(302), 415-434.
3. Kass, G. V. (1980). "An Exploratory Technique for Investigating Large Quantities of Categorical Data." *Applied Statistics*, 29(2), 119-127.
4. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). *Classification and Regression Trees*. Wadsworth.
5. Quinlan, J. R. (1986). "Induction of Decision Trees." *Machine Learning*, 1(1), 81-106.
6. Quinlan, J. R. (1993). *C4.5: Programs for Machine Learning*. Morgan Kaufmann.
7. Murthy, S. K., Kasif, S., & Salzberg, S. (1994). "A System for Induction of Oblique Decision Trees." *Journal of Artificial Intelligence Research*, 2, 1-32.
8. Breiman, L. (1996). "Bagging Predictors." *Machine Learning*, 24(2), 123-140.
9. Freund, Y., & Schapire, R. E. (1997). "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting." *Journal of Computer and System Sciences*, 55(1), 119-139.
10. Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32.
11. Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *The Annals of Statistics*, 29(5), 1189-1232.
12. Loh, W.-Y. (2014). "Fifty Years of Classification and Regression Trees." *International Statistical Review*, 82(3), 329-348.
13. Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 785-794. arXiv:1603.02754.
14. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." *Advances in Neural Information Processing Systems*, 30.
15. Lundberg, S. M., & Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." *Advances in Neural Information Processing Systems*, 30.
16. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). "CatBoost: Unbiased Boosting with Categorical Features." *Advances in Neural Information Processing Systems*, 31.
17. Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., et al. (2008). "Top 10 Algorithms in Data Mining." *Knowledge and Information Systems*, 14(1), 1-37. (C4.5 ranked 1, CART ranked 3; algorithms identified at IEEE ICDM, December 2006.)
18. scikit-learn developers. "Decision Trees: Tree algorithms ID3, C4.5, C5.0 and CART." scikit-learn User Guide. https://scikit-learn.org/stable/modules/tree.html
19. Quinlan, J. R. (1986). "Induction of Decision Trees." *Machine Learning*, 1(1), 81-106. Citation count per Semantic Scholar and Google Scholar (more than 18,000 citations as of 2024).

