See also: Machine learning terms
In the field of machine learning, a decision tree is a tree-structured predictive model used for both classification and regression tasks. The model works by recursively splitting the input data into subsets based on the values of input features, forming a hierarchy of decision rules that ultimately produce a prediction at each terminal node. Decision trees are among the most widely used algorithms in supervised learning because of their intuitive structure, interpretability, and ability to handle both numerical and categorical data without extensive preprocessing.
A decision tree can be understood as a flowchart-like structure where each internal node represents a test on a feature (for example, "Is temperature greater than 30?" or "Is color equal to red?"), each branch represents the outcome of that test, and each leaf node represents a class label or a continuous value. The path from the root to a leaf encodes a conjunction of feature conditions that leads to a particular prediction.
Finding the optimal decision tree for a given dataset is NP-complete, as proven by Laurent Hyafil and Ronald Rivest in 1976. Because of this computational intractability, all practical decision tree algorithms rely on greedy heuristics that make locally optimal splitting decisions at each node rather than searching for a globally optimal tree.
The development of decision tree algorithms spans more than six decades, with contributions from statisticians, computer scientists, and machine learning researchers.
The earliest decision tree algorithm was Automatic Interaction Detection (AID), published by James N. Morgan and John A. Sonquist in 1963 in the Journal of the American Statistical Association. Morgan and Sonquist were motivated by the challenge of dealing with complex interactions among categorical variables in social survey data. They wanted to untangle the influence of factors such as age, education, ethnicity, and profession on a person's income. AID works by recursively splitting data at the root node into two child nodes, selecting binary splits on ordinal and nominal independent variables that most reduce the sum of squares of an interval dependent variable from its mean. Because interaction effects are a natural byproduct of the splitting process, Morgan and Sonquist called the procedure "automatic." AID is considered the first regression tree algorithm in the published literature.
In the early 1970s, Morgan and colleagues extended AID with THAID (THeta Automatic Interaction Detection), which was designed for classification problems with categorical dependent variables. THAID used a different splitting criterion based on maximizing the proportion of correctly classified observations.
The Chi-squared Automatic Interaction Detection (CHAID) algorithm was developed by Gordon V. Kass, a statistician at the University of the Witwatersrand in South Africa. Kass published the method in his 1980 paper "An Exploratory Technique for Investigating Large Quantities of Categorical Data" in Applied Statistics. CHAID is based on a formal extension of AID and THAID, and it uses chi-squared tests of significance (with Bonferroni correction) to determine the best splits. Unlike AID and CART, CHAID can produce multi-way splits (not just binary), creating discrete groups at each node. CHAID was originally designed for classification and was later extended to handle regression problems as well.
The CART (Classification and Regression Trees) algorithm was introduced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in their landmark 1984 book Classification and Regression Trees. CART represented a major advance over earlier tree methods. It produces strictly binary trees (each node splits into exactly two children), handles both classification and regression tasks, introduces cost-complexity pruning for tree simplification, and provides theoretical guarantees for consistency. Breiman and colleagues also proved conditions under which all recursive partitioning techniques are Bayes risk consistent. CART remains one of the most widely used tree algorithms and is the basis for the tree implementation in scikit-learn.
ID3 (Iterative Dichotomiser 3) was developed by Ross Quinlan and published in 1986 in the paper "Induction of Decision Trees" in Machine Learning. ID3 uses information gain (based on entropy) as the splitting criterion, selecting the feature with the highest information gain at each step. ID3 follows a top-down, greedy search through the space of possible decision trees. A key limitation of ID3 is that it can only handle categorical features and does not include a pruning mechanism, making it prone to overfitting.
C4.5 is the successor to ID3, also developed by Ross Quinlan. Published in his 1993 book C4.5: Programs for Machine Learning, C4.5 addresses several limitations of ID3. The algorithm introduces gain ratio as a splitting criterion (which normalizes information gain by the intrinsic information of a split, reducing bias toward features with many values), handles both categorical and continuous features, includes mechanisms for tree pruning, and can deal with missing values. C4.5 was named the top algorithm in data mining by the IEEE International Conference on Data Mining (ICDM) in 2006.
C5.0 is the commercial successor to C4.5, released by Ross Quinlan in the late 1990s. C5.0 offers improved speed and memory efficiency, support for boosting to improve accuracy, handling of very large datasets, and a more compact rule set output. While C5.0 is proprietary, its predecessor C4.5 remains freely available and widely used in academic research.
The following table summarizes the major milestones in the history of decision tree algorithms:
| Year | Algorithm | Creator(s) | Key Contribution |
|---|---|---|---|
| 1963 | AID | Morgan and Sonquist | First regression tree algorithm; binary splits on survey data |
| 1973 | THAID | Morgan et al. | Extended AID to classification with categorical dependent variables |
| 1976 | NP-completeness proof | Hyafil and Rivest | Proved that constructing optimal binary decision trees is NP-complete |
| 1980 | CHAID | Gordon V. Kass | Chi-squared significance testing; multi-way splits |
| 1984 | CART | Breiman, Friedman, Olshen, Stone | Binary trees for classification and regression; cost-complexity pruning |
| 1986 | ID3 | Ross Quinlan | Information gain splitting criterion; top-down greedy approach |
| 1993 | C4.5 | Ross Quinlan | Gain ratio; handles continuous features and missing values; pruning |
| 1997 | C5.0 | Ross Quinlan | Faster, more memory-efficient; boosting support |
A decision tree consists of three types of nodes:
Root node: The topmost node in the tree that represents the entire dataset. The root node contains the first splitting condition applied to all training samples.
Internal nodes (decision nodes): Nodes that represent decision points, which split the dataset based on a particular feature value. Each internal node tests a condition on one feature and routes samples to child nodes based on the outcome.
Leaf nodes (terminal nodes): Terminal nodes that correspond to the final predictions or outcomes of the model. In classification trees, a leaf node typically stores the majority class of the training samples that reached it. In regression trees, it stores the mean (or sometimes the median) of the target values.
Branches in a decision tree connect nodes and represent the decision path taken based on the feature values. Each branch corresponds to a specific decision rule, such as "age < 30" or "salary > 50,000". For binary trees (like CART), each internal node has exactly two branches. For multi-way trees (like CHAID or ID3 with categorical features), a node may have as many branches as there are distinct values of the splitting feature.
The depth of a decision tree is the length of the longest path from the root node to any leaf node. The size of a tree refers to the total number of nodes. Deeper trees can capture more complex patterns in the data but are more likely to overfit. Controlling tree depth is one of the primary mechanisms for regularizing decision trees.
The core of any decision tree algorithm is the criterion used to select the best feature and threshold for splitting at each node. The goal is to create child nodes that are more "pure" (homogeneous) than the parent node with respect to the target variable.
Gini impurity measures the probability of misclassifying a randomly chosen element from the dataset if it were labeled according to the distribution of labels in the subset. For a node with K classes, the Gini impurity is defined as:
Gini(t) = 1 - sum(p_i^2) for i = 1 to K
where p_i is the proportion of samples belonging to class i at node t. A Gini impurity of 0 indicates a perfectly pure node (all samples belong to one class), while the maximum impurity for a binary classification problem is 0.5 (equal split between two classes). CART uses Gini impurity as its default splitting criterion for classification.
When evaluating a candidate split, the algorithm computes the weighted average Gini impurity of the child nodes and selects the split that yields the greatest reduction in impurity (the "Gini gain").
Entropy is a concept from information theory that quantifies the uncertainty or disorder in a dataset. For a node with K classes, entropy is defined as:
Entropy(t) = - sum(p_i * log2(p_i)) for i = 1 to K
where p_i is the proportion of samples belonging to class i. Entropy ranges from 0 (perfectly pure node) to log2(K) (uniform distribution across all classes).
Information gain measures the reduction in entropy achieved by splitting on a particular feature:
IG(t, feature) = Entropy(t) - sum((|t_v| / |t|) * Entropy(t_v))
where t_v represents the child nodes created by the split and |t_v|/|t| is the fraction of samples routed to each child. ID3 uses information gain as its splitting criterion. A known limitation of information gain is that it favors features with many distinct values (high cardinality), which led to the development of gain ratio in C4.5.
Gain ratio, introduced in C4.5, normalizes information gain by the intrinsic information (split information) of the feature:
GainRatio(t, feature) = IG(t, feature) / SplitInfo(t, feature)
where SplitInfo(t, feature) = - sum((|t_v| / |t|) * log2(|t_v| / |t|)). This normalization penalizes features that produce many small partitions, reducing the bias inherent in information gain.
For regression trees, the splitting criterion is typically variance reduction (also known as reduction in sum of squared errors). The algorithm selects the split that minimizes the weighted variance of the target variable in the child nodes:
VarianceReduction = Var(t) - sum((|t_v| / |t|) * Var(t_v))
where Var(t) is the variance of the target variable at node t. Each leaf node in a regression tree predicts the mean of the target values of the training samples that reached it. CART uses this criterion for regression tasks.
The following table compares the main splitting criteria:
| Criterion | Task | Used By | Range | Notes |
|---|---|---|---|---|
| Gini impurity | Classification | CART, scikit-learn | 0 to 0.5 (binary) | Computationally simpler; no logarithm |
| Entropy / Information Gain | Classification | ID3, C4.5 | 0 to log2(K) | Based on information theory; slightly more expensive to compute |
| Gain Ratio | Classification | C4.5, C5.0 | Normalized | Corrects information gain bias toward high-cardinality features |
| Variance Reduction | Regression | CART | 0 to Var(y) | Minimizes sum of squared errors in child nodes |
| Chi-squared test | Classification | CHAID | Significance level | Uses statistical significance for split selection |
Decision trees can be applied to both classification and regression problems. The fundamental tree-growing procedure is the same in both cases (recursive binary or multi-way partitioning), but several details differ.
| Aspect | Classification tree | Regression tree |
|---|---|---|
| Target variable | Categorical (discrete class labels) | Continuous (real-valued) |
| Splitting criterion | Gini impurity, entropy, gain ratio, chi-squared test | Variance reduction (MSE), mean absolute error (MAE) |
| Leaf prediction | Majority class among training samples at the leaf | Mean (or median) of target values at the leaf |
| Probability output | Class probability distribution at each leaf | Point estimate (mean) at each leaf |
| Evaluation metrics | Accuracy, F1 score, AUC-ROC | Mean squared error, mean absolute error, R-squared |
| Common algorithms | CART, ID3, C4.5, CHAID | CART, AID |
In a classification model, each leaf outputs the class that appears most frequently among the training samples that arrived at that leaf. The tree can also output a probability vector by reporting the proportion of each class at the leaf. In a regression model, each leaf outputs a constant value, typically the mean of the target variable for the training samples at that leaf. Some implementations use the median instead of the mean to produce predictions that are more robust to outliers.
CART is unique in that the same algorithmic framework handles both tasks. The only difference is the impurity measure: Gini impurity (or entropy) for classification and variance reduction (or mean absolute error) for regression.
CART (Classification and Regression Trees) is the most widely implemented decision tree algorithm and serves as the foundation for many modern tree-based methods. Below is a detailed description of how CART works.
The computational cost of growing a CART tree on a dataset with n samples and p features is O(p * n * log(n)) at each node (sorting each feature costs O(n log n), and there are p features to consider). The total cost depends on the depth of the tree, which in the worst case yields O(p * n^2 * log(n)) for a fully grown tree.
CART handles continuous and categorical features differently:
CART handles missing values through surrogate splits. When the best split feature has missing values for some samples, the algorithm identifies surrogate features that produce splits most similar to the primary split. Samples with missing values on the primary feature are then routed using the best available surrogate split. This approach avoids the need to impute missing values before training.
Other algorithms handle missing values differently. C4.5 distributes samples with missing values across all child nodes, weighting each branch by the fraction of non-missing samples that follow it. XGBoost learns a default direction for missing values during training: at each split, samples with missing values are sent left or right depending on which direction reduces the loss more. LightGBM similarly assigns missing values to whichever child reduces the objective.
Pruning is the process of simplifying a decision tree by removing branches that do not significantly improve predictive performance. Without pruning, decision trees tend to grow very deep, memorizing noise in the training data and performing poorly on unseen data. There are two main categories of pruning.
Pre-pruning halts the growth of the tree during construction by imposing constraints. Common pre-pruning parameters include:
| Parameter | Description | Effect |
|---|---|---|
| Maximum depth | Limits the longest path from root to any leaf | Prevents excessively deep trees |
| Minimum samples per split | Requires a minimum number of samples at a node before it can be split | Prevents splits on very small subsets |
| Minimum samples per leaf | Requires each leaf to contain at least a specified number of samples | Ensures leaf predictions are based on sufficient data |
| Maximum number of leaf nodes | Limits the total number of leaves in the tree | Controls overall tree complexity |
| Minimum impurity decrease | Requires a split to achieve at least a specified reduction in impurity | Prevents trivial splits |
Pre-pruning is computationally efficient because it avoids growing the full tree. However, it can be overly conservative: a split that appears unhelpful on its own may enable useful splits further down the tree (the "horizon effect").
Post-pruning first grows the tree to its full depth (or near-full depth) and then removes or collapses branches that do not improve generalization. Since the tree is fully grown before pruning begins, post-pruning can consider all possible patterns before discarding weak or irrelevant ones.
Cost-complexity pruning, introduced with the CART algorithm, is the most widely used post-pruning method. It defines a cost-complexity measure for any subtree T:
R_alpha(T) = R(T) + alpha * |T|
where R(T) is the total misclassification rate (or sum of squared errors for regression) of the tree, |T| is the number of leaf nodes, and alpha is a non-negative complexity parameter. The algorithm works as follows:
Greater values of alpha increase the number of nodes pruned, yielding simpler trees. The optimal alpha is typically selected via k-fold cross-validation.
Reduced error pruning is one of the simplest post-pruning methods. It requires a separate validation set. Starting from the leaves, each internal node is tentatively replaced with a leaf labeled with the majority class. If this replacement does not decrease accuracy on the validation set, the change is kept. This process continues until no further pruning improves validation accuracy. Reduced error pruning is fast and straightforward but requires setting aside data for validation, which can be wasteful when data is scarce.
Error-based pruning, used in C4.5, estimates the error rate at each node using a confidence interval based on the training data. Subtrees whose estimated error exceeds that of a single leaf replacement are pruned. This method does not require a separate validation set.
Standard decision tree algorithms like CART, ID3, and C4.5 create axis-aligned (univariate) splits, meaning each split tests a single feature at a time. This produces rectangular decision regions in the feature space. Oblique decision trees (also called multivariate decision trees) generalize this by splitting on linear combinations of features. Instead of a condition like "feature_j <= threshold," an oblique split takes the form:
w_1 * x_1 + w_2 * x_2 + ... + w_p * x_p <= threshold
where w_1, ..., w_p are learned weights. This allows the tree to produce decision boundaries that are diagonal or angled relative to the feature axes, which can represent certain patterns much more compactly than a staircase of axis-aligned splits.
Oblique trees often achieve higher generalization accuracy with fewer nodes than axis-aligned trees, especially on datasets where the true decision boundary is not parallel to any feature axis. However, they come with trade-offs. First, finding the optimal oblique split at each node is computationally expensive because the search space is continuous and high-dimensional. Second, oblique splits are harder for humans to interpret because they involve weighted combinations of multiple features rather than simple thresholds on a single variable.
Several algorithms have been proposed for constructing oblique trees. The OC1 system (Murthy, Kasif, and Salzberg, 1994) uses a randomized perturbation procedure and coordinate descent to optimize oblique splits. CART-LC (a variant of CART with linear combination splits) optimizes split coefficients by iterating over one weight at a time. More recent approaches use linear programming, support vector machines, or neural network components to find oblique hyperplanes at each node.
A multi-output decision tree predicts multiple target variables simultaneously using a single tree structure. Instead of predicting a single class label or a single continuous value, each leaf node stores a vector of predictions (one per target). The tree is grown using a splitting criterion that aggregates the impurity or variance across all targets.
For multi-output classification, the tree minimizes the average Gini impurity or entropy across all target variables at each split. For multi-output regression, it minimizes the average variance reduction across all targets.
Multi-output trees are useful in scenarios where the target variables are correlated, because a single tree can capture shared structure in the input space that affects multiple outputs simultaneously. Applications include predicting multiple properties of a chemical compound, multi-label classification tasks, and structured output problems such as image completion.
In scikit-learn, both DecisionTreeClassifier and DecisionTreeRegressor natively support multi-output prediction. When fit on a target array Y of shape (n_samples, n_outputs), the estimator learns a single tree that produces predictions for all outputs at once.
Decision trees and tree-based ensemble methods provide natural measures of feature importance, which indicate how much each feature contributes to the model's predictions.
Also known as impurity-based feature importance, MDI measures the total reduction in the splitting criterion (Gini impurity, entropy, or mean squared error) attributable to each feature, averaged across all trees in the ensemble. For a single tree, the importance of a feature is the sum of the impurity decreases at all nodes where that feature is used for splitting, weighted by the number of samples reaching each node.
MDI is computed during training and is therefore very fast. However, it has known limitations: it is biased toward high-cardinality features (features with many distinct values), and it can assign high importance to features that overfit the training data but are not genuinely predictive.
Permutation importance, introduced by Breiman in his 2001 Random Forest paper, measures feature importance by randomly shuffling the values of a single feature and observing the resulting decrease in model performance on a held-out dataset. If shuffling a feature causes a large drop in accuracy (or increase in error), that feature is considered important.
Permutation importance is model-agnostic (it can be applied to any model, not just trees), avoids the high-cardinality bias of MDI, and can be computed on test data to reflect genuine predictive importance. Its main drawback is that it can be slow for large datasets and is affected by feature correlations (correlated features may share importance).
SHAP (SHapley Additive exPlanations), developed by Scott Lundberg and Su-In Lee in 2017, provides a unified framework for feature importance based on Shapley values from cooperative game theory. TreeSHAP is an efficient algorithm for computing exact SHAP values for tree-based models in polynomial time. SHAP values provide both global feature importance (average absolute SHAP value across all samples) and local explanations (the contribution of each feature to an individual prediction).
Single decision trees have well-known limitations, including high variance and a tendency to overfit. Ensemble methods combine multiple trees to produce more accurate and robust predictions. The two main families of tree-based ensembles are bagging and boosting.
Bootstrap aggregating (bagging), introduced by Leo Breiman in 1996, trains multiple decision trees on different bootstrap samples (random samples drawn with replacement) from the training data. The final prediction is the average (regression) or majority vote (classification) of all trees. Bagging reduces variance without increasing bias.
Random forest, proposed by Leo Breiman in his 2001 paper "Random Forests" published in Machine Learning, extends bagging by adding random feature selection. At each split in each tree, only a random subset of features (typically the square root of the total number of features for classification, or one-third for regression) is considered as candidates. This additional randomness further decorrelates the individual trees, leading to lower generalization error. Key properties of random forests include:
Gradient boosting, formalized by Jerome Friedman in his 2001 paper "Greedy Function Approximation: A Gradient Boosting Machine" published in The Annals of Statistics, builds trees sequentially rather than independently. Each new tree is trained to predict the residual errors (or, more precisely, the negative gradient of the loss function) of the current ensemble. The predictions of all trees are summed to produce the final output. Gradient boosting typically uses shallow trees (often called "stumps" for depth-1 trees, or trees with depth 3 to 8) as weak learners. Key hyperparameters include the learning rate (shrinkage), number of trees, and tree depth.
AdaBoost, introduced by Yoav Freund and Robert Schapire in 1997, is a related boosting method that adjusts sample weights at each iteration, giving more weight to misclassified samples. AdaBoost can be viewed as a special case of gradient boosting with an exponential loss function.
Several high-performance implementations of gradient boosted trees have become dominant in applied machine learning and competitive data science:
XGBoost (eXtreme Gradient Boosting), developed by Tianqi Chen and Carlos Guestrin at the University of Washington, was published in the 2016 paper "XGBoost: A Scalable Tree Boosting System" at the ACM SIGKDD conference. XGBoost introduced several innovations: a regularized objective function that penalizes tree complexity (including both L1 and L2 regularization on leaf weights), a sparsity-aware algorithm for handling missing values, a weighted quantile sketch for approximate split finding, and system-level optimizations such as cache-aware access and out-of-core computation. XGBoost uses level-wise (breadth-first) tree growth by default.
LightGBM (Light Gradient Boosting Machine), developed by a team led by Guolin Ke at Microsoft Research, was published in the 2017 NeurIPS paper "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." LightGBM introduces two key techniques: Gradient-based One-Side Sampling (GOSS), which excludes data instances with small gradients to speed up information gain estimation; and Exclusive Feature Bundling (EFB), which bundles mutually exclusive features to reduce dimensionality. LightGBM also uses leaf-wise (best-first) tree growth rather than level-wise growth, which can produce deeper, more asymmetric trees that converge faster. LightGBM achieves training speedups of over 20x compared to conventional gradient boosting while maintaining comparable accuracy.
CatBoost (Categorical Boosting), developed by the machine learning team at Yandex, was described in the 2018 NeurIPS paper "CatBoost: Unbiased Boosting with Categorical Features" by Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost introduces ordered boosting, a permutation-driven approach that addresses prediction shift (a form of target leakage present in standard gradient boosting), and an efficient algorithm for processing categorical features directly without manual encoding. CatBoost uses oblivious (symmetric) trees as base learners, where all nodes at the same depth level use the same feature and split threshold. This symmetric structure enables efficient vectorized evaluation, particularly on GPUs.
The following table compares the major tree-based ensemble algorithms:
| Algorithm | Year | Author(s) | Ensemble Type | Tree Growth | Key Innovation |
|---|---|---|---|---|---|
| Bagging | 1996 | Breiman | Parallel (bootstrap) | Full depth | Variance reduction via averaging |
| Random Forest | 2001 | Breiman | Parallel (bootstrap + feature subspace) | Full depth | Random feature selection at each split |
| AdaBoost | 1997 | Freund, Schapire | Sequential (boosting) | Shallow (stumps) | Adaptive sample weighting |
| Gradient Boosting | 2001 | Friedman | Sequential (boosting) | Shallow | Fits residuals via gradient descent in function space |
| XGBoost | 2016 | Chen, Guestrin | Sequential (boosting) | Level-wise | Regularized objective; sparsity-aware splits |
| LightGBM | 2017 | Ke et al. (Microsoft) | Sequential (boosting) | Leaf-wise | GOSS and EFB for speed; leaf-wise growth |
| CatBoost | 2018 | Prokhorenkova et al. (Yandex) | Sequential (boosting) | Symmetric (oblivious) | Ordered boosting; native categorical feature handling |
One of the strongest advantages of decision trees is that they can be visualized in a way that humans can inspect and understand. Several tools and techniques exist for visualizing decision trees and interpreting their predictions.
The most common visualization is a node-link diagram where each node displays the splitting feature, threshold, impurity value, number of samples, and (for classification) the class distribution. Nodes are typically colored by the majority class or the predicted value, with color intensity reflecting the node's purity. This rendering allows practitioners to trace any prediction from root to leaf and understand why the model made a particular decision.
In scikit-learn, two primary functions support tree visualization:
sklearn.tree.plot_tree: Added in scikit-learn 0.21, this function renders a decision tree directly as a matplotlib figure without requiring external software. It accepts parameters such as max_depth (to limit the displayed depth), feature_names, class_names, filled (to color nodes by class or value), and rounded (for rounded node boxes).sklearn.tree.export_graphviz: This function exports the tree in DOT format, which can be rendered into high-resolution images using the Graphviz software. It provides more control over aesthetic details and can produce output in formats like PNG, PDF, or SVG.Additionally, the sklearn.tree.export_text function produces a plain-text representation of the tree rules, which is useful for logging, debugging, or environments where graphical output is not available.
Decision trees can be converted into a set of if-then rules, one rule per leaf. Each rule is the conjunction of all split conditions along the path from the root to that leaf. Rule extraction is useful for embedding tree-based logic into production systems, regulatory documentation, or clinical guidelines where a visual diagram is insufficient.
Decision trees are often cited as one of the most interpretable machine learning models. For shallow trees (depth 3 to 5), a human can follow the decision logic easily. However, interpretability degrades rapidly as trees grow deeper. A tree with hundreds of leaves is no easier to interpret than a black-box model. This is one reason why post-pruning and depth constraints are important not only for generalization but also for practical usability.
For ensemble methods built on decision trees, individual tree interpretability is largely lost because the final prediction aggregates hundreds or thousands of trees. Techniques like SHAP values, partial dependence plots, and feature importance scores are used to interpret ensemble models at a higher level.
Decision trees and tree-based ensembles are used across a wide range of domains:
Decision trees and tree-based ensembles are available in most major machine learning libraries:
| Library | Language | Tree Algorithms Available |
|---|---|---|
| scikit-learn | Python | CART (DecisionTreeClassifier, DecisionTreeRegressor), Random Forest, Gradient Boosting, AdaBoost, ExtraTreesClassifier |
| XGBoost | Python, R, C++, Java | XGBoost (regularized gradient boosting) |
| LightGBM | Python, R, C++ | LightGBM (leaf-wise gradient boosting with GOSS and EFB) |
| CatBoost | Python, R, C++ | CatBoost (ordered boosting with oblivious trees) |
| rpart | R | CART |
| Spark MLlib | Scala, Java, Python | Decision tree, Random Forest, Gradient Boosting |
| Weka | Java | J48 (C4.5 implementation), Random Forest, REPTree |
| dtreeviz | Python | Visualization library for scikit-learn, XGBoost, and Spark trees |
Scikit-learn's decision tree implementation is based on an optimized version of the CART algorithm. It uses Cython for performance-critical routines and supports both dense and sparse input data. Key classes include DecisionTreeClassifier for classification tasks and DecisionTreeRegressor for regression tasks, both of which support multi-output prediction, sample weighting, and cost-complexity pruning via the ccp_alpha parameter.
Imagine you are playing a guessing game with a friend who is thinking of an animal. You ask simple yes-or-no questions to narrow down the answer. "Does it have fur?" If yes, you ask "Is it bigger than a cat?" If no, you ask "Does it have feathers?" Each question helps you eliminate animals that do not fit, and after a few questions you can guess the right answer.
A decision tree works the same way with data. The computer looks at a bunch of examples and figures out which questions are the best ones to ask, meaning which questions split the examples into the most clear-cut groups. It arranges these questions into a tree shape, where the first question is at the top, follow-up questions come next, and the final answers are at the bottom. When it sees new data, it starts at the top and follows the branches based on the answers until it reaches a final prediction.