CART algorithm
Last reviewed
Apr 28, 2026
Sources
29 citations
Review status
Source-backed
Revision
v1 ยท 4,287 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
29 citations
Review status
Source-backed
Revision
v1 ยท 4,287 words
Add missing citations, update stale details, or suggest a clearer explanation.
The CART algorithm (Classification And Regression Trees) is a non-parametric supervised learning method that builds a binary decision tree from labelled training data. It was introduced by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone in their 1984 monograph Classification and Regression Trees, published by Wadsworth and Brooks/Cole [1]. The same algorithmic framework can perform classification, in which a tree predicts a discrete class label, and regression, in which a tree predicts a continuous numeric value. CART has become one of the most widely used algorithms in machine learning, both as a standalone interpretable model and as the base learner inside random forest, gradient boosting, and other tree ensembles [2][3].
The algorithm works by recursively splitting the feature space into rectangular regions. At each internal node it selects a single feature and a threshold (or a partition of categorical levels) that best separates the training observations under a chosen impurity criterion. CART uses the Gini impurity by default for classification problems and squared error for regression problems. The resulting tree is initially grown to a large size, then reduced through cost-complexity pruning, a grow-then-prune strategy that is one of the algorithm's defining contributions [1][2]. Other distinguishing features include surrogate splits for missing data and a procedure for handling categorical predictors with many levels efficiently.
Tree-structured statistical models predate CART by roughly two decades. The first widely cited regression tree procedure was Automatic Interaction Detection (AID), published by James Morgan and John Sonquist in 1963 at the University of Michigan to study survey data with complex interactions among categorical predictors [4]. AID partitioned data into homogeneous subgroups using sums of squares as a splitting criterion. A classification analogue, THAID (Theta Automatic Interaction Detection), was introduced by Messenger and Mandell in 1972 [4]. CHAID, a chi-squared automatic interaction detector that supports multiway splits, was published by Gordon Kass in 1980 [5].
Leo Breiman, then at the University of California, Berkeley, began collaborating with Jerome Friedman, Richard Olshen, and Charles Stone in the mid 1970s. Their work was motivated in part by limitations of AID and THAID, particularly the absence of a principled mechanism to control tree size and the lack of a way to estimate honest generalisation error. Drafts of what became CART circulated as technical reports in 1977 and 1978, and the full treatment was published as a 358-page book in 1984 [1][6]. The book introduced four ideas that distinguish CART from earlier tree procedures: an exhaustive search over binary splits, a Gini impurity criterion that simplifies computation for multi-class problems, cost-complexity pruning combined with cross-validated error estimation, and surrogate splits for handling missing data.
The trademarked name CART belongs to a commercial software implementation. The original code base was licensed by Salford Systems, which Minitab acquired in 2017; the trademark CART is now held by Minitab, LLC [7][8]. The underlying algorithm, however, is in the public domain. Open-source reimplementations include the R package rpart by Terry Therneau, Beth Atkinson, and Brian Ripley, whose name (Recursive PARTitioning) was deliberately chosen to avoid the trademark [9], and the scikit-learn DecisionTreeClassifier and DecisionTreeRegressor classes [10].
In parallel with CART, Ross Quinlan developed a competing family of algorithms in Sydney: ID3 in 1986 [11] and its successor C4.5 in 1993 [12]. ID3 and C4.5 use information-theoretic splitting criteria and produce multiway splits on categorical features. Together with CART and CHAID they constitute the four classical algorithm families for decision tree induction.
A CART model is a binary tree. Each internal node holds a split rule of the form x_j <= s for a numeric feature x_j and threshold s, or x_j in A for a categorical feature and a subset A of its levels. Observations that satisfy the rule travel to the left child; the rest travel to the right child. Each leaf node stores a prediction: the majority class for classification problems, or the mean of the response values for regression problems [2][3]. Predicting a value for a new observation requires walking from the root to a leaf along the path determined by the split rules, an operation that takes O(log n) time on a balanced tree.
Growing the tree is a greedy, top-down procedure. The algorithm begins with all training observations at the root and considers every possible split (every feature, every threshold). For each candidate split it computes the weighted impurity of the resulting two child nodes. The split that produces the largest reduction in impurity is selected, the data are partitioned, and the procedure recurses on each child. The recursion continues until a stopping condition is met or every leaf is pure.
Formally, for a node t with sample set Q_t and a candidate split theta = (j, s) that partitions Q_t into Q_t^L and Q_t^R, CART chooses
theta* = argmin_theta [ (n_L / n_t) * H(Q_t^L) + (n_R / n_t) * H(Q_t^R) ]
where H is the impurity function (Gini for classification, squared error for regression) and n_L, n_R, n_t are sample counts in the left child, right child, and parent node respectively [10][13]. The greedy choice is locally optimal, not globally optimal: finding the smallest tree that perfectly classifies a training set is NP-complete [14].
The splitting criterion is the function CART minimises when comparing candidate splits. The choice of criterion depends on whether the response is categorical (classification) or numeric (regression).
For a classification node t with class proportions p_1, p_2, ..., p_K (where K is the number of classes and p_k is the fraction of training samples at t belonging to class k), the Gini impurity is
Gini(t) = 1 - sum_{k=1}^{K} p_k^2
This quantity is the probability that a randomly chosen sample at the node would be misclassified if it were labelled by drawing a class according to the empirical distribution at the node [1][15]. Gini impurity is bounded between 0 (a pure node containing one class) and 1 - 1/K (a node with equal class proportions). The impurity reduction for a split is
Delta Gini(t, theta) = Gini(t) - (n_L / n_t) * Gini(t_L) - (n_R / n_t) * Gini(t_R)
CART selects the split that maximises Delta Gini(t, theta), equivalently the split that minimises the weighted average child impurity [1][13]. Gini is the default classification criterion in CART because it requires no logarithm evaluations and tends to isolate the most frequent class into its own node, producing pure regions quickly.
Entropy is an alternative criterion borrowed from information theory:
Entropy(t) = - sum_{k=1}^{K} p_k * log_2(p_k)
The corresponding impurity reduction is called information gain and is the criterion used by ID3 and C4.5 [11][12]. In practice Gini and entropy give similar trees: the two functions agree at the endpoints (zero for pure nodes, maximum for uniform nodes) and differ only modestly in between. scikit-learn exposes both via the criterion parameter ('gini', 'entropy', or the equivalent 'log_loss') [10].
For a regression node t containing samples with response values y_1, ..., y_{n_t}, CART minimises the within-node sum of squared errors:
SSE(t) = sum_{i in t} (y_i - y_bar_t)^2
where y_bar_t = (1 / n_t) * sum_{i in t} y_i is the mean response at the node [1][2]. The leaf prediction is the within-node mean, which is the value that minimises this expression. Equivalently, CART maximises the variance reduction
Delta SSE(t, theta) = SSE(t) - SSE(t_L) - SSE(t_R)
because the impurity at a regression node is proportional to the within-node variance multiplied by the sample count [10][13]. Squared error is the default regression criterion because it admits a closed-form leaf estimate and supports efficient incremental computation when scanning candidate splits.
Mean absolute error is an alternative criterion that is more robust to outliers but slower to evaluate. With MAE the leaf prediction is the median rather than the mean, and the splitting criterion is the sum of absolute deviations from the median. The scikit-learn implementation supports MAE via criterion='absolute_error', but documentation notes that it is roughly three to six times slower to fit than the squared-error variant [10].
The three criteria differ in computational cost, sensitivity, and interpretation, but in practice produce broadly similar trees on typical tabular data.
| Criterion | Task | Formula | Notes |
|---|---|---|---|
| Gini impurity | Classification | 1 - sum p_k^2 | Default in CART; no logs; favours larger classes. |
| Entropy / information gain | Classification | -sum p_k log_2 p_k | Default in ID3 and C4.5; slower due to logs. |
| Squared error (MSE) | Regression | sum (y_i - y_bar)^2 | Default in CART; closed-form leaf mean. |
| Absolute error (MAE) | Regression | `sum | y_i - median |
| Friedman MSE | Regression | Modified MSE | Used by gradient boosting; weighted by sample counts. |
| Poisson deviance | Regression | 2 * sum (y log(y/mu) - (y - mu)) | For non-negative count targets [10]. |
Left unchecked, the recursive splitting procedure produces a tree that perfectly classifies the training set, with one observation per leaf. Such a tree memorises the training data and generalises poorly. CART addresses this through a two-stage strategy: first grow a tree larger than necessary, then prune it back using cross-validation to find the right size [1][2].
Stopping criteria arrest growth before the tree becomes pathologically deep. Common stopping rules include a maximum tree depth, a minimum number of samples required to attempt a split (min_samples_split), a minimum number of samples per leaf (min_samples_leaf), and a minimum reduction in impurity required to accept a split (min_impurity_decrease). These are pre-pruning controls; they limit growth but do not, by themselves, recover the optimal tree size from a fully grown tree [10].
The defining post-pruning method introduced in the 1984 book is cost-complexity pruning, also called weakest-link pruning. For a fully grown tree T, define
R_alpha(T) = R(T) + alpha * |T|
where R(T) is the total impurity (or misclassification rate, or sum of squared errors) summed across the leaves of T, |T| is the number of leaves, and alpha >= 0 is a complexity parameter [1][16]. The first term rewards predictive accuracy; the second term penalises tree size. For each value of alpha, the pruned tree T_alpha that minimises R_alpha(T) is unique. As alpha increases from 0 to a sufficiently large value, the optimal sub-tree shrinks from the full tree to the root.
The weakest-link procedure efficiently traces this path. For each non-leaf node t in the current tree, define
g(t) = (R(t) - R(T_t)) / (|T_t| - 1)
where T_t is the sub-tree rooted at t and R(t) is the impurity of t collapsed into a single leaf. The node with the smallest g(t) is the weakest link: collapsing it produces the smallest increase in training error per leaf removed. CART repeatedly collapses the weakest link, generating a finite nested sequence of sub-trees T_0 supset T_1 supset ... supset T_root indexed by an increasing sequence of effective alpha values [1][16].
The optimal alpha is selected by cross-validation, typically tenfold. For each fold, a tree is grown on the training portion and the validation portion is run through each sub-tree in the pruning sequence. The alpha that minimises the average cross-validated error across folds is then used to choose the corresponding sub-tree of the full-data tree as the final model. The scikit-learn implementation exposes this through the cost_complexity_pruning_path method and the ccp_alpha constructor argument [10][16].
Real-world tabular data are rarely complete, and decision trees benefit from a principled missing-data mechanism. CART introduced surrogate splits for this purpose [1][17]. The idea is straightforward: at every internal node, in addition to the primary split selected by the impurity criterion, the algorithm computes a ranked list of alternative splits on other features that mimic the partition produced by the primary split. When an observation reaches the node and the primary split feature is missing, the algorithm sends the observation down the branch indicated by the highest-ranked surrogate whose feature is observed.
Formally, for a primary split that sends a fraction p_L of training observations left and p_R right, a surrogate split on a different feature is evaluated by the proportion of observations it sends to the same side as the primary, after correcting for the proportion expected by chance. Surrogates that do better than the trivial "always go to the majority side" rule are retained and ordered by their concordance with the primary split. CART stores surrogates at each node during training so that any future observation with missing values can still be routed to a leaf [1][17].
Surrogates also play a secondary role in measuring feature importance. A feature that frequently appears as a high-ranked surrogate even when not chosen as the primary split is a strong correlate of the primary signal and contributes to the so-called feature importance score. The C4.5 algorithm, by contrast, handles missing values by sending fractional observations down both branches with weights proportional to the observed split [12]. The current scikit-learn implementation supports a different approach: when missing values appear during training with splitter='best', samples with missing values for a candidate split feature are routed to whichever child minimises impurity, and that decision is recorded for use at prediction time [10].
A binary tree must split a categorical feature with q levels into two non-empty subsets. The number of distinct binary partitions of q levels is 2^(q-1) - 1, which grows quickly with q and is intractable for features with many levels [13]. CART exploits a result from the 1984 book that simplifies this for two important cases.
For binary classification, suppose a categorical feature has q levels. Sort the levels by the proportion of class-1 observations they contain. Breiman et al. proved that the optimal binary split (under Gini impurity or any equivalent measure) is one of the q - 1 splits obtained by cutting this sorted list at a single position [1][13]. The exhaustive search over 2^(q-1) - 1 partitions thus reduces to a linear scan, and the same reduction applies to regression with squared error: sort the levels by mean response and cut. For multi-class classification with three or more classes, no analogous shortcut exists, and CART implementations either restrict the search heuristically or convert categorical features to ordered numeric encodings [13].
The scikit-learn implementation does not natively handle categorical features in DecisionTreeClassifier; users must one-hot encode or ordinal encode them before fitting [10]. The R rpart package, by contrast, handles unordered factors directly using the Breiman ordering trick. LightGBM and CatBoost provide native categorical-feature support that follows similar principles, sorting levels by gradient or target statistics before scanning candidate splits [3][18].
CART belongs to a family of recursive partitioning algorithms that diverged in the 1980s and 1990s. The four classical members differ in splitting criterion, split arity, missing-data handling, and pruning strategy.
| Feature | AID (1963) | CHAID (1980) | ID3 (1986) | C4.5 (1993) | CART (1984) |
|---|---|---|---|---|---|
| Task | Regression | Classification | Classification | Classification | Both |
| Feature types | Categorical | Categorical | Categorical | Both | Both |
| Split arity | Multiway | Multiway | Multiway | Multiway | Binary |
| Splitting criterion | Sum of squares | Chi-squared | Information gain | Gain ratio | Gini / squared error |
| Pruning | None | None | None | Error-based pruning | Cost-complexity |
| Missing values | None | Separate category | None | Probabilistic distribution | Surrogate splits |
| Originators | Morgan & Sonquist | Kass | Quinlan | Quinlan | Breiman et al. |
ID3 introduced information gain to the decision-tree literature but was limited to categorical features and produced unpruned trees that overfit easily [11]. C4.5 extended ID3 to handle continuous features, missing values via probabilistic distribution, and post-pruning by error estimation; it also introduced gain ratio as a split criterion to counteract information gain's bias toward features with many levels [12]. CHAID constructs multiway splits using chi-squared statistics, which yields trees that are more interpretable when categorical predictors have natural groupings, but it has no formal pruning step [5].
CART's binary-only structure is sometimes seen as a limitation but in practice creates simpler, more flexible trees: any multiway split can be expressed as a sequence of binary splits, while the converse is not always true. Its principled cost-complexity pruning produces trees with lower variance than ID3 or C4.5 of similar depth, and its surrogate-split mechanism for missing data is more rigorous than C4.5's fractional-observation approach in the high-correlation regime [17][19].
The enduring popularity of CART rests on several practical advantages. Trees are highly interpretable: each prediction can be traced back to a chain of feature thresholds, and the model can be visualised directly. They impose no parametric assumptions on the data distribution and capture non-linear relationships and feature interactions automatically. They handle mixed feature types (numeric and categorical) without special preprocessing, and they are invariant to monotone transformations of numeric features, so feature scaling has no effect on the resulting tree [1][2]. Predictions at inference time are fast: a single pass from root to leaf takes time logarithmic in the number of nodes. Trees also produce native feature importance scores by aggregating impurity reductions across splits.
The algorithm has well-documented limitations. A single CART tree is a high-variance estimator: small perturbations of the training data can produce structurally different trees with different splits at the root [20]. This instability motivated the development of bagging and random forest, both of which average many trees to reduce variance. Greedy splitting is myopic, choosing the locally best split without regard for downstream interactions, and finding the globally optimal tree is NP-complete [14]. Trees produce piecewise constant predictions on axis-aligned rectangles, which makes them poor at approximating smooth functions of correlated inputs without an excessive number of leaves. Trees are also biased toward features that admit many candidate splits: numeric features and categorical features with many levels can dominate informative features with fewer levels in feature-importance comparisons [21]. Without pruning or strong regularisation, trees overfit the training set easily.
Most modern applications of CART trees occur inside ensemble learning algorithms that combine many trees to reduce variance, bias, or both [3]. Three families dominate.
Bagging (Bootstrap Aggregating) was introduced by Breiman in 1996 [20]. The procedure draws B bootstrap samples from the training set, fits a CART tree on each, and averages the predictions (for regression) or takes a majority vote (for classification). Because each bootstrap sample omits roughly 37% of the training observations, those out-of-bag observations can be used for an internal estimate of generalisation error. Bagging reduces variance without increasing bias and is most effective on high-variance base learners such as deep, unpruned CART trees.
Random forest, introduced by Breiman in 2001, extends bagging by injecting additional randomness at each split: instead of considering all features at every internal node, the algorithm samples a subset (typically sqrt(p) features for classification or p/3 for regression) and selects the best split within that subset [22]. This decorrelates the trees and gives larger variance reductions than bagging alone. Random forests use unpruned CART trees as base learners; the original implementation used Gini impurity for classification and squared error for regression.
Boosting algorithms grow trees sequentially, each new tree fitting the residuals or gradient of a loss function evaluated on the current ensemble. AdaBoost, by Yoav Freund and Robert Schapire (1997), reweights misclassified samples after each tree to focus subsequent learners on hard cases [23]. Gradient boosting, formalised by Friedman in 1999 and 2001, frames boosting as functional gradient descent in function space and uses CART trees of fixed depth (typically two to eight) as weak learners [24][25]. Gradient boosting underlies several widely used libraries: XGBoost by Tianqi Chen and Carlos Guestrin (2016) [26], LightGBM by Microsoft (2017) [27], and CatBoost by Yandex (2017) [28]. Each uses CART-style binary trees but adds engineering optimisations: XGBoost adds a regularisation term, second-order gradient information, and shrinkage; LightGBM introduces histogram-based binning and leaf-wise growth; CatBoost grows oblivious (symmetric) trees and natively supports categorical features through ordered target statistics.
The practical implication is that CART trees, although rarely used in isolation in modern competitive machine learning, remain the building block of the most accurate tabular-data models. Random forests and gradient boosting variants regularly top public benchmarks on tabular tasks and are standard choices when interpretability of individual decisions matters less than aggregate predictive accuracy [3].
Multiple software packages implement CART or close variants of it.
DecisionTreeClassifier and DecisionTreeRegressor in the sklearn.tree module. The implementation is described as an "optimised version of the CART algorithm" with support for Gini, entropy, and log loss criteria for classification, and squared error, absolute error, Friedman MSE, and Poisson deviance for regression [10]. It supports cost-complexity pruning via the ccp_alpha parameter and the cost_complexity_pruning_path method.rpart by Therneau, Atkinson, and Ripley implements the 1984 CART procedure in R, with native handling of categorical predictors, surrogate splits, and cross-validated cost-complexity pruning [9]. The printcp and plotcp functions visualise the pruning sequence.fitctree and fitrtree in the Statistics and Machine Learning Toolbox implement CART with surrogate splits.DecisionTreeClassifier and DecisionTreeRegressor classes that grow trees level-wise across cluster nodes, using approximate quantile-based split candidates for scalability to billions of rows [29].