Machine learning terms/Decision Forests
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 3,982 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 3,982 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Decision forests are a family of machine learning models built from collections of decision trees. Instead of relying on a single tree, a decision forest trains many trees and combines their predictions, usually by averaging for regression tasks and by majority vote or class probability averaging for classification. The result is a model that is far more accurate, far more stable, and far less prone to overfitting than any individual tree.
The two dominant decision forest paradigms are random forests, which train many independent trees in parallel on bootstrap samples of the data, and gradient boosted trees, which train trees sequentially with each new tree correcting the residual errors of the ensemble built so far. Both approaches dominate practical work on tabular data. They power credit scoring at major banks, ad click-through prediction at search engines, learning-to-rank systems, fraud detection, churn prediction, and the majority of winning solutions on tabular Kaggle competitions.
This page is the gateway hub for decision-forest entries on the AI Wiki. It covers the history, the mathematics behind tree splitting, the major algorithm families, the leading software libraries, and a curated index of every decision-forest concept that has its own dedicated wiki page.
A decision forest is an ensemble learning method whose base learners are decision trees. At inference time the input traverses each tree along an inference path of conditions until it reaches a leaf, and the leaf values are aggregated across the forest to produce the final prediction. Decision forests are non-parametric, handle mixed numerical and categorical features natively, are scale-invariant for monotonic feature transformations, and require comparatively little feature engineering.
The broad appeal of decision forests rests on a short list of practical strengths:
| Property | Why it matters |
|---|---|
| Strong accuracy on tabular data | Repeatedly beats deep learning baselines on heterogeneous tabular benchmarks |
| Native handling of categorical features | No mandatory one-hot encoding; algorithms such as LightGBM and CatBoost split on raw category values |
| Robust to outliers and feature scaling | Splits are based on order, not magnitude, so a single extreme value cannot distort the geometry of the input |
| Built-in feature importances | Direct measures of which inputs the model relies on most |
| Predictable training cost | Training time scales close to linearly with the number of trees and roughly with $n \log n$ in the number of training examples |
| Few brittle hyperparameters | Modern libraries train respectable models with default settings |
Decision forests are weaker on problems where signal lives in high-dimensional structured inputs such as raw pixels, audio waveforms, or token sequences. Those domains belong to deep learning and convolutional or transformer architectures.
Every decision forest is built out of decision trees. A tree is a recursive partition of the input space. Each internal node holds a test that compares one or more features against a threshold or category set, splitting incoming examples into child branches. Each leaf stores a constant prediction, typically a class probability vector for classification or a numeric value for regression. The starting node of the tree is the root. The component that decides which condition to install at each node is called the splitter, and the operation it performs is a split.
Three historical algorithm families established the recipes that modern forests still follow:
| Algorithm | Year | Author | Key ideas |
|---|---|---|---|
| ID3 | 1986 | J. Ross Quinlan | Multiway splits on categorical features, information gain (entropy reduction) as the splitting criterion, no pruning |
| C4.5 | 1993 | J. Ross Quinlan | Successor to ID3. Adds support for numerical features, missing-value handling, gain ratio normalization, and post-pruning |
| CART (Classification and Regression Trees) | 1984 | Breiman, Friedman, Olshen, Stone | Strictly binary conditions at each node, Gini impurity for classification, mean squared error for regression, cost-complexity pruning |
Most modern decision forest implementations are CART-style: they fit binary trees, choose splits greedily, and grow a forest of such trees. ID3 and C4.5 remain influential, especially in textbook treatments and in the older Weka ecosystem.
A condition, also called a test, is the question installed at a node. The major condition types are:
| Type | Form | Notes |
|---|---|---|
| Binary condition | Two child branches | The default in CART, random forest, and gradient boosted trees |
| Non-binary condition | More than two children | Used by ID3 and C4.5 on categorical features |
| Axis-aligned condition | Compares a single feature against a threshold | Standard in nearly every production forest |
| Oblique condition | Linear combination of features compared against a threshold | More expressive but more expensive; appears in oblique random forests |
| In-set condition | Tests whether a categorical feature lies in a learned subset | The native categorical split used by LightGBM and CatBoost |
The threshold used in an axis-aligned numerical condition is selected by the splitter to minimize an impurity measure on the resulting children. For categorical features the splitter searches a partition of category values; for high-cardinality categoricals modern libraries use efficient sorting-based heuristics rather than exhaustive search.
A split is chosen by scoring every candidate condition with an impurity or loss function and picking the condition that reduces the score the most.
Gini impurity is the probability that a randomly chosen example would be misclassified if labelled by the class distribution at the node:
$$\text{Gini}(t) = 1 - \sum_{k=1}^{K} p_k^2$$
where $p_k$ is the proportion of class $k$ in node $t$. Gini is zero for a pure node and reaches its maximum when classes are perfectly balanced. It is the default split criterion in scikit-learn's DecisionTreeClassifier and RandomForestClassifier.
Shannon entropy measures uncertainty in the class distribution at a node:
$$H(t) = -\sum_{k=1}^{K} p_k \log_2 p_k$$
The corresponding split score is information gain, the reduction in entropy after splitting:
$$\text{IG}(t, s) = H(t) - \sum_{c \in \text{children}(s)} \frac{|c|}{|t|} H(c)$$
ID3 selects the split with maximum information gain. C4.5 uses gain ratio, a normalized version that penalizes splits with many children. In practice, Gini and entropy almost always agree on which split is best; differences in tree structure are small and rarely change downstream accuracy.
For regression trees, the impurity at a node is usually the mean squared error of the leaf prediction. The split that minimizes the post-split sum of squared errors is selected. Mean absolute error and Huber loss are also available in major libraries for problems with heavy-tailed targets.
In gradient boosting, trees are fit to the negative gradient of a loss function, so the splitting criterion comes from a second-order Taylor expansion of that loss rather than from a fixed impurity. XGBoost popularized this view with its gain formula
$$\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda}\right] - \gamma$$
where $G$ and $H$ are sums of first and second derivatives in the left and right children and $\lambda$, $\gamma$ are regularization parameters.
Random forest was introduced by Leo Breiman in 2001. It combines two ideas:
Because each bootstrap sample omits roughly $1 - (1 - 1/n)^n \approx 1/e \approx 36.8%$ of the training examples, random forests support out-of-bag evaluation: each example is scored by the trees that did not see it during training, giving a near-unbiased generalization estimate without a separate validation set.
A random forest also has a built-in measure of variable importances. Two variants are widely used: mean decrease in impurity (MDI), which sums the impurity reductions a feature contributes across all trees, and permutation variable importances, which measures the drop in accuracy when a feature's values are randomly permuted. Permutation importance is generally preferred because MDI is biased toward features with many possible split points or high cardinality.
The success of random forests is sometimes described in terms of the wisdom of the crowd: independent learners that each do better than chance, when aggregated, produce far stronger predictions than any single learner. Galton's 1906 ox-weight contest is the canonical illustration.
Extremely Randomized Trees, often abbreviated Extra-Trees, were introduced by Pierre Geurts, Damien Ernst, and Louis Wehenkel in 2006. They differ from random forests in two ways:
The extra randomization further reduces variance at the cost of a small increase in bias. In practice Extra-Trees often match random forests on accuracy while training faster, since random thresholds avoid the per-feature sort that dominates standard split search. They are available in scikit-learn as ExtraTreesClassifier and ExtraTreesRegressor.
Gradient boosting was formalized by Jerome Friedman in 2001 in the paper "Greedy Function Approximation: A Gradient Boosting Machine". It generalizes the AdaBoost algorithm of Freund and Schapire (1995) to arbitrary differentiable loss functions. The training loop is sequential rather than parallel:
Friedman's MART (Multiple Additive Regression Trees) and Stochastic Gradient Boosting (2002) added subsampling of training rows to inject randomness, in the spirit of bagging.
Three modern open-source libraries dominate the boosted-trees landscape:
| Library | First release | Origin | Distinctive ideas |
|---|---|---|---|
| XGBoost | 2014 | Tianqi Chen, University of Washington | Second-order gradient + Hessian split formula, regularization in the objective, sparsity-aware split finding, exact and approximate histogram methods, GPU support |
| LightGBM | 2016 | Microsoft Research | Histogram-based splits, leaf-wise tree growth instead of level-wise, gradient-based one-side sampling (GOSS), exclusive feature bundling (EFB), native categorical splits |
| CatBoost | 2017 | Yandex | Ordered boosting to fight target leakage in categorical encodings, ordered target statistics for categorical features, symmetric (oblivious) trees, strong defaults |
All three reduce to the same mathematical core (gradient boosting of regression trees) but differ in growth strategy, categorical handling, and engineering. Benchmarks across academic papers and Kaggle competitions show they trade the lead frequently and are usually within a few tenths of a percent of one another on a given dataset.
Gradient boosted trees are also commonly referred to as gradient boosted (decision) trees or GBT, and the boosted-trees method as a whole is known as GBDT (gradient boosted decision trees) or GBM (gradient boosting machines).
The two dominant forest paradigms differ in nearly every dimension. The contrast is the easiest way to remember when to reach for which.
| Dimension | Bagging (random forest, Extra-Trees) | Boosting (gradient boosted trees) |
|---|---|---|
| Tree training order | Parallel and independent | Sequential, each tree depends on the previous |
| Sampling | Bootstrap of rows + random subspace of features | Often subsamples rows and columns per tree, but no bootstrap |
| Tree depth | Deep, fully grown trees | Shallow, typically 4 to 8 levels |
| Aggregation | Equal-weight average or vote | Weighted sum with shrinkage |
| What it reduces | Variance of low-bias deep trees | Bias of weak shallow learners |
| Out-of-the-box accuracy | Strong with defaults | Strong with defaults but more sensitive to learning rate and tree count |
| Tuning cost | Low | Moderate to high |
| Risk profile | Hard to overfit by adding more trees | Can overfit if too many rounds without early stopping |
| Parallelism | Embarrassingly parallel across trees | Parallel within a tree (per-feature histograms) but sequential across trees |
A common rule of thumb on tabular supervised learning: start with a random forest at default settings to establish a quick baseline, then run a tuned gradient boosted trees model with early stopping for the production result. The gap is typically one to three percent of accuracy in favor of boosting once tuned, sometimes more on noisy data, and occasionally negligible on small or very clean problems.
Decision forests have a small but high-leverage set of hyperparameters. The names below are the conventional ones; each library uses slight variants.
| Hyperparameter | Typical name | Effect | Sensible starting range |
|---|---|---|---|
| Number of trees | n_estimators, num_trees, num_boost_round | More trees reduce variance for bagging and bias for boosting until diminishing returns | 100 to 2,000 for boosting, 100 to 1,000 for bagging |
| Maximum depth | max_depth | Deeper trees fit more interactions but may overfit | 6 to 16 for bagging, 4 to 8 for boosting |
| Minimum samples per split | min_samples_split, min_data_in_leaf | Larger values prevent splits on noise | 2 to 50 |
| Maximum features per split | max_features (sklearn), colsample_bytree (XGBoost / LightGBM) | Controls attribute sampling; $\sqrt{p}$ for classification and $p/3$ for regression are common defaults in random forests | $\sqrt{p}$ to $p$ |
| Learning rate | learning_rate, eta, shrinkage | Smaller is more accurate but slower to converge; only used in boosting | 0.01 to 0.3 |
| L1 / L2 regularization | reg_alpha, reg_lambda | Penalizes leaf weights to reduce overfitting | 0 to a few |
| Row subsample | subsample, bagging_fraction | Fraction of rows sampled per tree | 0.5 to 1.0 |
| Minimum child weight | min_child_weight | Minimum sum of Hessian per leaf in XGBoost-style boosting | 1 to 100 |
For boosting, the standard tuning workflow is to fix a small learning rate (for example 0.05), set the number of rounds high (for example 5,000), and use early stopping on a held-out validation set to pick the best round automatically. This trades extra training compute for an essentially tuning-free choice of num_trees.
Decision forests give two complementary measures of feature importances:
| Measure | Definition | Strengths | Weaknesses |
|---|---|---|---|
| Mean decrease in impurity (MDI) | Sum, across all trees and all splits, of the impurity reduction credited to a feature | Cheap; computed as a byproduct of training | Biased toward high-cardinality and continuous features; computed on training data so can reward overfitting |
| Permutation importance | Drop in a chosen metric (accuracy, AUC, MSE) when a feature's values are randomly shuffled in a held-out set | Works with any model and any metric; uses fresh data, so it estimates true predictive contribution | More expensive; can underestimate importance when features are correlated |
| SHAP values | Shapley value attributions to each feature for each prediction | Local and global, additive, model-agnostic, with fast tree-specific implementation | Mathematically richer but slower; needs a baseline distribution choice |
The permutation variable importances method was popularized by the original random forest paper and is the default recommendation in modern scikit-learn tutorials. SHAP values, introduced by Lundberg and Lee (2017), have become the standard for explaining individual boosted-tree predictions in finance and healthcare.
A recurring finding in the empirical literature is that, despite a decade of effort to build neural networks for tabular data, tree-based ensembles continue to win on most heterogeneous tabular benchmarks.
| Study | Year | Conclusion |
|---|---|---|
| Olson et al., "Data-driven advice for applying machine learning to bioinformatics problems" | 2018 | Random forest and gradient boosting outperform deep nets on PMLB benchmarks |
| Shwartz-Ziv and Armon, "Tabular data: Deep learning is not all you need" | 2021 | Across 11 datasets, XGBoost outperforms deep tabular models on most tasks; ensembles of XGBoost and deep models do best |
| Borisov et al., "Deep neural networks and tabular data: A survey" | 2022 | Neural tabular models lag GBDT in average accuracy and require far more tuning |
| Grinsztajn, Oyallon, Varoquaux, "Why do tree-based models still outperform deep learning on tabular data?" (NeurIPS 2022) | 2022 | On 45 medium-sized tabular datasets, tree-based models beat deep learning even after extensive tuning. The paper attributes the gap to neural-network rotational invariance, sensitivity to uninformative features, and difficulty learning irregular target functions |
The practical conclusions from this body of work are direct: for tabular supervised learning under five million rows or so, a well-tuned gradient boosted trees model is the right default, and the burden of proof rests on whoever proposes a deep network. Deep learning is the right tool when raw inputs are sequences, images, audio, or graphs whose structure benefits from learned representations.
A wide range of libraries implement decision forests. The most widely used today are:
| Library | Language / runtime | Notes |
|---|---|---|
| scikit-learn | Python (Cython core) | DecisionTreeClassifier, RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier. The HistGradientBoosting estimators (added in 0.21) are competitive with XGBoost and LightGBM on accuracy and speed for medium datasets |
| XGBoost | C++ core with Python, R, JVM, Julia bindings | The original modern boosted-trees library; widely deployed in production and on Kaggle |
| LightGBM | C++ core with Python, R, C# bindings | Histogram-based with leaf-wise growth; usually the fastest of the big three on large datasets |
| CatBoost | C++ core with Python, R, C# bindings, plus standalone CLI | Best out-of-the-box defaults, especially with raw categorical features |
| TensorFlow Decision Forests (TF-DF) | Python on top of the YDF C++ library | Brings random forests, Extra-Trees, and gradient boosted trees into the TensorFlow ecosystem; supports Keras pipelines and serving via TensorFlow Serving |
| YDF (Yggdrasil Decision Forests) | C++ and Python | Google's standalone successor to TF-DF; emphasizes reproducibility and serving size |
| H2O | Java with Python, R, REST APIs | Distributed random forest and GBM; AutoML pipelines on Hadoop and Spark clusters |
| Spark MLlib | Scala on Apache Spark | Distributed random forest, gradient boosted trees, and isolation forest implementations for very large datasets |
| Weka | Java | Classical research and teaching package; includes J48 (a C4.5 reimplementation) and many forest variants |
| Ranger | C++ with R and Python bindings | High-performance random forest and survival forest implementation popular in statistics and biomedical research |
For most production tabular work in Python, the practical choice is between scikit-learn HistGradientBoostingClassifier, XGBoost, LightGBM, and CatBoost. All four interoperate with the standard pandas / NumPy stack and produce models that can be exported to ONNX or to native serving formats.
Decision forests dominate three broad application areas in production machine learning.
Most business problems present as a table of mixed numeric and categorical features. Credit scoring, churn prediction, lead scoring, customer lifetime value estimation, demand forecasting, and price optimization all have this shape. Decision forests are typically the strongest single-model baseline and frequently the production model.
Banks and insurers use gradient boosted trees for credit-default prediction, transaction fraud detection, anti-money-laundering alert ranking, and pricing of insurance policies. Their interpretability via feature importances and SHAP values, combined with regulatory acceptance under model-risk frameworks, makes them an easier sell to compliance teams than opaque deep networks.
Learning-to-rank with gradient boosted trees, especially LambdaMART (Burges, 2010), powered Bing and Yahoo web search relevance for years and remains a workhorse for product search, ads ranking, and recommendation re-ranking. LightGBM and XGBoost both ship LambdaRank and LambdaMART objectives. Ad click-through-rate prediction also relies heavily on boosted trees, sometimes in hybrid stacks with a wide-and-deep neural component.
Isolation forests, introduced by Liu, Ting, and Zhou (2008), use random axis-aligned splits to isolate individual points and score anomalies by the depth required to isolate them. They are an unsupervised cousin of random forests and are widely used in fraud and intrusion detection.
The pages below cover individual concepts in this glossary in depth.