Machine learning terms/Decision Forests

See also: Machine learning terms

Decision forests are a family of machine learning models built from collections of decision trees. Instead of relying on a single tree, a decision forest trains many trees and combines their predictions, usually by averaging for regression tasks and by majority vote or class probability averaging for classification. The result is a model that is far more accurate, far more stable, and far less prone to overfitting than any individual tree.

The two dominant decision forest paradigms are random forests, which train many independent trees in parallel on bootstrap samples of the data, and gradient boosted trees, which train trees sequentially with each new tree correcting the residual errors of the ensemble built so far. Both approaches dominate practical work on tabular data. They power credit scoring at major banks, ad click-through prediction at search engines, learning-to-rank systems, fraud detection, churn prediction, and the majority of winning solutions on tabular Kaggle competitions.

This page is the gateway hub for decision-forest entries on the AI Wiki. It covers the history, the mathematics behind tree splitting, the major algorithm families, the leading software libraries, and a curated index of every decision-forest concept that has its own dedicated wiki page.

what a decision forest is

A decision forest is an ensemble learning method whose base learners are decision trees. At inference time the input traverses each tree along an inference path of conditions until it reaches a leaf, and the leaf values are aggregated across the forest to produce the final prediction. Decision forests are non-parametric, handle mixed numerical and categorical features natively, are scale-invariant for monotonic feature transformations, and require comparatively little feature engineering.

The broad appeal of decision forests rests on a short list of practical strengths:

Property	Why it matters
Strong accuracy on tabular data	Repeatedly beats deep learning baselines on heterogeneous tabular benchmarks
Native handling of categorical features	No mandatory one-hot encoding; algorithms such as LightGBM and CatBoost split on raw category values
Robust to outliers and feature scaling	Splits are based on order, not magnitude, so a single extreme value cannot distort the geometry of the input
Built-in feature importances	Direct measures of which inputs the model relies on most
Predictable training cost	Training time scales close to linearly with the number of trees and roughly with $n \log n$ in the number of training examples
Few brittle hyperparameters	Modern libraries train respectable models with default settings

Decision forests are weaker on problems where signal lives in high-dimensional structured inputs such as raw pixels, audio waveforms, or token sequences. Those domains belong to deep learning and convolutional or transformer architectures.

decision tree fundamentals

Every decision forest is built out of decision trees. A tree is a recursive partition of the input space. Each internal node holds a test that compares one or more features against a threshold or category set, splitting incoming examples into child branches. Each leaf stores a constant prediction, typically a class probability vector for classification or a numeric value for regression. The starting node of the tree is the root. The component that decides which condition to install at each node is called the splitter, and the operation it performs is a split.

the classical algorithm families

Three historical algorithm families established the recipes that modern forests still follow:

Algorithm	Year	Author	Key ideas
ID3	1986	J. Ross Quinlan	Multiway splits on categorical features, information gain (entropy reduction) as the splitting criterion, no pruning
C4.5	1993	J. Ross Quinlan	Successor to ID3. Adds support for numerical features, missing-value handling, gain ratio normalization, and post-pruning
CART (Classification and Regression Trees)	1984	Breiman, Friedman, Olshen, Stone	Strictly binary conditions at each node, Gini impurity for classification, mean squared error for regression, cost-complexity pruning

Most modern decision forest implementations are CART-style: they fit binary trees, choose splits greedily, and grow a forest of such trees. ID3 and C4.5 remain influential, especially in textbook treatments and in the older Weka ecosystem.

conditions and node tests

A condition, also called a test, is the question installed at a node. The major condition types are:

Type	Form	Notes
Binary condition	Two child branches	The default in CART, random forest, and gradient boosted trees
Non-binary condition	More than two children	Used by ID3 and C4.5 on categorical features
Axis-aligned condition	Compares a single feature against a threshold	Standard in nearly every production forest
Oblique condition	Linear combination of features compared against a threshold	More expressive but more expensive; appears in oblique random forests
In-set condition	Tests whether a categorical feature lies in a learned subset	The native categorical split used by LightGBM and CatBoost

The threshold used in an axis-aligned numerical condition is selected by the splitter to minimize an impurity measure on the resulting children. For categorical features the splitter searches a partition of category values; for high-cardinality categoricals modern libraries use efficient sorting-based heuristics rather than exhaustive search.

splitting criteria

A split is chosen by scoring every candidate condition with an impurity or loss function and picking the condition that reduces the score the most.

gini impurity

Gini impurity is the probability that a randomly chosen example would be misclassified if labelled by the class distribution at the node:

$$\text{Gini}(t) = 1 - \sum_{k=1}^{K} p_k^2$$

where $p_k$ is the proportion of class $k$ in node $t$. Gini is zero for a pure node and reaches its maximum when classes are perfectly balanced. It is the default split criterion in scikit-learn's DecisionTreeClassifier and RandomForestClassifier.

entropy and information gain

Shannon entropy measures uncertainty in the class distribution at a node:

$$H(t) = -\sum_{k=1}^{K} p_k \log_2 p_k$$

The corresponding split score is information gain, the reduction in entropy after splitting:

$$\text{IG}(t, s) = H(t) - \sum_{c \in \text{children}(s)} \frac{|c|}{|t|} H(c)$$

ID3 selects the split with maximum information gain. C4.5 uses gain ratio, a normalized version that penalizes splits with many children. In practice, Gini and entropy almost always agree on which split is best; differences in tree structure are small and rarely change downstream accuracy.

regression splits

For regression trees, the impurity at a node is usually the mean squared error of the leaf prediction. The split that minimizes the post-split sum of squared errors is selected. Mean absolute error and Huber loss are also available in major libraries for problems with heavy-tailed targets.

loss-based splits in boosting

In gradient boosting, trees are fit to the negative gradient of a loss function, so the splitting criterion comes from a second-order Taylor expansion of that loss rather than from a fixed impurity. XGBoost popularized this view with its gain formula

$$\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda}\right] - \gamma$$

where $G$ and $H$ are sums of first and second derivatives in the left and right children and $\lambda$, $\gamma$ are regularization parameters.

random forests

Random forest was introduced by Leo Breiman in 2001. It combines two ideas:

Bagging, short for bootstrap aggregating, also due to Breiman in 1996. Each tree is trained on a bootstrap sample drawn by sampling with replacement from the original training set. Different bootstrap samples produce different trees; averaging their predictions reduces variance.
Random subspace, also called feature attribute sampling. At each split, only a random subset of features is considered as candidates. This decorrelates the trees so that the variance reduction from averaging is closer to the ideal $1/T$ for $T$ trees.

Because each bootstrap sample omits roughly $1 - (1 - 1/n)^n \approx 1/e \approx 36.8%$ of the training examples, random forests support out-of-bag evaluation: each example is scored by the trees that did not see it during training, giving a near-unbiased generalization estimate without a separate validation set.

A random forest also has a built-in measure of variable importances. Two variants are widely used: mean decrease in impurity (MDI), which sums the impurity reductions a feature contributes across all trees, and permutation variable importances, which measures the drop in accuracy when a feature's values are randomly permuted. Permutation importance is generally preferred because MDI is biased toward features with many possible split points or high cardinality.

The success of random forests is sometimes described in terms of the wisdom of the crowd: independent learners that each do better than chance, when aggregated, produce far stronger predictions than any single learner. Galton's 1906 ox-weight contest is the canonical illustration.

extremely randomized trees

Extremely Randomized Trees, often abbreviated Extra-Trees, were introduced by Pierre Geurts, Damien Ernst, and Louis Wehenkel in 2006. They differ from random forests in two ways:

The full training set is used to grow each tree, with no bootstrap step.
At each node, the threshold for each candidate feature is drawn at random rather than chosen to maximize impurity reduction. The split is then selected as the best among these random candidates.

The extra randomization further reduces variance at the cost of a small increase in bias. In practice Extra-Trees often match random forests on accuracy while training faster, since random thresholds avoid the per-feature sort that dominates standard split search. They are available in scikit-learn as ExtraTreesClassifier and ExtraTreesRegressor.

gradient boosting and modern boosted trees

Gradient boosting was formalized by Jerome Friedman in 2001 in the paper "Greedy Function Approximation: A Gradient Boosting Machine". It generalizes the AdaBoost algorithm of Freund and Schapire (1995) to arbitrary differentiable loss functions. The training loop is sequential rather than parallel:

Initialize the model with a constant prediction.
At each round $m = 1, \dots, M$:
- Compute the negative gradient of the loss with respect to the current predictions; this is the pseudo-residual.
- Fit a small regression tree to the pseudo-residuals.
- Add the new tree to the ensemble, scaled by a learning rate $\eta$ (also called shrinkage).
Stop when validation loss stops improving.

Friedman's MART (Multiple Additive Regression Trees) and Stochastic Gradient Boosting (2002) added subsampling of training rows to inject randomness, in the spirit of bagging.

Three modern open-source libraries dominate the boosted-trees landscape:

Library	First release	Origin	Distinctive ideas
XGBoost	2014	Tianqi Chen, University of Washington	Second-order gradient + Hessian split formula, regularization in the objective, sparsity-aware split finding, exact and approximate histogram methods, GPU support
LightGBM	2016	Microsoft Research	Histogram-based splits, leaf-wise tree growth instead of level-wise, gradient-based one-side sampling (GOSS), exclusive feature bundling (EFB), native categorical splits
CatBoost	2017	Yandex	Ordered boosting to fight target leakage in categorical encodings, ordered target statistics for categorical features, symmetric (oblivious) trees, strong defaults

All three reduce to the same mathematical core (gradient boosting of regression trees) but differ in growth strategy, categorical handling, and engineering. Benchmarks across academic papers and Kaggle competitions show they trade the lead frequently and are usually within a few tenths of a percent of one another on a given dataset.

Gradient boosted trees are also commonly referred to as gradient boosted (decision) trees or GBT, and the boosted-trees method as a whole is known as GBDT (gradient boosted decision trees) or GBM (gradient boosting machines).

bagging vs boosting

The two dominant forest paradigms differ in nearly every dimension. The contrast is the easiest way to remember when to reach for which.

Dimension	Bagging (random forest, Extra-Trees)	Boosting (gradient boosted trees)
Tree training order	Parallel and independent	Sequential, each tree depends on the previous
Sampling	Bootstrap of rows + random subspace of features	Often subsamples rows and columns per tree, but no bootstrap
Tree depth	Deep, fully grown trees	Shallow, typically 4 to 8 levels
Aggregation	Equal-weight average or vote	Weighted sum with shrinkage
What it reduces	Variance of low-bias deep trees	Bias of weak shallow learners
Out-of-the-box accuracy	Strong with defaults	Strong with defaults but more sensitive to learning rate and tree count
Tuning cost	Low	Moderate to high
Risk profile	Hard to overfit by adding more trees	Can overfit if too many rounds without early stopping
Parallelism	Embarrassingly parallel across trees	Parallel within a tree (per-feature histograms) but sequential across trees

A common rule of thumb on tabular supervised learning: start with a random forest at default settings to establish a quick baseline, then run a tuned gradient boosted trees model with early stopping for the production result. The gap is typically one to three percent of accuracy in favor of boosting once tuned, sometimes more on noisy data, and occasionally negligible on small or very clean problems.

key hyperparameters

Decision forests have a small but high-leverage set of hyperparameters. The names below are the conventional ones; each library uses slight variants.

Hyperparameter	Typical name	Effect	Sensible starting range
Number of trees	`n_estimators`, `num_trees`, `num_boost_round`	More trees reduce variance for bagging and bias for boosting until diminishing returns	100 to 2,000 for boosting, 100 to 1,000 for bagging
Maximum depth	`max_depth`	Deeper trees fit more interactions but may overfit	6 to 16 for bagging, 4 to 8 for boosting
Minimum samples per split	`min_samples_split`, `min_data_in_leaf`	Larger values prevent splits on noise	2 to 50
Maximum features per split	`max_features` (sklearn), `colsample_bytree` (XGBoost / LightGBM)	Controls attribute sampling; $\sqrt{p}$ for classification and $p/3$ for regression are common defaults in random forests	$\sqrt{p}$ to $p$
Learning rate	`learning_rate`, `eta`, `shrinkage`	Smaller is more accurate but slower to converge; only used in boosting	0.01 to 0.3
L1 / L2 regularization	`reg_alpha`, `reg_lambda`	Penalizes leaf weights to reduce overfitting	0 to a few
Row subsample	`subsample`, `bagging_fraction`	Fraction of rows sampled per tree	0.5 to 1.0
Minimum child weight	`min_child_weight`	Minimum sum of Hessian per leaf in XGBoost-style boosting	1 to 100

For boosting, the standard tuning workflow is to fix a small learning rate (for example 0.05), set the number of rounds high (for example 5,000), and use early stopping on a held-out validation set to pick the best round automatically. This trades extra training compute for an essentially tuning-free choice of num_trees.

feature importances: MDI vs permutation

Decision forests give two complementary measures of feature importances:

Measure	Definition	Strengths	Weaknesses
Mean decrease in impurity (MDI)	Sum, across all trees and all splits, of the impurity reduction credited to a feature	Cheap; computed as a byproduct of training	Biased toward high-cardinality and continuous features; computed on training data so can reward overfitting
Permutation importance	Drop in a chosen metric (accuracy, AUC, MSE) when a feature's values are randomly shuffled in a held-out set	Works with any model and any metric; uses fresh data, so it estimates true predictive contribution	More expensive; can underestimate importance when features are correlated
SHAP values	Shapley value attributions to each feature for each prediction	Local and global, additive, model-agnostic, with fast tree-specific implementation	Mathematically richer but slower; needs a baseline distribution choice

The permutation variable importances method was popularized by the original random forest paper and is the default recommendation in modern scikit-learn tutorials. SHAP values, introduced by Lundberg and Lee (2017), have become the standard for explaining individual boosted-tree predictions in finance and healthcare.

why tree models still beat deep learning on tabular data

A recurring finding in the empirical literature is that, despite a decade of effort to build neural networks for tabular data, tree-based ensembles continue to win on most heterogeneous tabular benchmarks.

Study	Year	Conclusion
Olson et al., "Data-driven advice for applying machine learning to bioinformatics problems"	2018	Random forest and gradient boosting outperform deep nets on PMLB benchmarks
Shwartz-Ziv and Armon, "Tabular data: Deep learning is not all you need"	2021	Across 11 datasets, XGBoost outperforms deep tabular models on most tasks; ensembles of XGBoost and deep models do best
Borisov et al., "Deep neural networks and tabular data: A survey"	2022	Neural tabular models lag GBDT in average accuracy and require far more tuning
Grinsztajn, Oyallon, Varoquaux, "Why do tree-based models still outperform deep learning on tabular data?" (NeurIPS 2022)	2022	On 45 medium-sized tabular datasets, tree-based models beat deep learning even after extensive tuning. The paper attributes the gap to neural-network rotational invariance, sensitivity to uninformative features, and difficulty learning irregular target functions

The practical conclusions from this body of work are direct: for tabular supervised learning under five million rows or so, a well-tuned gradient boosted trees model is the right default, and the burden of proof rests on whoever proposes a deep network. Deep learning is the right tool when raw inputs are sequences, images, audio, or graphs whose structure benefits from learned representations.

implementations and ecosystems

A wide range of libraries implement decision forests. The most widely used today are:

Library	Language / runtime	Notes
scikit-learn	Python (Cython core)	`DecisionTreeClassifier`, `RandomForestClassifier`, `ExtraTreesClassifier`, `GradientBoostingClassifier`, `HistGradientBoostingClassifier`. The HistGradientBoosting estimators (added in 0.21) are competitive with XGBoost and LightGBM on accuracy and speed for medium datasets
XGBoost	C++ core with Python, R, JVM, Julia bindings	The original modern boosted-trees library; widely deployed in production and on Kaggle
LightGBM	C++ core with Python, R, C# bindings	Histogram-based with leaf-wise growth; usually the fastest of the big three on large datasets
CatBoost	C++ core with Python, R, C# bindings, plus standalone CLI	Best out-of-the-box defaults, especially with raw categorical features
TensorFlow Decision Forests (TF-DF)	Python on top of the YDF C++ library	Brings random forests, Extra-Trees, and gradient boosted trees into the TensorFlow ecosystem; supports Keras pipelines and serving via TensorFlow Serving
YDF (Yggdrasil Decision Forests)	C++ and Python	Google's standalone successor to TF-DF; emphasizes reproducibility and serving size
H2O	Java with Python, R, REST APIs	Distributed random forest and GBM; AutoML pipelines on Hadoop and Spark clusters
Spark MLlib	Scala on Apache Spark	Distributed random forest, gradient boosted trees, and isolation forest implementations for very large datasets
Weka	Java	Classical research and teaching package; includes J48 (a C4.5 reimplementation) and many forest variants
Ranger	C++ with R and Python bindings	High-performance random forest and survival forest implementation popular in statistics and biomedical research

For most production tabular work in Python, the practical choice is between scikit-learn HistGradientBoostingClassifier, XGBoost, LightGBM, and CatBoost. All four interoperate with the standard pandas / NumPy stack and produce models that can be exported to ONNX or to native serving formats.

use cases

Decision forests dominate three broad application areas in production machine learning.

tabular data and structured prediction

Most business problems present as a table of mixed numeric and categorical features. Credit scoring, churn prediction, lead scoring, customer lifetime value estimation, demand forecasting, and price optimization all have this shape. Decision forests are typically the strongest single-model baseline and frequently the production model.

finance and risk

Banks and insurers use gradient boosted trees for credit-default prediction, transaction fraud detection, anti-money-laundering alert ranking, and pricing of insurance policies. Their interpretability via feature importances and SHAP values, combined with regulatory acceptance under model-risk frameworks, makes them an easier sell to compliance teams than opaque deep networks.

ranking and recommendation

Learning-to-rank with gradient boosted trees, especially LambdaMART (Burges, 2010), powered Bing and Yahoo web search relevance for years and remains a workhorse for product search, ads ranking, and recommendation re-ranking. LightGBM and XGBoost both ship LambdaRank and LambdaMART objectives. Ad click-through-rate prediction also relies heavily on boosted trees, sometimes in hybrid stacks with a wide-and-deep neural component.

isolation forests for anomaly detection

Isolation forests, introduced by Liu, Ting, and Zhou (2008), use random axis-aligned splits to isolate individual points and score anomalies by the depth required to isolate them. They are an unsupervised cousin of random forests and are widely used in fraud and intrusion detection.

index of decision-forest term wiki pages

The pages below cover individual concepts in this glossary in depth.

references

Breiman, L. (1996). "Bagging predictors". Machine Learning, 24(2), 123 to 140.
Breiman, L. (2001). "Random Forests". Machine Learning, 45(1), 5 to 32.
Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984). Classification and Regression Trees. Wadsworth.
Quinlan, J. R. (1986). "Induction of decision trees". Machine Learning, 1(1), 81 to 106.
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann.
Friedman, J. H. (2001). "Greedy function approximation: A gradient boosting machine". Annals of Statistics, 29(5), 1189 to 1232.
Friedman, J. H. (2002). "Stochastic gradient boosting". Computational Statistics and Data Analysis, 38(4), 367 to 378.
Freund, Y., Schapire, R. E. (1995). "A decision-theoretic generalization of on-line learning and an application to boosting". In European Conference on Computational Learning Theory.
Geurts, P., Ernst, D., Wehenkel, L. (2006). "Extremely randomized trees". Machine Learning, 63(1), 3 to 42.
Chen, T., Guestrin, C. (2016). "XGBoost: A scalable tree boosting system". Proceedings of KDD 2016.
Ke, G., Meng, Q., Finley, T., et al. (2017). "LightGBM: A highly efficient gradient boosting decision tree". NeurIPS 2017.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., Gulin, A. (2018). "CatBoost: unbiased boosting with categorical features". NeurIPS 2018.
Liu, F. T., Ting, K. M., Zhou, Z.-H. (2008). "Isolation forest". ICDM 2008.
Burges, C. J. C. (2010). "From RankNet to LambdaRank to LambdaMART: An overview". Microsoft Research Technical Report MSR-TR-2010-82.
Lundberg, S. M., Lee, S.-I. (2017). "A unified approach to interpreting model predictions". NeurIPS 2017.
Shwartz-Ziv, R., Armon, A. (2021). "Tabular data: Deep learning is not all you need". arXiv:2106.03253.
Borisov, V., Leemann, T., Sessler, K., et al. (2022). "Deep neural networks and tabular data: A survey". IEEE Transactions on Neural Networks and Learning Systems.
Grinsztajn, L., Oyallon, E., Varoquaux, G. (2022). "Why do tree-based models still outperform deep learning on tabular data?". NeurIPS 2022 Datasets and Benchmarks.