Decision Forest

A decision forest is a family of ensemble learning methods in machine learning that combine multiple decision trees to produce more accurate and stable predictions than any single tree. The term "decision forest" is often used interchangeably with "random forest," though it can also refer more broadly to any ensemble of decision trees, including gradient boosting ensembles, extremely randomized trees, and isolation forests. Decision forests are among the most widely used algorithms for classification and regression tasks, particularly on tabular and structured data.

The core idea behind decision forests is simple: individual decision trees are prone to overfitting and can produce high-variance predictions, but by training many trees and combining their outputs, the ensemble reduces variance while maintaining low bias. This principle, rooted in the bias-variance tradeoff, is what makes decision forests so effective in practice.

Explain like I'm 5 (ELI5)

Imagine you have a hard question, like "Is this animal a cat or a dog?" Instead of asking just one friend, you ask 100 friends. Each friend looks at different clues (one looks at the ears, another at the tail, another at the fur). Then you count up all their answers and go with whatever most of them said. That is basically what a decision forest does. Each "friend" is a decision tree that looks at the data in a slightly different way. By combining all their answers together, the group almost always gets a better answer than any single friend would on their own.

History and development

The development of decision forests spans several decades and involves contributions from multiple researchers.

Early foundations (1990s)

The concept of combining multiple decision trees into an ensemble was explored throughout the 1990s. In 1994, Tin Kam Ho introduced "random decision forests" at the Third International Conference on Document Analysis and Recognition (ICDAR). Ho's approach used the random subspace method, which constructs trees in randomly chosen feature subspaces. Trees built in different subspaces generalize in complementary ways, and Ho demonstrated that their combined classification accuracy improves monotonically as the forest grows. Ho published a more comprehensive treatment in IEEE Transactions on Pattern Analysis and Machine Intelligence in 1998.

Around the same time, Leo Breiman introduced bagging (bootstrap aggregating) in 1996. Bagging trains each model on a different bootstrap sample (a random sample drawn with replacement from the training data) and averages their predictions. Breiman showed that bagging is especially effective for unstable learners like decision trees, where small changes in training data can produce very different trees.

Breiman's random forests (2001)

The modern formulation of the random forest algorithm was presented by Leo Breiman in his 2001 paper "Random Forests," published in the journal Machine Learning. This paper has become one of the most cited publications in all of statistics and machine learning. Breiman combined bagging with random feature selection at each split, building on Ho's random subspace method and work by Amit and Geman (1997) on randomized node optimization.

Breiman's 2001 paper introduced several concepts that remain central to the algorithm:

Out-of-bag (OOB) error estimation: a built-in method for estimating generalization error without a separate test set
Variable importance measures: methods for ranking features by their contribution to predictive accuracy
Theoretical bounds: an upper bound on the generalization error expressed in terms of the strength of individual trees and the correlation between them

Breiman proved that as the number of trees increases, the generalization error of a random forest converges to a limit and does not overfit, a property that distinguishes random forests from many other machine learning methods.

Later developments

Several variants and extensions followed Breiman's work:

Year	Method	Authors	Contribution
1995	Random decision forests	Tin Kam Ho	Introduced random subspace method for constructing decision tree ensembles
1996	Bagging	Leo Breiman	Bootstrap aggregating for variance reduction
1999	Stochastic gradient boosting	Jerome Friedman	Combined gradient boosting with randomized subsampling
2001	Random forest	Leo Breiman	Combined bagging with random feature selection; introduced OOB error
2006	Extremely randomized trees	Pierre Geurts, Damien Ernst, Louis Wehenkel	Randomized both feature selection and split thresholds
2008	Isolation forest	Fei Tony Liu, Kai Ming Ting, Zhi-Hua Zhou	Adapted decision tree ensembles for anomaly detection
2014	XGBoost	Tianqi Chen	Scalable, regularized gradient boosting system
2017	LightGBM	Microsoft Research	Leaf-wise tree growth with GOSS and EFB optimizations
2017	CatBoost	Yandex	Ordered boosting with native categorical feature support

How decision forests work

Bootstrap aggregating (bagging)

The most common type of decision forest, the random forest, uses bootstrap aggregating as its foundation. The process works as follows:

Bootstrap sampling: From a training dataset of N samples, draw N samples with replacement to create a bootstrap sample. On average, each bootstrap sample contains about 63.2% of the unique original samples (the remaining 36.8% are left out and form the "out-of-bag" set for that tree).
Tree construction: Grow a decision tree on each bootstrap sample. At each node, instead of evaluating all available features to find the best split, randomly select a subset of k features and choose the best split among only those features.
No pruning: Trees are typically grown to their full depth (or until leaves contain a minimum number of samples), since overfitting at the individual tree level is counteracted by the averaging process.
Aggregation: Combine the predictions of all trees. For classification, use majority voting. For regression, average the predicted values.

The number of features considered at each split, k, is a key hyperparameter. Common defaults are k = sqrt(p) for classification and k = p/3 for regression, where p is the total number of features.

Why it works: variance reduction

The mathematical intuition behind decision forests relies on the effect of averaging on variance. Consider a set of B trees, each producing a prediction. If the trees were independent with variance sigma-squared, the variance of their average would be sigma-squared / B, which decreases as more trees are added.

In practice, trees trained on bootstrap samples from the same dataset are not independent; they are correlated. Breiman showed that the generalization error of a random forest depends on two factors:

Strength (s): the accuracy of the individual trees
Correlation (rho): the average correlation between pairs of trees

The upper bound on generalization error is approximately rho * (1 - s^2) / s^2. This means that reducing the correlation between trees (by randomizing feature selection at each split) directly improves forest performance, even if it slightly reduces the accuracy of individual trees. This is why random forests use random feature subsets at each split rather than always choosing the globally best feature.

Out-of-bag error estimation

One of the most practical features of random forests is the out-of-bag (OOB) error estimate. Because each tree is trained on a bootstrap sample that leaves out roughly one-third of the data, each training sample is "out of bag" for a subset of the trees. The OOB prediction for a given sample is obtained by aggregating only the predictions of trees for which that sample was not used in training.

The OOB error estimate has been shown to be approximately as accurate as using a held-out test set of the same size as the training set. This eliminates the need for a separate validation set or cross-validation, making random forests particularly convenient when data is limited.

Feature importance

Decision forests provide two primary methods for measuring feature importance:

Impurity-based importance (Gini importance): For each feature, sum the total reduction in impurity (measured by Gini index or entropy for classification, or variance for regression) across all splits in all trees that use that feature. Features that produce larger reductions in impurity are considered more important. This method is fast to compute since it is a byproduct of the training process, but it can be biased toward features with many categories or high cardinality.

Permutation importance: For each feature, randomly shuffle (permute) its values in the OOB samples, then measure how much the OOB error increases. If a feature is genuinely useful, permuting its values should cause a noticeable increase in prediction error. If the feature is unimportant, permuting it will have little effect. Permutation importance is more reliable than impurity-based importance but more computationally expensive.

Types of decision forests

Random forests

The random forest is the most well-known type of decision forest. It uses bagging combined with random feature selection at each split, as described above. Random forests can handle both classification and regression tasks, are robust to noisy features, and require relatively little hyperparameter tuning. Their main limitation is that they can be slow to train and predict on very large datasets with many trees.

Extremely randomized trees (Extra-Trees)

Proposed by Geurts, Ernst, and Wehenkel in 2006, extremely randomized trees (Extra-Trees) take the randomization of random forests one step further. In addition to randomly selecting a subset of features at each split, Extra-Trees also randomly choose the split threshold for each selected feature, rather than searching for the optimal threshold.

This additional randomization has two effects:

Reduced variance: The extra randomness further decorrelates the trees, reducing ensemble variance.
Faster training: Because the algorithm does not need to search for optimal split points, Extra-Trees can be trained significantly faster than standard random forests.

Extra-Trees also differ from random forests in that they typically do not use bootstrap sampling; instead, each tree is trained on the entire dataset. The default number of features considered at each split is sqrt(p) for classification and p (all features) for regression.

In practice, Extra-Trees often achieve comparable or slightly better accuracy than random forests, especially on datasets with noisy features, because the added randomness acts as additional regularization.

Gradient boosted decision trees

While random forests and Extra-Trees use bagging (parallel ensemble construction), gradient boosting builds trees sequentially. Each new tree is trained to correct the errors (residuals) of the combined ensemble of all previous trees. This is achieved by fitting each new tree to the negative gradient of a differentiable loss function, effectively performing gradient descent in function space.

Jerome Friedman formalized this approach in his 2001 paper "Greedy Function Approximation: A Gradient Boosting Machine." The algorithm works as follows:

Initialize the model with a constant prediction (for example, the mean of the target values).
For each iteration, compute the negative gradient of the loss function with respect to the current predictions (these are the "pseudo-residuals").
Fit a new decision tree to the pseudo-residuals.
Add the new tree to the ensemble, scaled by a learning rate.
Repeat until a stopping criterion is met.

Gradient boosted trees tend to achieve higher accuracy than random forests on many tasks, but they are more sensitive to hyperparameter settings and more prone to overfitting if not properly regularized. They also cannot be parallelized as easily as random forests because each tree depends on the output of all previous trees.

Modern gradient boosting frameworks

Several high-performance implementations of gradient boosted decision trees have become standard tools in applied machine learning:

Framework	Developer	Year	Key features
XGBoost	Tianqi Chen	2014	L1/L2 regularization, sparsity-aware splits, column block structure, approximate split finding
LightGBM	Microsoft	2017	Leaf-wise growth, Gradient-based One-Side Sampling (GOSS), Exclusive Feature Bundling (EFB), histogram-based splits
CatBoost	Yandex	2017	Ordered boosting to prevent target leakage, native categorical feature handling, symmetric tree growth

XGBoost (Extreme Gradient Boosting) popularized gradient boosting in the data science community and became the dominant algorithm in competitive machine learning (Kaggle). It introduced regularization terms in the objective function to control model complexity.

LightGBM uses a leaf-wise tree growth strategy instead of the level-wise approach used by XGBoost. This means it expands the leaf with the largest potential loss reduction, which can lead to faster convergence and lower error. LightGBM's GOSS technique keeps samples with large gradients while randomly sampling those with small gradients, and EFB bundles mutually exclusive features to reduce dimensionality.

CatBoost addresses a subtle but important problem in gradient boosting: target leakage. Traditional gradient boosting computes residuals using the same data that trains the model, which can introduce bias. CatBoost's ordered boosting uses a permutation-driven approach to compute unbiased estimates. It also handles categorical features directly using ordered target statistics, eliminating the need for one-hot encoding or label encoding.

Isolation forests

The isolation forest, introduced by Liu, Ting, and Zhou in 2008, adapts the decision forest framework for anomaly detection. Instead of trying to model normal data, isolation forests explicitly isolate anomalous points.

The algorithm works by randomly selecting a feature and a split value, then recursively partitioning the data. Anomalies, being rare and different from normal observations, tend to be isolated in fewer splits (shorter path lengths from root to leaf). The anomaly score for each data point is based on its average path length across all trees in the forest.

Isolation forests have a training time complexity of O(n log n) and a prediction time complexity of O(log n), making them efficient for large datasets. The Extended Isolation Forest (EIF), proposed by Hariri, Kind, and Brunner in 2018, improved on the original by using random hyperplane cuts instead of axis-aligned splits, which produces better anomaly score maps for datasets with complex structures.

Key hyperparameters

Decision forests have several hyperparameters that control their behavior. The table below lists the most common parameters for random forests (as implemented in Scikit-learn):

Hyperparameter	Description	Typical default	Effect of increasing
n_estimators	Number of trees in the forest	100	Improves accuracy and stability; increases training time
max_depth	Maximum depth of each tree	None (unlimited)	Allows more complex trees; can increase overfitting
min_samples_split	Minimum samples required to split a node	2	Regularizes the model; reduces overfitting
min_samples_leaf	Minimum samples required at a leaf node	1	Regularizes the model; produces smoother predictions
max_features	Number of features considered at each split	sqrt(p) for classification, p/3 for regression	Lower values reduce tree correlation but may reduce individual tree accuracy
bootstrap	Whether to use bootstrap sampling	True	False gives Extra-Trees behavior
oob_score	Whether to compute OOB error	False	Enables built-in error estimation at no extra cost
max_samples	Size of each bootstrap sample	N (same as training set)	Smaller values speed up training and increase diversity

Tuning strategies

For random forests, the most impactful hyperparameters are generally n_estimators and max_features. A common approach is:

Start with the default settings and evaluate performance.
Increase n_estimators until the OOB error or cross-validation score plateaus.
Try different values of max_features (for example, sqrt(p), log2(p), or p/3).
Adjust max_depth or min_samples_leaf if overfitting is observed.

For gradient boosted trees, the learning rate and n_estimators interact strongly: a lower learning rate requires more trees but often produces better results. Regularization parameters like max_depth (typically 3 to 8 for boosting) and subsampling rate are also important.

Advantages and limitations

Advantages

High accuracy on tabular data: Decision forests, particularly gradient boosted variants, consistently rank among the top-performing methods on structured and tabular datasets. A 2022 study by Grinsztajn et al. found that tree-based models still outperform deep learning on typical tabular data.
Robustness: Random forests are resistant to overfitting because the averaging process reduces variance. Adding more trees never hurts performance (though it increases computation).
Built-in feature importance: Decision forests provide feature importance scores without requiring additional analysis, which aids in feature engineering and model interpretation.
Minimal preprocessing: Decision forests handle missing values (in some implementations), do not require feature scaling, and can work with both numerical and categorical features.
OOB error estimate: Random forests provide a free, built-in estimate of generalization error through the OOB mechanism.
Parallelization: In bagging-based forests, trees are independent and can be trained in parallel across multiple cores or machines.
Versatility: Decision forests support classification, regression, multi-output tasks, and survival analysis.

Limitations

Interpretability: While individual decision trees are highly interpretable, an ensemble of hundreds or thousands of trees is much harder to understand. Methods like SHAP (SHapley Additive exPlanations) and LIME can help, but the model is no longer a simple set of rules.
Memory and storage: Each tree in the forest must be stored in memory. For forests with many deep trees, memory usage can be substantial.
Prediction latency: Making predictions requires traversing every tree in the forest, which can be slow for real-time applications with strict latency requirements.
Extrapolation: Decision trees (and by extension, decision forests) cannot extrapolate beyond the range of values seen in the training data. They always predict values within the range of the training targets, which can be a problem for regression tasks where the test data contains values outside the training distribution.
Performance on unstructured data: Decision forests are designed for tabular data. For images, text, audio, and other unstructured data types, neural networks and deep learning approaches generally perform much better.
Sequential training (boosting): Gradient boosted forests cannot be fully parallelized because each tree depends on the predictions of all previous trees.

Decision forests vs. other methods

The following table compares decision forests with several other common supervised learning methods:

Property	Random forest	Gradient boosting	Decision tree	Logistic regression	Neural network
Ensemble type	Bagging	Boosting	Single model	Single model	Single model
Overfitting risk	Low	Moderate	High	Low	Moderate to high
Interpretability	Low	Low	High	High	Very low
Feature scaling required	No	No	No	Yes	Yes
Handles missing data	Yes (some implementations)	Yes (some implementations)	Yes	No	No
Training speed (large data)	Moderate	Slow	Fast	Fast	Slow
Tabular data performance	Very good	Excellent	Good	Good	Good (large data)
Unstructured data performance	Poor	Poor	Poor	Poor	Excellent
Extrapolation ability	Poor	Poor	Poor	Good	Good

Software implementations

Decision forests are available in all major machine learning libraries:

Library	Language	Classes / functions	Notes
Scikit-learn	Python	RandomForestClassifier, RandomForestRegressor, ExtraTreesClassifier, ExtraTreesRegressor, GradientBoostingClassifier, IsolationForest	Standard reference implementation; good for small to medium datasets
XGBoost	Python, R, C++, Java	xgb.XGBClassifier, xgb.XGBRegressor	High-performance gradient boosting; widely used in competitions
LightGBM	Python, R, C++	lgb.LGBMClassifier, lgb.LGBMRegressor	Fast training on large datasets; histogram-based splits
CatBoost	Python, R, C++	CatBoostClassifier, CatBoostRegressor	Best native categorical feature handling
ranger	R	ranger()	Fast C++ implementation of random forest for R
randomForest	R	randomForest()	Classic R implementation; slower than ranger on large datasets
Spark MLlib	Scala, Python, Java	RandomForestClassifier, GBTClassifier	Distributed training for very large datasets
TensorFlow Decision Forests	Python	tfdf.keras.RandomForestModel, tfdf.keras.GradientBoostedTreesModel	Integrates decision tree models into the TensorFlow ecosystem

Applications

Decision forests are used across a wide range of domains:

Healthcare and biomedical research: Random forests are used for disease diagnosis, patient outcome prediction, biomarker discovery, and gene expression classification. For example, random forest models have been applied to predict COVID-19 severity using patient demographic and health data, achieving accuracy above 94% in some studies.

Finance: Applications include credit scoring (evaluating the likelihood that a loan applicant will default), fraud detection (identifying unusual transaction patterns), stock price prediction, and algorithmic trading. Random forests are popular in finance because they handle mixed feature types well and provide feature importance rankings that support regulatory explanations.

Remote sensing and ecology: Random forests are one of the most commonly used classifiers for land cover classification from satellite imagery. They are also used in species distribution modeling and biodiversity assessment.

Bioinformatics: Random forests have been widely adopted for protein function prediction, drug-target interaction prediction, and genomic variant classification.

Recommendation systems: Gradient boosted decision trees power many recommendation and ranking systems at large technology companies.

Anomaly detection: Isolation forests are used for fraud detection, network intrusion detection, manufacturing quality control, and identifying outliers in large datasets.

Natural language processing: While deep learning now dominates most NLP tasks, decision forests remain useful for text classification, spam detection, and sentiment analysis when combined with feature extraction techniques like TF-IDF or bag-of-words.

Decision forests vs. deep learning on tabular data

A recurring question in applied machine learning is whether deep learning has made decision forests obsolete. Research consistently shows that this is not the case for tabular data.

A 2022 benchmark study by Grinsztajn, Boulber, and Varoquaux (published at NeurIPS) systematically compared tree-based models to deep learning on 45 tabular datasets of varying sizes. The study found that tree-based models (random forests and gradient boosted trees) outperformed neural networks on medium-sized datasets (around 10,000 samples), and that the difference persisted even after extensive hyperparameter tuning of the deep learning models.

The reasons for this advantage include:

Decision trees handle irregular, non-smooth decision boundaries naturally, while neural networks learn smooth functions that may not fit tabular data well.
Tabular data typically lacks the spatial or sequential structure that convolutional and recurrent neural networks exploit in images and text.
Decision forests are relatively insensitive to uninformative features, while neural networks can be distracted by them.
Random forests and boosted trees require less hyperparameter tuning and less training data to achieve good performance.

However, on very large tabular datasets (millions of samples), deep learning approaches can close the gap or occasionally outperform tree-based methods, especially when data augmentation is used.

Theoretical properties

Random forests have several theoretically established properties:

Consistency: Under mild conditions, random forests are universally consistent, meaning their predictions converge to the true conditional expectation as the number of training samples and trees grows to infinity. This was formally proved by Biau, Devroye, and Lugosi (2008) for simplified versions and extended by Scornet, Biau, and Vert (2015) for Breiman's original algorithm.

Convergence of error: Breiman (2001) proved that as the number of trees B grows, the generalization error converges to a finite limit. Increasing B never causes the forest to overfit. This is in contrast to gradient boosting, where too many iterations can lead to overfitting.

Variable importance: Permutation-based variable importance in random forests has been shown to be a consistent measure of feature relevance under certain conditions, though it can be biased when features are correlated.

References

Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
Ho, T. K. (1995). "Random Decision Forests." *Proceedings of the 3rd International Conference on Document Analysis and Recognition*, 278-282.
Ho, T. K. (1998). "The Random Subspace Method for Constructing Decision Forests." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 20(8), 832-844.
Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *The Annals of Statistics*, 29(5), 1189-1232.
Geurts, P., Ernst, D., & Wehenkel, L. (2006). "Extremely Randomized Trees." *Machine Learning*, 63(1), 3-42. https://doi.org/10.1007/s10994-006-6226-1
Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). "Isolation Forest." *Proceedings of the 8th IEEE International Conference on Data Mining*, 413-422.
Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 785-794.
Ke, G., Meng, Q., Finley, T., et al. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." *Advances in Neural Information Processing Systems*, 30.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). "CatBoost: Unbiased Boosting with Categorical Features." *Advances in Neural Information Processing Systems*, 31.
Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). "Why do tree-based models still outperform deep learning on typical tabular data?" *Advances in Neural Information Processing Systems*, 35.
Biau, G., & Scornet, E. (2016). "A Random Forest Guided Tour." *TEST*, 25(2), 197-227. https://doi.org/10.1007/s11749-016-0481-7
Scornet, E., Biau, G., & Vert, J.-P. (2015). "Consistency of Random Forests." *The Annals of Statistics*, 43(4), 1716-1741.
Hariri, S., Kind, M. C., & Brunner, R. J. (2019). "Extended Isolation Forest." *IEEE Transactions on Knowledge and Data Engineering*, 33(4), 1479-1489.
Breiman, L. (1996). "Bagging Predictors." *Machine Learning*, 24(2), 123-140.

Explain like I'm 5 (ELI5)

History and development

Early foundations (1990s)

Breiman's random forests (2001)

Later developments

How decision forests work

Bootstrap aggregating (bagging)

Why it works: variance reduction

Out-of-bag error estimation

Feature importance

Types of decision forests

Random forests

Extremely randomized trees (Extra-Trees)

Gradient boosted decision trees

Modern gradient boosting frameworks

Isolation forests

Key hyperparameters

Tuning strategies

Advantages and limitations

Advantages

Limitations

Decision forests vs. other methods

Software implementations

Applications

Decision forests vs. deep learning on tabular data

Theoretical properties

References

Improve this article

Related Articles

ARC-AGI 2

Confident Learning (CL)

Example

Ground Truth

Target

Support Vector Machine (SVM)

Explain like I'm 5 (ELI5)

History and development

Early foundations (1990s)

Breiman's random forests (2001)

Later developments

How decision forests work

Bootstrap aggregating (bagging)

Why it works: variance reduction

Out-of-bag error estimation

Feature importance

Types of decision forests

Random forests

Extremely randomized trees (Extra-Trees)

Gradient boosted decision trees

Modern gradient boosting frameworks

Isolation forests

Key hyperparameters

Tuning strategies

Advantages and limitations

Advantages

Limitations

Decision forests vs. other methods

Software implementations

Applications

Decision forests vs. deep learning on tabular data

Theoretical properties

References

Related Articles

ARC-AGI 2

Confident Learning (CL)

Example

Ground Truth

Target

Support Vector Machine (SVM)