See also: Random Forest, Decision Tree, Ensemble Learning
A decision forest is a family of ensemble learning methods in machine learning that combine multiple decision trees to produce more accurate and stable predictions than any single tree. The term "decision forest" is often used interchangeably with "random forest," though it can also refer more broadly to any ensemble of decision trees, including gradient boosting ensembles, extremely randomized trees, and isolation forests. Decision forests are among the most widely used algorithms for classification and regression tasks, particularly on tabular and structured data.
The core idea behind decision forests is simple: individual decision trees are prone to overfitting and can produce high-variance predictions, but by training many trees and combining their outputs, the ensemble reduces variance while maintaining low bias. This principle, rooted in the bias-variance tradeoff, is what makes decision forests so effective in practice.
Imagine you have a hard question, like "Is this animal a cat or a dog?" Instead of asking just one friend, you ask 100 friends. Each friend looks at different clues (one looks at the ears, another at the tail, another at the fur). Then you count up all their answers and go with whatever most of them said. That is basically what a decision forest does. Each "friend" is a decision tree that looks at the data in a slightly different way. By combining all their answers together, the group almost always gets a better answer than any single friend would on their own.
The development of decision forests spans several decades and involves contributions from multiple researchers.
The concept of combining multiple decision trees into an ensemble was explored throughout the 1990s. In 1994, Tin Kam Ho introduced "random decision forests" at the Third International Conference on Document Analysis and Recognition (ICDAR). Ho's approach used the random subspace method, which constructs trees in randomly chosen feature subspaces. Trees built in different subspaces generalize in complementary ways, and Ho demonstrated that their combined classification accuracy improves monotonically as the forest grows. Ho published a more comprehensive treatment in IEEE Transactions on Pattern Analysis and Machine Intelligence in 1998.
Around the same time, Leo Breiman introduced bagging (bootstrap aggregating) in 1996. Bagging trains each model on a different bootstrap sample (a random sample drawn with replacement from the training data) and averages their predictions. Breiman showed that bagging is especially effective for unstable learners like decision trees, where small changes in training data can produce very different trees.
The modern formulation of the random forest algorithm was presented by Leo Breiman in his 2001 paper "Random Forests," published in the journal Machine Learning. This paper has become one of the most cited publications in all of statistics and machine learning. Breiman combined bagging with random feature selection at each split, building on Ho's random subspace method and work by Amit and Geman (1997) on randomized node optimization.
Breiman's 2001 paper introduced several concepts that remain central to the algorithm:
Breiman proved that as the number of trees increases, the generalization error of a random forest converges to a limit and does not overfit, a property that distinguishes random forests from many other machine learning methods.
Several variants and extensions followed Breiman's work:
| Year | Method | Authors | Contribution |
|---|---|---|---|
| 1995 | Random decision forests | Tin Kam Ho | Introduced random subspace method for constructing decision tree ensembles |
| 1996 | Bagging | Leo Breiman | Bootstrap aggregating for variance reduction |
| 1999 | Stochastic gradient boosting | Jerome Friedman | Combined gradient boosting with randomized subsampling |
| 2001 | Random forest | Leo Breiman | Combined bagging with random feature selection; introduced OOB error |
| 2006 | Extremely randomized trees | Pierre Geurts, Damien Ernst, Louis Wehenkel | Randomized both feature selection and split thresholds |
| 2008 | Isolation forest | Fei Tony Liu, Kai Ming Ting, Zhi-Hua Zhou | Adapted decision tree ensembles for anomaly detection |
| 2014 | XGBoost | Tianqi Chen | Scalable, regularized gradient boosting system |
| 2017 | LightGBM | Microsoft Research | Leaf-wise tree growth with GOSS and EFB optimizations |
| 2017 | CatBoost | Yandex | Ordered boosting with native categorical feature support |
The most common type of decision forest, the random forest, uses bootstrap aggregating as its foundation. The process works as follows:
The number of features considered at each split, k, is a key hyperparameter. Common defaults are k = sqrt(p) for classification and k = p/3 for regression, where p is the total number of features.
The mathematical intuition behind decision forests relies on the effect of averaging on variance. Consider a set of B trees, each producing a prediction. If the trees were independent with variance sigma-squared, the variance of their average would be sigma-squared / B, which decreases as more trees are added.
In practice, trees trained on bootstrap samples from the same dataset are not independent; they are correlated. Breiman showed that the generalization error of a random forest depends on two factors:
The upper bound on generalization error is approximately rho * (1 - s^2) / s^2. This means that reducing the correlation between trees (by randomizing feature selection at each split) directly improves forest performance, even if it slightly reduces the accuracy of individual trees. This is why random forests use random feature subsets at each split rather than always choosing the globally best feature.
One of the most practical features of random forests is the out-of-bag (OOB) error estimate. Because each tree is trained on a bootstrap sample that leaves out roughly one-third of the data, each training sample is "out of bag" for a subset of the trees. The OOB prediction for a given sample is obtained by aggregating only the predictions of trees for which that sample was not used in training.
The OOB error estimate has been shown to be approximately as accurate as using a held-out test set of the same size as the training set. This eliminates the need for a separate validation set or cross-validation, making random forests particularly convenient when data is limited.
Decision forests provide two primary methods for measuring feature importance:
Impurity-based importance (Gini importance): For each feature, sum the total reduction in impurity (measured by Gini index or entropy for classification, or variance for regression) across all splits in all trees that use that feature. Features that produce larger reductions in impurity are considered more important. This method is fast to compute since it is a byproduct of the training process, but it can be biased toward features with many categories or high cardinality.
Permutation importance: For each feature, randomly shuffle (permute) its values in the OOB samples, then measure how much the OOB error increases. If a feature is genuinely useful, permuting its values should cause a noticeable increase in prediction error. If the feature is unimportant, permuting it will have little effect. Permutation importance is more reliable than impurity-based importance but more computationally expensive.
The random forest is the most well-known type of decision forest. It uses bagging combined with random feature selection at each split, as described above. Random forests can handle both classification and regression tasks, are robust to noisy features, and require relatively little hyperparameter tuning. Their main limitation is that they can be slow to train and predict on very large datasets with many trees.
Proposed by Geurts, Ernst, and Wehenkel in 2006, extremely randomized trees (Extra-Trees) take the randomization of random forests one step further. In addition to randomly selecting a subset of features at each split, Extra-Trees also randomly choose the split threshold for each selected feature, rather than searching for the optimal threshold.
This additional randomization has two effects:
Extra-Trees also differ from random forests in that they typically do not use bootstrap sampling; instead, each tree is trained on the entire dataset. The default number of features considered at each split is sqrt(p) for classification and p (all features) for regression.
In practice, Extra-Trees often achieve comparable or slightly better accuracy than random forests, especially on datasets with noisy features, because the added randomness acts as additional regularization.
While random forests and Extra-Trees use bagging (parallel ensemble construction), gradient boosting builds trees sequentially. Each new tree is trained to correct the errors (residuals) of the combined ensemble of all previous trees. This is achieved by fitting each new tree to the negative gradient of a differentiable loss function, effectively performing gradient descent in function space.
Jerome Friedman formalized this approach in his 2001 paper "Greedy Function Approximation: A Gradient Boosting Machine." The algorithm works as follows:
Gradient boosted trees tend to achieve higher accuracy than random forests on many tasks, but they are more sensitive to hyperparameter settings and more prone to overfitting if not properly regularized. They also cannot be parallelized as easily as random forests because each tree depends on the output of all previous trees.
Several high-performance implementations of gradient boosted decision trees have become standard tools in applied machine learning:
| Framework | Developer | Year | Key features |
|---|---|---|---|
| XGBoost | Tianqi Chen | 2014 | L1/L2 regularization, sparsity-aware splits, column block structure, approximate split finding |
| LightGBM | Microsoft | 2017 | Leaf-wise growth, Gradient-based One-Side Sampling (GOSS), Exclusive Feature Bundling (EFB), histogram-based splits |
| CatBoost | Yandex | 2017 | Ordered boosting to prevent target leakage, native categorical feature handling, symmetric tree growth |
XGBoost (Extreme Gradient Boosting) popularized gradient boosting in the data science community and became the dominant algorithm in competitive machine learning (Kaggle). It introduced regularization terms in the objective function to control model complexity.
LightGBM uses a leaf-wise tree growth strategy instead of the level-wise approach used by XGBoost. This means it expands the leaf with the largest potential loss reduction, which can lead to faster convergence and lower error. LightGBM's GOSS technique keeps samples with large gradients while randomly sampling those with small gradients, and EFB bundles mutually exclusive features to reduce dimensionality.
CatBoost addresses a subtle but important problem in gradient boosting: target leakage. Traditional gradient boosting computes residuals using the same data that trains the model, which can introduce bias. CatBoost's ordered boosting uses a permutation-driven approach to compute unbiased estimates. It also handles categorical features directly using ordered target statistics, eliminating the need for one-hot encoding or label encoding.
The isolation forest, introduced by Liu, Ting, and Zhou in 2008, adapts the decision forest framework for anomaly detection. Instead of trying to model normal data, isolation forests explicitly isolate anomalous points.
The algorithm works by randomly selecting a feature and a split value, then recursively partitioning the data. Anomalies, being rare and different from normal observations, tend to be isolated in fewer splits (shorter path lengths from root to leaf). The anomaly score for each data point is based on its average path length across all trees in the forest.
Isolation forests have a training time complexity of O(n log n) and a prediction time complexity of O(log n), making them efficient for large datasets. The Extended Isolation Forest (EIF), proposed by Hariri, Kind, and Brunner in 2018, improved on the original by using random hyperplane cuts instead of axis-aligned splits, which produces better anomaly score maps for datasets with complex structures.
Decision forests have several hyperparameters that control their behavior. The table below lists the most common parameters for random forests (as implemented in Scikit-learn):
| Hyperparameter | Description | Typical default | Effect of increasing |
|---|---|---|---|
| n_estimators | Number of trees in the forest | 100 | Improves accuracy and stability; increases training time |
| max_depth | Maximum depth of each tree | None (unlimited) | Allows more complex trees; can increase overfitting |
| min_samples_split | Minimum samples required to split a node | 2 | Regularizes the model; reduces overfitting |
| min_samples_leaf | Minimum samples required at a leaf node | 1 | Regularizes the model; produces smoother predictions |
| max_features | Number of features considered at each split | sqrt(p) for classification, p/3 for regression | Lower values reduce tree correlation but may reduce individual tree accuracy |
| bootstrap | Whether to use bootstrap sampling | True | False gives Extra-Trees behavior |
| oob_score | Whether to compute OOB error | False | Enables built-in error estimation at no extra cost |
| max_samples | Size of each bootstrap sample | N (same as training set) | Smaller values speed up training and increase diversity |
For random forests, the most impactful hyperparameters are generally n_estimators and max_features. A common approach is:
For gradient boosted trees, the learning rate and n_estimators interact strongly: a lower learning rate requires more trees but often produces better results. Regularization parameters like max_depth (typically 3 to 8 for boosting) and subsampling rate are also important.
The following table compares decision forests with several other common supervised learning methods:
| Property | Random forest | Gradient boosting | Decision tree | Logistic regression | Neural network |
|---|---|---|---|---|---|
| Ensemble type | Bagging | Boosting | Single model | Single model | Single model |
| Overfitting risk | Low | Moderate | High | Low | Moderate to high |
| Interpretability | Low | Low | High | High | Very low |
| Feature scaling required | No | No | No | Yes | Yes |
| Handles missing data | Yes (some implementations) | Yes (some implementations) | Yes | No | No |
| Training speed (large data) | Moderate | Slow | Fast | Fast | Slow |
| Tabular data performance | Very good | Excellent | Good | Good | Good (large data) |
| Unstructured data performance | Poor | Poor | Poor | Poor | Excellent |
| Extrapolation ability | Poor | Poor | Poor | Good | Good |
Decision forests are available in all major machine learning libraries:
| Library | Language | Classes / functions | Notes |
|---|---|---|---|
| Scikit-learn | Python | RandomForestClassifier, RandomForestRegressor, ExtraTreesClassifier, ExtraTreesRegressor, GradientBoostingClassifier, IsolationForest | Standard reference implementation; good for small to medium datasets |
| XGBoost | Python, R, C++, Java | xgb.XGBClassifier, xgb.XGBRegressor | High-performance gradient boosting; widely used in competitions |
| LightGBM | Python, R, C++ | lgb.LGBMClassifier, lgb.LGBMRegressor | Fast training on large datasets; histogram-based splits |
| CatBoost | Python, R, C++ | CatBoostClassifier, CatBoostRegressor | Best native categorical feature handling |
| ranger | R | ranger() | Fast C++ implementation of random forest for R |
| randomForest | R | randomForest() | Classic R implementation; slower than ranger on large datasets |
| Spark MLlib | Scala, Python, Java | RandomForestClassifier, GBTClassifier | Distributed training for very large datasets |
| TensorFlow Decision Forests | Python | tfdf.keras.RandomForestModel, tfdf.keras.GradientBoostedTreesModel | Integrates decision tree models into the TensorFlow ecosystem |
Decision forests are used across a wide range of domains:
Healthcare and biomedical research: Random forests are used for disease diagnosis, patient outcome prediction, biomarker discovery, and gene expression classification. For example, random forest models have been applied to predict COVID-19 severity using patient demographic and health data, achieving accuracy above 94% in some studies.
Finance: Applications include credit scoring (evaluating the likelihood that a loan applicant will default), fraud detection (identifying unusual transaction patterns), stock price prediction, and algorithmic trading. Random forests are popular in finance because they handle mixed feature types well and provide feature importance rankings that support regulatory explanations.
Remote sensing and ecology: Random forests are one of the most commonly used classifiers for land cover classification from satellite imagery. They are also used in species distribution modeling and biodiversity assessment.
Bioinformatics: Random forests have been widely adopted for protein function prediction, drug-target interaction prediction, and genomic variant classification.
Recommendation systems: Gradient boosted decision trees power many recommendation and ranking systems at large technology companies.
Anomaly detection: Isolation forests are used for fraud detection, network intrusion detection, manufacturing quality control, and identifying outliers in large datasets.
Natural language processing: While deep learning now dominates most NLP tasks, decision forests remain useful for text classification, spam detection, and sentiment analysis when combined with feature extraction techniques like TF-IDF or bag-of-words.
A recurring question in applied machine learning is whether deep learning has made decision forests obsolete. Research consistently shows that this is not the case for tabular data.
A 2022 benchmark study by Grinsztajn, Boulber, and Varoquaux (published at NeurIPS) systematically compared tree-based models to deep learning on 45 tabular datasets of varying sizes. The study found that tree-based models (random forests and gradient boosted trees) outperformed neural networks on medium-sized datasets (around 10,000 samples), and that the difference persisted even after extensive hyperparameter tuning of the deep learning models.
The reasons for this advantage include:
However, on very large tabular datasets (millions of samples), deep learning approaches can close the gap or occasionally outperform tree-based methods, especially when data augmentation is used.
Random forests have several theoretically established properties:
Consistency: Under mild conditions, random forests are universally consistent, meaning their predictions converge to the true conditional expectation as the number of training samples and trees grows to infinity. This was formally proved by Biau, Devroye, and Lugosi (2008) for simplified versions and extended by Scornet, Biau, and Vert (2015) for Breiman's original algorithm.
Convergence of error: Breiman (2001) proved that as the number of trees B grows, the generalization error converges to a finite limit. Increasing B never causes the forest to overfit. This is in contrast to gradient boosting, where too many iterations can lead to overfitting.
Variable importance: Permutation-based variable importance in random forests has been shown to be a consistent measure of feature relevance under certain conditions, though it can be biased when features are correlated.