# Decision Forest

> Source: https://aiwiki.ai/wiki/decision_forest
> Updated: 2026-07-11
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Random Forest](/wiki/random_forest), [Decision Tree](/wiki/decision_tree), [Ensemble Learning](/wiki/ensemble_learning)*

A **decision forest** is a family of [ensemble learning](/wiki/ensemble_learning) methods in [machine learning](/wiki/machine_learning) that combine many [decision trees](/wiki/decision_tree) to produce more accurate and stable predictions than any single tree [1][2]. The term covers the [random forest](/wiki/random_forest) (the most common variant), gradient boosted decision trees (GBDT), extremely randomized trees, and isolation forests. Decision forests are among the most widely used algorithms for [classification](/wiki/classification) and [regression](/wiki/regression) on tabular and structured data, and as of 2022 tree-based ensembles still outperformed [deep learning](/wiki/deep_learning) on most medium-sized tabular problems [10].

In his foundational 2001 paper, Leo Breiman defined the canonical case precisely: "Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest" [1]. The two dominant ways to build that combination are [bagging](/wiki/bagging) (training trees in parallel on resampled data and averaging them, as in random forests) and [boosting](/wiki/boosting) (training trees sequentially so each corrects the previous ensemble, as in GBDT) [1][4].

The core idea is simple: individual decision trees are prone to [overfitting](/wiki/overfitting) and can produce high-variance predictions, but by training many trees and combining their outputs, the ensemble reduces variance while keeping bias low. This principle, rooted in the [bias-variance tradeoff](/wiki/bias_variance_tradeoff), is what makes decision forests so effective in practice.

## Explain like I'm 5 (ELI5)

Imagine you have a hard question, like "Is this animal a cat or a dog?" Instead of asking just one friend, you ask 100 friends. Each friend looks at different clues (one looks at the ears, another at the tail, another at the fur). Then you count up all their answers and go with whatever most of them said. That is basically what a decision forest does. Each "friend" is a decision tree that looks at the data in a slightly different way. By combining all their answers together, the group almost always gets a better answer than any single friend would on their own.

## What is a decision forest?

A decision forest is an ensemble of [decision trees](/wiki/decision_tree) whose individual predictions are aggregated (by voting for classification or averaging for regression) into a single, stronger prediction. "Decision forest" is an umbrella term: it is sometimes used as a synonym for "random forest," but more broadly it refers to any ensemble of trees, including [gradient boosting](/wiki/gradient_boosting) ensembles, extremely randomized trees, and isolation forests.

A single deep decision tree is a low-bias, high-variance model: it can fit training data very closely but is unstable, since small changes in the data can yield a very different tree. A decision forest deliberately trades a little of each tree's accuracy for diversity between trees, then averages that diversity away. The result is a model that is much more robust than any one of its components.

## Who invented decision forests?

The development of decision forests spans several decades and involves contributions from multiple researchers.

### Early foundations (1990s)

The concept of combining multiple decision trees into an ensemble was explored throughout the 1990s. In 1995, Tin Kam Ho introduced "random decision forests" at the Third International Conference on Document Analysis and Recognition (ICDAR) [2]. Ho's approach used the **random subspace method**, which constructs trees in randomly chosen feature subspaces. Trees built in different subspaces generalize in complementary ways, and Ho showed that their combined classification accuracy improves monotonically as the forest grows [2]. Ho published a more comprehensive treatment in IEEE Transactions on Pattern Analysis and Machine Intelligence in 1998 [3].

Around the same time, Leo Breiman introduced [bagging](/wiki/bagging) (bootstrap aggregating) in 1996 [14]. Bagging trains each model on a different bootstrap sample (a random sample drawn with replacement from the [training data](/wiki/training_data)) and averages their predictions. Breiman showed that bagging is especially effective for unstable learners like decision trees, where small changes in training data can produce very different trees [14].

### Breiman's random forests (2001)

The modern formulation of the random forest algorithm was presented by Leo Breiman in his 2001 paper "Random Forests," published in the journal Machine Learning [1]. Spanning pages 5 to 32 of volume 45, this paper has become one of the most cited publications in all of statistics and machine learning, with well over 100,000 citations. Breiman combined bagging with random feature selection at each split, building on Ho's random subspace method and on work by Amit and Geman (1997) on randomized node optimization.

Breiman's 2001 paper introduced several concepts that remain central to the algorithm [1]:

- **Out-of-bag (OOB) error estimation**: a built-in method for estimating generalization error without a separate [test set](/wiki/test_set)
- **Variable importance measures**: methods for ranking features by their contribution to predictive accuracy
- **Theoretical bounds**: an upper bound on the generalization error expressed in terms of the strength of individual trees and the correlation between them

Breiman proved that as the number of trees increases, the generalization error of a random forest converges to a limit and does not overfit [1], a property that distinguishes random forests from many other machine learning methods.

### Later developments

Several variants and extensions followed Breiman's work:

| Year | Method | Authors | Contribution |
|------|--------|---------|--------------|
| 1995 | Random decision forests | Tin Kam Ho | Introduced random subspace method for constructing [decision tree](/wiki/decision_tree) ensembles [2] |
| 1996 | [Bagging](/wiki/bagging) | Leo Breiman | Bootstrap aggregating for variance reduction [14] |
| 1999 | Stochastic gradient [boosting](/wiki/boosting) | Jerome Friedman | Combined [gradient boosting](/wiki/gradient_boosting) with randomized subsampling |
| 2001 | [Random forest](/wiki/random_forest) | Leo Breiman | Combined bagging with random feature selection; introduced OOB error [1] |
| 2006 | Extremely randomized trees | Pierre Geurts, Damien Ernst, Louis Wehenkel | Randomized both feature selection and split thresholds [5] |
| 2008 | Isolation forest | Fei Tony Liu, Kai Ming Ting, Zhi-Hua Zhou | Adapted [decision tree](/wiki/decision_tree) ensembles for [anomaly detection](/wiki/anomaly_detection) [6] |
| 2016 | XGBoost | Tianqi Chen, Carlos Guestrin | Scalable, regularized gradient [boosting](/wiki/boosting) system [7] |
| 2017 | LightGBM | Microsoft Research | Leaf-wise tree growth with GOSS and EFB optimizations [8] |
| 2018 | CatBoost | Yandex | Ordered boosting with native categorical feature support [9] |

## How do decision forests work?

### Bootstrap aggregating (bagging)

The most common type of decision forest, the random forest, uses bootstrap aggregating as its foundation. The process works as follows:

1. **Bootstrap sampling**: From a training dataset of N samples, draw N samples with replacement to create a bootstrap sample. On average, each bootstrap sample contains about 63.2% of the unique original samples (the remaining 36.8% are left out and form the "out-of-bag" set for that tree).
2. **Tree construction**: Grow a [decision tree](/wiki/decision_tree) on each bootstrap sample. At each node, instead of evaluating all available features to find the best split, randomly select a subset of k features and choose the best split among only those features.
3. **No pruning**: Trees are typically grown to their full depth (or until leaves contain a minimum number of samples), since overfitting at the individual tree level is counteracted by the averaging process.
4. **Aggregation**: Combine the predictions of all trees. For classification, use majority voting. For regression, average the predicted values.

The number of features considered at each split, k, is a key [hyperparameter](/wiki/hyperparameter). Common defaults are $$k = \sqrt{p}$$ for classification and $$k = p/3$$ for regression, where p is the total number of features.

### Why it works: variance reduction

The mathematical intuition behind decision forests relies on the effect of averaging on variance. Consider a set of B trees, each producing a prediction. If the trees were independent with variance $$\sigma^2$$, the variance of their average would be $$\sigma^2 / B$$, which decreases as more trees are added.

In practice, trees trained on bootstrap samples from the same dataset are not independent; they are correlated. Breiman showed that the generalization error of a random forest depends on two factors [1]:

- **Strength ($$s$$)**: the accuracy of the individual trees
- **Correlation ($$\rho$$)**: the average correlation between pairs of trees

The upper bound on generalization error is approximately $$\rho (1 - s^2) / s^2$$ [1]. This means that reducing the correlation between trees (by randomizing feature selection at each split) directly improves forest performance, even if it slightly reduces the accuracy of individual trees. This is why random forests use random feature subsets at each split rather than always choosing the globally best feature.

### Out-of-bag error estimation

One of the most practical features of random forests is the out-of-bag (OOB) error estimate. Because each tree is trained on a bootstrap sample that leaves out roughly one-third of the data, each training sample is "out of bag" for a subset of the trees. The OOB prediction for a given sample is obtained by aggregating only the predictions of trees for which that sample was not used in training.

The OOB error estimate has been shown to be approximately as accurate as using a held-out [test set](/wiki/test_set) of the same size as the training set [1]. This eliminates the need for a separate [validation](/wiki/validation) set or [cross-validation](/wiki/cross-validation), making random forests particularly convenient when data is limited.

### Feature importance

Decision forests provide two primary methods for measuring feature importance:

**Impurity-based importance (Gini importance):** For each feature, sum the total reduction in impurity (measured by Gini index or entropy for classification, or variance for regression) across all splits in all trees that use that feature. Features that produce larger reductions in impurity are considered more important. This method is fast to compute since it is a byproduct of the training process, but it can be biased toward features with many categories or high cardinality.

**Permutation importance:** For each feature, randomly shuffle (permute) its values in the OOB samples, then measure how much the OOB error increases. If a feature is genuinely useful, permuting its values should cause a noticeable increase in prediction error. If the feature is unimportant, permuting it will have little effect. Permutation importance is more reliable than impurity-based importance but more computationally expensive.

## What are the types of decision forests?

### Random forests

The [random forest](/wiki/random_forest) is the most well-known type of decision forest. It uses bagging combined with random feature selection at each split, as described above. Random forests can handle both classification and regression tasks, are robust to noisy features, and require relatively little [hyperparameter tuning](/wiki/hyperparameter_tuning). Their main limitation is that they can be slow to train and predict on very large datasets with many trees.

### Extremely randomized trees (Extra-Trees)

Proposed by Geurts, Ernst, and Wehenkel in 2006, extremely randomized trees (Extra-Trees) take the randomization of random forests one step further [5]. In addition to randomly selecting a subset of features at each split, Extra-Trees also randomly choose the split threshold for each selected feature, rather than searching for the optimal threshold.

This additional randomization has two effects:

- **Reduced variance**: The extra randomness further decorrelates the trees, reducing ensemble variance.
- **Faster training**: Because the algorithm does not need to search for optimal split points, Extra-Trees can be trained significantly faster than standard random forests.

Extra-Trees also differ from random forests in that they typically do not use bootstrap sampling; instead, each tree is trained on the entire dataset [5]. The default number of features considered at each split is $$\sqrt{p}$$ for classification and $$p$$ (all features) for regression.

In practice, Extra-Trees often achieve comparable or slightly better accuracy than random forests, especially on datasets with noisy features, because the added randomness acts as additional regularization.

### Gradient boosted decision trees

While random forests and Extra-Trees use [bagging](/wiki/bagging) (parallel ensemble construction), [gradient boosting](/wiki/gradient_boosting) builds trees sequentially. Each new tree is trained to correct the errors (residuals) of the combined ensemble of all previous trees. This is achieved by fitting each new tree to the negative gradient of a differentiable [loss function](/wiki/loss_function), effectively performing gradient descent in function space.

Jerome Friedman formalized this approach in his 2001 paper "Greedy Function Approximation: A Gradient Boosting Machine" [4]. As Friedman framed it, "Function estimation/approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space" [4]. The algorithm works as follows:

1. Initialize the model with a constant prediction (for example, the mean of the target values).
2. For each iteration, compute the negative gradient of the loss function with respect to the current predictions (these are the "pseudo-residuals").
3. Fit a new decision tree to the pseudo-residuals.
4. Add the new tree to the ensemble, scaled by a learning rate.
5. Repeat until a stopping criterion is met.

Gradient boosted trees tend to achieve higher accuracy than random forests on many tasks, but they are more sensitive to [hyperparameter](/wiki/hyperparameter) settings and more prone to overfitting if not properly regularized. They also cannot be parallelized as easily as random forests because each tree depends on the output of all previous trees.

### Modern gradient boosting frameworks

Several high-performance implementations of gradient boosted decision trees have become standard tools in applied machine learning:

| Framework | Developer | Year | Key features |
|-----------|-----------|------|--------------|
| XGBoost | Tianqi Chen | 2014 (paper 2016) | L1/L2 regularization, sparsity-aware splits, column block structure, approximate split finding [7] |
| LightGBM | Microsoft | 2017 | Leaf-wise growth, Gradient-based One-Side Sampling (GOSS), Exclusive Feature Bundling (EFB), histogram-based splits [8] |
| CatBoost | Yandex | 2017 (paper 2018) | Ordered boosting to prevent target leakage, native categorical feature handling, symmetric tree growth [9] |

**XGBoost** (Extreme Gradient Boosting) popularized gradient boosting in the data science community and became the dominant algorithm in competitive machine learning, especially on Kaggle [7]. It introduced regularization terms in the objective function to control model complexity.

**LightGBM** uses a leaf-wise tree growth strategy instead of the level-wise approach used by XGBoost [8]. This means it expands the leaf with the largest potential loss reduction, which can lead to faster convergence and lower error. LightGBM's GOSS technique keeps samples with large gradients while randomly sampling those with small gradients, and EFB bundles mutually exclusive features to reduce dimensionality.

**CatBoost** addresses a subtle but important problem in gradient boosting: target leakage [9]. Traditional gradient boosting computes residuals using the same data that trains the model, which can introduce bias. CatBoost's ordered boosting uses a permutation-driven approach to compute unbiased estimates. It also handles categorical features directly using ordered target statistics, eliminating the need for one-hot encoding or label encoding.

### Isolation forests

The isolation forest, introduced by Liu, Ting, and Zhou in 2008, adapts the decision forest framework for [anomaly detection](/wiki/anomaly_detection) [6]. Instead of trying to model normal data, isolation forests explicitly isolate anomalous points.

The algorithm works by randomly selecting a feature and a split value, then recursively partitioning the data. Anomalies, being rare and different from normal observations, tend to be isolated in fewer splits (shorter path lengths from root to leaf). The anomaly score for each data point is based on its average path length across all trees in the forest.

Isolation forests have a training time complexity of $$O(n \log n)$$ and a prediction time complexity of $$O(\log n)$$, making them efficient for large datasets [6]. The Extended Isolation Forest (EIF), proposed by Hariri, Kind, and Brunner in 2018, improved on the original by using random hyperplane cuts instead of axis-aligned splits, which produces better anomaly score maps for datasets with complex structures [13].

## How do random forests and gradient boosted trees differ?

Random forests and gradient boosted decision trees (GBDT) are the two most important decision forests, and they take opposite approaches to building the ensemble. The table below summarizes the key differences.

| Property | Random forest | Gradient boosted trees |
|----------|---------------|------------------------|
| Ensemble strategy | [Bagging](/wiki/bagging): trees trained in parallel and averaged [1] | [Boosting](/wiki/boosting): trees trained sequentially, each correcting the last [4] |
| Each tree sees | A bootstrap resample of the data | The residuals/gradients of the current ensemble |
| Tree depth | Deep, fully grown (high variance, low bias) | Shallow (typically depth 3 to 8) |
| Overfitting | Adding trees never overfits; error converges [1] | Adding too many trees can overfit |
| Parallelism | Trees are independent, easily parallelized | Sequential; harder to parallelize |
| Tuning sensitivity | Low; works well near defaults | High; learning rate, depth, and tree count interact |
| Typical accuracy on tabular data | Very good | Often slightly higher, when well tuned |

In short, a random forest reduces variance by averaging many strong, decorrelated trees, while gradient boosting reduces bias by additively stacking many weak trees. Random forests are the safer default that needs little tuning; gradient boosted trees can reach higher accuracy but demand more careful regularization.

## Key hyperparameters

Decision forests have several [hyperparameters](/wiki/hyperparameter) that control their behavior. The table below lists the most common parameters for random forests (as implemented in [Scikit-learn](/wiki/scikit-learn)):

| Hyperparameter | Description | Typical default | Effect of increasing |
|----------------|-------------|-----------------|---------------------|
| n_estimators | Number of trees in the forest | 100 | Improves accuracy and stability; increases training time |
| max_depth | Maximum depth of each tree | None (unlimited) | Allows more complex trees; can increase [overfitting](/wiki/overfitting) |
| min_samples_split | Minimum samples required to split a node | 2 | Regularizes the model; reduces overfitting |
| min_samples_leaf | Minimum samples required at a leaf node | 1 | Regularizes the model; produces smoother predictions |
| max_features | Number of features considered at each split | $$\sqrt{p}$$ for classification, $$p/3$$ for [regression](/wiki/regression) | Lower values reduce tree correlation but may reduce individual tree accuracy |
| bootstrap | Whether to use bootstrap sampling | True | False gives Extra-Trees behavior |
| oob_score | Whether to compute OOB error | False | Enables built-in error estimation at no extra cost |
| max_samples | Size of each bootstrap sample | N (same as training set) | Smaller values speed up training and increase diversity |

### Tuning strategies

For random forests, the most impactful hyperparameters are generally n_estimators and max_features. A common approach is:

1. Start with the default settings and evaluate performance.
2. Increase n_estimators until the OOB error or cross-validation score plateaus.
3. Try different values of max_features (for example, $$\sqrt{p}$$, $$\log_2(p)$$, or $$p/3$$).
4. Adjust max_depth or min_samples_leaf if overfitting is observed.

For gradient boosted trees, the learning rate and n_estimators interact strongly: a lower learning rate requires more trees but often produces better results. Regularization parameters like max_depth (typically 3 to 8 for boosting) and subsampling rate are also important.

## What are the advantages and limitations?

### Advantages

- **High accuracy on tabular data**: Decision forests, particularly gradient boosted variants, consistently rank among the top-performing methods on structured and tabular datasets. A 2022 study by Grinsztajn et al. found that tree-based models still outperform [deep learning](/wiki/deep_learning) on typical tabular data [10].
- **Robustness**: Random forests are resistant to overfitting because the averaging process reduces variance. Adding more trees never hurts performance (though it increases computation) [1].
- **Built-in feature importance**: Decision forests provide feature importance scores without requiring additional analysis, which aids in [feature engineering](/wiki/feature_engineering) and model interpretation.
- **Minimal preprocessing**: Decision forests handle missing values (in some implementations), do not require feature scaling, and can work with both numerical and categorical features.
- **OOB error estimate**: Random forests provide a free, built-in estimate of generalization error through the OOB mechanism [1].
- **Parallelization**: In bagging-based forests, trees are independent and can be trained in parallel across multiple cores or machines.
- **Versatility**: Decision forests support classification, regression, multi-output tasks, and survival analysis.

### Limitations

- **Interpretability**: While individual [decision trees](/wiki/decision_tree) are highly interpretable, an ensemble of hundreds or thousands of trees is much harder to understand. Methods like SHAP (SHapley Additive exPlanations) and LIME can help, but the model is no longer a simple set of rules.
- **Memory and storage**: Each tree in the forest must be stored in memory. For forests with many deep trees, memory usage can be substantial.
- **Prediction latency**: Making predictions requires traversing every tree in the forest, which can be slow for real-time applications with strict latency requirements.
- **Extrapolation**: Decision trees (and by extension, decision forests) cannot extrapolate beyond the range of values seen in the training data. They always predict values within the range of the training targets, which can be a problem for regression tasks where the test data contains values outside the training distribution.
- **Performance on unstructured data**: Decision forests are designed for tabular data. For images, text, audio, and other unstructured data types, [neural networks](/wiki/neural_network) and [deep learning](/wiki/deep_learning) approaches generally perform much better.
- **Sequential training (boosting)**: Gradient boosted forests cannot be fully parallelized because each tree depends on the predictions of all previous trees.

## How do decision forests compare to other methods?

The following table compares decision forests with several other common [supervised learning](/wiki/supervised_learning) methods:

| Property | [Random forest](/wiki/random_forest) | [Gradient boosting](/wiki/gradient_boosting) | [Decision tree](/wiki/decision_tree) | [Logistic regression](/wiki/logistic_regression) | [Neural network](/wiki/neural_network) |
|----------|---------------|-------------------|---------------|----------------------|----------------|
| Ensemble type | [Bagging](/wiki/bagging) | [Boosting](/wiki/boosting) | Single model | Single model | Single model |
| Overfitting risk | Low | Moderate | High | Low | Moderate to high |
| Interpretability | Low | Low | High | High | Very low |
| Feature scaling required | No | No | No | Yes | Yes |
| Handles missing data | Yes (some implementations) | Yes (some implementations) | Yes | No | No |
| Training speed (large data) | Moderate | Slow | Fast | Fast | Slow |
| Tabular data performance | Very good | Excellent | Good | Good | Good (large data) |
| Unstructured data performance | Poor | Poor | Poor | Poor | Excellent |
| Extrapolation ability | Poor | Poor | Poor | Good | Good |

## What software libraries implement decision forests?

Decision forests are available in all major machine learning libraries:

| Library | Language | Classes / functions | Notes |
|---------|----------|---------------------|-------|
| [Scikit-learn](/wiki/scikit-learn) | Python | RandomForestClassifier, RandomForestRegressor, ExtraTreesClassifier, ExtraTreesRegressor, GradientBoostingClassifier, IsolationForest | Standard reference implementation; good for small to medium datasets |
| XGBoost | Python, R, C++, Java | xgb.XGBClassifier, xgb.XGBRegressor | High-performance gradient [boosting](/wiki/boosting); widely used in competitions [7] |
| LightGBM | Python, R, C++ | lgb.LGBMClassifier, lgb.LGBMRegressor | Fast training on large datasets; histogram-based splits [8] |
| CatBoost | Python, R, C++ | CatBoostClassifier, CatBoostRegressor | Best native categorical feature handling [9] |
| ranger | R | ranger() | Fast C++ implementation of [random forest](/wiki/random_forest) for R |
| randomForest | R | randomForest() | Classic R implementation; slower than ranger on large datasets |
| Spark MLlib | Scala, Python, Java | RandomForestClassifier, GBTClassifier | Distributed training for very large datasets |
| TensorFlow Decision Forests | Python | tfdf.keras.RandomForestModel, tfdf.keras.GradientBoostedTreesModel | Integrates [decision tree](/wiki/decision_tree) models into the TensorFlow ecosystem |

## What are decision forests used for?

Decision forests are used across a wide range of domains:

**Healthcare and biomedical research**: Random forests are used for disease diagnosis, patient outcome prediction, biomarker discovery, and gene expression classification. For example, random forest models have been applied to predict COVID-19 severity using patient demographic and health data, achieving accuracy above 94% in some studies.

**Finance**: Applications include credit scoring (evaluating the likelihood that a loan applicant will default), fraud detection (identifying unusual transaction patterns), stock price prediction, and algorithmic trading. Random forests are popular in finance because they handle mixed feature types well and provide feature importance rankings that support regulatory explanations.

**Remote sensing and ecology**: Random forests are one of the most commonly used classifiers for land cover classification from satellite imagery. They are also used in species distribution modeling and biodiversity assessment.

**Bioinformatics**: Random forests have been widely adopted for protein function prediction, drug-target interaction prediction, and genomic variant classification.

**Recommendation systems**: Gradient boosted decision trees power many recommendation and ranking systems at large technology companies.

**[Anomaly detection](/wiki/anomaly_detection)**: Isolation forests are used for fraud detection, network intrusion detection, manufacturing quality control, and identifying outliers in large datasets [6].

**[Natural language processing](/wiki/natural_language_processing)**: While [deep learning](/wiki/deep_learning) now dominates most NLP tasks, decision forests remain useful for text classification, spam detection, and [sentiment analysis](/wiki/sentiment_analysis) when combined with feature extraction techniques like TF-IDF or bag-of-words.

## Why are decision forests good for tabular data?

A recurring question in applied machine learning is whether [deep learning](/wiki/deep_learning) has made decision forests obsolete. Research consistently shows that this is not the case for tabular data.

A 2022 benchmark study by Grinsztajn, Oyallon, and Varoquaux (published at NeurIPS) systematically compared tree-based models to deep learning on a large suite of tabular datasets of varying sizes [10]. The authors concluded that "tree-based models remain state-of-the-art on medium-sized data (around 10K samples) even without accounting for their superior speed," and that the advantage persisted even after extensive hyperparameter tuning of the deep learning models [10].

The reasons for this advantage include [10]:

- Decision trees handle irregular, non-smooth decision boundaries naturally, while neural networks are biased toward smooth functions that may not fit tabular data well.
- Tabular data typically lacks the spatial or sequential structure that convolutional and recurrent neural networks exploit in images and text.
- Decision forests are relatively insensitive to uninformative features, while neural networks can be distracted by them.
- Random forests and boosted trees require less hyperparameter tuning and less training data to achieve good performance.

However, on very large tabular datasets (millions of samples), deep learning approaches can close the gap or occasionally outperform tree-based methods, especially when data augmentation is used.

## What theoretical guarantees do decision forests have?

Random forests have several theoretically established properties:

**Consistency**: Under mild conditions, random forests are universally consistent, meaning their predictions converge to the true conditional expectation as the number of training samples and trees grows to infinity. This was formally proved by Biau, Devroye, and Lugosi (2008) for simplified versions and extended by Scornet, Biau, and Vert (2015) for Breiman's original algorithm [12].

**Convergence of error**: Breiman (2001) proved that as the number of trees B grows, the generalization error converges to a finite limit, and increasing B never causes the forest to overfit [1]. This is in contrast to gradient boosting, where too many iterations can lead to overfitting.

**Variable importance**: Permutation-based variable importance in random forests has been shown to be a consistent measure of feature relevance under certain conditions, though it can be biased when features are correlated [11].

## References

1. Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324

2. Ho, T. K. (1995). "Random Decision Forests." *Proceedings of the 3rd International Conference on Document Analysis and Recognition*, 278-282.

3. Ho, T. K. (1998). "The Random Subspace Method for Constructing Decision Forests." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 20(8), 832-844.

4. Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *The Annals of Statistics*, 29(5), 1189-1232.

5. Geurts, P., Ernst, D., & Wehenkel, L. (2006). "Extremely Randomized Trees." *Machine Learning*, 63(1), 3-42. https://doi.org/10.1007/s10994-006-6226-1

6. Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). "Isolation Forest." *Proceedings of the 8th IEEE International Conference on Data Mining*, 413-422.

7. Chen, T., & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 785-794.

8. Ke, G., Meng, Q., Finley, T., et al. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." *Advances in Neural Information Processing Systems*, 30.

9. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). "CatBoost: Unbiased Boosting with Categorical Features." *Advances in Neural Information Processing Systems*, 31.

10. Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). "Why do tree-based models still outperform deep learning on typical tabular data?" *Advances in Neural Information Processing Systems*, 35 (Datasets and Benchmarks Track).

11. Biau, G., & Scornet, E. (2016). "A Random Forest Guided Tour." *TEST*, 25(2), 197-227. https://doi.org/10.1007/s11749-016-0481-7

12. Scornet, E., Biau, G., & Vert, J.-P. (2015). "Consistency of Random Forests." *The Annals of Statistics*, 43(4), 1716-1741.

13. Hariri, S., Kind, M. C., & Brunner, R. J. (2019). "Extended Isolation Forest." *IEEE Transactions on Knowledge and Data Engineering*, 33(4), 1479-1489.

14. Breiman, L. (1996). "Bagging Predictors." *Machine Learning*, 24(2), 123-140.