Random Forest

Introduction

Random forest is a versatile and widely used ensemble learning method in machine learning. It combines the predictions of many individual decision trees, each trained on a different random subset of the data using a technique called bootstrap aggregating (or bagging). At every node split during tree construction, only a randomly selected subset of features is considered, which decorrelates the trees and leads to stronger ensemble performance. This combination of bagging with random feature selection makes random forests robust to overfitting, effective in high-dimensional settings, and applicable to both classification and regression tasks.

Random forests are sometimes referred to more broadly as decision forests, particularly in the Microsoft research literature. The method has become one of the most popular off-the-shelf learning algorithms due to its strong out-of-the-box performance, few assumptions about data distribution, and relatively low sensitivity to hyperparameter settings.

Historical background

The intellectual roots of random forests trace back to several lines of research in the 1990s.

In 1995, Tin Kam Ho introduced the concept of random decision forests in a paper presented at the International Conference on Document Analysis and Recognition. Ho's approach built multiple decision trees in randomly selected subspaces of the feature space, demonstrating that forests of trees growing in random subspaces could achieve accuracy improvements as more trees were added, without overfitting. Ho expanded this work in 1998 with a paper on the random subspace method for constructing decision forests, published in IEEE Transactions on Pattern Analysis and Machine Intelligence. The key insight was that trees built in different random subspaces generalize in complementary ways, so their combined classification improves monotonically.

Independently, Yali Amit and Donald Geman (1997) introduced the idea of randomly sampling available features when splitting nodes during tree construction, an idea that influenced later work.

Leo Breiman synthesized these ideas and introduced the modern random forest algorithm in his influential 2001 paper "Random Forests," published in the journal Machine Learning. Breiman's formulation combined his earlier work on bagging (1996) with random feature selection at each split, and it introduced several practical tools that are now standard: out-of-bag error estimation, permutation-based variable importance, and proximity measures. Breiman also proved an upper bound on the generalization error of random forests, showing that performance depends on the strength of individual trees and the correlation between them. His 2001 paper is one of the most cited papers in all of machine learning.

Theoretical foundations

Bagging (bootstrap aggregating)

Bagging, short for bootstrap aggregating, is the foundational variance-reduction technique behind random forests. Breiman introduced bagging in 1996 as a method for reducing the variance of a statistical learning procedure. The core idea is straightforward: rather than training a single model on the entire dataset, bagging trains multiple models on different random subsets of the data and then combines their predictions.

The variance reduction in bagging comes from the statistical principle that the variance of an average of B random variables, each with variance sigma squared, is:

Var(average) = (1/B) * sigma^2 (when variables are independent)

In practice, the individual trees in a bagged ensemble are not fully independent because they are trained on overlapping data. If the pairwise correlation between trees is rho, the variance of the ensemble average becomes:

Var(average) = rho * sigma^2 + ((1 - rho) / B) * sigma^2

The first term depends on the correlation between trees and does not decrease as more trees are added. The second term decreases with B. By reducing rho through feature randomness, random forests reduce the dominant first term, achieving better variance reduction than plain bagging.

Bootstrap sampling

Bootstrap sampling is the process of creating a new dataset by randomly sampling with replacement from the original training data. For a dataset with n observations, each bootstrap sample also contains n observations, but some original observations appear multiple times while others are left out entirely.

On average, each bootstrap sample contains approximately 63.2% of the unique observations from the original dataset. The remaining 36.8% of observations that are not included in a given bootstrap sample are called out-of-bag (OOB) samples. This follows from the probability that a specific observation is not selected in any of n draws:

P(not selected) = (1 - 1/n)^n, which approaches 1/e (approximately 0.368) as n grows large.

The OOB samples serve as a natural validation set for each tree, which is one of the unique advantages of the random forest approach.

Feature randomness (random subspace method)

The key innovation that distinguishes random forests from simple bagged decision trees is the introduction of feature randomness at each split. When building each decision tree, at every node where a split is made, only a random subset of the available features is considered as candidates for the split.

This additional randomness serves to decorrelate the individual trees in the ensemble. Without feature randomness, if one or two features are very strong predictors, most trees in a bagged ensemble would use those same features at the top splits, making the trees highly correlated. Averaging correlated trees provides less variance reduction than averaging uncorrelated trees.

The typical number of features considered at each split (denoted m or max_features) is:

Task	Recommended max_features	Formula
Classification	Square root of total features	m = sqrt(p)
Regression	One-third of total features	m = p/3

where p is the total number of features. These defaults work well in practice, though tuning max_features can sometimes improve performance.

Generalization error bound

Breiman (2001) derived an upper bound on the generalization error of a random forest. Let s denote the strength of the forest (the expected margin of the correct class) and rho-bar denote the mean correlation between trees. The generalization error (PE*) is bounded by:

PE* <= rho-bar * (1 - s^2) / s^2

This bound reveals two opposing forces. Reducing the number of features considered at each split (max_features) decreases the correlation rho-bar between trees, which helps. But it also decreases the strength s of each individual tree, which hurts. The optimal max_features balances these two effects. This theoretical result explains why random forests do not overfit as the number of trees increases; adding more trees only reduces the second term in the variance decomposition.

Algorithm

Construction

The random forest algorithm constructs a collection of decision trees during the training phase. The complete training procedure for a random forest with B trees is as follows:

For each tree b = 1, 2, ..., B: a. Draw a bootstrap sample of size n from the training data. b. Grow a decision tree on the bootstrap sample. At each node, randomly select m features from the p available features, find the best split among those m features, and split the node into two child nodes. c. Grow the tree until a stopping criterion is met (for example, each leaf contains fewer than a minimum number of observations, or the tree reaches a maximum depth).
Return the ensemble of B trees.

Individual trees in a random forest are typically grown deep (often to full depth for classification), with little or no pruning. This makes each individual tree a high-variance, low-bias estimator. The ensemble averaging then reduces the variance while preserving the low bias.

Splitting criteria

At each node, the algorithm selects the feature and threshold that best separates the data. The quality of a split is measured by an impurity metric:

Gini impurity (for classification): measures the probability of incorrectly classifying a randomly chosen sample. For a node with K classes and class proportions p1, p2, ..., pK, the Gini impurity is: Gini = 1 - Sum(pk^2).
Entropy / information gain (for classification): measures the reduction in information entropy achieved by a split.
Variance reduction (for regression): measures the reduction in the variance of the target variable.

Gini impurity is the default criterion in most implementations because it is slightly faster to compute than entropy and typically produces similar results.

Prediction

Once the trees are constructed, the random forest makes predictions by combining the output of each individual tree:

Classification: The final prediction is determined by majority vote. Each tree casts a vote for one class, and the class that receives the most votes is the predicted class. Alternatively, the predicted probability for each class can be computed as the fraction of trees voting for that class.
Regression: The final prediction is the average of the predictions from all individual trees.

Formally, for regression with B trees, the prediction for a new input x is:

f_hat(x) = (1/B) * Sum from b=1 to B of f_b(x)

where f_b is the prediction function of the b-th tree.

Out-of-bag (OOB) error estimation

One of the most useful properties of random forests is the ability to estimate generalization error without a separate validation set, using out-of-bag (OOB) error estimation.

For each training observation, approximately one-third of the trees did not use that observation in their bootstrap sample. These are the OOB trees for that observation. The OOB prediction for an observation is obtained by aggregating the predictions of only the trees for which that observation was out-of-bag.

The OOB error is the average prediction error across all training observations using their OOB predictions. Breiman (2001) showed that the OOB error is a nearly unbiased estimate of the true generalization error, comparable to the error estimate from cross-validation but computed at no extra cost during training.

This makes OOB error particularly valuable when the dataset is small and setting aside a separate validation set would significantly reduce the amount of training data available. In practice, OOB error estimates tend to be slightly pessimistic (conservative), which is generally preferable to optimistic estimates.

Feature importance

Random forests provide several methods for measuring feature importance, helping practitioners understand which features contribute most to predictions.

Mean decrease in impurity (MDI / Gini importance)

MDI measures the total reduction in node impurity (measured by Gini impurity for classification or variance for regression) that each feature provides across all trees in the forest. Features that appear in more splits and produce larger impurity reductions receive higher importance scores.

Formally, the unnormalized importance of feature x is:

Importance(x) = (1/B) * Sum over all trees T, Sum over all nodes j where x is the splitting variable, of p(j) * Delta_impurity(j)

where p(j) is the fraction of samples reaching node j and Delta_impurity(j) is the impurity decrease at that node.

MDI is fast to compute because it is calculated during training. However, it has a known bias toward features with high cardinality (many unique values), such as continuous features or categorical features with many levels. Strobl et al. (2007) demonstrated this bias and recommended using permutation importance or conditional importance as alternatives.

Mean decrease in accuracy (MDA / permutation importance)

MDA (also called permutation importance) measures how much the model's accuracy decreases when the values of a feature are randomly shuffled (permuted), breaking the relationship between the feature and the target. The procedure is:

Compute the baseline OOB error (or validation error).
For each feature, randomly permute its values and recompute the error.
The importance of the feature is the increase in error caused by the permutation.

Permutation importance is more reliable than MDI because it is not biased toward high-cardinality features. However, it is more computationally expensive because it requires re-evaluating the model for each feature. It can also be misleading when features are highly correlated, since permuting one correlated feature may have little effect because the other correlated features still carry the same information.

Method	Speed	Bias	Reliability	Computed during
MDI (Gini importance)	Fast	Biased toward high-cardinality features	Moderate	Training
MDA (Permutation importance)	Slow	Unbiased (but affected by correlated features)	High	After training (using OOB or validation data)

SHAP values

In recent years, SHAP (SHapley Additive exPlanations) values have become a popular alternative for explaining random forest predictions. SHAP values are based on cooperative game theory and provide a theoretically grounded way to attribute the prediction of each individual observation to the contributions of each feature. The TreeSHAP algorithm (Lundberg et al., 2020) computes exact SHAP values for tree-based models efficiently, making it practical for random forests.

Proximity measures

Breiman (2001) introduced the concept of proximity measures as a byproduct of random forest training. The proximity between two observations is defined as the proportion of trees in which both observations land in the same terminal (leaf) node. Formally, for observations i and j:

Proximity(i, j) = (1 / B) * Sum over all trees T of I(leaf_T(i) = leaf_T(j))

where I is the indicator function and leaf_T(i) is the leaf node that observation i falls into in tree T.

Proximity measures capture the notion that two observations are "similar" if the random forest treats them similarly. These proximities have several practical applications:

Outlier detection: Observations with low proximity to all other observations in their class may be outliers.
Missing value imputation: Missing values can be filled in using a proximity-weighted average of the non-missing values from nearby observations.
Data visualization: The proximity matrix can be used with multidimensional scaling (MDS) to produce low-dimensional plots of the data, revealing cluster structure.
Clustering: The proximity matrix can serve as input to clustering algorithms, providing a data-driven dissimilarity measure that adapts to the structure of the problem.

Variable interaction detection

Random forests can also be used to detect interactions between variables. Breiman proposed an experimental procedure based on Gini importance values. The approach works as follows: for each pair of variables (m, k), the algorithm computes the absolute difference in their Gini importance ranks across all trees, then compares this to the expected difference under the hypothesis that the two variables are independent. A large positive difference implies that a split on one variable inhibits or promotes splits on the other variable, suggesting an interaction.

Breiman noted that this procedure is experimental and its conclusions should be regarded with caution. More recent approaches to interaction detection in random forests include partial dependence plots and the H-statistic proposed by Friedman and Popescu (2008), which quantifies the strength of interactions between features using the partial dependence decomposition.

Hyperparameters

Random forests have several important hyperparameters that affect model performance. While random forests are generally robust to hyperparameter choices compared to methods like gradient boosting, tuning can still yield meaningful improvements.

Hyperparameter	Description	Typical default	Effect of increasing
n_estimators	Number of trees in the forest	100	Better performance, higher computation; no overfitting risk
max_depth	Maximum depth of each tree	None (fully grown)	More complex trees, risk of overfitting per tree
max_features	Number of features considered at each split	sqrt(p) for classification, p/3 for regression	Lower correlation among trees, potentially lower accuracy per tree
min_samples_split	Minimum samples required to split a node	2	Smoother decision boundaries, less overfitting
min_samples_leaf	Minimum samples in a leaf node	1	Prevents very small leaves, reduces overfitting
bootstrap	Whether to use bootstrap sampling	True	If False, each tree uses the full dataset
oob_score	Whether to compute OOB error	False	Enables OOB error estimation
n_jobs	Number of parallel jobs for training	1	Faster training (trees are independent)
class_weight	Weights for each class (classification only)	None	Helps with imbalanced datasets
max_samples	Number of samples to draw for each tree (if bootstrap=True)	n (full dataset size)	Smaller values increase diversity but may reduce per-tree accuracy

Tuning guidelines

n_estimators: More trees generally improve performance but with diminishing returns beyond a certain point. A common practice is to start with 100 trees and increase until the OOB error stabilizes. In most cases, 300 to 500 trees are sufficient. Importantly, increasing n_estimators never causes overfitting; it only adds computation.
max_features: Reducing max_features increases the diversity of trees (more decorrelation) but makes each tree weaker. The defaults of sqrt(p) for classification and p/3 for regression are good starting points.
max_depth / min_samples_leaf: For classification, fully grown trees (no depth limit) often work well. For regression or noisy datasets, limiting depth or increasing min_samples_leaf can improve performance.
Tuning strategy: A common practical approach is to use RandomizedSearchCV first to narrow down the hyperparameter ranges, then apply GridSearchCV to fine-tune the best values found. The most impactful hyperparameters to tune are typically n_estimators, max_features, and max_depth or min_samples_split.

Random forests for classification vs. regression

Random forests can be applied to both classification and regression problems. The core algorithm is the same, but there are some differences in defaults and aggregation:

Aspect	Classification	Regression
Aggregation method	Majority vote (or averaged probabilities)	Mean of predictions
Default max_features	sqrt(p)	p/3
Splitting criterion	Gini impurity or entropy	Variance reduction (MSE)
Output	Class label or class probabilities	Continuous value
Tree depth	Often fully grown	May benefit from depth limits
Evaluation metric	Accuracy, F1, AUC	MSE, MAE, R-squared

For classification, each tree votes for a class and the majority wins. For regression, the predictions from all trees are averaged. A notable extension is the Quantile Regression Forest (Meinshausen, 2006), which retains all training observations in the leaf nodes rather than just their mean. This allows the forest to estimate the full conditional distribution of the target variable, enabling prediction intervals.

Advantages

Random forests offer several advantages over single decision trees and many other machine learning algorithms:

Improved accuracy: By aggregating the output of multiple trees, random forests achieve higher accuracy and better generalization than individual trees.
Robustness to overfitting: The use of bootstrap sampling and random feature selection helps reduce overfitting. Adding more trees never increases overfitting; it only stabilizes the ensemble.
Handling missing values: The algorithm can handle missing data through surrogate splits in the decision trees and through proximity-based imputation.
Feature importance: The relative importance of features can be assessed using Gini importance, permutation importance, or SHAP values.
Minimal preprocessing: Random forests do not require feature scaling or normalization. They handle both numerical and categorical features natively.
Parallelizable: Since each tree is built independently, training can be easily parallelized across multiple CPU cores.
Built-in validation: OOB error provides a free estimate of generalization performance without needing a separate validation set.
Handles high-dimensional data: Random forests work well even when the number of features greatly exceeds the number of observations, a common situation in genomics and text classification.
Nonparametric: No assumptions about the underlying data distribution are required.

Limitations

Despite its many strengths, random forests have notable limitations:

Interpretability trade-off: While individual decision trees are highly interpretable, the ensemble nature of random forests makes the final model harder to interpret. Feature importance scores and SHAP values partially address this, but the model itself remains a "black box" compared to a single tree.
Memory usage: Storing hundreds or thousands of full-depth decision trees can require significant memory, especially for large datasets with many features.
Slow prediction for large forests: Although training can be parallelized, prediction requires passing each new observation through every tree. For real-time applications with very large forests, this can be a bottleneck.
Biased feature importance for high-cardinality features: The default Gini importance measure is biased toward features with many unique values. Permutation importance or conditional importance should be used when this is a concern.
No extrapolation: Random forests cannot extrapolate beyond the range of the training data. For regression tasks, predictions are bounded by the minimum and maximum target values seen during training. This makes them unsuitable for tasks that require predicting trends beyond observed data.
Performance ceiling: On structured/tabular data, well-tuned gradient boosting methods (XGBoost, LightGBM) often slightly outperform random forests.
Bias with imbalanced data: Random forests can be biased toward the majority class in imbalanced datasets, though class weighting and balanced bootstrap sampling can mitigate this.

Extremely Randomized Trees (Extra-Trees)

Extremely Randomized Trees, or Extra-Trees, were introduced by Geurts, Ernst, and Wehenkel (2006) as a variant that pushes the randomization strategy even further than standard random forests. The key differences are:

Aspect	Random forest	Extra-Trees
Bootstrap sampling	Yes (sample with replacement)	No (uses the full training set)
Split threshold	Best threshold among random features	Random threshold for each random feature
Split optimization	Optimized per feature	Fully randomized
Training speed	Moderate	Faster (no threshold optimization)
Bias-variance trade-off	Lower bias, moderate variance	Slightly higher bias, lower variance
Tree correlation	Moderate	Lower (more diverse trees)

In Extra-Trees, split thresholds are drawn at random for each candidate feature, and the best of these random splits is chosen. Because the algorithm skips the computationally expensive step of finding optimal split points, Extra-Trees are generally faster to train. The additional randomness further reduces the correlation between trees, which can reduce the variance of the ensemble. However, individual Extra-Trees tend to have higher bias than standard random forest trees because they do not optimize splits.

In practice, Extra-Trees often achieve comparable or sometimes slightly better performance than standard random forests, particularly on datasets where overfitting is a concern. Scikit-learn provides ExtraTreesClassifier and ExtraTreesRegressor for this variant.

Comparison with gradient boosting

Gradient boosting is another popular ensemble method that combines decision trees, but it works very differently from random forests. Understanding the differences helps practitioners choose the right method.

Aspect	Random forest	Gradient boosting (XGBoost, LightGBM)
Ensemble strategy	Bagging (parallel)	Boosting (sequential)
Tree construction	Independent trees	Each tree corrects errors of previous trees
Tree depth	Deep (often fully grown)	Shallow (typically 3 to 8 levels)
Overfitting risk	Low	Higher (requires careful regularization)
Hyperparameter sensitivity	Low	High
Training speed	Fast (parallelizable)	Slower (sequential), though LightGBM is very fast
Typical accuracy	Good	Often slightly better when well-tuned
Ease of use	High (good defaults)	Moderate (requires tuning learning rate, n_estimators, depth)
Extrapolation	Cannot extrapolate	Cannot extrapolate
Missing values	Handled via surrogates	Handled natively in XGBoost and LightGBM

In random forests, trees are built independently using bootstrap samples and random feature subsets. In gradient boosting, trees are built sequentially, with each new tree trained to correct the residual errors of the ensemble built so far. Gradient boosting trees are deliberately kept shallow (weak learners) because each tree only needs to make a small improvement.

Gradient boosting primarily reduces bias by iteratively fitting residuals, while random forests primarily reduce variance through averaging. In practice, gradient boosting methods like XGBoost (Chen and Guestrin, 2016), LightGBM (Ke et al., 2017), and CatBoost often achieve higher accuracy than random forests on structured/tabular data when properly tuned. However, random forests remain preferred when ease of use and robustness are priorities, when computational resources for hyperparameter tuning are limited, or when a reliable baseline model is needed quickly.

Relationship to kernel methods and nearest neighbors

Random forests have a deep connection to kernel methods and nearest-neighbor approaches. Both random forests and k-nearest neighbors can be expressed as weighted neighborhood schemes:

y_hat(x) = Sum over i of W(x_i, x) * y_i

where W represents neighborhood weights. In a random forest, the weight assigned to training observation i when predicting at point x is the fraction of trees in which x and x_i land in the same leaf node. This is precisely the proximity measure discussed earlier.

Lin and Jeon (2006) showed that random forests can be viewed as adaptive nearest-neighbor methods, where the neighborhood of a point depends in a complex way on the tree structures and adapts locally to feature importance. This perspective has led to the development of Kernel Random Forests (KeRF), which make the connection explicit and provide a framework that is easier to analyze theoretically.

Scornet (2016) studied the consistency of these kernel-based forest estimators and established convergence rates under mild regularity conditions.

Variants and extensions

Several variants and extensions of the original random forest algorithm have been developed:

Isolation Forest: An adaptation of random forests for anomaly detection that isolates anomalies by randomly partitioning data. Anomalies require fewer random partitions to isolate than normal points.
Weighted Random Forest: Assigns weights to individual trees based on their performance, giving more influence to better-performing trees.
Random Survival Forests: Extends the algorithm to handle censored data common in medical and reliability studies, where the outcome of interest is time until an event occurs.
Quantile Regression Forests: Instead of predicting only the conditional mean, this variant estimates the full conditional distribution of the target variable, allowing for prediction intervals (Meinshausen, 2006).
Mondrian Forests: An online variant that supports incremental learning, allowing new data to be incorporated without retraining the entire forest.
Random Forest for high-dimensional data: Variants like enriched random forests and regularized random forests have been developed specifically for settings where the number of features far exceeds the number of observations, such as in genomics.

Random forests in practice

Scikit-learn provides the most widely used Python implementation of random forests:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
rf = RandomForestClassifier(n_estimators=100, max_features='sqrt', oob_score=True)
rf.fit(X_train, y_train)
print("OOB Score:", rf.oob_score_)
print("Test Accuracy:", accuracy_score(y_test, rf.predict(X_test)))

For regression tasks, RandomForestRegressor provides an equivalent API. Other popular implementations include:

Library / Package	Language	Notes
scikit-learn	Python	Most popular; `RandomForestClassifier`, `RandomForestRegressor`
randomForest (Liaw and Wiener)	R	Classic R implementation based on Breiman's original Fortran code
ranger	R	Fast C++ implementation, handles large datasets efficiently
H2O	Python, R, Java	Distributed random forest for big data
Apache Spark MLlib	Scala, Python	Distributed training on large clusters
XGBoost	Python, R, C++	Includes a random forest mode via `num_parallel_tree`

Practical tips

When working with random forests in practice, the following guidelines can help achieve good results:

Start with defaults. Random forests' default hyperparameters often produce competitive results without any tuning. This makes them an excellent baseline model for any classification or regression problem.
Use OOB score for quick evaluation. Enable oob_score=True to get an estimate of generalization performance without needing a validation set or cross-validation.
Check feature importance early. After fitting the model, inspect feature importance to identify irrelevant features. Removing noise features can sometimes improve performance and reduce training time.
Increase n_estimators for stability. If the OOB score fluctuates across runs, increasing the number of trees usually helps stabilize results.
Consider class weights for imbalanced data. When one class is much rarer than the other, setting class_weight='balanced' adjusts the algorithm to give more weight to the minority class during training.
Monitor memory usage. Each tree stores its full structure in memory. For very large forests or very deep trees, memory consumption can become significant. Limiting max_depth or max_leaf_nodes reduces per-tree memory usage.
When to prefer random forests over gradient boosting. Random forests are a good choice when you need a quick, reliable baseline; when you cannot afford extensive hyperparameter tuning; when training data is limited (OOB estimation is valuable); or when you want a model that is robust to noisy features.

Explain like I'm 5 (ELI5)

Imagine you want to know whether it will rain tomorrow. Instead of asking just one friend, you ask a hundred friends. Each friend looks at slightly different clues (some look at the clouds, some check the wind, some look at what the animals are doing). Not every friend sees the same clues, and not every friend uses the same weather data from the past.

Once everyone has made their guess, you count up the votes. If 70 friends say "rain" and 30 say "no rain," you go with "rain." That is basically how a random forest works. Each "friend" is a decision tree. By asking lots of different trees and taking the most popular answer, random forests make better predictions than any single tree would on its own.

References

Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32.
Breiman, L. (1996). "Bagging Predictors." *Machine Learning*, 24(2), 123-140.
Ho, T. K. (1995). "Random Decision Forests." *Proceedings of the 3rd International Conference on Document Analysis and Recognition*, 278-282.
Ho, T. K. (1998). "The Random Subspace Method for Constructing Decision Forests." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 20(8), 832-844.
Geurts, P., Ernst, D., & Wehenkel, L. (2006). "Extremely Randomized Trees." *Machine Learning*, 63(1), 3-42.
Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). "Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution." *BMC Bioinformatics*, 8, 25.
Meinshausen, N. (2006). "Quantile Regression Forests." *Journal of Machine Learning Research*, 7, 983-999.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning*. Springer.
Lundberg, S. M., Erion, G., Chen, H., et al. (2020). "From Local Explanations to Global Understanding with Explainable AI for Trees." *Nature Machine Intelligence*, 2, 56-67.
Chen, T. & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 785-794.
Lin, Y. & Jeon, Y. (2006). "Random Forests and Adaptive Nearest Neighbors." *Journal of the American Statistical Association*, 101(474), 578-590.
Friedman, J. H. & Popescu, B. E. (2008). "Predictive Learning via Rule Ensembles." *Annals of Applied Statistics*, 2(3), 916-954.
Scikit-learn documentation: Ensemble Methods.

Introduction

Historical background

Theoretical foundations

Bagging (bootstrap aggregating)

Bootstrap sampling

Feature randomness (random subspace method)

Generalization error bound

Algorithm

Construction

Splitting criteria

Prediction

Out-of-bag (OOB) error estimation

Feature importance

Mean decrease in impurity (MDI / Gini importance)

Mean decrease in accuracy (MDA / permutation importance)

SHAP values

Proximity measures

Variable interaction detection

Hyperparameters

Tuning guidelines

Random forests for classification vs. regression

Advantages

Limitations

Extremely Randomized Trees (Extra-Trees)

Comparison with gradient boosting

Relationship to kernel methods and nearest neighbors

Variants and extensions

Random forests in practice

Practical tips

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Machine learning terms/Decision Forests

Gradient Boosting

ARC-AGI 2

Decision Tree

Bagging

Boosting

Introduction

Historical background

Theoretical foundations

Bagging (bootstrap aggregating)

Bootstrap sampling

Feature randomness (random subspace method)

Generalization error bound

Algorithm

Construction

Splitting criteria

Prediction

Out-of-bag (OOB) error estimation

Feature importance

Mean decrease in impurity (MDI / Gini importance)

Mean decrease in accuracy (MDA / permutation importance)

SHAP values

Proximity measures

Variable interaction detection

Hyperparameters

Tuning guidelines

Random forests for classification vs. regression

Advantages

Limitations

Extremely Randomized Trees (Extra-Trees)

Comparison with gradient boosting

Relationship to kernel methods and nearest neighbors

Variants and extensions

Random forests in practice

Practical tips

Explain like I'm 5 (ELI5)

References

Related Articles

Machine learning terms/Decision Forests

Gradient Boosting

ARC-AGI 2

Decision Tree

Bagging

Boosting