See also: Machine learning terms
Random forest is a versatile and widely used ensemble learning method in machine learning. It combines the predictions of many individual decision trees, each trained on a different random subset of the data using a technique called bootstrap aggregating (or bagging). At every node split during tree construction, only a randomly selected subset of features is considered, which decorrelates the trees and leads to stronger ensemble performance. This combination of bagging with random feature selection makes random forests robust to overfitting, effective in high-dimensional settings, and applicable to both classification and regression tasks.
Random forests are sometimes referred to more broadly as decision forests, particularly in the Microsoft research literature. The method has become one of the most popular off-the-shelf learning algorithms due to its strong out-of-the-box performance, few assumptions about data distribution, and relatively low sensitivity to hyperparameter settings.
The intellectual roots of random forests trace back to several lines of research in the 1990s.
In 1995, Tin Kam Ho introduced the concept of random decision forests in a paper presented at the International Conference on Document Analysis and Recognition. Ho's approach built multiple decision trees in randomly selected subspaces of the feature space, demonstrating that forests of trees growing in random subspaces could achieve accuracy improvements as more trees were added, without overfitting. Ho expanded this work in 1998 with a paper on the random subspace method for constructing decision forests, published in IEEE Transactions on Pattern Analysis and Machine Intelligence. The key insight was that trees built in different random subspaces generalize in complementary ways, so their combined classification improves monotonically.
Independently, Yali Amit and Donald Geman (1997) introduced the idea of randomly sampling available features when splitting nodes during tree construction, an idea that influenced later work.
Leo Breiman synthesized these ideas and introduced the modern random forest algorithm in his influential 2001 paper "Random Forests," published in the journal Machine Learning. Breiman's formulation combined his earlier work on bagging (1996) with random feature selection at each split, and it introduced several practical tools that are now standard: out-of-bag error estimation, permutation-based variable importance, and proximity measures. Breiman also proved an upper bound on the generalization error of random forests, showing that performance depends on the strength of individual trees and the correlation between them. His 2001 paper is one of the most cited papers in all of machine learning.
Bagging, short for bootstrap aggregating, is the foundational variance-reduction technique behind random forests. Breiman introduced bagging in 1996 as a method for reducing the variance of a statistical learning procedure. The core idea is straightforward: rather than training a single model on the entire dataset, bagging trains multiple models on different random subsets of the data and then combines their predictions.
The variance reduction in bagging comes from the statistical principle that the variance of an average of B random variables, each with variance sigma squared, is:
Var(average) = (1/B) * sigma^2 (when variables are independent)
In practice, the individual trees in a bagged ensemble are not fully independent because they are trained on overlapping data. If the pairwise correlation between trees is rho, the variance of the ensemble average becomes:
Var(average) = rho * sigma^2 + ((1 - rho) / B) * sigma^2
The first term depends on the correlation between trees and does not decrease as more trees are added. The second term decreases with B. By reducing rho through feature randomness, random forests reduce the dominant first term, achieving better variance reduction than plain bagging.
Bootstrap sampling is the process of creating a new dataset by randomly sampling with replacement from the original training data. For a dataset with n observations, each bootstrap sample also contains n observations, but some original observations appear multiple times while others are left out entirely.
On average, each bootstrap sample contains approximately 63.2% of the unique observations from the original dataset. The remaining 36.8% of observations that are not included in a given bootstrap sample are called out-of-bag (OOB) samples. This follows from the probability that a specific observation is not selected in any of n draws:
P(not selected) = (1 - 1/n)^n, which approaches 1/e (approximately 0.368) as n grows large.
The OOB samples serve as a natural validation set for each tree, which is one of the unique advantages of the random forest approach.
The key innovation that distinguishes random forests from simple bagged decision trees is the introduction of feature randomness at each split. When building each decision tree, at every node where a split is made, only a random subset of the available features is considered as candidates for the split.
This additional randomness serves to decorrelate the individual trees in the ensemble. Without feature randomness, if one or two features are very strong predictors, most trees in a bagged ensemble would use those same features at the top splits, making the trees highly correlated. Averaging correlated trees provides less variance reduction than averaging uncorrelated trees.
The typical number of features considered at each split (denoted m or max_features) is:
| Task | Recommended max_features | Formula |
|---|---|---|
| Classification | Square root of total features | m = sqrt(p) |
| Regression | One-third of total features | m = p/3 |
where p is the total number of features. These defaults work well in practice, though tuning max_features can sometimes improve performance.
Breiman (2001) derived an upper bound on the generalization error of a random forest. Let s denote the strength of the forest (the expected margin of the correct class) and rho-bar denote the mean correlation between trees. The generalization error (PE*) is bounded by:
PE* <= rho-bar * (1 - s^2) / s^2
This bound reveals two opposing forces. Reducing the number of features considered at each split (max_features) decreases the correlation rho-bar between trees, which helps. But it also decreases the strength s of each individual tree, which hurts. The optimal max_features balances these two effects. This theoretical result explains why random forests do not overfit as the number of trees increases; adding more trees only reduces the second term in the variance decomposition.
The random forest algorithm constructs a collection of decision trees during the training phase. The complete training procedure for a random forest with B trees is as follows:
Individual trees in a random forest are typically grown deep (often to full depth for classification), with little or no pruning. This makes each individual tree a high-variance, low-bias estimator. The ensemble averaging then reduces the variance while preserving the low bias.
At each node, the algorithm selects the feature and threshold that best separates the data. The quality of a split is measured by an impurity metric:
Gini impurity is the default criterion in most implementations because it is slightly faster to compute than entropy and typically produces similar results.
Once the trees are constructed, the random forest makes predictions by combining the output of each individual tree:
Formally, for regression with B trees, the prediction for a new input x is:
f_hat(x) = (1/B) * Sum from b=1 to B of f_b(x)
where f_b is the prediction function of the b-th tree.
One of the most useful properties of random forests is the ability to estimate generalization error without a separate validation set, using out-of-bag (OOB) error estimation.
For each training observation, approximately one-third of the trees did not use that observation in their bootstrap sample. These are the OOB trees for that observation. The OOB prediction for an observation is obtained by aggregating the predictions of only the trees for which that observation was out-of-bag.
The OOB error is the average prediction error across all training observations using their OOB predictions. Breiman (2001) showed that the OOB error is a nearly unbiased estimate of the true generalization error, comparable to the error estimate from cross-validation but computed at no extra cost during training.
This makes OOB error particularly valuable when the dataset is small and setting aside a separate validation set would significantly reduce the amount of training data available. In practice, OOB error estimates tend to be slightly pessimistic (conservative), which is generally preferable to optimistic estimates.
Random forests provide several methods for measuring feature importance, helping practitioners understand which features contribute most to predictions.
MDI measures the total reduction in node impurity (measured by Gini impurity for classification or variance for regression) that each feature provides across all trees in the forest. Features that appear in more splits and produce larger impurity reductions receive higher importance scores.
Formally, the unnormalized importance of feature x is:
Importance(x) = (1/B) * Sum over all trees T, Sum over all nodes j where x is the splitting variable, of p(j) * Delta_impurity(j)
where p(j) is the fraction of samples reaching node j and Delta_impurity(j) is the impurity decrease at that node.
MDI is fast to compute because it is calculated during training. However, it has a known bias toward features with high cardinality (many unique values), such as continuous features or categorical features with many levels. Strobl et al. (2007) demonstrated this bias and recommended using permutation importance or conditional importance as alternatives.
MDA (also called permutation importance) measures how much the model's accuracy decreases when the values of a feature are randomly shuffled (permuted), breaking the relationship between the feature and the target. The procedure is:
Permutation importance is more reliable than MDI because it is not biased toward high-cardinality features. However, it is more computationally expensive because it requires re-evaluating the model for each feature. It can also be misleading when features are highly correlated, since permuting one correlated feature may have little effect because the other correlated features still carry the same information.
| Method | Speed | Bias | Reliability | Computed during |
|---|---|---|---|---|
| MDI (Gini importance) | Fast | Biased toward high-cardinality features | Moderate | Training |
| MDA (Permutation importance) | Slow | Unbiased (but affected by correlated features) | High | After training (using OOB or validation data) |
In recent years, SHAP (SHapley Additive exPlanations) values have become a popular alternative for explaining random forest predictions. SHAP values are based on cooperative game theory and provide a theoretically grounded way to attribute the prediction of each individual observation to the contributions of each feature. The TreeSHAP algorithm (Lundberg et al., 2020) computes exact SHAP values for tree-based models efficiently, making it practical for random forests.
Breiman (2001) introduced the concept of proximity measures as a byproduct of random forest training. The proximity between two observations is defined as the proportion of trees in which both observations land in the same terminal (leaf) node. Formally, for observations i and j:
Proximity(i, j) = (1 / B) * Sum over all trees T of I(leaf_T(i) = leaf_T(j))
where I is the indicator function and leaf_T(i) is the leaf node that observation i falls into in tree T.
Proximity measures capture the notion that two observations are "similar" if the random forest treats them similarly. These proximities have several practical applications:
Random forests can also be used to detect interactions between variables. Breiman proposed an experimental procedure based on Gini importance values. The approach works as follows: for each pair of variables (m, k), the algorithm computes the absolute difference in their Gini importance ranks across all trees, then compares this to the expected difference under the hypothesis that the two variables are independent. A large positive difference implies that a split on one variable inhibits or promotes splits on the other variable, suggesting an interaction.
Breiman noted that this procedure is experimental and its conclusions should be regarded with caution. More recent approaches to interaction detection in random forests include partial dependence plots and the H-statistic proposed by Friedman and Popescu (2008), which quantifies the strength of interactions between features using the partial dependence decomposition.
Random forests have several important hyperparameters that affect model performance. While random forests are generally robust to hyperparameter choices compared to methods like gradient boosting, tuning can still yield meaningful improvements.
| Hyperparameter | Description | Typical default | Effect of increasing |
|---|---|---|---|
| n_estimators | Number of trees in the forest | 100 | Better performance, higher computation; no overfitting risk |
| max_depth | Maximum depth of each tree | None (fully grown) | More complex trees, risk of overfitting per tree |
| max_features | Number of features considered at each split | sqrt(p) for classification, p/3 for regression | Lower correlation among trees, potentially lower accuracy per tree |
| min_samples_split | Minimum samples required to split a node | 2 | Smoother decision boundaries, less overfitting |
| min_samples_leaf | Minimum samples in a leaf node | 1 | Prevents very small leaves, reduces overfitting |
| bootstrap | Whether to use bootstrap sampling | True | If False, each tree uses the full dataset |
| oob_score | Whether to compute OOB error | False | Enables OOB error estimation |
| n_jobs | Number of parallel jobs for training | 1 | Faster training (trees are independent) |
| class_weight | Weights for each class (classification only) | None | Helps with imbalanced datasets |
| max_samples | Number of samples to draw for each tree (if bootstrap=True) | n (full dataset size) | Smaller values increase diversity but may reduce per-tree accuracy |
Random forests can be applied to both classification and regression problems. The core algorithm is the same, but there are some differences in defaults and aggregation:
| Aspect | Classification | Regression |
|---|---|---|
| Aggregation method | Majority vote (or averaged probabilities) | Mean of predictions |
| Default max_features | sqrt(p) | p/3 |
| Splitting criterion | Gini impurity or entropy | Variance reduction (MSE) |
| Output | Class label or class probabilities | Continuous value |
| Tree depth | Often fully grown | May benefit from depth limits |
| Evaluation metric | Accuracy, F1, AUC | MSE, MAE, R-squared |
For classification, each tree votes for a class and the majority wins. For regression, the predictions from all trees are averaged. A notable extension is the Quantile Regression Forest (Meinshausen, 2006), which retains all training observations in the leaf nodes rather than just their mean. This allows the forest to estimate the full conditional distribution of the target variable, enabling prediction intervals.
Random forests offer several advantages over single decision trees and many other machine learning algorithms:
Despite its many strengths, random forests have notable limitations:
Extremely Randomized Trees, or Extra-Trees, were introduced by Geurts, Ernst, and Wehenkel (2006) as a variant that pushes the randomization strategy even further than standard random forests. The key differences are:
| Aspect | Random forest | Extra-Trees |
|---|---|---|
| Bootstrap sampling | Yes (sample with replacement) | No (uses the full training set) |
| Split threshold | Best threshold among random features | Random threshold for each random feature |
| Split optimization | Optimized per feature | Fully randomized |
| Training speed | Moderate | Faster (no threshold optimization) |
| Bias-variance trade-off | Lower bias, moderate variance | Slightly higher bias, lower variance |
| Tree correlation | Moderate | Lower (more diverse trees) |
In Extra-Trees, split thresholds are drawn at random for each candidate feature, and the best of these random splits is chosen. Because the algorithm skips the computationally expensive step of finding optimal split points, Extra-Trees are generally faster to train. The additional randomness further reduces the correlation between trees, which can reduce the variance of the ensemble. However, individual Extra-Trees tend to have higher bias than standard random forest trees because they do not optimize splits.
In practice, Extra-Trees often achieve comparable or sometimes slightly better performance than standard random forests, particularly on datasets where overfitting is a concern. Scikit-learn provides ExtraTreesClassifier and ExtraTreesRegressor for this variant.
Gradient boosting is another popular ensemble method that combines decision trees, but it works very differently from random forests. Understanding the differences helps practitioners choose the right method.
| Aspect | Random forest | Gradient boosting (XGBoost, LightGBM) |
|---|---|---|
| Ensemble strategy | Bagging (parallel) | Boosting (sequential) |
| Tree construction | Independent trees | Each tree corrects errors of previous trees |
| Tree depth | Deep (often fully grown) | Shallow (typically 3 to 8 levels) |
| Overfitting risk | Low | Higher (requires careful regularization) |
| Hyperparameter sensitivity | Low | High |
| Training speed | Fast (parallelizable) | Slower (sequential), though LightGBM is very fast |
| Typical accuracy | Good | Often slightly better when well-tuned |
| Ease of use | High (good defaults) | Moderate (requires tuning learning rate, n_estimators, depth) |
| Extrapolation | Cannot extrapolate | Cannot extrapolate |
| Missing values | Handled via surrogates | Handled natively in XGBoost and LightGBM |
In random forests, trees are built independently using bootstrap samples and random feature subsets. In gradient boosting, trees are built sequentially, with each new tree trained to correct the residual errors of the ensemble built so far. Gradient boosting trees are deliberately kept shallow (weak learners) because each tree only needs to make a small improvement.
Gradient boosting primarily reduces bias by iteratively fitting residuals, while random forests primarily reduce variance through averaging. In practice, gradient boosting methods like XGBoost (Chen and Guestrin, 2016), LightGBM (Ke et al., 2017), and CatBoost often achieve higher accuracy than random forests on structured/tabular data when properly tuned. However, random forests remain preferred when ease of use and robustness are priorities, when computational resources for hyperparameter tuning are limited, or when a reliable baseline model is needed quickly.
Random forests have a deep connection to kernel methods and nearest-neighbor approaches. Both random forests and k-nearest neighbors can be expressed as weighted neighborhood schemes:
y_hat(x) = Sum over i of W(x_i, x) * y_i
where W represents neighborhood weights. In a random forest, the weight assigned to training observation i when predicting at point x is the fraction of trees in which x and x_i land in the same leaf node. This is precisely the proximity measure discussed earlier.
Lin and Jeon (2006) showed that random forests can be viewed as adaptive nearest-neighbor methods, where the neighborhood of a point depends in a complex way on the tree structures and adapts locally to feature importance. This perspective has led to the development of Kernel Random Forests (KeRF), which make the connection explicit and provide a framework that is easier to analyze theoretically.
Scornet (2016) studied the consistency of these kernel-based forest estimators and established convergence rates under mild regularity conditions.
Several variants and extensions of the original random forest algorithm have been developed:
Scikit-learn provides the most widely used Python implementation of random forests:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
rf = RandomForestClassifier(n_estimators=100, max_features='sqrt', oob_score=True)
rf.fit(X_train, y_train)
print("OOB Score:", rf.oob_score_)
print("Test Accuracy:", accuracy_score(y_test, rf.predict(X_test)))
For regression tasks, RandomForestRegressor provides an equivalent API. Other popular implementations include:
| Library / Package | Language | Notes |
|---|---|---|
| scikit-learn | Python | Most popular; RandomForestClassifier, RandomForestRegressor |
| randomForest (Liaw and Wiener) | R | Classic R implementation based on Breiman's original Fortran code |
| ranger | R | Fast C++ implementation, handles large datasets efficiently |
| H2O | Python, R, Java | Distributed random forest for big data |
| Apache Spark MLlib | Scala, Python | Distributed training on large clusters |
| XGBoost | Python, R, C++ | Includes a random forest mode via num_parallel_tree |
When working with random forests in practice, the following guidelines can help achieve good results:
oob_score=True to get an estimate of generalization performance without needing a validation set or cross-validation.class_weight='balanced' adjusts the algorithm to give more weight to the minority class during training.max_depth or max_leaf_nodes reduces per-tree memory usage.Imagine you want to know whether it will rain tomorrow. Instead of asking just one friend, you ask a hundred friends. Each friend looks at slightly different clues (some look at the clouds, some check the wind, some look at what the animals are doing). Not every friend sees the same clues, and not every friend uses the same weather data from the past.
Once everyone has made their guess, you count up the votes. If 70 friends say "rain" and 30 say "no rain," you go with "rain." That is basically how a random forest works. Each "friend" is a decision tree. By asking lots of different trees and taking the most popular answer, random forests make better predictions than any single tree would on its own.