# Random Forest

> Source: https://aiwiki.ai/wiki/random_forest
> Updated: 2026-07-11
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

A random forest is a supervised [machine learning](/wiki/machine_learning) algorithm that builds many [decision trees](/wiki/decision_tree) on random subsets of the data and combines their outputs, by majority vote for [classification](/wiki/classification_model) or by averaging for [regression](/wiki/regression_model), to produce predictions that are more accurate and more stable than any single tree. It was introduced by statistician Leo Breiman in 2001, who defined it as "a classifier consisting of a collection of tree-structured classifiers $$\{h(x, \Theta_k), k = 1, \ldots\}$$ where the $$\{\Theta_k\}$$ are independent identically distributed random vectors and each tree casts a unit vote for the most popular class at input x."[1] Breiman's paper "Random Forests," published in the journal *Machine Learning* (volume 45, pages 5-32), is one of the most cited works in all of machine learning, with more than 100,000 citations recorded on Semantic Scholar.[1][14]

Random forests combine two sources of randomness: bootstrap aggregating (or [bagging](/wiki/bagging)), in which each tree is trained on a different random sample of the data, and random feature selection, in which only a randomly chosen subset of features is considered at each node split. This combination decorrelates the trees and makes random forests robust to [overfitting](/wiki/overfitting), effective in high-dimensional settings, and applicable to both classification and regression tasks. In a large benchmark of 179 classifiers from 17 families across 121 datasets, Fernandez-Delgado et al. (2014) found that "the classifiers most likely to be the bests are the random forest (RF) versions," with the best random forest reaching 94.1% of the maximum accuracy achieved by any method on each dataset.[15]

## What is a random forest?

## Introduction

Random forest is a versatile and widely used [ensemble](/wiki/ensemble) learning method in [machine learning](/wiki/machine_learning). It combines the predictions of many individual [decision trees](/wiki/decision_tree), each trained on a different random subset of the data using a technique called bootstrap aggregating (or [bagging](/wiki/bagging)). At every node split during tree construction, only a randomly selected subset of features is considered, which decorrelates the trees and leads to stronger ensemble performance. This combination of bagging with random feature selection makes random forests robust to [overfitting](/wiki/overfitting), effective in high-dimensional settings, and applicable to both [classification](/wiki/classification_model) and [regression](/wiki/regression_model) tasks.

Random forests are sometimes referred to more broadly as [decision forests](/wiki/decision_forest), particularly in the Microsoft research literature. The method has become one of the most popular off-the-shelf learning algorithms due to its strong out-of-the-box performance, few assumptions about data distribution, and relatively low sensitivity to [hyperparameter](/wiki/hyperparameter) settings.

## When was the random forest algorithm invented?

## Historical background

The intellectual roots of random forests trace back to several lines of research in the 1990s.

In 1995, Tin Kam Ho introduced the concept of random decision forests in a paper presented at the International Conference on Document Analysis and Recognition.[3] Ho's approach built multiple decision trees in randomly selected subspaces of the feature space, demonstrating that forests of trees growing in random subspaces could achieve accuracy improvements as more trees were added, without overfitting.[3] Ho expanded this work in 1998 with a paper on the random subspace method for constructing decision forests, published in IEEE Transactions on Pattern Analysis and Machine Intelligence.[4] The key insight was that trees built in different random subspaces generalize in complementary ways, so their combined classification improves monotonically.[4]

Independently, Yali Amit and Donald Geman (1997) introduced the idea of randomly sampling available features when splitting nodes during tree construction, an idea that influenced later work. Breiman explicitly credited this work, writing that "this latter paper has been influential in my thinking."[1]

Leo Breiman synthesized these ideas and introduced the modern random forest algorithm in his influential 2001 paper "Random Forests," published in the journal *Machine Learning*.[1] The paper is dated January 2001 and lists Breiman's affiliation as the Statistics Department, University of California, Berkeley.[1] Breiman's formulation combined his earlier work on bagging (1996)[2] with random feature selection at each split, and it introduced several practical tools that are now standard: out-of-bag error estimation, permutation-based variable importance, and proximity measures.[1] Breiman also proved an upper bound on the generalization error of random forests, showing that performance depends on the strength of individual trees and the correlation between them.[1] His 2001 paper is one of the most cited papers in all of machine learning, with more than 100,000 citations recorded on Semantic Scholar.[14]

In the paper's abstract, Breiman summarized the method this way: "Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest."[1]

## Theoretical foundations

### Bagging (bootstrap aggregating)

Bagging, short for bootstrap aggregating, is the foundational variance-reduction technique behind random forests. Breiman introduced bagging in 1996 as a method for reducing the variance of a statistical learning procedure.[2] The core idea is straightforward: rather than training a single model on the entire dataset, bagging trains multiple models on different random subsets of the data and then combines their predictions.[2]

The variance reduction in bagging comes from the statistical principle that the variance of an average of B random variables, each with variance $$\sigma^2$$, is:

$$
\mathrm{Var}(\text{average}) = \frac{1}{B} \sigma^2 \quad \text{(when variables are independent)}
$$

In practice, the individual trees in a bagged ensemble are not fully independent because they are trained on overlapping data. If the pairwise correlation between trees is $$\rho$$, the variance of the ensemble average becomes:

$$
\mathrm{Var}(\text{average}) = \rho \sigma^2 + \frac{1 - \rho}{B} \sigma^2
$$

The first term depends on the correlation between trees and does not decrease as more trees are added. The second term decreases with B. By reducing $$\rho$$ through feature randomness, random forests reduce the dominant first term, achieving better variance reduction than plain bagging.[8]

### Bootstrap sampling

Bootstrap sampling is the process of creating a new dataset by randomly sampling with replacement from the original training data. For a dataset with n observations, each bootstrap sample also contains n observations, but some original observations appear multiple times while others are left out entirely.

On average, each bootstrap sample contains approximately 63.2% of the unique observations from the original dataset. The remaining 36.8% of observations that are not included in a given bootstrap sample are called out-of-bag (OOB) samples. This follows from the probability that a specific observation is not selected in any of n draws:

$$P(\text{not selected}) = (1 - 1/n)^n$$, which approaches $$1/e$$ (approximately 0.368) as n grows large.

The OOB samples serve as a natural validation set for each tree, which is one of the unique advantages of the random forest approach.

### Feature randomness (random subspace method)

The key innovation that distinguishes random forests from simple bagged decision trees is the introduction of feature randomness at each split. When building each decision tree, at every node where a split is made, only a random subset of the available features is considered as candidates for the split.

This additional randomness serves to decorrelate the individual trees in the ensemble. Without feature randomness, if one or two features are very strong predictors, most trees in a bagged ensemble would use those same features at the top splits, making the trees highly correlated. Averaging correlated trees provides less variance reduction than averaging uncorrelated trees.

The typical number of features considered at each split (denoted $$m$$ or max_features) is:

| Task | Recommended max_features | Formula |
|---|---|---|
| Classification | Square root of total features | $$m = \sqrt{p}$$ |
| Regression | One-third of total features | $$m = p/3$$ |

where $$p$$ is the total number of features. These defaults work well in practice, though tuning max_features can sometimes improve performance.[8] Breiman observed empirically that this choice is not delicate, reporting that "the results turn out to be insensitive to the number of features selected to split each node" and that "usually, selecting one or two features gives near optimum results."[1]

### Generalization error bound

Breiman (2001) derived an upper bound on the generalization error of a random forest.[1] Let $$s$$ denote the strength of the forest (the expected margin of the correct class) and $$\bar{\rho}$$ denote the mean correlation between trees. The generalization error ($$\mathrm{PE}^*$$) is bounded by:

$$
\mathrm{PE}^* \le \bar{\rho} \cdot \frac{1 - s^2}{s^2}
$$

This bound reveals two opposing forces. Reducing the number of features considered at each split (max_features) decreases the correlation $$\bar{\rho}$$ between trees, which helps. But it also decreases the strength $$s$$ of each individual tree, which hurts. The optimal max_features balances these two effects. This theoretical result explains why random forests do not overfit as the number of trees increases; adding more trees only reduces the second term in the variance decomposition.

## Why don't random forests overfit?

A defining property of random forests is that adding more trees does not cause overfitting. Breiman proved this using the Strong Law of Large Numbers, writing in the 2001 paper that "use of the Strong Law of Large Numbers shows that they always converge so that overfitting is not a problem."[1] As the number of trees grows, the generalization error converges almost surely to a limiting value rather than increasing.[1] In practical terms, this means the number of trees (n_estimators) can be raised freely to stabilize predictions: more trees only add computation, never overfitting. Each individual tree is grown deep and is therefore a low-bias, high-variance estimator; the ensemble averaging then drives down the variance while preserving the low bias.[8]

## Algorithm

### Construction

The random forest algorithm constructs a collection of decision trees during the training phase. The complete training procedure for a random forest with B trees is as follows:

1. For each tree $$b = 1, 2, \ldots, B$$:
   a. Draw a bootstrap sample of size n from the training data.
   b. Grow a decision tree on the bootstrap sample. At each node, randomly select $$m$$ features from the $$p$$ available features, find the best split among those $$m$$ features, and split the node into two child nodes.
   c. Grow the tree until a stopping criterion is met (for example, each leaf contains fewer than a minimum number of observations, or the tree reaches a maximum depth).
2. Return the ensemble of B trees.

Individual trees in a random forest are typically grown deep (often to full depth for classification), with little or no pruning. This makes each individual tree a high-variance, low-bias estimator. The ensemble averaging then reduces the variance while preserving the low bias.[8]

### Splitting criteria

At each node, the algorithm selects the feature and threshold that best separates the data. The quality of a split is measured by an impurity metric:

- **Gini impurity** (for classification): measures the probability of incorrectly classifying a randomly chosen sample. For a node with $$K$$ classes and class proportions $$p_1, p_2, \ldots, p_K$$, the Gini impurity is: $$\mathrm{Gini} = 1 - \sum_{k=1}^{K} p_k^2$$.
- **Entropy / information gain** (for classification): measures the reduction in information entropy achieved by a split.
- **Variance reduction** (for regression): measures the reduction in the variance of the target variable.

Gini impurity is the default criterion in most implementations because it is slightly faster to compute than entropy and typically produces similar results.

### Prediction

Once the trees are constructed, the random forest makes predictions by combining the output of each individual tree:

- **Classification:** The final prediction is determined by majority vote. Each tree casts a vote for one class, and the class that receives the most votes is the predicted class. Alternatively, the predicted probability for each class can be computed as the fraction of trees voting for that class.
- **Regression:** The final prediction is the average of the predictions from all individual trees.

Formally, for regression with $$B$$ trees, the prediction for a new input $$x$$ is:

$$
\hat{f}(x) = \frac{1}{B} \sum_{b=1}^{B} f_b(x)
$$

where $$f_b$$ is the prediction function of the $$b$$-th tree.

## Out-of-bag (OOB) error estimation

One of the most useful properties of random forests is the ability to estimate generalization error without a separate validation set, using out-of-bag (OOB) error estimation.

For each training observation, approximately one-third of the trees did not use that observation in their bootstrap sample. These are the OOB trees for that observation. The OOB prediction for an observation is obtained by aggregating the predictions of only the trees for which that observation was out-of-bag.

The OOB error is the average prediction error across all training observations using their OOB predictions. Breiman (2001) showed that the OOB error is a nearly unbiased estimate of the true generalization error, comparable to the error estimate from [cross-validation](/wiki/cross-validation) but computed at no extra cost during training.[1]

This makes OOB error particularly valuable when the dataset is small and setting aside a separate validation set would significantly reduce the amount of training data available. In practice, OOB error estimates tend to be slightly pessimistic (conservative), which is generally preferable to optimistic estimates.

## Feature importance

Random forests provide several methods for measuring [feature importance](/wiki/feature_importances), helping practitioners understand which features contribute most to predictions.

### Mean decrease in impurity (MDI / Gini importance)

MDI measures the total reduction in node impurity (measured by Gini impurity for classification or variance for regression) that each feature provides across all trees in the forest. Features that appear in more splits and produce larger impurity reductions receive higher importance scores.

Formally, the unnormalized importance of feature $$x$$ is:

$$
\mathrm{Importance}(x) = \frac{1}{B} \sum_{\text{trees } T} \; \sum_{\substack{\text{nodes } j \text{ in } T \\ \text{split on } x}} p(j) \, \Delta_{\text{impurity}}(j)
$$

where $$p(j)$$ is the fraction of samples reaching node $$j$$ and $$\Delta_{\text{impurity}}(j)$$ is the impurity decrease at that node.

MDI is fast to compute because it is calculated during training. However, it has a known bias toward features with high cardinality (many unique values), such as continuous features or categorical features with many levels. Strobl et al. (2007) demonstrated this bias and recommended using permutation importance or conditional importance as alternatives.[6]

### Mean decrease in accuracy (MDA / permutation importance)

MDA (also called permutation importance) measures how much the model's accuracy decreases when the values of a feature are randomly shuffled (permuted), breaking the relationship between the feature and the target. The procedure is:

1. Compute the baseline OOB error (or validation error).
2. For each feature, randomly permute its values and recompute the error.
3. The importance of the feature is the increase in error caused by the permutation.

Permutation importance is more reliable than MDI because it is not biased toward high-cardinality features. However, it is more computationally expensive because it requires re-evaluating the model for each feature. It can also be misleading when features are highly correlated, since permuting one correlated feature may have little effect because the other correlated features still carry the same information.

| Method | Speed | Bias | Reliability | Computed during |
|---|---|---|---|---|
| MDI (Gini importance) | Fast | Biased toward high-cardinality features | Moderate | Training |
| MDA (Permutation importance) | Slow | Unbiased (but affected by correlated features) | High | After training (using OOB or validation data) |

### SHAP values

In recent years, SHAP (SHapley Additive exPlanations) values have become a popular alternative for explaining random forest predictions. SHAP values are based on cooperative game theory and provide a theoretically grounded way to attribute the prediction of each individual observation to the contributions of each feature. The TreeSHAP algorithm (Lundberg et al., 2020) computes exact SHAP values for tree-based models efficiently, making it practical for random forests.[9]

## Proximity measures

Breiman (2001) introduced the concept of proximity measures as a byproduct of random forest training.[1] The proximity between two observations is defined as the proportion of trees in which both observations land in the same terminal (leaf) node. Formally, for observations $$i$$ and $$j$$:

$$
\mathrm{Proximity}(i, j) = \frac{1}{B} \sum_{\text{trees } T} \mathbb{I}\!\left(\mathrm{leaf}_T(i) = \mathrm{leaf}_T(j)\right)
$$

where $$\mathbb{I}$$ is the indicator function and $$\mathrm{leaf}_T(i)$$ is the leaf node that observation $$i$$ falls into in tree $$T$$.

Proximity measures capture the notion that two observations are "similar" if the random forest treats them similarly. These proximities have several practical applications:

- **Outlier detection:** Observations with low proximity to all other observations in their class may be outliers.
- **Missing value imputation:** Missing values can be filled in using a proximity-weighted average of the non-missing values from nearby observations.
- **Data visualization:** The proximity matrix can be used with multidimensional scaling (MDS) to produce low-dimensional plots of the data, revealing cluster structure.
- **Clustering:** The proximity matrix can serve as input to clustering algorithms, providing a data-driven dissimilarity measure that adapts to the structure of the problem.

## Variable interaction detection

Random forests can also be used to detect interactions between variables. Breiman proposed an experimental procedure based on Gini importance values. The approach works as follows: for each pair of variables $$(m, k)$$, the algorithm computes the absolute difference in their Gini importance ranks across all trees, then compares this to the expected difference under the hypothesis that the two variables are independent. A large positive difference implies that a split on one variable inhibits or promotes splits on the other variable, suggesting an interaction.

Breiman noted that this procedure is experimental and its conclusions should be regarded with caution.[1] More recent approaches to interaction detection in random forests include partial dependence plots and the H-statistic proposed by Friedman and Popescu (2008), which quantifies the strength of interactions between features using the partial dependence decomposition.[12]

## Hyperparameters

Random forests have several important hyperparameters that affect model performance. While random forests are generally robust to hyperparameter choices compared to methods like [gradient boosting](/wiki/gradient_boosting), tuning can still yield meaningful improvements.

| Hyperparameter | Description | Typical default | Effect of increasing |
|---|---|---|---|
| n_estimators | Number of trees in the forest | 100 | Better performance, higher computation; no overfitting risk |
| max_depth | Maximum depth of each tree | None (fully grown) | More complex trees, risk of overfitting per tree |
| max_features | Number of features considered at each split | $$\sqrt{p}$$ for classification, $$p/3$$ for regression | Lower correlation among trees, potentially lower accuracy per tree |
| min_samples_split | Minimum samples required to split a node | 2 | Smoother decision boundaries, less overfitting |
| min_samples_leaf | Minimum samples in a leaf node | 1 | Prevents very small leaves, reduces overfitting |
| bootstrap | Whether to use bootstrap sampling | True | If False, each tree uses the full dataset |
| oob_score | Whether to compute OOB error | False | Enables OOB error estimation |
| n_jobs | Number of parallel jobs for training | 1 | Faster training (trees are independent) |
| class_weight | Weights for each class (classification only) | None | Helps with imbalanced datasets |
| max_samples | Number of samples to draw for each tree (if bootstrap=True) | n (full dataset size) | Smaller values increase diversity but may reduce per-tree accuracy |

In the widely used scikit-learn implementation, the default number of trees (n_estimators) was changed from 10 to 100 in version 0.22, reflecting the practical finding that larger forests generally improve accuracy at no risk of overfitting.[16]

### Tuning guidelines

- **n_estimators:** More trees generally improve performance but with diminishing returns beyond a certain point. A common practice is to start with 100 trees and increase until the OOB error stabilizes. In most cases, 300 to 500 trees are sufficient. Importantly, increasing n_estimators never causes overfitting; it only adds computation.
- **max_features:** Reducing max_features increases the diversity of trees (more decorrelation) but makes each tree weaker. The defaults of $$\sqrt{p}$$ for classification and $$p/3$$ for regression are good starting points.
- **max_depth / min_samples_leaf:** For classification, fully grown trees (no depth limit) often work well. For regression or noisy datasets, limiting depth or increasing min_samples_leaf can improve performance.
- **Tuning strategy:** A common practical approach is to use RandomizedSearchCV first to narrow down the hyperparameter ranges, then apply GridSearchCV to fine-tune the best values found. The most impactful hyperparameters to tune are typically n_estimators, max_features, and max_depth or min_samples_split.

## How does random forest differ from a single decision tree?

A single [decision tree](/wiki/decision_tree) is easy to interpret but tends to have high variance: small changes in the training data can produce a very different tree, and a deep tree often overfits. A random forest addresses this by training hundreds of de-correlated trees on bootstrap samples and averaging their votes, which sharply reduces variance while keeping bias low. The trade-off is interpretability: a single tree can be read as a flow chart, whereas a forest of hundreds of trees behaves more like a "black box" and must be explained through feature importance or SHAP values. The table below contrasts a single decision tree with a random forest of the same trees.

| Aspect | Single decision tree | Random forest |
|---|---|---|
| Variance | High (unstable) | Low (averaged over many trees) |
| Overfitting | Prone to overfit when grown deep | Resistant; adding trees does not overfit |
| Interpretability | High (readable flow chart) | Lower (ensemble of many trees) |
| Predictive accuracy | Moderate | Generally higher |
| Built-in validation | None | Out-of-bag error estimate |

## Random forests for classification vs. regression

Random forests can be applied to both classification and regression problems. The core algorithm is the same, but there are some differences in defaults and aggregation:

| Aspect | Classification | Regression |
|---|---|---|
| Aggregation method | Majority vote (or averaged probabilities) | Mean of predictions |
| Default max_features | $$\sqrt{p}$$ | $$p/3$$ |
| Splitting criterion | Gini impurity or entropy | Variance reduction (MSE) |
| Output | Class label or class probabilities | Continuous value |
| Tree depth | Often fully grown | May benefit from depth limits |
| Evaluation metric | Accuracy, F1, AUC | MSE, MAE, R-squared |

For classification, each tree votes for a class and the majority wins. For regression, the predictions from all trees are averaged. A notable extension is the Quantile Regression Forest (Meinshausen, 2006), which retains all training observations in the leaf nodes rather than just their mean.[7] This allows the forest to estimate the full conditional distribution of the target variable, enabling prediction intervals.[7]

## How accurate are random forests?

Random forests are widely regarded as one of the strongest off-the-shelf classifiers, often delivering near-best accuracy with default settings. In a large empirical study, Fernandez-Delgado et al. (2014) evaluated 179 classifiers drawn from 17 families on 121 datasets and concluded that "the classifiers most likely to be the bests are the random forest (RF) versions."[15] The best random forest in that study reached 94.1% of the maximum accuracy achieved by any classifier on each dataset, ahead of all other families, though the gap to the second-best family (a support vector machine with a Gaussian kernel) was not statistically significant.[15] These results help explain why random forests remain a default baseline for tabular data, even though carefully tuned gradient boosting methods can edge them out.

## Advantages

Random forests offer several advantages over single decision trees and many other machine learning algorithms:

- **Improved accuracy:** By aggregating the output of multiple trees, random forests achieve higher accuracy and better generalization than individual trees.
- **Robustness to overfitting:** The use of bootstrap sampling and random feature selection helps reduce overfitting. Adding more trees never increases overfitting; it only stabilizes the ensemble.
- **Handling missing values:** The algorithm can handle missing data through surrogate splits in the decision trees and through proximity-based imputation.
- **Feature importance:** The relative importance of features can be assessed using Gini importance, permutation importance, or SHAP values.
- **Minimal preprocessing:** Random forests do not require feature scaling or normalization. They handle both numerical and categorical features natively.
- **Parallelizable:** Since each tree is built independently, training can be easily parallelized across multiple CPU cores.
- **Built-in validation:** OOB error provides a free estimate of generalization performance without needing a separate validation set.
- **Handles high-dimensional data:** Random forests work well even when the number of features greatly exceeds the number of observations, a common situation in genomics and text classification.
- **Nonparametric:** No assumptions about the underlying data distribution are required.

## Limitations

Despite its many strengths, random forests have notable limitations:

- **Interpretability trade-off:** While individual decision trees are highly interpretable, the ensemble nature of random forests makes the final model harder to interpret. Feature importance scores and SHAP values partially address this, but the model itself remains a "black box" compared to a single tree.
- **Memory usage:** Storing hundreds or thousands of full-depth decision trees can require significant memory, especially for large datasets with many features.
- **Slow prediction for large forests:** Although training can be parallelized, prediction requires passing each new observation through every tree. For real-time applications with very large forests, this can be a bottleneck.
- **Biased feature importance for high-cardinality features:** The default Gini importance measure is biased toward features with many unique values. Permutation importance or conditional importance should be used when this is a concern.
- **No extrapolation:** Random forests cannot extrapolate beyond the range of the training data. For regression tasks, predictions are bounded by the minimum and maximum target values seen during training. This makes them unsuitable for tasks that require predicting trends beyond observed data.
- **Performance ceiling:** On structured/tabular data, well-tuned gradient boosting methods (XGBoost, LightGBM) often slightly outperform random forests.
- **Bias with imbalanced data:** Random forests can be biased toward the majority class in imbalanced datasets, though class weighting and balanced bootstrap sampling can mitigate this.

## Extremely Randomized Trees (Extra-Trees)

Extremely Randomized Trees, or Extra-Trees, were introduced by Geurts, Ernst, and Wehenkel (2006) as a variant that pushes the randomization strategy even further than standard random forests.[5] The key differences are:

| Aspect | Random forest | Extra-Trees |
|---|---|---|
| Bootstrap sampling | Yes (sample with replacement) | No (uses the full training set) |
| Split threshold | Best threshold among random features | Random threshold for each random feature |
| Split optimization | Optimized per feature | Fully randomized |
| Training speed | Moderate | Faster (no threshold optimization) |
| Bias-variance trade-off | Lower bias, moderate variance | Slightly higher bias, lower variance |
| Tree correlation | Moderate | Lower (more diverse trees) |

In Extra-Trees, split thresholds are drawn at random for each candidate feature, and the best of these random splits is chosen. Because the algorithm skips the computationally expensive step of finding optimal split points, Extra-Trees are generally faster to train. The additional randomness further reduces the correlation between trees, which can reduce the variance of the ensemble. However, individual Extra-Trees tend to have higher bias than standard random forest trees because they do not optimize splits.

In practice, Extra-Trees often achieve comparable or sometimes slightly better performance than standard random forests, particularly on datasets where overfitting is a concern. Scikit-learn provides `ExtraTreesClassifier` and `ExtraTreesRegressor` for this variant.

## How does random forest differ from gradient boosting?

## Comparison with gradient boosting

Gradient boosting is another popular ensemble method that combines decision trees, but it works very differently from random forests. Understanding the differences helps practitioners choose the right method.

| Aspect | Random forest | Gradient boosting (XGBoost, LightGBM) |
|---|---|---|
| Ensemble strategy | Bagging (parallel) | Boosting (sequential) |
| Tree construction | Independent trees | Each tree corrects errors of previous trees |
| Tree depth | Deep (often fully grown) | Shallow (typically 3 to 8 levels) |
| Overfitting risk | Low | Higher (requires careful regularization) |
| Hyperparameter sensitivity | Low | High |
| Training speed | Fast (parallelizable) | Slower (sequential), though LightGBM is very fast |
| Typical accuracy | Good | Often slightly better when well-tuned |
| Ease of use | High (good defaults) | Moderate (requires tuning learning rate, n_estimators, depth) |
| Extrapolation | Cannot extrapolate | Cannot extrapolate |
| Missing values | Handled via surrogates | Handled natively in XGBoost and LightGBM |

In random forests, trees are built independently using bootstrap samples and random feature subsets. In gradient boosting, trees are built sequentially, with each new tree trained to correct the residual errors of the ensemble built so far. Gradient boosting trees are deliberately kept shallow (weak learners) because each tree only needs to make a small improvement.

Gradient boosting primarily reduces bias by iteratively fitting residuals, while random forests primarily reduce variance through averaging. In practice, gradient boosting methods like XGBoost (Chen and Guestrin, 2016),[10] LightGBM (Ke et al., 2017), and CatBoost often achieve higher accuracy than random forests on structured/tabular data when properly tuned. However, random forests remain preferred when ease of use and robustness are priorities, when computational resources for hyperparameter tuning are limited, or when a reliable baseline model is needed quickly.

## Relationship to kernel methods and nearest neighbors

Random forests have a deep connection to kernel methods and nearest-neighbor approaches. Both random forests and k-nearest neighbors can be expressed as weighted neighborhood schemes:

$$
\hat{y}(x) = \sum_{i} W(x_i, x) \, y_i
$$

where $$W$$ represents neighborhood weights. In a random forest, the weight assigned to training observation i when predicting at point x is the fraction of trees in which $$x$$ and $$x_i$$ land in the same leaf node. This is precisely the proximity measure discussed earlier.

Lin and Jeon (2006) showed that random forests can be viewed as adaptive nearest-neighbor methods, where the neighborhood of a point depends in a complex way on the tree structures and adapts locally to feature importance.[11] This perspective has led to the development of Kernel Random Forests (KeRF), which make the connection explicit and provide a framework that is easier to analyze theoretically.

Scornet (2016) studied the consistency of these kernel-based forest estimators and established convergence rates under mild regularity conditions.

## Variants and extensions

Several variants and extensions of the original random forest algorithm have been developed:

- **Isolation Forest:** An adaptation of random forests for anomaly detection that isolates anomalies by randomly partitioning data. Anomalies require fewer random partitions to isolate than normal points.
- **Weighted Random Forest:** Assigns weights to individual trees based on their performance, giving more influence to better-performing trees.
- **Random Survival Forests:** Extends the algorithm to handle censored data common in medical and reliability studies, where the outcome of interest is time until an event occurs.
- **Quantile Regression Forests:** Instead of predicting only the conditional mean, this variant estimates the full conditional distribution of the target variable, allowing for prediction intervals (Meinshausen, 2006).[7]
- **Mondrian Forests:** An online variant that supports incremental learning, allowing new data to be incorporated without retraining the entire forest.
- **Random Forest for high-dimensional data:** Variants like enriched random forests and regularized random forests have been developed specifically for settings where the number of features far exceeds the number of observations, such as in genomics.

## What is a random forest used for?

Random forests are applied across a wide range of domains because they work well on tabular data with mixed numerical and categorical features, require little preprocessing, and give a reliable accuracy estimate for free through out-of-bag error. Common applications include credit scoring and fraud detection in finance, disease classification and biomarker discovery in genomics and medicine, churn prediction and recommendation in business analytics, land-cover classification from satellite imagery in remote sensing, and as a strong baseline in data-science competitions. Breiman highlighted high-dimensional problems specifically, noting that "important recent problems, i.e., medical diagnosis and document retrieval, often have the property that there are many input variables, often in the hundreds or thousands, with each one containing only a small amount of information," a setting in which combining trees grown with random features "can produce improved accuracy."[1]

## Random forests in practice

Scikit-learn provides the most widely used Python implementation of random forests:[13]

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
rf = RandomForestClassifier(n_estimators=100, max_features='sqrt', oob_score=True)
rf.fit(X_train, y_train)
print("OOB Score:", rf.oob_score_)
print("Test Accuracy:", accuracy_score(y_test, rf.predict(X_test)))
```

For regression tasks, `RandomForestRegressor` provides an equivalent API. Other popular implementations include:

| Library / Package | Language | Notes |
|---|---|---|
| scikit-learn | Python | Most popular; `RandomForestClassifier`, `RandomForestRegressor` |
| randomForest (Liaw and Wiener) | R | Classic R implementation based on Breiman's original Fortran code |
| ranger | R | Fast C++ implementation, handles large datasets efficiently |
| H2O | Python, R, Java | Distributed random forest for big data |
| Apache Spark MLlib | Scala, Python | Distributed training on large clusters |
| XGBoost | Python, R, C++ | Includes a random forest mode via `num_parallel_tree` |

### Practical tips

When working with random forests in practice, the following guidelines can help achieve good results:

- **Start with defaults.** Random forests' default hyperparameters often produce competitive results without any tuning. This makes them an excellent baseline model for any classification or regression problem.
- **Use OOB score for quick evaluation.** Enable `oob_score=True` to get an estimate of generalization performance without needing a validation set or cross-validation.
- **Check feature importance early.** After fitting the model, inspect feature importance to identify irrelevant features. Removing noise features can sometimes improve performance and reduce training time.
- **Increase n_estimators for stability.** If the OOB score fluctuates across runs, increasing the number of trees usually helps stabilize results.
- **Consider class weights for imbalanced data.** When one class is much rarer than the other, setting `class_weight='balanced'` adjusts the algorithm to give more weight to the minority class during training.
- **Monitor memory usage.** Each tree stores its full structure in memory. For very large forests or very deep trees, memory consumption can become significant. Limiting `max_depth` or `max_leaf_nodes` reduces per-tree memory usage.
- **When to prefer random forests over gradient boosting.** Random forests are a good choice when you need a quick, reliable baseline; when you cannot afford extensive hyperparameter tuning; when training data is limited (OOB estimation is valuable); or when you want a model that is robust to noisy features.

## Explain like I'm 5 (ELI5)

Imagine you want to know whether it will rain tomorrow. Instead of asking just one friend, you ask a hundred friends. Each friend looks at slightly different clues (some look at the clouds, some check the wind, some look at what the animals are doing). Not every friend sees the same clues, and not every friend uses the same weather data from the past.

Once everyone has made their guess, you count up the votes. If 70 friends say "rain" and 30 say "no rain," you go with "rain." That is basically how a random forest works. Each "friend" is a decision tree. By asking lots of different trees and taking the most popular answer, random forests make better predictions than any single tree would on its own.

## References

1. Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32. https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
2. Breiman, L. (1996). "Bagging Predictors." *Machine Learning*, 24(2), 123-140.
3. Ho, T. K. (1995). "Random Decision Forests." *Proceedings of the 3rd International Conference on Document Analysis and Recognition*, 278-282.
4. Ho, T. K. (1998). "The Random Subspace Method for Constructing Decision Forests." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 20(8), 832-844.
5. Geurts, P., Ernst, D., & Wehenkel, L. (2006). "Extremely Randomized Trees." *Machine Learning*, 63(1), 3-42.
6. Strobl, C., Boulesteix, A. L., Zeileis, A., & Hothorn, T. (2007). "Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution." *BMC Bioinformatics*, 8, 25.
7. Meinshausen, N. (2006). "Quantile Regression Forests." *Journal of Machine Learning Research*, 7, 983-999.
8. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning*. Springer.
9. Lundberg, S. M., Erion, G., Chen, H., et al. (2020). "From Local Explanations to Global Understanding with Explainable AI for Trees." *Nature Machine Intelligence*, 2, 56-67.
10. Chen, T. & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 785-794.
11. Lin, Y. & Jeon, Y. (2006). "Random Forests and Adaptive Nearest Neighbors." *Journal of the American Statistical Association*, 101(474), 578-590.
12. Friedman, J. H. & Popescu, B. E. (2008). "Predictive Learning via Rule Ensembles." *Annals of Applied Statistics*, 2(3), 916-954.
13. Scikit-learn documentation: [Ensemble Methods](https://scikit-learn.org/stable/modules/ensemble.html).
14. Semantic Scholar entry for Breiman, L. (2001), "Random Forests," citation count. https://www.semanticscholar.org/paper/Random-Forests-Breiman/8e0be569ea77b8cb29bb0e8b031887630fe7a96c
15. Fernandez-Delgado, M., Cernadas, E., Barro, S., & Amorim, D. (2014). "Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?" *Journal of Machine Learning Research*, 15, 3133-3181. https://jmlr.org/papers/v15/delgado14a.html
16. Scikit-learn, `RandomForestClassifier` API reference (n_estimators default changed from 10 to 100 in version 0.22). https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html