# Machine learning terms/Decision Forests

> Source: https://aiwiki.ai/wiki/machine_learning_terms_decision_forests
> Updated: 2026-07-07
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**Decision forests** are a family of [machine learning](/wiki/machine_learning) models built from collections of [decision trees](/wiki/decision_tree). Instead of relying on a single tree, a [decision forest](/wiki/decision_forest) trains many trees and combines their predictions, usually by averaging for [regression](/wiki/regression) tasks and by majority vote or class probability averaging for [classification](/wiki/classification). The result is a model that is far more accurate, far more stable, and far less prone to [overfitting](/wiki/overfitting) than any individual tree.

The two dominant decision forest paradigms are [random forests](/wiki/random_forest), which train many independent trees in parallel on bootstrap samples of the data [2], and [gradient boosted trees](/wiki/gradient_boosted_decision_trees_gbt), which train trees sequentially with each new tree correcting the residual errors of the ensemble built so far [6]. Both approaches dominate practical work on tabular data. They power credit scoring at major banks, ad click-through prediction at search engines, learning-to-rank systems, fraud detection, churn prediction, and the majority of winning solutions on tabular [Kaggle](/wiki/kaggle) competitions. On a 2022 benchmark of 45 medium-sized tabular datasets, tree-based models remained state of the art against deep networks even after both families received extensive hyperparameter tuning [18].

This page is the gateway hub for decision-forest entries on the AI Wiki. It covers the history, the mathematics behind tree splitting, the major algorithm families, the leading software libraries, and a curated index of every decision-forest concept that has its own dedicated wiki page.

## what is a decision forest?

A decision forest is an [ensemble learning](/wiki/ensemble_learning) method whose base learners are [decision trees](/wiki/decision_tree). At inference time the input traverses each tree along an [inference path](/wiki/inference_path) of [conditions](/wiki/condition) until it reaches a [leaf](/wiki/leaf), and the leaf values are aggregated across the forest to produce the final prediction. Decision forests are non-parametric, handle mixed numerical and categorical features natively, are scale-invariant for monotonic feature transformations, and require comparatively little [feature engineering](/wiki/feature_engineering).

The broad appeal of decision forests rests on a short list of practical strengths:

| Property | Why it matters |
|---|---|
| Strong accuracy on tabular data | Repeatedly beats deep learning baselines on heterogeneous tabular benchmarks |
| Native handling of categorical features | No mandatory [one-hot encoding](/wiki/one-hot_encoding); algorithms such as [LightGBM](/wiki/lightgbm) and [CatBoost](/wiki/catboost) split on raw category values |
| Robust to outliers and feature scaling | Splits are based on order, not magnitude, so a single extreme value cannot distort the geometry of the input |
| Built-in [feature importances](/wiki/feature_importances) | Direct measures of which inputs the model relies on most |
| Predictable training cost | Training time scales close to linearly with the number of trees and roughly with $n \log n$ in the number of training examples |
| Few brittle hyperparameters | Modern libraries train respectable models with default settings |

Decision forests are weaker on problems where signal lives in high-dimensional structured inputs such as raw pixels, audio waveforms, or token sequences. Those domains belong to [deep learning](/wiki/deep_learning) and convolutional or [transformer](/wiki/transformer) architectures.

## decision tree fundamentals

Every decision forest is built out of [decision trees](/wiki/decision_tree). A tree is a recursive partition of the input space. Each internal [node](/wiki/node_decision_tree) holds a [test](/wiki/test) that compares one or more [features](/wiki/feature) against a [threshold](/wiki/threshold_for_decision_trees) or category set, splitting incoming examples into child branches. Each [leaf](/wiki/leaf) stores a constant prediction, typically a class probability vector for classification or a numeric value for regression. The starting node of the tree is the [root](/wiki/root). The component that decides which condition to install at each node is called the [splitter](/wiki/splitter), and the operation it performs is a [split](/wiki/split).

### the classical algorithm families

Three historical algorithm families established the recipes that modern forests still follow:

| Algorithm | Year | Author | Key ideas |
|---|---|---|---|
| ID3 | 1986 | J. Ross Quinlan | Multiway splits on categorical features, [information gain](/wiki/information_gain) (entropy reduction) as the splitting criterion, no pruning [4] |
| C4.5 | 1993 | J. Ross Quinlan | Successor to ID3. Adds support for numerical features, missing-value handling, gain ratio normalization, and post-pruning [5] |
| CART (Classification and Regression Trees) | 1984 | Breiman, Friedman, Olshen, Stone | Strictly [binary conditions](/wiki/binary_condition) at each node, [Gini impurity](/wiki/gini_impurity) for classification, mean squared error for regression, cost-complexity [pruning](/wiki/pruning) [3] |

Most modern decision forest implementations are CART-style: they fit binary trees, choose splits greedily, and grow a forest of such trees. ID3 and C4.5 remain influential, especially in textbook treatments and in the older Weka ecosystem.

### conditions and node tests

A [condition](/wiki/condition), also called a test, is the question installed at a node. The major condition types are:

| Type | Form | Notes |
|---|---|---|
| [Binary condition](/wiki/binary_condition) | Two child branches | The default in CART, [random forest](/wiki/random_forest), and gradient boosted trees |
| [Non-binary condition](/wiki/non-binary_condition) | More than two children | Used by ID3 and C4.5 on categorical features |
| [Axis-aligned condition](/wiki/axis-aligned_condition) | Compares a single feature against a threshold | Standard in nearly every production forest |
| [Oblique condition](/wiki/oblique_condition) | Linear combination of features compared against a threshold | More expressive but more expensive; appears in oblique random forests |
| [In-set condition](/wiki/in-set_condition) | Tests whether a categorical feature lies in a learned subset | The native categorical split used by [LightGBM](/wiki/lightgbm) and [CatBoost](/wiki/catboost) |

The [threshold](/wiki/threshold_for_decision_trees) used in an axis-aligned numerical condition is selected by the [splitter](/wiki/splitter) to minimize an impurity measure on the resulting children. For categorical features the splitter searches a partition of category values; for high-cardinality categoricals modern libraries use efficient sorting-based heuristics rather than exhaustive search.

## how does a decision tree choose splits?

A split is chosen by scoring every candidate condition with an impurity or loss function and picking the condition that reduces the score the most.

### gini impurity

[Gini impurity](/wiki/gini_impurity) is the probability that a randomly chosen example would be misclassified if labelled by the class distribution at the node:

$$\text{Gini}(t) = 1 - \sum_{k=1}^{K} p_k^2$$

where $p_k$ is the proportion of class $k$ in node $t$. Gini is zero for a pure node and reaches its maximum when classes are perfectly balanced. It is the default split criterion in [scikit-learn](/wiki/scikit-learn)'s `DecisionTreeClassifier` and `RandomForestClassifier` [20].

### entropy and information gain

Shannon [entropy](/wiki/entropy) measures uncertainty in the class distribution at a node:

$$H(t) = -\sum_{k=1}^{K} p_k \log_2 p_k$$

The corresponding split score is [information gain](/wiki/information_gain), the reduction in entropy after splitting:

$$\text{IG}(t, s) = H(t) - \sum_{c \in \text{children}(s)} \frac{|c|}{|t|} H(c)$$

ID3 selects the split with maximum information gain [4]. C4.5 uses gain ratio, a normalized version that penalizes splits with many children [5]. In practice, Gini and entropy almost always agree on which split is best; differences in tree structure are small and rarely change downstream accuracy.

### regression splits

For [regression](/wiki/regression) trees, the impurity at a node is usually the mean squared error of the leaf prediction. The split that minimizes the post-split sum of squared errors is selected. Mean absolute error and Huber loss are also available in major libraries for problems with heavy-tailed targets.

### loss-based splits in boosting

In [gradient boosting](/wiki/gradient_boosting), trees are fit to the negative gradient of a loss function, so the splitting criterion comes from a second-order Taylor expansion of that loss rather than from a fixed impurity. XGBoost popularized this view with its gain formula [10]

$$\text{Gain} = \frac{1}{2}\left[\frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda}\right] - \gamma$$

where $G$ and $H$ are sums of first and second derivatives in the left and right children and $\lambda$, $\gamma$ are [regularization](/wiki/regularization) parameters.

## random forests

[Random forest](/wiki/random_forest) was introduced by Leo Breiman in 2001, who defined it as "a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest" [2]. It combines two ideas:

1. [Bagging](/wiki/bagging), short for bootstrap aggregating, also due to Breiman in 1996 [1]. Each tree is trained on a bootstrap sample drawn by [sampling with replacement](/wiki/sampling_with_replacement) from the original training set. Different bootstrap samples produce different trees; averaging their predictions reduces variance.
2. Random subspace, also called feature [attribute sampling](/wiki/attribute_sampling). At each split, only a random subset of features is considered as candidates. This decorrelates the trees so that the variance reduction from averaging is closer to the ideal $1/T$ for $T$ trees.

Because each bootstrap sample omits roughly $1 - (1 - 1/n)^n \approx 1/e \approx 36.8\%$ of the training examples, random forests support [out-of-bag evaluation](/wiki/out-of-bag_evaluation_oob_evaluation): each example is scored by the trees that did not see it during training, giving a near-unbiased generalization estimate without a separate validation set [2].

A random forest also has a built-in measure of [variable importances](/wiki/variable_importances). Two variants are widely used: mean decrease in impurity (MDI), which sums the impurity reductions a feature contributes across all trees, and [permutation variable importances](/wiki/permutation_variable_importances), which measures the drop in accuracy when a feature's values are randomly permuted. Permutation importance is generally preferred because MDI is biased toward features with many possible split points or high cardinality.

The success of random forests is sometimes described in terms of the [wisdom of the crowd](/wiki/wisdom_of_the_crowd): independent learners that each do better than chance, when aggregated, produce far stronger predictions than any single learner. Galton's 1906 ox-weight contest is the canonical illustration.

## extremely randomized trees

Extremely Randomized Trees, often abbreviated Extra-Trees, were introduced by Pierre Geurts, Damien Ernst, and Louis Wehenkel in 2006 [9]. They differ from random forests in two ways:

1. The full training set is used to grow each tree, with no bootstrap step.
2. At each node, the threshold for each candidate feature is drawn at random rather than chosen to maximize impurity reduction. The split is then selected as the best among these random candidates.

The extra randomization further reduces variance at the cost of a small increase in bias. In practice Extra-Trees often match random forests on accuracy while training faster, since random thresholds avoid the per-feature sort that dominates standard split search. They are available in [scikit-learn](/wiki/scikit-learn) as `ExtraTreesClassifier` and `ExtraTreesRegressor` [20].

## gradient boosting and modern boosted trees

Gradient boosting was formalized by Jerome Friedman in 2001 in the paper "Greedy Function Approximation: A Gradient Boosting Machine" [6]. It generalizes the [AdaBoost](/wiki/adaboost) algorithm of Freund and Schapire (1995) to arbitrary differentiable loss functions [8]. The training loop is sequential rather than parallel:

1. Initialize the model with a constant prediction.
2. At each round $m = 1, \dots, M$:
   - Compute the negative gradient of the loss with respect to the current predictions; this is the pseudo-residual.
   - Fit a small regression tree to the pseudo-residuals.
   - Add the new tree to the ensemble, scaled by a [learning rate](/wiki/learning_rate) $\eta$ (also called [shrinkage](/wiki/shrinkage)).
3. Stop when validation loss stops improving.

Friedman's MART (Multiple Additive Regression Trees) and Stochastic Gradient Boosting (2002) added [subsampling](/wiki/subsampling) of training rows to inject randomness, in the spirit of bagging [7].

Three modern open-source libraries dominate the boosted-trees landscape:

| Library | First release | Origin | Distinctive ideas |
|---|---|---|---|
| [XGBoost](/wiki/xgboost) | 2014 | Tianqi Chen, University of Washington | Second-order gradient + Hessian split formula, regularization in the objective, sparsity-aware split finding, exact and approximate histogram methods, GPU support [10] |
| [LightGBM](/wiki/lightgbm) | 2016 | Microsoft Research | Histogram-based splits, leaf-wise tree growth instead of level-wise, gradient-based one-side sampling (GOSS), exclusive feature bundling (EFB), native categorical splits [11] |
| [CatBoost](/wiki/catboost) | 2017 | Yandex | Ordered boosting to fight target leakage in categorical encodings, ordered target statistics for categorical features, symmetric (oblivious) trees, strong defaults [12] |

All three reduce to the same mathematical core (gradient boosting of regression trees) but differ in growth strategy, categorical handling, and engineering. Benchmarks across academic papers and Kaggle competitions show they trade the lead frequently and are usually within a few tenths of a percent of one another on a given dataset.

Gradient boosted trees are also commonly referred to as [gradient boosted (decision) trees](/wiki/gradient_boosted_decision_trees_gbt) or GBT, and the boosted-trees method as a whole is known as GBDT (gradient boosted decision trees) or GBM (gradient boosting machines).

## how do bagging and boosting differ?

The two dominant forest paradigms differ in nearly every dimension. The contrast is the easiest way to remember when to reach for which.

| Dimension | Bagging (random forest, Extra-Trees) | Boosting (gradient boosted trees) |
|---|---|---|
| Tree training order | Parallel and independent | Sequential, each tree depends on the previous |
| Sampling | [Bootstrap](/wiki/sampling_with_replacement) of rows + random subspace of features | Often subsamples rows and columns per tree, but no bootstrap |
| Tree depth | Deep, fully grown trees | Shallow, typically 4 to 8 levels |
| Aggregation | Equal-weight average or vote | Weighted sum with [shrinkage](/wiki/shrinkage) |
| What it reduces | Variance of low-bias deep trees | Bias of weak shallow learners |
| Out-of-the-box accuracy | Strong with defaults | Strong with defaults but more sensitive to learning rate and tree count |
| Tuning cost | Low | Moderate to high |
| Risk profile | Hard to overfit by adding more trees | Can overfit if too many rounds without early stopping |
| Parallelism | Embarrassingly parallel across trees | Parallel within a tree (per-feature histograms) but sequential across trees |

A common rule of thumb on tabular [supervised learning](/wiki/supervised_learning): start with a random forest at default settings to establish a quick baseline, then run a tuned [gradient boosted trees](/wiki/gradient_boosted_decision_trees_gbt) model with early stopping for the production result. The gap is typically one to three percent of accuracy in favor of boosting once tuned, sometimes more on noisy data, and occasionally negligible on small or very clean problems.

## which hyperparameters matter most?

Decision forests have a small but high-leverage set of [hyperparameters](/wiki/hyperparameter). The names below are the conventional ones; each library uses slight variants.

| Hyperparameter | Typical name | Effect | Sensible starting range |
|---|---|---|---|
| Number of trees | `n_estimators`, `num_trees`, `num_boost_round` | More trees reduce variance for bagging and bias for boosting until diminishing returns | 100 to 2,000 for boosting, 100 to 1,000 for bagging |
| Maximum depth | `max_depth` | Deeper trees fit more interactions but may overfit | 6 to 16 for bagging, 4 to 8 for boosting |
| Minimum samples per split | `min_samples_split`, `min_data_in_leaf` | Larger values prevent splits on noise | 2 to 50 |
| Maximum features per split | `max_features` (sklearn), `colsample_bytree` (XGBoost / LightGBM) | Controls [attribute sampling](/wiki/attribute_sampling); $\sqrt{p}$ for classification and $p/3$ for regression are common defaults in random forests | $\sqrt{p}$ to $p$ |
| Learning rate | `learning_rate`, `eta`, `shrinkage` | Smaller is more accurate but slower to converge; only used in boosting | 0.01 to 0.3 |
| L1 / L2 regularization | `reg_alpha`, `reg_lambda` | Penalizes leaf weights to reduce overfitting | 0 to a few |
| Row subsample | `subsample`, `bagging_fraction` | Fraction of rows sampled per tree | 0.5 to 1.0 |
| Minimum child weight | `min_child_weight` | Minimum sum of Hessian per leaf in XGBoost-style boosting | 1 to 100 |

For boosting, the standard tuning workflow is to fix a small learning rate (for example 0.05), set the number of rounds high (for example 5,000), and use early stopping on a held-out validation set to pick the best round automatically. This trades extra training compute for an essentially tuning-free choice of `num_trees`.

## feature importances: MDI vs permutation

Decision forests give two complementary measures of [feature importances](/wiki/feature_importances):

| Measure | Definition | Strengths | Weaknesses |
|---|---|---|---|
| Mean decrease in impurity (MDI) | Sum, across all trees and all splits, of the impurity reduction credited to a feature | Cheap; computed as a byproduct of training | Biased toward high-cardinality and continuous features; computed on training data so can reward overfitting |
| Permutation importance | Drop in a chosen metric (accuracy, AUC, MSE) when a feature's values are randomly shuffled in a held-out set | Works with any model and any metric; uses fresh data, so it estimates true predictive contribution | More expensive; can underestimate importance when features are correlated |
| SHAP values | Shapley value attributions to each feature for each prediction | Local and global, additive, model-agnostic, with fast tree-specific implementation | Mathematically richer but slower; needs a baseline distribution choice |

The [permutation variable importances](/wiki/permutation_variable_importances) method was popularized by the original random forest paper and is the default recommendation in modern [scikit-learn](/wiki/scikit-learn) tutorials [2][20]. SHAP values, introduced by Lundberg and Lee (2017), have become the standard for explaining individual boosted-tree predictions in finance and healthcare [15].

## why do tree-based models still beat deep learning on tabular data?

A recurring finding in the empirical literature is that, despite a decade of effort to build neural networks for tabular data, tree-based ensembles continue to win on most heterogeneous tabular benchmarks.

| Study | Year | Conclusion |
|---|---|---|
| Olson et al., "Data-driven advice for applying machine learning to bioinformatics problems" | 2018 | Random forest and gradient boosting outperform deep nets on PMLB benchmarks [19] |
| Shwartz-Ziv and Armon, "Tabular data: Deep learning is not all you need" | 2021 | Across 11 datasets, [XGBoost](/wiki/xgboost) outperforms deep tabular models on most tasks; ensembles of XGBoost and deep models do best [16] |
| Borisov et al., "Deep neural networks and tabular data: A survey" | 2022 | Neural tabular models lag GBDT in average accuracy and require far more tuning [17] |
| Grinsztajn, Oyallon, Varoquaux, "Why do tree-based models still outperform deep learning on tabular data?" (NeurIPS 2022) | 2022 | On 45 medium-sized tabular datasets, tree-based models beat deep learning even after extensive tuning. The paper attributes the gap to neural-network rotational invariance, sensitivity to uninformative features, and difficulty learning irregular target functions [18] |

Grinsztajn, Oyallon, and Varoquaux summarize the pattern bluntly: tree-based models "remain state-of-the-art on medium-sized data (~10K samples) even without accounting for their superior speed" [18]. Shwartz-Ziv and Armon reach a similar verdict across 11 datasets, reporting that XGBoost "outperforms these deep models" and "requires much less tuning", while an ensemble of XGBoost and deep models performs better than either component alone [16].

The practical conclusions from this body of work are direct: for tabular [supervised learning](/wiki/supervised_learning) under five million rows or so, a well-tuned [gradient boosted trees](/wiki/gradient_boosted_decision_trees_gbt) model is the right default, and the burden of proof rests on whoever proposes a deep network. Deep learning is the right tool when raw inputs are sequences, images, audio, or graphs whose structure benefits from learned representations.

## which libraries implement decision forests?

A wide range of libraries implement decision forests. The most widely used today are:

| Library | Language / runtime | Notes |
|---|---|---|
| [scikit-learn](/wiki/scikit-learn) | Python (Cython core) | `DecisionTreeClassifier`, `RandomForestClassifier`, `ExtraTreesClassifier`, `GradientBoostingClassifier`, `HistGradientBoostingClassifier`. The HistGradientBoosting estimators (added in 0.21) are competitive with XGBoost and LightGBM on accuracy and speed for medium datasets [20] |
| [XGBoost](/wiki/xgboost) | C++ core with Python, R, JVM, Julia bindings | The original modern boosted-trees library; widely deployed in production and on Kaggle |
| [LightGBM](/wiki/lightgbm) | C++ core with Python, R, C# bindings | Histogram-based with leaf-wise growth; usually the fastest of the big three on large datasets |
| [CatBoost](/wiki/catboost) | C++ core with Python, R, C# bindings, plus standalone CLI | Best out-of-the-box defaults, especially with raw categorical features |
| [TensorFlow Decision Forests (TF-DF)](/wiki/tensorflow_decision_forests) | Python on top of the YDF C++ library | Brings random forests, Extra-Trees, and gradient boosted trees into the TensorFlow ecosystem; supports Keras pipelines and serving via TensorFlow Serving |
| YDF (Yggdrasil Decision Forests) | C++ and Python | Google's standalone successor to TF-DF; emphasizes reproducibility and serving size |
| H2O | Java with Python, R, REST APIs | Distributed random forest and GBM; AutoML pipelines on Hadoop and Spark clusters |
| Spark MLlib | Scala on Apache Spark | Distributed random forest, gradient boosted trees, and isolation forest implementations for very large datasets |
| Weka | Java | Classical research and teaching package; includes J48 (a C4.5 reimplementation) and many forest variants |
| Ranger | C++ with R and Python bindings | High-performance random forest and survival forest implementation popular in statistics and biomedical research |

For most production tabular work in Python, the practical choice is between [scikit-learn](/wiki/scikit-learn) `HistGradientBoostingClassifier`, [XGBoost](/wiki/xgboost), [LightGBM](/wiki/lightgbm), and [CatBoost](/wiki/catboost). All four interoperate with the standard pandas / NumPy stack and produce models that can be exported to ONNX or to native serving formats.

## what are decision forests used for?

Decision forests dominate three broad application areas in production machine learning.

### tabular data and structured prediction

Most business problems present as a table of mixed numeric and categorical features. Credit scoring, churn prediction, lead scoring, customer lifetime value estimation, demand forecasting, and price optimization all have this shape. Decision forests are typically the strongest single-model baseline and frequently the production model.

### finance and risk

Banks and insurers use [gradient boosted trees](/wiki/gradient_boosted_decision_trees_gbt) for credit-default prediction, transaction fraud detection, anti-money-laundering alert ranking, and pricing of insurance policies. Their interpretability via [feature importances](/wiki/feature_importances) and SHAP values, combined with regulatory acceptance under model-risk frameworks, makes them an easier sell to compliance teams than opaque deep networks.

### ranking and recommendation

Learning-to-rank with gradient boosted trees, especially LambdaMART (Burges, 2010), powered Bing and Yahoo web search relevance for years and remains a workhorse for product search, ads ranking, and recommendation re-ranking [14]. LightGBM and XGBoost both ship `LambdaRank` and `LambdaMART` objectives. Ad click-through-rate prediction also relies heavily on boosted trees, sometimes in hybrid stacks with a wide-and-deep neural component.

### isolation forests for anomaly detection

[Isolation forests](/wiki/random_forest), introduced by Liu, Ting, and Zhou (2008), use random axis-aligned splits to isolate individual points and score anomalies by the depth required to isolate them [13]. They are an unsupervised cousin of random forests and are widely used in fraud and intrusion detection.

## index of decision-forest term wiki pages

The pages below cover individual concepts in this glossary in depth.

- [attribute sampling](/wiki/attribute_sampling)

- [axis-aligned condition](/wiki/axis-aligned_condition)

- [bagging](/wiki/bagging)

- [binary condition](/wiki/binary_condition)

- [condition](/wiki/condition)

- [decision forest](/wiki/decision_forest)

- [decision tree](/wiki/decision_tree)

- [entropy](/wiki/entropy)

- [feature importances](/wiki/feature_importances)

- [gini impurity](/wiki/gini_impurity)

- [gradient boosting](/wiki/gradient_boosting)

- [gradient boosted (decision) trees (GBT)](/wiki/gradient_boosted_decision_trees_gbt)

- [inference path](/wiki/inference_path)

- [information gain](/wiki/information_gain)

- [in-set condition](/wiki/in-set_condition)

- [leaf](/wiki/leaf)

- [node (decision tree)](/wiki/node_decision_tree)

- [non-binary condition](/wiki/non-binary_condition)

- [oblique condition](/wiki/oblique_condition)

- [out-of-bag evaluation (OOB evaluation)](/wiki/out-of-bag_evaluation_oob_evaluation)

- [permutation variable importances](/wiki/permutation_variable_importances)

- [random forest](/wiki/random_forest)

- [root](/wiki/root)

- [sampling with replacement](/wiki/sampling_with_replacement)

- [shrinkage](/wiki/shrinkage)

- [split](/wiki/split)

- [splitter](/wiki/splitter)

- [test](/wiki/test)

- [threshold (for decision trees)](/wiki/threshold_for_decision_trees)

- [variable importances](/wiki/variable_importances)

- [wisdom of the crowd](/wiki/wisdom_of_the_crowd)

## related glossary hubs

- [Machine learning terms](/wiki/machine_learning_terms)
- [Machine learning terms / Fundamentals](/wiki/machine_learning_terms_fundamentals)
- [Machine learning terms / Natural Language Processing](/wiki/machine_learning_terms_natural_language_processing)
- [Machine learning terms / Reinforcement Learning](/wiki/machine_learning_terms_reinforcement_learning)
- [Machine learning terms / Computer Vision](/wiki/machine_learning_terms_computer_vision)
- [Machine learning terms / Fairness](/wiki/machine_learning_terms_fairness)
- [Machine learning terms / TensorFlow](/wiki/machine_learning_terms_tensorflow)
- [Machine learning terms / Google Cloud](/wiki/machine_learning_terms_google_cloud)

## references

1. Breiman, L. (1996). "Bagging predictors". *Machine Learning*, 24(2), 123 to 140.
2. Breiman, L. (2001). "Random Forests". *Machine Learning*, 45(1), 5 to 32.
3. Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984). *Classification and Regression Trees*. Wadsworth.
4. Quinlan, J. R. (1986). "Induction of decision trees". *Machine Learning*, 1(1), 81 to 106.
5. Quinlan, J. R. (1993). *C4.5: Programs for Machine Learning*. Morgan Kaufmann.
6. Friedman, J. H. (2001). "Greedy function approximation: A gradient boosting machine". *Annals of Statistics*, 29(5), 1189 to 1232.
7. Friedman, J. H. (2002). "Stochastic gradient boosting". *Computational Statistics and Data Analysis*, 38(4), 367 to 378.
8. Freund, Y., Schapire, R. E. (1995). "A decision-theoretic generalization of on-line learning and an application to boosting". In *European Conference on Computational Learning Theory*.
9. Geurts, P., Ernst, D., Wehenkel, L. (2006). "Extremely randomized trees". *Machine Learning*, 63(1), 3 to 42.
10. Chen, T., Guestrin, C. (2016). "XGBoost: A scalable tree boosting system". *Proceedings of KDD 2016*.
11. Ke, G., Meng, Q., Finley, T., et al. (2017). "LightGBM: A highly efficient gradient boosting decision tree". *NeurIPS 2017*.
12. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., Gulin, A. (2018). "CatBoost: unbiased boosting with categorical features". *NeurIPS 2018*.
13. Liu, F. T., Ting, K. M., Zhou, Z.-H. (2008). "Isolation forest". *ICDM 2008*.
14. Burges, C. J. C. (2010). "From RankNet to LambdaRank to LambdaMART: An overview". *Microsoft Research Technical Report MSR-TR-2010-82*.
15. Lundberg, S. M., Lee, S.-I. (2017). "A unified approach to interpreting model predictions". *NeurIPS 2017*.
16. Shwartz-Ziv, R., Armon, A. (2021). "Tabular data: Deep learning is not all you need". arXiv:2106.03253. Published in *Information Fusion*, 81, 84 to 90 (2022).
17. Borisov, V., Leemann, T., Sessler, K., et al. (2022). "Deep neural networks and tabular data: A survey". *IEEE Transactions on Neural Networks and Learning Systems*.
18. Grinsztajn, L., Oyallon, E., Varoquaux, G. (2022). "Why do tree-based models still outperform deep learning on tabular data?". *NeurIPS 2022 Datasets and Benchmarks*. arXiv:2207.08815.
19. Olson, R. S., La Cava, W. G., Mustahsan, Z., Varik, A., Moore, J. H. (2018). "Data-driven advice for applying machine learning to bioinformatics problems". *Pacific Symposium on Biocomputing*, 23, 192 to 203. arXiv:1708.05070.
20. scikit-learn developers. "Ensemble methods" and "Permutation feature importance". scikit-learn User Guide. https://scikit-learn.org/stable/modules/ensemble.html