# Attribute sampling

> Source: https://aiwiki.ai/wiki/attribute_sampling
> Updated: 2026-07-11
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

Attribute sampling is a randomization technique in which a [decision tree](/wiki/decision_tree) considers only a small, randomly drawn subset of the available input features when searching for the best split at each node, rather than evaluating all of them. It is the core mechanism that turns bagged trees into a [random forest](/wiki/random_forest): by forcing different trees (and different splits) to look at different features, it decorrelates the trees and lowers the variance of the averaged ensemble [1][6]. The technique is also known as feature sampling, feature bagging, attribute bagging, random feature selection, or column subsampling, and it is a standard regularizer in random forests, [extra trees](/wiki/extra_trees), and [gradient boosting](/wiki/gradient_boosting) libraries such as XGBoost, LightGBM, and CatBoost.

## What is attribute sampling?

Attribute sampling restricts the set of candidate features that a tree-based learner is allowed to split on. Instead of letting every split choose from the full set of input variables, the algorithm limits each split, each tree level, or each whole tree to a small random subset of the available features, and the split is then chosen only among that subset [1][6]. The number of features in the subset is a hyperparameter, conventionally written F or mtry, and choosing it well is one of the most consequential decisions in tuning a forest [4].

In Leo Breiman's terminology this column randomization is what separates a true random forest from plain [bagging](/wiki/bagging): bagging randomizes the rows (training instances), attribute sampling randomizes the columns (features), and combining the two produces a random forest [1][10]. The technique is unrelated to the identically named auditing or data-quality procedure (drawing a random sample of records and inspecting selected attributes); this article covers only the machine learning meaning.

## Where did attribute sampling come from?

Attribute sampling has several roots in the late 1990s. Yali Amit and Donald Geman, working on handwritten digit recognition, proposed in 1997 to grow randomized trees that consider only a small random sample of binary queries at each node, because the family of possible queries was effectively infinite and a full search was not feasible [3]. Their paper, *Shape Quantization and Recognition with Randomized Trees* (Neural Computation 9), is one of the earliest descriptions of node-level feature randomization in decision trees.

Tin Kam Ho introduced the closely related random subspace method in 1998 in *The Random Subspace Method for Constructing Decision Forests* (IEEE Transactions on Pattern Analysis and Machine Intelligence 20) [2]. In Ho's setting each tree in the ensemble is trained on the full training set but using a randomly chosen subspace of the input features, fixed for the life of that tree. The method was designed to deal with high dimensional pattern recognition problems such as optical character recognition and satellite image classification, where exhaustive search over all features per tree was both expensive and prone to overfitting.

Leo Breiman's 2001 paper *Random Forests* (Machine Learning 45) combined bootstrap aggregation of training instances with per split random feature selection [1]. Breiman gave the variant the name Forest-RI, where RI stands for random input selection. At each node of each tree, F input variables are drawn at random from the M available variables and the best split is chosen only among those F. The forest grows unpruned trees. Breiman's experiments tested values such as $$F = 1$$ and $$F = \mathrm{int}(\log_2(M) + 1)$$, and he reported that even F = 1 produced a useful forest as long as the trees were diverse enough [1]. In one reported run on a dataset with around 1,000 features, the average single-tree error rate fell from about 80 percent at F = 1 to about 65 percent at F = 10 and about 60 percent at F = 25, illustrating how raising F strengthens individual trees while reducing their diversity [1]. The per-node formulation is the one used by most modern random forest implementations.

## How does attribute sampling work?

A standard decision tree splitter examines every input variable at every node, picks the variable and threshold that most improves the chosen criterion (Gini impurity, entropy, mean squared error, or similar), and then recurses. Attribute sampling intercepts that loop in one of three places:

* Per tree (tree level subsampling). Before a tree is built the algorithm draws a random subset of columns and that tree only ever sees those columns. This matches Ho's original random subspace method [2] and corresponds to XGBoost's `colsample_bytree`, LightGBM's `feature_fraction`, and CatBoost's `colsample_bylevel` when set on a per tree basis.
* Per level. A fresh random subset is drawn each time the tree grows a new depth level. Every node at that level shares the same candidate columns. XGBoost exposes this as `colsample_bylevel`, sampled from the columns already chosen for the current tree [7].
* Per node or per split. A fresh random subset is drawn for every node, so two sibling nodes can be split on completely different columns. This is the Breiman Forest-RI setting [1]. In scikit-learn it is controlled by `max_features` on [RandomForestClassifier](/wiki/random_forest), [RandomForestRegressor](/wiki/random_forest), [ExtraTreesClassifier](/wiki/extra_trees), and `HistGradientBoostingRegressor` [6]. In XGBoost it is `colsample_bynode` [7]. In LightGBM it is `feature_fraction_bynode` [8].

The three levels are not mutually exclusive. XGBoost applies them cumulatively: with 64 features and the configuration `colsample_bytree=0.5`, `colsample_bylevel=0.5`, `colsample_bynode=0.5`, the splitter sees $$64 \times 0.5 \times 0.5 \times 0.5 = 8$$ candidate features at each split [7]. LightGBM combines `feature_fraction` and `feature_fraction_bynode` the same way: the effective fraction at each node is the product of the two [8].

## How many features are sampled at each node?

The size of the random subset is the most sensitive hyperparameter in feature sampling. The conventions below are widely used.

| Library | Setting | Default | Common rule of thumb |
| --- | --- | --- | --- |
| scikit-learn `RandomForestClassifier` | `max_features` | `"sqrt"`, that is $$\sqrt{p}$$ | $$\sqrt{p}$$ for classification |
| scikit-learn `RandomForestRegressor` | `max_features` | `1.0`, equivalent to bagging | $$p/3$$ for regression (Breiman's original suggestion) |
| scikit-learn `ExtraTreesClassifier` | `max_features` | `"sqrt"` | $$\sqrt{p}$$ |
| R `randomForest` | `mtry` | $$\sqrt{p}$$ for classification, $$p/3$$ for regression | same |
| XGBoost | `colsample_bytree`, `_bylevel`, `_bynode` | 1.0 each | 0.5 to 0.9 per parameter, tuned by cross validation |
| LightGBM | `feature_fraction`, `feature_fraction_bynode` | 1.0 each | 0.6 to 0.9 |

For classification the long-standing default is the square root of the feature count, $$\sqrt{p}$$; for regression Breiman's original suggestion was one third of the features, $$p/3$$ [1][6]. The scikit-learn `RandomForestRegressor` is an interesting outlier: as of version 1.1 the default changed from `"auto"` (which had meant $$\sqrt{p}$$) to `1.0`, which is equivalent to plain [bagging](/wiki/bagging) of trees, while `RandomForestClassifier` kept feature sampling on by default at `"sqrt"` [6]. The scikit-learn maintainers reported that the older $$p/3$$ default did not consistently win on regression benchmarks, so they switched to the more conservative full feature setting and left feature sampling to the user [6]. Practitioners often still set `max_features="sqrt"` or a float such as 0.3 by hand for regression, especially with high dimensional data.

## Why does attribute sampling improve random forests?

Breiman framed the analysis of random forests in terms of two quantities for the ensemble: the strength of the individual trees, written s, and the average correlation between predictions of different trees, written rho [1]. He proved that the generalization error of a random forest is bounded by a quantity of the form rho times (1 minus s squared) over s squared [1]. To make the bound small you want trees that are individually accurate (high s) and that disagree with each other (low rho); as Breiman put it, the design goal is to reduce the correlation while holding the strength roughly constant.

Fully grown trees on a bootstrap sample already have decent individual accuracy, but they tend to be strongly correlated. If two or three features dominate the signal, every tree will pick those features for its top splits, so the trees will look similar and will make similar errors on test data. Averaging similar predictors does not buy you much variance reduction.

Attribute sampling deliberately weakens each tree (s drops a little) in order to make the trees more different from each other (rho drops a lot). Empirically, on most tabular problems, the net effect is a sizable reduction in ensemble variance and a clear gain in test accuracy compared to bagging the same trees without feature sampling. Breiman's original experiments showed that very aggressive sampling, even F = 1, can still produce useful forests as long as the ensemble is large [1].

More recent theoretical work, including the 2025 paper *Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests* by Mentch and Zhou (Journal of Machine Learning Research 26), argues that in some regimes attribute sampling can also reduce bias by visiting feature combinations that a greedy splitter would otherwise ignore [5]. The classical reading (lower variance, slightly higher bias per tree, lower correlation across trees) is still a fair first approximation.

## How does attribute sampling affect bias and variance?

The bias variance picture for a forest of T trees, each with prediction variance v and pairwise prediction correlation rho, gives an ensemble variance of roughly rho times v plus (1 minus rho) times v over T. As T grows the second term vanishes and the residual variance is rho times v. Lowering rho is therefore the only way to keep cutting variance once the forest is already large. This is exactly the lever that attribute sampling pulls. Bagging alone reduces rho only modestly; feature sampling reduces it much more [1].

The trade off is that with very small candidate sets each tree becomes a weaker learner. If F is set to 1 with a few thousand features that are mostly noise, most splits will land on garbage and trees will be both weak and only weakly correlated, which can hurt overall accuracy. The $$\sqrt{p}$$ and $$p/3$$ rules are compromises that work well across a wide range of tabular problems but are not optimal in any strict sense. For very high dimensional sparse data such as text bag of words, higher fractions are often better. For very low dimensional data with five or ten variables, sampling at all may be counterproductive.

## How is attribute sampling implemented in practice?

### scikit-learn

In `RandomForestClassifier`, `RandomForestRegressor`, `ExtraTreesClassifier`, and `ExtraTreesRegressor` the parameter is `max_features`. Accepted values are [6]:

* `"sqrt"`: max_features = floor(sqrt(n_features))
* `"log2"`: max_features = floor(log2(n_features))
* `None` or `1.0`: max_features = n_features (no sampling)
* an int: that exact count is used
* a float in (0, 1]: max(1, int(max_features * n_features)) is used

The `max_features` value is a soft target, not a hard cap. The scikit-learn documentation states: "the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than `max_features` features." [6]

For [gradient boosted trees](/wiki/gradient_boosting), `HistGradientBoostingRegressor` and `HistGradientBoostingClassifier` accept a `max_features` parameter that subsamples columns per split, added to scikit-learn in pull request 27139 [6].

### XGBoost

XGBoost exposes three multiplicative knobs: `colsample_bytree`, `colsample_bylevel`, and `colsample_bynode`. All three default to 1.0 and accept any value in (0, 1] [7]. `colsample_bynode` is not supported by the exact tree method, only by the histogram and approximate methods. The XGBoost documentation gives the worked example that with 64 columns and all three set to 0.5, each split sees 8 candidate columns [7].

### LightGBM

LightGBM uses `feature_fraction` (per tree) and `feature_fraction_bynode` (per split). The effective per node fraction is the product of the two [8]. The `feature_fraction_seed` parameter pins the random selection for reproducibility.

### CatBoost and others

CatBoost supports `colsample_bylevel`, which subsamples columns at each level of each oblivious tree. R's `randomForest` package uses `mtry` and matches Breiman's original defaults exactly [1]. H2O's distributed random forest and gradient boosting models expose `col_sample_rate_per_tree`, `col_sample_rate`, and `col_sample_rate_change_per_level`.

## How should attribute sampling be tuned?

The value of mtry or max_features is usually worth tuning. The 2019 review *Hyperparameters and Tuning Strategies for Random Forest* by Probst, Wright, and Boulesteix (WIREs Data Mining and Knowledge Discovery) lists mtry among the two most influential parameters of a random forest, alongside the minimum node size [4]. A few rules of thumb from the empirical literature:

* Start with $$\sqrt{p}$$ for classification and $$p/3$$ for regression. These were Breiman's defaults and they remain reasonable starting points [1][4].
* Try a small grid around the default, for example {$$\sqrt{p}$$, 2 $$\sqrt{p}$$, $$p/3$$, $$p/2$$}. Differences between adjacent values are often small but consistent.
* With many noisy features, raise mtry. With many highly correlated features, lower it; very correlated features make tree correlation easy and you need more diversity.
* For XGBoost and LightGBM, treat `colsample_bytree` (or `feature_fraction`) as a regularizer to combine with row subsampling. Common starting points are 0.7 to 0.9 with row subsampling near 0.8 [7][8].
* Per node sampling (`colsample_bynode`, `feature_fraction_bynode`, scikit-learn's default split level behaviour) is more aggressive than per tree sampling and is closer in spirit to Breiman's original Forest-RI [1].

## How does attribute sampling relate to other techniques?

Attribute sampling sits next to several closely related ideas. Bootstrap aggregation (bagging) randomizes the rows; attribute sampling randomizes the columns; combining the two gives a [random forest](/wiki/random_forest) [1][10]. Extra trees go further by also randomizing the split threshold, not just the candidate features. The random subspace method, in Ho's original sense, is the special case of attribute sampling where the subset is fixed for the lifetime of a base learner and the same subset is used for every split in that learner [2][9]. Feature selection methods such as recursive feature elimination or L1 regularized models pursue the same goal of dropping features, but deterministically rather than at random and once globally rather than at every node.

Attribute sampling in auditing or data quality work refers to drawing a random sample of records and checking the values of selected attributes. That usage shares a name with the machine learning sense but is otherwise unrelated; this article covers only the machine learning meaning.

## ELI5

Imagine a class of students all guessing the answer to the same hard question. If every student studied from the same notes, they will all be wrong in the same way, and asking the whole class will not help. So you give each student a different randomly chosen page of the textbook to study from. Now their guesses are different. When you average the guesses, the random errors cancel out and the average is closer to the right answer than any single student would be. Random forests do the same trick with trees: each tree only gets to look at a random subset of the data columns at each decision, so the trees end up disagreeing in useful ways, and the average prediction is more accurate than any single tree.

## References

1. Breiman, L. (2001). Random Forests. *Machine Learning*, 45(1), 5 to 32. https://link.springer.com/article/10.1023/A:1010933404324
2. Ho, T. K. (1998). The Random Subspace Method for Constructing Decision Forests. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 20(8), 832 to 844. https://ieeexplore.ieee.org/document/709601
3. Amit, Y., and Geman, D. (1997). Shape Quantization and Recognition with Randomized Trees. *Neural Computation*, 9(7), 1545 to 1588.
4. Probst, P., Wright, M. N., and Boulesteix, A. L. (2019). Hyperparameters and Tuning Strategies for Random Forest. *WIREs Data Mining and Knowledge Discovery*, 9(3). https://arxiv.org/abs/1804.03515
5. Mentch, L., and Zhou, S. (2025). Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests. *Journal of Machine Learning Research*, 26. https://jmlr.org/papers/volume26/24-0255/24-0255.pdf
6. scikit-learn developers. RandomForestClassifier and RandomForestRegressor documentation. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html and https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
7. XGBoost developers. XGBoost Parameters documentation. https://xgboost.readthedocs.io/en/stable/parameter.html
8. LightGBM developers. LightGBM Parameters documentation. https://lightgbm.readthedocs.io/en/latest/Parameters.html
9. Wikipedia. Random subspace method. https://en.wikipedia.org/wiki/Random_subspace_method
10. Wikipedia. Random forest. https://en.wikipedia.org/wiki/Random_forest