# Attribute sampling

> Source: https://aiwiki.ai/wiki/attribute_sampling
> Updated: 2026-05-11
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

Attribute sampling, also called feature sampling, feature bagging, attribute bagging, or column subsampling, is a randomization technique used by tree ensemble methods such as [random forest](/wiki/random_forest), [extra trees](/wiki/extra_trees), and [gradient boosting](/wiki/gradient_boosting) machines. Instead of letting every split in a [decision tree](/wiki/decision_tree) choose from the full set of input variables, the algorithm restricts each split, each tree level, or each tree to a small random subset of the available features. The split is then chosen only among that subset. The technique is the core mechanism that turns bagged trees into random forests and is a standard regularizer in XGBoost, LightGBM, and CatBoost.

## History and origin

Attribute sampling has several roots in the late 1990s. Yali Amit and Donald Geman, working on handwritten digit recognition, proposed in 1997 to grow randomized trees that consider only a small random sample of binary queries at each node, because the family of possible queries was effectively infinite and a full search was not feasible. Their paper, *Shape Quantization and Recognition with Randomized Trees* (Neural Computation 9), is one of the earliest descriptions of node-level feature randomization in decision trees.

Tin Kam Ho introduced the closely related random subspace method in 1998 in *The Random Subspace Method for Constructing Decision Forests* (IEEE Transactions on Pattern Analysis and Machine Intelligence 20). In Ho's setting each tree in the ensemble is trained on the full training set but using a randomly chosen subspace of the input features, fixed for the life of that tree. The method was designed to deal with high dimensional pattern recognition problems such as optical character recognition and satellite image classification, where exhaustive search over all features per tree was both expensive and prone to overfitting.

Leo Breiman's 2001 paper *Random Forests* (Machine Learning 45) combined bootstrap aggregation of training instances with per split random feature selection. Breiman gave the variant the name Forest-RI, where RI stands for random input selection. At each node of each tree, F input variables are drawn at random from the M available variables and the best split is chosen only among those F. The forest grows unpruned trees. Breiman's experiments tested values such as F = 1 and F = int(log2(M) + 1), and he reported that even F = 1 produced a useful forest as long as the trees were diverse enough. This is the formulation used by most modern random forest implementations.

## How attribute sampling works

A standard decision tree splitter examines every input variable at every node, picks the variable and threshold that most improves the chosen criterion (Gini impurity, entropy, mean squared error, or similar), and then recurses. Attribute sampling intercepts that loop in one of three places:

* Per tree (tree level subsampling). Before a tree is built the algorithm draws a random subset of columns and that tree only ever sees those columns. This matches Ho's original random subspace method and corresponds to XGBoost's `colsample_bytree`, LightGBM's `feature_fraction`, and CatBoost's `colsample_bylevel` when set on a per tree basis.
* Per level. A fresh random subset is drawn each time the tree grows a new depth level. Every node at that level shares the same candidate columns. XGBoost exposes this as `colsample_bylevel`.
* Per node or per split. A fresh random subset is drawn for every node, so two sibling nodes can be split on completely different columns. This is the Breiman Forest-RI setting. In scikit-learn it is controlled by `max_features` on [RandomForestClassifier](/wiki/random_forest), [RandomForestRegressor](/wiki/random_forest), [ExtraTreesClassifier](/wiki/extra_trees), and `HistGradientBoostingRegressor`. In XGBoost it is `colsample_bynode`. In LightGBM it is `feature_fraction_bynode`.

The three levels are not mutually exclusive. XGBoost applies them cumulatively: with 64 features and the configuration `colsample_bytree=0.5`, `colsample_bylevel=0.5`, `colsample_bynode=0.5`, the splitter sees 64 x 0.5 x 0.5 x 0.5 = 8 candidate features at each split. LightGBM combines `feature_fraction` and `feature_fraction_bynode` the same way: the effective fraction at each node is the product of the two.

## Typical defaults

The size of the random subset is the most sensitive hyperparameter in feature sampling. The conventions below are widely used.

| Library | Setting | Default | Common rule of thumb |
| --- | --- | --- | --- |
| scikit-learn `RandomForestClassifier` | `max_features` | `"sqrt"`, that is sqrt(p) | sqrt(p) for classification |
| scikit-learn `RandomForestRegressor` | `max_features` | `1.0`, equivalent to bagging | p/3 for regression (Breiman's original suggestion) |
| scikit-learn `ExtraTreesClassifier` | `max_features` | `"sqrt"` | sqrt(p) |
| R `randomForest` | `mtry` | sqrt(p) for classification, p/3 for regression | same |
| XGBoost | `colsample_bytree`, `_bylevel`, `_bynode` | 1.0 each | 0.5 to 0.9 per parameter, tuned by cross validation |
| LightGBM | `feature_fraction`, `feature_fraction_bynode` | 1.0 each | 0.6 to 0.9 |

The scikit-learn `RandomForestRegressor` is an interesting outlier: as of version 1.1 the default changed from `"auto"` (sqrt(p)) to `1.0`, which is equivalent to plain [bagging](/wiki/bagging) of trees. The scikit-learn maintainers reported that the older p/3 default did not consistently win on regression benchmarks, so they switched to the more conservative full feature setting and left feature sampling to the user. Practitioners often still set `max_features="sqrt"` or a float such as 0.3 by hand for regression, especially with high dimensional data.

## Why it works: strength and correlation

Breiman framed the analysis of random forests in terms of two quantities for the ensemble: the strength of the individual trees, written s, and the average correlation between predictions of different trees, written rho. He proved that the generalization error of a random forest is bounded by a quantity of the form rho times (1 minus s squared) over s squared. To make the bound small you want trees that are individually accurate (high s) and that disagree with each other (low rho).

Fully grown trees on a bootstrap sample already have decent individual accuracy, but they tend to be strongly correlated. If two or three features dominate the signal, every tree will pick those features for its top splits, so the trees will look similar and will make similar errors on test data. Averaging similar predictors does not buy you much variance reduction.

Attribute sampling deliberately weakens each tree (s drops a little) in order to make the trees more different from each other (rho drops a lot). Empirically, on most tabular problems, the net effect is a sizable reduction in ensemble variance and a clear gain in test accuracy compared to bagging the same trees without feature sampling. Breiman's original experiments showed that very aggressive sampling, even F = 1, can still produce useful forests as long as the ensemble is large.

More recent theoretical work, including the 2024 paper *Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests* by Mentch and Zhou (Journal of Machine Learning Research 26), argues that in some regimes attribute sampling can also reduce bias by visiting feature combinations that a greedy splitter would otherwise ignore. The classical reading (lower variance, slightly higher bias per tree, lower correlation across trees) is still a fair first approximation.

## Effect on bias and variance

The bias variance picture for a forest of T trees, each with prediction variance v and pairwise prediction correlation rho, gives an ensemble variance of roughly rho times v plus (1 minus rho) times v over T. As T grows the second term vanishes and the residual variance is rho times v. Lowering rho is therefore the only way to keep cutting variance once the forest is already large. This is exactly the lever that attribute sampling pulls. Bagging alone reduces rho only modestly; feature sampling reduces it much more.

The trade off is that with very small candidate sets each tree becomes a weaker learner. If F is set to 1 with a few thousand features that are mostly noise, most splits will land on garbage and trees will be both weak and only weakly correlated, which can hurt overall accuracy. The sqrt(p) and p/3 rules are compromises that work well across a wide range of tabular problems but are not optimal in any strict sense. For very high dimensional sparse data such as text bag of words, higher fractions are often better. For very low dimensional data with five or ten variables, sampling at all may be counterproductive.

## Implementations

### scikit-learn

In `RandomForestClassifier`, `RandomForestRegressor`, `ExtraTreesClassifier`, and `ExtraTreesRegressor` the parameter is `max_features`. Accepted values are:

* `"sqrt"`: max_features = floor(sqrt(n_features))
* `"log2"`: max_features = floor(log2(n_features))
* `None` or `1.0`: max_features = n_features (no sampling)
* an int: that exact count is used
* a float in (0, 1]: max(1, int(max_features * n_features)) is used

The splitter will keep examining features until it finds at least one valid split, even if that means looking at more than `max_features` columns; the parameter is a soft target, not a hard cap.

For [gradient boosted trees](/wiki/gradient_boosting), `HistGradientBoostingRegressor` and `HistGradientBoostingClassifier` accept a `max_features` parameter that subsamples columns per split, added to scikit-learn in pull request 27139.

### XGBoost

XGBoost exposes three multiplicative knobs: `colsample_bytree`, `colsample_bylevel`, and `colsample_bynode`. All three default to 1.0 and accept any value in (0, 1]. `colsample_bynode` is not supported by the exact tree method, only by the histogram and approximate methods. The XGBoost documentation gives the worked example that with 64 columns and all three set to 0.5, each split sees 8 candidate columns.

### LightGBM

LightGBM uses `feature_fraction` (per tree) and `feature_fraction_bynode` (per split). The effective per node fraction is the product of the two. The `feature_fraction_seed` parameter pins the random selection for reproducibility.

### CatBoost and others

CatBoost supports `colsample_bylevel`, which subsamples columns at each level of each oblivious tree. R's `randomForest` package uses `mtry` and matches Breiman's original defaults exactly. H2O's distributed random forest and gradient boosting models expose `col_sample_rate_per_tree`, `col_sample_rate`, and `col_sample_rate_change_per_level`.

## Practical guidance

The value of mtry or max_features is usually worth tuning. The 2018 review *Hyperparameters and Tuning Strategies for Random Forest* by Probst, Wright, and Boulesteix (WIREs Data Mining and Knowledge Discovery) lists mtry among the two most influential parameters of a random forest, alongside the minimum node size. A few rules of thumb from the empirical literature:

* Start with sqrt(p) for classification and p/3 for regression. These were Breiman's defaults and they remain reasonable starting points.
* Try a small grid around the default, for example {sqrt(p), 2 sqrt(p), p/3, p/2}. Differences between adjacent values are often small but consistent.
* With many noisy features, raise mtry. With many highly correlated features, lower it; very correlated features make tree correlation easy and you need more diversity.
* For XGBoost and LightGBM, treat `colsample_bytree` (or `feature_fraction`) as a regularizer to combine with row subsampling. Common starting points are 0.7 to 0.9 with row subsampling near 0.8.
* Per node sampling (`colsample_bynode`, `feature_fraction_bynode`, scikit-learn's default split level behaviour) is more aggressive than per tree sampling and is closer in spirit to Breiman's original Forest-RI.

## Related techniques

Attribute sampling sits next to several closely related ideas. Bootstrap aggregation (bagging) randomizes the rows; attribute sampling randomizes the columns; combining the two gives a [random forest](/wiki/random_forest). Extra trees go further by also randomizing the split threshold, not just the candidate features. The random subspace method, in Ho's original sense, is the special case of attribute sampling where the subset is fixed for the lifetime of a base learner and the same subset is used for every split in that learner. Feature selection methods such as recursive feature elimination or L1 regularized models pursue the same goal of dropping features, but deterministically rather than at random and once globally rather than at every node.

Attribute sampling in auditing or data quality work refers to drawing a random sample of records and checking the values of selected attributes. That usage shares a name with the machine learning sense but is otherwise unrelated; this article covers only the machine learning meaning.

## ELI5

Imagine a class of students all guessing the answer to the same hard question. If every student studied from the same notes, they will all be wrong in the same way, and asking the whole class will not help. So you give each student a different randomly chosen page of the textbook to study from. Now their guesses are different. When you average the guesses, the random errors cancel out and the average is closer to the right answer than any single student would be. Random forests do the same trick with trees: each tree only gets to look at a random subset of the data columns at each decision, so the trees end up disagreeing in useful ways, and the average prediction is more accurate than any single tree.

## References

1. Breiman, L. (2001). Random Forests. *Machine Learning*, 45(1), 5 to 32. https://link.springer.com/article/10.1023/A:1010933404324
2. Ho, T. K. (1998). The Random Subspace Method for Constructing Decision Forests. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 20(8), 832 to 844. https://ieeexplore.ieee.org/document/709601
3. Amit, Y., and Geman, D. (1997). Shape Quantization and Recognition with Randomized Trees. *Neural Computation*, 9(7), 1545 to 1588.
4. Probst, P., Wright, M. N., and Boulesteix, A. L. (2019). Hyperparameters and Tuning Strategies for Random Forest. *WIREs Data Mining and Knowledge Discovery*, 9(3). https://arxiv.org/abs/1804.03515
5. Mentch, L., and Zhou, S. (2025). Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests. *Journal of Machine Learning Research*, 26. https://jmlr.org/papers/volume26/24-0255/24-0255.pdf
6. scikit-learn developers. RandomForestClassifier and RandomForestRegressor documentation. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html and https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
7. XGBoost developers. XGBoost Parameters documentation. https://xgboost.readthedocs.io/en/stable/parameter.html
8. LightGBM developers. LightGBM Parameters documentation. https://lightgbm.readthedocs.io/en/latest/Parameters.html
9. Wikipedia. Random subspace method. https://en.wikipedia.org/wiki/Random_subspace_method
10. Wikipedia. Random forest. https://en.wikipedia.org/wiki/Random_forest
