Attribute sampling
Last reviewed
May 11, 2026
Sources
10 citations
Review status
Source-backed
Revision
v2 · 2,248 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
10 citations
Review status
Source-backed
Revision
v2 · 2,248 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Attribute sampling, also called feature sampling, feature bagging, attribute bagging, or column subsampling, is a randomization technique used by tree ensemble methods such as random forest, extra trees, and gradient boosting machines. Instead of letting every split in a decision tree choose from the full set of input variables, the algorithm restricts each split, each tree level, or each tree to a small random subset of the available features. The split is then chosen only among that subset. The technique is the core mechanism that turns bagged trees into random forests and is a standard regularizer in XGBoost, LightGBM, and CatBoost.
Attribute sampling has several roots in the late 1990s. Yali Amit and Donald Geman, working on handwritten digit recognition, proposed in 1997 to grow randomized trees that consider only a small random sample of binary queries at each node, because the family of possible queries was effectively infinite and a full search was not feasible. Their paper, Shape Quantization and Recognition with Randomized Trees (Neural Computation 9), is one of the earliest descriptions of node-level feature randomization in decision trees.
Tin Kam Ho introduced the closely related random subspace method in 1998 in The Random Subspace Method for Constructing Decision Forests (IEEE Transactions on Pattern Analysis and Machine Intelligence 20). In Ho's setting each tree in the ensemble is trained on the full training set but using a randomly chosen subspace of the input features, fixed for the life of that tree. The method was designed to deal with high dimensional pattern recognition problems such as optical character recognition and satellite image classification, where exhaustive search over all features per tree was both expensive and prone to overfitting.
Leo Breiman's 2001 paper Random Forests (Machine Learning 45) combined bootstrap aggregation of training instances with per split random feature selection. Breiman gave the variant the name Forest-RI, where RI stands for random input selection. At each node of each tree, F input variables are drawn at random from the M available variables and the best split is chosen only among those F. The forest grows unpruned trees. Breiman's experiments tested values such as F = 1 and F = int(log2(M) + 1), and he reported that even F = 1 produced a useful forest as long as the trees were diverse enough. This is the formulation used by most modern random forest implementations.
A standard decision tree splitter examines every input variable at every node, picks the variable and threshold that most improves the chosen criterion (Gini impurity, entropy, mean squared error, or similar), and then recurses. Attribute sampling intercepts that loop in one of three places:
colsample_bytree, LightGBM's feature_fraction, and CatBoost's colsample_bylevel when set on a per tree basis.colsample_bylevel.max_features on RandomForestClassifier, RandomForestRegressor, ExtraTreesClassifier, and HistGradientBoostingRegressor. In XGBoost it is colsample_bynode. In LightGBM it is feature_fraction_bynode.The three levels are not mutually exclusive. XGBoost applies them cumulatively: with 64 features and the configuration colsample_bytree=0.5, colsample_bylevel=0.5, colsample_bynode=0.5, the splitter sees 64 x 0.5 x 0.5 x 0.5 = 8 candidate features at each split. LightGBM combines feature_fraction and feature_fraction_bynode the same way: the effective fraction at each node is the product of the two.
The size of the random subset is the most sensitive hyperparameter in feature sampling. The conventions below are widely used.
| Library | Setting | Default | Common rule of thumb |
|---|---|---|---|
scikit-learn RandomForestClassifier | max_features | "sqrt", that is sqrt(p) | sqrt(p) for classification |
scikit-learn RandomForestRegressor | max_features | 1.0, equivalent to bagging | p/3 for regression (Breiman's original suggestion) |
scikit-learn ExtraTreesClassifier | max_features | "sqrt" | sqrt(p) |
R randomForest | mtry | sqrt(p) for classification, p/3 for regression | same |
| XGBoost | colsample_bytree, _bylevel, _bynode | 1.0 each | 0.5 to 0.9 per parameter, tuned by cross validation |
| LightGBM | feature_fraction, feature_fraction_bynode | 1.0 each | 0.6 to 0.9 |
The scikit-learn RandomForestRegressor is an interesting outlier: as of version 1.1 the default changed from "auto" (sqrt(p)) to 1.0, which is equivalent to plain bagging of trees. The scikit-learn maintainers reported that the older p/3 default did not consistently win on regression benchmarks, so they switched to the more conservative full feature setting and left feature sampling to the user. Practitioners often still set max_features="sqrt" or a float such as 0.3 by hand for regression, especially with high dimensional data.
Breiman framed the analysis of random forests in terms of two quantities for the ensemble: the strength of the individual trees, written s, and the average correlation between predictions of different trees, written rho. He proved that the generalization error of a random forest is bounded by a quantity of the form rho times (1 minus s squared) over s squared. To make the bound small you want trees that are individually accurate (high s) and that disagree with each other (low rho).
Fully grown trees on a bootstrap sample already have decent individual accuracy, but they tend to be strongly correlated. If two or three features dominate the signal, every tree will pick those features for its top splits, so the trees will look similar and will make similar errors on test data. Averaging similar predictors does not buy you much variance reduction.
Attribute sampling deliberately weakens each tree (s drops a little) in order to make the trees more different from each other (rho drops a lot). Empirically, on most tabular problems, the net effect is a sizable reduction in ensemble variance and a clear gain in test accuracy compared to bagging the same trees without feature sampling. Breiman's original experiments showed that very aggressive sampling, even F = 1, can still produce useful forests as long as the ensemble is large.
More recent theoretical work, including the 2024 paper Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests by Mentch and Zhou (Journal of Machine Learning Research 26), argues that in some regimes attribute sampling can also reduce bias by visiting feature combinations that a greedy splitter would otherwise ignore. The classical reading (lower variance, slightly higher bias per tree, lower correlation across trees) is still a fair first approximation.
The bias variance picture for a forest of T trees, each with prediction variance v and pairwise prediction correlation rho, gives an ensemble variance of roughly rho times v plus (1 minus rho) times v over T. As T grows the second term vanishes and the residual variance is rho times v. Lowering rho is therefore the only way to keep cutting variance once the forest is already large. This is exactly the lever that attribute sampling pulls. Bagging alone reduces rho only modestly; feature sampling reduces it much more.
The trade off is that with very small candidate sets each tree becomes a weaker learner. If F is set to 1 with a few thousand features that are mostly noise, most splits will land on garbage and trees will be both weak and only weakly correlated, which can hurt overall accuracy. The sqrt(p) and p/3 rules are compromises that work well across a wide range of tabular problems but are not optimal in any strict sense. For very high dimensional sparse data such as text bag of words, higher fractions are often better. For very low dimensional data with five or ten variables, sampling at all may be counterproductive.
In RandomForestClassifier, RandomForestRegressor, ExtraTreesClassifier, and ExtraTreesRegressor the parameter is max_features. Accepted values are:
"sqrt": max_features = floor(sqrt(n_features))"log2": max_features = floor(log2(n_features))None or 1.0: max_features = n_features (no sampling)The splitter will keep examining features until it finds at least one valid split, even if that means looking at more than max_features columns; the parameter is a soft target, not a hard cap.
For gradient boosted trees, HistGradientBoostingRegressor and HistGradientBoostingClassifier accept a max_features parameter that subsamples columns per split, added to scikit-learn in pull request 27139.
XGBoost exposes three multiplicative knobs: colsample_bytree, colsample_bylevel, and colsample_bynode. All three default to 1.0 and accept any value in (0, 1]. colsample_bynode is not supported by the exact tree method, only by the histogram and approximate methods. The XGBoost documentation gives the worked example that with 64 columns and all three set to 0.5, each split sees 8 candidate columns.
LightGBM uses feature_fraction (per tree) and feature_fraction_bynode (per split). The effective per node fraction is the product of the two. The feature_fraction_seed parameter pins the random selection for reproducibility.
CatBoost supports colsample_bylevel, which subsamples columns at each level of each oblivious tree. R's randomForest package uses mtry and matches Breiman's original defaults exactly. H2O's distributed random forest and gradient boosting models expose col_sample_rate_per_tree, col_sample_rate, and col_sample_rate_change_per_level.
The value of mtry or max_features is usually worth tuning. The 2018 review Hyperparameters and Tuning Strategies for Random Forest by Probst, Wright, and Boulesteix (WIREs Data Mining and Knowledge Discovery) lists mtry among the two most influential parameters of a random forest, alongside the minimum node size. A few rules of thumb from the empirical literature:
colsample_bytree (or feature_fraction) as a regularizer to combine with row subsampling. Common starting points are 0.7 to 0.9 with row subsampling near 0.8.colsample_bynode, feature_fraction_bynode, scikit-learn's default split level behaviour) is more aggressive than per tree sampling and is closer in spirit to Breiman's original Forest-RI.Attribute sampling sits next to several closely related ideas. Bootstrap aggregation (bagging) randomizes the rows; attribute sampling randomizes the columns; combining the two gives a random forest. Extra trees go further by also randomizing the split threshold, not just the candidate features. The random subspace method, in Ho's original sense, is the special case of attribute sampling where the subset is fixed for the lifetime of a base learner and the same subset is used for every split in that learner. Feature selection methods such as recursive feature elimination or L1 regularized models pursue the same goal of dropping features, but deterministically rather than at random and once globally rather than at every node.
Attribute sampling in auditing or data quality work refers to drawing a random sample of records and checking the values of selected attributes. That usage shares a name with the machine learning sense but is otherwise unrelated; this article covers only the machine learning meaning.
Imagine a class of students all guessing the answer to the same hard question. If every student studied from the same notes, they will all be wrong in the same way, and asking the whole class will not help. So you give each student a different randomly chosen page of the textbook to study from. Now their guesses are different. When you average the guesses, the random errors cancel out and the average is closer to the right answer than any single student would be. Random forests do the same trick with trees: each tree only gets to look at a random subset of the data columns at each decision, so the trees end up disagreeing in useful ways, and the average prediction is more accurate than any single tree.