Subsampling

Definition

Subsampling is a family of techniques in machine learning, statistics, and signal processing that draw a smaller subset from a larger collection of data, samples, features, or signal values. The term covers several distinct procedures that look superficially similar but solve different problems. In ensemble learning, subsampling refers to fitting each base model on a random fraction of the training rows. In stochastic optimization, it refers to drawing a mini-batch of training examples per gradient step. In convolutional neural networks, it refers to pooling or strided convolutions that reduce spatial resolution. In statistics, the word names a specific resampling method (Politis and Romano, 1994) where subsamples of size m are drawn without replacement from a sample of size n. In signal processing, it is a synonym for downsampling the sample rate of an audio or image signal, with an anti-aliasing filter applied first. In imbalanced learning, it usually means undersampling the majority class.

This article surveys the main meanings, presents the underlying algorithms and formulas where they are short enough, and points to framework support. The unifying intuition is that working with fewer samples reduces compute, often acts as regularization, and changes the bias-variance trade-off.

Meanings of subsampling at a glance

Context	What is subsampled	Typical purpose	Key reference
Stochastic gradient boosting	Training rows per boosting iteration	Faster fitting, regularization	Friedman (2002)
Random forests	Training rows per tree (bootstrap)	Decorrelate trees, bagging	Breiman (2001)
Mini-batch SGD	Training examples per gradient step	Tractable optimization	Robbins and Monro (1951)
CNN pooling and strided convolution	Spatial positions in a feature map	Reduce resolution, expand receptive field	LeCun et al. (1998)
Signal and image processing	Sample-rate of a 1D or 2D signal	Compression, format conversion	Nyquist-Shannon
Imbalanced classification	Majority-class examples	Rebalance class distribution	Kubat and Matwin (1997)
Word2vec	Frequent words and noise contrasts	Faster training, better embeddings	Mikolov et al. (2013)
Statistical resampling	Subsamples of size m from n	Confidence regions under weak assumptions	Politis and Romano (1994)
Efficient transformers	Token pairs in self-attention	Sub-quadratic attention	Kitaev et al. (2020)

Subsampling in stochastic gradient boosting

In the original gradient boosting machine, each new tree is fit to the residuals of the running model on the entire training set (Friedman, 2001). One year later, Friedman (2002) proposed a stochastic variant: at every boosting iteration, draw a random subsample of rows from the training data without replacement and fit the next tree only on that subsample. The fraction of rows drawn is the subsample rate, written as a number between 0 and 1.

Friedman reported two effects from this change. First, training is faster because each tree sees fewer rows. Second, the variance introduced by random row selection acts as regularization and improves generalization, especially when the base learner is large enough to overfit. He recommended values around 0.5 for a wide range of problems.

Parameter names across libraries

Library	Parameter	Default	Notes
XGBoost	`subsample`	1.0	Sampled once per tree, without replacement
LightGBM	`bagging_fraction` (alias `subsample`)	1.0	Requires `bagging_freq` > 0 to take effect
CatBoost	`subsample`	0.8 (Bernoulli)	Bernoulli or Poisson bootstrap
scikit-learn `GradientBoostingClassifier`	`subsample`	1.0	Values < 1.0 give stochastic gradient boosting
H2O GBM	`sample_rate`	1.0	Per-tree sampling without replacement

A related but separate idea is column subsampling: each split or each tree is restricted to a random subset of features. XGBoost calls these colsample_bytree, colsample_bylevel, and colsample_bynode. LightGBM uses feature_fraction. Row subsampling and column subsampling can be combined and they regularize the model in different ways. Column subsampling is the same idea Breiman used in random forests, applied here on top of boosting.

Subsampling in random forests

Random forests (Breiman, 2001) build an ensemble of decision trees where each tree is fit on a bootstrap sample of the training data, that is, a sample of size n drawn with replacement from the n rows of the training set. About 36.8% of the original rows are left out of any given bootstrap (the out-of-bag set), and these can be used as a free validation set for each tree. Bootstrap row sampling is the bagging part of the algorithm. On top of bagging, each split considers only a random subset of features (often sqrt(p) for classification or p/3 for regression with p features), which is the random subspace method of Ho (1998).

Extremely Randomized Trees, or Extra Trees (Geurts, Ernst, and Wehenkel, 2006), drop the bootstrap step and fit each tree on the full training set, while randomizing the cut-points of each split. They argue that the variance reduction from random splits already substitutes for the variance reduction from bootstrap sampling. In scikit-learn, both RandomForestClassifier and ExtraTreesClassifier expose a bootstrap flag and a max_samples argument that controls the fraction of rows drawn per tree.

Subsampling in stochastic optimization

Stochastic gradient descent (SGD) is itself a form of subsampling. Each step computes the gradient on a single example or a small mini-batch instead of on the full training set. This is what makes deep learning practical: with millions of training examples, computing the exact gradient at every step would be far too expensive. The mini-batch version of SGD goes back to Robbins and Monro (1951) for the single-example case, with mini-batch variants becoming standard practice in the 1990s and 2000s.

Mini-batch sizes typically range from 32 to 8192 in modern deep learning. Larger mini-batches give lower-variance gradient estimates but require more memory and need careful learning-rate scaling to avoid generalization losses (Goyal et al., 2017). Small mini-batches add noise that can act as a regularizer and help the optimizer escape sharp minima (Keskar et al., 2017). The interaction between batch size, learning rate, and generalization remains an active research question.

Subsampling in convolutional networks

Convolutional neural networks (CNNs) routinely subsample feature maps to reduce spatial resolution and to enlarge the receptive field of deeper layers. The original LeNet-5 (LeCun et al., 1998) called these layers S2 and S4, where S stood for subsampling. Each subsampling unit averaged a 2x2 patch of the previous feature map, multiplied by a trainable coefficient and added a trainable bias.

Modern CNNs achieve subsampling in two ways:

Method	Mechanism	Example
Pooling	Aggregate a small spatial window with max or average	2x2 max pool with stride 2 discards 75% of activations
Strided convolution	Apply a convolution with stride > 1	A 3x3 conv with stride 2 halves spatial dimensions

A 2x2 max-pooling layer with stride 2 is the canonical building block: it picks the largest value from each non-overlapping 2x2 window, halving height and width and dropping the activation count to one quarter of the input. Strided convolutions, popularized by ResNet (He et al., 2016) and All-Convolutional Networks (Springenberg et al., 2015), achieve the same downsampling effect while learning the filter that performs it. Both approaches are subsampling because they reduce the number of spatial positions retained from one layer to the next, and both expand the receptive field of subsequent layers.

There is a known interaction between aggressive subsampling and translation invariance: vanilla strided convolutions and pooling violate the shift-equivariance properties one would expect from a convolution. Zhang (2019) showed that adding a low-pass anti-aliasing filter before subsampling improves both robustness and accuracy on standard image benchmarks.

Subsampling in signal and image processing

In classical signal processing, subsampling is a synonym for downsampling: keep only every M-th sample of a sampled signal, where M is the integer downsampling factor. The Nyquist-Shannon sampling theorem requires that any signal whose highest frequency component is below B Hz be sampled at least at 2B Hz to avoid aliasing. When the sample rate is reduced, the new Nyquist limit is lower, so any spectral content above the new limit must be removed before subsampling, or it will fold back into the audible band as aliasing artifacts. The standard practice is to apply a low-pass anti-aliasing filter and then drop samples.

A few common cases:

Domain	Original rate	Reduced rate	Notes
Audio for speech models	48 kHz studio	16 kHz model input	Many speech recognizers and TTS models expect 16 kHz
Image downsampling	1024x1024 image	256x256 image	4x downsampling per dimension; needs blur or area filter
Telephony	8 kHz audio	n/a	Low rate is itself a result of historical subsampling
Lidar / sensor logs	100 Hz	10 Hz	Often subsampled to match a slower controller loop

In image processing, common anti-aliasing filters before subsampling include Gaussian blur, area averaging, Lanczos kernels, and bicubic kernels. Pillow, OpenCV, and TensorFlow all expose these as resize options. Skipping the filter and just decimating pixels produces the visible aliasing pattern known as moire.

Subsampling for imbalanced learning

When one class dominates a classification dataset, training on the raw data tends to bias the model toward the majority class. A common remedy is to subsample the majority class so that the class proportions become more balanced. This is usually called undersampling rather than subsampling, but the operation is the same: random selection without replacement from the more numerous class.

Method	Description
Random undersampling	Pick majority-class examples uniformly at random until the desired ratio is reached
Tomek links	Remove majority-class examples that form a Tomek pair with a minority example, cleaning the decision boundary
NearMiss-1	Keep majority examples whose average distance to the three nearest minority examples is smallest
NearMiss-2	Keep majority examples whose average distance to the three farthest minority examples is smallest
NearMiss-3	For each minority example, retain a fixed number of nearest majority neighbors
EditedNearestNeighbours	Drop majority examples whose neighbors disagree with their label
Cluster centroids	Replace majority examples with the centroids of clusters fit to that class

These methods are all available in the open-source imbalanced-learn Python package, which integrates with scikit-learn pipelines. See the imbalanced dataset and undersampling articles for a deeper discussion of when each method works.

Negative sampling and frequent-word subsampling in word2vec

Word2vec (Mikolov et al., 2013) uses two separate subsampling tricks. The first is negative sampling, an alternative to the full softmax over the vocabulary. For each true (target word, context word) pair, the model samples a small number of "negative" words (typically between 5 and 20 for small datasets, 2 to 5 for large ones) drawn from the vocabulary according to the unigram distribution raised to the 3/4 power. The model then trains a binary classifier to distinguish the true context from the negatives. This replaces an O(|V|) softmax with a constant-time update.

The second trick is subsampling of frequent words. Common words such as "the", "and", "a" appear in many contexts and contribute little information per occurrence. Mikolov et al. discard each occurrence of word w with probability:

P(discard w) = 1 - sqrt(t / f(w))

where f(w) is the relative frequency of w in the corpus and t is a threshold (the original paper recommends t = 1e-5). Words with f(w) below t are never discarded, while very frequent words are discarded most of the time. The published implementation uses a slightly different formula, but the qualitative effect is the same. Mikolov reported that this scheme accelerated training by about 2x and improved accuracy on word-similarity tasks because it gave the model relatively more training signal from rare, semantically rich words.

The same family of ideas reappears in noise-contrastive estimation (Gutmann and Hyvarinen, 2010) and in modern contrastive learning, where positive pairs are augmented and negatives are sampled from the batch.

Statistical subsampling (Politis and Romano)

In theoretical statistics, subsampling has a precise technical meaning: given an i.i.d. sample of size n from an unknown distribution, draw subsamples of size m without replacement, where m grows more slowly than n (so m/n -> 0 as n -> infinity). Compute the statistic of interest on each subsample and form an empirical distribution of the suitably normalized values. Politis and Romano (1994) proved that this procedure yields asymptotically valid confidence regions under remarkably weak assumptions, essentially only that the statistic has a non-degenerate limiting distribution.

This is different from the bootstrap, where one resamples with replacement at the same size n. The bootstrap requires stronger conditions to be consistent (smoothness of the statistic, finite second moments, and so on), and it can fail in heavy-tailed or non-regular cases where subsampling still works. The reference text is the Springer monograph by Politis, Romano, and Wolf (1999), which extends the i.i.d. theory to time series, random fields, and dependent data. The price for the weaker assumptions is a slower convergence rate and the need to choose the block size m, often via a data-driven calibration rule.

Subsampling in efficient self-attention

Standard self-attention in transformers has O(L^2) cost in the sequence length L, because every token attends to every other token. Several efficient transformer variants reduce this cost by subsampling the set of token pairs that participate in attention.

Reformer (Kitaev, Kaiser, and Levskaya, 2020) groups tokens with locality-sensitive hashing and only attends within each hash bucket, achieving O(L log L) cost.
Performer (Choromanski et al., 2021) uses random feature maps to approximate the attention kernel in O(L) cost without explicit subsampling, but the random features can be viewed as a probabilistic sampling of kernel evaluations.
BigBird (Zaheer et al., 2020) attends to a fixed pattern of local, global, and randomly sampled positions, with the random subset providing approximation guarantees.
Sparse Transformer (Child et al., 2019) restricts each query to attend to a strided or fixed pattern of keys.

In each case, subsampling the L-by-L attention matrix is what buys the sub-quadratic cost. The trade-off is approximation error in the attention output, which is usually small enough in practice that long-context models can be trained.

Common themes across the meanings

Despite the variety of contexts, several patterns recur whenever subsampling is introduced:

Compute reduction. Touching fewer items per step is the most direct motivation, whether the items are training rows, attention pairs, or signal samples.
Variance reduction at the model level. By averaging many models, each fit on a different subsample, ensembles like random forests reduce variance compared to a single model fit on the whole dataset.
Variance increase at the iteration level. Because each iteration sees a different subsample, the path of the optimizer or the boosting trajectory becomes noisier. This noise often acts as implicit regularization.
Bias trade-off. Subsampling a smaller fraction of the data introduces bias into estimates of the gradient, the loss, or whatever statistic is being computed on the subsample. The art is choosing a fraction small enough to gain speed and regularization without hurting accuracy too much.
Need for a matching pre-processing step. In signal processing, subsampling without anti-aliasing produces artifacts. In CNNs, naive strided pooling breaks shift-equivariance. In imbalanced learning, naive random undersampling discards potentially informative majority-class examples and can be improved with structure-aware methods like Tomek links or NearMiss.

Choosing a subsampling fraction

For row subsampling in stochastic gradient boosting and random forests, a common starting range is 0.5 to 0.8 of the training rows per tree. Smaller fractions speed up training and add more regularization, larger fractions reduce stochastic noise. For mini-batch SGD, batch sizes between 32 and 512 are typical for tabular and small image data, while large vision and language models use 1024 to 65536, scaled with learning rate (often via the linear or square-root scaling rule). For pooling in CNNs, 2x2 with stride 2 is the canonical default, but architectures like Inception use 3x3 windows with stride 2, and modern architectures often replace pooling with strided convolutions entirely. For statistical subsampling, the block size m must satisfy m -> infinity and m/n -> 0; in practice it is often chosen by minimizing a calibration criterion such as the minimum-volatility method.

Explain like I'm 5

Imagine you have a giant box of LEGO with thousands of bricks. There are different ways to use just some of them.

If you want to build a quick test castle, you might grab a handful at random instead of sorting through the whole box. That is mini-batch subsampling: it is faster than dumping the box on the floor every time. If you want a friend to also build a castle and then compare, you might give them a different handful, and looking at both castles together gives you a better idea of what is in the box. That is what random forests and stochastic gradient boosting do.

If your friend is recording your castle on their phone but the camera is too zoomed out, the picture will be blurry. They might decide to keep only every other pixel to make a smaller picture, but if they do not blur the photo first, the result will look weird and jagged. That is signal subsampling with anti-aliasing.

And if you have a thousand red bricks but only ten blue ones, your friend might say "let me only look at fifty red bricks so I do not forget about the blue ones." That is undersampling for imbalanced data.

All of these are subsampling: pick fewer items, pick them carefully, and pay attention to what you might lose.

References

Friedman, J. H. (2002). Stochastic gradient boosting. *Computational Statistics and Data Analysis*, 38(4), 367-378. https://www.sciencedirect.com/science/article/abs/pii/S0167947301000652
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. *Annals of Statistics*, 29(5), 1189-1232. https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation-A-gradient-boosting-machine/10.1214/aos/1013203451.full
Breiman, L. (2001). Random forests. *Machine Learning*, 45(1), 5-32. https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
Ho, T. K. (1998). The random subspace method for constructing decision forests. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 20(8), 832-844.
Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. *Machine Learning*, 63(1), 3-42. https://link.springer.com/article/10.1007/s10994-006-6226-1
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. *Proceedings of the IEEE*, 86(11), 2278-2324. http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. *CVPR 2016*. arXiv:1512.03385.
Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. (2015). Striving for simplicity: The all convolutional net. *ICLR Workshop 2015*. arXiv:1412.6806.
Zhang, R. (2019). Making convolutional networks shift-invariant again. *ICML 2019*. arXiv:1904.11486.
Robbins, H. and Monro, S. (1951). A stochastic approximation method. *Annals of Mathematical Statistics*, 22(3), 400-407.
Goyal, P., et al. (2017). Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677.
Keskar, N. S., et al. (2017). On large-batch training for deep learning: Generalization gap and sharp minima. *ICLR 2017*. arXiv:1609.04836.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. *NeurIPS 2013*. arXiv:1310.4546. https://arxiv.org/pdf/1310.4546
Goldberg, Y. and Levy, O. (2014). word2vec explained: Deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv:1402.3722.
Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. *AISTATS 2010*.
Politis, D. N. and Romano, J. P. (1994). Large sample confidence regions based on subsamples under minimal assumptions. *Annals of Statistics*, 22(4), 2031-2050.
Politis, D. N., Romano, J. P., and Wolf, M. (1999). *Subsampling*. Springer Series in Statistics. https://link.springer.com/book/10.1007/978-1-4612-1554-7
Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced training sets: One-sided selection. *ICML 1997*.
Lemaitre, G., Nogueira, F., and Aridas, C. K. (2017). Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning. *JMLR*, 18, 1-5. https://imbalanced-learn.org/
Kitaev, N., Kaiser, L., and Levskaya, A. (2020). Reformer: The efficient transformer. *ICLR 2020*. arXiv:2001.04451.
Choromanski, K., et al. (2021). Rethinking attention with Performers. *ICLR 2021*. arXiv:2009.14794.
Zaheer, M., et al. (2020). Big Bird: Transformers for longer sequences. *NeurIPS 2020*. arXiv:2007.14062.
Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv:1904.10509.
XGBoost developers. XGBoost parameters documentation. https://xgboost.readthedocs.io/en/stable/parameter.html
LightGBM developers. LightGBM parameters documentation. https://lightgbm.readthedocs.io/en/latest/Parameters.html
Shannon, C. E. (1949). Communication in the presence of noise. *Proceedings of the IRE*, 37(1), 10-21.

Definition

Meanings of subsampling at a glance

Subsampling in stochastic gradient boosting

Parameter names across libraries

Subsampling in random forests

Subsampling in stochastic optimization

Subsampling in convolutional networks

Subsampling in signal and image processing

Subsampling for imbalanced learning

Negative sampling and frequent-word subsampling in word2vec

Statistical subsampling (Politis and Romano)

Subsampling in efficient self-attention

Common themes across the meanings

Choosing a subsampling fraction

Explain like I'm 5

References

Improve this article

Related Articles

ARC-AGI 2

Sampling with replacement

Sensor fusion

AUC-ROC

ARIMA

Undersampling

Definition

Meanings of subsampling at a glance

Subsampling in stochastic gradient boosting

Parameter names across libraries

Subsampling in random forests

Subsampling in stochastic optimization

Subsampling in convolutional networks

Subsampling in signal and image processing

Subsampling for imbalanced learning

Negative sampling and frequent-word subsampling in word2vec

Statistical subsampling (Politis and Romano)

Subsampling in efficient self-attention

Common themes across the meanings

Choosing a subsampling fraction

Explain like I'm 5

References

Related Articles

ARC-AGI 2

Sampling with replacement

Sensor fusion

AUC-ROC

ARIMA

Undersampling