# Subsampling

> Source: https://aiwiki.ai/wiki/subsampling
> Updated: 2026-07-11
> Categories: Machine Learning, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

Subsampling is the practice of drawing a smaller subset from a larger collection of data points, training examples, features, or signal values, in order to cut compute cost, add regularization, or rebalance a dataset. The term is used across [machine learning](/wiki/machine_learning), statistics, and signal processing for several distinct procedures that share this core idea: in ensemble methods it means fitting each tree on a random fraction of the rows, in stochastic optimization it means using a mini-batch per gradient step, in [convolutional neural networks](/wiki/convolutional_neural_network) it means [pooling](/wiki/pooling) or strided convolutions that shrink spatial resolution, and in [word2vec](/wiki/word2vec) it means discarding frequent words with probability $$P(w) = 1 - \sqrt{t / f(w)}$$, with the threshold t typically around 1e-5.[13] In formal statistics it names a specific resampling method (Politis and Romano, 1994) that draws subsamples of size m without replacement from a sample of size n.[16]

*See also: [downsampling](/wiki/downsampling), [upsampling](/wiki/upsampling), [oversampling](/wiki/oversampling), [undersampling](/wiki/undersampling), [bootstrapping](/wiki/bootstrapping), [data augmentation](/wiki/data_augmentation)*

## What is subsampling?

Subsampling is a family of techniques in [machine learning](/wiki/machine_learning), statistics, and signal processing that draw a smaller subset from a larger collection of data, samples, features, or signal values. The term covers several distinct procedures that look superficially similar but solve different problems. In ensemble learning, subsampling refers to fitting each base model on a random fraction of the training rows. In stochastic optimization, it refers to drawing a mini-batch of training examples per gradient step. In convolutional neural networks, it refers to pooling or strided convolutions that reduce spatial resolution. In statistics, the word names a specific resampling method (Politis and Romano, 1994) where subsamples of size m are drawn without replacement from a sample of size n.[16] In signal processing, it is a synonym for [downsampling](/wiki/downsampling) the sample rate of an audio or image signal, with an anti-aliasing filter applied first. In imbalanced learning, it usually means undersampling the majority class.

This article surveys the main meanings, presents the underlying algorithms and formulas where they are short enough, and points to framework support. The unifying intuition is that working with fewer samples reduces compute, often acts as regularization, and changes the bias-variance trade-off.

## What are the meanings of subsampling at a glance?

| Context | What is subsampled | Typical purpose | Key reference |
|---------|-------------------|-----------------|---------------|
| Stochastic gradient boosting | Training rows per boosting iteration | Faster fitting, regularization | Friedman (2002) |
| Random forests | Training rows per tree (bootstrap) | Decorrelate trees, bagging | Breiman (2001) |
| Mini-batch SGD | Training examples per gradient step | Tractable optimization | Robbins and Monro (1951) |
| CNN pooling and strided convolution | Spatial positions in a feature map | Reduce resolution, expand receptive field | LeCun et al. (1998) |
| Signal and image processing | Sample-rate of a 1D or 2D signal | Compression, format conversion | Nyquist-Shannon |
| Imbalanced classification | Majority-class examples | Rebalance class distribution | Kubat and Matwin (1997) |
| Word2vec | Frequent words and noise contrasts | Faster training, better embeddings | Mikolov et al. (2013) |
| Statistical resampling | Subsamples of size m from n | Confidence regions under weak assumptions | Politis and Romano (1994) |
| Efficient transformers | Token pairs in self-attention | Sub-quadratic attention | Kitaev et al. (2020) |

## How does subsampling work in stochastic gradient boosting?

In the original gradient boosting machine, each new tree is fit to the residuals of the running model on the entire training set (Friedman, 2001).[2] One year later, Friedman (2002) proposed a stochastic variant: at every boosting iteration, draw a random subsample of rows from the training data without replacement and fit the next tree only on that subsample.[1] The fraction of rows drawn is the subsample rate, written as a number between 0 and 1. The paper's abstract states that "both the approximation accuracy and execution speed of gradient boosting can be substantially improved by incorporating randomization into the procedure," where "at each iteration a subsample of the training data is drawn at random (without replacement) from the full training data set."[1]

Friedman reported two effects from this change. First, training is faster because each tree sees fewer rows. Second, the variance introduced by random row selection acts as regularization and improves generalization, especially when the base learner is large enough to overfit. He recommended values around 0.5 for a wide range of problems, and the widely used `gbm` package set its default subsample rate to 0.5 (half the observations per step).[1]

### Parameter names across libraries

| Library | Parameter | Default | Notes |
|---------|-----------|---------|-------|
| [XGBoost](/wiki/xgboost) | `subsample` | 1.0 | Sampled once per tree, without replacement |
| [LightGBM](/wiki/lightgbm) | `bagging_fraction` (alias `subsample`) | 1.0 | Requires `bagging_freq` > 0 to take effect |
| CatBoost | `subsample` | 0.8 (Bernoulli) | Bernoulli or Poisson bootstrap |
| scikit-learn `GradientBoostingClassifier` | `subsample` | 1.0 | Values < 1.0 give stochastic gradient boosting |
| H2O GBM | `sample_rate` | 1.0 | Per-tree sampling without replacement |

A related but separate idea is column subsampling: each split or each tree is restricted to a random subset of features. XGBoost calls these `colsample_bytree`, `colsample_bylevel`, and `colsample_bynode`.[24] LightGBM uses `feature_fraction`.[25] Row subsampling and column subsampling can be combined and they regularize the model in different ways. Column subsampling is the same idea Breiman used in random forests, applied here on top of boosting.

## How does subsampling work in random forests?

Random forests (Breiman, 2001) build an ensemble of decision trees where each tree is fit on a bootstrap sample of the training data, that is, a sample of size n drawn with replacement from the n rows of the training set.[3] About 36.8% of the original rows are left out of any given bootstrap (the out-of-bag set), and these can be used as a free validation set for each tree. Bootstrap row sampling is the bagging part of the algorithm. On top of bagging, each split considers only a random subset of features (often $$\sqrt{p}$$ for classification or $$p/3$$ for regression with p features), which is the random subspace method of Ho (1998).[4]

Extremely Randomized Trees, or Extra Trees (Geurts, Ernst, and Wehenkel, 2006), drop the bootstrap step and fit each tree on the full training set, while randomizing the cut-points of each split. They argue that the variance reduction from random splits already substitutes for the variance reduction from bootstrap sampling.[5] In scikit-learn, both `RandomForestClassifier` and `ExtraTreesClassifier` expose a `bootstrap` flag and a `max_samples` argument that controls the fraction of rows drawn per tree.

## How does subsampling work in stochastic optimization?

Stochastic gradient descent (SGD) is itself a form of subsampling. Each step computes the gradient on a single example or a small mini-batch instead of on the full training set. This is what makes deep learning practical: with millions of training examples, computing the exact gradient at every step would be far too expensive. The mini-batch version of SGD goes back to Robbins and Monro (1951) for the single-example case, with mini-batch variants becoming standard practice in the 1990s and 2000s.[10]

Mini-batch sizes typically range from 32 to 8192 in modern deep learning. Larger mini-batches give lower-variance gradient estimates but require more memory and need careful learning-rate scaling to avoid generalization losses (Goyal et al., 2017).[11] Small mini-batches add noise that can act as a regularizer and help the optimizer escape sharp minima (Keskar et al., 2017).[12] The interaction between batch size, learning rate, and generalization remains an active research question.

## How does subsampling work in convolutional networks?

Convolutional neural networks ([CNNs](/wiki/convolutional_neural_network)) routinely subsample feature maps to reduce spatial resolution and to enlarge the receptive field of deeper layers. The original LeNet-5 (LeCun et al., 1998) called these layers S2 and S4, where S stood for subsampling. Each subsampling unit had a 2x2 receptive field, halving the height and width of the preceding feature map. In the authors' description, "the four inputs to a unit in S2 are added, then multiplied by a trainable coefficient, and added to a trainable bias," and "the result is passed through a sigmoidal function."[6] Because the 2x2 windows do not overlap, each subsampling layer reduces the spatial dimensions of the feature maps by a factor of two.[6]

Modern CNNs achieve subsampling in two ways:

| Method | Mechanism | Example |
|--------|-----------|---------|
| [Pooling](/wiki/pooling) | Aggregate a small spatial window with max or average | 2x2 max pool with stride 2 discards 75% of activations |
| Strided convolution | Apply a convolution with stride > 1 | A 3x3 conv with stride 2 halves spatial dimensions |

A 2x2 max-pooling layer with stride 2 is the canonical building block: it picks the largest value from each non-overlapping 2x2 window, halving height and width and dropping the activation count to one quarter of the input. Strided convolutions, popularized by ResNet (He et al., 2016) and All-Convolutional Networks (Springenberg et al., 2015), achieve the same downsampling effect while learning the filter that performs it.[7][8] Both approaches are subsampling because they reduce the number of spatial positions retained from one layer to the next, and both expand the receptive field of subsequent layers.

There is a known interaction between aggressive subsampling and translation invariance: vanilla strided convolutions and pooling violate the shift-equivariance properties one would expect from a convolution. Zhang (2019) showed that adding a [low-pass anti-aliasing filter](/wiki/anti_aliasing) before subsampling improves both robustness and accuracy on standard image benchmarks.[9]

## How does subsampling work in signal and image processing?

In classical signal processing, subsampling is a synonym for downsampling: keep only every M-th sample of a sampled signal, where M is the integer downsampling factor. The Nyquist-Shannon sampling theorem requires that any signal whose highest frequency component is below B Hz be sampled at least at 2B Hz to avoid aliasing.[26] When the sample rate is reduced, the new Nyquist limit is lower, so any spectral content above the new limit must be removed before subsampling, or it will fold back into the audible band as aliasing artifacts. The standard practice is to apply a low-pass anti-aliasing filter and then drop samples.

A few common cases:

| Domain | Original rate | Reduced rate | Notes |
|--------|--------------|--------------|-------|
| Audio for speech models | 48 kHz studio | 16 kHz model input | Many speech recognizers and TTS models expect 16 kHz |
| Image downsampling | 1024x1024 image | 256x256 image | 4x downsampling per dimension; needs blur or area filter |
| Telephony | 8 kHz audio | n/a | Low rate is itself a result of historical subsampling |
| Lidar / sensor logs | 100 Hz | 10 Hz | Often subsampled to match a slower controller loop |

In image processing, common anti-aliasing filters before subsampling include Gaussian blur, area averaging, Lanczos kernels, and bicubic kernels. Pillow, OpenCV, and TensorFlow all expose these as resize options. Skipping the filter and just decimating pixels produces the visible aliasing pattern known as moire.

## How does subsampling help with imbalanced learning?

When one class dominates a classification dataset, training on the raw data tends to bias the model toward the majority class. A common remedy is to subsample the majority class so that the class proportions become more balanced.[18] This is usually called undersampling rather than subsampling, but the operation is the same: random selection without replacement from the more numerous class.

| Method | Description |
|--------|-------------|
| Random undersampling | Pick majority-class examples uniformly at random until the desired ratio is reached |
| Tomek links | Remove majority-class examples that form a Tomek pair with a minority example, cleaning the decision boundary |
| NearMiss-1 | Keep majority examples whose average distance to the three nearest minority examples is smallest |
| NearMiss-2 | Keep majority examples whose average distance to the three farthest minority examples is smallest |
| NearMiss-3 | For each minority example, retain a fixed number of nearest majority neighbors |
| EditedNearestNeighbours | Drop majority examples whose neighbors disagree with their label |
| Cluster centroids | Replace majority examples with the centroids of clusters fit to that class |

These methods are all available in the open-source `imbalanced-learn` Python package, which integrates with scikit-learn pipelines.[19] See the [imbalanced dataset](/wiki/imbalanced_dataset) and [undersampling](/wiki/undersampling) articles for a deeper discussion of when each method works.

## How does subsampling work in word2vec?

Word2vec (Mikolov et al., 2013) uses two separate subsampling tricks. The first is negative sampling, an alternative to the full softmax over the vocabulary. For each true (target word, context word) pair, the model samples a small number of "negative" words (typically between 5 and 20 for small datasets, 2 to 5 for large ones) drawn from the unigram distribution raised to the 3/4 power.[13] The authors report that this U(w) to the 3/4 power weighting "outperformed significantly the unigram and the uniform distributions" on their analogy tasks.[13] The model then trains a binary classifier to distinguish the true context from the negatives.[14] This replaces an O(|V|) softmax with a constant-time update.

The second trick is subsampling of frequent words. Common words such as "the", "and", "a" appear in many contexts and contribute little information per occurrence. Mikolov et al. discard each occurrence of word w with probability:

$$
P(\text{discard } w) = 1 - \sqrt{t / f(w)}
$$

where f(w) is the relative frequency of w in the corpus and t is a threshold (the original paper recommends t = 1e-5).[13] Words with f(w) below t are never discarded, while very frequent words are discarded most of the time. The published implementation uses a slightly different formula, but the qualitative effect is the same. Mikolov reported that this scheme "accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words," giving roughly a 2x speedup because the model spends relatively more of its training signal on rare, semantically rich words.[13]

The same family of ideas reappears in noise-contrastive estimation (Gutmann and Hyvarinen, 2010)[15] and in modern contrastive learning, where positive pairs are augmented and negatives are sampled from the batch.

## What is statistical subsampling (Politis and Romano)?

In theoretical statistics, subsampling has a precise technical meaning: given an i.i.d. sample of size n from an unknown distribution, draw subsamples of size m without replacement, where m grows more slowly than n (so $$m/n \to 0$$ as $$n \to \infty$$). Compute the statistic of interest on each subsample and form an empirical distribution of the suitably normalized values. Politis and Romano (1994) proved that this procedure yields asymptotically valid confidence regions under remarkably weak assumptions, essentially only that the statistic has a non-degenerate limiting distribution.[16]

This is different from the [bootstrap](/wiki/bootstrapping), where one resamples with replacement at the same size n. The bootstrap requires stronger conditions to be consistent (smoothness of the statistic, finite second moments, and so on), and it can fail in heavy-tailed or non-regular cases where subsampling still works. The reference text is the Springer monograph by Politis, Romano, and Wolf (1999), which extends the i.i.d. theory to time series, random fields, and dependent data.[17] The price for the weaker assumptions is a slower convergence rate and the need to choose the block size m, often via a data-driven calibration rule.

## How does subsampling enable efficient self-attention?

Standard self-attention in transformers has $$O(L^2)$$ cost in the sequence length $$L$$, because every token attends to every other token. Several efficient transformer variants reduce this cost by subsampling the set of token pairs that participate in attention.

- Reformer (Kitaev, Kaiser, and Levskaya, 2020) groups tokens with locality-sensitive hashing and only attends within each hash bucket, achieving $$O(L \log L)$$ cost.[20]
- Performer (Choromanski et al., 2021) uses random feature maps to approximate the attention kernel in $$O(L)$$ cost without explicit subsampling, but the random features can be viewed as a probabilistic sampling of kernel evaluations.[21]
- BigBird (Zaheer et al., 2020) attends to a fixed pattern of local, global, and randomly sampled positions, with the random subset providing approximation guarantees.[22]
- Sparse Transformer (Child et al., 2019) restricts each query to attend to a strided or fixed pattern of keys.[23]

In each case, subsampling the $$L \times L$$ attention matrix is what buys the sub-quadratic cost. The trade-off is approximation error in the attention output, which is usually small enough in practice that long-context models can be trained.

## What themes are common across the meanings?

Despite the variety of contexts, several patterns recur whenever subsampling is introduced:

1. Compute reduction. Touching fewer items per step is the most direct motivation, whether the items are training rows, attention pairs, or signal samples.
2. Variance reduction at the model level. By averaging many models, each fit on a different subsample, ensembles like random forests reduce variance compared to a single model fit on the whole dataset.
3. Variance increase at the iteration level. Because each iteration sees a different subsample, the path of the optimizer or the boosting trajectory becomes noisier. This noise often acts as implicit regularization.
4. Bias trade-off. Subsampling a smaller fraction of the data introduces bias into estimates of the gradient, the loss, or whatever statistic is being computed on the subsample. The art is choosing a fraction small enough to gain speed and regularization without hurting accuracy too much.
5. Need for a matching pre-processing step. In signal processing, subsampling without anti-aliasing produces artifacts. In CNNs, naive strided pooling breaks shift-equivariance. In imbalanced learning, naive random undersampling discards potentially informative majority-class examples and can be improved with structure-aware methods like Tomek links or NearMiss.

## How do you choose a subsampling fraction?

For row subsampling in stochastic gradient boosting and random forests, a common starting range is 0.5 to 0.8 of the training rows per tree. Smaller fractions speed up training and add more regularization, larger fractions reduce stochastic noise. For mini-batch SGD, batch sizes between 32 and 512 are typical for tabular and small image data, while large vision and language models use 1024 to 65536, scaled with learning rate (often via the linear or square-root scaling rule). For pooling in CNNs, 2x2 with stride 2 is the canonical default, but architectures like Inception use 3x3 windows with stride 2, and modern architectures often replace pooling with strided convolutions entirely. For statistical subsampling, the block size m must satisfy $$m \to \infty$$ and $$m/n \to 0$$; in practice it is often chosen by minimizing a calibration criterion such as the minimum-volatility method.

## Explain like I'm 5

Imagine you have a giant box of LEGO with thousands of bricks. There are different ways to use just some of them.

If you want to build a quick test castle, you might grab a handful at random instead of sorting through the whole box. That is mini-batch subsampling: it is faster than dumping the box on the floor every time. If you want a friend to also build a castle and then compare, you might give them a different handful, and looking at both castles together gives you a better idea of what is in the box. That is what random forests and stochastic gradient boosting do.

If your friend is recording your castle on their phone but the camera is too zoomed out, the picture will be blurry. They might decide to keep only every other pixel to make a smaller picture, but if they do not blur the photo first, the result will look weird and jagged. That is signal subsampling with anti-aliasing.

And if you have a thousand red bricks but only ten blue ones, your friend might say "let me only look at fifty red bricks so I do not forget about the blue ones." That is undersampling for imbalanced data.

All of these are subsampling: pick fewer items, pick them carefully, and pay attention to what you might lose.

## References

1. Friedman, J. H. (2002). Stochastic gradient boosting. *Computational Statistics and Data Analysis*, 38(4), 367-378. https://www.sciencedirect.com/science/article/abs/pii/S0167947301000652
2. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. *Annals of Statistics*, 29(5), 1189-1232. https://projecteuclid.org/journals/annals-of-statistics/volume-29/issue-5/Greedy-function-approximation-A-gradient-boosting-machine/10.1214/aos/1013203451.full
3. Breiman, L. (2001). Random forests. *Machine Learning*, 45(1), 5-32. https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
4. Ho, T. K. (1998). The random subspace method for constructing decision forests. *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 20(8), 832-844.
5. Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. *Machine Learning*, 63(1), 3-42. https://link.springer.com/article/10.1007/s10994-006-6226-1
6. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. *Proceedings of the IEEE*, 86(11), 2278-2324. http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
7. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. *CVPR 2016*. arXiv:1512.03385.
8. Springenberg, J. T., Dosovitskiy, A., Brox, T., and Riedmiller, M. (2015). Striving for simplicity: The all convolutional net. *ICLR Workshop 2015*. arXiv:1412.6806.
9. Zhang, R. (2019). Making convolutional networks shift-invariant again. *ICML 2019*. arXiv:1904.11486.
10. Robbins, H. and Monro, S. (1951). A stochastic approximation method. *Annals of Mathematical Statistics*, 22(3), 400-407.
11. Goyal, P., et al. (2017). Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677.
12. Keskar, N. S., et al. (2017). On large-batch training for deep learning: Generalization gap and sharp minima. *ICLR 2017*. arXiv:1609.04836.
13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. *NeurIPS 2013*. arXiv:1310.4546. https://arxiv.org/pdf/1310.4546
14. Goldberg, Y. and Levy, O. (2014). word2vec explained: Deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv:1402.3722.
15. Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. *AISTATS 2010*.
16. Politis, D. N. and Romano, J. P. (1994). Large sample confidence regions based on subsamples under minimal assumptions. *Annals of Statistics*, 22(4), 2031-2050.
17. Politis, D. N., Romano, J. P., and Wolf, M. (1999). *Subsampling*. Springer Series in Statistics. https://link.springer.com/book/10.1007/978-1-4612-1554-7
18. Kubat, M. and Matwin, S. (1997). Addressing the curse of imbalanced training sets: One-sided selection. *ICML 1997*.
19. Lemaitre, G., Nogueira, F., and Aridas, C. K. (2017). Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning. *JMLR*, 18, 1-5. https://imbalanced-learn.org/
20. Kitaev, N., Kaiser, L., and Levskaya, A. (2020). Reformer: The efficient transformer. *ICLR 2020*. arXiv:2001.04451.
21. Choromanski, K., et al. (2021). Rethinking attention with Performers. *ICLR 2021*. arXiv:2009.14794.
22. Zaheer, M., et al. (2020). Big Bird: Transformers for longer sequences. *NeurIPS 2020*. arXiv:2007.14062.
23. Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv:1904.10509.
24. XGBoost developers. XGBoost parameters documentation. https://xgboost.readthedocs.io/en/stable/parameter.html
25. LightGBM developers. LightGBM parameters documentation. https://lightgbm.readthedocs.io/en/latest/Parameters.html
26. Shannon, C. E. (1949). Communication in the presence of noise. *Proceedings of the IRE*, 37(1), 10-21.