See also: downsampling, upsampling, oversampling, undersampling, bootstrapping, data augmentation
Subsampling is a family of techniques in machine learning, statistics, and signal processing that draw a smaller subset from a larger collection of data, samples, features, or signal values. The term covers several distinct procedures that look superficially similar but solve different problems. In ensemble learning, subsampling refers to fitting each base model on a random fraction of the training rows. In stochastic optimization, it refers to drawing a mini-batch of training examples per gradient step. In convolutional neural networks, it refers to pooling or strided convolutions that reduce spatial resolution. In statistics, the word names a specific resampling method (Politis and Romano, 1994) where subsamples of size m are drawn without replacement from a sample of size n. In signal processing, it is a synonym for downsampling the sample rate of an audio or image signal, with an anti-aliasing filter applied first. In imbalanced learning, it usually means undersampling the majority class.
This article surveys the main meanings, presents the underlying algorithms and formulas where they are short enough, and points to framework support. The unifying intuition is that working with fewer samples reduces compute, often acts as regularization, and changes the bias-variance trade-off.
| Context | What is subsampled | Typical purpose | Key reference |
|---|---|---|---|
| Stochastic gradient boosting | Training rows per boosting iteration | Faster fitting, regularization | Friedman (2002) |
| Random forests | Training rows per tree (bootstrap) | Decorrelate trees, bagging | Breiman (2001) |
| Mini-batch SGD | Training examples per gradient step | Tractable optimization | Robbins and Monro (1951) |
| CNN pooling and strided convolution | Spatial positions in a feature map | Reduce resolution, expand receptive field | LeCun et al. (1998) |
| Signal and image processing | Sample-rate of a 1D or 2D signal | Compression, format conversion | Nyquist-Shannon |
| Imbalanced classification | Majority-class examples | Rebalance class distribution | Kubat and Matwin (1997) |
| Word2vec | Frequent words and noise contrasts | Faster training, better embeddings | Mikolov et al. (2013) |
| Statistical resampling | Subsamples of size m from n | Confidence regions under weak assumptions | Politis and Romano (1994) |
| Efficient transformers | Token pairs in self-attention | Sub-quadratic attention | Kitaev et al. (2020) |
In the original gradient boosting machine, each new tree is fit to the residuals of the running model on the entire training set (Friedman, 2001). One year later, Friedman (2002) proposed a stochastic variant: at every boosting iteration, draw a random subsample of rows from the training data without replacement and fit the next tree only on that subsample. The fraction of rows drawn is the subsample rate, written as a number between 0 and 1.
Friedman reported two effects from this change. First, training is faster because each tree sees fewer rows. Second, the variance introduced by random row selection acts as regularization and improves generalization, especially when the base learner is large enough to overfit. He recommended values around 0.5 for a wide range of problems.
| Library | Parameter | Default | Notes |
|---|---|---|---|
| XGBoost | subsample | 1.0 | Sampled once per tree, without replacement |
| LightGBM | bagging_fraction (alias subsample) | 1.0 | Requires bagging_freq > 0 to take effect |
| CatBoost | subsample | 0.8 (Bernoulli) | Bernoulli or Poisson bootstrap |
scikit-learn GradientBoostingClassifier | subsample | 1.0 | Values < 1.0 give stochastic gradient boosting |
| H2O GBM | sample_rate | 1.0 | Per-tree sampling without replacement |
A related but separate idea is column subsampling: each split or each tree is restricted to a random subset of features. XGBoost calls these colsample_bytree, colsample_bylevel, and colsample_bynode. LightGBM uses feature_fraction. Row subsampling and column subsampling can be combined and they regularize the model in different ways. Column subsampling is the same idea Breiman used in random forests, applied here on top of boosting.
Random forests (Breiman, 2001) build an ensemble of decision trees where each tree is fit on a bootstrap sample of the training data, that is, a sample of size n drawn with replacement from the n rows of the training set. About 36.8% of the original rows are left out of any given bootstrap (the out-of-bag set), and these can be used as a free validation set for each tree. Bootstrap row sampling is the bagging part of the algorithm. On top of bagging, each split considers only a random subset of features (often sqrt(p) for classification or p/3 for regression with p features), which is the random subspace method of Ho (1998).
Extremely Randomized Trees, or Extra Trees (Geurts, Ernst, and Wehenkel, 2006), drop the bootstrap step and fit each tree on the full training set, while randomizing the cut-points of each split. They argue that the variance reduction from random splits already substitutes for the variance reduction from bootstrap sampling. In scikit-learn, both RandomForestClassifier and ExtraTreesClassifier expose a bootstrap flag and a max_samples argument that controls the fraction of rows drawn per tree.
Stochastic gradient descent (SGD) is itself a form of subsampling. Each step computes the gradient on a single example or a small mini-batch instead of on the full training set. This is what makes deep learning practical: with millions of training examples, computing the exact gradient at every step would be far too expensive. The mini-batch version of SGD goes back to Robbins and Monro (1951) for the single-example case, with mini-batch variants becoming standard practice in the 1990s and 2000s.
Mini-batch sizes typically range from 32 to 8192 in modern deep learning. Larger mini-batches give lower-variance gradient estimates but require more memory and need careful learning-rate scaling to avoid generalization losses (Goyal et al., 2017). Small mini-batches add noise that can act as a regularizer and help the optimizer escape sharp minima (Keskar et al., 2017). The interaction between batch size, learning rate, and generalization remains an active research question.
Convolutional neural networks (CNNs) routinely subsample feature maps to reduce spatial resolution and to enlarge the receptive field of deeper layers. The original LeNet-5 (LeCun et al., 1998) called these layers S2 and S4, where S stood for subsampling. Each subsampling unit averaged a 2x2 patch of the previous feature map, multiplied by a trainable coefficient and added a trainable bias.
Modern CNNs achieve subsampling in two ways:
| Method | Mechanism | Example |
|---|---|---|
| Pooling | Aggregate a small spatial window with max or average | 2x2 max pool with stride 2 discards 75% of activations |
| Strided convolution | Apply a convolution with stride > 1 | A 3x3 conv with stride 2 halves spatial dimensions |
A 2x2 max-pooling layer with stride 2 is the canonical building block: it picks the largest value from each non-overlapping 2x2 window, halving height and width and dropping the activation count to one quarter of the input. Strided convolutions, popularized by ResNet (He et al., 2016) and All-Convolutional Networks (Springenberg et al., 2015), achieve the same downsampling effect while learning the filter that performs it. Both approaches are subsampling because they reduce the number of spatial positions retained from one layer to the next, and both expand the receptive field of subsequent layers.
There is a known interaction between aggressive subsampling and translation invariance: vanilla strided convolutions and pooling violate the shift-equivariance properties one would expect from a convolution. Zhang (2019) showed that adding a low-pass anti-aliasing filter before subsampling improves both robustness and accuracy on standard image benchmarks.
In classical signal processing, subsampling is a synonym for downsampling: keep only every M-th sample of a sampled signal, where M is the integer downsampling factor. The Nyquist-Shannon sampling theorem requires that any signal whose highest frequency component is below B Hz be sampled at least at 2B Hz to avoid aliasing. When the sample rate is reduced, the new Nyquist limit is lower, so any spectral content above the new limit must be removed before subsampling, or it will fold back into the audible band as aliasing artifacts. The standard practice is to apply a low-pass anti-aliasing filter and then drop samples.
A few common cases:
| Domain | Original rate | Reduced rate | Notes |
|---|---|---|---|
| Audio for speech models | 48 kHz studio | 16 kHz model input | Many speech recognizers and TTS models expect 16 kHz |
| Image downsampling | 1024x1024 image | 256x256 image | 4x downsampling per dimension; needs blur or area filter |
| Telephony | 8 kHz audio | n/a | Low rate is itself a result of historical subsampling |
| Lidar / sensor logs | 100 Hz | 10 Hz | Often subsampled to match a slower controller loop |
In image processing, common anti-aliasing filters before subsampling include Gaussian blur, area averaging, Lanczos kernels, and bicubic kernels. Pillow, OpenCV, and TensorFlow all expose these as resize options. Skipping the filter and just decimating pixels produces the visible aliasing pattern known as moire.
When one class dominates a classification dataset, training on the raw data tends to bias the model toward the majority class. A common remedy is to subsample the majority class so that the class proportions become more balanced. This is usually called undersampling rather than subsampling, but the operation is the same: random selection without replacement from the more numerous class.
| Method | Description |
|---|---|
| Random undersampling | Pick majority-class examples uniformly at random until the desired ratio is reached |
| Tomek links | Remove majority-class examples that form a Tomek pair with a minority example, cleaning the decision boundary |
| NearMiss-1 | Keep majority examples whose average distance to the three nearest minority examples is smallest |
| NearMiss-2 | Keep majority examples whose average distance to the three farthest minority examples is smallest |
| NearMiss-3 | For each minority example, retain a fixed number of nearest majority neighbors |
| EditedNearestNeighbours | Drop majority examples whose neighbors disagree with their label |
| Cluster centroids | Replace majority examples with the centroids of clusters fit to that class |
These methods are all available in the open-source imbalanced-learn Python package, which integrates with scikit-learn pipelines. See the imbalanced dataset and undersampling articles for a deeper discussion of when each method works.
Word2vec (Mikolov et al., 2013) uses two separate subsampling tricks. The first is negative sampling, an alternative to the full softmax over the vocabulary. For each true (target word, context word) pair, the model samples a small number of "negative" words (typically between 5 and 20 for small datasets, 2 to 5 for large ones) drawn from the vocabulary according to the unigram distribution raised to the 3/4 power. The model then trains a binary classifier to distinguish the true context from the negatives. This replaces an O(|V|) softmax with a constant-time update.
The second trick is subsampling of frequent words. Common words such as "the", "and", "a" appear in many contexts and contribute little information per occurrence. Mikolov et al. discard each occurrence of word w with probability:
P(discard w) = 1 - sqrt(t / f(w))
where f(w) is the relative frequency of w in the corpus and t is a threshold (the original paper recommends t = 1e-5). Words with f(w) below t are never discarded, while very frequent words are discarded most of the time. The published implementation uses a slightly different formula, but the qualitative effect is the same. Mikolov reported that this scheme accelerated training by about 2x and improved accuracy on word-similarity tasks because it gave the model relatively more training signal from rare, semantically rich words.
The same family of ideas reappears in noise-contrastive estimation (Gutmann and Hyvarinen, 2010) and in modern contrastive learning, where positive pairs are augmented and negatives are sampled from the batch.
In theoretical statistics, subsampling has a precise technical meaning: given an i.i.d. sample of size n from an unknown distribution, draw subsamples of size m without replacement, where m grows more slowly than n (so m/n -> 0 as n -> infinity). Compute the statistic of interest on each subsample and form an empirical distribution of the suitably normalized values. Politis and Romano (1994) proved that this procedure yields asymptotically valid confidence regions under remarkably weak assumptions, essentially only that the statistic has a non-degenerate limiting distribution.
This is different from the bootstrap, where one resamples with replacement at the same size n. The bootstrap requires stronger conditions to be consistent (smoothness of the statistic, finite second moments, and so on), and it can fail in heavy-tailed or non-regular cases where subsampling still works. The reference text is the Springer monograph by Politis, Romano, and Wolf (1999), which extends the i.i.d. theory to time series, random fields, and dependent data. The price for the weaker assumptions is a slower convergence rate and the need to choose the block size m, often via a data-driven calibration rule.
Standard self-attention in transformers has O(L^2) cost in the sequence length L, because every token attends to every other token. Several efficient transformer variants reduce this cost by subsampling the set of token pairs that participate in attention.
In each case, subsampling the L-by-L attention matrix is what buys the sub-quadratic cost. The trade-off is approximation error in the attention output, which is usually small enough in practice that long-context models can be trained.
Despite the variety of contexts, several patterns recur whenever subsampling is introduced:
For row subsampling in stochastic gradient boosting and random forests, a common starting range is 0.5 to 0.8 of the training rows per tree. Smaller fractions speed up training and add more regularization, larger fractions reduce stochastic noise. For mini-batch SGD, batch sizes between 32 and 512 are typical for tabular and small image data, while large vision and language models use 1024 to 65536, scaled with learning rate (often via the linear or square-root scaling rule). For pooling in CNNs, 2x2 with stride 2 is the canonical default, but architectures like Inception use 3x3 windows with stride 2, and modern architectures often replace pooling with strided convolutions entirely. For statistical subsampling, the block size m must satisfy m -> infinity and m/n -> 0; in practice it is often chosen by minimizing a calibration criterion such as the minimum-volatility method.
Imagine you have a giant box of LEGO with thousands of bricks. There are different ways to use just some of them.
If you want to build a quick test castle, you might grab a handful at random instead of sorting through the whole box. That is mini-batch subsampling: it is faster than dumping the box on the floor every time. If you want a friend to also build a castle and then compare, you might give them a different handful, and looking at both castles together gives you a better idea of what is in the box. That is what random forests and stochastic gradient boosting do.
If your friend is recording your castle on their phone but the camera is too zoomed out, the picture will be blurry. They might decide to keep only every other pixel to make a smaller picture, but if they do not blur the photo first, the result will look weird and jagged. That is signal subsampling with anti-aliasing.
And if you have a thousand red bricks but only ten blue ones, your friend might say "let me only look at fifty red bricks so I do not forget about the blue ones." That is undersampling for imbalanced data.
All of these are subsampling: pick fewer items, pick them carefully, and pay attention to what you might lose.