Sampling with replacement

Sampling with replacement is a method of drawing items from a population in which each selected item is returned to the pool before the next draw, so the same item can be chosen more than once. The procedure is the workhorse behind a long list of statistical and machine learning techniques, including the bootstrap, bagging, the random forest, and many Monte Carlo procedures. It contrasts with sampling without replacement, where every item is drawn at most once, which is the default for most survey designs and for shuffling-based mini-batching in stochastic gradient descent.

The distinction matters because the two schemes induce different probability structures. With replacement, draws are independent and identically distributed, so the probability of selecting any particular item stays constant from draw to draw. Without replacement, the population shrinks after each pick, and successive draws are dependent. That single difference, independence versus dependence, is what makes the with-replacement scheme so attractive when statisticians want to mimic the data-generating process from a single observed sample.

Definition and procedure

Given a population of size N, sampling with replacement draws n items by repeating the following step n times: pick one item uniformly at random from the N items, record it, and put it back. Each draw is an independent trial with the same probability distribution, so the resulting sequence of n picks forms an i.i.d. sequence. When the population is finite and uniformly weighted, every item has probability 1/N of being chosen on any given draw, regardless of what was drawn before.

The sample can also use non-uniform weights, in which case each item i has probability p_i of being selected, with the weights summing to 1. Weighted sampling with replacement is the standard mechanism behind importance sampling, particle filters, and several reinforcement learning replay schemes.

A convenient mental model is an urn with N balls. To sample with replacement, you draw a ball, note it, and drop it back in the urn before the next draw. The urn never empties, and the same ball can come out many times in a row. Sampling without replacement removes the ball after each draw, so the urn shrinks as you go.

With versus without replacement

The two schemes lead to different downstream properties for variance, sample composition, and the form of the resulting estimators.

Property	With replacement	Without replacement
Independence of draws	i.i.d.	Dependent
Probability of selecting item on draw k	1/N (constant)	Depends on prior draws
Sample can contain duplicates	Yes	No
Maximum sample size	Unlimited	At most N
Variance of sample mean (size n, population variance sigma squared)	sigma squared / n	(sigma squared / n) times (N - n) / (N - 1)
Typical use cases	Bootstrap, bagging, MCMC proposals, weighted resampling	Survey sampling, mini-batch SGD within an epoch, permutation tests, k-fold cross-validation
Computational cost	Cheap, draws are independent	Slightly higher, requires tracking what was drawn

The finite-population correction factor (N - n) / (N - 1) shows that without-replacement sampling has lower variance when the sample is a meaningful fraction of the population. As N grows large relative to n, the two schemes become nearly indistinguishable. This is one reason why bootstrap theory is usually framed for large N, where the with-replacement and without-replacement distinctions blur.

Mathematics of bootstrap-style resampling

The most quoted fact about sampling with replacement comes from the bootstrap setting, where a sample of size n is drawn with replacement from a population of size N = n. The probability that a particular item is not selected on any single draw is (1 - 1/N), and the probability that it is missed across all n draws is:

P(item not chosen) = (1 - 1/N)^n

When n equals N and N is large, this expression has a famous limit. Using the standard identity that (1 - 1/N)^N approaches 1/e as N goes to infinity, the probability that any specific item never appears in a bootstrap sample approaches 1/e, which is approximately 0.368. The probability that an item does appear at least once is therefore approximately 1 - 1/e, or about 0.632.

This is why a bootstrap sample of size n drawn from a population of size n contains, on average, about 63.2 percent unique items. The remaining 36.8 percent of original items are absent from the sample, and the slots they would have occupied are taken up by duplicates of items that were chosen multiple times. The 63.2 percent figure converges quickly: even at N = 100, the probability of inclusion is already very close to its asymptotic value.

This fact has direct practical consequences. In a random forest, the items not chosen for any individual tree's bootstrap sample are called out-of-bag samples. They are about 36.8 percent of the training set per tree, and they form a free held-out set that can be used to estimate generalization error without a separate validation split.

Statistical foundation: the bootstrap

Sampling with replacement underpins the bootstrap, a resampling method introduced by Bradley Efron in his 1979 paper "Bootstrap Methods: Another Look at the Jackknife," published in the Annals of Statistics. The core idea is to treat the observed sample as a stand-in for the unknown population. To approximate the sampling distribution of a statistic, the bootstrap repeatedly draws synthetic samples from the original data with replacement, recomputes the statistic on each, and uses the resulting empirical distribution to estimate standard errors, bias, and confidence intervals.

The bootstrap replaces analytical derivations with computation. Before the bootstrap, computing the standard error of a complicated estimator like a sample median or a regression coefficient often required either restrictive distributional assumptions or laborious calculus. Efron's insight was that resampling with replacement from the data itself produces an estimate of the sampling distribution that converges, under regularity conditions, to the truth. Bootstrap methods became practical only with the rise of cheap computing.

A few common bootstrap variants illustrate why with-replacement sampling is the right tool:

Variant	What it does	Typical use
Percentile bootstrap	Reads off the 2.5 and 97.5 percentiles of the bootstrap statistic distribution	Quick confidence intervals when the statistic is roughly symmetric
Basic (reverse) bootstrap	Reflects bootstrap quantiles around the observed statistic	Alternative to percentile when distribution is skewed
Bias-corrected and accelerated (BCa)	Corrects for bias and skewness using a bias factor and a jackknife-based acceleration constant	Default choice for accurate intervals; introduced by Efron in 1987
Studentized bootstrap	Standardizes each bootstrap statistic by an estimated standard error	Better coverage when sample size is small
Bootstrap hypothesis test	Resamples under a null distribution to compute p-values	Tests where analytical null distributions are intractable
Block bootstrap	Resamples contiguous blocks rather than single observations	Time series and other dependent data, introduced by Kunsch in 1989

All of these schemes inherit the basic resampling-with-replacement step from Efron's original construction.

Bagging and the random forest

Leo Breiman borrowed Efron's resampling idea for prediction in his 1996 paper "Bagging Predictors," published in Machine Learning. Bagging, short for bootstrap aggregating, trains a collection of base models on bootstrap samples of the training data and combines their outputs by averaging for regression or majority vote for classification. Each base model sees a slightly different view of the data, and aggregating their predictions reduces variance without increasing bias much. Breiman showed that bagging works best when the base learner is unstable, meaning small changes in the training data produce different models. Decision trees are the canonical example.

Five years later, Breiman extended bagging into the random forest, described in his 2001 Machine Learning paper of the same name. A random forest fits many decision trees, each on its own bootstrap sample, and adds a second layer of randomness by considering only a random subset of features at each split. The two sources of randomness, bootstrap sampling and random feature selection, decorrelate the trees so that aggregation gives a larger variance reduction than bagging alone.

The random forest also exploits the 36.8 percent out-of-bag fact. For each training example, the trees that did not see it during fitting form a held-out ensemble that can predict its label. Aggregating these out-of-bag predictions across the forest yields the out-of-bag error estimate, which approximates a leave-one-out cross-validation score without the expense of repeated retraining. The out-of-bag error is a side effect of with-replacement sampling that has no clean analogue under without-replacement schemes.

Other applications

Resampling with replacement appears throughout statistics and machine learning beyond the bootstrap and tree ensembles.

Application	How sampling with replacement is used
Markov chain Monte Carlo	Particle filters and resampling steps in sequential Monte Carlo redraw particles with replacement, weighted by importance, to focus computation on high-likelihood regions
Importance resampling	Convert weighted samples from a proposal distribution into approximately unweighted samples from a target by resampling with probabilities proportional to the weights
Bootstrap aggregating in deep learning	Snapshot ensembles and stochastic weight averaging take inspiration from bagging, although they typically use without-replacement minibatches inside each model
Reinforcement learning replay buffers	Prioritized experience replay samples transitions from the buffer with probabilities proportional to TD error, with replacement, so high-priority transitions can appear multiple times in a minibatch
Polyak averaging	Averages parameter snapshots taken across training, where the snapshots themselves come from minibatches that may be drawn with or without replacement depending on the implementation
Class rebalancing	When the positive class is rare, upsampling duplicates positives by sampling them with replacement to match the negative class count
Bootstrap evaluation of ML metrics	Confidence intervals for accuracy, F1 score, AUC, and other test-set metrics are routinely computed by resampling the test set with replacement

In each case, the with-replacement step does the same job that it does in the bootstrap: it manufactures additional pseudo-samples from a fixed dataset so that the variance of an estimator can be quantified or so that an ensemble can see diverse training views.

When without replacement is preferred

Many machine learning workflows use sampling without replacement because each observation should appear at most once per pass over the data.

Mini-batch SGD typically samples without replacement within an epoch, drawing each minibatch from the remaining unused examples until the dataset is exhausted, then reshuffling for the next epoch. Leon Bottou and others have argued that random shuffling, although harder to analyze, gives faster convergence in practice than sampling with replacement at every step.
K-fold cross-validation partitions the data into disjoint folds, which is a without-replacement procedure by construction.
Permutation tests draw random permutations of labels, again without replacement, to construct a null distribution.
Stratified sampling for survey design and for held-out validation sets uses without replacement to guarantee that subgroups are represented exactly once at the desired frequency.

As a rule of thumb, with-replacement sampling is the right choice when the goal is to mimic the act of drawing fresh samples from an underlying population. Without-replacement sampling is the right choice when the goal is to partition a fixed dataset into non-overlapping pieces, or to ensure each example contributes exactly one gradient signal per epoch.

Code examples

Both NumPy and scikit-learn expose with-replacement sampling through standard APIs.

import numpy as np

# Uniform sampling with replacement using NumPy
rng = np.random.default_rng(seed=42)
data = np.array([10, 20, 30, 40, 50])
bootstrap_sample = rng.choice(data, size=len(data), replace=True)
# Possible result: array([30, 10, 40, 30, 20])

# Weighted sampling with replacement
weights = np.array([0.1, 0.1, 0.6, 0.1, 0.1])
weighted_sample = rng.choice(data, size=10, replace=True, p=weights)

The legacy interface numpy.random.choice and the modern Generator.choice from NumPy both default to replace=True. Setting replace=False switches to sampling without replacement, but then size cannot exceed the length of the input.

Scikit-learn provides sklearn.utils.resample for the same task in a tabular context:

from sklearn.utils import resample
import numpy as np

X = np.arange(20).reshape(10, 2)
y = np.arange(10)

# Bootstrap sample of size 10 from a 10-row dataset
X_boot, y_boot = resample(X, y, replace=True, n_samples=10, random_state=0)

For confidence interval construction, SciPy added scipy.stats.bootstrap in version 1.7. It supports the percentile, basic, and BCa methods, and handles the resampling, statistic recomputation, and quantile extraction in a single call:

import numpy as np
from scipy.stats import bootstrap

rng = np.random.default_rng(seed=0)
sample = rng.normal(loc=5.0, scale=2.0, size=200)

result = bootstrap((sample,), np.mean, n_resamples=10000,
                   confidence_level=0.95, method='BCa', random_state=rng)
print(result.confidence_interval)

The BootstrapMethod helper in newer SciPy versions also lets users plug a precomputed bootstrap distribution into hypothesis tests like ttest_ind for finer control.

Expected statistics under sampling with replacement

A bootstrap sample drawn from a fixed dataset has several useful properties that follow directly from the with-replacement structure.

Statistic	Value under with-replacement sampling
Probability a specific item appears at least once (n = N, large N)	Approximately 0.632
Probability a specific item is missed entirely (n = N, large N)	Approximately 0.368
Expected number of unique items in a bootstrap sample of size N	Approximately 0.632 N
Expected number of duplicate slots (size n, drawn with replacement)	n minus expected unique count
Variance of the bootstrap mean (target sample variance s squared, size n)	s squared / n
Number of distinct possible bootstrap samples of size n from N items	C(2N - 1, N), the number of multisets of size N

The number of possible distinct bootstrap samples grows enormously with N, which is why practical bootstrap implementations rely on Monte Carlo approximation: drawing a few thousand bootstrap replicates rather than enumerating all of them.

Limitations

Sampling with replacement is powerful but not universal.

The bootstrap can underestimate variance when the statistic is a function of extreme order statistics, such as the sample maximum or minimum. The maximum of a bootstrap sample is at most the maximum of the original data, so the bootstrap distribution of the max has a point mass at the observed max and cannot capture true tail variability.

The assumption that observations are independent and identically distributed breaks down for time series, spatial data, and other forms of dependent data. Naive bootstrap resampling destroys the temporal or spatial dependence structure. The fix is to resample blocks rather than individual observations: Hans Kunsch introduced the moving block bootstrap in 1989, and Dimitris Politis and Joseph Romano introduced the stationary bootstrap in 1994 to give a smoothed alternative. These block-based variants preserve short-range dependence within each block.

With-replacement resampling can also be inappropriate when the original dataset is itself unrepresentative. The bootstrap can only reflect the distribution it sees, so a biased sample produces biased bootstrap inference. The bootstrap is not a cure for bad data.

In machine learning, naive bagging on top of an already low-variance learner offers little benefit and may even hurt accuracy. Breiman observed in 1996 that bagging works only when the base learner is unstable: if the model is already smooth and stable, perturbing the training set does not produce diverse ensemble members, and aggregation cannot reduce a variance that is not there.

Finally, very small samples cause the bootstrap to inherit the small-sample problems of the underlying estimator. With only a handful of observations, even resampling many times cannot manufacture meaningful information that was not present in the data. Asymptotic guarantees about bootstrap consistency presume the original sample is large enough for the empirical distribution to approximate the true distribution.

Modern context

Resampling-with-replacement methods sit at the heart of how scientific results are reported. Bootstrap confidence intervals appear routinely in clinical trials, econometrics papers, ecology, and machine learning benchmarks. SciPy's bootstrap function and the comparable boot package in R have made these methods accessible to researchers who do not want to implement the resampling logic by hand.

In modern machine learning evaluation, bootstrap confidence intervals are used to compare models, particularly when the test set is small or when reporting per-class metrics. The standard recipe is to resample the test set with replacement, recompute the metric, and report the 2.5 and 97.5 percentiles as a 95 percent interval. This is far more honest than reporting a single point estimate, and many leaderboards and benchmarks now expect it.

Tree ensembles built on bootstrap samples remain among the most reliable tabular learners. Modern implementations like XGBoost and LightGBM use bagging-style row subsampling alongside their primary boosting mechanism, blending Breiman's bagging idea with gradient boosting. The 36.8 percent out-of-bag estimate is still cited as a defensible alternative to k-fold cross-validation when computational budget is tight.

Beyond classical statistics and tree ensembles, with-replacement sampling appears in modern reinforcement learning replay buffers, in importance-weighted variational inference, and in particle-based generative models. The same independence-preserving property that made the bootstrap work in 1979 continues to make resampling with replacement the natural choice whenever computation needs to mimic independent draws from an unknown distribution.

Implementations

Library or function	Defaults
`numpy.random.choice(a, size, replace=True, p=None)`	Uniform with replacement when `p` is omitted; supports weighted sampling
`numpy.random.Generator.choice`	Modern, recommended NumPy interface with the same `replace` flag
`sklearn.utils.resample(*arrays, replace=True, n_samples=None)`	Bootstrap-style resampling for paired feature and label arrays; preserves alignment between X and y
`scipy.stats.bootstrap(data, statistic, n_resamples, method)`	Confidence intervals via percentile, basic, or BCa, available since SciPy 1.7
`scipy.stats.BootstrapMethod`	Helper class that plugs precomputed bootstrap distributions into hypothesis tests
`pandas.DataFrame.sample(n, replace=True)`	Row sampling with replacement at the DataFrame level
R `boot::boot` and `boot::boot.ci`	The reference R package for bootstrap statistics, supports normal, basic, percentile, BCa, and Studentized intervals
TensorFlow `tf.random.categorical`	Draws indices with replacement from logits, the building block for sequence sampling and weighted resampling
PyTorch `torch.multinomial(input, num_samples, replacement=True)`	Weighted sampling with replacement, used heavily in language model decoding and replay buffers

These implementations all rest on the same underlying procedure: pick an index uniformly or with given weights, record it, repeat. The variation across libraries lies in vectorization, random number generator handling, and convenience wrappers for confidence intervals.

References

Efron, B. (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1), 1-26. DOI: 10.1214/aos/1176344552.
Efron, B. (1987). Better Bootstrap Confidence Intervals. Journal of the American Statistical Association, 82(397), 171-185.
Efron, B., and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman and Hall.
Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123-140. DOI: 10.1007/BF00058655.
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. DOI: 10.1023/A:1010933404324.
Kunsch, H. R. (1989). The Jackknife and the Bootstrap for General Stationary Observations. The Annals of Statistics, 17(3), 1217-1241.
Politis, D. N., and Romano, J. P. (1994). The Stationary Bootstrap. Journal of the American Statistical Association, 89(428), 1303-1313.
Bottou, L. (2009). Curiously Fast Convergence of Some Stochastic Gradient Descent Algorithms. Proceedings of the Symposium on Learning and Data Science.
Bottou, L. (2012). Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade, Springer, 421-436.
NumPy documentation. numpy.random.choice. https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html
Scikit-learn documentation. sklearn.utils.resample. https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html
SciPy documentation. scipy.stats.bootstrap. https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html
Wikipedia. Out-of-bag error. https://en.wikipedia.org/wiki/Out-of-bag_error
Wikipedia. Bootstrapping (statistics). https://en.wikipedia.org/wiki/Bootstrapping_(statistics)

Sampling with replacement

Definition and procedure

With versus without replacement

Mathematics of bootstrap-style resampling

Statistical foundation: the bootstrap

Bagging and the random forest

Other applications

When without replacement is preferred

Code examples

Expected statistics under sampling with replacement

Limitations

Modern context

Implementations

See also

References

Improve this article

Definition and procedure

With versus without replacement

Mathematics of bootstrap-style resampling

Statistical foundation: the bootstrap

Bagging and the random forest

Other applications

When without replacement is preferred

Code examples

Expected statistics under sampling with replacement

Limitations

Modern context

Implementations

See also

References

Definition and procedure

With versus without replacement

Mathematics of bootstrap-style resampling

Statistical foundation: the bootstrap

Bagging and the random forest

Other applications

When without replacement is preferred

Code examples

Expected statistics under sampling with replacement

Limitations

Modern context

Implementations

See also

References

Improve this article

Related Articles

Subsampling

Undersampling

Temperature sampling

A/B Testing

Generalized Linear Model

L1 Loss

Definition and procedure

With versus without replacement

Mathematics of bootstrap-style resampling

Statistical foundation: the bootstrap

Bagging and the random forest

Other applications

When without replacement is preferred

Code examples

Expected statistics under sampling with replacement

Limitations

Modern context

Implementations

See also

References

Related Articles

Subsampling

Undersampling

Temperature sampling

A/B Testing

Generalized Linear Model

L1 Loss