Sampling with replacement is a method of drawing items from a population in which each selected item is returned to the pool before the next draw, so the same item can be chosen more than once. The procedure is the workhorse behind a long list of statistical and machine learning techniques, including the bootstrap, bagging, the random forest, and many Monte Carlo procedures. It contrasts with sampling without replacement, where every item is drawn at most once, which is the default for most survey designs and for shuffling-based mini-batching in stochastic gradient descent.
The distinction matters because the two schemes induce different probability structures. With replacement, draws are independent and identically distributed, so the probability of selecting any particular item stays constant from draw to draw. Without replacement, the population shrinks after each pick, and successive draws are dependent. That single difference, independence versus dependence, is what makes the with-replacement scheme so attractive when statisticians want to mimic the data-generating process from a single observed sample.
Given a population of size N, sampling with replacement draws n items by repeating the following step n times: pick one item uniformly at random from the N items, record it, and put it back. Each draw is an independent trial with the same probability distribution, so the resulting sequence of n picks forms an i.i.d. sequence. When the population is finite and uniformly weighted, every item has probability 1/N of being chosen on any given draw, regardless of what was drawn before.
The sample can also use non-uniform weights, in which case each item i has probability p_i of being selected, with the weights summing to 1. Weighted sampling with replacement is the standard mechanism behind importance sampling, particle filters, and several reinforcement learning replay schemes.
A convenient mental model is an urn with N balls. To sample with replacement, you draw a ball, note it, and drop it back in the urn before the next draw. The urn never empties, and the same ball can come out many times in a row. Sampling without replacement removes the ball after each draw, so the urn shrinks as you go.
The two schemes lead to different downstream properties for variance, sample composition, and the form of the resulting estimators.
| Property | With replacement | Without replacement |
|---|---|---|
| Independence of draws | i.i.d. | Dependent |
| Probability of selecting item on draw k | 1/N (constant) | Depends on prior draws |
| Sample can contain duplicates | Yes | No |
| Maximum sample size | Unlimited | At most N |
| Variance of sample mean (size n, population variance sigma squared) | sigma squared / n | (sigma squared / n) times (N - n) / (N - 1) |
| Typical use cases | Bootstrap, bagging, MCMC proposals, weighted resampling | Survey sampling, mini-batch SGD within an epoch, permutation tests, k-fold cross-validation |
| Computational cost | Cheap, draws are independent | Slightly higher, requires tracking what was drawn |
The finite-population correction factor (N - n) / (N - 1) shows that without-replacement sampling has lower variance when the sample is a meaningful fraction of the population. As N grows large relative to n, the two schemes become nearly indistinguishable. This is one reason why bootstrap theory is usually framed for large N, where the with-replacement and without-replacement distinctions blur.
The most quoted fact about sampling with replacement comes from the bootstrap setting, where a sample of size n is drawn with replacement from a population of size N = n. The probability that a particular item is not selected on any single draw is (1 - 1/N), and the probability that it is missed across all n draws is:
P(item not chosen) = (1 - 1/N)^n
When n equals N and N is large, this expression has a famous limit. Using the standard identity that (1 - 1/N)^N approaches 1/e as N goes to infinity, the probability that any specific item never appears in a bootstrap sample approaches 1/e, which is approximately 0.368. The probability that an item does appear at least once is therefore approximately 1 - 1/e, or about 0.632.
This is why a bootstrap sample of size n drawn from a population of size n contains, on average, about 63.2 percent unique items. The remaining 36.8 percent of original items are absent from the sample, and the slots they would have occupied are taken up by duplicates of items that were chosen multiple times. The 63.2 percent figure converges quickly: even at N = 100, the probability of inclusion is already very close to its asymptotic value.
This fact has direct practical consequences. In a random forest, the items not chosen for any individual tree's bootstrap sample are called out-of-bag samples. They are about 36.8 percent of the training set per tree, and they form a free held-out set that can be used to estimate generalization error without a separate validation split.
Sampling with replacement underpins the bootstrap, a resampling method introduced by Bradley Efron in his 1979 paper "Bootstrap Methods: Another Look at the Jackknife," published in the Annals of Statistics. The core idea is to treat the observed sample as a stand-in for the unknown population. To approximate the sampling distribution of a statistic, the bootstrap repeatedly draws synthetic samples from the original data with replacement, recomputes the statistic on each, and uses the resulting empirical distribution to estimate standard errors, bias, and confidence intervals.
The bootstrap replaces analytical derivations with computation. Before the bootstrap, computing the standard error of a complicated estimator like a sample median or a regression coefficient often required either restrictive distributional assumptions or laborious calculus. Efron's insight was that resampling with replacement from the data itself produces an estimate of the sampling distribution that converges, under regularity conditions, to the truth. Bootstrap methods became practical only with the rise of cheap computing.
A few common bootstrap variants illustrate why with-replacement sampling is the right tool:
| Variant | What it does | Typical use |
|---|---|---|
| Percentile bootstrap | Reads off the 2.5 and 97.5 percentiles of the bootstrap statistic distribution | Quick confidence intervals when the statistic is roughly symmetric |
| Basic (reverse) bootstrap | Reflects bootstrap quantiles around the observed statistic | Alternative to percentile when distribution is skewed |
| Bias-corrected and accelerated (BCa) | Corrects for bias and skewness using a bias factor and a jackknife-based acceleration constant | Default choice for accurate intervals; introduced by Efron in 1987 |
| Studentized bootstrap | Standardizes each bootstrap statistic by an estimated standard error | Better coverage when sample size is small |
| Bootstrap hypothesis test | Resamples under a null distribution to compute p-values | Tests where analytical null distributions are intractable |
| Block bootstrap | Resamples contiguous blocks rather than single observations | Time series and other dependent data, introduced by Kunsch in 1989 |
All of these schemes inherit the basic resampling-with-replacement step from Efron's original construction.
Leo Breiman borrowed Efron's resampling idea for prediction in his 1996 paper "Bagging Predictors," published in Machine Learning. Bagging, short for bootstrap aggregating, trains a collection of base models on bootstrap samples of the training data and combines their outputs by averaging for regression or majority vote for classification. Each base model sees a slightly different view of the data, and aggregating their predictions reduces variance without increasing bias much. Breiman showed that bagging works best when the base learner is unstable, meaning small changes in the training data produce different models. Decision trees are the canonical example.
Five years later, Breiman extended bagging into the random forest, described in his 2001 Machine Learning paper of the same name. A random forest fits many decision trees, each on its own bootstrap sample, and adds a second layer of randomness by considering only a random subset of features at each split. The two sources of randomness, bootstrap sampling and random feature selection, decorrelate the trees so that aggregation gives a larger variance reduction than bagging alone.
The random forest also exploits the 36.8 percent out-of-bag fact. For each training example, the trees that did not see it during fitting form a held-out ensemble that can predict its label. Aggregating these out-of-bag predictions across the forest yields the out-of-bag error estimate, which approximates a leave-one-out cross-validation score without the expense of repeated retraining. The out-of-bag error is a side effect of with-replacement sampling that has no clean analogue under without-replacement schemes.
Resampling with replacement appears throughout statistics and machine learning beyond the bootstrap and tree ensembles.
| Application | How sampling with replacement is used |
|---|---|
| Markov chain Monte Carlo | Particle filters and resampling steps in sequential Monte Carlo redraw particles with replacement, weighted by importance, to focus computation on high-likelihood regions |
| Importance resampling | Convert weighted samples from a proposal distribution into approximately unweighted samples from a target by resampling with probabilities proportional to the weights |
| Bootstrap aggregating in deep learning | Snapshot ensembles and stochastic weight averaging take inspiration from bagging, although they typically use without-replacement minibatches inside each model |
| Reinforcement learning replay buffers | Prioritized experience replay samples transitions from the buffer with probabilities proportional to TD error, with replacement, so high-priority transitions can appear multiple times in a minibatch |
| Polyak averaging | Averages parameter snapshots taken across training, where the snapshots themselves come from minibatches that may be drawn with or without replacement depending on the implementation |
| Class rebalancing | When the positive class is rare, upsampling duplicates positives by sampling them with replacement to match the negative class count |
| Bootstrap evaluation of ML metrics | Confidence intervals for accuracy, F1 score, AUC, and other test-set metrics are routinely computed by resampling the test set with replacement |
In each case, the with-replacement step does the same job that it does in the bootstrap: it manufactures additional pseudo-samples from a fixed dataset so that the variance of an estimator can be quantified or so that an ensemble can see diverse training views.
Many machine learning workflows use sampling without replacement because each observation should appear at most once per pass over the data.
As a rule of thumb, with-replacement sampling is the right choice when the goal is to mimic the act of drawing fresh samples from an underlying population. Without-replacement sampling is the right choice when the goal is to partition a fixed dataset into non-overlapping pieces, or to ensure each example contributes exactly one gradient signal per epoch.
Both NumPy and scikit-learn expose with-replacement sampling through standard APIs.
import numpy as np
# Uniform sampling with replacement using NumPy
rng = np.random.default_rng(seed=42)
data = np.array([10, 20, 30, 40, 50])
bootstrap_sample = rng.choice(data, size=len(data), replace=True)
# Possible result: array([30, 10, 40, 30, 20])
# Weighted sampling with replacement
weights = np.array([0.1, 0.1, 0.6, 0.1, 0.1])
weighted_sample = rng.choice(data, size=10, replace=True, p=weights)
The legacy interface numpy.random.choice and the modern Generator.choice from NumPy both default to replace=True. Setting replace=False switches to sampling without replacement, but then size cannot exceed the length of the input.
Scikit-learn provides sklearn.utils.resample for the same task in a tabular context:
from sklearn.utils import resample
import numpy as np
X = np.arange(20).reshape(10, 2)
y = np.arange(10)
# Bootstrap sample of size 10 from a 10-row dataset
X_boot, y_boot = resample(X, y, replace=True, n_samples=10, random_state=0)
For confidence interval construction, SciPy added scipy.stats.bootstrap in version 1.7. It supports the percentile, basic, and BCa methods, and handles the resampling, statistic recomputation, and quantile extraction in a single call:
import numpy as np
from scipy.stats import bootstrap
rng = np.random.default_rng(seed=0)
sample = rng.normal(loc=5.0, scale=2.0, size=200)
result = bootstrap((sample,), np.mean, n_resamples=10000,
confidence_level=0.95, method='BCa', random_state=rng)
print(result.confidence_interval)
The BootstrapMethod helper in newer SciPy versions also lets users plug a precomputed bootstrap distribution into hypothesis tests like ttest_ind for finer control.
A bootstrap sample drawn from a fixed dataset has several useful properties that follow directly from the with-replacement structure.
| Statistic | Value under with-replacement sampling |
|---|---|
| Probability a specific item appears at least once (n = N, large N) | Approximately 0.632 |
| Probability a specific item is missed entirely (n = N, large N) | Approximately 0.368 |
| Expected number of unique items in a bootstrap sample of size N | Approximately 0.632 N |
| Expected number of duplicate slots (size n, drawn with replacement) | n minus expected unique count |
| Variance of the bootstrap mean (target sample variance s squared, size n) | s squared / n |
| Number of distinct possible bootstrap samples of size n from N items | C(2N - 1, N), the number of multisets of size N |
The number of possible distinct bootstrap samples grows enormously with N, which is why practical bootstrap implementations rely on Monte Carlo approximation: drawing a few thousand bootstrap replicates rather than enumerating all of them.
Sampling with replacement is powerful but not universal.
The bootstrap can underestimate variance when the statistic is a function of extreme order statistics, such as the sample maximum or minimum. The maximum of a bootstrap sample is at most the maximum of the original data, so the bootstrap distribution of the max has a point mass at the observed max and cannot capture true tail variability.
The assumption that observations are independent and identically distributed breaks down for time series, spatial data, and other forms of dependent data. Naive bootstrap resampling destroys the temporal or spatial dependence structure. The fix is to resample blocks rather than individual observations: Hans Kunsch introduced the moving block bootstrap in 1989, and Dimitris Politis and Joseph Romano introduced the stationary bootstrap in 1994 to give a smoothed alternative. These block-based variants preserve short-range dependence within each block.
With-replacement resampling can also be inappropriate when the original dataset is itself unrepresentative. The bootstrap can only reflect the distribution it sees, so a biased sample produces biased bootstrap inference. The bootstrap is not a cure for bad data.
In machine learning, naive bagging on top of an already low-variance learner offers little benefit and may even hurt accuracy. Breiman observed in 1996 that bagging works only when the base learner is unstable: if the model is already smooth and stable, perturbing the training set does not produce diverse ensemble members, and aggregation cannot reduce a variance that is not there.
Finally, very small samples cause the bootstrap to inherit the small-sample problems of the underlying estimator. With only a handful of observations, even resampling many times cannot manufacture meaningful information that was not present in the data. Asymptotic guarantees about bootstrap consistency presume the original sample is large enough for the empirical distribution to approximate the true distribution.
Resampling-with-replacement methods sit at the heart of how scientific results are reported. Bootstrap confidence intervals appear routinely in clinical trials, econometrics papers, ecology, and machine learning benchmarks. SciPy's bootstrap function and the comparable boot package in R have made these methods accessible to researchers who do not want to implement the resampling logic by hand.
In modern machine learning evaluation, bootstrap confidence intervals are used to compare models, particularly when the test set is small or when reporting per-class metrics. The standard recipe is to resample the test set with replacement, recompute the metric, and report the 2.5 and 97.5 percentiles as a 95 percent interval. This is far more honest than reporting a single point estimate, and many leaderboards and benchmarks now expect it.
Tree ensembles built on bootstrap samples remain among the most reliable tabular learners. Modern implementations like XGBoost and LightGBM use bagging-style row subsampling alongside their primary boosting mechanism, blending Breiman's bagging idea with gradient boosting. The 36.8 percent out-of-bag estimate is still cited as a defensible alternative to k-fold cross-validation when computational budget is tight.
Beyond classical statistics and tree ensembles, with-replacement sampling appears in modern reinforcement learning replay buffers, in importance-weighted variational inference, and in particle-based generative models. The same independence-preserving property that made the bootstrap work in 1979 continues to make resampling with replacement the natural choice whenever computation needs to mimic independent draws from an unknown distribution.
| Library or function | Defaults |
|---|---|
numpy.random.choice(a, size, replace=True, p=None) | Uniform with replacement when p is omitted; supports weighted sampling |
numpy.random.Generator.choice | Modern, recommended NumPy interface with the same replace flag |
sklearn.utils.resample(*arrays, replace=True, n_samples=None) | Bootstrap-style resampling for paired feature and label arrays; preserves alignment between X and y |
scipy.stats.bootstrap(data, statistic, n_resamples, method) | Confidence intervals via percentile, basic, or BCa, available since SciPy 1.7 |
scipy.stats.BootstrapMethod | Helper class that plugs precomputed bootstrap distributions into hypothesis tests |
pandas.DataFrame.sample(n, replace=True) | Row sampling with replacement at the DataFrame level |
R boot::boot and boot::boot.ci | The reference R package for bootstrap statistics, supports normal, basic, percentile, BCa, and Studentized intervals |
TensorFlow tf.random.categorical | Draws indices with replacement from logits, the building block for sequence sampling and weighted resampling |
PyTorch torch.multinomial(input, num_samples, replacement=True) | Weighted sampling with replacement, used heavily in language model decoding and replay buffers |
These implementations all rest on the same underlying procedure: pick an index uniformly or with given weights, record it, repeat. The variation across libraries lies in vectorization, random number generator handling, and convenience wrappers for confidence intervals.