See also: Monte Carlo methods, Importance sampling, Markov chain Monte Carlo, Speculative decoding, Bayesian inference
Rejection sampling, also called the accept-reject method or the acceptance-rejection method, is a foundational Monte Carlo technique for drawing independent samples from a target probability distribution using a simpler auxiliary (proposal) distribution. The method works by generating candidate samples from the proposal, then accepting or rejecting each candidate according to a probabilistic rule that exactly corrects the difference between the two distributions. First formalized by John von Neumann in 1951, rejection sampling is taught in virtually every graduate course on computational statistics, and variants of the same core rule power modern large language model acceleration techniques such as speculative decoding.
Let p(x) be a target density that is difficult to sample from directly but easy to evaluate (at least up to a normalizing constant). Let q(x) be a proposal density that is easy to sample from and easy to evaluate. Suppose there exists a finite constant M greater than or equal to 1 such that
p(x) <= M * q(x) for all x in the support of p.
The function M q(x) is called the envelope because its graph lies entirely above the graph of p(x). Rejection sampling then proceeds by drawing a candidate x from q, drawing an independent uniform random number u from the interval [0, 1], and accepting x whenever
u <= p(x) / (M * q(x)).
Any accepted x is distributed exactly according to p. Rejected candidates are discarded and the procedure is repeated until a sample is accepted [1][2].
Although the method is most often stated for continuous univariate densities, it applies equally to multivariate distributions, discrete distributions, and even mixed distributions, as long as a suitable envelope can be constructed.
Imagine you want a dart board shaped exactly like a funny cloud, but all you have is a big rectangular board. Here is a trick that always works: throw a dart uniformly at random anywhere on the rectangle. If the dart lands inside the cloud, keep it and call it a "good" dart. If it lands outside the cloud, throw it away and try again.
If you only look at the darts you kept, they are spread out perfectly and uniformly inside the cloud, which is exactly what you wanted. Rejection sampling is the same idea in math: the rectangle is a simple distribution you know how to sample from, the cloud is the complicated distribution you actually want, and the accept-or-reject rule is the "is the dart inside the cloud" check. The only downside is that if the rectangle is much bigger than the cloud, you throw away most of your darts and it takes a long time to get the samples you want.
Rejection sampling was introduced by the Hungarian-American mathematician John von Neumann in his 1951 paper "Various techniques used in connection with random digits," which appeared as a chapter in the proceedings of a Monte Carlo symposium held at the Institute for Numerical Analysis in Los Angeles in 1949 and published in National Bureau of Standards Applied Mathematics Series 12 [3][4]. The same paper introduced the middle-square method for generating pseudorandom numbers and contained von Neumann's famous quip that "anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin."
Von Neumann's original motivation came from the Los Alamos work on neutron transport and simulations of nuclear reactions during and after the Manhattan Project. Researchers needed to draw samples from complicated physical distributions using only the primitive random number generators available on early computers such as ENIAC and MANIAC I. The accept-reject rule gave a universal recipe: any density bounded by a constant multiple of a tractable proposal could be sampled with essentially no algebra, at the price of occasionally wasting a candidate draw.
Although von Neumann's formal statement is the first modern description of the algorithm, the underlying geometric idea has roots in earlier probabilistic arguments. Buffon's needle problem from the 18th century, which estimates pi by counting the fraction of dropped needles that cross a line, can be read as a rejection-style experiment on a uniform proposal [1]. From the 1960s onwards, Brian Ripley, George Marsaglia, Luc Devroye, and others developed highly optimized rejection algorithms for specific distributions, which form the basis of random variate generators in almost every scientific computing library today [2].
The introduction of Markov chain Monte Carlo methods in the 1950s, 1970s, and 1980s (notably the Metropolis algorithm in 1953 and the Metropolis-Hastings generalization in 1970) eventually overtook pure rejection sampling as the dominant Monte Carlo tool for high-dimensional Bayesian problems. Rejection sampling nevertheless remained important both as a teaching device and as a subroutine inside more complex methods, including Gibbs sampling (through adaptive rejection sampling), particle filters, and, more recently, speculative decoding for LLMs.
To apply rejection sampling the user must supply three ingredients:
The pair (q, M) defines the envelope M q(x), which must dominate the target everywhere. The closer M q(x) hugs p(x), the fewer candidates will be rejected.
Given the setup above, one iteration of the algorithm consists of three operations:
The ratio r(x) always lies in [0, 1] because of the envelope condition, so the comparison is always well defined.
function rejection_sample(p, q, M):
while True:
x ~ q # draw candidate
u ~ Uniform(0, 1) # draw auxiliary
if u <= p(x) / (M * q(x)): # accept?
return x
Each call to rejection_sample returns one independent draw from p. To generate n samples the procedure is repeated n times. Samples produced this way are genuinely independent and identically distributed (i.i.d.), in contrast to MCMC output which is only asymptotically distributed as the target and usually autocorrelated [1].
The correctness of rejection sampling is most easily seen through the envelope interpretation: generating (x, u) uniformly under the curve M q(x) and accepting only those pairs that fall under p(x) produces points uniformly distributed under p(x). The marginal density of the accepted x is therefore proportional to p(x) [1][5].
A direct probability calculation gives the same conclusion. Let A denote the acceptance event {U <= p(X) / (M q(X))}. For any measurable set B:
P(X in B, A) = integral over B of q(x) * P(U <= p(x) / (M q(x))) dx
= integral over B of q(x) * (p(x) / (M q(x))) dx
= (1 / M) * integral over B of p(x) dx
= P(X_target in B) / M.
Taking B to be the entire support yields
P(A) = 1 / M,
and applying Bayes' rule gives the conditional density of an accepted sample:
f(x | A) = P(X in dx, A) / P(A) = p(x).
In other words, the distribution of the accepted candidates is exactly the target p, the acceptance probability is 1/M, and the samples are independent across iterations because each trial uses fresh random numbers [1][2][5]. The argument works identically for discrete distributions (replace integrals by sums) and for multivariate distributions (interpret x as a vector).
An attractive side effect of the proof is that it also works when p is known only up to a normalizing constant Z. In that case M only needs to bound the unnormalized density p_tilde(x) = Z p(x), and the acceptance ratio becomes p_tilde(x) / (M q(x)). The unknown Z simply appears inside M and never has to be computed, which is precisely why rejection sampling is useful for Bayesian inference problems where the posterior normalization is intractable.
Each iteration of rejection sampling succeeds with probability 1/M. The number of iterations required to obtain one accepted sample is therefore a geometric random variable with mean M, so generating n samples takes on average n M candidate draws plus n M density evaluations. Minimizing M subject to the envelope constraint p(x) <= M q(x) is the main tuning problem of rejection sampling.
The smallest valid choice for a given proposal is
M* = sup_x p(x) / q(x),
which is tight: the envelope M* q(x) touches p(x) at the worst-case point. Any larger M is also valid but wastes effort; any smaller M violates the bound and invalidates the algorithm. When p and q are known analytically the supremum can sometimes be computed in closed form by differentiating p(x)/q(x) and solving for its maximum.
The table below gives intuition for how M controls efficiency. All values assume that both p and q are normalized, so the acceptance probability equals 1/M exactly.
| Envelope constant M | Acceptance probability | Average candidates per accepted sample | Quality of proposal |
|---|---|---|---|
| 1.0 | 100% | 1 | q equals p (no work to do) |
| 1.1 | 90.9% | 1.1 | Excellent match |
| 2 | 50% | 2 | Good match |
| 5 | 20% | 5 | Reasonable match |
| 10 | 10% | 10 | Loose envelope |
| 100 | 1% | 100 | Poor proposal |
| 10,000 | 0.01% | 10,000 | Nearly unusable |
Choosing M is therefore a balance between analytic convenience and runtime cost. Three strategies are common in practice:
When no closed-form supremum is available, a small overestimate of M is preferred to an underestimate because an underestimate silently biases the sampler.
Adaptive rejection sampling (ARS), introduced by Wally Gilks and Pascal Wild in 1992, automates the construction of q and M for any univariate log-concave density [6]. The key observation is that if log p(x) is concave, then tangent lines to log p at selected points form an upper piecewise-linear envelope, and chords between the same points form a lower piecewise-linear squeezing function. Exponentiating these piecewise-linear functions yields a piecewise-exponential upper envelope that is easy to sample from and a piecewise-exponential lower squeeze that is cheap to evaluate.
ARS operates by maintaining a list of contact points, drawing candidates from the upper envelope, and updating the envelope with the new contact point every time a candidate is rejected. As more points are added, the envelope converges to the target and the acceptance rate climbs toward 100%. ARS is particularly valuable inside Gibbs sampling, where the full conditionals of hierarchical Bayesian models are often log-concave but have no closed-form sampler. It is the default univariate sampler in the BUGS family of probabilistic programming tools [6].
Several extensions relax the log-concavity requirement:
A squeeze function s(x) is a cheap lower bound that satisfies s(x) <= p(x) for all x in the support. Before evaluating the full density, the algorithm first checks whether u <= s(x) / (M q(x)). If so, the candidate is accepted without ever computing p(x), which saves time when evaluating p is expensive. Only when the squeeze check fails does the algorithm fall back to the full ratio check. Squeeze functions can reduce the number of expensive density evaluations by an order of magnitude in well-tuned implementations and are used pervasively inside random variate generators for Beta, Gamma, and related distributions [2][6].
The ziggurat algorithm, introduced by George Marsaglia and Wai Wan Tsang in 2000, is a highly optimized rejection sampler for symmetric unimodal densities such as the Gaussian and exponential. It partitions the area under the density into horizontal strips of equal area, allowing most candidate draws to be accepted using a single table lookup and one comparison. Modern statistical libraries (including many random number generators in NumPy, SciPy, and R) use ziggurat-style rejection for normal random variates [2].
Variational rejection sampling, introduced by Grover and colleagues in 2018, combines the idea with variational inference. The variational family is allowed to be tighter than would otherwise be possible because any residual gap can be closed by a rejection step, trading extra compute for a provably tighter evidence lower bound.
Rejection sampling is one of several general-purpose techniques for drawing samples from complicated distributions. The table below summarizes the main differences against the other classical options. All four methods are discussed in standard references such as Robert and Casella (2004), MacKay (2003), and Bishop (2006) [1][5][7].
| Property | Rejection sampling | Importance sampling | MCMC (Metropolis-Hastings) | Inverse transform sampling |
|---|---|---|---|---|
| Produces i.i.d. samples? | Yes | Yes (weighted) | No (autocorrelated) | Yes |
| Returns unweighted samples? | Yes | No, samples have weights w = p/q | Yes | Yes |
| Needs envelope constant M? | Yes, M must satisfy p <= M q | No | No | No |
| Tolerates unnormalized p? | Yes | Yes (with self-normalization) | Yes | No (needs inverse CDF) |
| Convergence guarantee | Exact in finite time | Asymptotic (variance depends on q) | Asymptotic (stationary distribution) | Exact in one step |
| Scales well in high dim.? | No, M grows exponentially | Often no (effective sample size collapses) | Better (but still challenging) | Only for separable distributions |
| Main failure mode | Runaway rejection rate | Weight degeneracy (a few huge weights) | Slow mixing, unclear convergence | Inverse CDF unavailable |
| Typical use case | Low-dim, bounded ratio, simulations | Expectations, rare events | High-dim Bayesian posteriors | 1-D distributions with closed CDF |
| Requires tuning? | Choice of q and M | Choice of q | Step size, burn-in, diagnostics | None |
In very rough terms, rejection sampling is the gold standard when it is affordable: it gives you independent, unweighted, exact samples with no burn-in and no diagnostics, but it falls apart in high dimensions because the envelope constant M grows too fast. Importance sampling avoids the rejection step at the price of returning weighted samples whose variance can blow up. MCMC, and in particular Metropolis-Hastings and its descendants, is the method of choice for modern high-dimensional Bayesian problems but gives up the i.i.d. property. Slice sampling and Gibbs sampling are specialized MCMC variants that can be viewed as structured rejection methods, using uniform auxiliary variables and conditional exactness, respectively.
Rejection sampling is notoriously poor in high dimensions. The reason is purely geometric: if p and q are d-dimensional densities that differ at each coordinate by a factor c > 1, then their ratio and hence the envelope constant grows as c^d. The acceptance probability 1/M therefore decays exponentially in the dimension, and the expected number of candidates needed to obtain a single sample explodes.
A concrete textbook example is sampling a d-dimensional standard Gaussian using a slightly wider d-dimensional Gaussian proposal with covariance sigma^2 I, sigma > 1. The envelope constant satisfies M = sigma^d, so even modest mismatches compound rapidly. With sigma = 1.01 and d = 100 the acceptance rate is already around 37%; by d = 1000 it collapses to less than 0.005%. With sigma = 1.1 and d = 100 the acceptance rate is roughly 7 x 10^(-5), and by d = 1000 it is far below any number that could be observed in practice [5][7]. The rejection sampler technically still works, but it would take longer than the lifetime of the universe to obtain a single sample.
This "curse of dimensionality" is the primary reason that rejection sampling is rarely used directly for modern Bayesian posteriors. High-dimensional problems almost always require MCMC or variational methods, both of which exploit local structure in the target to sidestep the exponential blow-up. Rejection sampling does, however, remain competitive in low or moderate dimensions, in problems where a tight analytic envelope is available, and inside hybrid samplers that perform only small local rejection steps [1][5].
Rejection sampling is the workhorse behind the random variate generators shipped with almost every scientific computing library. Efficient rejection algorithms exist for the normal, exponential, gamma, beta, chi-squared, Student-t, Poisson, binomial, and many other standard distributions. These generators form the bedrock on top of which almost all higher-level statistical simulations are built. The classical reference is Luc Devroye's 1986 book Non-Uniform Random Variate Generation, which catalogs rejection methods for essentially every common distribution and was made freely available by the author [2].
In low-dimensional Bayesian inference problems, rejection sampling is a tempting choice because it requires only evaluation of the unnormalized posterior and delivers genuine i.i.d. draws. A particularly clean special case is approximate Bayesian computation (ABC), which avoids likelihood evaluation entirely: candidate parameter values are drawn from the prior, the model is simulated forward, and the candidate is accepted only if the simulated data match (or are close to) the observed data. ABC is essentially a rejection sampler on the joint space of parameters and simulated data, and it remains the method of choice in fields like population genetics and systems biology where likelihoods are unavailable.
Particle filters, also known as sequential Monte Carlo methods, track a population of weighted particles through a state-space model. Resampling and propagation steps inside these filters often use rejection sampling as a subroutine, in particular when drawing from the transition density is easier than drawing from the optimal proposal. Rejection sampling is also used in smoothing algorithms that require exact draws from conditional distributions along the trajectory.
Rejection sampling has been used to sharpen neural generative models. For example, Azadi and colleagues (2018) proposed discriminator rejection sampling for generative adversarial networks, using the GAN discriminator to compute an acceptance probability that brings the sample distribution closer to the data distribution. Related work applies rejection sampling to autoregressive models, energy-based models, and normalizing flows, typically as a post-hoc correction that trades compute for quality.
The original application domain, neutron transport and particle physics, still uses rejection sampling extensively. Almost every Monte Carlo integration package for high-energy physics (including MCNP, which was developed at Los Alamos) contains heavily optimized rejection samplers for cross-section distributions, angular scattering distributions, and energy spectra.
The most visible modern use of rejection sampling in AI is speculative decoding (also called speculative sampling), introduced independently in 2022 by Leviathan, Kalman, and Matias at Google and in 2023 by Chen and colleagues at DeepMind [8][9]. The goal is to accelerate autoregressive decoding in LLMs without changing the output distribution of the large target model.
The method runs a small, fast "draft" model q that speculatively generates several tokens ahead, then uses a single parallel forward pass of the large "target" model p to verify them. Each drafted token x is accepted with probability
min(1, p(x) / q(x)),
which is precisely a modified rejection sampling rule. If the draft distribution overestimates the probability of a token (q(x) > p(x)) the token is only sometimes kept; if the draft underestimates it (q(x) <= p(x)) the token is always kept. When a token is rejected, the decoder resamples from the residual distribution
p'(x) = normalize(max(0, p(x) - q(x)))
and discards all subsequent draft tokens beyond the first rejection. The residual distribution is constructed precisely so that the overall probability of emitting any token is exactly p(x). The algorithm therefore samples from the target model's distribution exactly, bit for bit, no matter how bad or biased the draft model is, while typically running two to three times faster than naive autoregressive decoding because most drafted tokens are accepted [8][9].
| Feature | Classical rejection sampling | Speculative decoding |
|---|---|---|
| Target distribution | p(x) | Target model p(x_t |
| Proposal distribution | q(x) | Draft model q(x_t |
| Envelope constant | M with p <= M q | None required |
| Acceptance rule | u <= p(x) / (M q(x)) | u <= min(1, p(x) / q(x)) |
| Behavior on rejection | Discard and retry | Resample from normalize(max(0, p - q)) |
| Distribution of output | Exactly p | Exactly p |
| Typical speedup | Depends on M | 2x to 3x on modern LLMs |
Speculative decoding departs from classical rejection sampling in one important way: it does not require an envelope constant M. Instead, the accept-reject rule is asymmetric (the ratio p/q is clipped at 1), and the leftover probability mass max(0, p(x) - q(x)) is explicitly redistributed on rejection. This trick guarantees distributional equality without needing a worst-case bound on p/q, which would otherwise be impossible to compute over the exponentially large token space [8][9]. The technique is now the standard production-grade method for accelerating LLM inference in systems including vLLM, TensorRT-LLM, and the Gemini and Claude serving stacks.
Rejection sampling fine-tuning, sometimes abbreviated RFT, is a technique for improving LLM reasoning ability introduced by Yuan and colleagues in 2023 [10]. The idea is to use rejection sampling at the level of entire generated solutions rather than individual tokens. For each training problem the model produces a large number of candidate chains of thought, each candidate is graded for correctness (for example, by checking the final numerical answer against the ground truth), and only correct solutions are kept for supervised fine-tuning. Incorrect solutions are rejected.
Because the acceptance criterion in RFT is deterministic (correct vs. incorrect), the procedure is effectively rejection sampling from the conditional distribution of model outputs given that the output is correct. Yuan and colleagues used RFT to push LLaMA 7B from 35.9% to 49.3% accuracy on GSM8K by combining rejection samples from several fine-tuned models [10]. Follow-up work (DART-Math, Difficulty-Aware Rejection Tuning, and others) showed that RFT benefits weaker base models more than stronger ones and remains competitive with more elaborate preference-learning methods such as RLHF and DPO in reasoning tasks.
A complementary variant, reinforced self-training with rejection (sometimes called ReST), iterates the RFT loop: each round uses the latest fine-tuned model to generate fresh candidates, rejects incorrect ones, and trains on the survivors. This is a close relative of expert iteration in reinforcement learning and is used in the training pipelines of several state-of-the-art reasoning LLMs.
Rejection sampling is also the simplest baseline for constrained decoding, in which generated outputs must satisfy hard constraints (for example, being valid JSON, matching a regular expression, or staying within a length budget). The naive approach samples completions from the model and rejects any that fail the constraint check. While simple, this can be catastrophically slow when the constraint is rare, because it inherits the high-dimensional curse of dimensionality. More efficient constrained decoders (grammar-constrained decoding, FSM-constrained decoding) replace rejection with direct probability masking of the next-token distribution, which can be viewed as performing the rejection step in closed form at every step.