Rejection sampling

Machine Learning Statistics

24 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v4 · 4,897 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Rejection sampling, also called the accept-reject method or the acceptance-rejection method, is a Monte Carlo technique that draws independent samples from a hard-to-sample target distribution p(x) by generating candidates from an easy proposal distribution q(x) and accepting each candidate with probability p(x) / (M q(x)), where M is a constant chosen so that p(x) <= M q(x) everywhere. Accepted candidates are distributed exactly according to the target, the acceptance probability is 1/M, and the method needs only to evaluate p up to a constant, which is why it works even when the target's normalizing constant is intractable. The same accept-reject rule appears in two distinct settings: the classical statistical method first formalized by John von Neumann in 1951 ^[3], and a modern large language model variant used both to accelerate inference (speculative decoding) ^[8]^[9] and to train models (rejection sampling fine-tuning, as in Llama 2 and best-of-N) ^[10]^[11].

What is rejection sampling?

Let p(x) be a target density that is difficult to sample from directly but easy to evaluate (at least up to a normalizing constant). Let q(x) be a proposal density that is easy to sample from and easy to evaluate. Suppose there exists a finite constant M greater than or equal to 1 such that

p(x) <= M * q(x)  for all x in the support of p.

The function M q(x) is called the envelope because its graph lies entirely above the graph of p(x). Rejection sampling then proceeds by drawing a candidate x from q, drawing an independent uniform random number u from the interval [0, 1], and accepting x whenever

u <= p(x) / (M * q(x)).

Any accepted x is distributed exactly according to p. Rejected candidates are discarded and the procedure is repeated until a sample is accepted ^[1]^[2].

Although the method is most often stated for continuous univariate densities, it applies equally to multivariate distributions, discrete distributions, and even mixed distributions, as long as a suitable envelope can be constructed.

ELI5: Explain like I'm 5

Imagine you want a dart board shaped exactly like a funny cloud, but all you have is a big rectangular board. Here is a trick that always works: throw a dart uniformly at random anywhere on the rectangle. If the dart lands inside the cloud, keep it and call it a "good" dart. If it lands outside the cloud, throw it away and try again.

If you only look at the darts you kept, they are spread out perfectly and uniformly inside the cloud, which is exactly what you wanted. Rejection sampling is the same idea in math: the rectangle is a simple distribution you know how to sample from, the cloud is the complicated distribution you actually want, and the accept-or-reject rule is the "is the dart inside the cloud" check. The only downside is that if the rectangle is much bigger than the cloud, you throw away most of your darts and it takes a long time to get the samples you want.

Who invented rejection sampling?

Rejection sampling was introduced by the Hungarian-American mathematician John von Neumann, who described the accept-reject technique in a 1947 letter and formalized it in his 1951 paper "Various techniques used in connection with random digits" ^[3]^[4]. That paper appeared as a chapter in the proceedings of a Monte Carlo symposium held at the Institute for Numerical Analysis in Los Angeles in 1949 and published in National Bureau of Standards Applied Mathematics Series 12 ^[3]. The same paper introduced the middle-square method for generating pseudorandom numbers and contained von Neumann's famous warning, "Any one who considers arithmetic methods of producing random digits is, of course, in a state of sin" ^[3]^[4].

Von Neumann's original motivation came from the Los Alamos work on neutron transport and simulations of nuclear reactions during and after the Manhattan Project. Researchers needed to draw samples from complicated physical distributions using only the primitive random number generators available on early computers such as ENIAC and MANIAC I. The accept-reject rule gave a universal recipe: any density bounded by a constant multiple of a tractable proposal could be sampled with essentially no algebra, at the price of occasionally wasting a candidate draw. Eckhardt's 1987 history records that von Neumann and Stanislaw Ulam worked out both inversion and accept-reject techniques for nonuniform sampling in 1947, in the same burst of activity that produced the modern Monte Carlo method ^[4].

Although von Neumann's formal statement is the first modern description of the algorithm, the underlying geometric idea has roots in earlier probabilistic arguments. Buffon's needle problem from the 18th century, which estimates pi by counting the fraction of dropped needles that cross a line, can be read as a rejection-style experiment on a uniform proposal ^[1]. From the 1960s onwards, Brian Ripley, George Marsaglia, Luc Devroye, and others developed highly optimized rejection algorithms for specific distributions, which form the basis of random variate generators in almost every scientific computing library today ^[2].

The introduction of Markov chain Monte Carlo methods in the 1950s, 1970s, and 1980s (notably the Metropolis algorithm in 1953 and the Metropolis-Hastings generalization in 1970) eventually overtook pure rejection sampling as the dominant Monte Carlo tool for high-dimensional Bayesian problems. Rejection sampling nevertheless remained important both as a teaching device and as a subroutine inside more complex methods, including Gibbs sampling (through adaptive rejection sampling), particle filters, and, more recently, speculative decoding for LLMs.

How does rejection sampling work?

Setup: target and proposal distributions

To apply rejection sampling the user must supply three ingredients:

A target density p(x) that can be evaluated pointwise. Full normalization is not required; rejection sampling can be adapted to work with an unnormalized target as long as the envelope bounds the unnormalized density.
A proposal density q(x) that has the same (or larger) support as p and from which independent samples are cheap to generate. Common choices include uniform, Gaussian, Student-t, exponential, and mixture distributions.
A constant M greater than or equal to 1 that satisfies p(x) <= M q(x) for every x.

The pair (q, M) defines the envelope M q(x), which must dominate the target everywhere. The closer M q(x) hugs p(x), the fewer candidates will be rejected.

The accept-reject step

Given the setup above, one iteration of the algorithm consists of three operations:

Propose. Draw x from the proposal q.
Draw a uniform auxiliary. Draw u from the uniform distribution on [0, 1] independently of x.
Accept or reject. Compute the ratio r(x) = p(x) / (M q(x)). If u <= r(x) return x as a sample from p. Otherwise discard x and go back to step 1.

The ratio r(x) always lies in [0, 1] because of the envelope condition, so the comparison is always well defined.

Pseudocode

function rejection_sample(p, q, M):
    while True:
        x ~ q                      # draw candidate
        u ~ Uniform(0, 1)           # draw auxiliary
        if u <= p(x) / (M * q(x)):  # accept?
            return x

Each call to rejection_sample returns one independent draw from p. To generate n samples the procedure is repeated n times. Samples produced this way are genuinely independent and identically distributed (i.i.d.), in contrast to MCMC output which is only asymptotically distributed as the target and usually autocorrelated ^[1].

Why is rejection sampling correct?

The correctness of rejection sampling is most easily seen through the envelope interpretation: generating (x, u) uniformly under the curve M q(x) and accepting only those pairs that fall under p(x) produces points uniformly distributed under p(x). The marginal density of the accepted x is therefore proportional to p(x) ^[1]^[5].

A direct probability calculation gives the same conclusion. Let A denote the acceptance event {U <= p(X) / (M q(X))}. For any measurable set B:

P(X in B, A) = integral over B of q(x) * P(U <= p(x) / (M q(x))) dx
             = integral over B of q(x) * (p(x) / (M q(x))) dx
             = (1 / M) * integral over B of p(x) dx
             = P(X_target in B) / M.

Taking B to be the entire support yields

P(A) = 1 / M,

and applying Bayes' rule gives the conditional density of an accepted sample:

f(x | A) = P(X in dx, A) / P(A) = p(x).

In other words, the distribution of the accepted candidates is exactly the target p, the acceptance probability is 1/M, and the samples are independent across iterations because each trial uses fresh random numbers ^[1]^[2]^[5]. The argument works identically for discrete distributions (replace integrals by sums) and for multivariate distributions (interpret x as a vector).

An attractive side effect of the proof is that it also works when p is known only up to a normalizing constant Z. In that case M only needs to bound the unnormalized density p_tilde(x) = Z p(x), and the acceptance ratio becomes p_tilde(x) / (M q(x)). The unknown Z simply appears inside M and never has to be computed, which is precisely why rejection sampling is useful for Bayesian inference problems where the posterior normalization is intractable.

What is the envelope constant M and how does it set efficiency?

Each iteration of rejection sampling succeeds with probability 1/M. The number of iterations required to obtain one accepted sample is therefore a geometric random variable with mean M, so generating n samples takes on average n M candidate draws plus n M density evaluations. The efficiency of the sampler, measured as the long-run fraction of candidates that are kept, is exactly 1/M. Minimizing M subject to the envelope constraint p(x) <= M q(x) is the main tuning problem of rejection sampling.

The smallest valid choice for a given proposal is

M* = sup_x p(x) / q(x),

which is tight: the envelope M* q(x) touches p(x) at the worst-case point. Any larger M is also valid but wastes effort; any smaller M violates the bound and invalidates the algorithm. When p and q are known analytically the supremum can sometimes be computed in closed form by differentiating p(x)/q(x) and solving for its maximum.

The table below gives intuition for how M controls efficiency. All values assume that both p and q are normalized, so the acceptance probability equals 1/M exactly.

Envelope constant M	Acceptance probability	Average candidates per accepted sample	Quality of proposal
1.0	100%	1	q equals p (no work to do)
1.1	90.9%	1.1	Excellent match
2	50%	2	Good match
5	20%	5	Reasonable match
10	10%	10	Loose envelope
100	1%	100	Poor proposal
10,000	0.01%	10,000	Nearly unusable

Choosing M is therefore a balance between analytic convenience and runtime cost. Three strategies are common in practice:

Analytic bounding. When the ratio p/q has a known maximum (e.g., when p is a Beta and q is uniform), set M equal to that maximum.
Numerical optimization. When the ratio is differentiable but algebraically messy, use a numerical optimizer to find an upper bound on p/q.
Family matching. Pick q from the same family as p (e.g., use a Student-t proposal for a Gaussian-like target) so that the ratio is bounded and M stays small.

When no closed-form supremum is available, a small overestimate of M is preferred to an underestimate because an underestimate silently biases the sampler.

What are the main variants of rejection sampling?

Adaptive rejection sampling

Adaptive rejection sampling (ARS), introduced by Wally Gilks and Pascal Wild in 1992, automates the construction of q and M for any univariate log-concave density ^[6]. The key observation is that if log p(x) is concave, then tangent lines to log p at selected points form an upper piecewise-linear envelope, and chords between the same points form a lower piecewise-linear squeezing function. Exponentiating these piecewise-linear functions yields a piecewise-exponential upper envelope that is easy to sample from and a piecewise-exponential lower squeeze that is cheap to evaluate.

ARS operates by maintaining a list of contact points, drawing candidates from the upper envelope, and updating the envelope with the new contact point every time a candidate is rejected. As more points are added, the envelope converges to the target and the acceptance rate climbs toward 100%. ARS is particularly valuable inside Gibbs sampling, where the full conditionals of hierarchical Bayesian models are often log-concave but have no closed-form sampler. It is the default univariate sampler in the BUGS family of probabilistic programming tools ^[6].

Several extensions relax the log-concavity requirement:

Adaptive rejection Metropolis sampling (ARMS) applies a Metropolis-Hastings correction after each ARS proposal to handle non-log-concave densities.
Concave-convex adaptive rejection sampling decomposes the log-density into concave and convex parts and builds separate envelopes for each.
Discrete adaptive rejection sampling handles log-concave probability mass functions.

Squeeze functions

A squeeze function s(x) is a cheap lower bound that satisfies s(x) <= p(x) for all x in the support. Before evaluating the full density, the algorithm first checks whether u <= s(x) / (M q(x)). If so, the candidate is accepted without ever computing p(x), which saves time when evaluating p is expensive. Only when the squeeze check fails does the algorithm fall back to the full ratio check. Squeeze functions can reduce the number of expensive density evaluations by an order of magnitude in well-tuned implementations and are used pervasively inside random variate generators for Beta, Gamma, and related distributions ^[2]^[6].

Ziggurat and table methods

The ziggurat algorithm, introduced by George Marsaglia and Wai Wan Tsang in 2000, is a highly optimized rejection sampler for symmetric unimodal densities such as the Gaussian and exponential. It partitions the area under the density into horizontal strips of equal area, allowing most candidate draws to be accepted using a single table lookup and one comparison. Modern statistical libraries (including many random number generators in NumPy, SciPy, and R) use ziggurat-style rejection for normal random variates ^[2].

Variational rejection sampling

Variational rejection sampling, introduced by Grover and colleagues in 2018, combines the idea with variational inference. The variational family is allowed to be tighter than would otherwise be possible because any residual gap can be closed by a rejection step, trading extra compute for a provably tighter evidence lower bound.

How does rejection sampling compare to other Monte Carlo methods?

Rejection sampling is one of several general-purpose techniques for drawing samples from complicated distributions. The table below summarizes the main differences against the other classical options. All four methods are discussed in standard references such as Robert and Casella (2004), MacKay (2003), and Bishop (2006) ^[1]^[5]^[7].

Property	Rejection sampling	Importance sampling	MCMC (Metropolis-Hastings)	Inverse transform sampling
Produces i.i.d. samples?	Yes	Yes (weighted)	No (autocorrelated)	Yes
Returns unweighted samples?	Yes	No, samples have weights w = p/q	Yes	Yes
Needs envelope constant M?	Yes, M must satisfy p <= M q	No	No	No
Tolerates unnormalized p?	Yes	Yes (with self-normalization)	Yes	No (needs inverse CDF)
Convergence guarantee	Exact in finite time	Asymptotic (variance depends on q)	Asymptotic (stationary distribution)	Exact in one step
Scales well in high dim.?	No, M grows exponentially	Often no (effective sample size collapses)	Better (but still challenging)	Only for separable distributions
Main failure mode	Runaway rejection rate	Weight degeneracy (a few huge weights)	Slow mixing, unclear convergence	Inverse CDF unavailable
Typical use case	Low-dim, bounded ratio, simulations	Expectations, rare events	High-dim Bayesian posteriors	1-D distributions with closed CDF
Requires tuning?	Choice of q and M	Choice of q	Step size, burn-in, diagnostics	None

In very rough terms, rejection sampling is the gold standard when it is affordable: it gives you independent, unweighted, exact samples with no burn-in and no diagnostics, but it falls apart in high dimensions because the envelope constant M grows too fast. Importance sampling avoids the rejection step at the price of returning weighted samples whose variance can blow up. MCMC, and in particular Metropolis-Hastings and its descendants, is the method of choice for modern high-dimensional Bayesian problems but gives up the i.i.d. property. Slice sampling and Gibbs sampling are specialized MCMC variants that can be viewed as structured rejection methods, using uniform auxiliary variables and conditional exactness, respectively.

Why does rejection sampling fail in high dimensions?

Rejection sampling is notoriously poor in high dimensions. The reason is purely geometric: if p and q are d-dimensional densities that differ at each coordinate by a factor c > 1, then their ratio and hence the envelope constant grows as c^d. The acceptance probability 1/M therefore decays exponentially in the dimension, and the expected number of candidates needed to obtain a single sample explodes.

A concrete textbook example is sampling a d-dimensional standard Gaussian using a slightly wider d-dimensional Gaussian proposal with covariance sigma^2 I, sigma > 1. The envelope constant satisfies M = sigma^d, so even modest mismatches compound rapidly. With sigma = 1.01 and d = 100 the acceptance rate is already around 37%; by d = 1000 it collapses to less than 0.005%. With sigma = 1.1 and d = 100 the acceptance rate is roughly 7 x 10^(-5), and by d = 1000 it is far below any number that could be observed in practice ^[5]^[7]. The rejection sampler technically still works, but it would take longer than the lifetime of the universe to obtain a single sample.

This "curse of dimensionality" is the primary reason that rejection sampling is rarely used directly for modern Bayesian posteriors. High-dimensional problems almost always require MCMC or variational methods, both of which exploit local structure in the target to sidestep the exponential blow-up. Rejection sampling does, however, remain competitive in low or moderate dimensions, in problems where a tight analytic envelope is available, and inside hybrid samplers that perform only small local rejection steps ^[1]^[5].

What is rejection sampling used for in classical statistics?

Classical statistics and random variate generation

Rejection sampling is the workhorse behind the random variate generators shipped with almost every scientific computing library. Efficient rejection algorithms exist for the normal, exponential, gamma, beta, chi-squared, Student-t, Poisson, binomial, and many other standard distributions. These generators form the bedrock on top of which almost all higher-level statistical simulations are built. The classical reference is Luc Devroye's 1986 book Non-Uniform Random Variate Generation, which catalogs rejection methods for essentially every common distribution and was made freely available by the author ^[2].

Bayesian inference

In low-dimensional Bayesian inference problems, rejection sampling is a tempting choice because it requires only evaluation of the unnormalized posterior and delivers genuine i.i.d. draws. A particularly clean special case is approximate Bayesian computation (ABC), which avoids likelihood evaluation entirely: candidate parameter values are drawn from the prior, the model is simulated forward, and the candidate is accepted only if the simulated data match (or are close to) the observed data. ABC is essentially a rejection sampler on the joint space of parameters and simulated data, and it remains the method of choice in fields like population genetics and systems biology where likelihoods are unavailable.

Particle filters and sequential Monte Carlo

Particle filters, also known as sequential Monte Carlo methods, track a population of weighted particles through a state-space model. Resampling and propagation steps inside these filters often use rejection sampling as a subroutine, in particular when drawing from the transition density is easier than drawing from the optimal proposal. Rejection sampling is also used in smoothing algorithms that require exact draws from conditional distributions along the trajectory.

Generative modeling and neural samplers

Rejection sampling has been used to sharpen neural generative models. For example, Azadi and colleagues (2018) proposed discriminator rejection sampling for generative adversarial networks, using the GAN discriminator to compute an acceptance probability that brings the sample distribution closer to the data distribution. Related work applies rejection sampling to autoregressive models, energy-based models, and normalizing flows, typically as a post-hoc correction that trades compute for quality.

Physics and numerical integration

The original application domain, neutron transport and particle physics, still uses rejection sampling extensively. Almost every Monte Carlo integration package for high-energy physics (including MCNP, which was developed at Los Alamos) contains heavily optimized rejection samplers for cross-section distributions, angular scattering distributions, and energy spectra.

How is rejection sampling used in LLM training and inference?

Speculative decoding for large language models

The most visible modern use of rejection sampling in AI is speculative decoding (also called speculative sampling), introduced independently in 2022 by Leviathan, Kalman, and Matias at Google and in 2023 by Chen and colleagues at DeepMind ^[8]^[9]. The goal is to accelerate autoregressive decoding in LLMs without changing the output distribution of the large target model.

The method runs a small, fast "draft" model q that speculatively generates several tokens ahead, then uses a single parallel forward pass of the large "target" model p to verify them. Each drafted token x is accepted with probability

min(1, p(x) / q(x)),

which is precisely a modified rejection sampling rule. If the draft distribution overestimates the probability of a token (q(x) > p(x)) the token is only sometimes kept; if the draft underestimates it (q(x) <= p(x)) the token is always kept. When a token is rejected, the decoder resamples from the residual distribution

p'(x) = normalize(max(0, p(x) - q(x)))

and discards all subsequent draft tokens beyond the first rejection. The residual distribution is constructed precisely so that the overall probability of emitting any token is exactly p(x). The algorithm therefore samples from the target model's distribution exactly, bit for bit, no matter how bad or biased the draft model is. Leviathan and colleagues reported 2x to 3x wall-clock speedups on models such as T5-XXL, and Chen and colleagues reported 2x to 2.5x speedups on a 70 billion parameter Chinchilla model, in both cases with no change to the sampled distribution ^[8]^[9].

Feature	Classical rejection sampling	Speculative decoding
Target distribution	p(x)	Target model p(x_t
Proposal distribution	q(x)	Draft model q(x_t
Envelope constant	M with p <= M q	None required
Acceptance rule	u <= p(x) / (M q(x))	u <= min(1, p(x) / q(x))
Behavior on rejection	Discard and retry	Resample from normalize(max(0, p - q))
Distribution of output	Exactly p	Exactly p
Typical speedup	Depends on M	2x to 3x on modern LLMs

Speculative decoding departs from classical rejection sampling in one important way: it does not require an envelope constant M. Instead, the accept-reject rule is asymmetric (the ratio p/q is clipped at 1), and the leftover probability mass max(0, p(x) - q(x)) is explicitly redistributed on rejection. This trick guarantees distributional equality without needing a worst-case bound on p/q, which would otherwise be impossible to compute over the exponentially large token space ^[8]^[9]. The technique is now the standard production-grade method for accelerating LLM inference in systems including vLLM, TensorRT-LLM, and the Gemini and Claude serving stacks.

Rejection sampling fine-tuning and best-of-N

While speculative decoding applies the accept-reject rule at the level of individual tokens, a second family of techniques applies it at the level of entire generated responses. The simplest member is best-of-N (also called best-of-N sampling): for a given prompt the model generates N candidate completions, a reward model scores each one, and the single highest-scoring completion is returned. Best-of-N is purely an inference-time procedure and does not change the model. Best-of-N sampling is widely used as a strong reranking baseline and as the inner loop of reward-guided generation.

Rejection sampling fine-tuning turns best-of-N into a training method by feeding the winning completions back into supervised fine-tuning. Because only the top-scoring (accepted) completions are kept and the rest are discarded, the procedure is effectively rejection sampling from the conditional distribution of model outputs given that the output is high quality.

A prominent production example is Llama 2, whose chat models were aligned with two RLHF algorithms: Proximal Policy Optimization and rejection sampling fine-tuning. For each prompt, Llama 2 sampled K outputs from the model, scored each with the reward model, treated the highest-reward sample as a new gold-standard target, and fine-tuned on the resulting set ^[11]. Meta reported that "rejection sampling is performed with the largest model (i.e., LLaMA-70B-Chat) and used to train all other (smaller) models," so the smaller chat models were trained on data rejection-sampled from the 70B model, a form of distillation ^[11]. Up to and including the fourth RLHF iteration Meta used only rejection sampling fine-tuning, then combined it with PPO in later iterations ^[11].

In the open-research literature, rejection sampling fine-tuning (abbreviated RFT) was studied for mathematical reasoning by Yuan and colleagues in 2023 ^[10]. For each training problem the model produces a large number of candidate chains of thought, each candidate is graded for correctness (for example, by checking the final numerical answer against the ground truth), and only correct solutions are kept for supervised fine-tuning. Because the acceptance criterion in RFT is deterministic (correct vs. incorrect), the procedure is again rejection sampling, this time from the conditional distribution of outputs given that the output is correct. Yuan and colleagues used RFT to push LLaMA 7B from 35.9% to 49.3% accuracy on the GSM8K math benchmark by combining rejection samples from several fine-tuned models ^[10]. Follow-up work (DART-Math, Difficulty-Aware Rejection Tuning, and others) showed that RFT benefits weaker base models more than stronger ones and remains competitive with more elaborate preference-learning methods such as RLHF and DPO in reasoning tasks.

A complementary variant, reinforced self-training with rejection (ReST), iterates the loop: each round uses the latest fine-tuned model to generate fresh candidates, a reward model ranks them, only the top outputs are kept, and the model is fine-tuned on the survivors before the next round. ReST was introduced by Gulcehre and colleagues at Google DeepMind in 2023 and demonstrated on machine translation ^[12]. It is a close relative of expert iteration in reinforcement learning and underlies the training pipelines of several state-of-the-art reasoning LLMs.

Constrained generation

Rejection sampling is also the simplest baseline for constrained decoding, in which generated outputs must satisfy hard constraints (for example, being valid JSON, matching a regular expression, or staying within a length budget). The naive approach samples completions from the model and rejects any that fail the constraint check. While simple, this can be catastrophically slow when the constraint is rare, because it inherits the high-dimensional curse of dimensionality. More efficient constrained decoders (grammar-constrained decoding, FSM-constrained decoding) replace rejection with direct probability masking of the next-token distribution, which can be viewed as performing the rejection step in closed form at every step.

References

Robert, C. P. and Casella, G. (2004). *Monte Carlo Statistical Methods* (2nd ed.). Springer. ISBN 978-0387212395. ↩
Devroye, L. (1986). *Non-Uniform Random Variate Generation*. Springer-Verlag, New York. ↩
Von Neumann, J. (1951). "Various techniques used in connection with random digits." In *Monte Carlo Method*, A. S. Householder, G. E. Forsythe, and H. H. Germond, eds., National Bureau of Standards Applied Mathematics Series 12, pp. 36-38. ↩
Eckhardt, R. (1987). "Stan Ulam, John von Neumann, and the Monte Carlo Method." *Los Alamos Science*, 15, 131-137. ↩
MacKay, D. J. C. (2003). *Information Theory, Inference, and Learning Algorithms*. Cambridge University Press. Chapter 29, "Monte Carlo Methods." ↩
Gilks, W. R. and Wild, P. (1992). "Adaptive Rejection Sampling for Gibbs Sampling." *Journal of the Royal Statistical Society, Series C (Applied Statistics)*, 41(2), 337-348. ↩
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Section 11.1.2, "Rejection sampling." ↩
Leviathan, Y., Kalman, M., and Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." *Proceedings of the 40th International Conference on Machine Learning (ICML)*. arXiv:2211.17192. ↩
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling." arXiv:2302.01318. ↩
Yuan, Z., Yuan, H., Li, C., Dong, G., Lu, K., Tan, C., Zhou, C., and Zhou, J. (2023). "Scaling Relationship on Learning Mathematical Reasoning with Large Language Models." arXiv:2308.01825. (Introduces rejection sampling fine-tuning, RFT.) ↩
Touvron, H., Martin, L., Stone, K., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." Meta AI. arXiv:2307.09288. (Section 3.2.2 describes rejection sampling fine-tuning.) ↩
Gulcehre, C., Le Paine, T., Srinivasan, S., et al. (2023). "Reinforced Self-Training (ReST) for Language Modeling." Google DeepMind. arXiv:2308.08998. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Best-of-N sampling EAGLE-2 Importance sampling ReST / ReST-EM (Reinforced Self-Training)STaR (Self-Taught Reasoner)Self-Rewarding Language Models Self-speculative decoding (LayerSkip)Speculative Decoding

What is rejection sampling?

ELI5: Explain like I'm 5

Who invented rejection sampling?

How does rejection sampling work?

Setup: target and proposal distributions

The accept-reject step

Pseudocode

Why is rejection sampling correct?

What is the envelope constant M and how does it set efficiency?

What are the main variants of rejection sampling?

Adaptive rejection sampling

Squeeze functions

Ziggurat and table methods

Variational rejection sampling

How does rejection sampling compare to other Monte Carlo methods?

Why does rejection sampling fail in high dimensions?

What is rejection sampling used for in classical statistics?

Classical statistics and random variate generation

Bayesian inference

Particle filters and sequential Monte Carlo

Generative modeling and neural samplers

Physics and numerical integration

How is rejection sampling used in LLM training and inference?

Speculative decoding for large language models

Rejection sampling fine-tuning and best-of-N

Constrained generation

See also

References

Improve this article

Related Articles

A/B Testing

Generalized Linear Model

L1 Loss

L2 Loss

Squared Loss

Stationarity

What links here

Related Articles

A/B Testing

Generalized Linear Model

L1 Loss

L2 Loss

Squared Loss

Stationarity

What links here