Best-of-N sampling

Machine Learning Reinforcement Learning

11 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v2 · 2,151 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Best-of-N sampling (BoN) is an inference-time method that improves a large language model output by drawing N independent candidate responses to the same prompt, scoring each with a reward model, verifier, or voting rule, and returning the single highest-scoring candidate. It requires no extra training of the base model, its measured quality typically rises with N, and it is the canonical baseline for both decode-time preference alignment and parallel test-time compute scaling ^[1]^[7]. The method is also called best-of-n or best-of-k, and its training-time cousin is rejection-sampling fine-tuning.

What is best-of-N sampling?

Best-of-N sampling (BoN), also written best-of-n or best-of-k, is an inference-time method for improving the quality of large language model outputs. For a given prompt the model draws N independent candidate responses, scores each one, and returns the single highest-scoring candidate. The scorer is most often a reward model or verifier, though it can also be a deterministic checker or a voting rule. Because BoN requires no extra training of the base model and improves measured output quality over a wide range of N, it is one of the most widely used baselines both for aligning model outputs with human preferences at decode time and for test-time compute scaling ^[1]^[2].

The starting intuition is that a model's per-sample probability of producing a good answer may be modest, but the probability that at least one of N samples is good rises quickly with N; a reliable scorer then surfaces that good sample. BoN is simple yet theoretically rich. It is tightly connected to RLHF: keeping the best of N samples implicitly defines a reward-tilted policy whose divergence from the base model grows with N, paralleling the KL-regularized objective of policy optimization ^[3]^[4]. It also seeds derived methods, most notably Best-of-N distillation (BOND), which trains a policy to imitate the BoN distribution so that BoN-level quality is obtained without the N-fold inference cost ^[5]. The same "sample many, keep the best" logic has an adversarial counterpart in Best-of-N jailbreaking ^[6].

How does best-of-N sampling work?

BoN has three steps:

Sampling. Draw N candidate completions independently from the base policy for the prompt, usually at temperature greater than zero so the samples are diverse.
Scoring. Assign a numeric score to each candidate. The scorer is most often a learned outcome reward model or verifier, but can be a process reward model that grades intermediate reasoning steps, a deterministic checker such as unit tests or a proof assistant, or a voting rule over extracted final answers.
Selection. Return the candidate with the highest score.

The cost of BoN is essentially linear in N: N generations plus N scoring passes. Sampling and scoring can be parallelized, so latency need not grow linearly, but total compute does. Trading this linear compute increase for higher accuracy is what makes BoN a canonical test-time-scaling baseline ^[1]^[7].

Two distinctions matter for understanding BoN's behavior:

Selection versus coverage. BoN must commit to a single answer using only its scorer. This differs from coverage, or pass@k, which asks whether any of the N samples is correct under a ground-truth oracle. Coverage is an upper bound on what any selector can reach and keeps rising with N across several orders of magnitude, but practical selectors such as reward models and majority voting tend to plateau after a few hundred samples ^[1].
The scorer is the ceiling. BoN can be no better than the signal it selects on. With sound automatic verifiers (code that compiles and passes tests, machine-checked proofs) accuracy scales cleanly with N; with imperfect learned reward models, gains are bounded by reward-model error and ultimately reverse through over-optimization (see Limitations) ^[1]^[3].

Selection rule	Scorer used	Needs a trained reward model	Notes
Majority vote (self-consistency)	answer frequency	No	Needs comparable, extractable final answers
Reward-model Best-of-N	outcome reward model or verifier	Yes	Returns the single highest-reward sample
Weighted Best-of-N	reward or process reward model	Yes	Sums score per distinct answer; often strongest
Oracle / pass@k (coverage)	ground-truth checker	No	Upper bound only; needs verifiable answers

How does best-of-N relate to RLHF and rejection sampling?

BoN is an inference-time alternative to reinforcement-learning policy optimization. Standard RLHF maximizes a KL-regularized objective, expected reward minus a penalty for drifting from the reference policy: maximize E[r(x,y)] - beta * KL(pi || pi_ref). Its closed-form optimum is the reward-tilted distribution pi*(y|x) proportional to pi_ref(y|x) * exp(r(x,y) / beta), where the temperature beta sets how hard the policy chases reward at the expense of KL divergence ^[3]^[4].

BoN reaches a related reward-tilted distribution without changing any weights. Choosing a larger N pushes the selected-sample distribution further from pi_ref, so N plays a role analogous to 1/beta. A widely cited analytic estimate sets the KL divergence of the BoN policy from the reference at log n - (n - 1)/n. Beirami et al. (January 2024) showed this expression is an upper bound rather than an exact equality, gave a tighter estimator, and proved the BoN policy's win rate against the reference is at most n/(n + 1) ^[4]. Empirically, BoN traces a reward-versus-KL frontier comparable to RL, and in the low-KL regime it often matches or beats KL-regularized policy optimization, at the price of N-fold inference compute. Gao et al. (October 2022) found that "RL consumes much more KL distance than best-of-n for the same amount of optimization," so for a fixed proxy-reward gain BoN moves the policy less far from the base model than reinforcement learning does ^[3].

BoN is closely related to rejection sampling. Selecting the top candidate out of N is sometimes loosely called rejection sampling, although it differs from classical statistical rejection sampling, which probabilistically accepts or rejects samples to reproduce a target density exactly; BoN instead deterministically keeps the top-1 of N. The same operation is also used during training: rejection-sampling fine-tuning generates N samples per prompt, keeps the best by reward, and fine-tunes the model on those winners. This data-generation loop, used for example in Meta's Llama 2 alignment pipeline (July 2023), is BoN applied at training time rather than inference time. In Llama 2 the loop was run only for the 70B model, and smaller models were fine-tuned on its rejection-sampled outputs, distilling the larger model's gains ^[12].

How does best-of-N improve accuracy as a test-time scaling method?

BoN is the simplest form of parallel test-time scaling, trading more samples for accuracy. Cobbe et al. (October 2021), introducing the GSM8K grade-school-math benchmark, trained an outcome-supervised verifier and selected the best of N candidate solutions; a 6B-parameter generator plus verifier matched a fine-tuned 175B model, an early demonstration that inference-time verification can partly substitute for model scale ^[8]. Brown et al. (July 2024), in "Large Language Monkeys," showed that coverage scales log-linearly with N over four orders of magnitude: on SWE-bench Lite, DeepSeek-Coder-V2-Instruct rose from 15.9% with one sample to 56% with 250 samples, above the 43% single-attempt state of the art at the time. As the authors put it, "the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws" ^[1]. They also documented the selection bottleneck, that reward-model and majority-vote selectors plateau in domains where automatic verifiers do not ^[1].

Snell et al. (August 2024) framed this as compute-optimal scaling: the best way to spend a fixed inference budget depends on problem difficulty, and for many problems a smaller model given enough test-time compute (BoN search against a verifier, or sequential revisions) can outperform a much larger model. They distinguished parallel scaling (independent samples, as in BoN) from sequential scaling (iterative revision of one trajectory), noting the two are complementary ^[7]. This line of work underpins the "reasoning model" trend popularized by systems such as o1 and DeepSeek-R1, although those rely more on long chains of thought than on explicit BoN.

What are the main variants of best-of-N?

Self-consistency and majority voting

When the final answer can be extracted and compared (a number, a label, a short string), the scorer can be replaced by a vote. Self-consistency (Wang et al., March 2022) samples many chain-of-thought reasoning paths and returns the most frequent final answer, marginalizing over the reasoning. This is BoN with majority vote as the selector and needs no reward model, but it applies only where answers are directly comparable ^[9].

Weighted Best-of-N

Weighted Best-of-N, or weighted majority voting, combines the two ideas: each candidate's reward or verifier score is summed per distinct answer and the highest-scoring answer is returned. Empirically this often beats both plain majority voting and plain reward-model BoN, especially when paired with a process reward model that grades intermediate steps. Lightman et al. (May 2023), in "Let's Verify Step by Step," showed that such step-level verifiers select better candidates than outcome-only verifiers on the MATH benchmark ^[10].

Best-of-N distillation (BOND)

BoN's drawback is paying for N generations at every query. BOND (Sessa et al., Google DeepMind, July 2024) removes that recurring cost by training the policy itself to match the BoN distribution. It casts alignment as distribution matching against the analytic best-of-N distribution and uses the Jeffreys divergence, a combination of forward and backward KL that balances mode-covering and mode-seeking behavior. The practical iterative variant, J-BOND, distills the best-of-N of a slowly moving anchor policy (it targets the best-of-2 distribution at each step), yielding BoN-level reward with single-sample inference ^[5].

Best-of-N jailbreaking

The same scaling principle can be turned against safety filters. Best-of-N jailbreaking (Hughes et al., December 2024) is a black-box attack that repeatedly samples augmented variants of a harmful prompt (random capitalization, word shuffling, and similar perturbations for text, with analogous augmentations for image and audio) until the target model complies. Attack success rate grows with N along a power law; the authors report 89% on GPT-4o and 78% on Claude 3.5 Sonnet at 10,000 sampled prompts, with the method also working across vision and audio inputs ^[6].

What are the risks and limitations of best-of-N?

The central limitation is reward over-optimization, an instance of reward hacking and Goodhart's law: as N increases, BoN searches harder over the scorer's errors, so the proxy reward keeps rising while true quality eventually falls. Gao et al. (October 2022) measured this directly, fitting the gold reward as R_bon(d) = d * (alpha_bon - beta_bon * d) for best-of-n versus R_RL(d) = d * (alpha_RL - beta_RL * log d) for RL, where d is the square root of the KL divergence from the initial policy; both curves rise, peak, and then decline, quantifying when more optimization hurts ^[3]. Mitigations include reward-model ensembles, which dampen but do not eliminate the effect (Eisenstein et al., December 2023) ^[11], constraining N, and using verifiable rather than learned scorers where possible.

Other limitations follow from the method's structure. Cost scales linearly with N, which can be prohibitive at the thousands-of-samples scale used in some studies. BoN cannot exceed its sampling support: if no drawn sample is correct, no selector can recover the right answer (the coverage ceiling). Selection quality for learned scorers plateaus well before coverage does ^[1]. Outputs inherit any bias or miscalibration of the reward model. Finally, BoN is purely parallel, with no information shared across candidates, so on hard multi-step problems sequential methods that revise a single trajectory can be more compute-efficient ^[7].

References

B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V. Le, C. Re, A. Mirhoseini, "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling," arXiv:2407.21787, July 2024. ↩
R. Nakano et al., "WebGPT: Browser-assisted question-answering with human feedback," arXiv:2112.09332, December 2021. ↩
L. Gao, J. Schulman, J. Hilton, "Scaling Laws for Reward Model Overoptimization," arXiv:2210.10760, October 2022 (ICML 2023). ↩
A. Beirami, A. Agarwal, J. Berant, A. D'Amour, J. Eisenstein, C. Nagpal, A. T. Suresh, "Theoretical guarantees on the best-of-n alignment policy," arXiv:2401.01879, January 2024 (ICML 2025). ↩
P. G. Sessa et al., "BOND: Aligning LLMs with Best-of-N Distillation," arXiv:2407.14622, July 2024. ↩
J. Hughes, S. Price, A. Lynch et al., "Best-of-N Jailbreaking," arXiv:2412.03556, December 2024. ↩
C. Snell, J. Lee, K. Xu, A. Kumar, "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters," arXiv:2408.03314, August 2024. ↩
K. Cobbe et al., "Training Verifiers to Solve Math Word Problems," arXiv:2110.14168, October 2021. ↩
X. Wang et al., "Self-Consistency Improves Chain of Thought Reasoning in Language Models," arXiv:2203.11171, March 2022 (ICLR 2023). ↩
H. Lightman et al., "Let's Verify Step by Step," arXiv:2305.20050, May 2023. ↩
J. Eisenstein et al., "Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking," arXiv:2312.09244, December 2023. ↩
H. Touvron et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models," arXiv:2307.09288, July 2023. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AgentHarm ProcessBench Rejection sampling RewardBench

What is best-of-N sampling?

How does best-of-N sampling work?

How does best-of-N relate to RLHF and rejection sampling?

How does best-of-N improve accuracy as a test-time scaling method?

What are the main variants of best-of-N?

Self-consistency and majority voting

Weighted Best-of-N

Best-of-N distillation (BOND)

Best-of-N jailbreaking

What are the risks and limitations of best-of-N?

References

Improve this article

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)

What links here

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)

What links here