Best-of-N sampling
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,897 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
11 citations
Review status
Source-backed
Revision
v1 · 1,897 words
Add missing citations, update stale details, or suggest a clearer explanation.
Best-of-N sampling (BoN), also written best-of-n or best-of-k, is an inference-time method for improving the quality of large language model outputs. For a given prompt the model draws N independent candidate responses, scores each one, and returns the single highest-scoring candidate. The scorer is most often a reward model or verifier, though it can also be a deterministic checker or a voting rule. Because BoN requires no extra training of the base model and improves measured output quality over a wide range of N, it is one of the most widely used baselines both for aligning model outputs with human preferences at decode time and for test-time compute scaling [1][2].
The starting intuition is that a model's per-sample probability of producing a good answer may be modest, but the probability that at least one of N samples is good rises quickly with N; a reliable scorer then surfaces that good sample. BoN is simple yet theoretically rich. It is tightly connected to RLHF: keeping the best of N samples implicitly defines a reward-tilted policy whose divergence from the base model grows with N, paralleling the KL-regularized objective of policy optimization [3][4]. It also seeds derived methods, most notably Best-of-N distillation (BOND), which trains a policy to imitate the BoN distribution so that BoN-level quality is obtained without the N-fold inference cost [5]. The same "sample many, keep the best" logic has an adversarial counterpart in Best-of-N jailbreaking [6].
BoN has three steps:
The cost of BoN is essentially linear in N: N generations plus N scoring passes. Sampling and scoring can be parallelized, so latency need not grow linearly, but total compute does. Trading this linear compute increase for higher accuracy is what makes BoN a canonical test-time-scaling baseline [1][7].
Two distinctions matter for understanding BoN's behavior:
| Selection rule | Scorer used | Needs a trained reward model | Notes |
|---|---|---|---|
| Majority vote (self-consistency) | answer frequency | No | Needs comparable, extractable final answers |
| Reward-model Best-of-N | outcome reward model or verifier | Yes | Returns the single highest-reward sample |
| Weighted Best-of-N | reward or process reward model | Yes | Sums score per distinct answer; often strongest |
| Oracle / pass@k (coverage) | ground-truth checker | No | Upper bound only; needs verifiable answers |
BoN is an inference-time alternative to reinforcement-learning policy optimization. Standard RLHF maximizes a KL-regularized objective, expected reward minus a penalty for drifting from the reference policy: maximize E[r(x,y)] - beta * KL(pi || pi_ref). Its closed-form optimum is the reward-tilted distribution pi*(y|x) proportional to pi_ref(y|x) * exp(r(x,y) / beta), where the temperature beta sets how hard the policy chases reward at the expense of KL divergence [3][4].
BoN reaches a related reward-tilted distribution without changing any weights. Choosing a larger N pushes the selected-sample distribution further from pi_ref, so N plays a role analogous to 1/beta. A widely cited analytic estimate sets the KL divergence of the BoN policy from the reference at log n - (n - 1)/n. Beirami et al. (January 2024) showed this expression is an upper bound rather than an exact equality, gave a tighter estimator, and proved the BoN policy's win rate against the reference is at most n/(n + 1) [4]. Empirically, BoN traces a reward-versus-KL frontier comparable to RL, and in the low-KL regime it often matches or beats KL-regularized policy optimization, at the price of N-fold inference compute [3].
BoN is closely related to rejection sampling. Selecting the top candidate out of N is sometimes loosely called rejection sampling, although it differs from classical statistical rejection sampling, which probabilistically accepts or rejects samples to reproduce a target density exactly; BoN instead deterministically keeps the top-1 of N. The same operation is also used during training: rejection-sampling fine-tuning generates N samples per prompt, keeps the best by reward, and fine-tunes the model on those winners. This data-generation loop, used for example in Meta's Llama 2 alignment pipeline (July 2023), is BoN applied at training time rather than inference time.
BoN is the simplest form of parallel test-time scaling, trading more samples for accuracy. Cobbe et al. (October 2021), introducing the GSM8K grade-school-math benchmark, trained an outcome-supervised verifier and selected the best of N candidate solutions; a 6B-parameter generator plus verifier matched a fine-tuned 175B model, an early demonstration that inference-time verification can partly substitute for model scale [8]. Brown et al. (July 2024), in "Large Language Monkeys," showed that coverage scales log-linearly with N over four orders of magnitude: on SWE-bench Lite, DeepSeek-Coder-V2-Instruct rose from 15.9% with one sample to 56% with 250 samples. They also documented the selection bottleneck, that reward-model and majority-vote selectors plateau in domains where automatic verifiers do not [1].
Snell et al. (August 2024) framed this as compute-optimal scaling: the best way to spend a fixed inference budget depends on problem difficulty, and for many problems a smaller model given enough test-time compute (BoN search against a verifier, or sequential revisions) can outperform a much larger model. They distinguished parallel scaling (independent samples, as in BoN) from sequential scaling (iterative revision of one trajectory), noting the two are complementary [7]. This line of work underpins the "reasoning model" trend popularized by systems such as o1 and DeepSeek-R1, although those rely more on long chains of thought than on explicit BoN.
When the final answer can be extracted and compared (a number, a label, a short string), the scorer can be replaced by a vote. Self-consistency (Wang et al., March 2022) samples many chain-of-thought reasoning paths and returns the most frequent final answer, marginalizing over the reasoning. This is BoN with majority vote as the selector and needs no reward model, but it applies only where answers are directly comparable [9].
Weighted Best-of-N, or weighted majority voting, combines the two ideas: each candidate's reward or verifier score is summed per distinct answer and the highest-scoring answer is returned. Empirically this often beats both plain majority voting and plain reward-model BoN, especially when paired with a process reward model that grades intermediate steps. Lightman et al. (May 2023), in "Let's Verify Step by Step," showed that such step-level verifiers select better candidates than outcome-only verifiers on the MATH benchmark [10].
BoN's drawback is paying for N generations at every query. BOND (Sessa et al., Google DeepMind, July 2024) removes that recurring cost by training the policy itself to match the BoN distribution. It casts alignment as distribution matching against the analytic best-of-N distribution and uses the Jeffreys divergence, a combination of forward and backward KL that balances mode-covering and mode-seeking behavior. The practical iterative variant, J-BOND, distills the best-of-N of a slowly moving anchor policy, yielding BoN-level reward with single-sample inference [5].
The same scaling principle can be turned against safety filters. Best-of-N jailbreaking (Hughes et al., December 2024) is a black-box attack that repeatedly samples augmented variants of a harmful prompt (random capitalization, word shuffling, and similar perturbations for text, with analogous augmentations for image and audio) until the target model complies. Attack success rate grows with N along a power law; the authors report 89% on GPT-4o and 78% on Claude 3.5 Sonnet at 10,000 sampled prompts, with the method also working across vision and audio inputs [6].
The central limitation is reward over-optimization, an instance of reward hacking and Goodhart's law: as N increases, BoN searches harder over the scorer's errors, so the proxy reward keeps rising while true quality eventually falls. Gao et al. (October 2022) measured this directly, fitting the gold reward as R_bon(d) = d * (alpha_bon - beta_bon * d) for best-of-n versus R_RL(d) = d * (alpha_RL - beta_RL * log d) for RL, where d is the square root of the KL divergence from the initial policy; both curves rise, peak, and then decline, quantifying when more optimization hurts [3]. Mitigations include reward-model ensembles, which dampen but do not eliminate the effect (Eisenstein et al., December 2023) [11], constraining N, and using verifiable rather than learned scorers where possible.
Other limitations follow from the method's structure. Cost scales linearly with N, which can be prohibitive at the thousands-of-samples scale used in some studies. BoN cannot exceed its sampling support: if no drawn sample is correct, no selector can recover the right answer (the coverage ceiling). Selection quality for learned scorers plateaus well before coverage does [1]. Outputs inherit any bias or miscalibration of the reward model. Finally, BoN is purely parallel, with no information shared across candidates, so on hard multi-step problems sequential methods that revise a single trajectory can be more compute-efficient [7].