Self-consistency
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,441 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,441 words
Add missing citations, update stale details, or suggest a clearer explanation.
Self-consistency is a decoding strategy for large language models introduced by Wang et al. (2022) in the paper "Self-Consistency Improves Chain of Thought Reasoning in Language Models."[1] The method replaces the standard greedy decoding used in chain-of-thought prompting with a "sample-and-marginalize" procedure: multiple reasoning paths are sampled from the language model, and the final answer is selected by majority vote over the resulting answers. The technique is entirely unsupervised, requires no fine-tuning, and works off the shelf with any pre-trained language model that supports stochastic decoding.[1]
Self-consistency was a foundational result in the broader research program on test-time compute for reasoning. By spending additional inference compute on multiple sampled paths rather than a single greedy one, the method delivers large accuracy gains on arithmetic and commonsense reasoning benchmarks. On GSM8K with PaLM-540B, self-consistency raised greedy chain-of-thought accuracy from 56.5% to 74.4%, a +17.9 absolute point improvement.[1] The paper was accepted as a poster at ICLR 2023.[2]
Chain-of-thought (CoT) prompting, introduced by Wei et al. (2022), conditions a language model on few-shot exemplars that demonstrate step-by-step reasoning leading to a final answer. Instead of asking the model to emit an answer directly, the prompt encourages it to generate intermediate steps that mimic the reasoning a person might follow. CoT prompting significantly improves performance on multi-step arithmetic, symbolic, and commonsense tasks, particularly with very large models.[1]
The standard decoding strategy used alongside CoT prompting has been greedy decoding, in which the most probable next token is chosen at every step. Greedy decoding generates a single deterministic reasoning trace and a single final answer.[1]
Modern language models support several alternative decoding strategies. Temperature sampling rescales the model's output distribution by a temperature parameter T before sampling, with T < 1 sharpening the distribution and T > 1 flattening it. Top-p (nucleus) sampling restricts sampling to the smallest set of tokens whose cumulative probability exceeds p. Top-k sampling restricts sampling to the k highest-probability tokens. Beam search maintains the top-B partial sequences ranked by joint log-probability rather than sampling.[1] Each method trades off diversity against fidelity to high-probability outputs in a different way.
For tasks with a unique correct answer, researchers have historically defaulted to greedy decoding because it produces the single most likely continuation. Self-consistency challenges this default by showing that strategically introducing diversity via sampling and then aggregating answers can outperform greedy decoding on reasoning tasks.[1]
Self-consistency is described in the paper as a three-step procedure:[1]
Formally, let (r_i, a_i) denote the i-th sampled (rationale, answer) pair, where i = 1, ..., m. After parsing the final answer a_i from each sampled output, self-consistency selects:
a* = argmax_a sum_{i=1}^{m} 1[a_i == a]
That is, the answer that appears most frequently across the m samples. The paper notes that this unweighted majority vote is essentially equivalent to a weighted vote that uses the normalized model log-probabilities as weights, and far outperforms a vote that uses unnormalized log-probabilities.[1]
The original paper used m = 40 sampled reasoning paths as its primary setting. Sampling hyperparameters varied by model: UL2-20B and LaMDA-137B were sampled with temperature T = 0.5 and top-k k = 40; PaLM-540B used T = 0.7 with k = 40; and GPT-3 used T = 0.7 with no top-k truncation. The authors report that the method is generally robust to the choice of sampling strategy and parameters.[1]
Self-consistency relies on the empirical observation that, for a well-prompted language model, correct reasoning paths tend to agree in their final answer while incorrect paths tend to disperse across many wrong answers. Even when individual sampled paths contain mistakes, their final answers concentrate near the truth when the model has the underlying competence to solve the problem. The wrong answers, by contrast, are scattered across the answer space.[1]
This is closely analogous to the human intuition that arriving at the same answer through several different routes increases confidence in that answer. The paper attributes the underlying cognitive intuition to dual-process accounts of reasoning, in which deliberate analytic thinking admits multiple valid solution paths.[1]
The paper frames self-consistency as a tractable approximation to marginalization over latent reasoning paths.[1] Given a question q and a prompt, the model defines a joint distribution P(r, a | q) over rationales r and answers a. The probability of a particular answer is:
P(a | q) = sum_r P(r, a | q)
Exact marginalization is intractable because the space of rationales is combinatorially large. Self-consistency approximates this sum by Monte Carlo sampling from P(r, a | q) and then computing the empirical answer distribution. The most frequent sampled answer is the Monte Carlo estimate of argmax_a P(a | q).[1]
This view explains why self-consistency tends to outperform greedy decoding even for tasks with a unique correct answer. Greedy decoding returns the most probable joint trajectory (r*, a*), which is not the same as the most probable marginal answer. The most probable rationale may be wrong even when the marginal probability of the correct answer is high. By marginalizing, self-consistency selects according to the posterior over answers rather than over (rationale, answer) pairs.[1]
The authors found that weighting samples by their normalized log-probability yields accuracy similar to unweighted majority vote, while weighting by unnormalized log-probability hurts performance substantially. They attribute this to the fact that longer rationales receive lower unnormalized log-probabilities, biasing the vote toward short, often incorrect, paths.[1]
The original paper evaluated self-consistency on four language models of varying scales: UL2-20B, GPT-3 (with the code-davinci-001 and code-davinci-002 engines), LaMDA-137B, and PaLM-540B. Benchmarks covered arithmetic, commonsense, and symbolic reasoning.[1]
For PaLM-540B with chain-of-thought prompting (greedy) versus self-consistency (40 samples), the paper reports:[1]
For GPT-3 code-davinci-002, the largest measured GPT-3 engine, self-consistency raised GSM8K from 60.1% to 78.0% (+17.9), SVAMP from 75.8% to 86.8% (+11.0), and AQuA from 39.8% to 52.0% (+12.2). Across all four models, the absolute gain from self-consistency on AQuA and GSM8K was about +12 to +18 points, and about +7 to +11 points on SVAMP and ASDiv.[1]
The headline GSM8K improvement on PaLM-540B (56.5% to 74.4%) achieved a new state of the art for the benchmark at the time of publication, despite the method being unsupervised and task-agnostic. Prior SoTA on GSM8K had relied on supervised fine-tuning with thousands of examples and a separately trained verifier model.[1]
Self-consistency also delivered gains on tasks where the final answer is a discrete label:[1]
The +6.4% headline number for StrategyQA reported in the abstract reflects gains aggregated across models; the PaLM-540B and GPT-3 code-davinci-002 gains on StrategyQA were +6.3 and +6.4 respectively.[1] Self-consistency reached new state of the art on five of the six commonsense and symbolic reasoning tasks evaluated.[1]
The paper plotted mean accuracy and standard deviation over 10 runs for varying numbers of sampled paths m โ {1, 5, 10, 20, 40}. Performance improves monotonically with m, with gains diminishing as m grows. Most of the improvement is captured by the first 5 to 10 paths, with the remaining samples providing smaller marginal gains.[1]
The authors recommend trying a small number of paths (5 or 10) as a starting point to realize most of the accuracy gain while limiting compute cost, and note that performance generally saturates quickly.[1] This saturation behavior reflects the fact that majority vote with m samples is a noisy estimator of the underlying answer posterior, and the variance of the estimator falls roughly as 1/m.
Sampling-based self-consistency yields consistent improvements across all four model scales tested, but the gains are larger for larger models. The paper notes that the absolute accuracy gain on AQuA and GSM8K reaches +12 to +18 points only at the PaLM-540B and GPT-3 code-davinci-002 scales, while smaller models show more modest gains.[1] The authors attribute this to the fact that small models have limited reasoning ability and may not generate diverse correct paths even when sampling.[1]
The paper includes ablations showing that self-consistency is robust to:
The paper compares self-consistency against several baselines on equivalent compute budgets:[1]
The paper additionally evaluates self-consistency on NLP tasks where chain-of-thought sometimes hurts accuracy compared to standard prompting (for example, ANLI-R1, e-SNLI, RTE). On these tasks the authors find that self-consistency robustly recovers the standard-prompting performance and exceeds it.[1]
The paper observes that the consistency rate (fraction of samples agreeing with the final majority answer) correlates strongly with accuracy on GSM8K. This suggests self-consistency can double as an uncertainty estimate: when the model's sampled answers strongly agree, the answer is more likely correct; when they disperse, the model is more likely wrong. The authors describe this as giving the model some ability to "know when it doesn't know."[1]
A practical limitation of the original self-consistency procedure is that it requires a way to extract a discrete final answer from each sample so that votes can be tallied. This works well for math problems (the answer is a single number), multiple-choice tasks (the answer is a letter or label), and other tasks with a fixed answer set, but fails for open-ended generation such as summarization, code, or free-form question answering.[3]
Universal Self-Consistency (USC), introduced by Chen et al. (2023), addresses this limitation by replacing the explicit answer-extraction-and-vote step with a single additional LLM call.[3] After sampling m candidate responses, USC prompts the LLM with all m candidates and asks it to identify the most consistent response. The selected response is returned as the final answer.
USC was evaluated on mathematical reasoning (GSM8K and MATH), code generation (BIRD-SQL text-to-SQL and Python generation), long-context summarization, and open-ended question answering.[3] Key findings:[3]
USC suffers from some long-context weaknesses of the underlying LLM. The paper notes that accuracy on GSM8K under USC decreased when the number of candidates fed into the final call grew very large (for example, 16 candidates), because the LLM struggles with the long composite prompt; the authors recommend roughly 8 candidates as a practical sweet spot.[3]
USC inherits both the strengths and weaknesses of LLM-as-judge approaches: it is sensitive to the LLM's ability to assess consistency, can exhibit position bias, and may inherit other biases of the model. The original paper argues that consistency assessment is an easier task for an LLM than quality assessment, partially mitigating these concerns.[3]
A follow-up by the same group, Wang et al. (2022), generalized the self-consistency framework into rationale-augmented ensembles.[4] The paper observes that (input, rationale to output) prompting is brittle to manually written rationales, and proposes a unified framework that ensembles over multiple sampled rationales. Self-consistency is identified as one instance of this framework (output-space sampling), alongside prompt-order ensembles and input-rationale ensembles. The framework extends to NLP tasks such as question answering, word sense disambiguation, and sentiment analysis that do not traditionally use chain-of-thought.[4]
Huang et al. (2022) proposed using self-consistency as a data-generation procedure for self-improvement: the LLM generates multiple chain-of-thought solutions, self-consistency picks the majority answer, and rationales leading to that majority answer are used as training data to fine-tune the LLM. This approach demonstrates that the labels produced by self-consistency are high-quality enough to act as a pseudo-supervisory signal.[5]
Self-consistency is closely related to classical ensemble methods such as bagging in statistical learning, with one important distinction: the m sampled paths come from the same model rather than independently trained models. The paper characterizes self-consistency as a "self-ensemble" that produces ensemble-like gains without the cost of training multiple base models.[1] It also bears family resemblance to minimum Bayes risk decoding for sequence generation, in which a candidate is selected to minimize expected risk under the model distribution.
Self-consistency is one of the canonical demonstrations that test-time compute can be traded for accuracy on reasoning tasks: spending m times the inference cost on parallel sampling produces measurable accuracy gains that often exceed the gains from a comparable increase in model size.[1] This insight underlies a broader research program on test-time scaling that includes tree of thoughts, best-of-N selection with reward models, process reward models for verifier-guided search, and iterative refinement methods.
Modern reasoning models such as the OpenAI o-series are widely described as using internal test-time inference procedures conceptually related to self-consistency, in that they spend additional compute at inference to explore multiple reasoning paths before producing an answer. The specific algorithms used by such proprietary systems have not been fully disclosed.
The original paper contrasts self-consistency with verifier-based approaches such as the trained verifier used by Cobbe et al. (2021) for GSM8K, in which a separately trained scoring model ranks candidate solutions and selects the best.[1] Verifier-based methods can outperform plain self-consistency when the verifier has been trained on enough high-quality data, but require additional supervision, training compute, and a separate model. Self-consistency requires no additional model or annotation, only inference compute. The two approaches can be combined: candidates can be filtered by verifier score before majority vote, or the verifier can act as a tiebreaker. Process reward models extend verifier-based methods by scoring intermediate steps rather than only final answers.
The paper and follow-up work identify several limitations of self-consistency:[1][3]
A typical implementation of self-consistency follows the structure described in the original paper:[1]
Sampling can be parallelized across the m paths since each is independent given the prompt. This is in contrast to beam search, which requires synchronization across beams at each decoding step. Self-consistency therefore parallelizes more easily on accelerator hardware and on serving infrastructure that batches independent requests.
For long-context settings where each chain of thought is itself long, the memory cost of holding m independent generation states grows with m. Practical deployments tune m and the per-sample length budget to fit serving constraints.
Self-consistency has been an unusually influential paper in prompt engineering and reasoning research:[1][3][4]
The original paper has been cited many thousands of times since its 2022 arXiv release and 2023 ICLR publication, and the technique is widely implemented in evaluation harnesses and serving stacks for reasoning models.[2]