Self-consistency

Large Language Models Prompt Engineering Reasoning Models

18 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v3 · 3,551 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Self-consistency is a decoding strategy for large language models that samples multiple chain-of-thought reasoning paths for the same question and returns the answer that the majority of those paths agree on, instead of taking the single greedy path. Introduced by Xuezhi Wang and colleagues at Google in the 2022 paper "Self-Consistency Improves Chain of Thought Reasoning in Language Models," the method is unsupervised, requires no fine-tuning, and works off the shelf with any pretrained model that supports stochastic decoding.^[1] On GSM8K grade-school math, self-consistency raised greedy chain-of-thought accuracy with PaLM-540B from 56.5% to 74.4%, a 17.9 absolute point gain.^[1] The paper's authors describe self-consistency as a "sample-and-marginalize" procedure that "leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer."^[1]

Self-consistency was a foundational result in the broader research program on test-time compute for reasoning. By spending additional inference compute on multiple sampled paths rather than a single greedy one, the method delivers large accuracy gains on arithmetic and commonsense reasoning benchmarks. The paper was accepted as a poster at ICLR 2023.^[2]

What problem does self-consistency solve?

Chain-of-thought prompting

Chain-of-thought (CoT) prompting, introduced by Wei et al. (2022), conditions a language model on few-shot exemplars that demonstrate step-by-step reasoning leading to a final answer. Instead of asking the model to emit an answer directly, the prompt encourages it to generate intermediate steps that mimic the reasoning a person might follow. CoT prompting significantly improves performance on multi-step arithmetic, symbolic, and commonsense tasks, particularly with very large models.^[1]

The standard decoding strategy used alongside CoT prompting has been greedy decoding, in which the most probable next token is chosen at every step. Greedy decoding generates a single deterministic reasoning trace and a single final answer.^[1]

Decoding strategies

Modern language models support several alternative decoding strategies. Temperature sampling rescales the model's output distribution by a temperature parameter T before sampling, with T < 1 sharpening the distribution and T > 1 flattening it. Top-p (nucleus) sampling restricts sampling to the smallest set of tokens whose cumulative probability exceeds p. Top-k sampling restricts sampling to the k highest-probability tokens. Beam search maintains the top-B partial sequences ranked by joint log-probability rather than sampling.^[1] Each method trades off diversity against fidelity to high-probability outputs in a different way.

For tasks with a unique correct answer, researchers have historically defaulted to greedy decoding because it produces the single most likely continuation. Self-consistency challenges this default by showing that strategically introducing diversity via sampling and then aggregating answers can outperform greedy decoding on reasoning tasks.^[1]

How does self-consistency work?

Self-consistency is described in the paper as a three-step procedure:^[1]

Prompt a language model with chain-of-thought exemplars in the same format used by standard CoT prompting.
Sample a diverse set of m candidate outputs (reasoning path plus final answer) from the model's decoder using a stochastic decoding strategy such as temperature, top-k, or nucleus sampling.
Aggregate the m candidate answers by majority vote over the final answer set, selecting the most frequent answer as the model's prediction.

Formally, let (r_i, a_i) denote the i-th sampled (rationale, answer) pair, where i = 1, ..., m. After parsing the final answer a_i from each sampled output, self-consistency selects:

a* = argmax_a sum_{i=1}^{m} 1[a_i == a]

That is, the answer that appears most frequently across the m samples. The paper notes that this unweighted majority vote is essentially equivalent to a weighted vote that uses the normalized model log-probabilities as weights, and far outperforms a vote that uses unnormalized log-probabilities.^[1]

Sampling configuration in the original paper

The original paper used m = 40 sampled reasoning paths as its primary setting. Sampling hyperparameters varied by model: UL2-20B and LaMDA-137B were sampled with temperature T = 0.5 and top-k k = 40; PaLM-540B used T = 0.7 with k = 40; and GPT-3 used T = 0.7 with no top-k truncation. The authors report that the method is generally robust to the choice of sampling strategy and parameters.^[1]

Why does majority vote work?

Self-consistency relies on the empirical observation that, for a well-prompted language model, correct reasoning paths tend to agree in their final answer while incorrect paths tend to disperse across many wrong answers. Even when individual sampled paths contain mistakes, their final answers concentrate near the truth when the model has the underlying competence to solve the problem. The wrong answers, by contrast, are scattered across the answer space.^[1]

This is closely analogous to the human intuition that arriving at the same answer through several different routes increases confidence in that answer. The paper attributes the underlying cognitive intuition to dual-process accounts of reasoning, in which deliberate analytic thinking admits multiple valid solution paths.^[1]

Theoretical motivation: marginalization over reasoning paths

The paper frames self-consistency as a tractable approximation to marginalization over latent reasoning paths.^[1] Given a question q and a prompt, the model defines a joint distribution P(r, a | q) over rationales r and answers a. The probability of a particular answer is:

P(a | q) = sum_r P(r, a | q)

Exact marginalization is intractable because the space of rationales is combinatorially large. Self-consistency approximates this sum by Monte Carlo sampling from P(r, a | q) and then computing the empirical answer distribution. The most frequent sampled answer is the Monte Carlo estimate of argmax_a P(a | q).^[1]

This view explains why self-consistency tends to outperform greedy decoding even for tasks with a unique correct answer. Greedy decoding returns the most probable joint trajectory (r*, a*), which is not the same as the most probable marginal answer. The most probable rationale may be wrong even when the marginal probability of the correct answer is high. By marginalizing, self-consistency selects according to the posterior over answers rather than over (rationale, answer) pairs.^[1]

The authors found that weighting samples by their normalized log-probability yields accuracy similar to unweighted majority vote, while weighting by unnormalized log-probability hurts performance substantially. They attribute this to the fact that longer rationales receive lower unnormalized log-probabilities, biasing the vote toward short, often incorrect, paths.^[1]

How much does self-consistency improve accuracy?

The original paper evaluated self-consistency on four language models of varying scales: UL2-20B, GPT-3 (with the code-davinci-001 and code-davinci-002 engines), LaMDA-137B, and PaLM-540B. Benchmarks covered arithmetic, commonsense, and symbolic reasoning.^[1] In the paper's abstract, the headline gains versus greedy chain-of-thought are GSM8K +17.9%, SVAMP +11.0%, AQuA +12.2%, StrategyQA +6.4%, and ARC-challenge +3.9%.^[1]

Arithmetic reasoning

For PaLM-540B with chain-of-thought prompting (greedy) versus self-consistency (40 samples), the paper reports:^[1]

Benchmark	Greedy CoT	Self-consistency	Gain
GSM8K	56.5%	74.4%	+17.9
SVAMP	79.0%	86.6%	+7.6
AQuA	35.8%	48.3%	+12.5
MultiArith	94.7%	99.3%	+4.6
ASDiv	74.0%	81.9%	+7.9
AddSub	91.9%	93.7%	+1.8

For GPT-3 code-davinci-002, the largest measured GPT-3 engine, self-consistency raised GSM8K from 60.1% to 78.0% (+17.9), SVAMP from 75.8% to 86.8% (+11.0), and AQuA from 39.8% to 52.0% (+12.2). Across all four models, the absolute gain from self-consistency on AQuA and GSM8K was about +12 to +18 points, and about +7 to +11 points on SVAMP and ASDiv.^[1]

The headline GSM8K improvement on PaLM-540B (56.5% to 74.4%) achieved a new state of the art for the benchmark at the time of publication, despite the method being unsupervised and task-agnostic. Prior SoTA on GSM8K had relied on supervised fine-tuning with thousands of examples and a separately trained verifier model.^[1]

Commonsense and symbolic reasoning

Self-consistency also delivered gains on tasks where the final answer is a discrete label:^[1]

PaLM-540B on CommonsenseQA: 79.0% to 80.7% (+1.7)
PaLM-540B on StrategyQA: 75.3% to 81.6% (+6.3)
PaLM-540B on ARC-Easy: 95.3% to 96.4% (+1.1)
PaLM-540B on ARC-Challenge: 85.2% to 88.7% (+3.5)
GPT-3 code-davinci-002 on ARC-Challenge: 83.6% to 87.5% (+3.9)

The +6.4% headline number for StrategyQA reported in the abstract reflects gains aggregated across models; the PaLM-540B and GPT-3 code-davinci-002 gains on StrategyQA were +6.3 and +6.4 respectively.^[1] Self-consistency reached new state of the art on five of the six commonsense and symbolic reasoning tasks evaluated.^[1]

How many sampled paths are needed?

The paper plotted mean accuracy and standard deviation over 10 runs for varying numbers of sampled paths m in {1, 5, 10, 20, 40}. Performance improves monotonically with m, with gains diminishing as m grows. Most of the improvement is captured by the first 5 to 10 paths, with the remaining samples providing smaller marginal gains.^[1]

The authors recommend trying a small number of paths (5 or 10) as a starting point to realize most of the accuracy gain while limiting compute cost, and note that performance generally saturates quickly.^[1] This saturation behavior reflects the fact that majority vote with m samples is a noisy estimator of the underlying answer posterior, and the variance of the estimator falls roughly as 1/m.

Does model size matter?

Sampling-based self-consistency yields consistent improvements across all four model scales tested, but the gains are larger for larger models. The paper notes that the absolute accuracy gain on AQuA and GSM8K reaches +12 to +18 points only at the PaLM-540B and GPT-3 code-davinci-002 scales, while smaller models show more modest gains.^[1] The authors attribute this to the fact that small models have limited reasoning ability and may not generate diverse correct paths even when sampling.^[1]

Robustness studies

The paper includes ablations showing that self-consistency is robust to:

Sampling strategy and hyperparameters: varying T in temperature sampling, k in top-k, and p in nucleus sampling all produce broadly similar gains on PaLM-540B GSM8K.^[1]
Imperfect prompts: when the few-shot CoT prompt is corrupted (numbers in rationales randomly swapped), greedy CoT accuracy drops from 17.1% to 14.9% on the corrupted prompt, but self-consistency partially recovers the loss.^[1]
Zero-shot CoT: self-consistency improves zero-shot chain-of-thought prompting (Kojima et al. "Let's think step by step") by +26.2 absolute points on GSM8K in the paper's reported setting.^[1]
Equation-only rationales: self-consistency improves performance even when intermediate steps are expressed as equations rather than natural language, although the gain is smaller.^[1]

Comparison to alternative methods

The paper compares self-consistency against several baselines on equivalent compute budgets:^[1]

Sample-and-rank: sampling the same number of sequences and selecting the top-ranked one by joint log-probability. Sample-and-rank improves over greedy CoT but the gain is much smaller than self-consistency.^[1]
Beam search: on UL2-20B, self-consistency significantly outperforms beam search at the same number of beams or paths, because beam search produces less diverse outputs.^[1]
Prompt-order ensemble: randomly permuting the order of few-shot exemplars 40 times and aggregating. This yields smaller gains than self-consistency on LaMDA-137B.^[1]
Multi-prompt ensemble: aggregating greedy decodes from 40 different hand-written prompts. Also delivers smaller gains than self-consistency.^[1]

Standard prompting tasks

The paper additionally evaluates self-consistency on NLP tasks where chain-of-thought sometimes hurts accuracy compared to standard prompting (for example, ANLI-R1, e-SNLI, RTE). On these tasks the authors find that self-consistency robustly recovers the standard-prompting performance and exceeds it.^[1]

Consistency as a confidence signal

The paper observes that the consistency rate (fraction of samples agreeing with the final majority answer) correlates strongly with accuracy on GSM8K. This suggests self-consistency can double as an uncertainty estimate: when the model's sampled answers strongly agree, the answer is more likely correct; when they disperse, the model is more likely wrong. The authors describe this as giving the model some ability to "know when it doesn't know."^[1]

What is universal self-consistency?

A practical limitation of the original self-consistency procedure is that it requires a way to extract a discrete final answer from each sample so that votes can be tallied. This works well for math problems (the answer is a single number), multiple-choice tasks (the answer is a letter or label), and other tasks with a fixed answer set, but fails for open-ended generation such as summarization, code, or free-form question answering.^[3]

Universal Self-Consistency (USC), introduced by Chen et al. (2023), addresses this limitation by replacing the explicit answer-extraction-and-vote step with a single additional LLM call.^[3] After sampling m candidate responses, USC prompts the LLM with all m candidates and asks it to identify the most consistent response. The selected response is returned as the final answer.

USC was evaluated on mathematical reasoning (GSM8K and MATH), code generation (BIRD-SQL text-to-SQL and Python generation), long-context summarization, and open-ended question answering.^[3] Key findings:^[3]

On math problems where answer extraction is straightforward, USC matches the accuracy of standard self-consistency without requiring an answer parser.
On code generation, USC matches the performance of execution-based consistency methods (which run candidate programs and cluster them by execution output) without needing program execution.
On open-ended tasks where standard self-consistency does not apply, USC improves over greedy decoding and random selection.

USC suffers from some long-context weaknesses of the underlying LLM. The paper notes that accuracy on GSM8K under USC decreased when the number of candidates fed into the final call grew very large (for example, 16 candidates), because the LLM struggles with the long composite prompt; the authors recommend roughly 8 candidates as a practical sweet spot.^[3]

USC inherits both the strengths and weaknesses of LLM-as-judge approaches: it is sensitive to the LLM's ability to assess consistency, can exhibit position bias, and may inherit other biases of the model. The original paper argues that consistency assessment is an easier task for an LLM than quality assessment, partially mitigating these concerns.^[3]

Rationale-augmented ensembles

A follow-up by the same group, Wang et al. (2022), generalized the self-consistency framework into rationale-augmented ensembles.^[4] The paper observes that (input, rationale to output) prompting is brittle to manually written rationales, and proposes a unified framework that ensembles over multiple sampled rationales. Self-consistency is identified as one instance of this framework (output-space sampling), alongside prompt-order ensembles and input-rationale ensembles. The framework extends to NLP tasks such as question answering, word sense disambiguation, and sentiment analysis that do not traditionally use chain-of-thought.^[4]

Self-improvement via self-consistency

Huang et al. (2022) proposed using self-consistency as a data-generation procedure for self-improvement: the LLM generates multiple chain-of-thought solutions, self-consistency picks the majority answer, and rationales leading to that majority answer are used as training data to fine-tune the LLM. This approach demonstrates that the labels produced by self-consistency are high-quality enough to act as a pseudo-supervisory signal.^[5]

Connection to majority voting and ensembles

Self-consistency is closely related to classical ensemble methods such as bagging in statistical learning, with one important distinction: the m sampled paths come from the same model rather than independently trained models. The paper characterizes self-consistency as a "self-ensemble" that produces ensemble-like gains without the cost of training multiple base models.^[1] It also bears family resemblance to minimum Bayes risk decoding for sequence generation, in which a candidate is selected to minimize expected risk under the model distribution.

How does self-consistency relate to test-time compute?

Self-consistency is one of the canonical demonstrations that test-time compute can be traded for accuracy on reasoning tasks: spending m times the inference cost on parallel sampling produces measurable accuracy gains that often exceed the gains from a comparable increase in model size.^[1] This insight underlies a broader research program on test-time scaling that includes tree of thoughts, best-of-N selection with reward models, process reward models for verifier-guided search, and iterative refinement methods.

Modern reasoning models such as the OpenAI o-series are widely described as using internal test-time inference procedures conceptually related to self-consistency, in that they spend additional compute at inference to explore multiple reasoning paths before producing an answer. The specific algorithms used by such proprietary systems have not been fully disclosed.

Verifier-based methods

The original paper contrasts self-consistency with verifier-based approaches such as the trained verifier used by Cobbe et al. (2021) for GSM8K, in which a separately trained scoring model ranks candidate solutions and selects the best.^[1] Verifier-based methods can outperform plain self-consistency when the verifier has been trained on enough high-quality data, but require additional supervision, training compute, and a separate model. Self-consistency requires no additional model or annotation, only inference compute. The two approaches can be combined: candidates can be filtered by verifier score before majority vote, or the verifier can act as a tiebreaker. Process reward models extend verifier-based methods by scoring intermediate steps rather than only final answers.

What are the limitations of self-consistency?

The paper and follow-up work identify several limitations of self-consistency:^[1]^[3]

Discrete answer requirement: standard self-consistency only works for tasks with a fixed, parseable answer set. Free-form tasks such as summarization, translation, dialogue, and creative writing do not have a natural answer-equality predicate for majority voting. Universal Self-Consistency was proposed in part to address this limitation.^[3]
Compute cost: generating m sampled paths costs m times the inference compute of greedy decoding. For m = 40, this is a substantial overhead. The paper recommends m = 5 to 10 as a practical starting point because performance saturates quickly.^[1]
No improvement on impossible tasks: when the model's underlying distribution over reasoning paths does not concentrate on the correct answer, sampling more paths does not help. Self-consistency cannot conjure knowledge or skills the base model lacks. The paper notes that gains are smaller for smaller models for this reason.^[1]
Answer extraction fragility: even on tasks with discrete answers, the procedure depends on a working answer-extraction rule. If the model produces answers in inconsistent formats, equivalent answers may be counted as distinct, fragmenting the vote. This motivated the LLM-as-judge approach in USC.^[3]
Vulnerability to systematic bias: if the model is systematically biased toward a particular wrong answer, sampling more paths reinforces that bias rather than correcting it. Self-consistency is a variance-reduction technique, not a bias-correction technique.
Marginal gains on already strong models: the absolute gain shrinks on tasks where greedy CoT already performs near ceiling (for example MultiArith with PaLM-540B at 94.7% greedy).^[1]
Sensitivity to tokenization of the answer: when the answer is a long expression rather than a single number, even subtle formatting differences (whitespace, units, parentheses) can split the vote. Practical implementations use task-specific answer normalization.

How is self-consistency implemented?

A typical implementation of self-consistency follows the structure described in the original paper:^[1]

Prepare a chain-of-thought prompt (few-shot exemplars demonstrating the reasoning style).
For each test question, sample m completions from the model with a stochastic decoder (temperature sampling alone or combined with top-k or top-p truncation). Typical settings reported in the literature include T in the range 0.5 to 0.7 and m between 5 and 40.
Parse a final answer from each completion using a task-specific regular expression or normalization routine.
Compute the majority answer among the parsed final answers. Break ties arbitrarily or by joint log-probability.

Sampling can be parallelized across the m paths since each is independent given the prompt. This is in contrast to beam search, which requires synchronization across beams at each decoding step. Self-consistency therefore parallelizes more easily on accelerator hardware and on serving infrastructure that batches independent requests.

For long-context settings where each chain of thought is itself long, the memory cost of holding m independent generation states grows with m. Practical deployments tune m and the per-sample length budget to fit serving constraints.

Why is self-consistency significant?

Self-consistency has been an unusually influential paper in prompt engineering and reasoning research:^[1]^[3]^[4]

It was an early demonstration that scaling inference compute (m sampled paths) can be as effective as scaling model size, foreshadowing the test-time-compute paradigm that became central to research on reasoning models.
It established majority-vote-over-CoT as the standard baseline for reasoning evaluation, against which subsequent verifier-based, search-based, and reward-model-based methods are typically compared.
It introduced the marginalization framing for inference-time reasoning, which has informed later work on process supervision, tree search (tree of thoughts), and Monte Carlo tree search over LLM rollouts.
It identified consistency as a natural confidence signal, anticipating later work on calibration and uncertainty estimation for LLMs.

The original paper has been cited many thousands of times since its 2022 arXiv release and 2023 ICLR publication, and the technique is widely implemented in evaluation harnesses and serving stacks for reasoning models.^[2]

References

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171. Published as a conference paper at ICLR 2023. https://arxiv.org/abs/2203.11171 (Accessed 2026-06-22). Full text PDF: https://arxiv.org/pdf/2203.11171 (Accessed 2026-06-22). ↩
ICLR 2023 Poster page, "Self-Consistency Improves Chain of Thought Reasoning in Language Models." https://iclr.cc/virtual/2023/poster/11718 (Accessed 2026-06-22). OpenReview page: https://openreview.net/forum?id=1PL1NIMMrw (Accessed 2026-06-22). ↩
Chen, X., Aksitov, R., Alon, U., Ren, J., Xiao, K., Yin, P., Prakash, S., Sutton, C., Wang, X., & Zhou, D. (2023). "Universal Self-Consistency for Large Language Model Generation." arXiv:2311.17311. https://arxiv.org/abs/2311.17311 (Accessed 2026-06-22). ↩
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., & Zhou, D. (2022). "Rationale-Augmented Ensembles in Language Models." arXiv:2207.00747. https://arxiv.org/abs/2207.00747 (Accessed 2026-06-22). ↩
Huang, J., Gu, S. S., Hou, L., Wu, Y., Wang, X., Yu, H., & Han, J. (2022). "Large Language Models Can Self-Improve." arXiv:2210.11610. https://arxiv.org/abs/2210.11610 (Accessed 2026-06-22). ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

Self-consistency

What problem does self-consistency solve?

Chain-of-thought prompting

Decoding strategies

How does self-consistency work?

Sampling configuration in the original paper

Why does majority vote work?

Theoretical motivation: marginalization over reasoning paths

How much does self-consistency improve accuracy?

Arithmetic reasoning

Commonsense and symbolic reasoning

How many sampled paths are needed?

Does model size matter?

Robustness studies

Comparison to alternative methods

Standard prompting tasks

Consistency as a confidence signal

What is universal self-consistency?

Rationale-augmented ensembles

Self-improvement via self-consistency

Connection to majority voting and ensembles

How does self-consistency relate to test-time compute?

Verifier-based methods

What are the limitations of self-consistency?

How is self-consistency implemented?

Why is self-consistency significant?

References

Improve this article

What links here

What links here

What problem does self-consistency solve?

Chain-of-thought prompting

Decoding strategies

How does self-consistency work?

Sampling configuration in the original paper

Why does majority vote work?

Theoretical motivation: marginalization over reasoning paths

How much does self-consistency improve accuracy?

Arithmetic reasoning

Commonsense and symbolic reasoning

How many sampled paths are needed?

Does model size matter?

Robustness studies

Comparison to alternative methods

Standard prompting tasks

Consistency as a confidence signal

What is universal self-consistency?

Related and follow-up work

Rationale-augmented ensembles

Self-improvement via self-consistency

Connection to majority voting and ensembles

How does self-consistency relate to test-time compute?

Verifier-based methods

What are the limitations of self-consistency?

How is self-consistency implemented?

Why is self-consistency significant?

References

Improve this article

Related Articles

ReAct (prompting)

Tree of Thoughts

Prompt

Agentic Context Engineering

How to Pressure LLMs for Better Output

System prompt

What links here

Related Articles

ReAct (prompting)

Tree of Thoughts

Prompt

Agentic Context Engineering

How to Pressure LLMs for Better Output

System prompt

What links here