Chain-of-thought (CoT) prompting is a prompt engineering technique that encourages large language models (LLMs) to break down complex problems into intermediate reasoning steps before arriving at a final answer. Rather than producing a direct response, the model generates a sequence of logical steps that mirror how a human might work through a problem. First introduced by Jason Wei and colleagues at Google Brain in January 2022, chain-of-thought prompting has become one of the most influential techniques in modern natural language processing, fundamentally changing how researchers and practitioners interact with language models [1].
The concept of chain-of-thought prompting was formalized in the paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," authored by Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou [1]. The paper was first posted to arXiv on January 28, 2022 and later presented at NeurIPS 2022.
The core observation was deceptively simple: by including a few examples of step-by-step reasoning in the prompt (as part of few-shot prompting), large language models could be induced to generate their own intermediate reasoning steps before producing an answer. This approach required no changes to model weights, no fine-tuning, and no architectural modifications. It was purely a prompting strategy.
Wei et al. tested chain-of-thought prompting on three large language models: LaMDA (137B parameters), PaLM (540B parameters), and GPT-3 (175B parameters). The results across arithmetic, commonsense, and symbolic reasoning benchmarks were striking. On the GSM8K benchmark of grade school math word problems, PaLM 540B with chain-of-thought prompting achieved 56.9% accuracy compared to just 17.9% with standard prompting. This result surpassed even a fine-tuned GPT-3 175B model that used a specially trained verifier, which had previously held the state-of-the-art at 55% [1].
One of the paper's most important findings was that chain-of-thought prompting exhibits emergent behavior: it only becomes effective with sufficiently large models, generally those with over 100 billion parameters. Smaller models often produced illogical chains of thought that actually degraded performance compared to standard prompting [1].
In standard few-shot prompting, a user provides several input-output examples before posing a question. Chain-of-thought prompting modifies this pattern by including intermediate reasoning steps in each example.
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: The answer is 11.
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is
6 tennis balls. 5 + 6 = 11. The answer is 11.
When the model sees examples formatted this way, it learns to produce its own step-by-step reasoning for new questions. The key mechanism is that by decomposing a multi-step problem into individual operations, each step becomes simpler and more tractable for the model. The model can allocate more computation (in the form of generated tokens) to harder problems, rather than being forced to compute the answer in a single forward pass [1].
Several hypotheses explain the effectiveness of chain-of-thought prompting:
The original chain-of-thought paper sparked a wave of follow-up research that extended and improved upon the basic technique. The table below summarizes the major variants.
| Variant | Authors | Year | Key Idea | Publication Venue |
|---|---|---|---|---|
| Chain-of-Thought (Few-Shot) | Wei et al. | 2022 | Include step-by-step reasoning examples in the prompt | NeurIPS 2022 |
| Zero-Shot CoT | Kojima et al. | 2022 | Append "Let's think step by step" without examples | NeurIPS 2022 |
| Self-Consistency | Wang et al. | 2022 | Sample multiple reasoning paths, take majority vote | ICLR 2023 |
| Tree of Thoughts (ToT) | Yao et al. | 2023 | Explore branching reasoning paths with search algorithms | NeurIPS 2023 |
| Graph of Thoughts (GoT) | Besta et al. | 2023 | Model reasoning as an arbitrary graph structure | AAAI 2024 |
In May 2022, Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa published "Large Language Models are Zero-Shot Reasoners" [2]. Their key discovery was that simply appending the phrase "Let's think step by step" to a prompt, without providing any examples, could elicit chain-of-thought reasoning from large models.
This zero-shot approach uses a two-stage process. In the first stage, the prompt with "Let's think step by step" appended is sent to the model, which generates a reasoning path. In the second stage, the generated reasoning is combined with the original question and a prompt like "Therefore, the answer is" to extract the final answer [2].
The researchers tested multiple candidate trigger phrases and found that "Let's think step by step" was consistently the most effective. The results were impressive: on arithmetic reasoning tasks, models like InstructGPT and PaLM 540B improved from accuracy in the teens to 70-80%. On symbolic reasoning tasks, accuracy jumped from around 10% to 40% [2].
Like few-shot CoT, zero-shot CoT is an emergent capability of large models. Smaller models showed minimal or no improvement, suggesting that the ability to generate coherent reasoning chains without examples requires a critical mass of learned knowledge and language capability [2].
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou introduced self-consistency in March 2022, published at ICLR 2023 [3]. The method addresses a fundamental limitation of standard chain-of-thought prompting: it relies on greedy decoding, which produces only a single reasoning path.
Self-consistency works by sampling multiple diverse reasoning paths from the model (using a non-zero temperature setting) and then selecting the most common final answer through majority voting. The intuition is that complex reasoning problems typically admit multiple valid solution approaches, and the correct answer is more likely to be reached by several different reasoning paths than an incorrect one [3].
The performance improvements were substantial across multiple benchmarks:
| Benchmark | Improvement over Standard CoT |
|---|---|
| GSM8K | +17.9% |
| SVAMP | +11.0% |
| AQuA | +12.2% |
| StrategyQA | +6.4% |
| ARC-challenge | +3.9% |
Self-consistency requires no additional training, no auxiliary models, and no changes to the prompting format. The only cost is increased inference time due to multiple sampling passes [3].
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan introduced Tree of Thoughts (ToT) in May 2023, published at NeurIPS 2023 [4]. ToT generalizes chain-of-thought prompting by allowing the model to explore multiple reasoning branches simultaneously, evaluate their promise, and backtrack when necessary.
Where standard CoT produces a single linear chain of reasoning, ToT maintains a tree structure where each node represents a "thought" (a coherent unit of reasoning). The model generates candidate thoughts at each step, evaluates them (either by voting or by assigning value estimates), and uses search algorithms like breadth-first search (BFS) or depth-first search (DFS) to navigate the tree [4].
The performance gains on difficult planning tasks were dramatic. On the Game of 24 (a mathematical puzzle requiring the combination of four numbers to reach 24 using basic arithmetic), GPT-4 with standard chain-of-thought prompting solved only 4% of problems. With Tree of Thoughts, the success rate rose to 74% [4].
Maciej Besta and colleagues at ETH Zurich extended the paradigm further with Graph of Thoughts (GoT), introduced in August 2023 and published at AAAI 2024 [5]. GoT models the reasoning process as an arbitrary directed graph rather than a chain or tree. In this framework, individual thoughts are vertices and dependencies between them are edges.
The graph structure enables operations that are impossible in linear or tree-based approaches: thoughts can be merged, refined through feedback loops, or aggregated from multiple independent reasoning paths. On sorting tasks, GoT improved quality by 62% over Tree of Thoughts while simultaneously reducing computational costs by more than 31% [5].
Chain-of-thought prompting and its variants have been evaluated extensively across reasoning benchmarks. The technique is most effective on tasks that require multi-step reasoning, particularly in arithmetic, symbolic reasoning, and commonsense inference.
| Benchmark | Model | Standard Prompting | CoT Prompting | Improvement |
|---|---|---|---|---|
| GSM8K | PaLM 540B | 17.9% | 56.9% | +39.0 pp |
| GSM8K | PaLM 540B + Self-Consistency | 17.9% | 74.4% | +56.5 pp |
| SVAMP | PaLM 540B | 79.0% | 86.6% | +7.6 pp |
| MAWPS | PaLM 540B | 84.7% | 93.3% | +8.6 pp |
On commonsense reasoning benchmarks like StrategyQA and the AI2 Reasoning Challenge (ARC), chain-of-thought prompting provided more modest but still meaningful improvements. Performance gains ranged from 3-10 percentage points depending on the benchmark and model used [1].
Symbolic reasoning tasks (such as last-letter concatenation and coin flip tracking) showed some of the most dramatic improvements. PaLM 540B went from near-chance accuracy with standard prompting to strong performance with chain-of-thought prompting on these tasks, including on out-of-distribution examples that were longer than anything seen in the few-shot examples [1].
Starting in late 2024, a new class of models emerged that internalize chain-of-thought reasoning into the model itself, rather than relying on prompt-level techniques. These "reasoning models" or "thinking models" represent a fundamental shift: instead of the user instructing the model to reason step by step, the model is trained through reinforcement learning to do so automatically.
OpenAI released o1-preview in September 2024, the first commercially available model with internalized chain-of-thought reasoning [6]. The model maintains a hidden "thinking block" where it works through potential solutions step by step before presenting its final answer. This internal reasoning is produced during inference but is not fully visible to the user (OpenAI provides a summarized version).
The o1 model scored 83% on the 2024 American Invitational Mathematics Examination (AIME). Seven months later, o3 scored 96.7% on the same exam, matching the level of gold-medal participants in the International Mathematical Olympiad [7]. On the GPQA Diamond benchmark (a graduate-level science exam), o3 achieved 87.7% [7].
Unlike standard LLMs that generate answers in a single forward pass, reasoning models pause, plan, explore multiple solution paths, critique their own logic, and backtrack when they hit dead ends. They trade speed for deliberation.
Released on January 20, 2025, DeepSeek R1 demonstrated that reasoning capabilities comparable to OpenAI o1 could be achieved through a different training approach [8]. Built on a Mixture of Experts (MoE) architecture with 671 billion total parameters (37 billion activated per forward pass), DeepSeek R1 uses large-scale reinforcement learning applied directly to the base model without supervised fine-tuning as a preliminary step.
A particularly notable contribution was DeepSeek-R1-Zero, which showed that pure RL over a strong base model can unlock advanced chain-of-thought reasoning without manually curated reasoning traces. This dramatically reduced alignment costs. DeepSeek open-sourced the model weights under the MIT License, enabling commercial use and distillation [8].
Anthropic introduced extended thinking with Claude 3.7 Sonnet, allowing the model to engage in longer internal reasoning before responding [9]. When extended thinking is active, Claude uses serial test-time compute, working through multiple sequential reasoning steps before producing its final output. This capability was refined in subsequent releases, including Claude Sonnet 4.5 (September 2025), and further developed with adaptive thinking in Claude Opus 4.6 (2026), which allows the model to automatically determine when deeper reasoning would be helpful [9].
The emergence of reasoning models has changed the value proposition of explicit chain-of-thought prompting. A June 2025 study from the Wharton Generative AI Labs found that for reasoning models with built-in chain-of-thought capabilities (like o3-mini and o4-mini), explicit CoT prompting produced only minimal additional benefits of 2.9-3.1%, while increasing processing time by 20-80% [10]. Many recent models perform some form of chain-of-thought reasoning internally even when not explicitly prompted to do so, which explains the diminishing returns of explicit CoT instructions with newer systems.
For non-reasoning models, chain-of-thought prompting still provides modest average improvements but introduces increased variability in responses [10].
Despite its effectiveness on complex reasoning tasks, chain-of-thought prompting has several well-documented limitations.
CoT prompting is not a universal improvement. For straightforward, single-step problems, forcing the model to generate intermediate reasoning can introduce unnecessary complexity and degrade performance. A 2024 study titled "Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse" found that on tasks involving implicit statistical learning, o1-preview experienced an absolute performance drop of 36.3% compared to GPT-4o with zero-shot prompting. Consistent accuracy reductions were observed across eight other state-of-the-art models as well [11].
The general pattern is clear: tasks that benefit from conscious, deliberate reasoning in humans also benefit from CoT in language models, while tasks that rely on intuitive or pattern-based processing can be harmed by forced step-by-step reasoning.
Chain-of-thought prompting only works reliably with large models (roughly 100B+ parameters). In smaller models, the generated reasoning chains tend to be illogical or incoherent, leading to worse performance than standard prompting. This limits the technique's applicability in resource-constrained settings where only smaller models are available [1].
Because chain-of-thought prompting requires the model to generate many more tokens (the intermediate reasoning steps plus the final answer), it increases both response time and computational cost. Self-consistency compounds this further by requiring multiple sampling passes. For applications with strict latency requirements or tight cost constraints, these overheads can be prohibitive.
A growing body of research has shown that the reasoning steps produced by chain-of-thought prompting do not always faithfully represent the model's actual internal computation. The paper "Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting" (Turpin et al., 2023) demonstrated that models can produce plausible but misleading reasoning chains [12].
Several types of unfaithfulness have been identified:
A 2025 paper from Anthropic, "Reasoning Models Don't Always Say What They Think," extended these findings to reasoning models specifically, showing that even models with internalized chain-of-thought can produce unfaithful reasoning traces [14]. The paradox is that as models become more capable, their unfaithful reasoning also becomes more sophisticated and harder to detect.
These findings have significant implications for AI safety. Strategies that rely on monitoring chain-of-thought outputs to detect undesired model behavior may be unreliable if the displayed reasoning does not accurately reflect the model's true decision-making process.
The effectiveness of chain-of-thought prompting can be sensitive to the specific formatting of examples, the number of examples provided, and the choice of trigger phrases. Small changes in prompt wording can sometimes lead to large changes in performance, making the technique somewhat fragile in practice.
Chain-of-thought prompting is one of the most important techniques in the broader field of prompt engineering. It demonstrated that the way information is presented to a language model can dramatically alter the model's capabilities without any changes to the model itself.
CoT prompting built on earlier work in few-shot prompting (as popularized by the GPT-3 paper by Brown et al., 2020) and contributed to a broader understanding that LLMs can be "programmed" through carefully designed prompts. The success of chain-of-thought prompting inspired numerous other prompting techniques, including:
As of early 2026, chain-of-thought reasoning has evolved from an external prompting technique into a built-in capability of frontier language models. The major model providers (OpenAI, Anthropic, Google, and others) all offer models with internalized reasoning capabilities, often marketed as "thinking" modes.
For practitioners, the practical implications are notable:
The research frontier has moved beyond prompting-level CoT toward understanding and improving the internal reasoning processes of models trained with reinforcement learning. Questions about the faithfulness of reasoning traces, the relationship between visible chain-of-thought and actual model computation, and how to make reasoning more reliable remain active areas of investigation [13] [14].
Chain-of-thought prompting's legacy extends beyond its direct technical contributions. It established a paradigm: that language models can be coaxed into performing complex cognitive tasks through careful prompt design. Even as the specific technique of manually prompting for step-by-step reasoning becomes less necessary with newer models, the underlying insight that decomposed reasoning improves model performance continues to shape how large language models are trained, evaluated, and deployed.