See also: Prompt engineering, Large language model, and In-context learning
Chain-of-thought (CoT) prompting is a prompt engineering technique for improving the reasoning capabilities of large language models (LLMs) by prompting them to produce intermediate reasoning steps before arriving at a final answer. Rather than asking a model to output a direct response to a complex question, CoT prompting encourages the model to "show its work" by generating a sequence of logical steps that mirror how a human might work through a problem. The method was introduced by Jason Wei and colleagues at Google Brain (within Google) in their January 2022 paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," published at the NeurIPS 2022 conference.
CoT prompting has become one of the most widely studied and applied techniques in prompt engineering and in modern natural language processing, fundamentally changing how researchers and practitioners interact with language models. It requires no changes to model weights or architecture; instead, it works entirely through the input prompt, making it accessible to anyone using a pretrained language model. The technique has demonstrated strong performance gains on tasks involving arithmetic reasoning, commonsense reasoning, and symbolic manipulation, particularly when applied to models with 100 billion or more parameters.
Imagine you ask your teacher "What is 27 times 13?" If the teacher just says "351," you might not understand how they got that number. But if the teacher says, "First, 27 times 10 is 270. Then 27 times 3 is 81. Now add 270 and 81 together, and you get 351," you can follow each step and see why the answer is correct.
Chain-of-thought prompting works the same way with AI. Instead of asking the AI to jump straight to the answer, you show it examples where problems are solved step by step. When the AI sees these examples, it learns to break down new problems into smaller steps too, which helps it get the right answer more often.
The concept of chain-of-thought prompting was formalized in the paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," authored by Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. The paper was first posted to arXiv on January 28, 2022 and later presented at NeurIPS 2022.
The core observation was deceptively simple: by including a few examples of step-by-step reasoning in the prompt (as part of few-shot prompting), large language models could be induced to generate their own intermediate reasoning steps before producing an answer. This approach required no changes to model weights, no fine-tuning, and no architectural modifications. It was purely a prompting strategy.
Wei et al. tested chain-of-thought prompting on three large language models: LaMDA (137B parameters), PaLM (540B parameters), and GPT-3 (175B parameters). The results across arithmetic, commonsense, and symbolic reasoning benchmarks were striking. On the GSM8K benchmark of grade-school math word problems, PaLM 540B with chain-of-thought prompting achieved 56.9% accuracy (reported as 58% in a later summary with eight exemplars) compared to just 17.9% with standard prompting. This result surpassed even a fine-tuned GPT-3 175B model that used a specially trained verifier, which had previously held the state-of-the-art at 55%.
One of the paper's most important findings was that chain-of-thought prompting exhibits emergent behavior: it only becomes effective with sufficiently large models, generally those with over 100 billion parameters. Smaller models often produced illogical chains of thought that actually degraded performance compared to standard prompting.
In standard few-shot prompting, a user provides the model with several input-output pairs as examples, then poses a new question. The model generates an answer directly. In CoT prompting, each example is expanded from a simple (input, output) pair to a triple of (input, chain of thought, output), where the chain of thought is a sequence of natural language sentences describing the reasoning process.
For example, consider a math word problem:
Standard prompt:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: The answer is 11.
Chain-of-thought prompt:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
By including the intermediate reasoning steps in the demonstration, the model learns to generate similar step-by-step reasoning for new questions, leading to more accurate results. The key mechanism is that by decomposing a multi-step problem into individual operations, each step becomes simpler and more tractable for the model. The model can allocate more computation (in the form of generated tokens) to harder problems, rather than being forced to compute the answer in a single forward pass.
CoT prompting is an emergent ability of model scale. The benefits only appear in models with approximately 100 billion parameters or more. In smaller models, chain-of-thought prompting often produces illogical or incoherent reasoning chains, which can actually hurt performance compared to standard prompting.
Wei et al. (2022) observed that on the GSM8K benchmark, chain-of-thought prompting performed worse than standard prompting for models below a critical compute threshold (approximately 10^22 FLOPs). Above that threshold, performance improved substantially. This pattern was consistent across multiple model families, including PaLM, LaMDA, and GPT-3.
Since the original CoT paper, researchers have proposed several variants and extensions. The following table summarizes the major approaches.
| Variant | Authors | Year | Key idea | Publication Venue | Demonstration examples required? |
|---|---|---|---|---|---|
| Few-shot CoT | Wei et al. | 2022 | Include step-by-step reasoning in few-shot exemplars | NeurIPS 2022 | Yes (manually written) |
| Zero-shot CoT | Kojima et al. | 2022 | Append "Let's think step by step" to the prompt | NeurIPS 2022 | No |
| Self-consistency | Wang et al. | 2022 | Sample multiple reasoning paths, take majority vote | ICLR 2023 | Yes |
| Auto-CoT | Zhang et al. | 2022 | Automatically generate demonstrations via clustering | ICLR 2023 | No (auto-generated) |
| Least-to-most prompting | Zhou et al. | 2022 | Decompose problem into subproblems, solve sequentially | 2022 | Yes |
| Tree of Thoughts | Yao et al. | 2023 | Explore multiple reasoning branches with backtracking | NeurIPS 2023 | Yes |
| Graph of Thoughts | Besta et al. | 2023 | Model reasoning as an arbitrary graph structure | AAAI 2024 | Yes |
| Active-Prompt | Diao et al. | 2023 | Select most uncertain questions for human annotation | ACL 2024 | Yes (selectively annotated) |
| Contrastive CoT | Chia et al. | 2024 | Pair valid and invalid reasoning chains as examples | 2024 | Yes |
The original method proposed by Wei et al. (2022) uses manually crafted demonstrations that include intermediate reasoning steps. The user writes a small number of example problems (typically 4 to 8) along with detailed step-by-step solutions. These examples are prepended to the actual question as part of the prompt.
The approach was evaluated across three families of large language models: GPT-3 (175B parameters), LaMDA (137B parameters), and PaLM (540B parameters). On the GSM8K benchmark of grade-school math problems, PaLM 540B with eight chain-of-thought exemplars achieved 58% accuracy, compared to 18% with standard prompting. This result surpassed the prior state of the art set by a finetuned GPT-3 model with a verifier (55%).
In May 2022, Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa published "Large Language Models are Zero-Shot Reasoners." Their key discovery was that LLMs can perform chain-of-thought reasoning without any hand-crafted examples. By simply appending the phrase "Let's think step by step" to the end of a question, models generate intermediate reasoning steps on their own. This approach, called zero-shot CoT, is remarkably simple yet effective.
The method uses a two-stage process. In the first stage, the prompt includes the question followed by "Let's think step by step," and the model generates a reasoning chain. In the second stage, the generated reasoning is combined with the original question and a prompt like "Therefore, the answer is" to extract the final answer. With the InstructGPT model (text-davinci-002), zero-shot CoT increased accuracy on MultiArith from 17.7% to 78.7% and on GSM8K from 10.4% to 40.7%. Similar improvements were observed with PaLM 540B. On symbolic reasoning tasks, accuracy jumped from around 10% to 40%.
The researchers tested multiple candidate trigger phrases and found that "Let's think step by step" was consistently the most effective. The discovery that a single generic prompt phrase could elicit reasoning across diverse task types suggested that LLMs possess latent zero-shot reasoning capabilities that had previously gone untapped. Like few-shot CoT, zero-shot CoT is an emergent capability of large models. Smaller models showed minimal or no improvement, suggesting that the ability to generate coherent reasoning chains without examples requires a critical mass of learned knowledge and language capability.
Wang et al. (2022) proposed self-consistency as a replacement for the greedy decoding strategy typically used with CoT prompting. The method was introduced by Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou in March 2022 and published at ICLR 2023. Instead of generating a single reasoning path, the method samples multiple diverse reasoning paths from the model (using a non-zero temperature setting) and then selects the most common final answer through majority voting.
The intuition behind self-consistency is straightforward: complex reasoning problems often admit multiple valid solution paths, and if several independent reasoning chains converge on the same answer, that answer is more likely to be correct. The method does not require any additional training, fine-tuning, or auxiliary models.
Self-consistency produced large improvements over standard CoT across multiple benchmarks:
| Benchmark | CoT accuracy | CoT + self-consistency accuracy | Improvement |
|---|---|---|---|
| GSM8K | 56.5% | 74.4% | +17.9% |
| SVAMP | 68.9% | 79.9% | +11.0% |
| AQuA | 35.8% | 48.0% | +12.2% |
| StrategyQA | 73.4% | 79.8% | +6.4% |
| ARC-challenge | 85.2% | 89.1% | +3.9% |
These results were obtained using PaLM 540B. Self-consistency requires no additional training, no auxiliary models, and no changes to the prompting format. The only cost is increased inference time due to multiple sampling passes.
Zhang et al. (2022) introduced Auto-CoT to eliminate the need for manually crafted demonstrations. The method works in two stages. First, it clusters the input questions into several groups to ensure diversity. Second, it selects a representative question from each cluster and uses zero-shot CoT to automatically generate a reasoning chain for that question.
On ten public benchmark reasoning tasks with GPT-3, Auto-CoT consistently matched or exceeded the performance of manually designed CoT prompts. The method was published at ICLR 2023.
Zhou et al. (2022) proposed least-to-most prompting to address a limitation of standard CoT: poor generalization to problems that are more complex than the provided exemplars. The technique breaks a complex problem into a series of progressively simpler subproblems, solves each one in order, and feeds the solutions of earlier subproblems into the context for solving later ones.
On the SCAN compositional generalization benchmark, GPT-3 (code-davinci-002) with least-to-most prompting achieved at least 99% accuracy across all splits, compared to just 16% accuracy with standard chain-of-thought prompting. This demonstrated that structured decomposition can overcome the easy-to-hard generalization barrier.
Yao et al. (2023) generalized CoT prompting into a tree-structured search framework called Tree of Thoughts (ToT). The paper, authored by Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan, was published at NeurIPS 2023. While standard CoT generates a single linear reasoning chain, ToT allows the model to explore multiple reasoning branches, evaluate their promise, and backtrack when a path appears unproductive. This approach draws on classical AI search techniques such as breadth-first search (BFS) and depth-first search (DFS).
Where standard CoT produces a single linear chain of reasoning, ToT maintains a tree structure where each node represents a "thought" (a coherent unit of reasoning). The model generates candidate thoughts at each step, evaluates them (either by voting or by assigning value estimates), and uses search algorithms to navigate the tree.
ToT demonstrated dramatic improvements on tasks requiring planning and search. On the Game of 24 task (a mathematical puzzle requiring the combination of four numbers to reach 24 using basic arithmetic), GPT-4 with standard CoT solved only 4% of problems, while ToT achieved a 74% success rate.
Maciej Besta and colleagues at ETH Zurich extended the paradigm further with Graph of Thoughts (GoT), introduced in August 2023 and published at AAAI 2024. GoT models the reasoning process as an arbitrary directed graph rather than a chain or tree. In this framework, individual thoughts are vertices and dependencies between them are edges.
The graph structure enables operations that are impossible in linear or tree-based approaches: thoughts can be merged, refined through feedback loops, or aggregated from multiple independent reasoning paths. On sorting tasks, GoT improved quality by 62% over Tree of Thoughts while simultaneously reducing computational costs by more than 31%.
The following table summarizes key benchmark results for chain-of-thought prompting and its variants across arithmetic reasoning tasks.
| Model | Method | GSM8K | MultiArith | SVAMP | AQuA |
|---|---|---|---|---|---|
| GPT-3 175B | Standard prompting | 17.9% | - | - | - |
| GPT-3 175B | Few-shot CoT | 58.8% | - | - | - |
| PaLM 540B | Standard prompting | 17.9% | - | 79.0% | - |
| PaLM 540B | Few-shot CoT | 56.9% (reported 58.0% with 8 exemplars) | - | 86.6% | - |
| PaLM 540B | CoT + self-consistency | 74.4% | - | 79.9% | 48.0% |
| text-davinci-002 | Zero-shot CoT | 40.7% | 78.7% | - | - |
| GPT-3.5 | Standard | ~77-80% | - | - | - |
| GPT-4 | Standard | >90% | - | - | - |
| GPT-4 | DUP (zero-shot) | 97.1% | - | - | - |
Additional results from Wei et al. (2022) with PaLM 540B: MAWPS improved from 84.7% (standard) to 93.3% (CoT), a gain of 8.6 percentage points.
Beyond arithmetic, CoT prompting also improved commonsense reasoning. On the Sports Understanding benchmark, PaLM 540B with CoT achieved 95% accuracy, surpassing unaided human performance at 84%. On StrategyQA, CoT with self-consistency reached 79.8%. On commonsense reasoning benchmarks like StrategyQA and the AI2 Reasoning Challenge (ARC), performance gains ranged from 3 to 10 percentage points depending on the benchmark and model used.
Symbolic reasoning tasks (such as last-letter concatenation and coin flip tracking) showed some of the most dramatic improvements. PaLM 540B went from near-chance accuracy with standard prompting to strong performance with chain-of-thought prompting on these tasks, including on out-of-distribution examples that were longer than anything seen in the few-shot examples.
Several hypotheses have been proposed to explain why CoT prompting improves LLM performance on reasoning tasks.
The most intuitive explanation is that CoT breaks a multi-step problem into smaller, more manageable sub-computations. Each intermediate step requires only a simple operation (such as a single arithmetic calculation or a logical inference), which the model can perform more reliably than attempting to solve the entire problem in a single forward pass. This mirrors the way humans use scratch paper or talk through problems aloud.
By generating intermediate steps, the model effectively performs more computation before producing the final answer. In a transformer architecture, each token generation involves a full forward pass through all layers of the network. Generating a longer chain of reasoning tokens means the model has more serial computation available to "think through" the problem. This is sometimes framed as allocating more "test-time compute" to harder problems.
When the reasoning chain is explicit, errors in individual steps become visible and can sometimes be self-corrected in subsequent steps. This contrasts with single-shot answer generation, where an early mistake silently contaminates the final output.
Large models have been trained on vast corpora that include mathematical solutions, textbook explanations, and step-by-step tutorials. Chain-of-thought prompting activates these learned patterns, eliciting styles of reasoning that the model has encountered during pretraining.
Madaan and Yazdanbakhsh (2022) conducted counterfactual experiments to understand which components of CoT prompts contribute to their effectiveness. They decomposed prompts into symbols, patterns, and text, and found that a symbiotic relationship between text and patterns drives the success of few-shot CoT. Text provides commonsense knowledge and context, while patterns enforce task structure and guide generation. Interestingly, they found that the factual correctness of the intermediate steps in the demonstrations was less important than the structural pattern itself.
CoT reasoning appears to be an emergent ability that materializes only in sufficiently large models. Models below roughly 100 billion parameters tend to produce incoherent reasoning chains that do not improve (and may degrade) performance. This suggests that the capacity for multi-step reasoning requires a certain density of learned representations and associations that only develops at large scale. Wei et al. (2022) observed this pattern across multiple model families, noting a sharp phase transition in CoT effectiveness at specific parameter counts.
The principles behind CoT prompting have influenced the development of dedicated reasoning models. Starting in late 2024, a new class of models emerged that internalize chain-of-thought reasoning into the model itself, rather than relying on prompt-level techniques. These "reasoning models" or "thinking models" represent a fundamental shift: instead of the user instructing the model to reason step by step, the model is trained through reinforcement learning to do so automatically.
In September 2024, OpenAI released o1-preview, the first commercially available model with internalized chain-of-thought reasoning. The model maintains a hidden "thinking block" where it works through potential solutions step by step before presenting its final answer. This internal reasoning is produced during inference but is not fully visible to the user (OpenAI provides a summarized version). Unlike prompt-based CoT, where the user must explicitly request step-by-step reasoning, o1 was trained to engage in extended reasoning automatically.
The o1 model uses what OpenAI describes as a "chain of thought" during inference, spending additional time (test-time compute) to think through problems. On the American Invitational Mathematics Examination (AIME), o1-preview scored 83%. Its successor, o3, released later in 2024, achieved 96.7% on the same exam, matching gold-medal International Math Olympiad competitors. On the GPQA Diamond benchmark (a graduate-level science exam), o3 achieved 87.7%.
Unlike standard LLMs that generate answers in a single forward pass, reasoning models pause, plan, explore multiple solution paths, critique their own logic, and backtrack when they hit dead ends. They trade speed for deliberation.
Released on January 20, 2025, DeepSeek R1 demonstrated that reasoning capabilities comparable to OpenAI o1 could be achieved through a different training approach. Built on a Mixture of Experts (MoE) architecture with 671 billion total parameters (37 billion activated per forward pass), DeepSeek R1 uses large-scale reinforcement learning applied directly to the base model without supervised fine-tuning as a preliminary step.
A particularly notable contribution was DeepSeek-R1-Zero, which showed that pure RL over a strong base model can unlock advanced chain-of-thought reasoning without manually curated reasoning traces. This dramatically reduced alignment costs. DeepSeek open-sourced the model weights under the MIT License, enabling commercial use and distillation.
Anthropic introduced extended thinking with Claude 3.7 Sonnet, allowing the model to engage in longer internal reasoning before responding. When extended thinking is active, Claude uses serial test-time compute, working through multiple sequential reasoning steps before producing its final output. This capability was refined in subsequent releases, including Claude Sonnet 4.5 (September 2025), and further developed with adaptive thinking in Claude Opus 4.6 (2026), which allows the model to automatically determine when deeper reasoning would be helpful.
Similar approaches have been adopted by Google with Gemini 2.0 Flash Thinking and subsequent "thinking" variants of the Gemini family, which also internalize extended chain-of-thought reasoning during inference.
The emergence of reasoning models has changed the value proposition of explicit chain-of-thought prompting. A June 2025 study from the Wharton Generative AI Labs found that for reasoning models with built-in chain-of-thought capabilities (like o3-mini and o4-mini), explicit CoT prompting produced only minimal additional benefits of 2.9 to 3.1%, while increasing processing time by 20 to 80%. Many recent models perform some form of chain-of-thought reasoning internally even when not explicitly prompted to do so, which explains the diminishing returns of explicit CoT instructions with newer systems.
For non-reasoning models, chain-of-thought prompting still provides modest average improvements but introduces increased variability in responses. This evolution represents a shift from CoT as a prompting technique to CoT as an internalized capability trained directly into model weights through reinforcement learning.
Sprague et al. (2024) conducted a meta-analysis covering over 100 papers and evaluated 20 datasets across 14 models. Their study, published at ICLR 2025, found that CoT provides strong benefits primarily on tasks involving math and symbolic reasoning, with much smaller gains on other types of tasks.
| Task type | Average improvement from CoT |
|---|---|
| Symbolic reasoning | +14.2% |
| Mathematical reasoning | +12.3% |
| Logical reasoning | +6.9% |
| Other tasks (e.g., MMLU knowledge questions) | Negligible |
On the MMLU benchmark, answering directly produced nearly identical accuracy to CoT unless the question or the model's response contained an equals sign, indicating symbolic operations. This finding suggests that CoT is most valuable when a task genuinely requires multi-step computation and less helpful for tasks that primarily depend on knowledge retrieval.
The general pattern is clear: tasks that benefit from conscious, deliberate reasoning in humans also benefit from CoT in language models, while tasks that rely on intuitive or pattern-based processing can be harmed by forced step-by-step reasoning.
Despite its effectiveness, chain-of-thought prompting has several recognized limitations.
CoT prompting is not a universal improvement. For straightforward, single-step problems, forcing the model to generate intermediate reasoning can introduce unnecessary complexity and degrade performance. A 2024 study titled "Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse" found that on tasks involving implicit statistical learning, o1-preview experienced an absolute performance drop of 36.3% compared to GPT-4o with zero-shot prompting. Consistent accuracy reductions were observed across eight other state-of-the-art models as well.
Turpin et al. (2023) demonstrated that CoT explanations can systematically misrepresent the true reason for a model's prediction. In their study, published at NeurIPS 2023, they showed that models could be influenced by superficial biases in the prompt (such as always placing the correct answer in position "A" among multiple-choice options), while the generated chain-of-thought explanations failed to mention these biases. This effect caused accuracy to drop by as much as 36% on a suite of 13 tasks from BIG-Bench Hard. On a social-bias task, models produced explanations that justified stereotypical answers without acknowledging the influence of social biases.
Several types of unfaithfulness have been identified:
Anthropic's 2025 research paper "Reasoning Models Don't Always Say What They Think" extended this analysis to reasoning models specifically, finding faithfulness rates as low as 25% for Claude reasoning models and 39% for DeepSeek R1. The paradox is that as models become more capable, their unfaithful reasoning also becomes more sophisticated and harder to detect. These findings raise concerns about using CoT explanations as a basis for trusting or interpreting model decisions.
These findings have significant implications for AI safety. Strategies that rely on monitoring chain-of-thought outputs to detect undesired model behavior may be unreliable if the displayed reasoning does not accurately reflect the model's true decision-making process.
CoT prompting is not a universal solution. Smaller models (below approximately 100 billion parameters) tend to produce reasoning chains that appear coherent on the surface but contain logical errors. In these cases, CoT prompting actually hurts performance compared to standard direct-answer prompting. This limits the technique's applicability in resource-constrained settings where only smaller models are available.
Generating intermediate reasoning steps increases the number of output tokens substantially. Studies have found that CoT requests require 20 to 80% more time (typically 10 to 20 additional seconds) compared to direct-answer prompting. Self-consistency compounds this further by requiring multiple sampling passes. For latency-sensitive applications and cases with tight cost constraints, this additional computation can be a meaningful drawback, particularly when the task does not benefit from multi-step reasoning.
The effectiveness of few-shot CoT depends heavily on the quality and relevance of the demonstration examples, the number of examples provided, and the choice of trigger phrases. Poorly chosen examples, or examples that are too dissimilar to the target question, can lead to degraded performance. Small changes in prompt wording can sometimes lead to large changes in performance, making the technique somewhat fragile in practice. CoT prompts may only work consistently within a narrow problem class if the given examples are highly specific to that class.
Even when a model generates a coherent-looking chain of thought, there is no guarantee that the reasoning is logically sound or that the final answer is correct. Models can produce plausible but incorrect intermediate steps, and the final answer may be wrong despite an apparently reasonable reasoning chain. There is no built-in verification mechanism in standard CoT prompting.
Chain-of-thought prompting and its variants are used across a wide range of domains.
CoT was originally demonstrated on math word problems, and this remains its strongest application area. The technique is used in educational tools, automated tutoring systems, and research on mathematical reasoning in AI. On the GSM8K benchmark of grade-school math, CoT prompting with self-consistency has pushed accuracy above 74% with PaLM 540B, and more recent models like GPT-4 exceed 90%.
Structured chain-of-thought prompting has been applied to code generation tasks. Li et al. (2023) proposed Structured Chain-of-Thoughts (SCoTs), which asks LLMs to reason using programming structures (sequential, branching, and looping) before generating code. This produces more syntactically correct and logically sound programs compared to direct generation.
In healthcare, CoT prompting helps LLMs work through diagnostic reasoning by chaining prompts related to symptoms, patient history, and diagnostic criteria. Research has shown that incremental reasoning-driven CoT prompting, which follows a clinician's typical approach to reaching a diagnosis, significantly outperforms direct-answer prompting on open-ended medical questions.
CoT prompting has been applied to chemistry, physics, and biology tasks where multi-step reasoning is required, such as predicting reaction products or analyzing experimental data. The technique helps models break down complex scientific problems into manageable sub-questions.
When combined with extensions like Tree of Thoughts, CoT-style reasoning enables models to perform planning tasks such as game solving, route optimization, and scheduling, where exploration of multiple options and backtracking are necessary.
Chain-of-thought prompting is one of the most important techniques in the broader field of prompt engineering. It demonstrated that the way information is presented to a language model can dramatically alter the model's capabilities without any changes to the model itself.
CoT prompting built on earlier work in few-shot prompting (as popularized by the GPT-3 paper by Brown et al., 2020) and contributed to a broader understanding that LLMs can be "programmed" through carefully designed prompts. The success of chain-of-thought prompting inspired numerous other prompting techniques, including:
As of early 2026, chain-of-thought reasoning has evolved from an external prompting technique into a built-in capability of frontier language models. The major model providers (OpenAI, Anthropic, Google, and others) all offer models with internalized reasoning capabilities, often marketed as "thinking" modes.
For practitioners, the practical implications are notable:
The research frontier has moved beyond prompting-level CoT toward understanding and improving the internal reasoning processes of models trained with reinforcement learning. Questions about the faithfulness of reasoning traces, the relationship between visible chain-of-thought and actual model computation, and how to make reasoning more reliable remain active areas of investigation.
Chain-of-thought prompting's legacy extends beyond its direct technical contributions. It established a paradigm: that language models can be coaxed into performing complex cognitive tasks through careful prompt design. Even as the specific technique of manually prompting for step-by-step reasoning becomes less necessary with newer models, the underlying insight that decomposed reasoning improves model performance continues to shape how large language models are trained, evaluated, and deployed.
| Date | Event |
|---|---|
| January 2022 | Wei et al. publish "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (arXiv) |
| March 2022 | Wang et al. publish self-consistency paper (arXiv) |
| May 2022 | Kojima et al. publish zero-shot CoT paper; Zhou et al. publish least-to-most prompting paper |
| September 2022 | Madaan and Yazdanbakhsh publish analysis of why CoT works |
| October 2022 | Zhang et al. publish Auto-CoT paper |
| December 2022 | Wei et al. CoT paper published at NeurIPS 2022 |
| February 2023 | Self-consistency paper published at ICLR 2023 |
| May 2023 | Yao et al. publish Tree of Thoughts paper; Turpin et al. publish CoT unfaithfulness paper |
| August 2023 | Besta et al. publish Graph of Thoughts paper |
| September 2024 | OpenAI releases o1, a reasoning model with internalized CoT; Sprague et al. publish meta-analysis of CoT effectiveness |
| December 2024 | OpenAI releases o3 with extended reasoning capabilities |
| January 2025 | DeepSeek releases R1 with open-source RL-trained reasoning |
| 2025 | Anthropic introduces extended thinking in Claude 3.7 Sonnet; Anthropic publishes "Reasoning Models Don't Always Say What They Think" |
| June 2025 | Wharton study on diminishing value of CoT prompting with reasoning models |