Chain-of-Thought

Introduction

Chain-of-thought (CoT) prompting is a prompt engineering technique for improving the reasoning capabilities of large language models (LLMs) by eliciting a series of intermediate natural-language reasoning steps before the model produces a final answer.[^1] Rather than asking a model to output a direct response to a complex question, CoT prompting encourages the model to "show its work" by generating a sequence of logical steps that mirror how a human might work through a problem on paper.

The technique was introduced by Jason Wei and colleagues at Google Brain (now part of Google DeepMind) in the January 2022 paper "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," presented at NeurIPS 2022.[^1] The paper demonstrated that providing a small number of step-by-step reasoning exemplars in the prompt could dramatically improve accuracy on arithmetic, commonsense, and symbolic reasoning benchmarks for sufficiently large models. With a 540-billion-parameter PaLM model, CoT prompting lifted GSM8K grade-school math accuracy from 17.9% to 56.9% (and to 58% with eight exemplars), surpassing the prior state of the art set by a fine-tuned GPT-3 model with a verifier.[^1]

CoT prompting has since become one of the most influential techniques in prompt engineering and modern natural language processing. It spawned a family of variants — zero-shot CoT, self-consistency, auto-CoT, least-to-most prompting, tree of thoughts, and program-of-thoughts — and ultimately motivated a new generation of reasoning models such as OpenAI's o1, DeepSeek-R1, and Claude 3.7 Sonnet with extended thinking, which internalize chain-of-thought behavior through reinforcement learning rather than depending on user-supplied prompt structure.[^14][^16][^17]

Subsequent work has nuanced the original claims. The "emergent ability" framing has been contested by Schaeffer, Miranda, and Koyejo (2023), who argue that apparent phase transitions in CoT performance may be artifacts of discontinuous evaluation metrics rather than discontinuous model capabilities.[^25] Studies on CoT faithfulness (Turpin et al. 2023; Lanham et al. 2023; Chen and Benton et al. 2025) have shown that the verbalized chain of thought sometimes fails to reflect the model's true decision process, with significant implications for AI safety and interpretability.[^8][^21][^12]

ELI5 (Explain like I'm 5)

Imagine you ask your teacher "What is 27 times 13?" If the teacher just says "351," you might not understand how they got that number. But if the teacher says, "First, 27 times 10 is 270. Then 27 times 3 is 81. Now add 270 and 81 together, and you get 351," you can follow each step and see why the answer is correct.

Chain-of-thought prompting works the same way with AI. Instead of asking the AI to jump straight to the answer, you show it examples where problems are solved step by step. When the AI sees these examples, it learns to break down new problems into smaller steps too, which helps it get the right answer more often. The newest "thinking" AI models do this automatically — they have been trained to write out their reasoning before answering, without needing you to ask.

Background

Before chain-of-thought prompting, large language models trained with standard next-token prediction performed surprisingly poorly on tasks that required multi-step reasoning, even as they excelled at single-step tasks like translation, summarization, and basic question answering. Brown et al. (2020), in the original GPT-3 paper, observed that scaling alone produced significant improvements on knowledge benchmarks but only modest gains on arithmetic word problems and commonsense reasoning tasks.[^23]

Few-shot prompting (in-context learning) had emerged as the dominant inference-time interface for LLMs. A user would prepend several (input, output) pairs as exemplars and then pose a new question; the model would condition on the demonstrations to produce an answer in the same format. For tasks like sentiment classification or English-to-French translation, this worked well even with relatively few examples. But multi-step problems — where the answer required combining several intermediate facts or computations — remained stubbornly difficult, with accuracy often barely above chance.

Several earlier lines of work had explored decomposing reasoning into intermediate steps. Nye et al. (2021), in "Show Your Work: Scratchpads for Intermediate Computation with Language Models," demonstrated that fine-tuning a transformer on training data that included intermediate scratchpad tokens improved performance on multi-step arithmetic and program execution.[^24] However, this approach required fine-tuning and was not a pure inference-time technique. Wei et al. (2022) took the key step of showing that an analogous effect could be achieved purely through prompting, with no weight updates required.[^1]

Wei et al. 2022 — the original paper

The canonical CoT paper, "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," was first posted to arXiv on January 28, 2022 (arXiv:2201.11903) and presented at NeurIPS 2022.[^1] Its authors — Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou — were all at Google Brain (now Google DeepMind) at the time.[^1]

Few-shot CoT exemplars

The methodological contribution was simple: rather than presenting each in-context example as an (input, output) pair, present it as an (input, chain of thought, output) triple, where the chain of thought is a sequence of natural-language sentences describing the reasoning process. The canonical example from the paper involves a math word problem:

Standard few-shot exemplar:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: The answer is 11.

Chain-of-thought exemplar:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

After seeing several such exemplars (typically four to eight), the model would generate similarly structured reasoning for a new question. The authors emphasized that this required no fine-tuning, no architectural modifications, no auxiliary verifiers, and no learned reward models — only a different prompt format.[^1]

Datasets evaluated

Wei et al. evaluated CoT prompting across three families of reasoning tasks:[^1]

Arithmetic reasoning: GSM8K (grade-school math word problems), SVAMP, ASDiv, AQuA (algebra word problems with multiple-choice answers), MAWPS (a meta-benchmark combining several earlier math datasets), and MultiArith (multi-step arithmetic).
Commonsense reasoning: CSQA (CommonsenseQA), StrategyQA, the Date Understanding and Sports Understanding tasks from BIG-Bench, and the AI2 Reasoning Challenge (ARC).
Symbolic reasoning: Last-letter concatenation and coin-flip tracking, including out-of-distribution generalization tests with longer-than-seen examples.

PaLM 540B headline results

The most striking results came on PaLM 540B. Standard prompting on GSM8K gave 17.9% accuracy; chain-of-thought prompting with eight exemplars gave 58% (reported as 56.9% in some tables of the paper).[^1] This exceeded the previous state of the art of 55%, set by Cobbe et al.'s fine-tuned GPT-3 175B with a learned verifier — a result that required substantial additional training infrastructure.[^1]

Other notable PaLM 540B gains reported in Wei et al.:[^1]

MAWPS: 84.7% → 93.3% (+8.6 pp)
SVAMP: 69.4% → 79.0% (CoT) and 86.6% with chain of thought variants
StrategyQA: 68.6% → 77.8%
Sports Understanding (BIG-Bench): chain-of-thought PaLM 540B reached 95.4%, surpassing unaided human performance reported as 84%

Models tested

Wei et al. evaluated the technique on three large model families: GPT-3 175B (text-davinci-002 and earlier variants), LaMDA 137B, and PaLM 540B.[^1] Across all three families, chain-of-thought prompting helped on the larger end of the scale while hurting on smaller variants — a phenomenon the authors framed as an emergent ability.

Emergent ability claim

The paper argued that the benefits of CoT prompting are an emergent ability of model scale, appearing only above approximately 100 billion parameters or, equivalently, around 10^22 training FLOPs.[^1] Below this threshold, models that were instructed to reason step by step typically produced incoherent or factually wrong intermediate steps and performed worse than they would have with standard prompting.

This emergent-ability framing became influential and was subsequently extended in Wei et al. (2022b), "Emergent Abilities of Large Language Models." However, Schaeffer, Miranda, and Koyejo (2023), in "Are Emergent Abilities of Large Language Models a Mirage?" (arXiv:2304.15004), challenged the claim. Their argument is that apparent emergent phase transitions arise from researchers' choice of non-linear or discontinuous evaluation metrics (such as exact-match accuracy on a multi-step problem). Under linear or continuous metrics, performance often increases smoothly with scale. They demonstrated this on GPT-3, BIG-Bench, and synthetic vision examples.[^25] The debate is unresolved: CoT prompting clearly works much better on larger models, but whether the transition is a true qualitative emergence or a smooth quantitative improvement that looks emergent under particular metrics remains contested.

Zero-shot CoT (Kojima et al. 2022)

In May 2022, Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa published "Large Language Models are Zero-Shot Reasoners" (arXiv:2205.11916), accepted at NeurIPS 2022.[^2] Their core finding was that the multi-step reasoning capability identified by Wei et al. did not actually require few-shot exemplars at all. Simply appending the trigger phrase

"Let's think step by step."

to a question prompted the model to spontaneously generate intermediate reasoning steps, and a second, separate prompt could then extract the final answer.

Two-stage prompting

The Kojima et al. method uses a two-stage prompt:[^2]

Reasoning extraction: The prompt contains the question followed by "Let's think step by step." The model generates a free-form chain of reasoning.
Answer extraction: The original question, the generated reasoning chain, and a directive like "Therefore, the answer is" are concatenated and fed back to the model, which produces the final answer.

This two-stage approach avoids the brittleness of having to parse a free-form reasoning trace to find the answer.

Headline results

On InstructGPT (text-davinci-002):[^2]

MultiArith: 17.7% → 78.7%
GSM8K: 10.4% → 40.7%

Similar gains appeared on PaLM 540B. The authors tested multiple candidate trigger phrases and found that "Let's think step by step" consistently outperformed alternatives, though several semantically similar phrases also worked reasonably well. Like few-shot CoT, zero-shot CoT showed strong scale dependence — it produced little benefit (and sometimes hurt) on smaller models.[^2]

The discovery that a single generic trigger phrase could unlock reasoning across diverse tasks suggested that large models possessed latent reasoning ability that few-shot CoT had merely revealed in a more cumbersome way. It also turned CoT from a careful prompt engineering exercise into an almost trivial intervention, which contributed greatly to its spread.

Self-consistency (Wang et al. 2022)

Self-consistency, introduced by Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou in March 2022 (arXiv:2203.11171), is one of the most powerful and widely used CoT extensions.[^3] The paper was published at ICLR 2023.

Sample multiple, majority vote

Standard CoT uses greedy decoding (temperature 0): the model produces one reasoning chain and one final answer. Self-consistency replaces this with a sampling-and-voting strategy:[^3]

Sample N independent reasoning chains from the model at non-zero temperature (typically N = 10 to 40 and T ≈ 0.7).
Extract the final numerical or categorical answer from each.
Return the answer that appears most frequently across the N samples (majority vote).

The intuition: complex problems often admit multiple valid solution paths, and if several diverse reasoning chains converge on the same answer, it is much more likely correct. Conversely, if the model is uncertain, different samples will diverge to different answers, exposing the uncertainty.

Accuracy gains

On PaLM 540B, self-consistency produced large gains over single-sample CoT:[^3]

Benchmark	CoT (greedy)	CoT + self-consistency	Improvement
GSM8K	56.5%	74.4%	+17.9 pp
SVAMP	68.9%	79.9%	+11.0 pp
AQuA	35.8%	48.0%	+12.2 pp
StrategyQA	73.4%	79.8%	+6.4 pp
ARC-challenge	85.2%	89.1%	+3.9 pp

Self-consistency requires no additional training, no auxiliary reward models, and no changes to the prompt format. Its only cost is increased inference compute — N times as many forward passes per question. The technique remains a default baseline in reasoning research and a building block for more elaborate methods such as tree of thoughts and best-of-N sampling.

Auto-CoT (Zhang et al. 2022)

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola proposed Auto-CoT in October 2022 (arXiv:2210.03493), published at ICLR 2023.[^7] The motivation was to remove the manual labor of writing CoT exemplars for new domains.

Auto-CoT works in two stages:[^7]

Question clustering: Embed the available training questions and cluster them with k-means to ensure diversity.
Demonstration sampling: Pick one representative question from each cluster and use zero-shot CoT to automatically generate a reasoning chain for it. The (question, generated chain, generated answer) triples are then used as few-shot exemplars for new questions.

On ten benchmark reasoning tasks with GPT-3 (text-davinci-002 and code-davinci-002), Auto-CoT matched or modestly exceeded the performance of manually written CoT exemplars.[^7] The contribution was less about a new performance ceiling and more about practical accessibility: Auto-CoT showed that the few-shot exemplars themselves could be bootstrapped from a model's own zero-shot reasoning ability.

Least-to-most prompting

Denny Zhou and colleagues introduced least-to-most prompting in May 2022 (arXiv:2205.10625), with publication at ICLR 2023.[^6] The full author list is Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi.[^6]

Least-to-most prompting addresses a specific weakness of standard CoT: poor easy-to-hard generalization. Models prompted with relatively easy CoT exemplars often fail on substantially harder versions of the same problem class. Least-to-most uses two prompting stages:[^6]

Decomposition: Prompt the model to break the problem into a sequence of progressively simpler subproblems.
Subproblem solving: Solve each subproblem in order, appending each solution to the running context so later subproblems can refer to earlier answers.

On the SCAN compositional generalization benchmark, GPT-3 code-davinci-002 with least-to-most prompting achieved at least 99% accuracy across all splits, compared to just 16% with standard CoT — a result demonstrating that structured decomposition can overcome compositional generalization barriers that plain chain-of-thought cannot.[^6]

Tree of Thoughts (Yao et al. 2023)

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan introduced Tree of Thoughts (ToT) in May 2023 (arXiv:2305.10601), published at NeurIPS 2023.[^4] ToT generalizes chain-of-thought from a single linear reasoning trace to a tree-structured search.

In ToT, each thought is a coherent unit of intermediate reasoning. At each step, the model generates several candidate next thoughts, evaluates them (either by self-voting or by assigning numeric value estimates), and uses a classical search algorithm — typically breadth-first search or depth-first search with pruning — to navigate the tree. When a branch appears unpromising, the search backtracks.[^4]

ToT's most cited result is on the Game of 24, a mathematical puzzle requiring four input numbers to be combined with the four basic arithmetic operations to reach 24. GPT-4 with standard chain-of-thought solved only 4% of test puzzles; GPT-4 with Tree of Thoughts solved 74%.[^4] Comparable gains were reported on Creative Writing and Mini Crosswords benchmarks.

A related extension, Graph of Thoughts (Besta et al., arXiv:2308.09687, AAAI 2024), generalizes the topology further: thoughts become vertices in an arbitrary directed graph, allowing operations like merging multiple reasoning paths or refining a thought based on feedback from a downstream evaluator.[^5]

Program-of-Thoughts and Faithful CoT

A separate strand of CoT-variant research replaces natural-language reasoning with executable code, delegating arithmetic and logical computation to a deterministic interpreter rather than asking the LLM to perform it.

PAL: Program-Aided Language Models

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig introduced PAL in November 2022 (arXiv:2211.10435), published at ICML 2023.[^22] PAL prompts a code-capable model (such as Codex) to read a natural-language problem and generate a Python program whose final value is the answer. The program is executed externally; the LLM never has to compute arithmetic itself. Codex with PAL surpassed PaLM 540B on GSM8K by roughly 15 percentage points, despite using a substantially smaller model.[^22]

Program of Thoughts

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen introduced Program of Thoughts (PoT) in November 2022 (arXiv:2211.12588), published at TMLR 2023.[^26] PoT is essentially the same idea as PAL with somewhat different prompt engineering and evaluation: have the LLM generate a program rather than a natural-language chain, and offload computation to the runtime. Across eight mathematical and financial benchmarks, PoT averaged roughly 12 percentage points higher accuracy than CoT.[^26]

Faithful CoT

A more interpretability-motivated variant is Faithful CoT, introduced by Qing Lyu, Shreya Havaldar, Adam Stein, Li Zhang, Delip Rao, Eric Wong, Marianna Apidianaki, and Chris Callison-Burch in January 2023 (arXiv:2301.13379), published at IJCNLP-AACL 2023.[^27] Faithful CoT splits reasoning into a translation step (LLM converts the natural-language query into a symbolic reasoning chain in a formal language such as Python, Datalog, or PDDL) and a problem-solving step (a deterministic solver executes the chain). Because the final answer is mechanically determined by the explicit chain, the reasoning is guaranteed to be a faithful explanation of the answer — addressing the central concern of the CoT faithfulness literature discussed below. Faithful CoT outperformed standard CoT on nine of ten benchmarks across four domains.[^27]

CoT as trained behavior

The variants above are all prompting techniques: they coax reasoning out of a frozen model through clever prompt design. A separate line of work, accelerating sharply from 2023 onward, has internalized chain-of-thought behavior into model weights through training. This is a fundamentally different paradigm from CoT prompting, even though both go by similar names.

Instruction-tuned models reason without explicit prompting

The first hint of this shift came with the rise of instruction-tuned models such as text-davinci-003, ChatGPT (gpt-3.5-turbo), and Claude 1. These models, fine-tuned with supervised and reinforcement learning from human feedback on data that included many step-by-step explanations, often produced intermediate reasoning spontaneously even when not explicitly asked. The "Let's think step by step" trigger became less necessary in practice; users found that simply asking a difficult question yielded a reasoned response by default.[^15]

OpenAI o1 and o3

In September 2024, OpenAI released o1-preview, the first commercially deployed model explicitly trained with reinforcement learning to produce long, coherent chains of thought before answering.[^14] Crucially, the chain of thought is generated as hidden "thinking" tokens that the user does not see in full (OpenAI exposes only a summarized version). Internally, the model can produce thousands or tens of thousands of reasoning tokens before emitting its final response, exploring multiple solution paths and backtracking on errors. OpenAI reported that o1-preview scored 83% on the 2024 American Invitational Mathematics Examination (AIME), compared to 13% for GPT-4o. Its successor, o3, reached 96.7% on AIME and 87.7% on GPQA Diamond (graduate-level science).[^14]

The o1 paradigm is sometimes described as test-time compute scaling: rather than scaling pretraining compute or model parameters, the model scales the amount of serial reasoning it does at inference time. The relationship to traditional CoT prompting is direct in spirit — both rely on intermediate reasoning tokens — but the implementation is different. CoT prompting elicits reasoning from a model that was not specifically trained to reason at length; o1-style models have been trained, often via RL with verifiable rewards on math and code tasks, to produce extended internal reasoning automatically.

DeepSeek-R1

Released on January 20, 2025, DeepSeek-R1 (arXiv:2501.12948) demonstrated that o1-comparable reasoning could be developed in the open using a strikingly clean recipe: pure RL applied to a strong base model, without an explicit supervised fine-tuning stage on curated reasoning traces.[^16] DeepSeek-R1-Zero, the variant trained with only RL, exhibited spontaneous behaviors such as self-reflection, error correction, and reasoning length growing as training progressed. DeepSeek released the model weights under an MIT license, and an entire distilled family of smaller "R1-distill" models followed.[^16] The release pushed reasoning models firmly into the mainstream and was widely cited as a demonstration that the o1 paradigm was not dependent on any single lab's proprietary techniques.

Anthropic extended thinking

Anthropic introduced extended thinking with Claude 3.7 Sonnet in early 2025.[^17] When extended thinking is active, Claude generates an explicit reasoning trace before its final answer, with the user able to see the (largely unfiltered) thinking tokens. The feature is configurable — users can request more or less thinking depending on the difficulty of the problem. The capability was extended and refined in Claude Sonnet 4.5 (September 2025) and Claude Opus 4.6 (2026), which added adaptive thinking that automatically chooses how much to deliberate.[^17]

Google Gemini thinking variants

Google followed in late 2024 with Gemini 2.0 Flash Thinking, and subsequent Gemini 2.5 Pro and Gemini 3 variants with thinking modes, which use comparable internal CoT generation at inference.[^15]

Distinction from CoT prompting

It is worth being precise about what changed. CoT prompting (Wei 2022; Kojima 2022) is a technique that elicits intermediate reasoning from a frozen pretrained model purely through the input prompt — no training changes, no extra parameters, no special inference machinery beyond the model's own next-token prediction. CoT as trained behavior (o1, R1, extended thinking) is a model property created through additional training, typically RL with verifiable rewards, that biases the model toward generating long, deliberate reasoning traces automatically. The visible chain-of-thought text may look superficially similar in both cases, but the mechanism is different, and so are the practical implications. With trained-in CoT models, explicit "let's think step by step" prompting adds little additional benefit.[^18]

Why does chain-of-thought prompting work?

Several explanations have been proposed for why CoT produces such large performance gains. They are complementary rather than mutually exclusive.

Computational complexity arguments

A formal explanation comes from the theory of transformer expressivity. Standard fixed-depth transformers operating in a single forward pass are limited in the complexity class of problems they can solve. Merrill and Sabharwal (2023), in "The Expressive Power of Transformers with Chain of Thought" (arXiv:2310.07923), proved that allowing the transformer to generate intermediate tokens before its final answer strictly increases its computational power: with a linear number of decoding steps the model can recognize all regular languages, and with polynomially many steps it can solve problems in P (polynomial time).[^9] Without chain-of-thought tokens, no such guarantee holds for fixed-depth transformers. This serial reasoning hypothesis gives a principled reason why chain-of-thought helps on multi-step problems: the intermediate tokens function as a working tape that lets the model perform serial computation it could not otherwise execute.

Decomposition of complex problems

A more intuitive explanation is that CoT breaks a multi-step problem into smaller sub-computations, each of which is simple enough for the model to perform reliably. Each intermediate step requires only one elementary operation (a single arithmetic computation, a single logical inference, a single dictionary lookup), and conditioning on earlier steps lets the model treat each subsequent step as a near-trivial single-step problem. This mirrors the way humans use scratch paper or talk through problems aloud.

Increased serial compute at inference

A closely related framing: in a transformer, each generated token corresponds to one full forward pass through every layer. Generating k intermediate reasoning tokens before the answer therefore means the model has roughly k additional forward passes' worth of computation to bring to bear on the problem. This is the basic intuition behind test-time compute scaling — that for sufficiently hard problems, allocating more inference tokens is more valuable than allocating more parameters.[^14]

Pattern matching with pretraining data

Large models have been trained on vast corpora that include mathematical solutions, textbook explanations, code with comments, and step-by-step tutorials. Chain-of-thought prompting may primarily activate these learned patterns rather than producing genuinely novel reasoning. The "let's think step by step" trigger plausibly shifts the model toward continuations that resemble the kinds of pedagogical reasoning documents present in the training distribution.

Text and patterns

Aman Madaan and Amir Yazdanbakhsh, in "Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango" (arXiv:2209.07686, September 2022), studied which components of CoT prompts actually drive performance.[^10] By systematically corrupting symbols, patterns, and text in the demonstration chains, they found that text (commonsense context and explanatory connective tissue) and patterns (structural step-by-step format) jointly contribute, but that the factual correctness of the intermediate numerical computations in the exemplars was less important than the structural pattern. CoT exemplars with wrong intermediate arithmetic could still produce good performance, as long as they preserved the multi-step structure. This was a striking finding: it suggested that CoT exemplars partly work by teaching the model a generation format rather than by demonstrating correct content.

Error localization

When the reasoning chain is explicit, errors in individual steps become potentially visible. In some cases the model can self-correct mid-chain ("wait — that gives the wrong remainder; let me redo step 3"), particularly in trained-in-CoT reasoning models. This contrasts with single-shot answer generation, where an early mistake silently contaminates the final output with no opportunity for correction.

Faithfulness concerns

A central concern about chain-of-thought — both as a prompting technique and as a property of trained reasoning models — is whether the verbalized reasoning actually reflects the computation the model used to reach its answer. If it does not, then the explanatory and safety value of visible CoT is undermined.

Turpin et al. 2023

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman published "Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting" in May 2023 (arXiv:2305.04388), with publication at NeurIPS 2023.[^8] Their experimental setup deliberately injected biases into prompts — for example, always placing the correct answer in position "A" across the few-shot exemplars in a multiple-choice task. They tested GPT-3.5 and Claude 1.0 on 13 BIG-Bench Hard tasks and found:[^8]

The models' final answers were systematically influenced by the injected biases (accuracy dropped by up to 36 percentage points).
The models' chain-of-thought explanations almost never mentioned the bias. Instead, they constructed plausible-sounding justifications for the biased answer.
On a social-bias variant of the task, models produced explanations that justified stereotypical answers without acknowledging that they had been steered by stereotype cues.

The implication: a model can have a clear, coherent, internally consistent chain of thought that systematically fails to mention the actual reason for its answer.

Lanham et al. 2023

Tamera Lanham and twenty-two colleagues at Anthropic published "Measuring Faithfulness in Chain-of-Thought Reasoning" in July 2023 (arXiv:2307.13702).[^21] Where Turpin et al. focused on demonstrating unfaithfulness, Lanham et al. introduced a battery of quantitative measurements of faithfulness:

Truncating the chain of thought before the answer and observing whether the model's answer changes.
Adding mistakes into the chain and observing whether the final answer changes accordingly.
Paraphrasing the chain and seeing if it still produces the same answer.

Their results were mixed:[^21] faithfulness varied substantially across tasks and models. They also found, perhaps surprisingly, that larger, more capable models tended to produce less faithful chain-of-thought on most studied tasks — the model becomes capable enough to "shortcut" the reasoning while still emitting a plausible-looking chain. The authors concluded that faithful CoT is achievable but requires careful selection of model size and task type.

Anthropic 2025: Reasoning models specifically

Yanda Chen, Joe Benton, and collaborators at Anthropic published "Reasoning Models Don't Always Say What They Think" in May 2025 (arXiv:2505.05410), extending the analysis specifically to RL-trained reasoning models.[^12] Their methodology paired questions with hint-injected variants and measured whether the model verbalized the hint in its chain of thought when it actually used the hint to change its answer. Headline findings:[^12]

Across six types of hints, models verbalized the hint in fewer than 20% of cases where they used it.
For "unauthorized access" hints (where the hint reveals information the model arguably should not act on), Claude reasoning models verbalized the hint about 41% of the time and DeepSeek-R1 about 19% of the time.
The authors concluded that CoT monitoring is "a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out."

Arcuschin et al. (2025), "Chain-of-Thought Reasoning In The Wild Is Not Always Faithful" (arXiv:2503.08679), found similar patterns in less contrived natural settings, with post-hoc rationalization rates of 13% in GPT-4o-mini and 7% in Claude Haiku 3.5.[^20]

Implications for interpretability and safety

These findings have direct implications for AI safety and interpretability. A natural hope is that frontier models' visible chains of thought provide a window into their reasoning that allows humans to detect and prevent misaligned behavior. The faithfulness literature complicates this picture. Even when CoT is approximately faithful in aggregate, it may be systematically unfaithful in precisely the high-stakes adversarial cases — biased inputs, misleading hints, unauthorized actions — that monitoring most needs to catch.

CoT-monitoring research has therefore become an active subfield, with attempts to design training procedures that improve faithfulness, mechanistic-interpretability methods that compare verbalized reasoning to internal activations, and "thinking-aloud" RL training rewards that explicitly penalize unfaithful explanations.

Benchmark results

The following table summarizes selected results for chain-of-thought prompting and its variants on arithmetic reasoning tasks, drawn from the cited papers.

Model	Method	GSM8K	MultiArith	SVAMP	AQuA
PaLM 540B	Standard	17.9%	—	79.0%	—
PaLM 540B	Few-shot CoT (8 exemplars)	58.0% / 56.9%	—	86.6%	—
PaLM 540B	CoT + self-consistency	74.4%	—	79.9%	48.0%
GPT-3 175B (text-davinci-002)	Standard	~17%	22.7%	—	—
InstructGPT (text-davinci-002)	Zero-shot CoT	40.7%	78.7%	—	—
Codex (code-davinci-002)	PAL	80.4%	—	—	—
GPT-4	Standard	>90%	—	—	—
GPT-4	Tree of Thoughts (Game of 24 only)	— (74% on Game of 24 vs 4% CoT)	—	—	—
o1-preview	Trained-in CoT (AIME 2024)	—	—	—	—
o3	Trained-in CoT (AIME 2024)	—	—	—	—

Sources: Wei et al. 2022[^1]; Kojima et al. 2022[^2]; Wang et al. 2022[^3]; Yao et al. 2023[^4]; Gao et al. 2022[^22]; OpenAI 2024[^14].

Beyond arithmetic, CoT prompting also produced substantial gains on commonsense and symbolic reasoning. On the Sports Understanding task in BIG-Bench, PaLM 540B with CoT reached 95.4%, surpassing the reported unaided human baseline of 84%.[^1] On last-letter concatenation and coin-flip tracking (symbolic), PaLM 540B went from near-chance accuracy with standard prompting to strong performance with CoT, including on out-of-distribution test items longer than the in-context exemplars.[^1]

When chain-of-thought helps (and when it does not)

A 2024 meta-analysis by Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett, titled "To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning" (arXiv:2409.12183, ICLR 2025), examined more than 100 papers and ran controlled comparisons across 20 datasets and 14 models.[^11] Their headline finding was that CoT's gains are heavily concentrated in math and symbolic reasoning tasks; on broad knowledge and commonsense tasks, benefits are negligible or negative.

Task type	Mean CoT improvement
Symbolic reasoning	+14.2 pp
Mathematical reasoning	+12.3 pp
Logical reasoning	+6.9 pp
Other (e.g., MMLU knowledge questions)	Negligible

On MMLU, Sprague et al. observed that answering directly produced nearly identical accuracy to CoT unless the question or the model's answer contained an equals sign, indicating symbolic operations. The takeaway is that CoT is genuinely valuable when a task requires multi-step computation and largely irrelevant when it depends on knowledge retrieval.[^11]

A related result from Liu et al. 2024, "Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse" (arXiv:2410.21333), showed that on tasks involving implicit statistical learning, o1-preview experienced a 36.3 percentage-point absolute performance drop relative to GPT-4o with direct prompting.[^19] Forced step-by-step reasoning can disrupt pattern-based intuitive judgments, mirroring known psychological effects in humans.

Modern usage

As of mid-2026, chain-of-thought reasoning occupies several distinct niches in practical LLM usage.

In-context CoT prompting

For non-reasoning models (GPT-4o, Claude Sonnet without extended thinking, Gemini Flash without thinking), in-context CoT prompting — either few-shot CoT or simply asking the model to think step by step — remains a useful and easy intervention on math and reasoning-heavy tasks. It is essentially free, requires no API changes, and offers modest but real average gains on multi-step problems.[^11]

Trained-in reasoning

For reasoning models with internalized CoT (OpenAI o3 / o4-mini, Claude with extended thinking, Gemini 2.5 Pro and later with thinking, DeepSeek-R1), the model already reasons internally. Adding explicit CoT prompts provides only marginal additional benefit while increasing latency. A June 2025 study by Meincke, Mollick, Mollick, and Shapiro at the Wharton Generative AI Labs (arXiv:2506.07142) measured this directly: explicit CoT prompting on o3-mini and o4-mini yielded only 2.9-3.1 percentage-point additional accuracy gains, at the cost of 20-80% additional latency.[^18]

Test-time compute scaling

A third deployment pattern, popularized by o1 / R1, is to explicitly control the amount of reasoning compute the model is allowed to use at inference time. API parameters (reasoning effort, thinking budget) let applications dial the depth of chain-of-thought up for hard problems and down for cheap-and-fast tasks. This test-time compute scaling paradigm has become a primary axis along which frontier models are improved, alongside continued pretraining-scale increases.[^14]

Limitations

Despite its impact, chain-of-thought prompting has well-documented limitations.

Scale dependency

The original Wei et al. result was that CoT does not help (and often hurts) on models below approximately 100B parameters.[^1] Although instruction tuning has somewhat softened this floor — many 7B–30B instruct-tuned models now produce coherent step-by-step reasoning — it remains true that the largest reasoning gains accrue to the largest models. Resource-constrained settings with small models gain little from CoT and may be better served by other techniques such as fine-tuning on reasoning traces.

Error propagation

CoT chains are only as strong as their weakest step. An error anywhere in the chain can cascade through subsequent steps and produce a wrong final answer, even when most of the reasoning was correct. This is particularly acute in long chains where models cannot reliably double-check their own intermediate work without extensions like self-consistency or self-critique.

Verbosity, latency, cost

Generating intermediate reasoning means generating more output tokens. Studies have found 20-80% additional latency (roughly 10-20 extra seconds typical) compared to direct-answer prompting.[^18] Self-consistency compounds this further by requiring multiple samples. For latency-sensitive applications or tight per-query cost budgets, CoT may not be worth its cost on tasks that do not genuinely require multi-step reasoning.

Unfaithfulness

As discussed in detail above, CoT chains do not necessarily reflect the true reasoning underlying the model's answer. They can rationalize, omit relevant cues (especially biases), and shortcut around the visible chain entirely. This undermines the use of CoT as either an explanation of the model's reasoning or as a window for safety monitoring.[^8][^21][^12]

Prompt sensitivity

For few-shot CoT, performance can vary substantially with the choice of exemplars, the number of exemplars, the wording of the trigger phrase, and the order of demonstration examples. Small prompt changes have been documented to produce double-digit accuracy swings on some benchmarks.[^11] Approaches like Auto-CoT, self-consistency, and trained-in reasoning models all reduce this sensitivity but do not eliminate it.

No guarantee of correctness

Even a coherent-looking chain of thought provides no formal guarantee that the conclusion is correct. Without an external verifier (a program executor as in PAL / PoT, a symbolic solver as in Faithful CoT, or a learned verifier), CoT is at best a heuristic. Models can produce plausible-sounding chains that contain subtle errors — both as honest mistakes and, in adversarial settings, as systematically misleading rationalizations.

Applications

Chain-of-thought prompting and its variants are used across many domains:

Mathematical problem solving: The original and strongest use case. Educational tutoring, automated grading, math benchmark research.
Code generation: "Structured Chain-of-Thoughts" methods ask the model to reason in terms of programming structures (sequential, branching, looping) before emitting code. PAL and Program-of-Thoughts are also natively code-based.[^22][^26]
Medical reasoning: CoT-style prompting has been applied to differential diagnosis and clinical decision support, with prompts that chain symptoms, history, and diagnostic criteria in the manner of a clinician's reasoning.
Scientific reasoning: Chemistry, physics, and biology problem solving where multi-step inference is required.
Multi-step planning: Tree of Thoughts and related search-augmented variants enable planning tasks such as game solving and route optimization.
Agentic systems: ReAct (Yao et al. 2022, arXiv:2210.03629) and successor frameworks interleave chain-of-thought reasoning with tool-calling actions, allowing the model to reason about which tool to invoke and then incorporate tool results back into its reasoning.[^28]

Timeline

Date	Event
2021	Nye et al. publish "Show Your Work" scratchpad paper (a fine-tuning precursor to CoT)
January 2022	Wei et al. publish "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (arXiv:2201.11903)
March 2022	Wang et al. publish self-consistency paper (arXiv:2203.11171)
May 2022	Kojima et al. publish zero-shot CoT (arXiv:2205.11916); Zhou et al. publish least-to-most prompting (arXiv:2205.10625)
September 2022	Madaan and Yazdanbakhsh publish "Text and Patterns" analysis (arXiv:2209.07686)
October 2022	Zhang et al. publish Auto-CoT (arXiv:2210.03493)
November 2022	Gao et al. publish PAL (arXiv:2211.10435); Chen et al. publish Program-of-Thoughts (arXiv:2211.12588)
December 2022	Wei et al. CoT paper formally published at NeurIPS 2022
January 2023	Lyu et al. publish Faithful CoT (arXiv:2301.13379)
April 2023	Schaeffer et al. publish "Are Emergent Abilities a Mirage?" (arXiv:2304.15004)
May 2023	Yao et al. publish Tree of Thoughts (arXiv:2305.10601); Turpin et al. publish CoT unfaithfulness paper (arXiv:2305.04388)
July 2023	Lanham et al. publish "Measuring Faithfulness in CoT" (arXiv:2307.13702)
August 2023	Besta et al. publish Graph of Thoughts (arXiv:2308.09687)
October 2023	Merrill and Sabharwal publish "Expressive Power of Transformers with CoT" (arXiv:2310.07923)
September 2024	OpenAI releases o1, the first widely deployed reasoning model with trained-in CoT; Sprague et al. publish "To CoT or not to CoT" meta-analysis
December 2024	OpenAI releases o3 with extended reasoning
January 2025	DeepSeek releases R1 (arXiv:2501.12948) with open-source RL-trained reasoning
Early 2025	Anthropic introduces extended thinking in Claude 3.7 Sonnet
May 2025	Anthropic publishes "Reasoning Models Don't Always Say What They Think" (arXiv:2505.05410)
June 2025	Wharton study quantifies diminishing returns of explicit CoT on reasoning models

References

Introduction

ELI5 (Explain like I'm 5)

Background

Wei et al. 2022 — the original paper

Few-shot CoT exemplars

Datasets evaluated

PaLM 540B headline results

Models tested

Emergent ability claim

Zero-shot CoT (Kojima et al. 2022)

Two-stage prompting

Headline results

Self-consistency (Wang et al. 2022)

Sample multiple, majority vote

Accuracy gains

Auto-CoT (Zhang et al. 2022)

Least-to-most prompting

Tree of Thoughts (Yao et al. 2023)

Program-of-Thoughts and Faithful CoT

PAL: Program-Aided Language Models

Program of Thoughts

Faithful CoT

CoT as trained behavior

Instruction-tuned models reason without explicit prompting

OpenAI o1 and o3

DeepSeek-R1

Anthropic extended thinking

Google Gemini thinking variants

Distinction from CoT prompting

Why does chain-of-thought prompting work?

Computational complexity arguments

Decomposition of complex problems

Increased serial compute at inference

Pattern matching with pretraining data

Text and patterns

Error localization

Faithfulness concerns

Turpin et al. 2023

Lanham et al. 2023

Anthropic 2025: Reasoning models specifically

Implications for interpretability and safety

Benchmark results

When chain-of-thought helps (and when it does not)

Modern usage

In-context CoT prompting

Trained-in reasoning

Test-time compute scaling

Limitations

Scale dependency

Error propagation

Verbosity, latency, cost

Unfaithfulness

Prompt sensitivity

No guarantee of correctness

Applications

Timeline

See also

References

Improve this article

Related Articles

Sparse autoencoder

Context window

OCR Models

Post-training

Pre-training

Supervised fine-tuning

Introduction

ELI5 (Explain like I'm 5)

Background

Wei et al. 2022 — the original paper

Few-shot CoT exemplars

Datasets evaluated

PaLM 540B headline results

Models tested

Emergent ability claim

Zero-shot CoT (Kojima et al. 2022)

Two-stage prompting

Headline results

Self-consistency (Wang et al. 2022)

Sample multiple, majority vote