Auto-CoT
Last reviewed
Jun 7, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,092 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 7, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,092 words
Add missing citations, update stale details, or suggest a clearer explanation.
Auto-CoT (Automatic Chain of Thought) is an automated prompting method for eliciting multi-step reasoning from large language models, introduced by Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola in the paper "Automatic Chain of Thought Prompting in Large Language Models" (arXiv:2210.03493, October 2022).[1] Its goal is to remove the labor of hand-crafting Chain-of-Thought demonstrations: instead of writing few-shot exemplars by hand, Auto-CoT clusters a pool of unlabeled questions by semantic similarity, picks one representative question per cluster, generates a reasoning chain for each using the zero-shot prompt "Let's think step by step", and concatenates these question/rationale/answer triples into a few-shot prompt.[1] The method's central empirical claim is that on ten public benchmark reasoning tasks evaluated with GPT-3 (text-davinci-002), Auto-CoT consistently matches or exceeds Manual-CoT, despite using no human-written exemplars.[1] The paper was accepted as a poster at the International Conference on Learning Representations (ICLR) 2023.[2] An official reference implementation is maintained by Amazon Web Services in the amazon-science/auto-cot repository on GitHub under the Apache 2.0 license.[3]
Chain-of-Thought (CoT) prompting elicits intermediate reasoning steps from a language model by either appending a trigger such as "Let's think step by step" to a question (Zero-Shot-CoT, due to Kojima et al., 2022) or by prefixing the question with several question/rationale/answer demonstrations that show how to reason (Manual-CoT, due to Wei et al., 2022).[1] In the original Manual-CoT formulation, the few-shot exemplars are hand-written by humans, and Wei et al. reported that demonstrations produced by different annotators could yield up to 28.2% accuracy disparity on a symbolic reasoning task, illustrating how sensitive few-shot CoT is to the specific exemplars chosen.[1]
This sensitivity motivated Auto-CoT's central question: can the rationales for the few-shot prompt themselves be generated automatically by the language model, removing the human-authoring step? An obvious candidate is to retrieve a handful of test questions, run Zero-Shot-CoT on each to obtain a rationale, and use those as in-context demonstrations. The Auto-CoT paper documents that the naive version of this idea fails: while LLMs are "decent zero-shot reasoners," they are not perfect, and the rationales produced by "Let's think step by step" sometimes contain mistakes.[1] When demonstration questions are chosen by similarity to a target test question, those mistakes tend to cluster together and reinforce one another, an effect the authors call "misleading by similarity."[1]
The authors performed the work while Zhuosheng Zhang was on an internship at AWS; Zhang's home affiliation is Shanghai Jiao Tong University, while Aston Zhang, Mu Li, and Alex Smola were at Amazon Web Services at the time of the paper.[1][4] Aston Zhang, Mu Li, and Alex Smola are also co-authors of the open-source textbook "Dive into Deep Learning," and the Auto-CoT example is discussed in that book's prompting section.[4]
The paper was first posted to arXiv on 2022-10-07 as version v1, with the title "Automatic Chain of Thought Prompting in Large Language Models" and the four authors listed above.[1] It was reviewed at the International Conference on Learning Representations in the September 2022 cycle and accepted as a poster at ICLR 2023, with the official conference page listing it under poster slot 11360.[2] In parallel, the official implementation was released on GitHub as amazon-science/auto-cot under an Apache 2.0 license and has been mirrored at the legacy URL amazon-research/auto-cot referenced from the paper.[3] Secondary coverage in prompt-engineering documentation (Learn Prompting, Prompt Engineering Guide) and a KDnuggets explainer followed in mid-2023.[5][6][7]
The motivation for Auto-CoT is summarized by the paper's three observations.[1]
Manual prompts are labor-intensive and task-specific. Manual-CoT requires writing rationales tailored to each reasoning task: arithmetic problems need different demonstrations than commonsense or symbolic reasoning tasks. The authors note that the cost of designing demonstrations leads practitioners to reuse the same hand-written exemplars across multiple datasets, which limits how task-adaptive Manual-CoT can be.[1]
Zero-Shot-CoT sometimes produces bad rationales. Although "Let's think step by step" elicits coherent reasoning on average, the resulting chains contain factual or computational mistakes for a non-negligible fraction of questions. On the MultiArith dataset, Zero-Shot-CoT applied to GPT-3 (text-davinci-002) produced wrong final answers for 128 out of 600 questions, a 21.3% error rate.[1]
Similarity-based retrieval amplifies those mistakes. If exemplars are retrieved by cosine similarity of question embeddings, semantically similar test questions tend to land in the same error region, and a wrong rationale on one is more likely to be copied into the model's reasoning on the others.[1]
The first observation calls for an automatic method. The second and third observations explain why a naive automatic method is not enough: the algorithm must select demonstrations that are robust to imperfect rationales. The paper's resolution is that diversity in demonstration selection mitigates the effect of incidental errors, while similarity amplifies them.[1]
Auto-CoT operates in two stages: (i) question clustering to identify a diverse set of demonstration candidates, and (ii) demonstration sampling to generate a rationale for each cluster representative and assemble the final few-shot prompt.[1]
Given a set of questions Q (typically the test set itself, since the paper assumes no training annotations are available), Auto-CoT first computes a vector representation of each question with Sentence-BERT, averaging the contextualized embeddings to obtain a fixed-size sentence vector.[1] It then runs k-means clustering over those vectors to partition Q into k clusters, where k matches the number of demonstrations the final prompt will contain (typically 8, but 4 for AQUA-RAT and the Letter task, 7 for CommonsenseQA, and 6 for StrategyQA).[1]
Within each cluster i, questions are sorted in ascending order of distance to the cluster center, producing an ordered list q^(i) = [q_1^(i), q_2^(i), ...]. This ordering matters in Stage 2, because Auto-CoT iterates through the list and prefers the candidate closest to the center.[1] The paper describes this as Algorithm 1 ("Cluster").[1]
For each cluster i, Auto-CoT walks through q^(i) until it finds a question that satisfies simple heuristics. For the j-th candidate question q_j^(i), the method constructs a prompt of the form [Q: q_j^(i). A: Let's think step by step.] and feeds it to the LLM in the standard Zero-Shot-CoT fashion to obtain a rationale r_j^(i) and an extracted final answer a_j^(i).[1] The candidate demonstration d_j^(i) is the concatenation Q: q_j^(i). A: r_j^(i) ° a_j^(i).[1]
The heuristics for accepting a candidate, inspired by the criteria Wei et al. used for Manual-CoT, are:[1]
The paper notes that the second rule is easy to implement because Zero-Shot-CoT typically separates reasoning steps with the newline character "\n", so step count reduces to counting newlines.[1] If the closest candidate to the center violates the rule, Auto-CoT moves on to the next-closest candidate, and so on. The procedure is captured as Algorithm 2 ("Construct").[1]
After all k clusters are processed, Auto-CoT has constructed a list of k demonstrations [d^(1), ..., d^(k)]. The final prompt at inference time concatenates these demonstrations and appends the test question with the same "Let's think step by step" trigger: [d^(1), ..., d^(k), Q: q_test. A: Let's think step by step.]. The LLM then generates a reasoning chain for the test question, ending in an answer.[1]
A central analytical claim of the paper is that diversity of demonstration questions is the key ingredient that protects Auto-CoT from the imperfect rationales generated by Zero-Shot-CoT.[1] To make this concrete the authors compare two ablated variants:[1]
On the 128 MultiArith questions where Zero-Shot-CoT was already wrong, the "unresolving rate" (fraction still unsolved with extra demonstrations) was 46.9% for Retrieval-Q-CoT versus 25.8% for Random-Q-CoT. The interpretation is that retrieval pulls in demonstrations whose rationales contain the same mistakes the model is about to make, while random sampling occasionally introduces a diverse, correct rationale that breaks the pattern.[1] An additional analysis showed that the Zero-Shot-CoT error rate is uneven across clusters: in one of eight MultiArith clusters the error rate was 52.3%, far higher than the overall 21.3% rate. The paper calls this the "frequent-error cluster," and notes that similarity-based retrieval is structurally biased toward over-sampling such clusters.[1]
Clustering with k-means and taking one representative per cluster is, in this view, an explicit step away from similarity sampling and toward diversity sampling. Even if all wrong demonstrations fall into the same frequent-error cluster, sampling exactly one question from each of k clusters yields at most one bad demonstration out of k; the paper argues this is the structural reason Auto-CoT survives Zero-Shot-CoT's imperfections.[1]
| Stage | Step | Operation |
|---|---|---|
| 1 | Encode | Compute Sentence-BERT vector for each question in Q |
| 1 | Cluster | Run k-means on the question vectors into k clusters |
| 1 | Order | Sort each cluster's members by distance to the cluster center, ascending |
| 2 | Generate | For each cluster, run Zero-Shot-CoT on the closest candidate to obtain rationale + answer |
| 2 | Filter | Accept the demonstration if the question is at most 60 tokens and the rationale has at most 5 steps; else move to the next candidate |
| 2 | Assemble | Concatenate k accepted demonstrations and append the test question with the "Let's think step by step" trigger |
Source: Algorithms 1 and 2 of Zhang et al. (2022).[1]
The Auto-CoT paper evaluates the method on ten reasoning benchmarks spanning three task categories, with the public GPT-3 GPT-3 model (text-davinci-002) as the default LLM.[1] The arithmetic reasoning tasks are MultiArith, GSM8K, AddSub, AQUA-RAT, SingleEq, and SVAMP. The commonsense reasoning tasks are CommonsenseQA (CSQA) and StrategyQA. The symbolic reasoning tasks are Last Letter Concatenation and Coin Flip.[1]
The headline result is that Auto-CoT matches or beats Manual-CoT on all ten datasets, frequently by a small margin.[1]
| Method | MultiArith | GSM8K | AddSub | AQuA | SingleEq | SVAMP | CSQA | StrategyQA | Letter | Coin |
|---|---|---|---|---|---|---|---|---|---|---|
| Zero-Shot | 22.7 | 12.5 | 77.0 | 22.4 | 78.7 | 58.8 | 72.6 | 54.3 | 0.2 | 53.8 |
| Zero-Shot-CoT | 78.7 | 40.7 | 74.7 | 33.5 | 78.7 | 63.7 | 64.6 | 54.8 | 57.6 | 91.4 |
| Few-Shot | 33.8 | 15.6 | 83.3 | 24.8 | 82.7 | 65.7 | 79.5 | 65.9 | 0.2 | 57.2 |
| Manual-CoT | 91.7 | 46.9 | 81.3 | 35.8 | 86.6 | 68.9 | 73.5 | 65.4 | 59.0 | 97.2 |
| Auto-CoT | 92.0 | 47.9 | 84.8 | 36.5 | 87.0 | 69.5 | 74.4 | 65.4 | 59.7 | 99.9 |
All numbers are accuracy in percent on the standard test split for each dataset, with Auto-CoT results averaged over three random runs. Zero-Shot and Zero-Shot-CoT numbers are reproduced from Kojima et al. (2022); Few-Shot and Manual-CoT numbers are reproduced from Wei et al. (2022).[1] On the StrategyQA benchmark Auto-CoT ties Manual-CoT at 65.4%; on CommonsenseQA both Auto-CoT and Manual-CoT lag the Few-Shot baseline, an effect the paper attributes to the comparatively weak performance of CoT on multiple-choice commonsense reasoning at the GPT-3 scale.[1]
To test that the result is not specific to the InstructGPT family, the paper also evaluates Codex (code-davinci-002) as the underlying LLM on three arithmetic datasets. Auto-CoT remains competitive with Manual-CoT in this setting, outperforming it on GSM8K (62.8% vs 59.4%) and AddSub (91.9% vs 84.6%) while trailing slightly on MultiArith (93.2% vs 96.8%).[1]
The paper also reports an "Effect of Wrong Demonstrations" ablation in which a fraction of the eight demonstrations are deliberately replaced with incorrect Zero-Shot-CoT outputs. Diversity-based Auto-CoT remains robust up to roughly 50% wrong demonstrations on MultiArith, while an "In-Cluster Sampling" baseline that draws all demonstrations from the same cluster as the test question degrades much more quickly. This is the empirical evidence the paper uses to argue that diversity, not exemplar correctness alone, drives performance.[1]
The paper also describes a bootstrapped variant, Auto-CoT*, for a streaming setting in which test questions arrive in small batches. The bootstrapped variant uses Zero-Shot-CoT on the first batch (which is too small to cluster meaningfully), accumulates question/rationale pairs in a memory, and then applies the standard Auto-CoT clustering and sampling to that memory for subsequent batches. By the second batch on MultiArith (with batch size m=30), Auto-CoT* performs comparably to Manual-CoT.[1]
The paper's Appendix B.1 documents the dataset sizes and answer formats for the ten benchmarks:[1]
| Dataset | Samples | Avg. words | Answer format | License |
|---|---|---|---|---|
| MultiArith | 600 | 31.8 | Number | Unspecified |
| AddSub | 395 | 31.5 | Number | Unspecified |
| GSM8K | 1319 | 46.9 | Number | MIT |
| AQUA | 254 | 51.9 | Multiple choice | Apache 2.0 |
| SingleEq | 508 | 27.4 | Number | None |
| SVAMP | 1000 | 31.8 | Number | MIT |
| CSQA | 1221 | 27.8 | Multiple choice | Unspecified |
| StrategyQA | 2290 | 9.6 | Yes/No | Apache 2.0 |
| Last Letters | 500 | 15.0 | String | Unspecified |
| Coin Flip | 500 | 37.0 | Yes/No | Unspecified |
GSM8K is the largest arithmetic benchmark in the suite and the most widely cited; it requires multi-step arithmetic reasoning on grade-school math word problems.[1] StrategyQA is by far the largest non-arithmetic benchmark and requires multi-hop implicit reasoning.[1] All evaluations use greedy decoding with max_tokens=256 and temperature=0 via the GPT-3 OpenAI API.[1]
Auto-CoT's Appendix C reports several ablations of the selection criteria.[1] When restricted to demonstrations whose Zero-Shot-CoT answer is correct (so the "wrong demonstration" confound is removed), the choice of where to draw from inside a cluster matters:
| In-cluster ordering | MultiArith accuracy |
|---|---|
| Auto-CoT (closest to center, in-cluster min distance) | 93.7 |
| In-Cluster Min Dist (same as Auto-CoT) | 93.7 |
| In-Cluster Random | 89.2 |
| In-Cluster Max Dist | 88.7 |
In other words, taking the most "prototypical" question for each cluster (closest to the centroid) is a small but consistent improvement over sampling randomly inside the cluster, and a non-trivial improvement over choosing the outliers.[1]
A separate appendix ablation looks at which of the three demonstration components (question, rationale, answer) is most sensitive to corruption. Shuffling the questions across demonstrations reduces accuracy from 91.7% to 73.8%, while shuffling the rationales drops it to 43.8% and shuffling the answers to 17.0%. The asymmetry indicates that the rationale-answer mapping carries most of the in-context learning signal in CoT settings, which is the basis for the heuristic that demonstrations must have an answer that actually appears inside the rationale.[1] For arithmetic tasks except AQUA (which is multiple-choice), Auto-CoT applies this additional filter: the extracted answer a_j^(i) must be non-empty and must appear inside the rationale r_j^(i) for the demonstration to be accepted.[1]
The official implementation of Auto-CoT is hosted on GitHub at amazon-science/auto-cot under the Amazon Web Services Amazon Science organization, with an Apache 2.0 license.[3] The repository is the reference codebase cited from the paper, and includes:
run_demo.py, which builds the demonstrations for a given task by clustering questions, generating Zero-Shot-CoT rationales, and writing the selected demonstrations to disk; andrun_inference.py, which runs in-context inference on the test set using a saved set of demonstrations.[3]The repository also ships a Jupyter notebook try_cot.ipynb (and a Google Colab version try_cot_colab.ipynb) that walks through Auto-CoT end-to-end on a small example.[3] Dependencies declared in the README include Python 3.8 or newer and PyTorch 1.8.2, and the test datasets are pulled from the earlier Zero-Shot-CoT repository.[3] The default clustering uses Sentence-BERT through the sentence-transformers library; the README points users to the original Reimers and Gurevych (2019) implementation that the paper cites for question encoding.[1][3]
The README's framing emphasizes that the contribution is a way to "SAVE huge manual efforts in chain of thought prompt design," and that the empirical claim is matching or exceeding manually designed prompts on GPT-3 (text-davinci-002) on the ten benchmarks above.[3] The official BibTeX entry for the work is:
@inproceedings{zhang2023automatic,
title={Automatic Chain of Thought Prompting in Large Language Models},
author={Zhang, Zhuosheng and Zhang, Aston and Li, Mu and Smola, Alex},
booktitle={The Eleventh International Conference on Learning Representations (ICLR 2023)},
year={2023}
}
This citation is reproduced verbatim from the project README.[3]
Auto-CoT is one of the earliest works to push back on the assumption that few-shot CoT demonstrations must be hand-authored, and on the related assumption that the best demonstrations are the ones most semantically similar to the query. The paper's analysis that similarity-based retrieval can actively hurt in-context learning when the underlying rationales are imperfect has been picked up by later prompting research as a general design principle: maintain diversity among in-context demonstrations, even at the cost of relevance to the specific query.[5][6]
The paper sits in a cluster of late-2022 work on automating or stabilizing CoT, alongside self-consistency decoding (which samples many CoT chains and takes a majority vote) and zero-shot CoT (which proposes the "Let's think step by step" trigger).[1] Auto-CoT differs in that it manipulates the few-shot demonstrations themselves rather than the decoding process, and explicitly couples a clustering step to a generation step.
Outside of the academic literature, Auto-CoT is now a standard entry in prompt-engineering reference materials: Prompt Engineering Guide and Learn Prompting both list Automatic Chain of Thought as a named CoT variant alongside Zero-Shot-CoT, Manual-CoT, and self-consistency, citing the Zhang et al. (2022) paper as the source.[5][6] KDnuggets covered the method in a July 2023 explainer, emphasizing the "treat one LLM as a demonstration generator and another as an inference engine" framing and the diversity-over-similarity insight.[7]
The methodological observation that diversity, not similarity, should drive demonstration selection has shown up repeatedly in subsequent prompting research. Auto-CoT's experiments on the "frequent-error cluster" phenomenon, where a single cluster on the MultiArith dataset had an Zero-Shot-CoT error rate of 52.3% versus a 21.3% overall rate, give a mechanistic explanation: if your in-context examples are all retrieved from one neighborhood of a model's failure modes, the prompt becomes a self-reinforcing trap.[1] Clustering questions and taking one representative per cluster bounds this risk because a single bad demonstration in eight (k=8) is empirically tolerable. Auto-CoT's Appendix shows that even at 50% wrong demonstrations on MultiArith, the diversity-based variant degrades only modestly, while an "In-Cluster Sampling" variant that draws all demonstrations from the same cluster degrades sharply.[1]
This robustness story is a direct extension of earlier in-context learning analyses (Min et al., 2022; Lu et al., 2022) that documented the heavy sensitivity of few-shot prompts to ordering and exemplar choice, but Auto-CoT generalizes those findings to the CoT setting where the demonstrations contain rationales as well as labels and where the demonstrations are generated rather than retrieved.[1] The Min et al. (2022) result that even incorrect labels in few-shot prompts only marginally hurt classification accuracy is explicitly contrasted with Auto-CoT's findings: in CoT settings, mistakes in either the question-to-rationale mapping or the rationale-to-answer mapping cause a much larger performance drop than the analogous corruption in flat classification.[1]
The paper itself acknowledges several limitations of Auto-CoT:[1]
Clustering is only meaningful at sufficient scale. In the streaming setting where the first batch is small, Auto-CoT degenerates to Zero-Shot-CoT because there is not yet enough data to form k meaningful clusters. The bootstrapped Auto-CoT* variant addresses this only partially, requiring accumulation across batches.
Selection heuristics are simple. The 60-token / 5-step rule is hand-set, follows precedent from Wei et al. (2022), and is not learned. The paper makes no claim that these particular thresholds are optimal.
It still depends on Zero-Shot-CoT being a "decent" reasoner. If the underlying LLM does not produce a reasonable rationale most of the time in response to "Let's think step by step," there is no number of demonstrations Auto-CoT can construct that will succeed; the method exploits, rather than replaces, the model's zero-shot reasoning ability.
External commentary has pointed to two further limitations. First, Auto-CoT's clustering uses Sentence-BERT, which was trained on general-domain text; on highly specialized reasoning domains the embedding may not be a good proxy for "diverse skills." Second, the gains over Manual-CoT in the original paper are typically a few accuracy points and are within the variability of in-context learning runs on some benchmarks, so the strongest practical argument for Auto-CoT is convenience and reproducibility rather than raw accuracy.[5][7] Subsequent work such as Automate-CoT (Shum et al., 2023) and Reprompting (Xu et al., 2023) explicitly target the variance and exemplar-selection issues that Auto-CoT leaves open.[8][9]
The "automatic exemplar selection for CoT" line of work that Auto-CoT helped open includes a number of follow-up papers:
Both of these works cite Auto-CoT as the baseline against which automatic demonstration construction methods are now measured.[8][9] Reprompting in particular reports an average accuracy improvement of 9.4 absolute points over human-written CoT prompts and 11 to 33 points over Auto-CoT on its evaluated tasks, framing its contribution as inheriting Auto-CoT's "no manual exemplars" stance while reducing the variance in demonstration quality.[9] Automate-CoT explicitly cites Auto-CoT's prompt sensitivity issue as the motivation for its variance-reduction approach.[8]
Auto-CoT's diversity-sampling idea has also been carried over to multimodal reasoning: a follow-up effort from the same Amazon Science group, amazon-science/mm-cot, extends CoT to vision-language inputs and reuses the demonstration-construction philosophy of Auto-CoT in the multimodal setting.[3] More broadly, the paper has been cited as inspiration for retrieval-augmented prompting methods that use clustering for diversity rather than top-k similarity for relevance, including hybrid pipelines that combine cluster-based selection with retrieval reranking.[5]
| Method | Demonstrations | Trigger | Selection mechanism |
|---|---|---|---|
| Zero-Shot prompting | none | "The answer is" | none |
| Zero-Shot-CoT | none | "Let's think step by step" | none |
| Few-Shot prompting | hand-written | none | hand-picked |
| Manual-CoT | hand-written rationales | "Let's think step by step" | hand-picked |
| Retrieval-Q-CoT (ablation) | Zero-Shot-CoT rationales | "Let's think step by step" | top-k cosine similarity |
| Random-Q-CoT (ablation) | Zero-Shot-CoT rationales | "Let's think step by step" | random |
| Auto-CoT | Zero-Shot-CoT rationales | "Let's think step by step" | k-means cluster + closest-to-center + heuristics |
Definitions of the baselines are drawn from Zhang et al. (2022).[1]