Auto-CoT

Large Language Models Prompt Engineering

22 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v3 · 4,300 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Auto-CoT (Automatic Chain of Thought) is an automated prompting method that builds few-shot Chain-of-Thought demonstrations for large language models without any human-written exemplars, introduced by Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola in the 2022 paper "Automatic Chain of Thought Prompting in Large Language Models" (arXiv:2210.03493).^[1] Instead of hand-crafting reasoning examples, Auto-CoT clusters a pool of unlabeled questions by semantic similarity, picks one representative question per cluster, generates a reasoning chain for each with the zero-shot prompt "Let's think step by step", and concatenates those question/rationale/answer triples into a single few-shot prompt.^[1] The paper's headline finding is that on ten public benchmark reasoning tasks evaluated with GPT-3 (text-davinci-002), "Auto-CoT consistently matches or exceeds the performance of the CoT paradigm that requires manual designs of demonstrations," with its key insight being that "diversity matters for automatically constructing demonstrations."^[1] The work was accepted as a poster at the International Conference on Learning Representations (ICLR) 2023, and an official Apache 2.0 implementation is maintained by Amazon Web Services at the amazon-science/auto-cot GitHub repository.^[2]^[3]

What is Auto-CoT?

Auto-CoT is an automatic prompt engineering technique that removes the labor of hand-crafting Chain-of-Thought demonstrations. Where Manual-CoT (Wei et al., 2022) requires a human to write a handful of worked examples for each new reasoning task, Auto-CoT lets the language model generate its own demonstrations and selects a diverse set of them automatically.^[1] The paper frames its own contribution with a play on the zero-shot trigger phrase: "let's think not just step by step, but also one by one," meaning the model produces a reasoning chain for one sampled question at a time and those chains become the in-context demonstrations.^[1]

The method was created while Zhuosheng Zhang was an intern at AWS; his home affiliation is Shanghai Jiao Tong University, while Aston Zhang, Mu Li, and Alex Smola were at Amazon Web Services at the time of the paper.^[1]^[4] Aston Zhang, Mu Li, and Alex Smola are also co-authors of the open-source textbook "Dive into Deep Learning," and the Auto-CoT example is discussed in that book's prompting section.^[4]

Background: how does Auto-CoT relate to Chain-of-Thought?

Chain-of-Thought (CoT) prompting elicits intermediate reasoning steps from a language model by either appending a trigger such as "Let's think step by step" to a question (Zero-Shot-CoT, due to Kojima et al., 2022) or by prefixing the question with several question/rationale/answer demonstrations that show how to reason (Manual-CoT, due to Wei et al., 2022).^[1] In the original Manual-CoT formulation, the few-shot exemplars are hand-written by humans, and Wei et al. reported that demonstrations produced by different annotators could yield up to 28.2% accuracy disparity on a symbolic reasoning task, illustrating how sensitive few-shot CoT is to the specific exemplars chosen.^[1]

This sensitivity motivated Auto-CoT's central question: can the rationales for the few-shot prompt themselves be generated automatically by the language model, removing the human-authoring step? An obvious candidate is to retrieve a handful of test questions, run Zero-Shot-CoT on each to obtain a rationale, and use those as in-context demonstrations. The Auto-CoT paper documents that the naive version of this idea fails: while LLMs are "decent zero-shot reasoners," they are not perfect, and the rationales produced by "Let's think step by step" sometimes contain mistakes.^[1] When demonstration questions are chosen by similarity to a target test question, those mistakes tend to cluster together and reinforce one another, an effect the authors call "misleading by similarity."^[1]

When was Auto-CoT published?

The paper was first posted to arXiv on 2022-10-07 as version v1, with the title "Automatic Chain of Thought Prompting in Large Language Models" and the four authors listed above.^[1] It was reviewed at the International Conference on Learning Representations in the September 2022 cycle and accepted as a poster at ICLR 2023, with the official conference page listing it under poster slot 11360.^[2] In parallel, the official implementation was released on GitHub as amazon-science/auto-cot under an Apache 2.0 license and has been mirrored at the legacy URL amazon-research/auto-cot referenced from the paper.^[3] Secondary coverage in prompt-engineering documentation (Learn Prompting, Prompt Engineering Guide) and a KDnuggets explainer followed in mid-2023.^[5]^[6]^[7]

Why was Auto-CoT created?

The motivation for Auto-CoT is summarized by the paper's three observations.^[1]

Manual prompts are labor-intensive and task-specific. Manual-CoT requires writing rationales tailored to each reasoning task: arithmetic problems need different demonstrations than commonsense or symbolic reasoning tasks. The authors note that the cost of designing demonstrations leads practitioners to reuse the same hand-written exemplars across multiple datasets, which limits how task-adaptive Manual-CoT can be.^[1]
Zero-Shot-CoT sometimes produces bad rationales. Although "Let's think step by step" elicits coherent reasoning on average, the resulting chains contain factual or computational mistakes for a non-negligible fraction of questions. On the MultiArith dataset, Zero-Shot-CoT applied to GPT-3 (text-davinci-002) produced wrong final answers for 128 out of 600 questions, a 21.3% error rate.^[1]
Similarity-based retrieval amplifies those mistakes. If exemplars are retrieved by cosine similarity of question embeddings, semantically similar test questions tend to land in the same error region, and a wrong rationale on one is more likely to be copied into the model's reasoning on the others.^[1]

The first observation calls for an automatic method. The second and third observations explain why a naive automatic method is not enough: the algorithm must select demonstrations that are robust to imperfect rationales. The paper's resolution is that diversity in demonstration selection mitigates the effect of incidental errors, while similarity amplifies them.^[1] As the abstract puts it: "To mitigate the effect of such mistakes, we find that diversity matters for automatically constructing demonstrations."^[1]

How does Automatic Chain-of-Thought prompting work?

Auto-CoT operates in two stages: (i) question clustering to identify a diverse set of demonstration candidates, and (ii) demonstration sampling to generate a rationale for each cluster representative and assemble the final few-shot prompt.^[1]

Stage 1: Question clustering

Given a set of questions Q (typically the test set itself, since the paper assumes no training annotations are available), Auto-CoT first computes a vector representation of each question with Sentence-BERT, averaging the contextualized embeddings to obtain a fixed-size sentence vector.^[1] It then runs k-means clustering over those vectors to partition Q into k clusters, where k matches the number of demonstrations the final prompt will contain (typically 8, but 4 for AQUA-RAT and the Letter task, 7 for CommonsenseQA, and 6 for StrategyQA).^[1]

Within each cluster i, questions are sorted in ascending order of distance to the cluster center, producing an ordered list q^(i) = [q_1^(i), q_2^(i), ...]. This ordering matters in Stage 2, because Auto-CoT iterates through the list and prefers the candidate closest to the center.^[1] The paper describes this as Algorithm 1 ("Cluster").^[1]

Stage 2: Demonstration sampling

For each cluster i, Auto-CoT walks through q^(i) until it finds a question that satisfies simple heuristics. For the j-th candidate question q_j^(i), the method constructs a prompt of the form [Q: q_j^(i). A: Let's think step by step.] and feeds it to the LLM in the standard Zero-Shot-CoT fashion to obtain a rationale r_j^(i) and an extracted final answer a_j^(i).^[1] The candidate demonstration d_j^(i) is the concatenation Q: q_j^(i). A: r_j^(i) ° a_j^(i).^[1]

The heuristics for accepting a candidate, inspired by the criteria Wei et al. used for Manual-CoT, are:^[1]

the question q_j^(i) has no more than 60 tokens, and
the generated rationale r_j^(i) contains no more than 5 reasoning steps.

The paper notes that the second rule is easy to implement because Zero-Shot-CoT typically separates reasoning steps with the newline character "\n", so step count reduces to counting newlines.^[1] If the closest candidate to the center violates the rule, Auto-CoT moves on to the next-closest candidate, and so on. The procedure is captured as Algorithm 2 ("Construct").^[1]

After all k clusters are processed, Auto-CoT has constructed a list of k demonstrations [d^(1), ..., d^(k)]. The final prompt at inference time concatenates these demonstrations and appends the test question with the same "Let's think step by step" trigger: [d^(1), ..., d^(k), Q: q_test. A: Let's think step by step.]. The LLM then generates a reasoning chain for the test question, ending in an answer.^[1]

Why does diversity beat similarity in Auto-CoT?

A central analytical claim of the paper is that diversity of demonstration questions is the key ingredient that protects Auto-CoT from the imperfect rationales generated by Zero-Shot-CoT.^[1] To make this concrete the authors compare two ablated variants:^[1]

Retrieval-Q-CoT: for each test question, retrieve the top-k most similar other questions by Sentence-BERT cosine similarity and use Zero-Shot-CoT to generate their rationales.
Random-Q-CoT: for each test question, randomly sample k other questions and use Zero-Shot-CoT to generate their rationales.

On the 128 MultiArith questions where Zero-Shot-CoT was already wrong, the "unresolving rate" (fraction still unsolved with extra demonstrations) was 46.9% for Retrieval-Q-CoT versus 25.8% for Random-Q-CoT. The interpretation is that retrieval pulls in demonstrations whose rationales contain the same mistakes the model is about to make, while random sampling occasionally introduces a diverse, correct rationale that breaks the pattern.^[1] An additional analysis showed that the Zero-Shot-CoT error rate is uneven across clusters: in one of eight MultiArith clusters the error rate was 52.3%, far higher than the overall 21.3% rate. The paper calls this the "frequent-error cluster," and notes that similarity-based retrieval is structurally biased toward over-sampling such clusters.^[1]

Clustering with k-means and taking one representative per cluster is, in this view, an explicit step away from similarity sampling and toward diversity sampling. Even if all wrong demonstrations fall into the same frequent-error cluster, sampling exactly one question from each of k clusters yields at most one bad demonstration out of k; the paper argues this is the structural reason Auto-CoT survives Zero-Shot-CoT's imperfections.^[1]

Algorithm summary

Stage	Step	Operation
1	Encode	Compute Sentence-BERT vector for each question in Q
1	Cluster	Run k-means on the question vectors into k clusters
1	Order	Sort each cluster's members by distance to the cluster center, ascending
2	Generate	For each cluster, run Zero-Shot-CoT on the closest candidate to obtain rationale + answer
2	Filter	Accept the demonstration if the question is at most 60 tokens and the rationale has at most 5 steps; else move to the next candidate
2	Assemble	Concatenate k accepted demonstrations and append the test question with the "Let's think step by step" trigger

Source: Algorithms 1 and 2 of Zhang et al. (2022).^[1]

How does Auto-CoT compare to manual CoT on benchmarks?

The Auto-CoT paper evaluates the method on ten reasoning benchmarks spanning three task categories, with the public GPT-3 model (text-davinci-002) as the default LLM.^[1] The arithmetic reasoning tasks are MultiArith, GSM8K, AddSub, AQUA-RAT, SingleEq, and SVAMP. The commonsense reasoning tasks are CommonsenseQA (CSQA) and StrategyQA. The symbolic reasoning tasks are Last Letter Concatenation and Coin Flip.^[1]

The headline result is that Auto-CoT matches or beats Manual-CoT on all ten datasets, frequently by a small margin.^[1]

Method	MultiArith	GSM8K	AddSub	AQuA	SingleEq	SVAMP	CSQA	StrategyQA	Letter	Coin
Zero-Shot	22.7	12.5	77.0	22.4	78.7	58.8	72.6	54.3	0.2	53.8
Zero-Shot-CoT	78.7	40.7	74.7	33.5	78.7	63.7	64.6	54.8	57.6	91.4
Few-Shot	33.8	15.6	83.3	24.8	82.7	65.7	79.5	65.9	0.2	57.2
Manual-CoT	91.7	46.9	81.3	35.8	86.6	68.9	73.5	65.4	59.0	97.2
Auto-CoT	92.0	47.9	84.8	36.5	87.0	69.5	74.4	65.4	59.7	99.9

All numbers are accuracy in percent on the standard test split for each dataset, with Auto-CoT results averaged over three random runs. Zero-Shot and Zero-Shot-CoT numbers are reproduced from Kojima et al. (2022); Few-Shot and Manual-CoT numbers are reproduced from Wei et al. (2022).^[1] On the StrategyQA benchmark Auto-CoT ties Manual-CoT at 65.4%; on CommonsenseQA both Auto-CoT and Manual-CoT lag the Few-Shot baseline, an effect the paper attributes to the comparatively weak performance of CoT on multiple-choice commonsense reasoning at the GPT-3 scale.^[1]

Does Auto-CoT also work with Codex?

To test that the result is not specific to the InstructGPT family, the paper also evaluates Codex (code-davinci-002) as the underlying LLM on three arithmetic datasets. Auto-CoT remains competitive with Manual-CoT in this setting, outperforming it on GSM8K (62.8% vs 59.4%) and AddSub (91.9% vs 84.6%) while trailing slightly on MultiArith (93.2% vs 96.8%).^[1]

The paper also reports an "Effect of Wrong Demonstrations" ablation in which a fraction of the eight demonstrations are deliberately replaced with incorrect Zero-Shot-CoT outputs. Diversity-based Auto-CoT remains robust up to roughly 50% wrong demonstrations on MultiArith, while an "In-Cluster Sampling" baseline that draws all demonstrations from the same cluster as the test question degrades much more quickly. This is the empirical evidence the paper uses to argue that diversity, not exemplar correctness alone, drives performance.^[1]

Streaming setting

The paper also describes a bootstrapped variant, Auto-CoT*, for a streaming setting in which test questions arrive in small batches. The bootstrapped variant uses Zero-Shot-CoT on the first batch (which is too small to cluster meaningfully), accumulates question/rationale pairs in a memory, and then applies the standard Auto-CoT clustering and sampling to that memory for subsequent batches. By the second batch on MultiArith (with batch size m=30), Auto-CoT* performs comparably to Manual-CoT.^[1]

Which datasets does the evaluation use?

The paper's Appendix B.1 documents the dataset sizes and answer formats for the ten benchmarks:^[1]

Dataset	Samples	Avg. words	Answer format	License
MultiArith	600	31.8	Number	Unspecified
AddSub	395	31.5	Number	Unspecified
GSM8K	1319	46.9	Number	MIT
AQUA	254	51.9	Multiple choice	Apache 2.0
SingleEq	508	27.4	Number	None
SVAMP	1000	31.8	Number	MIT
CSQA	1221	27.8	Multiple choice	Unspecified
StrategyQA	2290	9.6	Yes/No	Apache 2.0
Last Letters	500	15.0	String	Unspecified
Coin Flip	500	37.0	Yes/No	Unspecified

GSM8K is the largest arithmetic benchmark in the suite and the most widely cited; it requires multi-step arithmetic reasoning on grade-school math word problems.^[1] StrategyQA is by far the largest non-arithmetic benchmark and requires multi-hop implicit reasoning.^[1] All evaluations use greedy decoding with max_tokens=256 and temperature=0 via the GPT-3 OpenAI API.^[1]

Ablations of selection criteria

Auto-CoT's Appendix C reports several ablations of the selection criteria.^[1] When restricted to demonstrations whose Zero-Shot-CoT answer is correct (so the "wrong demonstration" confound is removed), the choice of where to draw from inside a cluster matters:

In-cluster ordering	MultiArith accuracy
Auto-CoT (closest to center, in-cluster min distance)	93.7
In-Cluster Min Dist (same as Auto-CoT)	93.7
In-Cluster Random	89.2
In-Cluster Max Dist	88.7

In other words, taking the most "prototypical" question for each cluster (closest to the centroid) is a small but consistent improvement over sampling randomly inside the cluster, and a non-trivial improvement over choosing the outliers.^[1]

A separate appendix ablation looks at which of the three demonstration components (question, rationale, answer) is most sensitive to corruption. Shuffling the questions across demonstrations reduces accuracy from 91.7% to 73.8%, while shuffling the rationales drops it to 43.8% and shuffling the answers to 17.0%. The asymmetry indicates that the rationale-answer mapping carries most of the in-context learning signal in CoT settings, which is the basis for the heuristic that demonstrations must have an answer that actually appears inside the rationale.^[1] For arithmetic tasks except AQUA (which is multiple-choice), Auto-CoT applies this additional filter: the extracted answer a_j^(i) must be non-empty and must appear inside the rationale r_j^(i) for the demonstration to be accepted.^[1]

How is Auto-CoT implemented?

The official implementation of Auto-CoT is hosted on GitHub at amazon-science/auto-cot under the Amazon Web Services Amazon Science organization, with an Apache 2.0 license.^[3] The repository's README summarizes the contribution in one line: "Auto-CoT uses more cheers & diversity to SAVE huge manual efforts in chain of thought prompt design, matching or even exceeding performance of manual design on GPT-3."^[3] It is the reference codebase cited from the paper, and includes:

run_demo.py, which builds the demonstrations for a given task by clustering questions, generating Zero-Shot-CoT rationales, and writing the selected demonstrations to disk; and
run_inference.py, which runs in-context inference on the test set using a saved set of demonstrations.^[3]

The repository also ships a Jupyter notebook try_cot.ipynb (and a Google Colab version try_cot_colab.ipynb) that walks through Auto-CoT end-to-end on a small example.^[3] Dependencies declared in the README include Python 3.8 or newer and PyTorch 1.8.2, and the test datasets are pulled from the earlier Zero-Shot-CoT repository.^[3] The default clustering uses Sentence-BERT through the sentence-transformers library; the README points users to the original Reimers and Gurevych (2019) implementation that the paper cites for question encoding.^[1]^[3]

The README's framing emphasizes that the contribution is a way to "SAVE huge manual efforts in chain of thought prompt design," and that the empirical claim is matching or exceeding manually designed prompts on GPT-3 (text-davinci-002) on the ten benchmarks above.^[3] The official BibTeX entry for the work is:

@inproceedings{zhang2023automatic,
  title={Automatic Chain of Thought Prompting in Large Language Models},
  author={Zhang, Zhuosheng and Zhang, Aston and Li, Mu and Smola, Alex},
  booktitle={The Eleventh International Conference on Learning Representations (ICLR 2023)},
  year={2023}
}

This citation is reproduced verbatim from the project README.^[3]

Why is Auto-CoT significant?

Auto-CoT is one of the earliest works to push back on the assumption that few-shot CoT demonstrations must be hand-authored, and on the related assumption that the best demonstrations are the ones most semantically similar to the query. The paper's analysis that similarity-based retrieval can actively hurt in-context learning when the underlying rationales are imperfect has been picked up by later prompting research as a general design principle: maintain diversity among in-context demonstrations, even at the cost of relevance to the specific query.^[5]^[6]

The paper sits in a cluster of late-2022 work on automating or stabilizing CoT, alongside self-consistency decoding (which samples many CoT chains and takes a majority vote) and zero-shot CoT (which proposes the "Let's think step by step" trigger).^[1] Auto-CoT differs in that it manipulates the few-shot demonstrations themselves rather than the decoding process, and explicitly couples a clustering step to a generation step.

Outside of the academic literature, Auto-CoT is now a standard entry in prompt-engineering reference materials: Prompt Engineering Guide and Learn Prompting both list Automatic Chain of Thought as a named CoT variant alongside Zero-Shot-CoT, Manual-CoT, and self-consistency, citing the Zhang et al. (2022) paper as the source.^[5]^[6] KDnuggets covered the method in a July 2023 explainer, emphasizing the "treat one LLM as a demonstration generator and another as an inference engine" framing and the diversity-over-similarity insight.^[7]

Connection to "diversity matters" as a design principle

The methodological observation that diversity, not similarity, should drive demonstration selection has shown up repeatedly in subsequent prompting research. Auto-CoT's experiments on the "frequent-error cluster" phenomenon, where a single cluster on the MultiArith dataset had an Zero-Shot-CoT error rate of 52.3% versus a 21.3% overall rate, give a mechanistic explanation: if your in-context examples are all retrieved from one neighborhood of a model's failure modes, the prompt becomes a self-reinforcing trap.^[1] Clustering questions and taking one representative per cluster bounds this risk because a single bad demonstration in eight (k=8) is empirically tolerable. Auto-CoT's Appendix shows that even at 50% wrong demonstrations on MultiArith, the diversity-based variant degrades only modestly, while an "In-Cluster Sampling" variant that draws all demonstrations from the same cluster degrades sharply.^[1]

This robustness story is a direct extension of earlier in-context learning analyses (Min et al., 2022; Lu et al., 2022) that documented the heavy sensitivity of few-shot prompts to ordering and exemplar choice, but Auto-CoT generalizes those findings to the CoT setting where the demonstrations contain rationales as well as labels and where the demonstrations are generated rather than retrieved.^[1] The Min et al. (2022) result that even incorrect labels in few-shot prompts only marginally hurt classification accuracy is explicitly contrasted with Auto-CoT's findings: in CoT settings, mistakes in either the question-to-rationale mapping or the rationale-to-answer mapping cause a much larger performance drop than the analogous corruption in flat classification.^[1]

What are the limitations of Auto-CoT?

The paper itself acknowledges several limitations of Auto-CoT:^[1]

Clustering is only meaningful at sufficient scale. In the streaming setting where the first batch is small, Auto-CoT degenerates to Zero-Shot-CoT because there is not yet enough data to form k meaningful clusters. The bootstrapped Auto-CoT* variant addresses this only partially, requiring accumulation across batches.
Selection heuristics are simple. The 60-token / 5-step rule is hand-set, follows precedent from Wei et al. (2022), and is not learned. The paper makes no claim that these particular thresholds are optimal.
It still depends on Zero-Shot-CoT being a "decent" reasoner. If the underlying LLM does not produce a reasonable rationale most of the time in response to "Let's think step by step," there is no number of demonstrations Auto-CoT can construct that will succeed; the method exploits, rather than replaces, the model's zero-shot reasoning ability.

External commentary has pointed to two further limitations. First, Auto-CoT's clustering uses Sentence-BERT, which was trained on general-domain text; on highly specialized reasoning domains the embedding may not be a good proxy for "diverse skills." Second, the gains over Manual-CoT in the original paper are typically a few accuracy points and are within the variability of in-context learning runs on some benchmarks, so the strongest practical argument for Auto-CoT is convenience and reproducibility rather than raw accuracy.^[5]^[7] Subsequent work such as Automate-CoT (Shum et al., 2023) and Reprompting (Xu et al., 2023) explicitly target the variance and exemplar-selection issues that Auto-CoT leaves open.^[8]^[9]

What methods built on Auto-CoT?

The "automatic exemplar selection for CoT" line of work that Auto-CoT helped open includes a number of follow-up papers:

Automate-CoT (Shum, Diao, and Zhang, 2023) proposes an automatic augmentation and selection procedure that, like Auto-CoT, removes manual writing of exemplars, but uses labeled data when available and a variance-reduced selection criterion to choose the final demonstration set.^[8]
Reprompting (Xu et al., 2023) treats prompt construction as a Gibbs sampling problem and iteratively refines the demonstration set, reporting accuracy gains over Auto-CoT on several arithmetic and commonsense benchmarks.^[9]

Both of these works cite Auto-CoT as the baseline against which automatic demonstration construction methods are now measured.^[8]^[9] Reprompting in particular reports an average accuracy improvement of 9.4 absolute points over human-written CoT prompts and 11 to 33 points over Auto-CoT on its evaluated tasks, framing its contribution as inheriting Auto-CoT's "no manual exemplars" stance while reducing the variance in demonstration quality.^[9] Automate-CoT explicitly cites Auto-CoT's prompt sensitivity issue as the motivation for its variance-reduction approach.^[8]

Auto-CoT's diversity-sampling idea has also been carried over to multimodal reasoning: a follow-up effort from the same Amazon Science group, amazon-science/mm-cot, extends CoT to vision-language inputs and reuses the demonstration-construction philosophy of Auto-CoT in the multimodal setting.^[3] More broadly, the paper has been cited as inspiration for retrieval-augmented prompting methods that use clustering for diversity rather than top-k similarity for relevance, including hybrid pipelines that combine cluster-based selection with retrieval reranking.^[5]

Method	Demonstrations	Trigger	Selection mechanism
Zero-Shot prompting	none	"The answer is"	none
Zero-Shot-CoT	none	"Let's think step by step"	none
Few-Shot prompting	hand-written	none	hand-picked
Manual-CoT	hand-written rationales	"Let's think step by step"	hand-picked
Retrieval-Q-CoT (ablation)	Zero-Shot-CoT rationales	"Let's think step by step"	top-k cosine similarity
Random-Q-CoT (ablation)	Zero-Shot-CoT rationales	"Let's think step by step"	random
Auto-CoT	Zero-Shot-CoT rationales	"Let's think step by step"	k-means cluster + closest-to-center + heuristics

Definitions of the baselines are drawn from Zhang et al. (2022).^[1]

References

Zhuosheng Zhang, Aston Zhang, Mu Li, Alex Smola, "Automatic Chain of Thought Prompting in Large Language Models", arXiv, 2022-10-07. https://arxiv.org/abs/2210.03493. Accessed 2026-06-28. ↩
ICLR, "Automatic Chain of Thought Prompting in Large Language Models (Poster)", ICLR 2023 virtual proceedings, 2023-05-01. https://iclr.cc/virtual/2023/poster/11360. Accessed 2026-06-28. ↩
Amazon Science, "amazon-science/auto-cot (README)", GitHub, 2023-02-01. https://github.com/amazon-science/auto-cot. Accessed 2026-06-28. ↩
Zhuosheng Zhang, "Homepage", BCMI Lab, Shanghai Jiao Tong University, 2024-09-01. https://bcmi.sjtu.edu.cn/home/zhangzs/. Accessed 2026-06-28. ↩
Elvis Saravia et al., "Chain-of-Thought Prompting (Automatic Chain-of-Thought)", Prompt Engineering Guide, 2024-01-15. https://www.promptingguide.ai/techniques/cot. Accessed 2026-06-28. ↩
Learn Prompting, "Automatic Chain of Thought (Auto-CoT)", Learn Prompting documentation, 2024-03-20. https://learnprompting.org/docs/advanced/thought_generation/automatic_chain_of_thought. Accessed 2026-06-28. ↩
Matthew Mayo, "Automating the Chain of Thought: How AI Can Prompt Itself to Reason", KDnuggets, 2023-07-17. https://www.kdnuggets.com/2023/07/automating-chain-of-thought-ai-prompt-itself-reason.html. Accessed 2026-06-28. ↩
KaShun Shum, Shizhe Diao, Tong Zhang, "Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data", arXiv, 2023-02-24. https://arxiv.org/abs/2302.12822. Accessed 2026-06-28. ↩
Weijia Xu, Andrzej Banburski-Fahey, Nebojsa Jojic, "Reprompting: Automated Chain-of-Thought Prompt Inference Through Gibbs Sampling", arXiv, 2023-05-17. https://arxiv.org/abs/2305.09993. Accessed 2026-06-28. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Chain-of-Thought Graph of Thoughts Least-to-Most Prompting Skeleton-of-Thought

What is Auto-CoT?

Background: how does Auto-CoT relate to Chain-of-Thought?

When was Auto-CoT published?

Why was Auto-CoT created?

How does Automatic Chain-of-Thought prompting work?

Stage 1: Question clustering

Stage 2: Demonstration sampling

Why does diversity beat similarity in Auto-CoT?

Algorithm summary

How does Auto-CoT compare to manual CoT on benchmarks?

Does Auto-CoT also work with Codex?

Streaming setting

Which datasets does the evaluation use?

Ablations of selection criteria

How is Auto-CoT implemented?

Why is Auto-CoT significant?

Connection to "diversity matters" as a design principle

What are the limitations of Auto-CoT?

What methods built on Auto-CoT?

How does Auto-CoT compare with related prompting techniques?

See also

References

Improve this article

Related Articles

Prompt

Agentic Context Engineering

How to Pressure LLMs for Better Output

System prompt

Context engineering

Self-consistency

What links here

Related Articles

Prompt

Agentic Context Engineering

How to Pressure LLMs for Better Output

System prompt

Context engineering

Self-consistency

What links here