Least-to-Most Prompting

Large Language Models Prompt Engineering

12 min read

Updated Jun 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 7, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v2 · 2,299 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Least-to-Most Prompting is a few-shot prompting technique for large language models introduced by researchers at Google Brain in May 2022. The method addresses a key weakness of chain-of-thought prompting: the tendency to fail on test problems that are structurally harder than the examples shown in the prompt.^[1] Least-to-most prompting operates in two stages, first using the model to decompose a complex problem into an ordered list of easier sub-problems (problem reduction), then solving those sub-problems sequentially with each answer appended to the context used to solve the next one (problem solving).^[1] On the SCAN compositional generalization benchmark, the GPT-3 code-davinci-002 model reached at least 99% accuracy in every split (including the length split) using just 14 exemplars, compared to about 16% with chain-of-thought prompting.^[1]^[2] The paper was first posted to arXiv on 21 May 2022 and accepted as a poster at ICLR 2023.^[1]^[3]

Background

By early 2022, chain-of-thought prompting (Wei et al.) had shown that prepending a few worked-out reasoning traces to a question could elicit multi-step reasoning from sufficiently large language models without any fine-tuning.^[4] Despite headline gains on GSM8K and similar benchmarks, follow-up analyses noted a recurring failure mode: when the test problems required more reasoning steps, longer command strings, or deeper compositional structure than any of the exemplars, accuracy collapsed.^[1]^[2] The SCAN benchmark, designed by Lake and Baroni to test compositional generalization in sequence-to-sequence models, was a particularly stark example: even very large models struggled on its length split, where the test set contains command sequences longer than anything seen during training.^[2]^[5]

The Least-to-Most Prompting paper, titled "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models", was written by Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi, all affiliated with Google Research and the Google Brain team.^[1]^[6] Denny Zhou, who later founded and led the Reasoning Team at Google Brain (subsequently folded into the Gemini effort at Google DeepMind), described least-to-most as one of four pillars of his reasoning research program, alongside chain-of-thought, self-consistency, and instruction fine-tuning.^[6]

The paper was first uploaded to arXiv as 2205.10625v1 on 21 May 2022, revised on 6 October 2022 (v2), and reached its final v3 on 16 April 2023.^[1] It was accepted to ICLR 2023 as a poster in the NLP applications track.^[3] The method draws on a long tradition in problem-solving heuristics (decomposing a hard problem into easier sub-problems is a staple of Polya-style mathematical pedagogy), but its specific contribution is showing that a sufficiently capable LLM can be prompted to perform the decomposition itself, with no human supervision beyond the in-context exemplars.^[1]

How It Works

Least-to-most prompting separates reasoning into two prompting stages run against the same model. Both stages are pure few-shot prompting; no weights are updated.

Stage 1: Problem Reduction

The first prompt contains a small number of demonstrations in which a complex problem is rewritten as an ordered list of strictly easier sub-problems, terminating with the original problem.^[1]^[7] For a math word problem about how long it takes to make and eat a pizza, the decomposition might be "(1) How long does it take to bake the pizza? (2) How long does it take to cool the pizza? (3) How long does the whole task take?", with the third sub-problem being the original question.^[7] At test time the new question is appended and the model is asked to generate its own decomposition. Crucially, the decomposition is only structural; it does not contain answers.^[1]

Stage 2: Problem Solving

The second stage iterates through the decomposition produced by the first. At each step the model is given a prompt that contains: a few worked examples of solving similarly small sub-problems, the question being decomposed, all previously solved sub-questions together with their answers, and finally the current sub-question.^[7]^[8] The answer the model produces is appended to the context, and the loop advances to the next sub-question. The final sub-question is the original problem, so its answer is the answer to the task. Because each step only adds a small amount of new reasoning on top of explicitly stated prior results, the model never has to reason about more than one additional step at a time.^[1]^[8]

Comparison with Chain-of-Thought

Property	Chain-of-Thought	Least-to-Most
Number of prompts per query	1	2 (decomposition + iterative solving)
Sub-problems explicitly listed	No, intermediate steps emerge inline	Yes, generated in stage 1
Intermediate answers re-fed as context	No, single forward pass	Yes, appended after each step
Robustness when test items exceed exemplar difficulty	Drops sharply	Generalizes to harder items
Typical extra inference cost	None	One pass per sub-problem

The structural difference, that the running solution is recycled back into the prompt at every step, is what gives the method its name: each prompt asks the model to solve the least amount of remaining work given what is already known.^[1]^[7] Zero-shot chain-of-thought (Kojima et al.'s "Let's think step by step") shares the goal of cheaper reasoning prompts but still solves the whole problem in a single shot, so it inherits the same easy-to-hard cliff that motivated least-to-most.^[8]

Evaluation

The paper evaluates least-to-most prompting on four families of tasks: symbolic manipulation (last-letter concatenation), compositional generalization (SCAN), reading comprehension over paragraphs requiring numerical reasoning (DROP), and grade-school math word problems (GSM8K).^[1]^[9] Most experiments use GPT-3-family models, principally code-davinci-002 (the Codex model branched from GPT-3) and text-davinci-002.^[1]^[9]

SCAN compositional generalization

SCAN tests whether a model that has seen short navigation-style commands (e.g. "jump twice and walk thrice") can produce the correct action sequence for longer or more deeply nested commands.^[5] In the paper's length split, where test commands are systematically longer than any training command, code-davinci-002 with least-to-most prompting reaches 99.7% accuracy using only 14 exemplars.^[1]^[2] Chain-of-thought prompting on the same model and the same exemplars achieves about 16.2% on the length split.^[2] With text-davinci-002, standard prompting on SCAN reaches roughly 6% while least-to-most rises to about 76%, again with no fine-tuning.^[7] For context, the prior specialized neural-symbolic systems that approached perfect SCAN accuracy were trained on the full SCAN training set of more than 15,000 examples.^[1]

Last-letter concatenation

Given a list of words, the model must output the string formed by taking the last letter of each word. With four words, chain-of-thought is roughly competitive, but accuracy plateaus quickly as the list grows. The paper reports chain-of-thought accuracy near 38% on 8-word lists, while least-to-most maintains substantially higher accuracy as length increases by reducing each step to appending one letter to a previously concatenated running answer.^[7]^[8]

Numerical reasoning on DROP

On DROP, a reading-comprehension benchmark requiring discrete operations over passages, least-to-most outperforms chain-of-thought by a sizeable margin, especially on football game and history passages, where the most useful decompositions enumerate the individual events to be counted or aggregated before the final arithmetic.^[1]^[9] The improvement is more pronounced when test problems require more sub-steps than the few-shot exemplars demonstrated.^[9]

Math word problems on GSM8K

On GSM8K grade-school math, least-to-most matches chain-of-thought on items requiring a small number of reasoning steps and pulls ahead on the harder subset that requires roughly five or more steps.^[7]^[8] The paper reports the relative gap widens as the required number of intermediate calculations grows, again consistent with the easy-to-hard generalization narrative.^[1]^[7]

Aggregate finding

Across all four task families, the dominant pattern is that least-to-most prompting and chain-of-thought perform similarly when the test items are no harder than the prompted exemplars, while least-to-most retains accuracy as items get harder.^[1]^[7] The 99.7% result on SCAN is the most striking instance because the length split was specifically constructed to be out-of-distribution from the prompt.^[1]^[2]

Significance

Least-to-Most Prompting is one of the canonical entries in the early "wave" of prompt engineering techniques aimed at unlocking reasoning in large language models without fine-tuning. Alongside chain-of-thought, self-consistency, Step-Back Prompting, Auto-CoT, ReAct, and Tree of Thoughts, it represents a family of techniques that use the model itself to structure its own computation.^[6]^[10] In particular, it was an early and concrete demonstration of a recursive idea now common in agentic systems: ask the model to plan, then execute the plan step-by-step using the same model.^[1]^[7]

The technique also influenced subsequent compositional reasoning work. Drozdov et al.'s ICLR 2023 paper "Compositional Semantic Parsing with Large Language Models" builds directly on least-to-most prompting, using a syntactic-parse decomposition stage and dynamic exemplar selection to set new state-of-the-art numbers on the CFQ semantic-parsing benchmark with only about 1% of the training data used by prior approaches.^[10] Within the broader in-context learning literature, the paper is frequently cited as a demonstration that easy-to-hard generalization, long considered a major weakness of sequence models, can be partially bridged by structuring the prompt rather than scaling the model.^[1]^[11]

Adoption

After the paper appeared, "least-to-most" became one of the standard entries in tutorial-style guides on prompt engineering and was incorporated into general references such as the Learn Prompting documentation.^[7] The approach is closely related, and often combined with, self-consistency decoding and ReAct-style tool use. In agentic frameworks the decomposition stage corresponds to a "planner" call and the iterative solving stage corresponds to "worker" calls that may also invoke external tools, code interpreters, or retrieval.^[7]^[8] The technique remains a useful baseline in compositional generalization, complex math word problems, and reading comprehension where problems can be decomposed before solving.^[9]^[11]

In Denny Zhou's later talks on teaching language models to reason, he framed least-to-most as the explicit-decomposition analogue of chain-of-thought, with chain-of-thought relying on implicit decomposition inside a single forward pass and least-to-most exposing the decomposition explicitly so the model can recurse on it.^[6]

Limitations

The paper and follow-up work identify several limitations.

Decomposition prompts are task-specific. The few-shot demonstrations of how to decompose a problem typically transfer poorly across domains; a decomposition recipe tuned for last-letter concatenation will not work for math word problems, and one tuned for SCAN will not work for DROP.^[7]^[9]
Errors cascade. Because each sub-problem's answer is fed into the prompt for the next step, an early arithmetic or parsing mistake can corrupt every subsequent step. Unlike chain-of-thought with self-consistency sampling, the iterative structure makes simple majority voting harder to apply directly.^[7]
Decomposition can itself fail. Smaller or non-instruction-tuned models sometimes fail to produce a well-formed decomposition at all, which makes the method dependent on the capability of the underlying model. The paper's strongest results rely on code-davinci-002, a Codex-family model that is no longer offered through the OpenAI API.^[1]^[9]
Inference cost is higher. Two prompt evaluations per query are required at minimum, and the solving stage requires one inference per sub-problem, which can become expensive for long decompositions.^[7]
Some benchmarks favor it more than others. On problems that genuinely require only a single short reasoning trace, least-to-most adds latency without improving accuracy, and chain-of-thought is preferable.^[1]^[7]

Least-to-most prompting belongs to a cluster of in-context learning techniques that explicitly structure the reasoning process.

Technique	Decomposition style	Iteration	Year
Chain-of-Thought	Implicit, inline	Single forward pass	2022
Zero-shot CoT	Implicit, triggered by "let's think step by step"	Single forward pass	2022
Least-to-Most	Explicit list of sub-problems	One prompt per sub-problem	2022
Self-consistency	Same as CoT	Multiple samples, majority vote	2022
Auto-CoT	Implicit, but clustered exemplars chosen automatically	Single forward pass	2022
Step-Back Prompting	Abstraction first, then solve	Two prompts	2023
Self-Refine	Critique and revise the same answer	Multi-pass	2023
ReAct	Interleaved reason / act	Multi-pass with tools	2022
Tree of Thoughts	Tree search over partial solutions	Many evaluations	2023

Compared to chain-of-thought, least-to-most makes the decomposition explicit and replays prior answers; compared to Tree of Thoughts, it keeps a single linear plan rather than branching; compared to ReAct, it does not interleave external tool calls.^[1]^[7]^[11]

References

Zhou, Denny et al., "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models", arXiv, 2022-05-21 (v1) / 2023-04-16 (v3). https://arxiv.org/abs/2205.10625. Accessed 2026-05-21. ↩
Semantic Scholar entry, "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models" (Zhou, Schärli et al.), Semantic Scholar, 2022. https://www.semanticscholar.org/paper/Least-to-Most-Prompting-Enables-Complex-Reasoning-Zhou-Scharli/5437e8adab596d7294124c0e798708e050e25321. Accessed 2026-05-21. ↩
OpenReview, "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models", ICLR 2023 Conference, 2023-02-01. https://openreview.net/forum?id=WZH7099tgfM. Accessed 2026-05-21. ↩
Wei, Jason et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", arXiv, 2022-01-28. https://arxiv.org/abs/2201.11903. Accessed 2026-05-21. ↩
Lake, Brenden and Baroni, Marco, "Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks", arXiv, 2017-10-31. https://arxiv.org/abs/1711.00350. Accessed 2026-05-21. ↩
Denny Zhou, "Denny Zhou's Home Page" (research bio, Reasoning Team lead at Google Brain / Google DeepMind), Google DeepMind, 2024. https://dennyzhou.github.io/. Accessed 2026-05-21. ↩
Learn Prompting, "Least-to-Most Prompting", Learn Prompting documentation, 2023. https://learnprompting.org/docs/intermediate/least_to_most. Accessed 2026-05-21. ↩
PromptHub, "Least-to-Most Prompting Guide", PromptHub Blog, 2023. https://www.prompthub.us/blog/least-to-most-prompting-guide. Accessed 2026-05-21. ↩
Emergent Mind, "Least-to-Most Prompting in LLMs (paper summary of arXiv:2205.10625)", Emergent Mind, 2023. https://www.emergentmind.com/papers/2205.10625. Accessed 2026-05-21. ↩
Drozdov, Andrew et al., "Compositional Semantic Parsing with Large Language Models", arXiv, 2022-09-29. https://arxiv.org/abs/2209.15003. Accessed 2026-05-21. ↩
alphaXiv, "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models (v3 overview)", alphaXiv, 2023. https://www.alphaxiv.org/overview/2205.10625v3. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributor · full history

Suggest edit

What links here

Chain-of-Thought