Least-to-Most Prompting
Last reviewed
Jun 7, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,299 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 7, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,299 words
Add missing citations, update stale details, or suggest a clearer explanation.
Least-to-Most Prompting is a few-shot prompting technique for large language models introduced by researchers at Google Brain in May 2022. The method addresses a key weakness of chain-of-thought prompting: the tendency to fail on test problems that are structurally harder than the examples shown in the prompt.[1] Least-to-most prompting operates in two stages, first using the model to decompose a complex problem into an ordered list of easier sub-problems (problem reduction), then solving those sub-problems sequentially with each answer appended to the context used to solve the next one (problem solving).[1] On the SCAN compositional generalization benchmark, the GPT-3 code-davinci-002 model reached at least 99% accuracy in every split (including the length split) using just 14 exemplars, compared to about 16% with chain-of-thought prompting.[1][2] The paper was first posted to arXiv on 21 May 2022 and accepted as a poster at ICLR 2023.[1][3]
By early 2022, chain-of-thought prompting (Wei et al.) had shown that prepending a few worked-out reasoning traces to a question could elicit multi-step reasoning from sufficiently large language models without any fine-tuning.[4] Despite headline gains on GSM8K and similar benchmarks, follow-up analyses noted a recurring failure mode: when the test problems required more reasoning steps, longer command strings, or deeper compositional structure than any of the exemplars, accuracy collapsed.[1][2] The SCAN benchmark, designed by Lake and Baroni to test compositional generalization in sequence-to-sequence models, was a particularly stark example: even very large models struggled on its length split, where the test set contains command sequences longer than anything seen during training.[2][5]
The Least-to-Most Prompting paper, titled "Least-to-Most Prompting Enables Complex Reasoning in Large Language Models", was written by Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi, all affiliated with Google Research and the Google Brain team.[1][6] Denny Zhou, who later founded and led the Reasoning Team at Google Brain (subsequently folded into the Gemini effort at Google DeepMind), described least-to-most as one of four pillars of his reasoning research program, alongside chain-of-thought, self-consistency, and instruction fine-tuning.[6]
The paper was first uploaded to arXiv as 2205.10625v1 on 21 May 2022, revised on 6 October 2022 (v2), and reached its final v3 on 16 April 2023.[1] It was accepted to ICLR 2023 as a poster in the NLP applications track.[3] The method draws on a long tradition in problem-solving heuristics (decomposing a hard problem into easier sub-problems is a staple of Polya-style mathematical pedagogy), but its specific contribution is showing that a sufficiently capable LLM can be prompted to perform the decomposition itself, with no human supervision beyond the in-context exemplars.[1]
Least-to-most prompting separates reasoning into two prompting stages run against the same model. Both stages are pure few-shot prompting; no weights are updated.
The first prompt contains a small number of demonstrations in which a complex problem is rewritten as an ordered list of strictly easier sub-problems, terminating with the original problem.[1][7] For a math word problem about how long it takes to make and eat a pizza, the decomposition might be "(1) How long does it take to bake the pizza? (2) How long does it take to cool the pizza? (3) How long does the whole task take?", with the third sub-problem being the original question.[7] At test time the new question is appended and the model is asked to generate its own decomposition. Crucially, the decomposition is only structural; it does not contain answers.[1]
The second stage iterates through the decomposition produced by the first. At each step the model is given a prompt that contains: a few worked examples of solving similarly small sub-problems, the question being decomposed, all previously solved sub-questions together with their answers, and finally the current sub-question.[7][8] The answer the model produces is appended to the context, and the loop advances to the next sub-question. The final sub-question is the original problem, so its answer is the answer to the task. Because each step only adds a small amount of new reasoning on top of explicitly stated prior results, the model never has to reason about more than one additional step at a time.[1][8]
| Property | Chain-of-Thought | Least-to-Most |
|---|---|---|
| Number of prompts per query | 1 | 2 (decomposition + iterative solving) |
| Sub-problems explicitly listed | No, intermediate steps emerge inline | Yes, generated in stage 1 |
| Intermediate answers re-fed as context | No, single forward pass | Yes, appended after each step |
| Robustness when test items exceed exemplar difficulty | Drops sharply | Generalizes to harder items |
| Typical extra inference cost | None | One pass per sub-problem |
The structural difference, that the running solution is recycled back into the prompt at every step, is what gives the method its name: each prompt asks the model to solve the least amount of remaining work given what is already known.[1][7] Zero-shot chain-of-thought (Kojima et al.'s "Let's think step by step") shares the goal of cheaper reasoning prompts but still solves the whole problem in a single shot, so it inherits the same easy-to-hard cliff that motivated least-to-most.[8]
The paper evaluates least-to-most prompting on four families of tasks: symbolic manipulation (last-letter concatenation), compositional generalization (SCAN), reading comprehension over paragraphs requiring numerical reasoning (DROP), and grade-school math word problems (GSM8K).[1][9] Most experiments use GPT-3-family models, principally code-davinci-002 (the Codex model branched from GPT-3) and text-davinci-002.[1][9]
SCAN tests whether a model that has seen short navigation-style commands (e.g. "jump twice and walk thrice") can produce the correct action sequence for longer or more deeply nested commands.[5] In the paper's length split, where test commands are systematically longer than any training command, code-davinci-002 with least-to-most prompting reaches 99.7% accuracy using only 14 exemplars.[1][2] Chain-of-thought prompting on the same model and the same exemplars achieves about 16.2% on the length split.[2] With text-davinci-002, standard prompting on SCAN reaches roughly 6% while least-to-most rises to about 76%, again with no fine-tuning.[7] For context, the prior specialized neural-symbolic systems that approached perfect SCAN accuracy were trained on the full SCAN training set of more than 15,000 examples.[1]
Given a list of words, the model must output the string formed by taking the last letter of each word. With four words, chain-of-thought is roughly competitive, but accuracy plateaus quickly as the list grows. The paper reports chain-of-thought accuracy near 38% on 8-word lists, while least-to-most maintains substantially higher accuracy as length increases by reducing each step to appending one letter to a previously concatenated running answer.[7][8]
On DROP, a reading-comprehension benchmark requiring discrete operations over passages, least-to-most outperforms chain-of-thought by a sizeable margin, especially on football game and history passages, where the most useful decompositions enumerate the individual events to be counted or aggregated before the final arithmetic.[1][9] The improvement is more pronounced when test problems require more sub-steps than the few-shot exemplars demonstrated.[9]
On GSM8K grade-school math, least-to-most matches chain-of-thought on items requiring a small number of reasoning steps and pulls ahead on the harder subset that requires roughly five or more steps.[7][8] The paper reports the relative gap widens as the required number of intermediate calculations grows, again consistent with the easy-to-hard generalization narrative.[1][7]
Across all four task families, the dominant pattern is that least-to-most prompting and chain-of-thought perform similarly when the test items are no harder than the prompted exemplars, while least-to-most retains accuracy as items get harder.[1][7] The 99.7% result on SCAN is the most striking instance because the length split was specifically constructed to be out-of-distribution from the prompt.[1][2]
Least-to-Most Prompting is one of the canonical entries in the early "wave" of prompt engineering techniques aimed at unlocking reasoning in large language models without fine-tuning. Alongside chain-of-thought, self-consistency, Step-Back Prompting, Auto-CoT, ReAct, and Tree of Thoughts, it represents a family of techniques that use the model itself to structure its own computation.[6][10] In particular, it was an early and concrete demonstration of a recursive idea now common in agentic systems: ask the model to plan, then execute the plan step-by-step using the same model.[1][7]
The technique also influenced subsequent compositional reasoning work. Drozdov et al.'s ICLR 2023 paper "Compositional Semantic Parsing with Large Language Models" builds directly on least-to-most prompting, using a syntactic-parse decomposition stage and dynamic exemplar selection to set new state-of-the-art numbers on the CFQ semantic-parsing benchmark with only about 1% of the training data used by prior approaches.[10] Within the broader in-context learning literature, the paper is frequently cited as a demonstration that easy-to-hard generalization, long considered a major weakness of sequence models, can be partially bridged by structuring the prompt rather than scaling the model.[1][11]
After the paper appeared, "least-to-most" became one of the standard entries in tutorial-style guides on prompt engineering and was incorporated into general references such as the Learn Prompting documentation.[7] The approach is closely related, and often combined with, self-consistency decoding and ReAct-style tool use. In agentic frameworks the decomposition stage corresponds to a "planner" call and the iterative solving stage corresponds to "worker" calls that may also invoke external tools, code interpreters, or retrieval.[7][8] The technique remains a useful baseline in compositional generalization, complex math word problems, and reading comprehension where problems can be decomposed before solving.[9][11]
In Denny Zhou's later talks on teaching language models to reason, he framed least-to-most as the explicit-decomposition analogue of chain-of-thought, with chain-of-thought relying on implicit decomposition inside a single forward pass and least-to-most exposing the decomposition explicitly so the model can recurse on it.[6]
The paper and follow-up work identify several limitations.
Least-to-most prompting belongs to a cluster of in-context learning techniques that explicitly structure the reasoning process.
| Technique | Decomposition style | Iteration | Year |
|---|---|---|---|
| Chain-of-Thought | Implicit, inline | Single forward pass | 2022 |
| Zero-shot CoT | Implicit, triggered by "let's think step by step" | Single forward pass | 2022 |
| Least-to-Most | Explicit list of sub-problems | One prompt per sub-problem | 2022 |
| Self-consistency | Same as CoT | Multiple samples, majority vote | 2022 |
| Auto-CoT | Implicit, but clustered exemplars chosen automatically | Single forward pass | 2022 |
| Step-Back Prompting | Abstraction first, then solve | Two prompts | 2023 |
| Self-Refine | Critique and revise the same answer | Multi-pass | 2023 |
| ReAct | Interleaved reason / act | Multi-pass with tools | 2022 |
| Tree of Thoughts | Tree search over partial solutions | Many evaluations | 2023 |
Compared to chain-of-thought, least-to-most makes the decomposition explicit and replays prior answers; compared to Tree of Thoughts, it keeps a single linear plan rather than branching; compared to ReAct, it does not interleave external tool calls.[1][7][11]