Step-Back Prompting

RawGraph

Last reviewed

Sources

No citations yet

Review status

Needs citations

Revision

v3 · 2,560 words

Step-Back Prompting

Step-Back Prompting is a two-stage prompting technique introduced by researchers at Google DeepMind in October 2023. It elicits stronger reasoning from large language models by first asking the model to "step back" and articulate a higher-level concept or first principle related to a question, then to answer the original question with that abstraction supplied as additional context.[1] The method was described in the paper "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models" by Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. Chi, Quoc V. Le, and Denny Zhou, accepted as a poster at the Twelfth International Conference on Learning Representations (ICLR 2024).[1][2][3] The authors report sizable accuracy gains on STEM, knowledge-intensive question answering, and multi-hop reasoning benchmarks for PaLM-2L, GPT-4, and Llama2-70B, attributing the improvement to the model's ability to recall and apply relevant principles before grounding a specific answer.[1][4]

Background and motivation

Researchers in prompt engineering have long observed that complex problems become tractable when a solver first identifies the relevant abstract principle. Chain-of-Thought prompting, introduced in 2022, demonstrated that asking a model to produce intermediate reasoning steps before a final answer dramatically improves accuracy on arithmetic and commonsense tasks.[1] Subsequent techniques such as self-consistency sample multiple chains and majority-vote the final answer, while decomposition methods like Least-to-Most split a problem into smaller sub-questions.[1]

The Step-Back Prompting authors argue that even with these advances, LLMs frequently fail on questions whose surface form contains many concrete details. The model is distracted by particulars and never retrieves the underlying physical law, knowledge fact, or temporal scope that would make the question easy.[1] Their stated motivation is an analogy to how humans tackle hard tasks: skilled problem solvers "step back and do abstractions to arrive at high-level principles to guide the process" before attempting a specific derivation.[1][5]

The paper frames the failure mode this way: when faced with "What happens to the pressure P of an ideal gas if temperature is increased by a factor of 2 and the volume is increased by a factor of 8?" a model often manipulates numbers without invoking the ideal gas law PV = nRT, leading to errors.[5] Likewise, knowledge questions such as "Which school did Estella Leopold go to between Aug 1954 and Nov 1954?" prove difficult even when the relevant biographical information is easily retrievable in answer to a broader question about her education history.[5] Step-Back Prompting is positioned as a lightweight, training-free intervention that closes this gap.

The method

Step-Back Prompting is a two-step procedure.[1][5]

  1. Abstraction. The model is shown the original question along with a short instruction and a few in-context examples that demonstrate how to rewrite a concrete question into a more generic "step-back question." For STEM tasks the abstraction prompt is "You are an expert at Physics/Chemistry. You are given a Physics/Chemistry problem. Your task is to extract the Physics/Chemistry concepts and principles involved in solving the problem."[4] For knowledge and multi-hop questions the prompt is "You are an expert at world knowledge. Your task is to step back and paraphrase a question to a more generic step-back question, which is easier to answer."[4][6] The model emits either the underlying principle (for STEM) or a paraphrased higher-level question (for knowledge tasks).

  2. Reasoning. A second call answers the step-back question (either by direct generation or by retrieval-augmented generation over an external corpus) and then composes a final answer to the original question conditioned on both the step-back content and the original query.[1][5] When retrieval is used, both the original question and the step-back question are independently issued to the retriever, and the combined evidence is concatenated into the answer prompt.[6]

The procedure adds two extra model calls compared with a direct prompt and one extra call compared with a single-pass Chain-of-Thought approach. It does not require any fine-tuning, weight updates, or special decoding strategy; only the prompt template and a handful of demonstrations change.[1][5]

Worked example: ideal gas law

StagePrompt or output
Original question"What happens to the pressure P of an ideal gas if temperature is increased by a factor of 2 and the volume is increased by a factor of 8?"[5]
Step-back question"What are the physics principles behind this question?"[5]
Abstraction answerThe ideal gas law: PV = nRT.[5]
Final reasoningApply PV = nRT with T -> 2T and V -> 8V, giving P -> P/4.[5]

Worked example: biographical knowledge

StagePrompt or output
Original question"Which school did Estella Leopold go to between Aug 1954 and Nov 1954?"[5]
Step-back question"What is Estella Leopold's education history?"[5]
Abstraction answerA summary of the universities she attended and corresponding date ranges.[5]
Final reasoningCross-reference the August-November 1954 window with the timeline to select the correct institution.[5]

Evaluation

The authors evaluate Step-Back Prompting with PaLM-2L as the primary model and additionally report results for GPT-4 and Llama2-70B baselines.[1][4] They divide tasks into three categories.

STEM

For MMLU high-school physics and high-school chemistry, the authors report the following PaLM-2L accuracies.[4]

MethodMMLU PhysicsMMLU Chemistry
PaLM-2L baseline66.4%70.9%
PaLM-2L + Chain-of-Thought65.0%75.3%
PaLM-2L + Take a Deep Breath65.7%73.8%
PaLM-2L + Step-Back73.2%81.8%

This is a +7 percentage-point gain over baseline on Physics and +10.9 points on Chemistry, with Step-Back outperforming Chain-of-Thought on both subsets.[1][4] On GSM8K the authors note that Step-Back is competitive with strong baselines but produces a smaller gap because the underlying arithmetic principles can already be inferred without explicit abstraction.[1]

Knowledge-intensive QA

The knowledge tasks are TimeQA, which probes time-sensitive biographical and historical facts, and SituatedQA, which targets context-dependent answers.[1][4]

MethodTimeQASituatedQA
PaLM-2L baseline41.5%54.3%
PaLM-2L + Chain-of-Thought40.8%56.4%
PaLM-2L + RAG57.4%n/a
PaLM-2L + Step-Back66.0%n/a
PaLM-2L + Step-Back + RAG68.7%61.0%
GPT-4 baseline45.6%63.2%

On TimeQA the combination of Step-Back and retrieval lifts PaLM-2L by 27.2 percentage points over the direct prompt and surpasses GPT-4's baseline by more than 23 points; on SituatedQA the technique narrows but does not close the gap to GPT-4.[1][4]

Multi-hop reasoning

Multi-hop reasoning is measured with MuSiQue and StrategyQA.[1][4]

MethodMuSiQueStrategyQA
PaLM-2L baseline35.5%82.8%
PaLM-2L + Chain-of-Thought38.7%n/a
PaLM-2L + Step-Back + RAG42.8%86.4%
GPT-4 baseline38.5%78.3%

Step-Back lifts PaLM-2L on MuSiQue by 7.3 percentage points and surpasses the GPT-4 baseline on both multi-hop benchmarks.[1][4]

Error analysis

The paper includes a fine-grained breakdown of where remaining errors originate. On MMLU Physics, Step-Back corrects about 20.5% of errors made by the baseline while introducing 11.9% new errors; the net gain is positive but more than 90% of remaining mistakes occur in the reasoning step rather than in the abstraction step, indicating that producing the principle is easier for the model than applying it.[5] On TimeQA, Step-Back fixes 39.9% of baseline errors and introduces only 5.6% new ones, but roughly 45% of remaining mistakes are traced to retrieval failures even when the step-back question is well formed.[5] The authors conclude that "abstraction is easier" for current models, leaving reasoning and retrieval as the dominant bottlenecks.[1][5]

Step-Back Prompting sits in a family of training-free strategies that change how a problem is posed before the model attempts an answer. The table below summarises how it relates to several adjacent methods.

MethodCore mechanismStrengthWeakness relative to Step-Back
Chain-of-Thought (Chain-of-Thought)Asks the model to produce intermediate reasoning before a final answer.[1]Single prompt, works on a wide range of arithmetic and commonsense problems.Does not surface a higher-level principle when surface details overwhelm the model.
Self-Consistency (self-consistency)Samples multiple Chain-of-Thought traces and majority-votes the answer.[7]Reduces variance from a single sampling.Cannot recover an answer when every chain shares the same misapplied principle; multiplies inference cost without changing the abstraction.
Retrieval-Augmented Generation (Retrieval-Augmented Generation)Issues the query against an external corpus and conditions the answer on retrieved passages.[1]Brings in fresh evidence and grounds factual responses.When the original query is too narrow, the retriever misses passages; combining RAG with Step-Back materially improves TimeQA and SituatedQA scores.[4]
Take a Deep BreathAdds a generic instruction such as "take a deep breath and work on this problem step-by-step."Trivial implementation.Underperforms Step-Back on MMLU Physics and Chemistry in the paper.[4]
Step-Back PromptingGenerates a higher-level principle or paraphrase, then answers the original question with that abstraction in context.[1][5]Recovers the relevant principle even when the original prompt is detail-heavy; composable with RAG.[1]Adds at least one extra model call; remaining errors concentrate in the reasoning step.[5]

Adoption and implementations

Step-Back Prompting was rapidly picked up by the open-source prompting community. The LangChain team released a chat-model implementation roughly two weeks after the arXiv preprint, adapting the original few-shot abstraction template ("You are an expert at world knowledge. Your task is to step back and paraphrase a question to a more generic step-back question, which is easier to answer.") into a runnable chain that retrieves over both the original and step-back queries and then synthesises an answer.[6][8] The cookbook demonstration uses ChatOpenAI with temperature 0 for determinism and combines DuckDuckGo search results from both queries before the final answer step.[6]

The technique has also been incorporated into educational prompt-engineering resources such as the LearnPrompting vocabulary and curriculum, which catalogues it as an extension of Chain-of-Thought that prepends a "preparatory stage" before reasoning.[9] Google DeepMind lists the paper among its publications on reasoning research, citing the abstraction-then-reasoning framing.[2][3][11] Practitioner-facing writeups have summarised the technique as adding a "reflection phase" before the model attempts an answer.[12]

Applications

Use cases that have been reported or demonstrated for Step-Back Prompting include:

  • Physics and chemistry tutoring where the model must first recall the controlling law (ideal gas law, conservation principles, stoichiometry) before manipulating numbers.[1][5]
  • Time-sensitive question answering, in which a paraphrased question covering a broader time range produces stronger retrieval recall than the narrowly framed original query.[4][6]
  • Multi-hop biographical and geographical reasoning, where the step-back question makes the connecting entity explicit before the final hop.[1][5]
  • General-purpose RAG pipelines that combine the original query and a step-back query at retrieval time to assemble a richer evidence context.[6][8]

Limitations and criticisms

The authors and subsequent analyses identify several constraints.[1][5][10]

  • Reasoning bottleneck. On MMLU Physics, more than 90% of remaining errors after Step-Back occur in the reasoning step, not in the abstraction step. The technique surfaces the right principle but the model still misapplies it. The authors describe reasoning as "the dominant failure mode even after large reduction of task complexity."[1][5]
  • Not universally applicable. Some questions ask directly about a first principle (for example, "What is the speed of light?") or require pure factual recall (for example, "Who was president of the United States in 2000?"). In these cases there is nothing meaningful to abstract from and Step-Back can add latency without benefit.[5]
  • Retrieval errors persist. On TimeQA, about 45% of remaining errors come from the retriever, even when the step-back question is well formed. Step-Back improves retrieval coverage but does not eliminate retrieval failure.[5]
  • SituatedQA gap to GPT-4. Even with retrieval, PaLM-2L plus Step-Back reaches 61.0% on SituatedQA, still below the 63.2% baseline of GPT-4. The technique narrows but does not always close cross-model accuracy gaps.[4]
  • Extra inference cost. Step-Back adds at least one extra LLM call (and one extra retrieval pass if RAG is used), which doubles or triples per-query latency and token spend compared with a single-shot prompt.[1][6]
  • Effectiveness varies by task. Independent analyses note that Step-Back outperforms Chain-of-Thought and few-shot prompting on most tested datasets but is sensitive to the choice and number of demonstrations, with more examples not always producing better abstractions.[10]

Step-Back Prompting belongs to a broader literature on prompt-time reasoning aids. Closely related methods include Chain-of-Thought prompting, which introduced explicit intermediate reasoning; self-consistency decoding, which samples multiple chains; Tree of Thoughts, which explores branching reasoning paths; ReAct, which interleaves reasoning with action calls; HyDE (Hypothetical Document Embeddings), which generates a hypothetical answer to improve retrieval; and meta-prompting approaches that ask the model to compose its own reasoning structure. As in-context learning research has expanded, abstraction-first techniques have proven complementary to retrieval-based and ensembling-based approaches rather than substitutes.[1][4]

The authors emphasise that Step-Back is orthogonal to retrieval: combining Step-Back with Retrieval-Augmented Generation yields larger gains than either component alone on TimeQA, SituatedQA, MuSiQue, and StrategyQA.[1][4]

See also

References

  1. Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. Chi, Quoc V. Le, Denny Zhou, "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models", arXiv, 2023-10-09. https://arxiv.org/abs/2310.06117. Accessed 2026-05-21.
  2. ICLR 2024 program committee, "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models (poster)", ICLR 2024 Virtual Site, 2024-05-07. https://iclr.cc/virtual/2024/poster/19503. Accessed 2026-05-21.
  3. OpenReview, "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models", OpenReview (ICLR 2024 forum 3bq3jsvcQ1), 2024-01-16. https://openreview.net/forum?id=3bq3jsvcQ1. Accessed 2026-05-21.
  4. Huaixiu Steven Zheng et al., "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models (HTML edition v2)", arXiv, 2024-03-12. https://arxiv.org/html/2310.06117v2. Accessed 2026-05-21.
  5. Huaixiu Steven Zheng et al., "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models (HTML rendering)", arXiv, 2024-03-12. https://arxiv.org/html/2310.06117. Accessed 2026-05-21.
  6. Cobus Greyling, "The LangChain Implementation of DeepMind's Step-Back Prompting", Medium, 2023-10-26. https://cobusgreyling.medium.com/the-langchain-implementation-of-deepminds-step-back-prompting-9d698cf3e0c2. Accessed 2026-05-21.
  7. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou, "Self-Consistency Improves Chain of Thought Reasoning in Language Models", arXiv, 2022-03-21. https://arxiv.org/abs/2203.11171. Accessed 2026-05-21.
  8. LangChain (@LangChainAI), "Step-back prompting: a new prompting technique from Google DeepMind, can be used to improve RAG results. Now in LangChain.", X (Twitter), 2023-10-23. https://x.com/LangChainAI/status/1716482177331011833. Accessed 2026-05-21.
  9. Learn Prompting, "Step-Back Prompting", Learn Prompting Vocabulary, 2024-06-04. https://learnprompting.org/vocabulary/step-back_prompting. Accessed 2026-05-21.
  10. Aziz Belaweid, "Is Step Back Prompting The Best Prompting Strategy?", Substack, 2024-01-21. https://azizbelaweid.substack.com/p/is-step-back-prompting-the-best-prompting. Accessed 2026-05-21.
  11. Google DeepMind, "Step-Back Prompting Enables Reasoning via Abstraction in Large Language Models", Google DeepMind Publications, 2023-10-09. https://deepmind.google/research/publications/step-back-prompting-enables-reasoning-via-abstraction-in-large-language-models/. Accessed 2026-05-21.
  12. PromptHub, "A Step Forward with Step-Back Prompting", PromptHub Blog, 2024-02-15. https://www.prompthub.us/blog/a-step-forward-with-step-back-prompting. Accessed 2026-05-21.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation.

Suggest edit