Step-Back Prompting
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 2,560 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 2,560 words
Add missing citations, update stale details, or suggest a clearer explanation.
Step-Back Prompting is a two-stage prompting technique introduced by researchers at Google DeepMind in October 2023. It elicits stronger reasoning from large language models by first asking the model to "step back" and articulate a higher-level concept or first principle related to a question, then to answer the original question with that abstraction supplied as additional context.[^1] The method was described in the paper "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models" by Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. Chi, Quoc V. Le, and Denny Zhou, accepted as a poster at the Twelfth International Conference on Learning Representations (ICLR 2024).[^1][^2][^3] The authors report sizable accuracy gains on STEM, knowledge-intensive question answering, and multi-hop reasoning benchmarks for PaLM-2L, GPT-4, and Llama2-70B, attributing the improvement to the model's ability to recall and apply relevant principles before grounding a specific answer.[^1][^4]
Researchers in prompt engineering have long observed that complex problems become tractable when a solver first identifies the relevant abstract principle. Chain-of-Thought prompting, introduced in 2022, demonstrated that asking a model to produce intermediate reasoning steps before a final answer dramatically improves accuracy on arithmetic and commonsense tasks.[^1] Subsequent techniques such as self-consistency sample multiple chains and majority-vote the final answer, while decomposition methods like Least-to-Most split a problem into smaller sub-questions.[^1]
The Step-Back Prompting authors argue that even with these advances, LLMs frequently fail on questions whose surface form contains many concrete details. The model is distracted by particulars and never retrieves the underlying physical law, knowledge fact, or temporal scope that would make the question easy.[^1] Their stated motivation is an analogy to how humans tackle hard tasks: skilled problem solvers "step back and do abstractions to arrive at high-level principles to guide the process" before attempting a specific derivation.[^1][^5]
The paper frames the failure mode this way: when faced with "What happens to the pressure P of an ideal gas if temperature is increased by a factor of 2 and the volume is increased by a factor of 8?" a model often manipulates numbers without invoking the ideal gas law PV = nRT, leading to errors.[^5] Likewise, knowledge questions such as "Which school did Estella Leopold go to between Aug 1954 and Nov 1954?" prove difficult even when the relevant biographical information is easily retrievable in answer to a broader question about her education history.[^5] Step-Back Prompting is positioned as a lightweight, training-free intervention that closes this gap.
Step-Back Prompting is a two-step procedure.[^1][^5]
Abstraction. The model is shown the original question along with a short instruction and a few in-context examples that demonstrate how to rewrite a concrete question into a more generic "step-back question." For STEM tasks the abstraction prompt is "You are an expert at Physics/Chemistry. You are given a Physics/Chemistry problem. Your task is to extract the Physics/Chemistry concepts and principles involved in solving the problem."[^4] For knowledge and multi-hop questions the prompt is "You are an expert at world knowledge. Your task is to step back and paraphrase a question to a more generic step-back question, which is easier to answer."[^4][^6] The model emits either the underlying principle (for STEM) or a paraphrased higher-level question (for knowledge tasks).
Reasoning. A second call answers the step-back question (either by direct generation or by retrieval-augmented generation over an external corpus) and then composes a final answer to the original question conditioned on both the step-back content and the original query.[^1][^5] When retrieval is used, both the original question and the step-back question are independently issued to the retriever, and the combined evidence is concatenated into the answer prompt.[^6]
The procedure adds two extra model calls compared with a direct prompt and one extra call compared with a single-pass Chain-of-Thought approach. It does not require any fine-tuning, weight updates, or special decoding strategy; only the prompt template and a handful of demonstrations change.[^1][^5]
| Stage | Prompt or output |
|---|---|
| Original question | "What happens to the pressure P of an ideal gas if temperature is increased by a factor of 2 and the volume is increased by a factor of 8?"[^5] |
| Step-back question | "What are the physics principles behind this question?"[^5] |
| Abstraction answer | The ideal gas law: PV = nRT.[^5] |
| Final reasoning | Apply PV = nRT with T -> 2T and V -> 8V, giving P -> P/4.[^5] |
| Stage | Prompt or output |
|---|---|
| Original question | "Which school did Estella Leopold go to between Aug 1954 and Nov 1954?"[^5] |
| Step-back question | "What is Estella Leopold's education history?"[^5] |
| Abstraction answer | A summary of the universities she attended and corresponding date ranges.[^5] |
| Final reasoning | Cross-reference the August–November 1954 window with the timeline to select the correct institution.[^5] |
The authors evaluate Step-Back Prompting with PaLM-2L as the primary model and additionally report results for GPT-4 and Llama2-70B baselines.[^1][^4] They divide tasks into three categories.
For MMLU high-school physics and high-school chemistry, the authors report the following PaLM-2L accuracies.[^4]
| Method | MMLU Physics | MMLU Chemistry |
|---|---|---|
| PaLM-2L baseline | 66.4% | 70.9% |
| PaLM-2L + Chain-of-Thought | 65.0% | 75.3% |
| PaLM-2L + Take a Deep Breath | 65.7% | 73.8% |
| PaLM-2L + Step-Back | 73.2% | 81.8% |
This is a +7 percentage-point gain over baseline on Physics and +10.9 points on Chemistry, with Step-Back outperforming Chain-of-Thought on both subsets.[^1][^4] On GSM8K the authors note that Step-Back is competitive with strong baselines but produces a smaller gap because the underlying arithmetic principles can already be inferred without explicit abstraction.[^1]
The knowledge tasks are TimeQA, which probes time-sensitive biographical and historical facts, and SituatedQA, which targets context-dependent answers.[^1][^4]
| Method | TimeQA | SituatedQA |
|---|---|---|
| PaLM-2L baseline | 41.5% | 54.3% |
| PaLM-2L + Chain-of-Thought | 40.8% | 56.4% |
| PaLM-2L + RAG | 57.4% | n/a |
| PaLM-2L + Step-Back | 66.0% | n/a |
| PaLM-2L + Step-Back + RAG | 68.7% | 61.0% |
| GPT-4 baseline | 45.6% | 63.2% |
On TimeQA the combination of Step-Back and retrieval lifts PaLM-2L by 27.2 percentage points over the direct prompt and surpasses GPT-4's baseline by more than 23 points; on SituatedQA the technique narrows but does not close the gap to GPT-4.[^1][^4]
Multi-hop reasoning is measured with MuSiQue and StrategyQA.[^1][^4]
| Method | MuSiQue | StrategyQA |
|---|---|---|
| PaLM-2L baseline | 35.5% | 82.8% |
| PaLM-2L + Chain-of-Thought | 38.7% | n/a |
| PaLM-2L + Step-Back + RAG | 42.8% | 86.4% |
| GPT-4 baseline | 38.5% | 78.3% |
Step-Back lifts PaLM-2L on MuSiQue by 7.3 percentage points and surpasses the GPT-4 baseline on both multi-hop benchmarks.[^1][^4]
The paper includes a fine-grained breakdown of where remaining errors originate. On MMLU Physics, Step-Back corrects about 20.5% of errors made by the baseline while introducing 11.9% new errors; the net gain is positive but more than 90% of remaining mistakes occur in the reasoning step rather than in the abstraction step, indicating that producing the principle is easier for the model than applying it.[^5] On TimeQA, Step-Back fixes 39.9% of baseline errors and introduces only 5.6% new ones, but roughly 45% of remaining mistakes are traced to retrieval failures even when the step-back question is well formed.[^5] The authors conclude that "abstraction is easier" for current models, leaving reasoning and retrieval as the dominant bottlenecks.[^1][^5]
Step-Back Prompting sits in a family of training-free strategies that change how a problem is posed before the model attempts an answer. The table below summarises how it relates to several adjacent methods.
| Method | Core mechanism | Strength | Weakness relative to Step-Back |
|---|---|---|---|
| Chain-of-Thought (Chain-of-Thought) | Asks the model to produce intermediate reasoning before a final answer.[^1] | Single prompt, works on a wide range of arithmetic and commonsense problems. | Does not surface a higher-level principle when surface details overwhelm the model. |
| Self-Consistency (self-consistency) | Samples multiple Chain-of-Thought traces and majority-votes the answer.[^7] | Reduces variance from a single sampling. | Cannot recover an answer when every chain shares the same misapplied principle; multiplies inference cost without changing the abstraction. |
| Retrieval-Augmented Generation (Retrieval-Augmented Generation) | Issues the query against an external corpus and conditions the answer on retrieved passages.[^1] | Brings in fresh evidence and grounds factual responses. | When the original query is too narrow, the retriever misses passages; combining RAG with Step-Back materially improves TimeQA and SituatedQA scores.[^4] |
| Take a Deep Breath | Adds a generic instruction such as "take a deep breath and work on this problem step-by-step." | Trivial implementation. | Underperforms Step-Back on MMLU Physics and Chemistry in the paper.[^4] |
| Step-Back Prompting | Generates a higher-level principle or paraphrase, then answers the original question with that abstraction in context.[^1][^5] | Recovers the relevant principle even when the original prompt is detail-heavy; composable with RAG.[^1] | Adds at least one extra model call; remaining errors concentrate in the reasoning step.[^5] |
Step-Back Prompting was rapidly picked up by the open-source prompting community. The LangChain team released a chat-model implementation roughly two weeks after the arXiv preprint, adapting the original few-shot abstraction template ("You are an expert at world knowledge. Your task is to step back and paraphrase a question to a more generic step-back question, which is easier to answer.") into a runnable chain that retrieves over both the original and step-back queries and then synthesises an answer.[^6][^8] The cookbook demonstration uses ChatOpenAI with temperature 0 for determinism and combines DuckDuckGo search results from both queries before the final answer step.[^6]
The technique has also been incorporated into educational prompt-engineering resources such as the LearnPrompting vocabulary and curriculum, which catalogues it as an extension of Chain-of-Thought that prepends a "preparatory stage" before reasoning.[^9] Google DeepMind lists the paper among its publications on reasoning research, citing the abstraction-then-reasoning framing.[^2][^3][^11] Practitioner-facing writeups have summarised the technique as adding a "reflection phase" before the model attempts an answer.[^12]
Use cases that have been reported or demonstrated for Step-Back Prompting include:
The authors and subsequent analyses identify several constraints.[^1][^5][^10]
Step-Back Prompting belongs to a broader literature on prompt-time reasoning aids. Closely related methods include Chain-of-Thought prompting, which introduced explicit intermediate reasoning; self-consistency decoding, which samples multiple chains; Tree of Thoughts, which explores branching reasoning paths; ReAct, which interleaves reasoning with action calls; HyDE (Hypothetical Document Embeddings), which generates a hypothetical answer to improve retrieval; and meta-prompting approaches that ask the model to compose its own reasoning structure. As in-context learning research has expanded, abstraction-first techniques have proven complementary to retrieval-based and ensembling-based approaches rather than substitutes.[^1][^4]
The authors emphasise that Step-Back is orthogonal to retrieval: combining Step-Back with Retrieval-Augmented Generation yields larger gains than either component alone on TimeQA, SituatedQA, MuSiQue, and StrategyQA.[^1][^4]