Step-Back Prompting

Large Language Models Prompt Engineering

14 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v5 · 2,827 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Step-Back Prompting is a two-stage prompting technique introduced by researchers at Google DeepMind in October 2023. It elicits stronger reasoning from large language models by first asking the model to "step back" and articulate a higher-level concept or first principle related to a question, then to answer the original question with that abstraction supplied as additional context.^[1] In the original experiments the technique improved PaLM-2L accuracy on MMLU high-school physics and chemistry by 7% and 11% respectively, on TimeQA by 27%, and on MuSiQue by 7%; applied to GPT-4, the same prompt raised MMLU physics accuracy from 69.4% to 84.5%.^[1]^[4] The method was described in the paper "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models" by Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. Chi, Quoc V. Le, and Denny Zhou, accepted as a poster at the Twelfth International Conference on Learning Representations (ICLR 2024).^[1]^[2]^[3] The authors report sizable accuracy gains on STEM, knowledge-intensive question answering, and multi-hop reasoning benchmarks for PaLM-2L, GPT-4, and Llama2-70B, attributing the improvement to the model's ability to recall and apply relevant principles before grounding a specific answer.^[1]^[4]

Why was Step-Back Prompting developed?

Researchers in prompt engineering have long observed that complex problems become tractable when a solver first identifies the relevant abstract principle. Chain-of-Thought prompting, introduced in 2022, demonstrated that asking a model to produce intermediate reasoning steps before a final answer dramatically improves accuracy on arithmetic and commonsense tasks.^[1] Subsequent techniques such as self-consistency sample multiple chains and majority-vote the final answer, while decomposition methods like Least-to-Most split a problem into smaller sub-questions.^[1]

The Step-Back Prompting authors argue that even with these advances, LLMs frequently fail on questions whose surface form contains many concrete details. The model is distracted by particulars and never retrieves the underlying physical law, knowledge fact, or temporal scope that would make the question easy.^[1] Their stated motivation is an analogy to how humans tackle hard tasks: skilled problem solvers "step back and do abstractions to arrive at high-level principles to guide the process" before attempting a specific derivation.^[1]^[5] The authors frame the proposal directly: they are "inspired by the fact that when faced with challenging tasks humans often step back and do abstractions to arrive at high-level principles to guide the process," and design Step-Back Prompting "to ground reasoning on abstractions to reduce the chance of making errors in the intermediate reasoning steps."^[5]

The paper frames the failure mode this way: when faced with "What happens to the pressure P of an ideal gas if temperature is increased by a factor of 2 and the volume is increased by a factor of 8?" a model often manipulates numbers without invoking the ideal gas law PV = nRT, leading to errors.^[5] Likewise, knowledge questions such as "Which school did Estella Leopold go to between Aug 1954 and Nov 1954?" prove difficult even when the relevant biographical information is easily retrievable in answer to a broader question about her education history.^[5] Step-Back Prompting is positioned as a lightweight, training-free intervention that closes this gap.

How does Step-Back Prompting work?

Step-Back Prompting is a two-step procedure.^[1]^[5]

Abstraction. The model is shown the original question along with a short instruction and a few in-context examples that demonstrate how to rewrite a concrete question into a more generic "step-back question." For STEM tasks the abstraction prompt is "You are an expert at Physics/Chemistry. You are given a Physics/Chemistry problem. Your task is to extract the Physics/Chemistry concepts and principles involved in solving the problem."^[4] For knowledge and multi-hop questions the prompt is "You are an expert at world knowledge. Your task is to step back and paraphrase a question to a more generic step-back question, which is easier to answer."^[4]^[6] The model emits either the underlying principle (for STEM) or a paraphrased higher-level question (for knowledge tasks).
Reasoning. A second call answers the step-back question (either by direct generation or by retrieval-augmented generation over an external corpus) and then composes a final answer to the original question conditioned on both the step-back content and the original query.^[1]^[5] When retrieval is used, both the original question and the step-back question are independently issued to the retriever, and the combined evidence is concatenated into the answer prompt.^[6]

The procedure adds two extra model calls compared with a direct prompt and one extra call compared with a single-pass Chain-of-Thought approach. It does not require any fine-tuning, weight updates, or special decoding strategy; only the prompt template and a handful of demonstrations change.^[1]^[5]

Worked example: ideal gas law

Stage	Prompt or output
Original question	"What happens to the pressure P of an ideal gas if temperature is increased by a factor of 2 and the volume is increased by a factor of 8?"^[5]
Step-back question	"What are the physics principles behind this question?"^[5]
Abstraction answer	The ideal gas law: PV = nRT.^[5]
Final reasoning	Apply PV = nRT with T -> 2T and V -> 8V, giving P -> P/4.^[5]

Worked example: biographical knowledge

Stage	Prompt or output
Original question	"Which school did Estella Leopold go to between Aug 1954 and Nov 1954?"^[5]
Step-back question	"What is Estella Leopold's education history?"^[5]
Abstraction answer	A summary of the universities she attended and corresponding date ranges.^[5]
Final reasoning	Cross-reference the August-November 1954 window with the timeline to select the correct institution.^[5]

How much does Step-Back Prompting improve accuracy?

The authors evaluate Step-Back Prompting with PaLM-2L as the primary model and additionally report results for GPT-4 and Llama2-70B baselines.^[1]^[4] They divide tasks into three categories.

STEM

For MMLU high-school physics and high-school chemistry, the authors report the following accuracies.^[4]

Method	MMLU Physics	MMLU Chemistry
PaLM-2L baseline	66.4%	70.9%
PaLM-2L + Chain-of-Thought	65.0%	75.3%
PaLM-2L + Take a Deep Breath	65.7%	73.8%
PaLM-2L + Step-Back	73.2%	81.8%
GPT-4 baseline	69.4%	80.9%
GPT-4 + Step-Back	84.5%	85.6%

This is a +6.8 percentage-point gain over baseline on Physics (rounded to 7% in the abstract) and +10.9 points on Chemistry (11%), with Step-Back outperforming Chain-of-Thought on both subsets.^[1]^[4] The same prompt applied to GPT-4 lifted its MMLU physics accuracy from 69.4% to 84.5% and chemistry from 80.9% to 85.6%, showing that the abstraction step helps a stronger model as well as PaLM-2L; with Step-Back, PaLM-2L itself (73.2% physics, 81.8% chemistry) also surpassed the GPT-4 baseline (69.4% and 80.9%).^[4] On GSM8K the authors note that Step-Back is competitive with strong baselines but produces a smaller gap because the underlying arithmetic principles can already be inferred without explicit abstraction.^[1]

Knowledge-intensive QA

The knowledge tasks are TimeQA, which probes time-sensitive biographical and historical facts, and SituatedQA, which targets context-dependent answers.^[1]^[4]

Method	TimeQA	SituatedQA
PaLM-2L baseline	41.5%	54.3%
PaLM-2L + Chain-of-Thought	40.8%	56.4%
PaLM-2L + RAG	57.4%	59.3%
PaLM-2L + Step-Back	66.0%	57.5%
PaLM-2L + Step-Back + RAG	68.7%	61.0%
GPT-4 baseline	45.6%	63.2%

On TimeQA the combination of Step-Back and retrieval lifts PaLM-2L by 27.2 percentage points over the direct prompt and surpasses GPT-4's baseline by more than 23 points; on SituatedQA the technique narrows but does not close the gap to GPT-4.^[1]^[4]

Multi-hop reasoning

Multi-hop reasoning is measured with MuSiQue and StrategyQA.^[1]^[4]

Method	MuSiQue	StrategyQA
PaLM-2L baseline	35.5%	82.8%
PaLM-2L + Chain-of-Thought	38.7%	83.6%
PaLM-2L + RAG	39.6%	84.2%
PaLM-2L + Step-Back	42.6%	82.7%
PaLM-2L + Step-Back + RAG	42.8%	86.4%
GPT-4 baseline	38.5%	78.3%

Step-Back lifts PaLM-2L on MuSiQue by 7.3 percentage points and surpasses the GPT-4 baseline on both multi-hop benchmarks.^[1]^[4] On StrategyQA the gain over baseline comes almost entirely from combining Step-Back with retrieval: 82.7% without RAG versus 86.4% with it.^[4]

Error analysis

The paper includes a fine-grained breakdown of where remaining errors originate. On MMLU Physics, Step-Back corrects about 20.5% of errors made by the baseline while introducing 11.9% new errors; the net gain is positive but more than 90% of remaining mistakes occur in the reasoning step rather than in the abstraction step, indicating that producing the principle is easier for the model than applying it.^[5] On TimeQA, Step-Back fixes 39.9% of baseline errors and introduces only 5.6% new ones, but roughly 45% of remaining mistakes are traced to retrieval failures even when the step-back question is well formed.^[5] The authors conclude that "abstraction is an easy skill for the LLMs such as PaLM-2L via sample-efficient in-context learning," whereas "reasoning is still one of the hardest skills for LLMs to acquire: it is still the dominant failure mode even after the large reduction of task complexity by Step-Back Prompting."^[5]

How does Step-Back Prompting compare to other techniques?

Step-Back Prompting sits in a family of training-free strategies that change how a problem is posed before the model attempts an answer. The table below summarises how it relates to several adjacent methods.

Method	Core mechanism	Strength	Weakness relative to Step-Back
Chain-of-Thought	Asks the model to produce intermediate reasoning before a final answer.^[1]	Single prompt, works on a wide range of arithmetic and commonsense problems.	Does not surface a higher-level principle when surface details overwhelm the model.
Self-Consistency	Samples multiple Chain-of-Thought traces and majority-votes the answer.^[7]	Reduces variance from a single sampling.	Cannot recover an answer when every chain shares the same misapplied principle; multiplies inference cost without changing the abstraction.
Retrieval-Augmented Generation	Issues the query against an external corpus and conditions the answer on retrieved passages.^[1]	Brings in fresh evidence and grounds factual responses.	When the original query is too narrow, the retriever misses passages; combining RAG with Step-Back materially improves TimeQA and SituatedQA scores.^[4]
Take a Deep Breath	Adds a generic instruction such as "take a deep breath and work on this problem step-by-step."	Trivial implementation.	Underperforms Step-Back on MMLU Physics and Chemistry in the paper.^[4]
Step-Back Prompting	Generates a higher-level principle or paraphrase, then answers the original question with that abstraction in context.^[1]^[5]	Recovers the relevant principle even when the original prompt is detail-heavy; composable with RAG.^[1]	Adds at least one extra model call; remaining errors concentrate in the reasoning step.^[5]

How is Step-Back Prompting implemented in practice?

Step-Back Prompting was rapidly picked up by the open-source prompting community. The LangChain team released a chat-model implementation roughly two weeks after the arXiv preprint, adapting the original few-shot abstraction template ("You are an expert at world knowledge. Your task is to step back and paraphrase a question to a more generic step-back question, which is easier to answer.") into a runnable chain that retrieves over both the original and step-back queries and then synthesises an answer.^[6]^[8] The cookbook demonstration uses ChatOpenAI with temperature 0 for determinism and combines DuckDuckGo search results from both queries before the final answer step.^[6]

The technique has also been incorporated into educational prompt-engineering resources such as the LearnPrompting vocabulary and curriculum, which catalogues it as an extension of Chain-of-Thought that prepends a "preparatory stage" before reasoning.^[9] Google DeepMind lists the paper among its publications on reasoning research, citing the abstraction-then-reasoning framing.^[2]^[3]^[11] Practitioner-facing writeups have summarised the technique as adding a "reflection phase" before the model attempts an answer.^[12]

What is Step-Back Prompting used for?

Use cases that have been reported or demonstrated for Step-Back Prompting include:

Physics and chemistry tutoring where the model must first recall the controlling law (ideal gas law, conservation principles, stoichiometry) before manipulating numbers.^[1]^[5]
Time-sensitive question answering, in which a paraphrased question covering a broader time range produces stronger retrieval recall than the narrowly framed original query.^[4]^[6]
Multi-hop biographical and geographical reasoning, where the step-back question makes the connecting entity explicit before the final hop.^[1]^[5]
General-purpose RAG pipelines that combine the original query and a step-back query at retrieval time to assemble a richer evidence context.^[6]^[8]

What are the limitations of Step-Back Prompting?

The authors and subsequent analyses identify several constraints.^[1]^[5]^[10]

Reasoning bottleneck. On MMLU Physics, more than 90% of remaining errors after Step-Back occur in the reasoning step, not in the abstraction step. The technique surfaces the right principle but the model still misapplies it. The authors describe reasoning as "the dominant failure mode even after the large reduction of task complexity by Step-Back Prompting."^[1]^[5]
Not universally applicable. Some questions ask directly about a first principle (for example, "What is the speed of light?") or require pure factual recall (for example, "Who was president of the United States in 2000?"). In these cases there is nothing meaningful to abstract from and Step-Back can add latency without benefit.^[5]
Retrieval errors persist. On TimeQA, about 45% of remaining errors come from the retriever, even when the step-back question is well formed. Step-Back improves retrieval coverage but does not eliminate retrieval failure.^[5]
SituatedQA gap to GPT-4. Even with retrieval, PaLM-2L plus Step-Back reaches 61.0% on SituatedQA, still below the 63.2% baseline of GPT-4. The technique narrows but does not always close cross-model accuracy gaps.^[4]
Extra inference cost. Step-Back adds at least one extra LLM call (and one extra retrieval pass if RAG is used), which doubles or triples per-query latency and token spend compared with a single-shot prompt.^[1]^[6]
Effectiveness varies by task. Independent analyses note that Step-Back outperforms Chain-of-Thought and few-shot prompting on most tested datasets but is sensitive to the choice and number of demonstrations, with more examples not always producing better abstractions.^[10]

Step-Back Prompting belongs to a broader literature on prompt-time reasoning aids. Closely related methods include Chain-of-Thought prompting, which introduced explicit intermediate reasoning; self-consistency decoding, which samples multiple chains; Tree of Thoughts, which explores branching reasoning paths; ReAct, which interleaves reasoning with action calls; HyDE (Hypothetical Document Embeddings), which generates a hypothetical answer to improve retrieval; and meta-prompting approaches that ask the model to compose its own reasoning structure. As in-context learning research has expanded, abstraction-first techniques have proven complementary to retrieval-based and ensembling-based approaches rather than substitutes.^[1]^[4]

The authors emphasise that Step-Back is orthogonal to retrieval: combining Step-Back with Retrieval-Augmented Generation yields larger gains than either component alone on TimeQA, SituatedQA, MuSiQue, and StrategyQA.^[1]^[4]

References

Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. Chi, Quoc V. Le, Denny Zhou, "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models", arXiv, 2023-10-09. https://arxiv.org/abs/2310.06117. Accessed 2026-05-21. ↩
ICLR 2024 program committee, "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models (poster)", ICLR 2024 Virtual Site, 2024-05-07. https://iclr.cc/virtual/2024/poster/19503. Accessed 2026-05-21. ↩
OpenReview, "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models", OpenReview (ICLR 2024 forum 3bq3jsvcQ1), 2024-01-16. https://openreview.net/forum?id=3bq3jsvcQ1. Accessed 2026-05-21. ↩
Huaixiu Steven Zheng et al., "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models (HTML edition v2)", arXiv, 2024-03-12. https://arxiv.org/html/2310.06117v2. Accessed 2026-05-21. ↩
Huaixiu Steven Zheng et al., "Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models (HTML rendering)", arXiv, 2024-03-12. https://arxiv.org/html/2310.06117. Accessed 2026-05-21. ↩
Cobus Greyling, "The LangChain Implementation of DeepMind's Step-Back Prompting", Medium, 2023-10-26. https://cobusgreyling.medium.com/the-langchain-implementation-of-deepminds-step-back-prompting-9d698cf3e0c2. Accessed 2026-05-21. ↩
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou, "Self-Consistency Improves Chain of Thought Reasoning in Language Models", arXiv, 2022-03-21. https://arxiv.org/abs/2203.11171. Accessed 2026-05-21. ↩
LangChain (@LangChainAI), "Step-back prompting: a new prompting technique from Google DeepMind, can be used to improve RAG results. Now in LangChain.", X (Twitter), 2023-10-23. https://x.com/LangChainAI/status/1716482177331011833. Accessed 2026-05-21. ↩
Learn Prompting, "Step-Back Prompting", Learn Prompting Vocabulary, 2024-06-04. https://learnprompting.org/vocabulary/step-back_prompting. Accessed 2026-05-21. ↩
Aziz Belaweid, "Is Step Back Prompting The Best Prompting Strategy?", Substack, 2024-01-21. https://azizbelaweid.substack.com/p/is-step-back-prompting-the-best-prompting. Accessed 2026-05-21. ↩
Google DeepMind, "Step-Back Prompting Enables Reasoning via Abstraction in Large Language Models", Google DeepMind Publications, 2023-10-09. https://deepmind.google/research/publications/step-back-prompting-enables-reasoning-via-abstraction-in-large-language-models/. Accessed 2026-05-21. ↩
PromptHub, "A Step Forward with Step-Back Prompting", PromptHub Blog, 2024-02-15. https://www.prompthub.us/blog/a-step-forward-with-step-back-prompting. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

Graph of Thoughts Least-to-Most Prompting Self-Discover prompting Skeleton-of-Thought

Why was Step-Back Prompting developed?

How does Step-Back Prompting work?

Worked example: ideal gas law

Worked example: biographical knowledge

How much does Step-Back Prompting improve accuracy?

STEM

Knowledge-intensive QA

Multi-hop reasoning

Error analysis

How does Step-Back Prompting compare to other techniques?

How is Step-Back Prompting implemented in practice?

What is Step-Back Prompting used for?

What are the limitations of Step-Back Prompting?

Related work

See also

References

Improve this article

Related Articles

Prompt

Agentic Context Engineering

How to Pressure LLMs for Better Output

System prompt

Context engineering

Self-consistency

What links here

Related Articles

Prompt

Agentic Context Engineering

How to Pressure LLMs for Better Output

System prompt

Context engineering

Self-consistency

What links here