Self-Discover prompting
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,139 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,139 words
Add missing citations, update stale details, or suggest a clearer explanation.
Self-Discover is a prompting framework for large language models in which the model first composes a task-specific reasoning structure from a library of atomic reasoning modules, and then follows that structure to solve every instance of the task.[^1] The method was introduced in the February 2024 paper "Self-Discover: Large Language Models Self-Compose Reasoning Structures" by Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra and Huaixiu Steven Zheng of Google DeepMind and the University of Southern California (USC).[^1][^2] Unlike Chain-of-Thought (CoT), which applies the same "think step by step" prompt to every task, Self-Discover runs a one-time meta-reasoning phase per task that selects, adapts and implements a structure tailored to that task's intrinsic shape, after which inference proceeds with no extra sampling overhead.[^1][^3] Reported gains over CoT reach 32% on BIG-Bench Hard, T4D and MATH, while inference compute is 10 to 40 times lower than Self-Consistency ensembles.[^1][^4] The paper was accepted at the 38th Conference on Neural Information Processing Systems (NeurIPS) 2024.[^5][^6]
By late 2023 the prompting literature had produced a long catalogue of techniques designed to elicit better reasoning from LLMs: few-shot and zero-shot Chain-of-Thought prompting, decomposition methods such as least-to-most prompting and decomposed prompting, step-back prompting, scratchpad prompting, ReAct, and search-based variants like Tree of Thoughts and Graph of Thought.[^1] Each technique encodes a single implicit hypothesis about how a reasoning problem should be tackled. CoT assumes linear step-by-step decomposition; least-to-most assumes recursive sub-problem decomposition; step-back prompting assumes reflection on first principles is helpful; Tree of Thoughts assumes exploratory search with backtracking.[^1]
The authors of Self-Discover argued that this monolithic choice is a fundamental limitation: applying any single prior assumption uniformly to all problems is suboptimal because each task has its own intrinsic reasoning shape.[^1] They cited the example that least-to-most prompting is much more effective than CoT at symbolic manipulation and compositional generalization owing to the decomposition structure of those tasks, but is not necessarily better elsewhere.[^1] A second motivation was efficiency. Methods that improve CoT, such as CoT with Self-Consistency or majority voting across many reasoning modules, require sampling many decoding trajectories per instance and therefore pay a 10 to 40 times inference cost penalty.[^1][^4]
A third motivation was interpretability. Prompt-optimization methods such as OPRO (Yang et al. 2023) produce strings that improve task accuracy but read as opaque token sequences whose effect on reasoning is hard to inspect. The Self-Discover authors wanted a structure that exposes the model's beliefs about how to solve the task in a form that humans can read, audit and edit.[^1] The JSON-style key-value reasoning template that the IMPLEMENT step produces is intended to be both machine-followable (the decoder fills in values) and human-readable (a developer can inspect or hand-edit it).[^1]
Self-Discover was framed as the "missing piece" in the prompting literature: a mechanism that lets the model itself compose a structure from a library of basic reasoning building blocks, analogous to how a program is written from primitive constructs such as for-loops and conditionals.[^1] The authors drew an explicit cognitive-science analogy to Newell, Shaw and Simon's 1958 "Elements of a theory of human problem solving" and to Jens Rasmussen's 1983 hierarchical model of human skills, knowledge and rules, framing the meta-reasoning step as analogous to the way humans first decide what kind of problem they face before applying any particular technique to it.[^1]
Self-Discover operates in two stages. Stage 1 is task-level and produces a reasoning structure once per task. Stage 2 is instance-level and reuses the structure on every test example. Because Stage 1 is amortized across all instances of a task, the per-instance inference cost is the same as Direct prompting or CoT (one forward pass), with three additional task-level meta-prompt calls in total.[^1][^3]
Stage 1 consists of three sequential meta-prompt calls, all executed on a small set of unlabeled task examples.[^1]
Each of the three actions is implemented as a single LLM call with its own meta-prompt. Stage 1 thus costs exactly three additional LLM calls per task and produces a frozen reasoning template D_I.[^1][^3]
In Stage 2, the implemented structure D_I is prepended to every test instance t of the task, and the model is instructed to fill in the values for each key step by step and produce a final answer.[^1] This stage requires exactly one LLM call per instance, the same number as Direct or CoT prompting. Because the structure already encodes the chain of reasoning steps, no extra sampling, no self-consistency voting and no tree-search expansion is needed.[^1][^4]
The library of 39 reasoning modules is adopted from Promptbreeder, the self-referential prompt-evolution framework introduced by Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero and Tim Rocktäschel in 2023.[^7] Each module is a short natural-language description of a high-level cognitive heuristic. The modules span experimental design ("How could I devise an experiment to help solve that problem?"), critical reasoning ("Critical Thinking: This style involves analyzing the problem from different perspectives"), systems thinking ("Use systems thinking"), reflection, collaboration, risk analysis, creative thinking and explicit step-by-step decomposition ("Let's think step by step"), among others.[^1][^7] The full list is published in Appendix A of the Self-Discover paper.[^1]
Representative module categories include:
The modules are intentionally redundant and overlap with existing prompting techniques: "Let's think step by step" reproduces zero-shot CoT, decomposition modules reproduce least-to-most prompting, and reflective modules approximate step-back prompting.[^1][^3] Self-Discover does not replace these techniques so much as treat them as primitive building blocks that the model can combine. Because the library was inherited from Promptbreeder rather than tuned for Self-Discover, the paper makes no claim that 39 is optimal; it is a fixed, public, model-independent seed set.[^1][^7]
The authors evaluated Self-Discover on a deliberately heterogeneous set of 25 reasoning tasks drawn from three benchmarks.[^1]
Four LLMs were tested: GPT-4 (gpt-4-turbo-preview), GPT-3.5-turbo, the instruction-tuned PaLM 2-L (a stronger instruction-tuned PaLM 2-L variant was used for MATH), and the open-source Llama 2-70B.[^1]
Baselines included Direct prompting, zero-shot CoT (Kojima et al. 2022), Plan-and-Solve prompting (Wang et al. 2023), CoT with Self-Consistency (sampling 10 outputs and majority-voting), majority voting across each of the 39 reasoning modules applied independently, "best of each RM" (an oracle baseline that picks the highest-accuracy single module per task using gold labels), and the prompt-optimization method OPRO (Yang et al. 2023).[^1]
On the 23 aggregated BBH tasks, Self-Discover gave PaLM 2-L a 7-point absolute improvement over CoT and a 6-point improvement over Plan-and-Solve; on GPT-4 the corresponding gains were 6 and 8 points.[^1] Self-Discover outperformed both Direct answering and CoT on more than 20 of 25 tasks in the zero-shot setting on PaLM 2-L.[^1]
On the social-reasoning T4D benchmark, the gains were larger. Self-Discover reached 69% accuracy on PaLM 2-L and 85% on GPT-4, an absolute improvement of at least 27 percentage points (32 points on GPT-4) over all baselines, and it surpassed the previous state-of-the-art prompting method Foresee and Reflect (FaR) without needing an expert-designed structure.[^1]
On MATH, improvements were more modest: 1 to 7 percentage points on PaLM 2-L and 2 to 3 points on GPT-4. Error analysis showed that 87.5% of the discovered reasoning structures on MATH were judged correct by human experts (an expert could follow the structure to solve the task), and that the dominant remaining failure mode (74.7% of errors) was arithmetic computation inside the otherwise correct reasoning structure, consistent with prior findings about LLM arithmetic.[^1]
Across all benchmarks the headline gain over CoT was reported as up to 32% on PaLM 2-L and GPT-4, the figure that was widely quoted in subsequent coverage.[^4][^8]
The efficiency advantage is the second headline. On a representative subset of tasks the authors plotted accuracy against inference calls per instance. Self-Discover uses one call per instance plus three task-level meta-prompts (amortized over the whole task), CoT with Self-Consistency uses 10 calls per instance (because 10 samples are drawn), and majority voting across all 39 reasoning modules uses 40 calls per instance.[^1][^4] Self-Discover outperformed both of those inference-heavy ensembles on the subset while using 10 to 40 times fewer calls.[^1][^4]
| Method | Inference calls per instance | Source |
|---|---|---|
| Direct prompting | 1 | [^1] |
| Chain-of-Thought | 1 | [^1] |
| Self-Discover | 1 (plus 3 task-level meta-prompts) | [^1] |
| CoT with Self-Consistency (10 samples) | 10 | [^1] |
| Majority voting across all 39 reasoning modules | 40 | [^1] |
The paper also noted that the meta-prompts and JSON-style structure inflate the input and output tokens compared with Direct or CoT prompting, so the per-instance dollar cost is somewhat higher than the call count alone suggests; the efficiency advantage grows as the number of instances per task grows because Stage 1 is paid only once.[^1]
An ablation on GPT-4 across four tasks showed that all three Stage 1 actions contribute: SELECT alone helps, SELECT plus ADAPT helps more, and the full SELECT-ADAPT-IMPLEMENT pipeline is the best.[^1] Removing IMPLEMENT in particular hurts because the JSON-style structure is what binds the modules into a coherent decoding plan.[^1]
A transferability study addressed whether the discovered structures encode universal task-knowledge or are model-specific. Reasoning structures discovered by PaLM 2-L were applied to GPT-4 and beat OPRO-optimized prompts (which had been trained on 20% of each task's data) on 3 of 4 tasks, despite Self-Discover being zero-shot.[^1] Reasoning structures discovered by GPT-4 were applied to Llama 2-70B and ChatGPT (GPT-3.5-turbo): on disambiguation-QA, Llama 2-70B reached 52% with the GPT-4-discovered structure versus 42% with CoT, and on geometry GPT-3.5-turbo reached 56% versus 51% for CoT (with a 3-shot demonstration).[^1] The authors interpreted this transferability as evidence that the structures capture task-intrinsic, not model-intrinsic, reasoning patterns.[^1]
Breaking BBH down by Suzgun et al.'s four task categories, the authors found that Self-Discover's largest gains over Direct and CoT were on tasks requiring world knowledge (such as sports understanding, movie recommendation and ruin names). Gains on the algorithmic and multi-step arithmetic category were moderate, mirroring the relatively small lift on MATH and the finding that the bottleneck on those tasks is arithmetic execution rather than reasoning structure.[^1]
The interpretation offered by the authors is that world-knowledge tasks benefit most because they reward the integration of several distinct cognitive moves (recalling facts, reasoning from constraints, considering exceptions) within a single decoding pass; a CoT chain composed of a single uniform thinking style tends to skip one of those moves, while a discovered structure that explicitly names "recall similar problems", "consider constraints" and "verify consistency" as separate keys forces the model to produce content for each.[^1] On purely algorithmic tasks, by contrast, the optimal structure is closer to plain step-by-step decomposition, which CoT already supplies, leaving Self-Discover with less headroom.[^1]
The paper reproduces several discovered structures to illustrate the heterogeneity of what the model composes for different tasks.[^1] For a geometric-shapes task in BBH, the discovered structure tells the decoder to first list the line segments, then analyze coordinates carefully, then determine whether the path is closed; CoT and Plan-and-Solve both got the same task wrong because they asserted early that the path was not closed and never recovered. For a parenthesis-parsing task, the structure invokes a creative-thinking module and proposes a stack-based algorithm in natural language, an unusual move that does not appear in a plain CoT trace.[^1] These examples illustrate the paper's central qualitative claim: the discovered structures are not just CoT in disguise but encode task-specific algorithmic choices that the model would not otherwise commit to in advance.[^1]
Self-Discover situates itself as a meta-method that composes over the others rather than competing with any single technique.[^1] The contrast with the four most closely related methods is summarised in the table below.
| Method | When the structure is fixed | What the model does | Per-instance inference calls |
|---|---|---|---|
| Chain-of-Thought (Wei et al. 2022; Kojima et al. 2022) | A priori; same structure for every task | Generates a linear step-by-step trace | 1 |
| Plan-and-Solve (Wang et al. 2023) | A priori; "make a plan then solve" template | Writes a plan then executes it | 1 |
| Tree of Thoughts (Yao et al. 2023) | A priori; tree search with backtracking | Explores branching reasoning tree | Many (search) |
| CoT + Self-Consistency (Wang et al. 2022) | A priori CoT, ensembled | Samples many CoT traces and majority-votes | 10 or more |
| Self-Discover (Zhou et al. 2024) | Discovered per task at Stage 1 | Composes selected modules, then follows the structure | 1 plus 3 task-level meta-prompts amortized over the task |
Plan-and-Solve is the closest prior work because both methods first generate something resembling a plan and then act on it. Self-Discover differs in two ways: the plan is grounded in atomic modules drawn from a library, rather than being free-form, and the decoded answer must follow an explicit key-value reasoning structure rather than free prose.[^1] Tree of Thoughts differs more sharply, exploring many reasoning trajectories per instance and incurring substantial extra compute; Self-Discover trades trajectory exploration for a single, well-structured trajectory whose shape was chosen by the model itself at Stage 1.[^1] CoT with Self-Consistency is an ensemble; Self-Discover beats it on the comparison subset with 10 times less inference compute.[^1]
The paper also compares against OPRO, a method that uses an LLM to iteratively optimise a prompt against a labelled training set. Self-Discover matches or exceeds OPRO without any training data, and the authors argue that explicit reasoning structures transfer between models more robustly than optimised prompt wordings.[^1]
The paper was posted to arXiv on 6 February 2024 as 2402.03620, and was widely covered in trade press within days.[^2] VentureBeat reported the announcement on 8 February 2024 under the headline "Google DeepMind proposes 'self-discover' framework for LLMs, improves GPT-4 performance" and highlighted the 32% headline gain over CoT and the 10 to 40 times inference-compute reduction over Self-Consistency.[^8] Tech Xplore covered it shortly thereafter, framing Self-Discover as a method that allows LLMs "to find and use task-intrinsic reasoning structures" and emphasising the two-step "first compose, then execute" mechanism.[^9] Google DeepMind also listed the paper on its own publications page under the title "Large Language Models Self-Discover Reasoning Structures", attributing the work to a joint team from DeepMind and USC.[^13]
The arXiv version carries the keyword tag "Machine Learning, ICML", reflecting the ICML-style LaTeX template the authors used, but the paper was ultimately accepted to the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) as a poster, not at ICML.[^5][^6] Co-author Pei Zhou announced the acceptance on Twitter / X in September 2024.[^10] The official NeurIPS proceedings list the paper under the bibkey conf/nips/ZhouP0CCLCZMZ24.[^6] The OpenReview record gives a publication date of 25 September 2024 and licenses the work under Creative Commons Attribution 4.0.[^11]
| Field | Value |
|---|---|
| arXiv ID | 2402.03620 |
| First posted | 6 February 2024 |
| Lead author | Pei Zhou |
| Affiliations | Google DeepMind; University of Southern California |
| Conference | NeurIPS 2024 (poster) |
| License | Creative Commons Attribution 4.0 (CC BY 4.0) |
Sources: arXiv listing,[^2] NeurIPS proceedings entry,[^6] dblp record for Denny Zhou,[^5] OpenReview record.[^11]
The paper itself names two limitations.[^1] First, on tasks where the bottleneck is execution rather than planning (most notably arithmetic-heavy benchmarks such as MATH), Self-Discover's gain shrinks because the discovered structure is already roughly correct; the model fails on calculation rather than on choosing the right reasoning steps. The 74.7% computation-error rate on MATH failures is the strongest evidence for this ceiling.[^1] Second, while per-instance inference calls are the same as CoT, the meta-prompts and the JSON-style structure inflate the input and output token counts and therefore the dollar cost; the amortized advantage only manifests when many instances share a task.[^1]
Implementation analyses in the broader prompting community add several caveats.[^3] Stage 1 requires that the underlying model be capable enough to follow long meta-prompts and produce coherent JSON structures; the authors observed low-quality structure outputs when they tried to run Stage 1 on Llama 2 directly, which is why the cross-model experiments used GPT-4 or PaLM 2-L as the discoverer and treated Llama 2 as the consumer.[^1] Self-Discover thus presumes the existence of a strong "discovery" model and inherits a teacher / student asymmetry: a small model cannot bootstrap its own structures with the same quality, and benefits most when a larger model provides them.
Self-Discover does not address how the 39-module library was chosen, and uses the Promptbreeder set unchanged. The library is not exhaustive (it lacks, for instance, modules for tool use, retrieval and code execution) and the framework gives no procedure for extending it.[^1][^7] More recent agentic prompting work that relies on tool calls, retrieval-augmented generation and code execution is therefore complementary rather than a direct comparison.
A further concern is that the published comparison is restricted to zero-shot, no-training-data baselines (Direct, CoT, Plan-and-Solve, CoT-Self-Consistency, OPRO). Self-Discover does not compare directly against fine-tuned reasoning models, against Tree of Thoughts on its tested tasks, or against later 2024 methods such as reflection-based self-correction and verifier-augmented decoding. The headline 32% figure is also the maximum gain on a per-task basis; aggregated gains on BBH are 6 to 8 percentage points absolute, which several commentators noted is a more modest but still meaningful improvement.[^4][^8]
There is no official Google DeepMind code release. Several open-source community implementations exist, including a Python reimplementation by Kailash S P on GitHub, and Self-Discover has been packaged as a reusable pattern in agent-orchestration tutorials and "agentic patterns" libraries.[^12][^14] The four-stage pattern (SELECT, ADAPT, IMPLEMENT, EXECUTE) is also taught as a reference design for structured reasoning in agent frameworks, where the SELECT and ADAPT calls are typically run offline as part of an agent's setup routine and the discovered structure is cached for reuse across user sessions.[^12]
Subsequent prompting research has cited Self-Discover as motivation for further structured-reasoning methods, including hybrid approaches that combine self-discovered structures with retrieval or with tool-using agents. The reasoning-module library itself has been reused in follow-up work, and the SELECT / ADAPT / IMPLEMENT decomposition is now a common reference for "compose-then-execute" prompting designs. Self-Discover is also frequently grouped with Chain-of-Thought, Tree of Thoughts, Self-Consistency and ReAct in textbooks and survey papers as one of the canonical 2022 to 2024 prompting techniques for reasoning in large language models.
Self-Discover represents a shift in how the prompting literature thinks about reasoning structure. Earlier methods, from Chain-of-Thought to Tree of Thoughts, embedded a single human prior about how reasoning should unfold and applied that prior to every problem. Self-Discover instead treats reasoning structure itself as a variable to be inferred, with the model selecting and composing modules from a public library at zero-shot inference time. The result is a method that is simultaneously more accurate than CoT on heterogeneous benchmarks and substantially cheaper than ensemble methods such as Self-Consistency, because the reasoning template is paid for once per task rather than once per instance.[^1][^4]
The transferability results add a second contribution: discovered structures generalise across model families, suggesting that good reasoning templates encode something about the task rather than about the model that produced them.[^1] That observation is part of the reason Self-Discover has been cited in subsequent work on "reasoning structure libraries" for agentic AI and on universal prompting designs for smaller open-source models.