Self-Discover prompting

Large Language Models Prompt Engineering

21 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v3 · 4,137 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Self-Discover is a prompting framework for large language models in which the model first composes a task-specific reasoning structure from a library of atomic reasoning modules, and then follows that structure to solve every instance of the task.^[1] The method was introduced in the February 2024 paper "Self-Discover: Large Language Models Self-Compose Reasoning Structures" by Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra and Huaixiu Steven Zheng of Google DeepMind and the University of Southern California (USC).^[1]^[2] Unlike Chain-of-Thought (CoT), which applies the same "think step by step" prompt to every task, Self-Discover runs a one-time meta-reasoning phase per task that selects, adapts and implements a structure tailored to that task's intrinsic shape, after which inference proceeds with no extra sampling overhead.^[1]^[3] Reported gains over CoT reach 32% on BIG-Bench Hard, T4D and MATH, while inference compute is 10 to 40 times lower than Self-Consistency ensembles.^[1]^[4] The paper was accepted at the 38th Conference on Neural Information Processing Systems (NeurIPS) 2024.^[5]^[6]

Background and motivation

By late 2023 the prompting literature had produced a long catalogue of techniques designed to elicit better reasoning from LLMs: few-shot and zero-shot Chain-of-Thought prompting, decomposition methods such as least-to-most prompting and decomposed prompting, step-back prompting, scratchpad prompting, ReAct, and search-based variants like Tree of Thoughts and Graph of Thought.^[1] Each technique encodes a single implicit hypothesis about how a reasoning problem should be tackled. CoT assumes linear step-by-step decomposition; least-to-most assumes recursive sub-problem decomposition; step-back prompting assumes reflection on first principles is helpful; Tree of Thoughts assumes exploratory search with backtracking.^[1]

The authors of Self-Discover argued that this monolithic choice is a fundamental limitation: applying any single prior assumption uniformly to all problems is suboptimal because each task has its own intrinsic reasoning shape.^[1] They cited the example that least-to-most prompting is much more effective than CoT at symbolic manipulation and compositional generalization owing to the decomposition structure of those tasks, but is not necessarily better elsewhere.^[1] A second motivation was efficiency. Methods that improve CoT, such as CoT with Self-Consistency or majority voting across many reasoning modules, require sampling many decoding trajectories per instance and therefore pay a 10 to 40 times inference cost penalty.^[1]^[4]

A third motivation was interpretability. Prompt-optimization methods such as OPRO (Yang et al. 2023) produce strings that improve task accuracy but read as opaque token sequences whose effect on reasoning is hard to inspect. The Self-Discover authors wanted a structure that exposes the model's beliefs about how to solve the task in a form that humans can read, audit and edit.^[1] The JSON-style key-value reasoning template that the IMPLEMENT step produces is intended to be both machine-followable (the decoder fills in values) and human-readable (a developer can inspect or hand-edit it).^[1]

Self-Discover was framed as the "missing piece" in the prompting literature: a mechanism that lets the model itself compose a structure from a library of basic reasoning building blocks, analogous to how a program is written from primitive constructs such as for-loops and conditionals.^[1] The authors drew an explicit cognitive-science analogy to Newell, Shaw and Simon's 1958 "Elements of a theory of human problem solving" and to Jens Rasmussen's 1983 hierarchical model of human skills, knowledge and rules, framing the meta-reasoning step as analogous to the way humans first decide what kind of problem they face before applying any particular technique to it.^[1]

Method

Self-Discover operates in two stages. Stage 1 is task-level and produces a reasoning structure once per task. Stage 2 is instance-level and reuses the structure on every test example. Because Stage 1 is amortized across all instances of a task, the per-instance inference cost is the same as Direct prompting or CoT (one forward pass), with three additional task-level meta-prompt calls in total.^[1]^[3]

Stage 1: SELECT, ADAPT, IMPLEMENT

Stage 1 consists of three sequential meta-prompt calls, all executed on a small set of unlabeled task examples.^[1]

SELECT. Given the full library D of 39 reasoning module descriptions and a handful of unlabeled examples t_i from task T, the LLM is prompted to pick a subset D_S of modules that look useful for this task. The meta-prompt explicitly asks the model to choose, for example, "reflective thinking" for science questions or "creative thinking" for narrative-continuation problems.^[1]
ADAPT. The selected module descriptions, which are written in generic English, are rewritten by the LLM to be task-specific. For instance, "break the problem into sub-problems" might be specialised to "calculate each arithmetic operation in order" for an arithmetic task. The result is an adapted set D_A.^[1]
IMPLEMENT. The adapted modules are operationalized into a structured plan D_I with concrete instructions for what to produce at each step. The IMPLEMENT meta-prompt includes a human-written demonstration of a reasoning structure on a different task to anchor the format. Structures are emitted as key-value pairs resembling JSON, because the authors had observed that following JSON-style structures improves the model's reasoning and generation quality.^[1]

Each of the three actions is implemented as a single LLM call with its own meta-prompt. Stage 1 thus costs exactly three additional LLM calls per task and produces a frozen reasoning template D_I.^[1]^[3]

Stage 2: applying the discovered structure

In Stage 2, the implemented structure D_I is prepended to every test instance t of the task, and the model is instructed to fill in the values for each key step by step and produce a final answer.^[1] This stage requires exactly one LLM call per instance, the same number as Direct or CoT prompting. Because the structure already encodes the chain of reasoning steps, no extra sampling, no self-consistency voting and no tree-search expansion is needed.^[1]^[4]

Atomic reasoning modules

The library of 39 reasoning modules is adopted from Promptbreeder, the self-referential prompt-evolution framework introduced by Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero and Tim Rocktäschel in 2023.^[7] Each module is a short natural-language description of a high-level cognitive heuristic. The modules span experimental design ("How could I devise an experiment to help solve that problem?"), critical reasoning ("Critical Thinking: This style involves analyzing the problem from different perspectives"), systems thinking ("Use systems thinking"), reflection, collaboration, risk analysis, creative thinking and explicit step-by-step decomposition ("Let's think step by step"), among others.^[1]^[7] The full list is published in Appendix A of the Self-Discover paper.^[1]

Representative module categories include:

Decomposition. "How can I break down this problem into smaller, more manageable parts?", "Let's think step by step" (the zero-shot CoT trigger from Kojima et al.), and sub-problem identification.^[1]^[7]
Critical and reflective thinking. "Critical Thinking" and reflective-thinking modules that ask the model to analyze a problem from multiple perspectives or to inspect the assumptions baked into a question.^[1]^[7]
Knowledge retrieval and analogy. Modules that prompt the model to recall similar prior problems or to draw on relevant domain knowledge.^[1]^[7]
Strategic and creative thinking. Modules such as "How can I devise an experiment?", "Use systems thinking" and "What kinds of solution typically are produced for this kind of problem specification?".^[1]^[7]
Verification and risk. Modules for checking logical consistency, considering edge cases, and analysing risks or trade-offs.^[1]^[7]

The modules are intentionally redundant and overlap with existing prompting techniques: "Let's think step by step" reproduces zero-shot CoT, decomposition modules reproduce least-to-most prompting, and reflective modules approximate step-back prompting.^[1]^[3] Self-Discover does not replace these techniques so much as treat them as primitive building blocks that the model can combine. Because the library was inherited from Promptbreeder rather than tuned for Self-Discover, the paper makes no claim that 39 is optimal; it is a fixed, public, model-independent seed set.^[1]^[7]

Experimental setup

The authors evaluated Self-Discover on a deliberately heterogeneous set of 25 reasoning tasks drawn from three benchmarks.^[1]

BIG-Bench Hard (BBH): a 23-task subset of BIG-Bench curated by Suzgun et al. that LLMs found unusually difficult. The tasks span four categories: algorithmic and multi-step arithmetic reasoning, natural language understanding, use of world knowledge, and multilingual knowledge and reasoning.^[1]
Thinking for Doing (T4D): a grounded social agent reasoning benchmark in which models must use mental-state reasoning to determine an action. At the time of the paper, GPT-4 with CoT solved only roughly 50% of T4D.^[1]
MATH: a subsample of 200 examples from the Hendrycks et al. MATH test set, evaluated with instance-level reasoning structures (a one-shot demonstration was added because MATH problems are individually complex).^[1]

Four LLMs were tested: GPT-4 (gpt-4-turbo-preview), GPT-3.5-turbo, the instruction-tuned PaLM 2-L (a stronger instruction-tuned PaLM 2-L variant was used for MATH), and the open-source Llama 2-70B.^[1]

Baselines included Direct prompting, zero-shot CoT (Kojima et al. 2022), Plan-and-Solve prompting (Wang et al. 2023), CoT with Self-Consistency (sampling 10 outputs and majority-voting), majority voting across each of the 39 reasoning modules applied independently, "best of each RM" (an oracle baseline that picks the highest-accuracy single module per task using gold labels), and the prompt-optimization method OPRO (Yang et al. 2023).^[1]

Results

Headline numbers

On the 23 aggregated BBH tasks, Self-Discover gave PaLM 2-L a 7-point absolute improvement over CoT and a 6-point improvement over Plan-and-Solve; on GPT-4 the corresponding gains were 6 and 8 points.^[1] Self-Discover outperformed both Direct answering and CoT on more than 20 of 25 tasks in the zero-shot setting on PaLM 2-L.^[1]

On the social-reasoning T4D benchmark, the gains were larger. Self-Discover reached 69% accuracy on PaLM 2-L and 85% on GPT-4, an absolute improvement of at least 27 percentage points (32 points on GPT-4) over all baselines, and it surpassed the previous state-of-the-art prompting method Foresee and Reflect (FaR) without needing an expert-designed structure.^[1]

On MATH, improvements were more modest: 1 to 7 percentage points on PaLM 2-L and 2 to 3 points on GPT-4. Error analysis showed that 87.5% of the discovered reasoning structures on MATH were judged correct by human experts (an expert could follow the structure to solve the task), and that the dominant remaining failure mode (74.7% of errors) was arithmetic computation inside the otherwise correct reasoning structure, consistent with prior findings about LLM arithmetic.^[1]

Across all benchmarks the headline gain over CoT was reported as up to 32% on PaLM 2-L and GPT-4, the figure that was widely quoted in subsequent coverage.^[4]^[8]

Efficiency comparison

The efficiency advantage is the second headline. On a representative subset of tasks the authors plotted accuracy against inference calls per instance. Self-Discover uses one call per instance plus three task-level meta-prompts (amortized over the whole task), CoT with Self-Consistency uses 10 calls per instance (because 10 samples are drawn), and majority voting across all 39 reasoning modules uses 40 calls per instance.^[1]^[4] Self-Discover outperformed both of those inference-heavy ensembles on the subset while using 10 to 40 times fewer calls.^[1]^[4]

Method	Inference calls per instance	Source
Direct prompting	1	^[1]
Chain-of-Thought	1	^[1]
Self-Discover	1 (plus 3 task-level meta-prompts)	^[1]
CoT with Self-Consistency (10 samples)	10	^[1]
Majority voting across all 39 reasoning modules	40	^[1]

The paper also noted that the meta-prompts and JSON-style structure inflate the input and output tokens compared with Direct or CoT prompting, so the per-instance dollar cost is somewhat higher than the call count alone suggests; the efficiency advantage grows as the number of instances per task grows because Stage 1 is paid only once.^[1]

Ablations and transferability

An ablation on GPT-4 across four tasks showed that all three Stage 1 actions contribute: SELECT alone helps, SELECT plus ADAPT helps more, and the full SELECT-ADAPT-IMPLEMENT pipeline is the best.^[1] Removing IMPLEMENT in particular hurts because the JSON-style structure is what binds the modules into a coherent decoding plan.^[1]

A transferability study addressed whether the discovered structures encode universal task-knowledge or are model-specific. Reasoning structures discovered by PaLM 2-L were applied to GPT-4 and beat OPRO-optimized prompts (which had been trained on 20% of each task's data) on 3 of 4 tasks, despite Self-Discover being zero-shot.^[1] Reasoning structures discovered by GPT-4 were applied to Llama 2-70B and ChatGPT (GPT-3.5-turbo): on disambiguation-QA, Llama 2-70B reached 52% with the GPT-4-discovered structure versus 42% with CoT, and on geometry GPT-3.5-turbo reached 56% versus 51% for CoT (with a 3-shot demonstration).^[1] The authors interpreted this transferability as evidence that the structures capture task-intrinsic, not model-intrinsic, reasoning patterns.^[1]

Where Self-Discover helps most

Breaking BBH down by Suzgun et al.'s four task categories, the authors found that Self-Discover's largest gains over Direct and CoT were on tasks requiring world knowledge (such as sports understanding, movie recommendation and ruin names). Gains on the algorithmic and multi-step arithmetic category were moderate, mirroring the relatively small lift on MATH and the finding that the bottleneck on those tasks is arithmetic execution rather than reasoning structure.^[1]

The interpretation offered by the authors is that world-knowledge tasks benefit most because they reward the integration of several distinct cognitive moves (recalling facts, reasoning from constraints, considering exceptions) within a single decoding pass; a CoT chain composed of a single uniform thinking style tends to skip one of those moves, while a discovered structure that explicitly names "recall similar problems", "consider constraints" and "verify consistency" as separate keys forces the model to produce content for each.^[1] On purely algorithmic tasks, by contrast, the optimal structure is closer to plain step-by-step decomposition, which CoT already supplies, leaving Self-Discover with less headroom.^[1]

Qualitative examples

The paper reproduces several discovered structures to illustrate the heterogeneity of what the model composes for different tasks.^[1] For a geometric-shapes task in BBH, the discovered structure tells the decoder to first list the line segments, then analyze coordinates carefully, then determine whether the path is closed; CoT and Plan-and-Solve both got the same task wrong because they asserted early that the path was not closed and never recovered. For a parenthesis-parsing task, the structure invokes a creative-thinking module and proposes a stack-based algorithm in natural language, an unusual move that does not appear in a plain CoT trace.^[1] These examples illustrate the paper's central qualitative claim: the discovered structures are not just CoT in disguise but encode task-specific algorithmic choices that the model would not otherwise commit to in advance.^[1]

Self-Discover situates itself as a meta-method that composes over the others rather than competing with any single technique.^[1] The contrast with the four most closely related methods is summarised in the table below.

Method	When the structure is fixed	What the model does	Per-instance inference calls
Chain-of-Thought (Wei et al. 2022; Kojima et al. 2022)	A priori; same structure for every task	Generates a linear step-by-step trace	1
Plan-and-Solve (Wang et al. 2023)	A priori; "make a plan then solve" template	Writes a plan then executes it	1
Tree of Thoughts (Yao et al. 2023)	A priori; tree search with backtracking	Explores branching reasoning tree	Many (search)
CoT + Self-Consistency (Wang et al. 2022)	A priori CoT, ensembled	Samples many CoT traces and majority-votes	10 or more
Self-Discover (Zhou et al. 2024)	Discovered per task at Stage 1	Composes selected modules, then follows the structure	1 plus 3 task-level meta-prompts amortized over the task

Plan-and-Solve is the closest prior work because both methods first generate something resembling a plan and then act on it. Self-Discover differs in two ways: the plan is grounded in atomic modules drawn from a library, rather than being free-form, and the decoded answer must follow an explicit key-value reasoning structure rather than free prose.^[1] Tree of Thoughts differs more sharply, exploring many reasoning trajectories per instance and incurring substantial extra compute; Self-Discover trades trajectory exploration for a single, well-structured trajectory whose shape was chosen by the model itself at Stage 1.^[1] CoT with Self-Consistency is an ensemble; Self-Discover beats it on the comparison subset with 10 times less inference compute.^[1]

The paper also compares against OPRO, a method that uses an LLM to iteratively optimise a prompt against a labelled training set. Self-Discover matches or exceeds OPRO without any training data, and the authors argue that explicit reasoning structures transfer between models more robustly than optimised prompt wordings.^[1]

Publication and reception

The paper was posted to arXiv on 6 February 2024 as 2402.03620, and was widely covered in trade press within days.^[2] VentureBeat reported the announcement on 8 February 2024 under the headline "Google DeepMind proposes 'self-discover' framework for LLMs, improves GPT-4 performance" and highlighted the 32% headline gain over CoT and the 10 to 40 times inference-compute reduction over Self-Consistency.^[8] Tech Xplore covered it shortly thereafter, framing Self-Discover as a method that allows LLMs "to find and use task-intrinsic reasoning structures" and emphasising the two-step "first compose, then execute" mechanism.^[9] Google DeepMind also listed the paper on its own publications page under the title "Large Language Models Self-Discover Reasoning Structures", attributing the work to a joint team from DeepMind and USC.^[13]

The arXiv version carries the keyword tag "Machine Learning, ICML", reflecting the ICML-style LaTeX template the authors used, but the paper was ultimately accepted to the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) as a poster, not at ICML.^[5]^[6] Co-author Pei Zhou announced the acceptance on Twitter / X in September 2024.^[10] The official NeurIPS proceedings list the paper under the bibkey conf/nips/ZhouP0CCLCZMZ24.^[6] The OpenReview record gives a publication date of 25 September 2024 and licenses the work under Creative Commons Attribution 4.0.^[11]

Field	Value
arXiv ID	2402.03620
First posted	6 February 2024
Lead author	Pei Zhou
Affiliations	Google DeepMind; University of Southern California
Conference	NeurIPS 2024 (poster)
License	Creative Commons Attribution 4.0 (CC BY 4.0)

Sources: arXiv listing,^[2] NeurIPS proceedings entry,^[6] dblp record for Denny Zhou,^[5] OpenReview record.^[11]

Limitations and criticisms

The paper itself names two limitations.^[1] First, on tasks where the bottleneck is execution rather than planning (most notably arithmetic-heavy benchmarks such as MATH), Self-Discover's gain shrinks because the discovered structure is already roughly correct; the model fails on calculation rather than on choosing the right reasoning steps. The 74.7% computation-error rate on MATH failures is the strongest evidence for this ceiling.^[1] Second, while per-instance inference calls are the same as CoT, the meta-prompts and the JSON-style structure inflate the input and output token counts and therefore the dollar cost; the amortized advantage only manifests when many instances share a task.^[1]

Implementation analyses in the broader prompting community add several caveats.^[3] Stage 1 requires that the underlying model be capable enough to follow long meta-prompts and produce coherent JSON structures; the authors observed low-quality structure outputs when they tried to run Stage 1 on Llama 2 directly, which is why the cross-model experiments used GPT-4 or PaLM 2-L as the discoverer and treated Llama 2 as the consumer.^[1] Self-Discover thus presumes the existence of a strong "discovery" model and inherits a teacher / student asymmetry: a small model cannot bootstrap its own structures with the same quality, and benefits most when a larger model provides them.

Self-Discover does not address how the 39-module library was chosen, and uses the Promptbreeder set unchanged. The library is not exhaustive (it lacks, for instance, modules for tool use, retrieval and code execution) and the framework gives no procedure for extending it.^[1]^[7] More recent agentic prompting work that relies on tool calls, retrieval-augmented generation and code execution is therefore complementary rather than a direct comparison.

A further concern is that the published comparison is restricted to zero-shot, no-training-data baselines (Direct, CoT, Plan-and-Solve, CoT-Self-Consistency, OPRO). Self-Discover does not compare directly against fine-tuned reasoning models, against Tree of Thoughts on its tested tasks, or against later 2024 methods such as reflection-based self-correction and verifier-augmented decoding. The headline 32% figure is also the maximum gain on a per-task basis; aggregated gains on BBH are 6 to 8 percentage points absolute, which several commentators noted is a more modest but still meaningful improvement.^[4]^[8]

Implementations and downstream use

There is no official Google DeepMind code release. Several open-source community implementations exist, including a Python reimplementation by Kailash S P on GitHub, and Self-Discover has been packaged as a reusable pattern in agent-orchestration tutorials and "agentic patterns" libraries.^[12]^[14] The four-stage pattern (SELECT, ADAPT, IMPLEMENT, EXECUTE) is also taught as a reference design for structured reasoning in agent frameworks, where the SELECT and ADAPT calls are typically run offline as part of an agent's setup routine and the discovered structure is cached for reuse across user sessions.^[12]

Subsequent prompting research has cited Self-Discover as motivation for further structured-reasoning methods, including hybrid approaches that combine self-discovered structures with retrieval or with tool-using agents. The reasoning-module library itself has been reused in follow-up work, and the SELECT / ADAPT / IMPLEMENT decomposition is now a common reference for "compose-then-execute" prompting designs. Self-Discover is also frequently grouped with Chain-of-Thought, Tree of Thoughts, Self-Consistency and ReAct in textbooks and survey papers as one of the canonical 2022 to 2024 prompting techniques for reasoning in large language models.

Significance

Self-Discover represents a shift in how the prompting literature thinks about reasoning structure. Earlier methods, from Chain-of-Thought to Tree of Thoughts, embedded a single human prior about how reasoning should unfold and applied that prior to every problem. Self-Discover instead treats reasoning structure itself as a variable to be inferred, with the model selecting and composing modules from a public library at zero-shot inference time. The result is a method that is simultaneously more accurate than CoT on heterogeneous benchmarks and substantially cheaper than ensemble methods such as Self-Consistency, because the reasoning template is paid for once per task rather than once per instance.^[1]^[4]

The transferability results add a second contribution: discovered structures generalise across model families, suggesting that good reasoning templates encode something about the task rather than about the model that produced them.^[1] That observation is part of the reason Self-Discover has been cited in subsequent work on "reasoning structure libraries" for agentic AI and on universal prompting designs for smaller open-source models.

Chain-of-Thought prompting, the canonical baseline that Self-Discover composes over and outperforms by up to 32%.^[1]
Tree of Thoughts, a search-based alternative that explores many reasoning branches per instance at much higher inference cost.^[1]
Self-Consistency, an ensemble of CoT samples that Self-Discover beats with 10 to 40 times less compute.^[1]^[4]
ReAct, an interleaved reasoning-and-acting prompting style relevant for tool-using agents.
Promptbreeder by Chrisantha Fernando et al., the source of the 39 reasoning modules used by Self-Discover.^[7]

References

Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, Huaixiu Steven Zheng, "Self-Discover: Large Language Models Self-Compose Reasoning Structures", arXiv preprint 2402.03620, 2024-02-06. https://arxiv.org/abs/2402.03620. Accessed 2026-05-21. ↩
arXiv listing for 2402.03620, "Self-Discover: Large Language Models Self-Compose Reasoning Structures", arXiv, 2024-02-06. https://arxiv.org/abs/2402.03620. Accessed 2026-05-21. ↩
Self-Discover HTML rendering on arXiv, "Self-Discover: Large Language Models Self-Compose Reasoning Structures", arXiv, 2024-02-07. https://arxiv.org/html/2402.03620v1. Accessed 2026-05-21. ↩
Pei Zhou et al., abstract of "Self-Discover: Large Language Models Self-Compose Reasoning Structures" (claims about 32% gain over CoT and 10 to 40 times less inference compute vs. Self-Consistency), arXiv, 2024-02-06. https://arxiv.org/abs/2402.03620. Accessed 2026-05-21. ↩
dblp bibliography for Denny Zhou (entry `conf/nips/ZhouP0CCLCZMZ24` for Self-Discover at NeurIPS 2024), dblp, 2024. https://dblp.org/pid/178/3277.html. Accessed 2026-05-21. ↩
"Self-Discover: Large Language Models Self-Compose Reasoning Structures", NeurIPS 2024 proceedings paper, Neural Information Processing Systems Foundation, 2024. https://proceedings.neurips.cc/paper_files/paper/2024/file/e41efb03e20ca3c231940a3c6917ef6f-Paper-Conference.pdf. Accessed 2026-05-21. ↩
Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, Tim Rocktäschel, "Promptbreeder: Self-Referential Self-Improvement via Prompt Evolution", arXiv preprint 2309.16797, 2023-09-28. https://arxiv.org/abs/2309.16797. Accessed 2026-05-21. ↩
Shubham Sharma, "Google DeepMind proposes 'self-discover' framework for LLMs, improves GPT-4 performance", VentureBeat, 2024-02-08. https://venturebeat.com/ai/google-deepmind-proposes-self-discover-framework-for-llms-improves-gpt-4-performance/. Accessed 2026-05-21. ↩
"A self-discovery approach: DeepMind framework allows LLMs to find and use task-intrinsic reasoning structures", Tech Xplore, 2024-02. https://techxplore.com/news/2024-02-discovery-approach-deepmind-framework-llms.html. Accessed 2026-05-21. ↩
Pei Zhou (@peizNLP), "Self-Discover is now accepted in NeurIPS", X (formerly Twitter), 2024-09-25. https://twitter.com/peizNLP/status/1839015744929149322. Accessed 2026-05-21. ↩
"Self-Discover: Large Language Models Self-Compose Reasoning Structures", OpenReview record (NeurIPS 2024 acceptance), OpenReview, 2024-09-25. https://openreview.net/forum?id=BROvXhmzYK. Accessed 2026-05-21. ↩
"Self-Discover: LLM Self-Composed Reasoning Structures", Agentic Patterns reference design, 2024. https://www.agentic-patterns.com/patterns/self-discover-reasoning-structures/. Accessed 2026-05-21. ↩
Google DeepMind, "Large Language Models Self-Discover Reasoning Structures", Google DeepMind Publications, 2024-02-06. https://deepmind.google/research/publications/64816/. Accessed 2026-05-21. ↩
Kailash S P, "SELF-DISCOVER (open-source Python reimplementation)", GitHub, 2024. https://github.com/kailashsp/SELF-DISCOVER. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Graph of Thoughts Skeleton-of-Thought

Background and motivation

Method

Stage 1: SELECT, ADAPT, IMPLEMENT

Stage 2: applying the discovered structure

Atomic reasoning modules

Experimental setup

Results

Headline numbers

Efficiency comparison

Ablations and transferability

Where Self-Discover helps most

Qualitative examples

Comparison with related prompting methods

Publication and reception

Limitations and criticisms

Implementations and downstream use

Significance

Related work

See also

References

Improve this article

Related Articles

Prompt

Agentic Context Engineering

How to Pressure LLMs for Better Output

System prompt

Context engineering

Self-consistency

What links here

Related Articles

Prompt

Agentic Context Engineering

How to Pressure LLMs for Better Output

System prompt

Context engineering

Self-consistency

What links here