Meta Prompting
Last reviewed
May 7, 2026
Sources
17 citations
Review status
Source-backed
Revision
v3 · 5,162 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
17 citations
Review status
Source-backed
Revision
v3 · 5,162 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: artificial intelligence terms
Meta prompting (also spelled meta-prompting) is an advanced prompt engineering technique where large language models (LLMs) are used to generate, refine, critique, select, or optimize prompts for themselves or other LLMs.[1][2] It involves creating higher-level prompts, often called "meta-prompts," that guide an AI in constructing or improving more specific, task-oriented prompts.[3] The core idea is to leverage an LLM's own capabilities to enhance the quality, effectiveness, and specificity of prompts, producing more reliable outputs from AI systems in the process.[1]
Meta prompting operates at a higher level of abstraction than conventional prompt writing. Rather than composing a prompt for a specific task, a practitioner writes a prompt about prompts, specifying what makes a good prompt, what the target task requires, and what constraints the generated prompt must satisfy. The LLM then produces the final task prompt from those specifications.[4][3] This separation of concerns, between what should be done (the task) and how to instruct the model to do it (the prompt), is central to the technique.
The approach connects to a broader family of self-directed AI behaviors, including self-reflection, self-refinement, and prompt optimization, and it underpins commercial tools from Anthropic, OpenAI, and Microsoft.
A meta-prompt is a prompt designed to elicit another prompt, or to modify an existing one. It provides instructions to a language model on how to create or improve a prompt for a specific downstream task. The term covers a wide range of practices:
The common thread is that the prompt itself becomes the object being manipulated by the language model, rather than the task output.[5] This is distinct from chain-of-thought prompting, which guides reasoning within a single prompt, and from retrieval-augmented generation, which supplements a prompt with retrieved content.
Meta prompting is sometimes used as a synonym for automatic prompt engineering, though the two overlap rather than coincide. Automatic prompt engineering refers specifically to computational search over a space of candidate prompts, while meta prompting is broader and includes manual, semi-manual, and fully automated approaches.
The idea that a language model could write better prompts than a human had early support in the 2022 paper "Large Language Models Are Human-Level Prompt Engineers" by Yao Zhou, Andrei Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba.[6] Published as a conference paper at ICLR 2023, the work introduced the Automatic Prompt Engineer (APE) framework, which treats instruction generation as a search problem: the model proposes many candidate instructions, each is scored by its ability to elicit correct answers on training examples, and the highest-scoring instruction is selected.
APE results were notable. On 24 NLP instruction induction tasks, automatically generated instructions outperformed human-written baselines on 19 of 24, achieving an interquartile mean of 0.810 against the human baseline of 0.749. On BIG-Bench tasks, APE matched or improved on human prompts in 17 of 21 cases. Particularly striking, APE discovered the chain-of-thought trigger "Let's think step by step" independently and found a variant that improved MultiArith performance from 78.7 to 82.0 and GSM8K from 40.7 to 43.0.[6]
The DSPy framework (Declarative Self-improving Python), introduced by Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, and colleagues from Stanford NLP in October 2023, took a different angle.[7] Rather than treating prompts as strings to be searched, DSPy treats them as programs: the developer specifies what a module should do using a typed signature, and DSPy compiles that signature into an optimized prompt through a training loop. The framework separates the interface (what the model should do) from the implementation (how to instruct it), which is the same separation at the heart of meta prompting.
DSPy's teleprompters, later renamed optimizers, search over prompt candidates and few-shot demonstration orderings using small training sets. In December 2023 DSPy added instruction optimizers, and in mid-2024 released MIPROv2 (optimizing instructions and demonstrations jointly) and BetterTogether (combining prompt optimization with model weight fine-tuning).[7] Within minutes of compilation, DSPy-optimized pipelines routinely surpass hand-written prompts for GPT-3.5 and Llama 2 on standard benchmarks.
Google DeepMind's OPRO (Optimization by PROmpting), published as a conference paper at ICLR 2024, proposed using a language model as the optimizer itself.[8] The setup is simple: a meta-prompt fed to the optimizer model includes past candidate instructions paired with their task scores (the optimization trajectory), a few task examples, and a meta-instruction describing the goal. The optimizer reads this meta-prompt and proposes a new candidate instruction that tries to improve on previous attempts.
OPRO produced prompts that outperformed human-designed instructions by up to 8 percentage points on GSM8K and up to 50 percentage points on BIG-Bench Hard tasks. The best prompt found by OPRO for some tasks reads like something a careful human might produce, but the search often discovers phrasing that humans would not try intuitively.[8]
The paper that most directly named and formalized the conductor-expert interpretation was "Meta-Prompting: Enhancing Language Models with Task-Agnostic Scaffolding" by Mirac Suzgun and Adam Tauman Kalai, submitted to arXiv in January 2024.[9] The paper introduced a scaffolding technique in which a single LLM instance acts as a conductor: it receives a high-level task, breaks it into subtasks, delegates each subtask to a separate expert instance of the same model (with tailored system instructions per subtask), collects the expert outputs, applies critical thinking and verification, and synthesizes a final answer.
The approach is zero-shot and task-agnostic: the conductor meta-prompt contains no task-specific examples. Testing on GPT-4 across Game of 24, Checkmate-in-One, and Python Programming Puzzles, the meta-prompting system outperformed standard prompting by 17.1%, expert (dynamic) prompting by 17.3%, and multipersona prompting by 15.2% when averaged across tasks. The Python interpreter was integrated as one of the callable experts, enabling the conductor to offload computation-heavy subtasks to code execution.[9]
TextGrad, published on arXiv in June 2024 and subsequently published in Nature, extended the optimization idea with a framework for automatic differentiation through text.[10] Inspired by PyTorch's autograd, TextGrad propagates "textual gradients" backward through a computation graph: each LLM call in a pipeline generates natural language feedback on the input it received, and that feedback is used to revise the input in the next iteration. This allows the same loop to optimize prompts, code snippets, molecular structures, or any variable that passes through an LLM. On LeetCode-Hard problems, TextGrad achieved a 20% relative performance gain; on Google-Proof Question Answering, it improved GPT-4o accuracy from 51% to 55%.[10]
Some meta prompting approaches, notably the Zhang et al. (2023) paper "Meta Prompting for AI Systems," formalize the idea using category theory.[4] The framework defines:
| Category | Description |
|---|---|
| Task category (T) | Objects are individual problem instances; morphisms are logical transformations between tasks. |
| Prompt category (P) | Objects are prompts or prompt templates; morphisms are transformations and refinements of prompts. |
The Meta Prompting Functor (MT: T → P) maps each task to a corresponding structured prompt. The functor is required to preserve compositional structure and identity elements, so that related tasks map to appropriately related prompts. This formalism is mostly useful for reasoning about the design space rather than for implementation, but it shows that the field has begun looking for principled foundations beyond empirical benchmarking.[4]
A more practical theoretical framing comes from the separation between what and how: a meta-prompt specifies what a good prompt must accomplish (the task interface), and the model instantiates how to phrase it (the implementation). DSPy explicitly calls this out as a core design principle, arguing that most prompting practice conflates interface and implementation in ways that make systems brittle and hard to optimize.[7]
The simplest form of meta prompting: a user describes a task in plain language, and the meta-prompt instructs the model to write a production-ready prompt. Anthropic's prompt generator in the developer console, released in 2024, is an example. The user provides a brief description (for instance, "classify customer support tickets by urgency"), and Claude generates a full system prompt with role definition, instructions, output format, and XML-tagged structure.[11]
The generated template uses Anthropic's prompt engineering conventions, including handlebars-style variables ({{variable_name}}), XML tags for separating instructions from context, and explicit output format specifications. The same approach is documented in OpenAI's prompt generation guide, where the model is instructed to produce a "detailed, high-quality prompt" from a task description.
Prompt improvement takes an existing, working prompt and asks the model to make it better by applying known best practices. This is where the meta-prompt encodes engineering knowledge: chain-of-thought instructions, XML structure, format specifications, and few-shot example standardization.
Anthropic's prompt improver, released on October 14, 2024, applies several techniques automatically:
Internal testing at Anthropic showed the improver increased accuracy by 30% on a multilabel classification task and brought word count adherence to 100% on a summarization task.[11]
OpenAI offers an equivalent in its Playground, accessed through an "Optimize" button. The Playground optimizer detects contradictions in instructions, missing output format specifications, and inconsistencies between the prompt and any provided few-shot examples, then rewrites the prompt according to best practices for the target model.[12]
The Suzgun-Kalai architecture runs the LLM in a loop: the conductor decides what expert to call and what instructions to give it, the expert runs, and the conductor evaluates the result and decides whether to call another expert, ask for a revision, or synthesize a final answer. This is distinct from simple prompt generation in that the meta-prompting happens at runtime rather than at authoring time. The conductor's meta-prompt is static, but the expert prompts it generates are dynamic and task-specific.[9]
This architecture shares structure with agentic frameworks like LangChain's Reflection Agents and with multi-agent systems more broadly, but the key distinction is that a single model fills both the conductor and expert roles, using context to modulate its behavior between them.
Self-Refine, introduced in 2023 by Madaan et al., uses the model itself to evaluate and improve its own outputs through a three-step loop: generate an initial draft, critique it against a rubric, and produce a refined version.[13] When applied to prompts rather than task outputs, this becomes iterative meta prompting: the model critiques its own prompt draft and revises it. Human evaluators rated Self-Refine outputs roughly 20% more preferable than single-pass outputs across several tasks.[13]
Reflexion, introduced by Shinn et al. in 2023, takes a related approach for agents: after each episode, the agent reflects verbally on what went wrong and stores that reflection as a memory for the next attempt. Reflexion agents using ReAct completed 130 of 134 AlfWorld tasks through self-evaluation and reflection, compared to lower rates without reflection.[14]
In recursive meta prompting, the meta-prompt itself is generated by a higher-level meta-prompt. This mirrors metaprogramming in software: just as a program can write programs, a meta-prompt can write meta-prompts. In practice this is rarely more than two levels deep, because deeper recursion compounds latency and cost without proportional gains, but the structure is useful in automated pipelines that bootstrap themselves from minimal initial specifications.
Prompt templates are reusable prompt skeletons with named variables that are filled at query time. They are a lightweight form of meta prompting in that the template author reasons about prompt structure independently from the content that will be inserted. Anthropic uses handlebars notation ({{variable_name}}) throughout its console tooling, including the prompt generator, prompt improver, and evaluation suite. XML tags are recommended to separate variable content from fixed instructions, reducing ambiguity when the variable content might otherwise blend with the surrounding text.[11]
Template-based prompting is the industrial complement to the research techniques above: while APE and OPRO automate prompt search, templates codify the output of that search into maintainable, versionable artifacts that developers can test, review, and deploy. Tools like PromptHub, discussed below, are built on this concept.
DSPy occupies a unique position in the meta prompting landscape because it blurs the line between prompt engineering and software engineering. A DSPy program is a Python module whose components correspond to LLM calls. Each call is described by a signature (typed input and output fields with natural language descriptions), and the developer writes no prompt text: DSPy generates prompts from the signatures and optimizes them using a training set and a metric function.
The main optimizers available in DSPy include:
| Optimizer | Description |
|---|---|
| BootstrapFewShot | Samples demonstrations from the training set and includes them in the prompt; scores each configuration and keeps the best. |
| BootstrapFewShotWithRandomSearch | Runs BootstrapFewShot multiple times with different random seeds and returns the best-scoring configuration. |
| MIPROv2 | Jointly optimizes instructions and few-shot demonstrations by proposing instruction candidates using a meta-prompt, scoring them, and selecting the best combination. |
| BetterTogether | Interleaves prompt optimization with model fine-tuning to get benefits from both approaches simultaneously. |
Because DSPy separates the task specification (signatures) from the prompt implementation (generated by the optimizer), changing the underlying model often requires only re-running the optimizer rather than rewriting prompts by hand. This makes DSPy programs more portable across model families than conventional prompt-based systems.[7]
DSPy's approach to meta prompting is fully automated: the developer never writes a meta-prompt manually. The optimizer itself is a program that uses an LLM to generate instruction candidates for each signature, evaluate them against the training metric, and iterate. The meta-prompt inside the optimizer is fixed and generic; the task-specific prompts it produces vary by module.
OPRO's contribution was to show that optimization by prompting is competitive with gradient-based prompt tuning methods on many benchmarks.[8] The key insight is that an LLM can read a history of (instruction, score) pairs and extract patterns that suggest what to try next, much as a human expert would look at experimental results and hypothesize about what changes might help.
The meta-prompt in OPRO has three parts:
The optimizer model reads this and proposes a new instruction. That instruction is evaluated on the training examples, and its score is added to the trajectory before the next iteration. The loop runs until a time or iteration budget is exhausted.
OPRO's design means the optimization is entirely in the space of natural language, with no gradient computation. This makes it applicable to any black-box model where only the output can be observed, not the weights. The trade-off is that each iteration requires multiple LLM calls and can be slow for large training sets.
Anthropic offers two meta prompting tools in its developer console.
The prompt generator, released in July 2024, creates a prompt from scratch given a brief task description. The user types a few sentences describing what they want Claude to do, and the generator outputs a complete system prompt including role, instructions, constraints, output format, and XML structure. The generated prompt follows Anthropic's documented best practices and uses the same conventions the company's own models were trained on.[11]
The prompt improver, released on October 14, 2024, refines an existing prompt. The developer pastes their current prompt, and the improver rewrites it by adding chain-of-thought reasoning, standardizing any examples, enriching examples with intermediate reasoning steps, clarifying structure, and adding a prefill for the assistant turn. The tool is particularly useful when adapting prompts from other providers, because it translates conventions from one platform's idiom into Anthropic's. In published testing, the improver raised accuracy by 30% on a classification task and achieved 100% adherence to a word count constraint on summarization.[11]
Both tools are themselves implemented as meta-prompts: the generator and improver are system prompts that instruct Claude on how to produce or refine a prompt given a task. Anthropic has shared portions of the meta-prompt logic in its documentation as a resource for developers who want to build similar tools.
OpenAI's Playground includes a prompt generation and optimization feature accessible through the "Generate" and "Optimize" buttons in the system message editor. The generation feature takes a task description and produces a system prompt; the Optimize feature takes an existing system prompt and rewrites it to fix contradictions, add missing format specifications, and align with best practices for the selected model.[12]
For GPT-5, OpenAI released a dedicated prompt optimizer cookbook in 2025 that walks through prompt migration from earlier models, showing how the optimizer restructures prompts to take advantage of GPT-5's extended context and instruction-following improvements. The Playground optimizer is free at the point of use within the Playground (standard token charges apply when using the resulting prompt via the API).[12]
OpenAI also released a meta-prompt for prompt optimization in its Cookbook, demonstrating the practice of using a more capable model (such as o1) to optimize prompts intended for a less expensive model (such as GPT-4o). The guide uses a news summarization example: a simple prompt asking for a summary is fed to the meta-prompt, which uses o1 to produce an improved prompt specifying content type, tags, and sentiment analysis fields.[12]
Context engineering is a related concept that became prominent in 2025 as models' context windows grew large enough to hold complex multi-document inputs, tool results, conversation histories, and retrieved knowledge alongside instructions. The framing, popularized in part by Andrej Karpathy's formulation that "the LLM is a CPU and the context window is RAM," treats the context window as the primary resource to be managed.[15]
Meta prompting and context engineering overlap in the system prompt layer. A meta-prompt can be used to generate a system prompt that organizes the context window more effectively: specifying what information should appear in what order, how retrieved documents should be tagged, how tool results should be formatted for downstream steps. In this sense, meta prompting is a tool within the broader context engineering workflow.
The distinction is one of emphasis. Prompt engineering and meta prompting focus on the instruction text itself, on how to phrase what the model should do. Context engineering focuses on everything that surrounds the instructions: the memory, state, and retrieved information that shapes what the model can do. In practice, both concerns are addressed together in production systems.
Anthropic's Constitutional AI (CAI), introduced in the December 2022 paper "Constitutional AI: Harmlessness from AI Feedback" by Yuntao Bai and colleagues, uses a structural form of meta prompting in its training pipeline.[16] The process works in two phases.
In the first phase (supervised learning), the model generates responses, then receives a meta-prompt asking it to critique its own responses against a set of principles (the "constitution"), and then revises its responses based on those critiques. The revised responses form a fine-tuning dataset.
In the second phase (reinforcement learning from AI feedback, or RLAIF), a separate model evaluates pairs of responses against the constitution and assigns preference scores. These scores are used to train a reward model, which then guides further RL training.
The self-critique and revision loop in CAI's supervised phase is a direct application of meta prompting: the critique prompt is a meta-prompt that instructs the model to evaluate its own output against explicit criteria. CAI extended this pattern from single-prompt refinement into a training procedure, showing that meta prompting techniques can be lifted from inference time into model training itself.
PromptHub is a prompt management platform built for teams. It applies Git-style version control concepts to prompts: branches for development and production versions, commits with change diffs, merge requests with review and approval workflows, and rollback to previous versions. Teams can test prompts across multiple inputs and models side by side, set up evaluation pipelines with automated checks before promoting a prompt to production, and deploy prompts as shareable forms or via API.
The platform connects to major LLM providers and supports both private team workspaces and public sharing. For organizations maintaining large prompt libraries across multiple products and model versions, PromptHub provides the kind of change management infrastructure that software teams use for code.[17]
Microsoft PromptFlow (now called Azure AI PromptFlow, integrated into Azure AI Foundry and Azure Machine Learning) is an end-to-end development environment for LLM applications that includes prompt management as a core feature. Developers build flows as visual graphs that link LLM calls, Python functions, and prompt templates. Flows can be versioned, tested across multiple input sets, deployed as Azure endpoints, and monitored in production.
PromptFlow's variant system allows developers to define multiple versions of a prompt for the same node in a flow and compare their outputs systematically. This is a structured form of A/B testing for prompts, which is closer to OPRO's evaluation loop than to manual iteration. The open-source PromptFlow project is available independently as a Python SDK and VS Code extension, separate from Azure.
Note: PromptFlow feature development ended on April 20, 2026, with Microsoft directing users to migrate to the Microsoft Agent Framework before the April 2027 retirement date.
LangChain provides a prompt template system (PromptTemplate and ChatPromptTemplate classes) that parameterizes prompts with named variables, enabling the same prompt structure to be reused across different inputs. LangSmith, LangChain's observability and testing platform, adds prompt versioning, run tracing, and evaluation datasets, giving teams visibility into how prompt changes affect output quality across a dataset rather than on single examples. LangSmith's prompt hub allows teams to store, version, and pull prompts from a central registry.
Weights and Biases, primarily known for ML experiment tracking, added a Prompts feature that captures the full context of LLM calls, including the system prompt, user messages, model parameters, and outputs, alongside performance metrics. This lets teams correlate prompt changes with output quality shifts over time, connecting prompt engineering to the same observability infrastructure used for model training.
The blank-page problem in prompt engineering: knowing that a prompt exists for a task but not knowing how to write it, is addressed by prompt generation tools. A developer describes the task, the meta-prompt generates a first draft, and the developer iterates from there. This is particularly valuable when working with a model for the first time or when migrating a workflow from one model family to another.
The conductor-expert architecture from Suzgun and Kalai is well suited to tasks where different subtasks require different types of reasoning: a mathematical reasoning expert, a code execution expert, and a natural language synthesis expert can be called in sequence or in parallel by the conductor, each receiving an appropriately tailored prompt for its role. This separation avoids the "jack of all trades, master of none" problem that arises when a single prompt tries to handle everything.
Meta prompting is used to construct evaluation prompts: prompts that instruct a model to act as a judge and score another model's outputs against a rubric. The quality of such evaluation prompts affects the reliability of the evaluation, so meta prompting the evaluation prompt is a natural step. OpenAI and Anthropic both document this pattern in their evaluation guides.
When a team moves from one LLM to another (for example from GPT-3.5 to GPT-4, or from one provider to another), prompts often require revision because models respond differently to the same phrasing. A meta-prompt can be used to translate prompts: it takes the original prompt as input and rewrites it using the conventions and idioms of the target model. OpenAI's GPT-5 prompt optimizer cookbook provides a worked example of this pattern.
Prompt optimization tools can generate high-quality instruction-response pairs for fine-tuning. An optimized prompt that reliably elicits correct answers can be used to generate a training dataset at scale by varying the inputs. This connects meta prompting to the broader pipeline of model improvement, including the RLAIF approach used in Constitutional AI.
Meta prompts can adapt a general-purpose prompt to a specialized domain by injecting domain-specific constraints, terminology, and output formats. A clinical documentation assistant might need a meta-prompt that understands HIPAA conventions, ICD-10 coding formats, and clinical note structure; rather than asking a general model to produce all of these constraints manually, the practitioner can use a meta-prompt to generate the domain-adapted system prompt from a high-level description.
| Technique | Level of abstraction | Automation | Primary goal |
|---|---|---|---|
| Manual prompt engineering | Direct prompt authoring | None | Task-specific quality |
| Meta prompting (generative) | Prompt about prompts | Semi-automated | Prompt generation and improvement |
| APE / OPRO | Automated prompt search | Fully automated | Optimal instruction discovery |
| DSPy | Program compilation | Fully automated | Portable, optimizable pipelines |
| Self-Refine | Output refinement loop | Automated | Output quality improvement |
| Constitutional AI | Training-time self-critique | Automated (training) | Model safety and alignment |
| Context engineering | Context window management | Mixed | Information architecture |
Meta prompting also differs from chain-of-thought prompting (which elicits reasoning steps within a prompt rather than generating the prompt itself), from few-shot prompting (which adds examples to a prompt rather than writing the prompt), and from prompt chaining (which connects outputs of one prompt to inputs of the next, without generating any of the prompts automatically).
Meta prompting at inference time requires at least two LLM calls: one for the meta-prompt and one for the generated prompt. Conductor-expert systems multiply this further. For latency-sensitive applications, the additional round trips can be prohibitive. Cost scales with the number of optimization iterations in OPRO-style or DSPy-style loops, where dozens to hundreds of candidates may be evaluated before the optimizer converges.[9][8]
The quality of the generated prompt depends heavily on the quality of the meta-prompt. If the meta-prompt embeds incorrect assumptions, outdated best practices, or biases about what makes a good prompt, those errors will propagate systematically into every generated prompt. This is harder to debug than errors in a single hand-written prompt because the failure mode is structural rather than textual.
Meta-prompting systems are typically calibrated for a specific model or model family. A meta-prompt that generates excellent instructions for Claude may generate mediocre instructions for GPT-4, and vice versa, because the two models respond differently to the same phrasing. When the target model is updated, the meta-prompt may require recalibration.
The optimization process in OPRO and DSPy is opaque in the sense that it is hard to explain why a particular generated prompt works better than the alternatives. The winning prompt in a DSPy compilation run may contain phrasing that seems arbitrary or unusual to human readers. This makes it difficult to audit prompts for safety, bias, or policy compliance, since the prompt is a product of search rather than deliberate authoring.
In multi-stage meta prompting pipelines, errors introduced at the meta-prompt stage multiply. A subtly misspecified meta-prompt may generate prompts that appear reasonable but consistently elicit a particular class of errors. Because each generated prompt may be used for many queries, the impact of a single meta-prompt failure can be large.
Automated prompt generation can encode and amplify biases present in the LLM used to generate the prompts. If that LLM has systematic biases in how it frames tasks, those biases will appear in generated prompts and in the outputs those prompts elicit. Prompt optimization systems that maximize a single metric can find prompts that game that metric in unintended ways, producing results that score well but behave badly on out-of-distribution inputs or edge cases.