Self-Refine
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,036 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,036 words
Add missing citations, update stale details, or suggest a clearer explanation.
Self-Refine is an inference-time prompting framework in which a single large language model iteratively improves its own output by alternating between generating natural-language feedback on a draft and producing a revised draft conditioned on that feedback. The method was introduced in the paper "Self-Refine: Iterative Refinement with Self-Feedback" by Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark, first posted to arXiv on 30 March 2023 and presented at NeurIPS 2023.[^1][^2] Unlike approaches that rely on additional training, reinforcement learning, or external tools, Self-Refine uses the same frozen model in three roles, namely generator, feedback provider, and refiner, all driven by few-shot prompts.[^1] Across seven tasks ranging from sentiment reversal to mathematical reasoning, the authors reported an average absolute task-performance improvement of roughly 20 percentage points over conventional one-step generation with GPT-3.5, ChatGPT, and GPT-4.[^1][^3] Follow-up studies have substantially qualified those claims, particularly for reasoning and planning tasks, where intrinsic self-critique is now widely held to be unreliable without external verifiers.[^4][^5]
The Self-Refine paper draws an explicit analogy with human writing practice. Authors rarely produce their best draft in one pass; instead, they re-read, identify deficiencies, and revise.[^1] By 2023, several lines of work had begun probing whether large language models could similarly improve their outputs after generation. Approaches such as PEER, Self-Correction (Welleck et al.), and various retrieval-augmented critique pipelines required either supervised pairs of bad-and-good drafts, scalar reward models, or task-specific refiner models.[^1] The contribution of Self-Refine was to show that, for a sufficiently strong instruction-following model, the same network can produce the critique and the rewrite without any parameter updates, simply by switching prompts.[^1]
The work was a collaboration of researchers from Carnegie Mellon University, the Allen Institute for AI, the University of Washington, NVIDIA, the University of California San Diego, and Google Research.[^2][^3] Aman Madaan, the lead author, was a Ph.D. student at CMU at the time, and Peter Clark of the Allen Institute is the senior author.[^2] The paper was posted to arXiv as 2303.17651 on 30 March 2023 and revised on 25 May 2023 (v2); it was then accepted to the 37th Conference on Neural Information Processing Systems (NeurIPS 2023) and presented as a poster.[^1][^2][^6]
The motivation section of the paper frames the problem as one of test-time improvement: practitioners have access to powerful frozen models such as GPT-4, and the question is whether output quality can be raised without fine-tuning, additional models, or environment access.[^1] The authors connect this to broader literatures on chain-of-thought reasoning, self-consistency, and human revision practices, arguing that natural-language feedback offers a richer signal than scalar rewards because it can point to specific failure modes such as "this code uses a quadratic loop where a linear pass would suffice" or "the response is generic and does not address the user's stated preference."[^1]
Self-Refine operates as a three-step loop driven by three task-specific few-shot prompts.[^1][^3] Given an input x and a base model M, the procedure is:
Steps 2 and 3 are repeated until a stopping condition is met. The paper uses a maximum of four refinement iterations in its main experiments and additionally stops early when the feedback indicates that no further improvement is needed, or when a task-specific numerical score (for example, a model-reported positivity score in sentiment reversal) exceeds a threshold.[^3][^7] In all rounds the prompt retains a record of previous outputs and previous feedback, so the model can avoid repeating earlier mistakes.[^3]
A simplified pseudocode for the loop, omitting prompt-formatting details, is:
y_0 = M(p_gen, x)
for t in 0, 1, ..., T-1:
f_t = M(p_fb, x, y_t)
if stop(f_t): break
y_{t+1} = M(p_refine, x, y_t, f_t)
return y_t
The three prompts together specify the framework in full; the underlying model is treated as a black box and never updated.[^1] The same model fills the generator, feedback, and refiner roles, which means the technique is not an ensemble or distillation method and incurs only additional decoding cost at test time.[^1][^3]
The released project page and the open-source GitHub repository at github.com/madaan/self-refine document task-specific prompt files for each of the seven tasks reported in the paper.[^2][^8] In the main experiments every step uses few-shot demonstrations: for example, the feedback prompt for code optimization contains worked examples in which an inefficient Python program is followed by a critique that identifies the algorithmic bottleneck, and the refine prompt contains the corresponding rewrite in O(n) instead of O(n^2).[^8] The math-reasoning prompts contain triples in which the feedback singles out the specific reasoning step that produced a wrong intermediate value.[^3]
A zero-shot or "instruction-only" version was studied as a baseline; the authors report substantial gains when at least feedback demonstrations are provided, since untemplated self-critique tends to produce vague or sycophantic comments such as "the answer looks correct" rather than actionable corrections.[^3]
A distinguishing element of Self-Refine relative to free-form self-criticism is the design of the feedback prompt itself. The authors deliberately demonstrate, in the few-shot examples, what counts as useful feedback: critiques that point to a specific span of the output, name the deficiency in concrete terms, and propose a direction for repair.[^1][^3] For acronym generation, a typical feedback demonstration might note that the candidate acronym ignores a domain-relevant keyword, abbreviates a less-important word, or produces a non-pronounceable result, and would then suggest swapping in a different word root. For sentiment reversal, the feedback would identify which sentences in the rewritten review still carry the original polarity. For code optimization, the feedback would name an algorithmic bottleneck (for example, a redundant nested loop or a repeated computation that could be cached).[^3][^8]
This structuring is one of the reasons the framework outperforms generic "review your previous answer and try again" prompting strategies, which in subsequent studies have been shown to produce regressive trajectories on hard-correctness tasks.[^13] By demonstrating multi-aspect, actionable feedback rather than holistic praise or criticism, Self-Refine implicitly defines a critique vocabulary for each task that the model can follow at inference time.[^3]
The paper's main configuration runs the loop for up to four iterations, but a task-specific stop is invoked earlier in several cases.[^3][^7] For tasks with a scalar quality signal that the model can be asked to emit (for example, "score the positivity of this review from 1 to 5" in sentiment reversal), the loop stops once the score crosses a threshold. For other tasks the loop stops when the feedback string contains explicit indicators of completion, such as "no further changes needed" or an empty critique.[^7] In practice, the average number of iterations per example is reported to be lower than the maximum, and the authors discuss diminishing returns after the first or second refinement step on most tasks.[^3]
The original NeurIPS paper evaluates Self-Refine on seven generation and reasoning tasks chosen to span open-ended writing, structured rewriting, and quantitative problem solving.[^1][^3] The tasks, dataset sizes, and primary metrics are summarised below.
| Task | Description | Dataset size | Primary metric |
|---|---|---|---|
| Sentiment Reversal | Rewrite Yelp reviews to flip sentiment | 1,000 reviews | Sentiment success rate |
| Dialogue Response Generation | Produce richer assistant turns | 372 conversations | Human / GPT-4 preference |
| Code Optimization | Speed up Python programs (PIE benchmark) | 1,000 programs | % programs improved |
| Code Readability Improvement | Refactor for clarity | 300 programs | Readability score |
| Math Reasoning | Solve grade-school word problems (GSM8K) | 1,319 questions | Solve rate |
| Acronym Generation | Generate acronyms for titles | 250 items | Multi-criterion score |
| Constrained Generation | CommonGen-Hard with 20-30 keywords | 200 prompts | Coverage |
Across all tasks, the paper reports that outputs produced under Self-Refine are preferred by humans and by automatic metrics over the same model's single-pass outputs, with an average absolute task-performance improvement of approximately 20 percentage points.[^1][^3] Per-task results extracted from the paper, expressed as the base score followed by the Self-Refine score for the strongest reported configuration of each model, illustrate where the gains concentrate. For GPT-4 the improvements are largest on tasks with diffuse or multi-dimensional quality criteria: Dialogue Response rose from 25.4 to 74.6 (a 49.2-point preference gain), Sentiment Reversal from 3.8 to 36.2, Code Readability from 27.4 to 56.2, and Constrained Generation from 15.0 to 45.0.[^3] On the math-reasoning benchmark GSM8K the gain was negligible, with GPT-4 reported at 92.9 versus 93.1 after Self-Refine, and the same near-zero delta was seen for GPT-3.5 and ChatGPT on that benchmark.[^3] Code optimization improvements were modest, ranging from +3.6 percentage points (ChatGPT) to +8.7 (GPT-4).[^3]
The authors interpret this pattern as evidence that Self-Refine helps most when "quality" is loosely defined and the model has many surface-level levers (lexical, structural, stylistic) for incremental improvement, and helps least when the answer is a single token whose correctness is decidable by an external checker, such as a numerical math answer.[^1][^3]
Sentiment reversal uses examples from Yelp reviews. The model is given a positive (or negative) review and asked to rewrite it with the opposite polarity while preserving content. The metric is the fraction of rewrites that, when scored by an automatic sentiment classifier, change polarity.[^3] Self-Refine drives the sentiment-reversal rate from a low baseline (single-digit to low-double-digit percentages depending on the model) to 30 to 43 percent.[^3]
Dialogue response generation uses conversations and asks the model to produce richer, more engaging assistant turns. Because there is no single ground-truth response, evaluation is by preference judgement (either by humans or by a GPT-4 judge).[^3] The dialogue task is where Self-Refine showed its largest reported gain, a 49.2-point preference improvement when using GPT-4.[^3]
Code optimization uses the Program Improvement Edit (PIE) benchmark of Python programs in which a slow reference solution must be rewritten into a faster one. The metric is the fraction of programs in which the model's rewrite, when executed, runs faster than the input.[^3][^8] Self-Refine yields between 3.6 and 8.7 percentage points of additional improved-programs across the three tested models.[^3]
Code readability improvement uses Python programs and asks the model to refactor them for clarity, with evaluation against a learned readability score; Self-Refine produced gains of about 14 to 35 percentage points depending on the base model.[^3]
Math reasoning uses GSM8K grade-school word problems. The metric is exact-match solve rate. As discussed above, Self-Refine yields essentially no gain on this benchmark for any of the three tested models, a finding the original authors flag and that subsequent critiques use as evidence for the broader claim that intrinsic self-critique does not help on hard-correctness reasoning tasks.[^3][^4][^13]
Acronym generation asks the model to produce acronyms for given titles. The metric is a multi-criterion score combining pronounceability, ease of spelling, relevance, and a few related axes. Self-Refine improves the score by 10 to 25 percentage points depending on the base model.[^3]
Constrained generation uses CommonGen-Hard prompts that supply 20 to 30 keywords and require the model to produce a coherent sentence using all of them. The metric is coverage of the supplied keywords. Self-Refine raises coverage by 9 to 30 percentage points across the tested models, with the largest gains on the strongest model.[^3]
Self-Refine sits in a family of inference-time techniques that exploit additional decoding compute to improve output quality.[^9] The table below contrasts it with four contemporaneous methods.
| Method | Year | Core mechanism | Feedback source | External tools |
|---|---|---|---|---|
| Self-Consistency | 2022 | Sample N reasoning paths, majority-vote final answer | None | None |
| Self-Refine | 2023 | Iterative draft / critique / revise loop | Same LLM | None |
| Reflexion (Shinn et al.) | 2023 | Verbal RL: maintain reflective memory across trials | Environment + LLM | Optional |
| Tree of Thoughts | 2023 | Search over partial-solution tree with self-evaluation | LLM evaluator | None |
| CRITIC (Gou et al.) | 2023 | Tool-interactive self-critique | LLM + external tools | Yes |
Self-Consistency, introduced by Wang et al. in 2022, samples multiple chain-of-thought traces and selects the most frequent final answer. It improves answer accuracy through diversity without ever critiquing any individual trace.[^9] Self-Refine, by contrast, generates and edits a single trajectory, and is therefore best suited to open-ended tasks where the notion of "majority vote" does not apply.[^1][^9]
Reflexion, by Shinn, Cassano, Gopinath, Narasimhan, and Yao (arXiv 2303.11366, also NeurIPS 2023), is framed as "verbal reinforcement learning."[^10] Reflexion is set in an agent loop where the model receives sparse rewards or environment feedback (for example, a unit-test pass/fail signal) and uses that feedback to update a reflective text memory that persists across trials. Whereas Self-Refine performs free-standing iterative refinement on the output of a single generation call, Reflexion expects an external signal (success/failure or scalar reward) and is therefore well-suited to interactive coding tasks such as HumanEval-style benchmarks with executable tests.[^10] The two methods are sometimes confused but differ in the source of feedback (intrinsic in Self-Refine, extrinsic or hybrid in Reflexion) and in the granularity of intervention (per-draft in Self-Refine, per-episode in Reflexion).[^10]
Tree of Thoughts, by Yao et al., reframes problem solving as deliberate search over a tree of partial solutions, with the LLM acting as both proposer and evaluator at each node. Self-Refine, Self-Consistency, and chain-of-thought can be viewed as special cases of Tree of Thoughts with constrained breadth or depth.[^11]
CRITIC, by Gou, Shao, Gong, Yang, Huang, Duan, and Chen (arXiv 2305.11738, ICLR 2024), externalises the critique step. Instead of asking the model to imagine flaws, CRITIC routes the draft through tools such as web search, code interpreters, or fact checkers and uses the tool output as the feedback signal. Empirically the authors of CRITIC report that removing tool access erodes most of the gains, which they take as evidence that LLMs are unreliable critics of their own work in tasks like open-domain QA.[^12]
The reference implementation lives at https://github.com/madaan/self-refine under an Apache-2.0 license.[^8] The repository contains subdirectories per task (sentiment_yelp, dialogue, acronym, gsm, pie/code optimization, code_readability, commongen) together with the few-shot prompt files used in the paper, a shared driver loop, and evaluation scripts.[^8] A separate project page at https://selfrefine.info collects task demos, example trajectories, and the citation block; an extension called "Visual Self-Refine" experiments with iteratively improving diagram generation through GPT-4V.[^2][^8]
Self-Refine has been adopted as a building block in several downstream systems and benchmarks. Prompt-engineering libraries and tutorials (including the Learn Prompting curriculum) treat it as one of the canonical self-criticism strategies for LLM applications and contrast it with Reflexion and Self-Consistency.[^7] The method is also frequently re-implemented inside agent frameworks where a "critique then revise" subroutine is plugged into longer pipelines.[^9]
The reference repository organises code by task. The sentiment_yelp directory contains the Yelp-derived sentiment-reversal prompts and the loop that drives feedback and refinement. The dialogue directory contains dialogue response examples with multi-criterion feedback templates. The pie directory targets program improvement, with Python execution scripts that measure wall-clock speedups. The gsm directory contains the GSM8K prompts and evaluation harness. The acronym, code_readability, and commongen directories each provide their own prompt files and evaluation logic.[^8] The colabs directory contains notebooks that allow practitioners to step through the feedback / refine loop interactively.[^8]
Variants and re-implementations have appeared in many later papers, often under names like "iterative self-correction", "critic-refiner loops", or "verbal RL". Among the more notable derivatives are systems that combine Self-Refine style critique with majority-vote selection (Self-Consistency), with retrieval-augmented critique, or with execution-based verifiers in code generation. CRITIC explicitly contrasts itself with Self-Refine by routing the critique step through external tools.[^12] Reflexion can be understood as a variant of Self-Refine in which the feedback signal includes an external reward and persists across episodes rather than being regenerated from scratch.[^10]
While Self-Refine's reported gains on open-ended generation are large, subsequent work has documented important limits, especially on reasoning and planning tasks where the original paper itself reported only tiny improvements on GSM8K.[^3][^4][^5]
The most direct critique comes from Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati at Arizona State University, who argued in "On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks" (arXiv 2402.08115, first posted 12 February 2024, accepted to ICLR 2025) that iterative self-critique frameworks of the Self-Refine family are unreliable when the underlying task has a hard correctness criterion.[^4][^5] Their study evaluates GPT-4 on three planning and reasoning domains: Game of 24, Graph Coloring, and STRIPS planning. The authors compare an iterative self-critique pipeline against a setup that uses a sound external verifier and find "significant performance collapse with self-critique and significant performance gains with sound external verification."[^4][^5] They further observe that merely re-prompting with a verifier that is guaranteed correct preserves most of the benefit of more elaborate self-improvement scaffolds.[^4]
Stechly and co-authors argue that the standard intuition motivating self-correction, namely that verification should be easier than generation, is a classical computational-complexity argument that does not transfer cleanly to LLMs, whose behaviour they characterise as approximate retrieval rather than principled reasoning.[^4] On their domains, prompting GPT-4 to critique its own answer often introduces new errors or fails to recognise correct solutions, so the iterative loop can degrade an initially correct answer.[^4][^5]
A separate and earlier critique, "Large Language Models Cannot Self-Correct Reasoning Yet" by Jie Huang and colleagues at Google DeepMind (arXiv 2310.01798, ICLR 2024), reaches a similar conclusion for arithmetic, commonsense, and multi-hop QA benchmarks: intrinsic self-correction without external feedback often hurts performance, and apparent gains in prior work sometimes stem from using oracle ground-truth labels to decide when to stop refining.[^13]
Beyond reasoning failures, several other limitations are acknowledged in the Self-Refine paper itself and in secondary coverage: the method requires a strong instruction-following base model (gains are smaller or negative for weak models that cannot produce useful feedback), evaluations were performed only on English text, the use of LLM-based or human preference judgements on highly preference-sensitive tasks may inflate apparent improvements, and the technique multiplies inference cost by the number of iterations.[^1][^7]
A practical caveat is that the framework's gains are sensitive to prompt design: poorly chosen feedback demonstrations can produce vague critiques that yield non-monotonic or even regressive refinement trajectories, and stopping criteria that rely on the model's own self-assessment of completeness are vulnerable to the same self-verification problem highlighted by Stechly et al.[^4][^7]
Each refinement iteration triggers two additional LLM calls (one for feedback, one for the rewrite). With four maximum iterations the worst-case inference cost is roughly nine times that of a single-shot generation (one initial draft plus four feedback / refine pairs).[^1][^3] In practice the average cost is lower because of early stopping, but Self-Refine consistently uses substantially more tokens per example than single-pass baselines. Whether the trade-off is favourable depends on the value of the additional quality and the price of generation. As a test-time compute technique, Self-Refine sits in the same family as repeated sampling with majority vote, beam search over reasoning steps, and Tree of Thoughts: all of them buy quality with additional compute.[^9][^11]
Because Self-Refine relies on instruction-following models that are also capable of generating harmful content, the same loop can in principle be aimed at producing more refined toxic, deceptive, or copyright-infringing outputs. The original paper notes that the technique was evaluated only on English-language text and that its applicability to other languages, dialects, or low-resource settings is not guaranteed.[^1] In safety-sensitive deployments, practitioners typically combine intrinsic self-refinement with external safety filters and content classifiers rather than rely on the model's own critique.[^9]
Despite the reasoning-task caveats, Self-Refine remains an influential reference point in the literature on test-time compute and on prompting frameworks for large language models.[^9][^11] It demonstrated that natural-language critique, expressed in the same channel as generation, can serve as a useful intermediate signal between scalar rewards and full external verification, and it provided one of the first systematic studies of intrinsic iterative refinement across heterogeneous task families.[^1][^3] The published code and prompts have made it straightforward to compare new self-correction proposals against a common baseline.[^8]
In subsequent years the field has moved toward hybrid systems that combine self-critique with external verifiers (CRITIC, agentic tool-use), with verified rewards (Reflexion variants), or with test-time compute scaling laws over search trees and verifiers.[^11][^12] Self-Refine occupies a notable position in that lineage as the cleanest demonstration that, for tasks where quality is multidimensional and there is no oracle verifier, a model can usefully edit its own draft.[^1][^9]
The broader contribution is methodological: by isolating the critique-and-rewrite loop as the only intervention and holding the underlying model fixed, the paper made it possible to study the effect of intrinsic self-feedback as an independent variable. Subsequent papers that combine self-feedback with sampling, with tools, with training, or with search build on a baseline that Self-Refine helped to standardise.[^1][^9] The negative result on GSM8K is in retrospect one of the most informative parts of the original work, since it telegraphed the limitations that Stechly et al. and Huang et al. would later document at greater scope.[^3][^4][^13]