Self-Refine

Large Language Models Prompt Engineering

22 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v3 · 4,302 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Self-Refine is an inference-time prompting framework in which a single large language model iteratively improves its own output by alternating between generating natural-language feedback on a draft and producing a revised draft conditioned on that feedback, all without any additional training. Introduced in the 2023 paper "Self-Refine: Iterative Refinement with Self-Feedback" by Aman Madaan and colleagues, the method uses the same frozen model in three roles, generator, feedback provider, and refiner, and reported an average improvement of about 20 percentage points in task performance across 7 diverse tasks using GPT-3.5, ChatGPT, and GPT-4.^[1]^[3] In the authors' words, the model generates an initial output and then "the same LLM provides feedback for its output and uses it to refine itself, iteratively," a loop that "does not require any supervised training data, additional training, or reinforcement learning."^[1]

The full author list is Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark; the paper was first posted to arXiv on 30 March 2023 and presented at NeurIPS 2023.^[1]^[2] Unlike approaches that rely on additional training, reinforcement learning, or external tools, Self-Refine drives all three roles with few-shot prompts.^[1] Follow-up studies have substantially qualified the headline gains, particularly for reasoning and planning tasks, where intrinsic self-critique is now widely held to be unreliable without external verifiers.^[4]^[5]

What is Self-Refine?

Self-Refine is a test-time technique that lets a frozen large language model edit its own work. Given an input, the model first writes a draft, then critiques that draft in natural language, then rewrites the draft using its own critique, repeating the feedback-and-refine cycle until the output is judged good enough.^[1]^[3] The key design choice is that one and the same model fills every role: there is no separate critic model, no reward model, and no parameter update. The framework is fully specified by three few-shot prompts (one to generate, one to give feedback, one to refine), so it can be wrapped around any sufficiently capable instruction-following model such as GPT-4.^[1]

The Self-Refine paper draws an explicit analogy with human writing practice. Authors rarely produce their best draft in one pass; instead, they re-read, identify deficiencies, and revise.^[1] By 2023, several lines of work had begun probing whether large language models could similarly improve their outputs after generation. Approaches such as PEER, Self-Correction (Welleck et al.), and various retrieval-augmented critique pipelines required either supervised pairs of bad-and-good drafts, scalar reward models, or task-specific refiner models.^[1] The contribution of Self-Refine was to show that, for a sufficiently strong instruction-following model, the same network can produce the critique and the rewrite without any parameter updates, simply by switching prompts.^[1]

The work was a collaboration of researchers from Carnegie Mellon University, the Allen Institute for AI, the University of Washington, NVIDIA, the University of California San Diego, and Google Research.^[2]^[3] Aman Madaan, the lead author, was a Ph.D. student at CMU at the time, and Peter Clark of the Allen Institute is the senior author.^[2] The paper was posted to arXiv as 2303.17651 on 30 March 2023 and revised on 25 May 2023 (v2); it was then accepted to the 37th Conference on Neural Information Processing Systems (NeurIPS 2023) and presented as a poster.^[1]^[2]^[6]

The motivation section of the paper frames the problem as one of test-time improvement: practitioners have access to powerful frozen models such as GPT-4, and the question is whether output quality can be raised without fine-tuning, additional models, or environment access.^[1] The authors connect this to broader literatures on chain-of-thought reasoning, self-consistency, and human revision practices, arguing that natural-language feedback offers a richer signal than scalar rewards because it can point to specific failure modes such as "this code uses a quadratic loop where a linear pass would suffice" or "the response is generic and does not address the user's stated preference."^[1]

How does the feedback-and-refine loop work?

Self-Refine operates as a three-step loop driven by three task-specific few-shot prompts.^[1]^[3] Given an input x and a base model M, the procedure is:

Initial generation. The model produces an initial output y0 from a prompt p_gen that includes input-output demonstrations for the target task.^[3]
Feedback. The model is re-prompted with p_fb, a few-shot template whose demonstrations are triples of the form (input, output, feedback). The model generates a natural-language critique f_t of the current output y_t, typically pointing to concrete, actionable problems.^[3]
Refinement. The model is re-prompted with p_refine, a few-shot template whose demonstrations are quadruples (input, output, feedback, refined output). Conditioned on x, y_t, and f_t, the model emits a revised output y_{t+1}.^[3]

Steps 2 and 3 are repeated until a stopping condition is met. The paper uses a maximum of four refinement iterations in its main experiments and additionally stops early when the feedback indicates that no further improvement is needed, or when a task-specific numerical score (for example, a model-reported positivity score in sentiment reversal) exceeds a threshold.^[3]^[7] In all rounds the prompt retains a record of previous outputs and previous feedback, so the model can avoid repeating earlier mistakes.^[3]

A simplified pseudocode for the loop, omitting prompt-formatting details, is:

y_0 = M(p_gen, x)
for t in 0, 1, ..., T-1:
    f_t = M(p_fb, x, y_t)
    if stop(f_t): break
    y_{t+1} = M(p_refine, x, y_t, f_t)
return y_t

The three prompts together specify the framework in full; the underlying model is treated as a black box and never updated.^[1] The same model fills the generator, feedback, and refiner roles, which means the technique is not an ensemble or distillation method and incurs only additional decoding cost at test time.^[1]^[3]

Are there zero-shot and few-shot variants?

The released project page and the open-source GitHub repository at github.com/madaan/self-refine document task-specific prompt files for each of the seven tasks reported in the paper.^[2]^[8] In the main experiments every step uses few-shot demonstrations: for example, the feedback prompt for code optimization contains worked examples in which an inefficient Python program is followed by a critique that identifies the algorithmic bottleneck, and the refine prompt contains the corresponding rewrite in O(n) instead of O(n^2).^[8] The math-reasoning prompts contain triples in which the feedback singles out the specific reasoning step that produced a wrong intermediate value.^[3]

A zero-shot or "instruction-only" version was studied as a baseline; the authors report substantial gains when at least feedback demonstrations are provided, since untemplated self-critique tends to produce vague or sycophantic comments such as "the answer looks correct" rather than actionable corrections.^[3]

How is the feedback prompt structured?

A distinguishing element of Self-Refine relative to free-form self-criticism is the design of the feedback prompt itself. The authors deliberately demonstrate, in the few-shot examples, what counts as useful feedback: critiques that point to a specific span of the output, name the deficiency in concrete terms, and propose a direction for repair.^[1]^[3] For acronym generation, a typical feedback demonstration might note that the candidate acronym ignores a domain-relevant keyword, abbreviates a less-important word, or produces a non-pronounceable result, and would then suggest swapping in a different word root. For sentiment reversal, the feedback would identify which sentences in the rewritten review still carry the original polarity. For code optimization, the feedback would name an algorithmic bottleneck (for example, a redundant nested loop or a repeated computation that could be cached).^[3]^[8]

This structuring is one of the reasons the framework outperforms generic "review your previous answer and try again" prompting strategies, which in subsequent studies have been shown to produce regressive trajectories on hard-correctness tasks.^[13] By demonstrating multi-aspect, actionable feedback rather than holistic praise or criticism, Self-Refine implicitly defines a critique vocabulary for each task that the model can follow at inference time.^[3]

When does the loop stop?

The paper's main configuration runs the loop for up to four iterations, but a task-specific stop is invoked earlier in several cases.^[3]^[7] For tasks with a scalar quality signal that the model can be asked to emit (for example, "score the positivity of this review from 1 to 5" in sentiment reversal), the loop stops once the score crosses a threshold. For other tasks the loop stops when the feedback string contains explicit indicators of completion, such as "no further changes needed" or an empty critique.^[7] In practice, the average number of iterations per example is reported to be lower than the maximum, and the authors discuss diminishing returns after the first or second refinement step on most tasks.^[3]

How well does Self-Refine work?

The original NeurIPS paper evaluates Self-Refine on seven generation and reasoning tasks chosen to span open-ended writing, structured rewriting, and quantitative problem solving.^[1]^[3] The headline result is that, in the authors' words, "outputs generated with Self-Refine are preferred by humans and automatic metrics over those generated with the same LLM using conventional one-step generation, improving by ~20% absolute on average in task performance."^[1] The tasks, dataset sizes, and primary metrics are summarised below.

Task	Description	Dataset size	Primary metric
Sentiment Reversal	Rewrite Yelp reviews to flip sentiment	1,000 reviews	Sentiment success rate
Dialogue Response Generation	Produce richer assistant turns	372 conversations	Human / GPT-4 preference
Code Optimization	Speed up Python programs (PIE benchmark)	1,000 programs	% programs improved
Code Readability Improvement	Refactor for clarity	300 programs	Readability score
Math Reasoning	Solve grade-school word problems (GSM8K)	1,319 questions	Solve rate
Acronym Generation	Generate acronyms for titles	250 items	Multi-criterion score
Constrained Generation	CommonGen-Hard with 20-30 keywords	200 prompts	Coverage

Across all tasks, the paper reports that outputs produced under Self-Refine are preferred by humans and by automatic metrics over the same model's single-pass outputs, with an average absolute task-performance improvement of approximately 20 percentage points.^[1]^[3] Per-task results extracted from the paper, expressed as the base score followed by the Self-Refine score for the strongest reported configuration of each model, illustrate where the gains concentrate. For GPT-4 the improvements are largest on tasks with diffuse or multi-dimensional quality criteria: Dialogue Response rose from 25.4 to 74.6 (a 49.2-point preference gain), Sentiment Reversal from 3.8 to 36.2, Code Readability from 27.4 to 56.2, and Constrained Generation from 15.0 to 45.0.^[3] On the math-reasoning benchmark GSM8K the gain was negligible, with GPT-4 reported at 92.9 versus 93.1 after Self-Refine, and the same near-zero delta was seen for GPT-3.5 and ChatGPT on that benchmark.^[3] Code optimization improvements were modest, ranging from +3.6 percentage points (ChatGPT) to +8.7 (GPT-4).^[3]

The authors interpret this pattern as evidence that Self-Refine helps most when "quality" is loosely defined and the model has many surface-level levers (lexical, structural, stylistic) for incremental improvement, and helps least when the answer is a single token whose correctness is decidable by an external checker, such as a numerical math answer.^[1]^[3]

What does each task measure?

Sentiment reversal uses examples from Yelp reviews. The model is given a positive (or negative) review and asked to rewrite it with the opposite polarity while preserving content. The metric is the fraction of rewrites that, when scored by an automatic sentiment classifier, change polarity.^[3] Self-Refine drives the sentiment-reversal rate from a low baseline (single-digit to low-double-digit percentages depending on the model) to 30 to 43 percent.^[3]

Dialogue response generation uses conversations and asks the model to produce richer, more engaging assistant turns. Because there is no single ground-truth response, evaluation is by preference judgement (either by humans or by a GPT-4 judge).^[3] The dialogue task is where Self-Refine showed its largest reported gain, a 49.2-point preference improvement when using GPT-4.^[3]

Code optimization uses the Program Improvement Edit (PIE) benchmark of Python programs in which a slow reference solution must be rewritten into a faster one. The metric is the fraction of programs in which the model's rewrite, when executed, runs faster than the input.^[3]^[8] Self-Refine yields between 3.6 and 8.7 percentage points of additional improved-programs across the three tested models.^[3]

Code readability improvement uses Python programs and asks the model to refactor them for clarity, with evaluation against a learned readability score; Self-Refine produced gains of about 14 to 35 percentage points depending on the base model.^[3]

Math reasoning uses GSM8K grade-school word problems. The metric is exact-match solve rate. As discussed above, Self-Refine yields essentially no gain on this benchmark for any of the three tested models, a finding the original authors flag and that subsequent critiques use as evidence for the broader claim that intrinsic self-critique does not help on hard-correctness reasoning tasks.^[3]^[4]^[13]

Acronym generation asks the model to produce acronyms for given titles. The metric is a multi-criterion score combining pronounceability, ease of spelling, relevance, and a few related axes. Self-Refine improves the score by 10 to 25 percentage points depending on the base model.^[3]

Constrained generation uses CommonGen-Hard prompts that supply 20 to 30 keywords and require the model to produce a coherent sentence using all of them. The metric is coverage of the supplied keywords. Self-Refine raises coverage by 9 to 30 percentage points across the tested models, with the largest gains on the strongest model.^[3]

Self-Refine sits in a family of inference-time techniques that exploit additional decoding compute to improve output quality.^[9] The table below contrasts it with four contemporaneous methods.

Method	Year	Core mechanism	Feedback source	External tools
Self-Consistency	2022	Sample N reasoning paths, majority-vote final answer	None	None
Self-Refine	2023	Iterative draft / critique / revise loop	Same LLM	None
Reflexion (Shinn et al.)	2023	Verbal RL: maintain reflective memory across trials	Environment + LLM	Optional
Tree of Thoughts	2023	Search over partial-solution tree with self-evaluation	LLM evaluator	None
CRITIC (Gou et al.)	2023	Tool-interactive self-critique	LLM + external tools	Yes

Self-Consistency, introduced by Wang et al. in 2022, samples multiple chain-of-thought traces and selects the most frequent final answer. It improves answer accuracy through diversity without ever critiquing any individual trace.^[9] Self-Refine, by contrast, generates and edits a single trajectory, and is therefore best suited to open-ended tasks where the notion of "majority vote" does not apply.^[1]^[9]

Reflexion, by Shinn, Cassano, Gopinath, Narasimhan, and Yao (arXiv 2303.11366, also NeurIPS 2023), is framed as "verbal reinforcement learning."^[10] Reflexion is set in an agent loop where the model receives sparse rewards or environment feedback (for example, a unit-test pass/fail signal) and uses that feedback to update a reflective text memory that persists across trials. Whereas Self-Refine performs free-standing iterative refinement on the output of a single generation call, Reflexion expects an external signal (success/failure or scalar reward) and is therefore well-suited to interactive coding tasks such as HumanEval-style benchmarks with executable tests.^[10] The two methods are sometimes confused but differ in the source of feedback (intrinsic in Self-Refine, extrinsic or hybrid in Reflexion) and in the granularity of intervention (per-draft in Self-Refine, per-episode in Reflexion).^[10]

Tree of Thoughts, by Yao et al., reframes problem solving as deliberate search over a tree of partial solutions, with the LLM acting as both proposer and evaluator at each node. Self-Refine, Self-Consistency, and chain-of-thought can be viewed as special cases of Tree of Thoughts with constrained breadth or depth.^[11]

CRITIC, by Gou, Shao, Gong, Yang, Huang, Duan, and Chen (arXiv 2305.11738, ICLR 2024), externalises the critique step. Instead of asking the model to imagine flaws, CRITIC routes the draft through tools such as web search, code interpreters, or fact checkers and uses the tool output as the feedback signal. Empirically the authors of CRITIC report that removing tool access erodes most of the gains, which they take as evidence that LLMs are unreliable critics of their own work in tasks like open-domain QA.^[12]

How is Self-Refine implemented and used?

The reference implementation lives at https://github.com/madaan/self-refine under an Apache-2.0 license.^[8] The repository contains subdirectories per task (sentiment_yelp, dialogue, acronym, gsm, pie/code optimization, code_readability, commongen) together with the few-shot prompt files used in the paper, a shared driver loop, and evaluation scripts.^[8] A separate project page at https://selfrefine.info collects task demos, example trajectories, and the citation block; an extension called "Visual Self-Refine" experiments with iteratively improving diagram generation through GPT-4V.^[2]^[8]

Self-Refine has been adopted as a building block in several downstream systems and benchmarks. Prompt-engineering libraries and tutorials (including the Learn Prompting curriculum) treat it as one of the canonical self-criticism strategies for LLM applications and contrast it with Reflexion and Self-Consistency.^[7] The method is also frequently re-implemented inside agent frameworks where a "critique then revise" subroutine is plugged into longer pipelines.^[9]

The reference repository organises code by task. The sentiment_yelp directory contains the Yelp-derived sentiment-reversal prompts and the loop that drives feedback and refinement. The dialogue directory contains dialogue response examples with multi-criterion feedback templates. The pie directory targets program improvement, with Python execution scripts that measure wall-clock speedups. The gsm directory contains the GSM8K prompts and evaluation harness. The acronym, code_readability, and commongen directories each provide their own prompt files and evaluation logic.^[8] The colabs directory contains notebooks that allow practitioners to step through the feedback / refine loop interactively.^[8]

Variants and re-implementations have appeared in many later papers, often under names like "iterative self-correction", "critic-refiner loops", or "verbal RL". Among the more notable derivatives are systems that combine Self-Refine style critique with majority-vote selection (Self-Consistency), with retrieval-augmented critique, or with execution-based verifiers in code generation. CRITIC explicitly contrasts itself with Self-Refine by routing the critique step through external tools.^[12] Reflexion can be understood as a variant of Self-Refine in which the feedback signal includes an external reward and persists across episodes rather than being regenerated from scratch.^[10]

What are its limitations?

While Self-Refine's reported gains on open-ended generation are large, subsequent work has documented important limits, especially on reasoning and planning tasks where the original paper itself reported only tiny improvements on GSM8K.^[3]^[4]^[5] The central caveat is that Self-Refine relies on the model being able to judge its own errors, and that judgement is unreliable precisely on tasks where a single answer is either right or wrong.

The most direct critique comes from Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati at Arizona State University, who argued in "On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks" (arXiv 2402.08115, first posted 12 February 2024, accepted to ICLR 2025) that iterative self-critique frameworks of the Self-Refine family are unreliable when the underlying task has a hard correctness criterion.^[4]^[5] Their study evaluates GPT-4 on three planning and reasoning domains: Game of 24, Graph Coloring, and STRIPS planning. The authors compare an iterative self-critique pipeline against a setup that uses a sound external verifier and find "significant performance collapse with self-critique and significant performance gains with sound external verification."^[4]^[5] They further observe that merely re-prompting with a verifier that is guaranteed correct preserves most of the benefit of more elaborate self-improvement scaffolds.^[4]

Stechly and co-authors argue that the standard intuition motivating self-correction, namely that verification should be easier than generation, is a classical computational-complexity argument that does not transfer cleanly to LLMs, whose behaviour they characterise as approximate retrieval rather than principled reasoning.^[4] On their domains, prompting GPT-4 to critique its own answer often introduces new errors or fails to recognise correct solutions, so the iterative loop can degrade an initially correct answer.^[4]^[5]

A separate and earlier critique, "Large Language Models Cannot Self-Correct Reasoning Yet" by Jie Huang and colleagues at Google DeepMind (arXiv 2310.01798, ICLR 2024), reaches a similar conclusion for arithmetic, commonsense, and multi-hop QA benchmarks: intrinsic self-correction without external feedback often hurts performance, and apparent gains in prior work sometimes stem from using oracle ground-truth labels to decide when to stop refining.^[13]

Beyond reasoning failures, several other limitations are acknowledged in the Self-Refine paper itself and in secondary coverage: the method requires a strong instruction-following base model (gains are smaller or negative for weak models that cannot produce useful feedback), evaluations were performed only on English text, the use of LLM-based or human preference judgements on highly preference-sensitive tasks may inflate apparent improvements, and the technique multiplies inference cost by the number of iterations.^[1]^[7]

A practical caveat is that the framework's gains are sensitive to prompt design: poorly chosen feedback demonstrations can produce vague critiques that yield non-monotonic or even regressive refinement trajectories, and stopping criteria that rely on the model's own self-assessment of completeness are vulnerable to the same self-verification problem highlighted by Stechly et al.^[4]^[7]

How much does Self-Refine cost to run?

Each refinement iteration triggers two additional LLM calls (one for feedback, one for the rewrite). With four maximum iterations the worst-case inference cost is roughly nine times that of a single-shot generation (one initial draft plus four feedback / refine pairs).^[1]^[3] In practice the average cost is lower because of early stopping, but Self-Refine consistently uses substantially more tokens per example than single-pass baselines. Whether the trade-off is favourable depends on the value of the additional quality and the price of generation. As a test-time compute technique, Self-Refine sits in the same family as repeated sampling with majority vote, beam search over reasoning steps, and Tree of Thoughts: all of them buy quality with additional compute.^[9]^[11]

What are the risks and misuse concerns?

Because Self-Refine relies on instruction-following models that are also capable of generating harmful content, the same loop can in principle be aimed at producing more refined toxic, deceptive, or copyright-infringing outputs. The original paper notes that the technique was evaluated only on English-language text and that its applicability to other languages, dialects, or low-resource settings is not guaranteed.^[1] In safety-sensitive deployments, practitioners typically combine intrinsic self-refinement with external safety filters and content classifiers rather than rely on the model's own critique.^[9]

Why does Self-Refine matter?

Despite the reasoning-task caveats, Self-Refine remains an influential reference point in the literature on test-time compute and on prompting frameworks for large language models.^[9]^[11] It demonstrated that natural-language critique, expressed in the same channel as generation, can serve as a useful intermediate signal between scalar rewards and full external verification, and it provided one of the first systematic studies of intrinsic iterative refinement across heterogeneous task families.^[1]^[3] The published code and prompts have made it straightforward to compare new self-correction proposals against a common baseline.^[8]

In subsequent years the field has moved toward hybrid systems that combine self-critique with external verifiers (CRITIC, agentic tool-use), with verified rewards (Reflexion variants), or with test-time compute scaling laws over search trees and verifiers.^[11]^[12] Self-Refine occupies a notable position in that lineage as the cleanest demonstration that, for tasks where quality is multidimensional and there is no oracle verifier, a model can usefully edit its own draft.^[1]^[9]

The broader contribution is methodological: by isolating the critique-and-rewrite loop as the only intervention and holding the underlying model fixed, the paper made it possible to study the effect of intrinsic self-feedback as an independent variable. Subsequent papers that combine self-feedback with sampling, with tools, with training, or with search build on a baseline that Self-Refine helped to standardise.^[1]^[9] The negative result on GSM8K is in retrospect one of the most informative parts of the original work, since it telegraphed the limitations that Stechly et al. and Huang et al. would later document at greater scope.^[3]^[4]^[13]

Timeline

30 March 2023: arXiv preprint v1 posted as 2303.17651.^[1]
25 May 2023: arXiv preprint v2 posted with additional analysis and tasks.^[1]
21 September 2023: NeurIPS 2023 camera-ready accepted on OpenReview.^[3]
Late 2023: presented as a poster at NeurIPS 2023 in New Orleans.^[6]
3 October 2023: Huang et al. post "Large Language Models Cannot Self-Correct Reasoning Yet" (later ICLR 2024).^[13]
12 February 2024: Stechly, Valmeekam, and Kambhampati post the "Self-Verification Limitations" critique.^[4]
22 January 2025: Stechly et al.'s critique formally appears at ICLR 2025.^[5]

References

Aman Madaan et al., "Self-Refine: Iterative Refinement with Self-Feedback", arXiv, 2023-03-30 (v1) and 2023-05-25 (v2). https://arxiv.org/abs/2303.17651. Accessed 2026-05-21. ↩
Aman Madaan et al., "Self-Refine: Iterative Refinement with Self-Feedback" (project page), selfrefine.info, 2023. https://selfrefine.info/. Accessed 2026-05-21. ↩
Aman Madaan et al., "Self-Refine: Iterative Refinement with Self-Feedback", NeurIPS 2023 (OpenReview), 2023-09-21. https://openreview.net/forum?id=S37hOerQLB. Accessed 2026-05-21. ↩
Kaya Stechly, Karthik Valmeekam, Subbarao Kambhampati, "On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks", arXiv, 2024-02-12. https://arxiv.org/abs/2402.08115. Accessed 2026-05-21. ↩
Kaya Stechly, Karthik Valmeekam, Subbarao Kambhampati, "On the self-verification limitations of large language models on reasoning and planning tasks", OpenReview / ICLR 2025, 2025-01-22. https://openreview.net/forum?id=4O0v4s3IzY. Accessed 2026-05-21. ↩
NeurIPS, "Self-Refine: Iterative Refinement with Self-Feedback" (poster page), neurips.cc, 2023. https://neurips.cc/virtual/2023/poster/71632. Accessed 2026-05-21. ↩
Learn Prompting, "Self-Refine: Iterative Refinement with Self-Feedback for LLMs", learnprompting.org, 2024. https://learnprompting.org/docs/advanced/self_criticism/self_refine. Accessed 2026-05-21. ↩
Aman Madaan et al., "madaan/self-refine" (source code repository), GitHub, 2023. https://github.com/madaan/self-refine. Accessed 2026-05-21. ↩
Athina AI, "Self-Refine: Iterative Refinement with Self-Feedback", blog.athina.ai, 2024. https://blog.athina.ai/self-refine-iterative-refinement-with-self-feedback. Accessed 2026-05-21. ↩
Noah Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning", arXiv, 2023-03-20. https://arxiv.org/abs/2303.11366. Accessed 2026-05-21. ↩
Shunyu Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models", NeurIPS 2023, 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/271db9922b8d1f4dd7aaef84ed5ac703-Paper-Conference.pdf. Accessed 2026-05-21. ↩
Zhibin Gou et al., "CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing", arXiv, 2023-05-19. https://arxiv.org/abs/2305.11738. Accessed 2026-05-21. ↩
Jie Huang et al., "Large Language Models Cannot Self-Correct Reasoning Yet", arXiv, 2023-10-03. https://arxiv.org/abs/2310.01798. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Chain of Verification (CoVe)Graph of Thoughts Least-to-Most Prompting Reflexion Skeleton-of-Thought

What is Self-Refine?

How does the feedback-and-refine loop work?

Are there zero-shot and few-shot variants?

How is the feedback prompt structured?

When does the loop stop?

How well does Self-Refine work?

What does each task measure?

How does Self-Refine differ from related prompting frameworks?

How is Self-Refine implemented and used?

What are its limitations?

How much does Self-Refine cost to run?

What are the risks and misuse concerns?

Why does Self-Refine matter?

Timeline

See also

References

Improve this article

Related Articles

Prompt

Agentic Context Engineering

How to Pressure LLMs for Better Output

System prompt

Context engineering

Self-consistency

What links here

Related Articles

Prompt

Agentic Context Engineering

How to Pressure LLMs for Better Output

System prompt

Context engineering

Self-consistency

What links here