Reflexion is a framework for reinforcing language agents through verbal feedback rather than traditional weight updates. Introduced in a 2023 paper by Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao, Reflexion enables agents to learn from their mistakes by generating natural language self-critiques and storing them in an episodic memory buffer. The agent then uses these reflections as context in subsequent attempts, improving its performance over successive trials without any fine-tuning or gradient-based optimization. The paper was published at NeurIPS 2023 and has since become one of the most widely cited works in the emerging field of LLM-based agent design.
Traditional reinforcement learning (RL) methods teach agents to improve through trial and error, but they typically require large numbers of training samples, carefully shaped reward functions, and expensive gradient updates to model parameters. When the agent in question is a large language model with billions of parameters, the cost of RL-based fine-tuning becomes substantial. Approaches like RLHF (Reinforcement Learning from Human Feedback) have demonstrated success in aligning language models, but they demand significant compute resources and carefully curated preference data.
At the same time, researchers observed that LLMs already possess strong reasoning abilities and can generate useful self-assessments when prompted appropriately. The ReAct framework, developed by Shunyu Yao and colleagues at Princeton, showed that LLMs could interleave reasoning traces with actions in an environment, but ReAct agents had no mechanism for learning from failures across episodes. If a ReAct agent failed a task, it would make the same types of mistakes when given the same problem again.
Reflexion was designed to fill this gap. Instead of updating model weights, Reflexion converts environmental feedback (whether scalar rewards, binary success/failure signals, or free-form text) into natural language reflections. These reflections are stored in memory and provided as additional context to the agent on its next attempt. The core insight is that language itself can serve as a reinforcement signal, and that an LLM can improve its behavior by reading and reasoning about its own past failures.
The Reflexion framework consists of three distinct components that work together in an iterative loop: the Actor, the Evaluator, and the Self-Reflection model.
The Actor is the LLM-based agent that generates text and takes actions in the given environment. Depending on the task, the Actor may use different prompting strategies. For sequential decision-making tasks, the Actor typically uses the ReAct prompting approach, which interleaves reasoning ("Thought") steps with action steps. For reasoning tasks, it may use Chain-of-Thought (CoT) prompting. The Actor is augmented with a memory component that provides additional context drawn from previous episodes.
Formally, the Actor generates a trajectory (\tau_t) at trial (t) by interacting with the environment. The trajectory consists of a sequence of observations, thoughts, and actions. The Actor conditions its generation on the current environment state, any few-shot examples provided in the prompt, and the contents of its long-term memory (the stored reflections from previous trials).
The Evaluator assesses the quality of a trajectory produced by the Actor and outputs a reward signal. The nature of the Evaluator varies by task domain:
| Task domain | Evaluator type | How it works |
|---|---|---|
| Sequential decision-making (AlfWorld) | Heuristic rules + LLM classification | Detects repeated actions (same action 3+ cycles), excessive trajectory length (>30 actions), and uses an LLM to classify success or failure |
| Reasoning (HotpotQA) | Exact-match grading | Compares the agent's final answer against the ground-truth answer string |
| Programming (HumanEval, MBPP) | Test execution | Runs the generated code against a suite of test cases and checks for pass/fail |
The Evaluator provides the feedback signal that drives the self-reflection loop. In some configurations, the feedback is binary (success or failure); in others, it may include more granular information such as specific test case results or error messages.
The Self-Reflection model is the component that distinguishes Reflexion from other agent architectures. It takes as input the current trajectory, the reward signal from the Evaluator, and the agent's existing memory, then generates a natural language reflection summarizing what went wrong and suggesting concrete improvements for the next attempt.
For example, after a failed coding attempt, the Self-Reflection model might produce output like: "The implementation failed because it did not handle the edge case where the input list is empty. In the next attempt, I should add a check at the beginning of the function to return an empty list if the input is empty."
These reflections are stored in an episodic memory buffer. The memory has a bounded capacity (typically 1 to 3 recent reflections) to fit within the LLM's context window. When memory is full, older reflections are discarded using a sliding window approach.
Reflexion distinguishes between two types of memory:
The long-term memory acts as an experience pool that grows over successive trials. Each reflection encodes specific, actionable feedback (such as "avoid using the go-to action before checking inventory" or "the recursive solution causes a stack overflow for large inputs; use an iterative approach instead"). When the Actor begins a new trial, these reflections are prepended to its prompt, giving it concrete guidance on what to do differently.
The Reflexion algorithm follows a simple iterative procedure:
This process mirrors how a human might approach a difficult problem: try a solution, observe where it fails, think about what went wrong, and adjust the approach for the next attempt. The difference is that in Reflexion, every step is performed by an LLM, and the "thinking about what went wrong" is itself a language generation task.
The Reflexion paper evaluated the framework across three distinct task categories: sequential decision-making, knowledge-intensive reasoning, and code generation. The base language model used was GPT-4 in most experiments.
AlfWorld is a text-based game environment where an agent must complete household tasks (such as finding and placing objects) by issuing text commands. The environment provides text observations describing the current state of the world.
| Method | Tasks solved (out of 134) | Success rate |
|---|---|---|
| ReAct (baseline) | ~108 | ~81% |
| ReAct + Reflexion | 130 | 97% |
Reflexion improved performance by an absolute 22% over the ReAct baseline after 12 iterative learning steps, bringing the success rate to 97% (130 out of 134 tasks). The authors found that a common failure mode for the baseline ReAct agent was "hallucinating" that it possessed an item when it did not. Reflexion effectively eliminated this class of errors because the self-reflection step would identify the hallucination and instruct the agent to verify item possession before attempting to use or place items.
The Evaluator for AlfWorld used a combination of heuristic rules (detecting if the agent repeated the same action three or more times in a row, or took more than 30 total actions) and an LLM-based binary classifier to determine success or failure.
HotpotQA is a multi-hop question answering dataset that requires reasoning across multiple Wikipedia paragraphs. The agent must search for and synthesize information from different sources to answer questions.
| Method | Accuracy |
|---|---|
| Chain-of-Thought (CoT) baseline | ~34% |
| CoT + Reflexion | ~54% |
| ReAct baseline | Lower than CoT |
| ReAct + Reflexion | Higher than CoT + Reflexion |
Reflexion improved performance by approximately 20 percentage points over the respective baselines. Notably, the baseline CoT agent showed no improvement across multiple trials (since it had no mechanism to learn from failures), while the Reflexion-augmented agent showed consistent improvement with each successive trial.
An ablation study on HotpotQA revealed the contribution of each component:
| Configuration | Accuracy |
|---|---|
| Baseline CoT (with ground truth context) | 61% |
| Baseline + Episodic Memory (last trajectory only) | 69% |
| Full Reflexion (Episodic Memory + Self-Reflection) | 77% |
The self-reflection component contributed an 8-percentage-point improvement beyond what episodic memory alone provided. This demonstrates that the natural language reflections add value above and beyond simply showing the agent its previous trajectory.
The code generation experiments were among the most striking results in the paper. Reflexion achieved state-of-the-art pass@1 accuracy on several benchmarks.
| Benchmark | Language | Previous SOTA | GPT-4 (baseline) | Reflexion + GPT-4 |
|---|---|---|---|---|
| HumanEval | Python | 65.8% (CodeT) | 80.1% | 91.0% |
| HumanEval | Rust | N/A | 60.0% | 68.0% |
| MBPP | Python | 67.7% (CodeT) | 80.1% | 77.1% |
| MBPP | Rust | N/A | 70.9% | 75.4% |
| LeetcodeHardGym | Python | N/A | 7.5% | 15.0% |
On HumanEval (Python), Reflexion achieved 91% pass@1 accuracy, surpassing GPT-4's baseline of 80.1% by nearly 11 percentage points. This was a notable result because it showed that a relatively simple iterative self-correction loop could push code generation accuracy well beyond what the base model achieved in a single attempt.
The code generation approach in Reflexion introduced an additional innovation: self-generated test suites. Before writing the implementation, the agent first generates a set of unit tests (up to six) based on the problem description. These tests are filtered for syntactic validity using abstract syntax tree (AST) parsing. The agent then iteratively refines its implementation against these self-generated tests, using the test results as feedback for the self-reflection step.
This test-first approach shifts what the authors called the "accuracy bottleneck" from correct code generation to correct test generation. The reasoning is that generating accurate tests for a function (given its specification) is generally easier than generating the correct implementation. If the agent can produce a diverse and accurate set of tests, it can use those tests as a reliable feedback signal for iterative refinement.
An ablation study on the 50 hardest HumanEval problems (in Rust) confirmed that both components were necessary:
| Configuration | Accuracy |
|---|---|
| Base GPT-4 | 60% |
| Without test generation | 52% |
| Without self-reflection | 60% |
| Full Reflexion | 68% |
Removing test generation actually hurt performance (dropping to 52%), while removing self-reflection kept performance at the baseline level (60%). Only the full Reflexion system, combining both self-generated tests and verbal self-reflection, achieved the highest accuracy of 68%.
The authors also introduced LeetcodeHardGym, a new benchmark consisting of 40 hard-level problems from Leetcode. These problems were selected from after October 2022 to avoid data contamination (since GPT-4's training data cutoff preceded that date). On this benchmark, Reflexion doubled GPT-4's baseline performance from 7.5% to 15.0%, though the absolute numbers remained low, reflecting the genuine difficulty of competitive programming problems.
Reflexion sits within a broader landscape of techniques for improving LLM agent performance. Several related methods address similar goals but differ in their mechanisms.
Chain-of-Thought (CoT) prompting encourages the LLM to "think step by step" before producing a final answer. While CoT improves reasoning on single attempts, it provides no mechanism for learning from failed attempts. A CoT agent that fails on a problem will produce the same (or similar) incorrect reasoning if given the same problem again. Reflexion can use CoT as the Actor's prompting strategy while adding the outer self-reflection loop for cross-episode learning.
ReAct interleaves reasoning traces ("Thought") with environment actions ("Action") and observations ("Observation"). ReAct agents can use tools, search the web, and interact with APIs. Reflexion extends ReAct by adding a self-reflection step after each episode and maintaining episodic memory across episodes. In the AlfWorld experiments, ReAct + Reflexion significantly outperformed ReAct alone (97% vs. 81%).
Self-Refine (Madaan et al., 2023) is an approach where an LLM iteratively refines its own output within a single episode. The model generates an initial output, critiques it, and produces a revised version, repeating this cycle several times. The key difference from Reflexion is that Self-Refine operates within a single trial and does not maintain memory across episodes. Reflexion's cross-episode memory means it can learn from fundamentally different approaches rather than just polishing the same initial solution.
Tree of Thoughts (ToT) explores multiple reasoning paths simultaneously, evaluating partial solutions and backtracking when a path appears unpromising. ToT is a search-time strategy that operates within a single problem-solving episode. Reflexion, by contrast, operates across episodes, learning from complete failed attempts. The two approaches are complementary: an agent could use ToT within each trial and Reflexion across trials.
LATS (Zhou et al., 2023), published at ICML 2024, explicitly combines ideas from Reflexion, Tree of Thoughts, and Monte Carlo Tree Search. LATS uses an LLM as both the agent and the value function, performing tree search over possible action sequences while incorporating self-reflection for backtracking decisions. LATS can be viewed as a unification of the reasoning, acting, and planning components from these earlier frameworks.
Conventional RL methods (such as PPO or policy gradient algorithms) update model weights based on reward signals. This requires backpropagation through the model, which for large language models is computationally expensive. Reflexion avoids weight updates entirely, relying instead on the LLM's in-context learning abilities. The tradeoff is that Reflexion's improvements are not permanently baked into the model; they exist only in the memory buffer. If the memory is cleared, the agent reverts to its base performance.
| Method | Learning mechanism | Cross-episode memory | Weight updates required | Feedback type |
|---|---|---|---|---|
| Chain-of-Thought | Single-episode prompting | No | No | None |
| ReAct | Single-episode reasoning + acting | No | No | Environment observations |
| Self-Refine | Within-episode iteration | No | No | Self-critique |
| Tree of Thoughts | Within-episode search | No | No | Self-evaluation |
| Traditional RL (PPO, etc.) | Gradient-based optimization | Yes (in weights) | Yes | Scalar rewards |
| Reflexion | Verbal reinforcement across episodes | Yes (in memory buffer) | No | Natural language reflections |
| LATS | Tree search + reflection | Yes | No | Self-reflection + value estimates |
Reflexion offers several practical advantages over alternative approaches:
No weight updates required. Because Reflexion stores its learned experience in natural language memory rather than model parameters, it works with any LLM, including closed-source models accessible only through APIs (such as GPT-4 or Claude). There is no need for access to model gradients or the ability to run backpropagation.
Interpretable learning. The self-reflections stored in memory are human-readable natural language. A developer or researcher can inspect the agent's memory to understand what it has learned and why it changed its behavior. This is a significant advantage over traditional RL, where learned behaviors are encoded in opaque weight matrices.
Nuanced feedback. Scalar reward signals (such as a score of 0 or 1) provide limited information about what went wrong. Reflexion converts these sparse signals into detailed, actionable natural language feedback. A reflection like "the function fails on negative inputs because the absolute value conversion is missing" carries far more information than a binary failure signal.
Lightweight implementation. Reflexion requires only prompt engineering and a memory buffer. It does not require training infrastructure, GPU clusters, or custom training loops. This makes it accessible to practitioners who want to build better agents without significant engineering overhead.
Flexibility across feedback types. The framework can incorporate scalar values, binary signals, or free-form natural language as feedback from the environment. This flexibility allows Reflexion to be applied across diverse task domains.
Despite its strengths, Reflexion has several important limitations that the authors and subsequent researchers have identified.
Like any optimization process, Reflexion can get stuck in local minima where the agent repeatedly tries slight variations of the same flawed approach. The self-reflection model may not always identify the fundamental issue, instead suggesting superficial changes that do not address the root cause of failure. The authors demonstrated this problem on the WebShop benchmark, where the agent needed to navigate an e-commerce website to find and purchase products. After testing a ReAct + Reflexion agent across 100 environments, the runs were terminated after only four trials because the agent showed no signs of improvement. The agent produced unhelpful self-reflections and could not escape its initial strategy.
The episodic memory is bounded by the LLM's context window. With a typical memory capacity of 1 to 3 reflections, the agent can only draw on a limited amount of past experience. As tasks become more involved and require learning many distinct lessons, this memory bottleneck becomes a constraint. Older reflections must be discarded to make room for newer ones, potentially losing useful information.
The entire framework relies on the LLM's ability to accurately assess its own performance and generate useful self-critiques. If the LLM cannot correctly identify why it failed, the reflections will be misleading, potentially making performance worse. This is particularly problematic for tasks where the LLM lacks the domain knowledge to diagnose its own errors.
The code generation approach depends on the agent's ability to produce correct test cases. For certain types of programs, generating accurate tests is difficult or impossible: non-deterministic functions, functions that interact with external APIs, hardware-dependent behavior, and concurrent programs all pose challenges for automated test generation. If the self-generated tests are incorrect, the agent may iteratively "fix" working code to satisfy broken tests.
Reflexion's improvements exist only in the memory buffer. If the buffer is cleared, or if the agent encounters a new task with no relevant prior reflections, it starts from scratch. The lessons learned during one problem-solving session do not transfer to future sessions unless the memory is explicitly carried over. This contrasts with fine-tuning approaches, where improvements are permanently encoded in the model's weights.
For Reflexion to work well, the agent needs to be able to try meaningfully different approaches across trials. If the task space is such that small perturbations to the agent's strategy do not yield useful signal, or if the agent cannot generate sufficiently diverse strategies, the self-reflection loop may not converge to a solution.
Reflexion has had considerable influence on the design of LLM-based agents since its publication. As of early 2025, the paper had accumulated over 2,000 citations according to Semantic Scholar, making it one of the most cited papers in the agent research space.
The framework's ideas have been adopted in several practical systems and frameworks:
The broader concept of "reflection" in AI agents, while not invented by the Reflexion paper, was significantly advanced by it. Andrew Ng identified reflection as one of four key agentic design patterns, citing Reflexion as a foundational example. The pattern of generating output, evaluating it, reflecting on failures, and trying again has become a standard component in many production agent systems.
As part of the Reflexion paper, the authors introduced LeetcodeHardGym, a new benchmark for evaluating code generation on genuinely difficult programming problems. The benchmark consists of 40 hard-level Leetcode problems that were published after October 2022, placing them outside the training data of models like GPT-4 at the time of the paper's writing.
The benchmark was designed to test code generation systems on problems that require algorithmic reasoning, data structure knowledge, and careful edge-case handling. Unlike HumanEval, which primarily tests basic programming competency, LeetcodeHardGym problems often require sophisticated algorithms like dynamic programming, graph traversal, and advanced data structures.
The code and benchmark data are publicly available on the project's GitHub repository.
The authors released their full implementation on GitHub. The codebase includes:
Reflexion has also been reimplemented in various agent frameworks. LangGraph provides a reference implementation that developers can adapt for their own use cases. The key implementation requirements are straightforward: an LLM for the Actor, an evaluation function appropriate to the task, a self-reflection prompt, and a memory buffer to store reflections.
A minimal Reflexion loop in pseudocode looks like this:
memory = []
for trial in range(max_trials):
trajectory = actor.run(task, memory)
reward = evaluator.evaluate(trajectory)
if reward == SUCCESS:
return trajectory
reflection = self_reflect(trajectory, reward, memory)
memory.append(reflection)
if len(memory) > max_memory:
memory.pop(0)
return FAILURE