# Reflexion

> Source: https://aiwiki.ai/wiki/reflexion
> Updated: 2026-06-24
> Categories: AI Agents, Machine Learning, Reasoning Models
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Reflexion** is a 2023 framework for reinforcing [language agents](/wiki/ai_agents) through verbal self-reflection rather than weight updates: the agent reflects in natural language on feedback from failed attempts, stores those reflections in an episodic memory buffer, and reads them back on later attempts to make better decisions.[1] Introduced by Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao, Reflexion let a [GPT-4](/wiki/gpt-4)-based agent reach 91% pass@1 on the [HumanEval](/wiki/humaneval) coding benchmark, surpassing the previous state-of-the-art GPT-4 result of 80%.[1] Because the learning lives in a text memory and not in the model's parameters, Reflexion works with any [large language model](/wiki/large_language_model), including closed-source [AI agents](/wiki/ai_agent) accessed only through an API, with no [fine-tuning](/wiki/fine_tuning) or gradient-based optimization.[1] The paper, "Reflexion: Language Agents with Verbal Reinforcement Learning," was published at [NeurIPS](/wiki/neurips) 2023[1] and has become one of the most cited works in [LLM](/wiki/large_language_model)-based agent design.[22] The NeurIPS proceedings credit the five authors above; the arXiv version of the paper additionally lists Edward Berman as a co-author.[1]

The authors framed the goal in human terms. "Our goal was to create AI agents that learn by reflecting on failures and enhancing their results, much like humans do," Shinn and Gopinath wrote.[7] In the paper's own words, Reflexion agents "verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials."[1]

## What problem does Reflexion solve?

Traditional [reinforcement learning](/wiki/reinforcement_learning) (RL) methods teach agents to improve through trial and error, but they typically require large numbers of training samples, carefully shaped reward functions, and expensive gradient updates to model parameters.[1] When the agent in question is a [large language model](/wiki/large_language_model) with billions of parameters, the cost of RL-based fine-tuning becomes substantial. Approaches like [RLHF](/wiki/reinforcement_learning_from_human_feedback) (Reinforcement Learning from Human Feedback) have demonstrated success in aligning language models, but they demand significant compute resources and carefully curated preference data.

At the same time, researchers observed that LLMs already possess strong [reasoning](/wiki/reasoning) abilities and can generate useful self-assessments when prompted appropriately. The [ReAct](/wiki/react_prompting) framework, developed by Shunyu Yao and colleagues at Princeton, showed that LLMs could interleave reasoning traces with actions in an environment,[2] but ReAct agents had no mechanism for learning from failures across episodes. If a ReAct agent failed a task, it would make the same types of mistakes when given the same problem again.

Reflexion was designed to fill this gap. Instead of updating model weights, Reflexion converts environmental feedback (whether scalar rewards, binary success/failure signals, or free-form text) into natural language reflections.[1] These reflections are stored in memory and provided as additional context to the agent on its next attempt. The core insight is that language itself can serve as a reinforcement signal, and that an LLM can improve its behavior by reading and reasoning about its own past failures.

## How does Reflexion work?

The Reflexion framework consists of three distinct components that work together in an iterative loop: the Actor, the Evaluator, and the Self-Reflection model.[1] At a high level, "Reflexion converts binary or scalar feedback from the environment into verbal feedback in the form of a textual summary, which is then added as additional context for the LLM agent in the next episode."[1]

### Actor

The Actor is the LLM-based agent that generates text and takes actions in the given environment. The paper describes it as built "upon a large language model (LLM) that is specifically prompted to generate the necessary text and actions conditioned on the state observations."[1] Depending on the task, the Actor may use different [prompting](/wiki/prompt_engineering) strategies. For sequential decision-making tasks, the Actor typically uses the [ReAct](/wiki/react_prompting) prompting approach, which interleaves reasoning ("Thought") steps with action steps.[2] For reasoning tasks, it may use [Chain-of-Thought](/wiki/chain_of_thought) (CoT) prompting.[3] The Actor is augmented with a memory component that provides additional context drawn from previous episodes.[1]

Formally, the Actor generates a trajectory \(\tau_t\) at trial \(t\) by interacting with the environment. The trajectory consists of a sequence of observations, thoughts, and actions. The Actor conditions its generation on the current environment state, any few-shot examples provided in the prompt, and the contents of its long-term memory (the stored reflections from previous trials).[1]

### Evaluator

The Evaluator assesses the quality of a trajectory produced by the Actor and outputs a reward signal.[1] The nature of the Evaluator varies by task domain:

| Task domain | Evaluator type | How it works |
|---|---|---|
| Sequential decision-making ([AlfWorld](/wiki/alfworld)) | Heuristic rules + LLM classification | Detects repeated actions (same action 3+ cycles), excessive trajectory length (>30 actions), and uses an LLM to classify success or failure |
| Reasoning ([HotpotQA](/wiki/hotpotqa)) | Exact-match grading | Compares the agent's final answer against the ground-truth answer string |
| Programming ([HumanEval](/wiki/humaneval), [MBPP](/wiki/mbpp)) | Test execution | Runs the generated code against a suite of test cases and checks for pass/fail |

The Evaluator provides the feedback signal that drives the self-reflection loop.[1] In some configurations, the feedback is binary (success or failure); in others, it may include more granular information such as specific test case results or error messages.

### Self-Reflection model

The Self-Reflection model is the component that distinguishes Reflexion from other agent architectures. The paper describes it as an LLM that "plays a crucial role in the Reflexion framework by generating verbal self-reflections to provide valuable feedback for future trials."[1] It takes as input the current trajectory, the reward signal from the Evaluator, and the agent's existing memory, then generates a natural language reflection summarizing what went wrong and suggesting concrete improvements for the next attempt.[1]

For example, after a failed coding attempt, the Self-Reflection model might produce output like: "The implementation failed because it did not handle the edge case where the input list is empty. In the next attempt, I should add a check at the beginning of the function to return an empty list if the input is empty."

These reflections are stored in an episodic memory buffer. The memory has a bounded capacity (typically 1 to 3 recent reflections) to fit within the LLM's [context window](/wiki/context_window).[1] When memory is full, older reflections are discarded using a sliding window approach.

### Memory system

Reflexion distinguishes between two types of memory:

- **Short-term memory**: The current trajectory, including all observations, thoughts, and actions from the ongoing episode. This is analogous to the scratchpad or working memory used during a single attempt.
- **Long-term memory**: The stored self-reflections from previous trials. These are distilled summaries of past experiences, not the full trajectories themselves. By storing only the reflection rather than the entire trajectory, Reflexion keeps the memory compact enough to fit in the LLM's context window.[1]

The long-term memory acts as an experience pool that grows over successive trials. Each reflection encodes specific, actionable feedback (such as "avoid using the go-to action before checking inventory" or "the recursive solution causes a stack overflow for large inputs; use an iterative approach instead"). When the Actor begins a new trial, these reflections are prepended to its prompt, giving it concrete guidance on what to do differently.

## The Reflexion algorithm

The Reflexion algorithm follows a simple iterative procedure:

1. **Generate trajectory**: The Actor interacts with the environment, producing a trajectory \(\tau_t\).
2. **Evaluate**: The Evaluator computes a reward \(r_t\) based on the trajectory.
3. **Check for success**: If the task is solved (i.e., \(r_t\) meets the success threshold), the loop terminates.
4. **Self-reflect**: The Self-Reflection model generates a verbal summary \(sr_t\) from the trajectory \(\tau_t\), reward \(r_t\), and existing memory.
5. **Update memory**: The reflection \(sr_t\) is appended to the long-term memory buffer.
6. **Repeat**: Return to step 1 with the updated memory, up to a maximum number of trials.[1]

This process mirrors how a human might approach a difficult problem: try a solution, observe where it fails, think about what went wrong, and adjust the approach for the next attempt. The difference is that in Reflexion, every step is performed by an LLM, and the "thinking about what went wrong" is itself a language generation task.

## How well does Reflexion perform?

The Reflexion paper evaluated the framework across three distinct task categories: sequential decision-making, knowledge-intensive reasoning, and code generation.[1] The base language model used was [GPT-3](/wiki/gpt-3) in the AlfWorld experiments, while the programming experiments primarily used [GPT-4](/wiki/gpt-4).[1] Across these domains the abstract reports that Reflexion "obtains significant improvements over a baseline agent," with the headline result a "91% pass@1 accuracy on the HumanEval coding benchmark, surpassing the previous state-of-the-art GPT-4 that achieves 80%."[1]

### Sequential decision-making: AlfWorld

[AlfWorld](/wiki/alfworld) is a text-based game environment where an agent must complete household tasks (such as finding and placing objects) by issuing text commands.[9] The environment provides text observations describing the current state of the world.

| Method | Tasks solved (out of 134) | Success rate |
|---|---|---|
| ReAct (baseline) | ~100 | ~75% |
| ReAct + Reflexion | 130 | 97% |

Reflexion improved performance by an absolute 22% over the ReAct baseline after 12 iterative learning steps, bringing the success rate to 97% (130 out of 134 tasks).[1] The authors found that a common failure mode for the baseline ReAct agent was "hallucinating" that it possessed an item when it did not.[1] Reflexion effectively eliminated this class of errors because the self-reflection step would identify the hallucination and instruct the agent to verify item possession before attempting to use or place items.

The Evaluator for AlfWorld used a combination of heuristic rules (detecting if the agent repeated the same action three or more times in a row, or took more than 30 total actions) and an LLM-based binary classifier to determine success or failure.[1]

### Reasoning: HotpotQA

[HotpotQA](/wiki/hotpotqa) is a multi-hop question answering dataset that requires reasoning across multiple Wikipedia paragraphs.[10] The agent must search for and synthesize information from different sources to answer questions.

| Method | Accuracy |
|---|---|
| Chain-of-Thought (CoT) baseline | ~34% |
| CoT + Reflexion | ~54% |
| ReAct baseline | Lower than CoT |
| ReAct + Reflexion | Higher than CoT + Reflexion |

Reflexion improved performance by approximately 20 percentage points over the respective baselines.[1] Notably, the baseline CoT agent showed no improvement across multiple trials (since it had no mechanism to learn from failures), while the Reflexion-augmented agent showed consistent improvement with each successive trial.[1]

An ablation study on HotpotQA revealed the contribution of each component:

| Configuration | Accuracy |
|---|---|
| Baseline CoT (with ground truth context) | 61% |
| Baseline + Episodic Memory (last trajectory only) | 69% |
| Full Reflexion (Episodic Memory + Self-Reflection) | 77% |

The self-reflection component contributed an 8-percentage-point improvement beyond what episodic memory alone provided.[1] This demonstrates that the natural language reflections add value above and beyond simply showing the agent its previous trajectory.

### Code generation: HumanEval, MBPP, and LeetcodeHardGym

The code generation experiments were among the most striking results in the paper. Reflexion achieved state-of-the-art pass@1 accuracy on several benchmarks.[1]

| Benchmark | Language | Previous SOTA | GPT-4 (baseline) | Reflexion + GPT-4 |
|---|---|---|---|---|
| [HumanEval](/wiki/humaneval) | Python | 65.8% (CodeT)[8] | 80.1% | 91.0% |
| HumanEval | Rust | N/A | 60.0% | 68.0% |
| MBPP | Python | 67.7% (CodeT)[8] | 80.1% | 77.1% |
| MBPP | Rust | N/A | 70.9% | 75.4% |
| LeetcodeHardGym | Python | N/A | 7.5% | 15.0% |

On HumanEval (Python), Reflexion achieved 91% pass@1 accuracy, surpassing GPT-4's baseline of 80.1% by nearly 11 percentage points.[1][7] This was a notable result because it showed that a relatively simple iterative self-correction loop could push code generation accuracy well beyond what the base model achieved in a single attempt.

The code generation approach in Reflexion introduced an additional innovation: self-generated test suites. Before writing the implementation, the agent first generates a set of unit tests (up to six) based on the problem description.[1] These tests are filtered for syntactic validity using abstract syntax tree (AST) parsing. The agent then iteratively refines its implementation against these self-generated tests, using the test results as feedback for the self-reflection step.

This test-first approach shifts what the authors called the "accuracy bottleneck" from correct code generation to correct test generation.[1] The reasoning is that generating accurate tests for a function (given its specification) is generally easier than generating the correct implementation. If the agent can produce a diverse and accurate set of tests, it can use those tests as a reliable feedback signal for iterative refinement.

An ablation study on the 50 hardest HumanEval problems (in Rust) confirmed that both components were necessary:

| Configuration | Accuracy |
|---|---|
| Base GPT-4 | 60% |
| Without test generation | 52% |
| Without self-reflection | 60% |
| Full Reflexion | 68% |

Removing test generation actually hurt performance (dropping to 52%), while removing self-reflection kept performance at the baseline level (60%). Only the full Reflexion system, combining both self-generated tests and verbal self-reflection, achieved the highest accuracy of 68%.[1]

The authors also introduced LeetcodeHardGym, a new benchmark consisting of 40 hard-level problems from Leetcode. These problems were selected from after October 2022 to avoid data contamination (since GPT-4's training data cutoff preceded that date).[1] On this benchmark, Reflexion doubled GPT-4's baseline performance from 7.5% to 15.0%, though the absolute numbers remained low, reflecting the genuine difficulty of competitive programming problems.[1]

## How does Reflexion differ from ReAct and Chain-of-Thought?

Reflexion sits within a broader landscape of techniques for improving LLM agent performance. Several related methods address similar goals but differ in their mechanisms.

### Chain-of-Thought prompting

[Chain-of-Thought](/wiki/chain_of_thought) (CoT) prompting encourages the LLM to "think step by step" before producing a final answer.[3] While CoT improves reasoning on single attempts, it provides no mechanism for learning from failed attempts. A CoT agent that fails on a problem will produce the same (or similar) incorrect reasoning if given the same problem again. Reflexion can use CoT as the Actor's prompting strategy while adding the outer self-reflection loop for cross-episode learning.

### ReAct

[ReAct](/wiki/react_prompting) interleaves reasoning traces ("Thought") with environment actions ("Action") and observations ("Observation").[2] ReAct agents can use tools, search the web, and interact with APIs. Reflexion extends ReAct by adding a self-reflection step after each episode and maintaining episodic memory across episodes. In the AlfWorld experiments, ReAct + Reflexion significantly outperformed ReAct alone (97% vs. 75%).[1]

### Self-Refine

[Self-Refine](/wiki/self_refine) (Madaan et al., 2023) is an approach where an LLM iteratively refines its own output within a single episode.[4] The model generates an initial output, critiques it, and produces a revised version, repeating this cycle several times. The key difference from Reflexion is that Self-Refine operates within a single trial and does not maintain memory across episodes.[4] Reflexion's cross-episode memory means it can learn from fundamentally different approaches rather than just polishing the same initial solution.

### Tree of Thoughts

[Tree of Thoughts](/wiki/tree_of_thoughts) (ToT) explores multiple reasoning paths simultaneously, evaluating partial solutions and backtracking when a path appears unpromising.[5] ToT is a search-time strategy that operates within a single problem-solving episode. Reflexion, by contrast, operates across episodes, learning from complete failed attempts. The two approaches are complementary: an agent could use ToT within each trial and Reflexion across trials.

### Language Agent Tree Search (LATS)

LATS (Zhou et al., 2023), published at ICML 2024, explicitly combines ideas from Reflexion, Tree of Thoughts, and Monte Carlo Tree Search.[6] LATS uses an LLM as both the agent and the value function, performing tree search over possible action sequences while incorporating self-reflection for backtracking decisions.[6] LATS can be viewed as a unification of the reasoning, acting, and planning components from these earlier frameworks.

### Traditional reinforcement learning

Conventional RL methods (such as [PPO](/wiki/reinforcement_learning) or policy gradient algorithms) update model weights based on reward signals. This requires [backpropagation](/wiki/backpropagation) through the model, which for large language models is computationally expensive. Reflexion avoids weight updates entirely, relying instead on the LLM's in-context learning abilities.[1] The tradeoff is that Reflexion's improvements are not permanently baked into the model; they exist only in the memory buffer. If the memory is cleared, the agent reverts to its base performance.

| Method | Learning mechanism | Cross-episode memory | Weight updates required | Feedback type |
|---|---|---|---|---|
| [Chain-of-Thought](/wiki/chain_of_thought) | Single-episode prompting | No | No | None |
| [ReAct](/wiki/react_prompting) | Single-episode reasoning + acting | No | No | Environment observations |
| Self-Refine | Within-episode iteration | No | No | Self-critique |
| [Tree of Thoughts](/wiki/tree_of_thoughts) | Within-episode search | No | No | Self-evaluation |
| Traditional RL ([PPO](/wiki/reinforcement_learning), etc.) | Gradient-based optimization | Yes (in weights) | Yes | Scalar rewards |
| Reflexion | Verbal reinforcement across episodes | Yes (in memory buffer) | No | Natural language reflections |
| LATS | Tree search + reflection | Yes | No | Self-reflection + value estimates |

## Why use Reflexion? Key advantages

Reflexion offers several practical advantages over alternative approaches:

**No weight updates required.** Because Reflexion stores its learned experience in natural language memory rather than model parameters, it works with any LLM, including closed-source models accessible only through APIs (such as [GPT-4](/wiki/gpt-4) or [Claude](/wiki/claude)).[1] There is no need for access to model gradients or the ability to run backpropagation.

**Interpretable learning.** The self-reflections stored in memory are human-readable natural language. A developer or researcher can inspect the agent's memory to understand what it has learned and why it changed its behavior. This is a significant advantage over traditional RL, where learned behaviors are encoded in opaque weight matrices.

**Nuanced feedback.** Scalar reward signals (such as a score of 0 or 1) provide limited information about what went wrong. Reflexion converts these sparse signals into detailed, actionable natural language feedback.[1] A reflection like "the function fails on negative inputs because the absolute value conversion is missing" carries far more information than a binary failure signal.

**Lightweight implementation.** Reflexion requires only prompt engineering and a memory buffer. It does not require training infrastructure, GPU clusters, or custom training loops. This makes it accessible to practitioners who want to build better agents without significant engineering overhead.

**Flexibility across feedback types.** The framework can incorporate scalar values, binary signals, or free-form natural language as feedback from the environment.[1] This flexibility allows Reflexion to be applied across diverse task domains.

## What are the limitations of Reflexion?

Despite its strengths, Reflexion has several important limitations that the authors and subsequent researchers have identified.

### Local minima

Like any optimization process, Reflexion can get stuck in local minima where the agent repeatedly tries slight variations of the same flawed approach. The self-reflection model may not always identify the fundamental issue, instead suggesting superficial changes that do not address the root cause of failure. The authors demonstrated this problem on the WebShop benchmark, where the agent needed to navigate an e-commerce website to find and purchase products. After testing a ReAct + Reflexion agent across 100 environments, the runs were terminated after only four trials because the agent showed no signs of improvement.[1] The agent produced unhelpful self-reflections and could not escape its initial strategy.

### Context window constraints

The episodic memory is bounded by the LLM's context window. With a typical memory capacity of 1 to 3 reflections, the agent can only draw on a limited amount of past experience.[1] As tasks become more involved and require learning many distinct lessons, this memory bottleneck becomes a constraint. Older reflections must be discarded to make room for newer ones, potentially losing useful information.

### Dependence on self-evaluation quality

The entire framework relies on the LLM's ability to accurately assess its own performance and generate useful self-critiques. If the LLM cannot correctly identify why it failed, the reflections will be misleading, potentially making performance worse. This is particularly problematic for tasks where the LLM lacks the domain knowledge to diagnose its own errors.

Subsequent research formalized this concern. In "Large Language Models Cannot Self-Correct Reasoning Yet" (ICLR 2024), researchers from Google DeepMind and the University of Illinois reported that intrinsic self-correction, in which a model critiques its own reasoning without any external feedback, tends to degrade rather than improve accuracy on reasoning benchmarks.[15] The same paper noted that Reflexion's reported gains depend on access to external signals, such as ground-truth answer checks or executable unit tests, to decide when to stop iterating.[15]

### Test generation limitations for code

The code generation approach depends on the agent's ability to produce correct test cases. For certain types of programs, generating accurate tests is difficult or impossible: non-deterministic functions, functions that interact with external APIs, hardware-dependent behavior, and concurrent programs all pose challenges for automated test generation.[1] If the self-generated tests are incorrect, the agent may iteratively "fix" working code to satisfy broken tests.

### No permanent learning

Reflexion's improvements exist only in the memory buffer. If the buffer is cleared, or if the agent encounters a new task with no relevant prior reflections, it starts from scratch. The lessons learned during one problem-solving session do not transfer to future sessions unless the memory is explicitly carried over. This contrasts with fine-tuning approaches, where improvements are permanently encoded in the model's weights. Later work such as Meta-Policy Reflexion (2025) targets this limitation by consolidating reflections into a structured, reusable memory designed to transfer across tasks.[21]

### Task diversity requirement

For Reflexion to work well, the agent needs to be able to try meaningfully different approaches across trials. If the task space is such that small perturbations to the agent's strategy do not yield useful signal, or if the agent cannot generate sufficiently diverse strategies, the self-reflection loop may not converge to a solution.

## Influence and subsequent work

Reflexion has had considerable influence on the design of LLM-based agents since its publication. As of early 2025, the paper had accumulated over 2,000 citations according to Semantic Scholar, making it one of the most cited papers in the agent research space. By June 2026, the [Semantic Scholar](/wiki/semantic_scholar) citation count had grown to more than 3,800, including over 300 citations the service classifies as highly influential.[22]

The framework's ideas have been adopted in several practical systems and frameworks:

- **[LangGraph](/wiki/langgraph)** by [LangChain](/wiki/langchain) includes built-in support for reflection-based agent loops inspired by Reflexion. Developers can construct "draft, execute tools, revise" pipelines that mirror the Reflexion cycle.[12]
- **LATS** (Language Agent Tree Search) directly extends Reflexion by combining its self-reflection mechanism with tree search, achieving improved results on reasoning and coding benchmarks.[6]
- **Multi-Agent Reflexion** (MAR) extends the single-agent framework to multi-agent settings, where multiple LLM agents reflect on each other's outputs to improve collective reasoning.[19]
- **AlphaCodium** by [CodiumAI](/wiki/qodo) applies Reflexion-style iterative test-and-refine loops specifically to code generation,[14] drawing on the test-first methodology introduced in the Reflexion paper.
- **ReflexiCoder** (2026) uses reinforcement learning to explicitly train LLMs to self-reflect on generated code and self-correct, building on the verbal reinforcement paradigm.[20]

The broader concept of "reflection" in AI agents, while not invented by the Reflexion paper, was significantly advanced by it. [Andrew Ng](/wiki/andrew_ng) identified reflection as one of four key [agentic design patterns](/wiki/agentic_design_patterns), citing Reflexion as a foundational example.[13] The pattern of generating output, evaluating it, reflecting on failures, and trying again has become a standard component in many production agent systems.

### 2025-2026 developments

Work since the paper's publication has pushed verbal self-reflection in two main directions: internalizing reflective behavior into model weights through training, and making the reflective memory more durable or more adversarial.

A major shift came with [reasoning models](/wiki/reasoning_models) trained end to end with reinforcement learning. The [DeepSeek-R1](/wiki/deepseek_r1) technical report (January 2025) described an "aha moment" during pure RL training in which the model spontaneously learned to pause, flag a mistake in its earlier steps, and re-derive the solution, behavior that Reflexion had implemented as an explicit outer loop around a frozen model.[16] The interpretation of this result is contested: researchers at Sea AI Lab reported in March 2025 that self-reflection patterns, including "aha moment" keywords, already appear in DeepSeek-V3-Base before any reinforcement learning is applied, suggesting that RL amplifies rather than creates the behavior.[17]

A second line of work uses reinforcement learning to improve the quality of the reflections themselves rather than relying on prompting alone. "Reflect, Retry, Reward" (May 2025) has a model generate a self-reflection after a failed attempt and retry the task, then uses [GRPO](/wiki/grpo) to reward only the reflection tokens when the retry succeeds; using purely binary success signals, the authors report improvements of up to 34.7% on math equation writing and 18.1% on function calling, with small fine-tuned models outperforming models from the same family that are 10 times larger.[18] ReflexiCoder (March 2026) internalizes the full generate-reflect-correct trajectory for code into model weights using an RL-only training recipe with no supervised fine-tuning stage; the authors report that ReflexiCoder-8B reaches 94.51% pass@1 on HumanEval in a single-attempt setting while reducing inference-time compute overhead by roughly 40%.[20]

Other extensions target Reflexion's memory and self-evaluation bottlenecks directly. Meta-Policy Reflexion (September 2025) consolidates LLM-generated reflections into a structured, reusable "meta-policy memory" with rule admissibility checks, and reports consistent gains in execution accuracy and robustness over Reflexion baselines on an AlfWorld-based evaluation without any model weight updates.[21] MAR (Multi-Agent Reflexion, December 2025), from researchers at the University of Michigan, replaces the single self-reflecting agent with multiple debater personas and a judge model that synthesizes their critiques into a unified reflection, aiming to counter the confirmation bias of a model grading its own work; the authors report 47% exact match on HotpotQA and 82.7% accuracy on HumanEval, both above single-agent reflection baselines.[19]

## What is LeetcodeHardGym?

As part of the Reflexion paper, the authors introduced LeetcodeHardGym, a new benchmark for evaluating code generation on genuinely difficult programming problems. The benchmark consists of 40 hard-level Leetcode problems that were published after October 2022, placing them outside the training data of models like GPT-4 at the time of the paper's writing.[1]

The benchmark was designed to test code generation systems on problems that require algorithmic reasoning, data structure knowledge, and careful edge-case handling. Unlike HumanEval, which primarily tests basic programming competency, LeetcodeHardGym problems often require sophisticated algorithms like dynamic programming, graph traversal, and advanced data structures.

The code and benchmark data are publicly available on the project's GitHub repository.[11] The repository is released under the MIT license.[11]

## How do you implement Reflexion?

The authors released their full implementation on GitHub.[11] The codebase includes:

- Reflexion agents for AlfWorld, HotpotQA, and programming tasks
- Self-reflection prompt templates for each domain
- Evaluation scripts for all benchmarks
- The LeetcodeHardGym benchmark dataset

Reflexion has also been reimplemented in various agent frameworks. LangGraph provides a reference implementation that developers can adapt for their own use cases.[12] The key implementation requirements are straightforward: an LLM for the Actor, an evaluation function appropriate to the task, a self-reflection prompt, and a memory buffer to store reflections.

A minimal Reflexion loop in pseudocode looks like this:

```
memory = []
for trial in range(max_trials):
    trajectory = actor.run(task, memory)
    reward = evaluator.evaluate(trajectory)
    if reward == SUCCESS:
        return trajectory
    reflection = self_reflect(trajectory, reward, memory)
    memory.append(reflection)
    if len(memory) > max_memory:
        memory.pop(0)
return FAILURE
```

## See also

- [AI agents](/wiki/ai_agents)
- [ReAct](/wiki/react_prompting)
- [Chain-of-Thought prompting](/wiki/chain_of_thought)
- [Tree of Thoughts](/wiki/tree_of_thoughts)
- [Reinforcement learning](/wiki/reinforcement_learning)
- [Prompt engineering](/wiki/prompt_engineering)
- [HumanEval](/wiki/humaneval)
- [LangGraph](/wiki/langgraph)

## References

1. Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." *Advances in Neural Information Processing Systems 36 (NeurIPS 2023)*. arXiv:2303.11366. https://arxiv.org/abs/2303.11366
2. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). "ReAct: Synergizing Reasoning and Acting in Language Models." *ICLR 2023*. arXiv:2210.03629.
3. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." *NeurIPS 2022*. arXiv:2201.11903.
4. Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B.P., Hermann, K.M., Welleck, S., Yazdanbakhsh, A., & Clark, P. (2023). "Self-Refine: Iterative Refinement with Self-Feedback." *NeurIPS 2023*. arXiv:2303.17651.
5. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., & Narasimhan, K. (2023). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." *NeurIPS 2023*. arXiv:2305.10601.
6. Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., & Wang, Y.-X. (2024). "Language Agent Tree Search Unifies Reasoning, Acting, and Planning in Language Models." *ICML 2024*. arXiv:2310.04406.
7. Shinn, N. & Gopinath, A. (2023). "Reflecting on Reflexion." *Nanothoughts (Substack)*. https://nanothoughts.substack.com/p/reflecting-on-reflexion
8. Chen, B., Zhang, F., Nguyen, A., Zan, D., Lin, Z., Lou, J.-G., & Chen, W. (2023). "CodeT: Code Generation with Generated Tests." *ICLR 2023*. arXiv:2207.10397.
9. Shridhar, M., Yuan, X., Côté, M.-A., Bisk, Y., Trischler, A., & Hausknecht, M. (2021). "ALFWorld: Aligning Text and Embodied Environments for Interactive Learning." *ICLR 2021*. arXiv:2010.03768.
10. Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W.W., Salakhutdinov, R., & Manning, C.D. (2018). "HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering." *EMNLP 2018*. arXiv:1809.09600.
11. Shinn, N. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning (official implementation)." GitHub repository, MIT license. https://github.com/noahshinn/reflexion
12. LangChain (2024). "Reflection Agents." *LangChain Blog*, February 21, 2024. https://blog.langchain.dev/reflection-agents/
13. Ng, A. (2024). "Agentic Design Patterns Part 2: Reflection." *The Batch*, DeepLearning.AI, March 27, 2024. https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-2-reflection/
14. Ridnik, T., Kredo, D., & Friedman, I. (2024). "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering." arXiv:2401.08500. https://arxiv.org/abs/2401.08500
15. Huang, J., Chen, X., Mishra, S., Zheng, H.S., Yu, A.W., Song, X., & Zhou, D. (2024). "Large Language Models Cannot Self-Correct Reasoning Yet." *ICLR 2024*. arXiv:2310.01798. https://arxiv.org/abs/2310.01798
16. DeepSeek-AI (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. https://arxiv.org/abs/2501.12948
17. Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W.S., & Lin, M. (2025). "Understanding R1-Zero-Like Training: A Critical Perspective." arXiv:2503.20783. https://arxiv.org/abs/2503.20783
18. Bensal, S., Jamil, U., Bryant, C., Russak, M., Kamble, K., Mozolevskyi, D., Ali, M., & AlShikh, W. (2025). "Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning." arXiv:2505.24726. https://arxiv.org/abs/2505.24726
19. Ozer, O., Wang, Y., Wu, G., Dosti, D., Zhang, H., & De La Rue, V. (2025). "MAR: Multi-Agent Reflexion Improves Reasoning Abilities in LLMs." arXiv:2512.20845. https://arxiv.org/abs/2512.20845
20. Jiang, J., Shen, J., Kim, S., Yoo, K.M., Kim, J., & Kim, S. (2026). "ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning." arXiv:2603.05863. https://arxiv.org/abs/2603.05863
21. Wu, C., Luo, Y., Qu, Z., & Wang, M. (2025). "Meta-Policy Reflexion: Reusable Reflective Memory and Rule Admissibility for Resource-Efficient LLM Agent." arXiv:2509.03990. https://arxiv.org/abs/2509.03990
22. Semantic Scholar (2026). "Reflexion: language agents with verbal reinforcement learning" citation record, accessed June 2026. https://www.semanticscholar.org/arxiv/2303.11366

