ReAct (prompting)
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 5,395 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 5,395 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Prompt engineering, Chain-of-thought prompting, Tool use, AI agents, and LangChain
ReAct (short for Reasoning and Acting) is a prompting paradigm for large language models that interleaves verbal reasoning traces with task-specific actions executed against an external environment.[1] Introduced by Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao in the October 2022 paper "ReAct: Synergizing Reasoning and Acting in Language Models" (arXiv:2210.03629),[1] ReAct enables an LLM to alternately generate free-form thoughts and concrete actions in a loop, where the thoughts plan and revise behavior while the actions allow the model to read from search APIs, knowledge bases, or simulated environments. The paper was accepted as an Oral (Notable Top 5%) at the International Conference on Learning Representations (ICLR) 2023.[2]
Yao and Narasimhan were at Princeton University at the time of writing; Zhao, Yu, Du, Shafran, and Cao were at Google Research, Brain team. The work was conducted during Yao's internship at Google.[1] An accompanying Google Research blog post announced the method on November 8, 2022,[3] and the authors released open-source PyTorch/Jupyter code on GitHub (now over 3,800 stars) implementing ReAct against both PaLM-540B and GPT-3 text-davinci-002.[4]
ReAct was one of the first frameworks to demonstrate that combining chain-of-thought reasoning with tool use produces agents that are simultaneously more accurate, more interpretable, and less prone to hallucination than either reasoning-only or action-only baselines.[1] Its Thought-Action-Observation loop became the default control flow in production agent frameworks including LangChain, LangGraph, LlamaIndex, the Hugging Face transformers agents API, and the underlying logic of many commercial tool-calling and "agentic" runtimes.[5][6]
Before ReAct, two separate lines of research had shown that large language models possess useful but isolated capabilities. On one hand, chain-of-thought prompting (Wei et al., 2022) demonstrated that asking a model to "think step by step" before answering substantially improves performance on arithmetic, commonsense, and symbolic reasoning tasks.[7] On the other hand, work on action generation and tool use, including SayCan, Inner Monologue, WebGPT, and later Toolformer, showed that language models can learn to call external APIs, navigate web pages, or control robotic systems when prompted with appropriate action schemas.
These two capabilities had mostly been studied in isolation. Chain-of-thought prompting relies entirely on the model's internal knowledge, which means it can produce plausible-sounding but factually incorrect reasoning chains, what the ReAct authors call "fact hallucination" and "error propagation."[1] Conversely, action-only approaches let models interact with the world but lack the ability to reason about what action to take next, why a previous action failed, or how to synthesize information gathered across multiple steps.
Yao et al. argued that humans do not separate reasoning from acting in practice. When solving a problem, a person typically thinks about what information is needed, searches for it, reads the results, reasons about what was found, decides on a next step, and continues until the answer becomes clear. ReAct formalizes this intuition by prompting language models to produce interleaved sequences of thoughts and actions. The paper introduces two reciprocal mechanisms: the model can "reason to act" (use thinking to decide what to do next) and "act to reason" (use observations from the environment to update its thinking).[1] This mutual feedback is what the title's "synergizing" refers to.
The core mechanism of ReAct is a structured loop with three components:
| Component | Role | Effect on environment |
|---|---|---|
| Thought | Free-form natural language reasoning. The model decomposes the problem, interprets observations, formulates hypotheses, or plans the next step. | None. Thoughts are internal to the model and do not change the external environment. |
| Action | A task-specific command that interacts with the external environment. Actions follow a predefined format (e.g., Search[query], Lookup[keyword], Finish[answer]). | Yes. The action is executed, and the environment returns a result. |
| Observation | The result returned by the environment after an action is executed. Observations are appended to the model's context so it can reason about them in the next thought step. | None. Observations are inputs to the model, not generated by it. |
The loop proceeds as follows:
Finish[answer]).This structure allows the model to dynamically adjust its plan based on what it discovers at each step, rather than committing to a fixed sequence of actions in advance. Crucially, the loop does not require fine-tuning or any architectural modification; it is implemented entirely through few-shot exemplars in the prompt and an external orchestrator that parses each Action and feeds the resulting Observation back into the next call.[1]
The following example, adapted from the original paper, shows ReAct solving a multi-hop question from the HotpotQA dataset using a simple Wikipedia API.[1]
Question: What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into?
Thought 1: I need to search Colorado orogeny, find the area that the
eastern sector extends into, then find the elevation range
of that area.
Action 1: Search[Colorado orogeny]
Observation 1: The Colorado orogeny was an episode of mountain building
(an orogeny) in Colorado and surrounding areas.
Thought 2: It does not mention the eastern sector. So I need to look
up eastern sector.
Action 2: Lookup[eastern sector]
Observation 2: (Result 1 / 1) The eastern sector extends into the High
Plains and is called the Central Plains orogeny.
Thought 3: The eastern sector of Colorado orogeny extends into the
High Plains. So I need to search High Plains and find its
elevation range.
Action 3: Search[High Plains]
Observation 3: High Plains refers to one of two distinct land regions.
Thought 4: I need to instead search High Plains (United States).
Action 4: Search[High Plains (United States)]
Observation 4: The High Plains are a subregion of the Great Plains.
From east to west, the High Plains rise in elevation
from around 1,800 to 7,000 ft (550 to 2,130 m).
Thought 5: High Plains rise in elevation from around 1,800 to 7,000 ft,
so the answer is 1,800 to 7,000 ft.
Action 5: Finish[1,800 to 7,000 ft]
Several features of this trace illustrate why interleaving helps. The model decomposes the original question into sub-questions (Thought 1), recovers from an uninformative search result (Thought 2), disambiguates a query that returned the wrong entity (Thought 4), and extracts the relevant fact from a longer passage (Thought 5).[1] These are exactly the kinds of flexible, adaptive behaviors that action-only approaches struggle with because they lack an explicit reasoning channel, and that pure chain-of-thought struggles with because it has no way to ground a guess against the world.
The specific actions available to a ReAct agent depend on the task and environment. The original paper defined actions for two categories of tasks.
For question answering (HotpotQA) and fact verification (FEVER), the paper used a simple Wikipedia API with three actions:[1]
| Action | Description |
|---|---|
Search[entity] | Returns the first five sentences of the Wikipedia page for the given entity. If no exact match is found, returns the top five similar entities from the Wikipedia search engine. |
Lookup[string] | Returns the next sentence on the current page that contains the given string. Simulates the browser's Ctrl+F (find) functionality. |
Finish[answer] | Terminates the episode and submits the given answer. |
For text-based games (ALFWorld) and web navigation (WebShop), the actions were environment-specific commands such as "go to cabinet 1," "take knife 2," "click [Buy Now]," and similar instructions that directly manipulate the environment state.[1]
In both cases, the key design choice is that the action space is simple and well-defined. The model does not need to generate arbitrary code or API calls; it selects from a small set of action templates and fills in the arguments based on its reasoning. This constraint is what makes few-shot prompting tractable: the model only needs to learn the schema, not invent new tools.
ReAct prompts are constructed using few-shot learning. The prompt contains a small number of human-written trajectories (typically 3 to 6 examples) that demonstrate the desired Thought-Action-Observation format. Each trajectory shows a complete problem-solving episode, including the question, the interleaved thoughts and actions, the observations returned by the environment, and the final answer.[1]
For knowledge-intensive tasks, the authors manually wrote trajectories that include dense reasoning at every step: each thought explains the model's current understanding and what it plans to do next. For decision-making tasks, the authors used a sparser style, inserting reasoning only at points where it was most useful (for instance, when forming a subgoal or when recovering from a mistake). The model was free to decide the rhythm of reasoning versus acting.[1]
The prompts in the paper were designed for PaLM-540B, the large language model used in the original experiments. The open-source code release also implements ReAct against GPT-3 text-davinci-002, and the authors note that PaLM and GPT-3 are stronger at different tasks (for example, GPT-3 narrowly outperforms PaLM-540B on HotpotQA exact match, while PaLM-540B is significantly stronger on FEVER accuracy).[4] No task-specific fine-tuning was required for the prompting experiments; the model learned the ReAct format purely from the in-context examples.
Two implementation approaches have emerged in practice:
| Approach | Description | When to use |
|---|---|---|
| Few-shot | The prompt includes several complete worked examples demonstrating the Thought-Action-Observation cycle. | When a base model (not instruction-tuned) is used, or when strict format adherence is required. |
| Zero-shot | The prompt provides detailed written instructions describing the format and available actions, without worked examples. | When an instruction-tuned model (e.g., GPT-4, Claude) is used, since these models can follow abstract instructions reliably. |
The original paper evaluated ReAct on four benchmarks spanning two task categories. All prompting experiments used PaLM-540B unless otherwise noted.[1]
HotpotQA is a multi-hop question answering dataset that requires synthesizing information from multiple Wikipedia articles.[8] The paper evaluated on a random subset of 500 examples from the validation set using 6-shot prompting.[1]
| Method | Exact Match (EM) |
|---|---|
| Standard prompting | 25.7 |
| Chain-of-thought (CoT) | 29.4 |
| CoT with self-consistency (CoT-SC) | 33.4 |
| Act only | 25.7 |
| ReAct | 27.4 |
| ReAct → CoT-SC (switch on failure) | 35.1 |
| CoT-SC → ReAct (switch on failure) | 35.1 |
On HotpotQA, ReAct underperformed pure CoT in raw accuracy. However, the authors showed that the two methods fail on different types of questions. CoT achieves higher scores but suffers from a much higher false-positive rate due to hallucination. ReAct makes fewer hallucination errors because it grounds its reasoning in retrieved Wikipedia content, but it sometimes fails due to search errors or retrieval of irrelevant articles. The best overall performance came from combining the two methods: running ReAct first and falling back to CoT-SC when ReAct fails (or vice versa), reaching 35.1 EM.[1]
FEVER is a fact verification dataset where the model must classify a claim as "SUPPORTS," "REFUTES," or "NOT ENOUGH INFO" based on Wikipedia evidence. The paper used 3-shot prompting on 500 random validation examples.[1]
| Method | Accuracy |
|---|---|
| Standard prompting | 57.1 |
| CoT | 56.3 |
| Act only | 58.9 |
| ReAct | 60.9 |
| ReAct → CoT-SC | 64.6 |
| CoT-SC → ReAct | 64.6 |
On FEVER, ReAct outperformed both CoT and act-only baselines outright, and the combined approach again produced the strongest result at 64.6 accuracy.[1]
ALFWorld is a text-based game environment requiring household tasks (e.g., "put a clean apple on the counter").[9] The paper used 1 or 2 in-context examples and ran 6 trials, reporting the best trial.[1]
| Method | Success Rate |
|---|---|
| BUTLER (imitation learning) | 37% |
| Act only | 45% |
| ReAct | 71% |
ReAct outperformed the act-only baseline by 26 percentage points and the BUTLER imitation-learning baseline by 34 points, matching the 34-point gain over imitation/RL methods that the abstract claims for interactive decision-making.[1] The relative performance gain of ReAct over Act ranged from 33% to 90% across the six trials, averaging 62%. The large gap demonstrates the value of interleaved reasoning: the model uses thoughts to decompose goals into subgoals, track which subgoals have been completed, and determine where to look for objects.[1]
WebShop is a web navigation benchmark where the agent must find and purchase a product matching a text description by navigating a simulated e-commerce site. The paper reports success rate (finding the exact product) and average reward score.[1]
| Method | Success Rate | Score |
|---|---|---|
| Imitation learning (IL) | 29.1% | 59.9 |
| IL + reinforcement learning | 28.7% | 62.4 |
| Act only | 30.1% | 66.6 |
| ReAct | 40.0% | 66.6 |
| Human expert | 59.6% | 82.1 |
ReAct achieved a 40% success rate, a 10-percentage-point improvement over the act-only baseline and the prior best methods (IL, IL+RL), despite using only one or two in-context examples compared to the approximately 100,000 training instances used by the learning-based baselines.[1]
The authors performed a detailed error analysis by randomly sampling 50 correct and 50 incorrect trajectories from each of ReAct and CoT on HotpotQA (200 examples total), then manually categorizing the success and failure modes.[1]
The key findings were:
Beyond few-shot prompting, the paper also explored using ReAct-format trajectories to fine-tune smaller language models. The approach worked as follows:[1]
The results showed that fine-tuning consistently and significantly improved EM scores over prompting alone. A fine-tuned PaLM-62B outperformed the prompted PaLM-540B on HotpotQA, demonstrating that the ReAct format transfers effectively to smaller models through distillation. Standard and CoT fine-tuning degraded after relatively few steps, while ReAct and Act methods generally benefited from more training steps and more training data.[1]
This finding was significant because it suggested a practical pipeline: use a large prompted model to generate high-quality trajectories, then fine-tune a smaller, cheaper model on those trajectories for deployment. The pattern was later picked up by FireAct (Chen et al., 2023), which fine-tuned language models specifically for ReAct-style agent behavior across multiple tasks and tools.[10]
ReAct sits at the intersection of several prompting paradigms. The following table summarizes how it relates to other approaches.
| Method | Reasoning | External actions | Grounding | Key limitation |
|---|---|---|---|---|
| Standard prompting | No | No | Internal knowledge only | Cannot reason through multi-step problems |
| Chain-of-thought (CoT) | Yes | No | Internal knowledge only | Prone to hallucination; cannot access new information |
| CoT with self-consistency (CoT-SC) | Yes | No | Internal knowledge only | Higher cost from multiple samples; still no external grounding |
| Act only | No | Yes | External via tool calls | Cannot reason about what action to take next or synthesize observations |
| ReAct | Yes | Yes | Both internal and external | Search/retrieval errors; higher token usage per step |
| Reflexion | Yes | Yes | Both, plus self-evaluation memory | Requires additional self-critique step and episodic memory |
| Tree of Thoughts (ToT) | Yes (branching) | No (in original) | Internal, with search | Cost of exploring branches; needs evaluator |
| Toolformer | Implicit | Yes (learned) | External via self-supervised tool calls | Requires training on tool-use traces; fixed API set |
| ReWOO | Yes (plan only) | Yes | Both, but planning is upfront | Cannot adapt mid-execution if conditions change unexpectedly |
The central insight from the paper is that reasoning and acting are complementary, not redundant. Reasoning without acting leads to hallucination because the model has no way to verify its claims. Acting without reasoning leads to inefficient or aimless behavior because the model cannot plan, recover from errors, or synthesize information across steps.[1]
ReAct is best understood as an extension of chain-of-thought prompting rather than a replacement for it. CoT showed that natural-language reasoning, elicited by exemplars, can improve performance on tasks that require multi-step inference.[7] ReAct preserves this insight, as the Thought steps in a ReAct trajectory are essentially CoT-style reasoning, but adds an Action channel that lets the model query the world between thoughts. In the limit where the action set is empty, ReAct degenerates to chain-of-thought; in the limit where the thoughts are empty, it degenerates to a pure action policy. The paper's headline finding is that neither extreme is optimal: the combination is strictly better on grounded tasks.[1]
Where ReAct teaches a model to use tools through prompting alone, Toolformer (Schick et al., 2023) takes a complementary, training-based approach: it has the model annotate a corpus with self-generated tool calls, filters for calls that reduce the loss on the next token, and fine-tunes on the resulting traces.[11] The two methods can be combined, as a Toolformer-trained model can be prompted in ReAct format, and they represent two of the dominant paradigms for equipping LLMs with tool use.
ReAct was accepted at ICLR 2023 as an Oral and designated "Notable Top 5%."[2] Reviewers highlighted the elegance of the prompting recipe and the consistency of the gains across qualitatively different benchmarks. According to Google Scholar tallies, the paper had accumulated well over 5,000 citations within three years of publication, placing it among the most-cited works in the post-2022 LLM-agent literature.[12]
Beyond academic citations, the paradigm spread quickly into industrial and open-source software. Andrew Ng, in his 2024 series on "agentic design patterns," listed tool use and planning as two of the four foundational patterns and pointed to ReAct as the canonical demonstration of their combined value.[13] Major model providers, including Anthropic, OpenAI, and Google, shipped structured tool-calling APIs whose orchestration loops mirror the Thought-Action-Observation cycle, even when the surface syntax differs from raw text prompting.
Shunyu Yao completed his PhD at Princeton in 2024 and subsequently joined the research staff at OpenAI in August of that year, where he contributed to the design of agentic products including Operator and Deep Research.[14] Karthik Narasimhan remains a faculty member at Princeton's Computer Science department.
ReAct was one of the earliest and most influential demonstrations that prompting alone can turn a language model into a functional agent. Its Thought-Action-Observation loop became the conceptual blueprint for a generation of agent frameworks and research directions.
Several popular libraries adopted ReAct as a core pattern:
langchain.agents.create_react_agent, which takes a model, a list of tools, and an optional prompt template and returns an agent that runs the Thought-Action-Observation loop. LangChain's early agent architecture was built almost entirely around this pattern, and the initialize_agent helper exposed several ReAct variants (zero-shot-react-description, react-docstore, and others).[5]langgraph.prebuilt.create_react_agent factory as one of its first prebuilt components. In LangGraph v1 (released in late 2025), this prebuilt was deprecated in favor of the more general langchain.agents.create_agent, which runs on LangGraph and adds a configurable middleware system on top of the ReAct loop.[15]transformers agents API, using the Thought-Action-Observation format as the default architecture.[6]ReAct directly inspired or informed a wave of follow-up works:
The ReAct paradigm influenced the design of tool-use and agentic capabilities in commercial language-model APIs. OpenAI's function calling, Anthropic's tool use, and Google's Gemini function calling all incorporate elements of the Thought-Action-Observation pattern, though they typically implement it through structured API protocols (JSON tool definitions and tool-result messages) rather than raw text prompting. Many production agent systems use a ReAct-inspired loop internally, even when the implementation details differ from the original paper's prompting approach. The 2024–2026 "agentic" product wave, including OpenAI Operator and Deep Research, Anthropic's computer use, Google's Project Astra/Mariner, and dozens of vertical agent startups, relies on Thought-Action-Observation control as a default scaffold.
Despite its wide influence, ReAct has several known limitations:
Practitioners building ReAct agents typically follow one of several patterns.
A minimal ReAct prompt includes:
Thought 1: to prompt the model to begin reasoning.The orchestration system then parses the model's output to detect action commands (using string matching, regular expressions, or structured-output parsing), executes the action, appends the observation to the context, and calls the model again to generate the next thought. A typical control loop in pseudocode looks like:
history = render_few_shot_prompt(examples) + render_question(task)
for step in range(max_steps):
completion = llm.generate(history, stop=["Observation"])
history += completion
action = parse_action(completion)
if action.name == "Finish":
return action.args
obs = tools[action.name](action.args)
history += f"\nObservation {step+1}: {obs}\n"
return None # exceeded budget
Using a framework like LangChain, a basic ReAct agent can be created with a few lines of code by specifying a model, a list of tool objects (each with a name, description, and callable function), and optionally a custom prompt. The framework handles the orchestration loop, action parsing, tool execution, and observation injection automatically. In LangGraph v1, the recommended API is langchain.agents.create_agent, which wraps the ReAct loop in a graph node and exposes middleware hooks for logging, validation, and human-in-the-loop interventions.[15]
Production ReAct agents typically include additional safeguards:
The ReAct loop bears a structural resemblance to the state-action-reward loop in reinforcement learning. In RL, an agent observes a state, takes an action, receives a reward, and updates its policy. In ReAct, the model reads the current context (state), generates a thought and action, receives an observation (analogous to a new state), and continues. However, ReAct differs from RL in several important ways: there is no explicit reward signal during the episode, the model is not trained through trial and error, and the "policy" is defined entirely by the prompt and the model's pretrained weights.
The connection to RL became more explicit in LATS, which introduced Monte Carlo Tree Search (a technique from game-playing RL) into the ReAct framework, and in Reflexion, which introduced a form of episodic learning through self-critique.[16][18] More recent reasoning models trained with reinforcement learning from process or outcome rewards (such as DeepSeek-R1 and OpenAI's o-series) can be viewed as internalizing the ReAct loop: their training rewards reasoning trajectories that lead to correct answers, with tool use sometimes embedded directly in the policy.